Utilizing Mapping Targets of Sequences Underrepresented in the Reference Assembly to Reduce False Positive Alignments
Overview
Affiliations
The human reference assembly remains incomplete due to the underrepresentation of repeat-rich sequences that are found within centromeric regions and acrocentric short arms. Although these sequences are marginally represented in the assembly, they are often fully represented in whole-genome short-read datasets and contribute to inappropriate alignments and high read-depth signals that localize to a small number of assembled homologous regions. As a consequence, these regions often provide artifactual peak calls that confound hypothesis testing and large-scale genomic studies. To address this problem, we have constructed mapping targets that represent roughly 8% of the human genome generally omitted from the human reference assembly. By integrating these data into standard mapping and peak-calling pipelines we demonstrate a 10-fold reduction in signals in regions common to the blacklisted region and identify a comprehensive set of regions that exhibit mapping sensitivity with the presence of the repeat-rich targets.
Wall B, Ogata J, Nguyen M, McClay J, Harrell J, Dozmorov M bioRxiv. 2025; .
PMID: 39975128 PMC: 11839099. DOI: 10.1101/2025.02.06.636968.
Sturgill D, Wang L, Arda H BMC Genomics. 2024; 25(1):76.
PMID: 38238687 PMC: 10797729. DOI: 10.1186/s12864-024-09964-y.
excluderanges: exclusion sets for T2T-CHM13, GRCm39, and other genome assemblies.
Ogata J, Mu W, Davis E, Xue B, Harrell J, Sheffield N Bioinformatics. 2023; 39(4).
PMID: 37067481 PMC: 10126321. DOI: 10.1093/bioinformatics/btad198.
GFI1-Dependent Repression of Increases Multiple Myeloma Cell Survival.
Petrusca D, Mulcrone P, Macar D, Bishop R, Berdyshev E, Suvannasankha A Cancers (Basel). 2022; 14(3).
PMID: 35159039 PMC: 8833953. DOI: 10.3390/cancers14030772.
Lin L, Chou C, Cheng H, Chang K, Liu C Front Oncol. 2021; 11:741626.
PMID: 34912705 PMC: 8666431. DOI: 10.3389/fonc.2021.741626.