» Articles » PMID: 31104630

Systematic Analysis of Dark and Camouflaged Genes Reveals Disease-relevant Genes Hiding in Plain Sight

Abstract

Background: The human genome contains "dark" gene regions that cannot be adequately assembled or aligned using standard short-read sequencing technologies, preventing researchers from identifying mutations within these gene regions that may be relevant to human disease. Here, we identify regions with few mappable reads that we call dark by depth, and others that have ambiguous alignment, called camouflaged. We assess how well long-read or linked-read technologies resolve these regions.

Results: Based on standard whole-genome Illumina sequencing data, we identify 36,794 dark regions in 6054 gene bodies from pathways important to human health, development, and reproduction. Of these gene bodies, 8.7% are completely dark and 35.2% are ≥ 5% dark. We identify dark regions that are present in protein-coding exons across 748 genes. Linked-read or long-read sequencing technologies from 10x Genomics, PacBio, and Oxford Nanopore Technologies reduce dark protein-coding regions to approximately 50.5%, 35.6%, and 9.6%, respectively. We present an algorithm to resolve most camouflaged regions and apply it to the Alzheimer's Disease Sequencing Project. We rescue a rare ten-nucleotide frameshift deletion in CR1, a top Alzheimer's disease gene, found in disease cases but not in controls.

Conclusions: While we could not formally assess the association of the CR1 frameshift mutation with Alzheimer's disease due to insufficient sample-size, we believe it merits investigating in a larger cohort. There remain thousands of potentially important genomic regions overlooked by short-read sequencing that are largely resolved by long-read technologies.

Citing Articles

Genome-wide profiling of highly similar paralogous genes using HiFi sequencing.

Chen X, Baker D, Dolzhenko E, Devaney J, Noya J, Berlyoung A Nat Commun. 2025; 16(1):2340.

PMID: 40057485 PMC: 11890787. DOI: 10.1038/s41467-025-57505-2.


The human immunoglobulin heavy chain constant gene locus is enriched for large complex structural variants and coding polymorphisms that vary in frequency among human populations.

Jana U, Rodriguez O, Lees W, Engelbrecht E, Vanwinkle Z, Vanwinkle Z bioRxiv. 2025; .

PMID: 39990387 PMC: 11844466. DOI: 10.1101/2025.02.12.634878.


Locityper: targeted genotyping of complex polymorphic genes.

Prodanov T, Plender E, Seebohm G, Meuth S, Eichler E, Marschall T bioRxiv. 2025; .

PMID: 39990346 PMC: 11844405. DOI: 10.1101/2024.05.03.592358.


Comparative evaluation of four exome enrichment solutions in 2024: Agilent, Roche, Vazyme and Nanodigmbio.

Belova V, Vasiliadis I, Repinskaia Z, Samitova A, Shmitko A, Ponikarovskaya N BMC Genomics. 2025; 26(1):76.

PMID: 39871131 PMC: 11770928. DOI: 10.1186/s12864-024-11196-z.


Genomic complexity and clinical significance of the RCCX locus.

Shiryagin V, Devyatkin A, Fateev O, Petriaikina E, Bogdanov V, Antysheva Z PeerJ. 2024; 12:e18243.

PMID: 39512309 PMC: 11542561. DOI: 10.7717/peerj.18243.


References
1.
La Spada A, Taylor J . Repeat expansion disease: progress and puzzles in disease pathogenesis. Nat Rev Genet. 2010; 11(4):247-58. PMC: 4704680. DOI: 10.1038/nrg2748. View

2.
Ridge P, Karch C, Hsu S, Arano I, Teerlink C, Ebbert M . Linkage, whole genome sequence, and biological data implicate variants in RAB10 in Alzheimer's disease resilience. Genome Med. 2017; 9(1):100. PMC: 5706401. DOI: 10.1186/s13073-017-0486-1. View

3.
Lettice L, Heaney S, Purdie L, Li L, de Beer P, Oostra B . A long-range Shh enhancer regulates expression in the developing limb and fin and is associated with preaxial polydactyly. Hum Mol Genet. 2003; 12(14):1725-35. DOI: 10.1093/hmg/ddg180. View

4.
Cartegni L, Hastings M, Calarco J, de Stanchina E, Krainer A . Determinants of exon 7 splicing in the spinal muscular atrophy genes, SMN1 and SMN2. Am J Hum Genet. 2005; 78(1):63-77. PMC: 1380224. DOI: 10.1086/498853. View

5.
Kashima T, Manley J . A negative element in SMN2 exon 7 inhibits splicing in spinal muscular atrophy. Nat Genet. 2003; 34(4):460-3. DOI: 10.1038/ng1207. View