An Expanded Sequence Context Model Broadly Explains Variability in Polymorphism Levels Across the Human Genome
Authors
Affiliations
The rate of single-nucleotide polymorphism varies substantially across the human genome and fundamentally influences evolution and incidence of genetic disease. Previous studies have only considered the immediately flanking nucleotides around a polymorphic site--the site's trinucleotide sequence context--to study polymorphism levels across the genome. Moreover, the impact of larger sequence contexts has not been fully clarified, even though context substantially influences rates of polymorphism. Using a new statistical framework and data from the 1000 Genomes Project, we demonstrate that a heptanucleotide context explains >81% of variability in substitution probabilities, highlighting new mutation-promoting motifs at ApT dinucleotide, CAAT and TACG sequences. Our approach also identifies previously undocumented variability in C-to-T substitutions at CpG sites, which is not immediately explained by differential methylation intensity. Using our model, we present informative substitution intolerance scores for genes and a new intolerance score for amino acids, and we demonstrate clinical use of the model in neuropsychiatric diseases.
Landscape of human protein-coding somatic mutations across tissues and individuals.
Xu H, Bierman R, Akey D, Koers C, Comi T, McWhite C bioRxiv. 2025; .
PMID: 39829890 PMC: 11741334. DOI: 10.1101/2025.01.07.631808.
A modeling of complex trait phenotypic variance determinants.
Hussain S PNAS Nexus. 2024; 3(11):pgae472.
PMID: 39529912 PMC: 11552524. DOI: 10.1093/pnasnexus/pgae472.
Towards the genomic sequence code of DNA fragility for machine learning.
Pflughaupt P, Abdullah A, Masuda K, Sahakyan A Nucleic Acids Res. 2024; 52(21):12798-12816.
PMID: 39441076 PMC: 11602142. DOI: 10.1093/nar/gkae914.
Machine Learning Reveals the Diversity of Human 3D Chromatin Contact Patterns.
Gilbertson E, Brand C, McArthur E, Rinker D, Kuang S, Pollard K Mol Biol Evol. 2024; 41(10).
PMID: 39404010 PMC: 11523124. DOI: 10.1093/molbev/msae209.
Zhang X, Theotokis P, Li N, Wright C, Samocha K, Whiffin N Genome Med. 2024; 16(1):88.
PMID: 38992748 PMC: 11238507. DOI: 10.1186/s13073-024-01358-9.