» Articles » PMID: 36173322

Estimation of Site Frequency Spectra from Low-coverage Sequencing Data Using Stochastic EM Reduces Overfitting, Runtime, and Memory Usage

Overview
Journal Genetics
Specialty Genetics
Date 2022 Sep 29
PMID 36173322
Authors
Affiliations
Soon will be listed here.
Abstract

The site frequency spectrum is an important summary statistic in population genetics used for inference on demographic history and selection. However, estimation of the site frequency spectrum from called genotypes introduces bias when working with low-coverage sequencing data. Methods exist for addressing this issue but sometimes suffer from 2 problems. First, they can have very high computational demands, to the point that it may not be possible to run estimation for genome-scale data. Second, existing methods are prone to overfitting, especially for multidimensional site frequency spectrum estimation. In this article, we present a stochastic expectation-maximization algorithm for inferring the site frequency spectrum from NGS data that address these challenges. We show that this algorithm greatly reduces runtime and enables estimation with constant, trivial RAM usage. Furthermore, the algorithm reduces overfitting and thereby improves downstream inference. An implementation is available at github.com/malthesr/winsfs.

Citing Articles

Modeling Biases from Low-Pass Genome Sequencing to Enable Accurate Population Genetic Inferences.

Fonseca E, Tran L, Mendoza H, Gutenkunst R Mol Biol Evol. 2025; 42(1.

PMID: 39847470 PMC: 11756381. DOI: 10.1093/molbev/msaf002.


Genomic Insights Into Red Squirrels in Scotland Reveal Loss of Heterozygosity Associated With Extreme Founder Effects.

Marr M, Humble E, Lurz P, Wilson L, Milne E, Beckmann K Evol Appl. 2025; 18(1):e70072.

PMID: 39822659 PMC: 11735740. DOI: 10.1111/eva.70072.


Genomic and Acoustic Biogeography of the Iconic Sulphur-crested Cockatoo Clarifies Species Limits and Patterns of Intraspecific Diversity.

Sands A, Andersson A, Reid K, Hains T, Joseph L, Drew A Mol Biol Evol. 2024; 41(11).

PMID: 39447047 PMC: 11586666. DOI: 10.1093/molbev/msae222.


Persistent Gene Flow Suggests an Absence of Reproductive Isolation in an African Antelope Speciation Model.

Wang X, Pedersen C, Athanasiadis G, Garcia-Erill G, Hanghoj K, Bertola L Syst Biol. 2024; 73(6):979-994.

PMID: 39140829 PMC: 11637686. DOI: 10.1093/sysbio/syae037.


Impact of Holocene environmental change on the evolutionary ecology of an Arctic top predator.

Westbury M, Brown S, Lorenzen J, ONeill S, Scott M, McCuaig J Sci Adv. 2023; 9(45):eadf3326.

PMID: 37939193 PMC: 10631739. DOI: 10.1126/sciadv.adf3326.


References
1.
Kelleher J, Thornton K, Ashander J, Ralph P . Efficient pedigree recording for fast population genetics simulation. PLoS Comput Biol. 2018; 14(11):e1006581. PMC: 6233923. DOI: 10.1371/journal.pcbi.1006581. View

2.
Lorenzen E, Arctander P, Siegismund H . Regional genetic structuring and evolutionary history of the impala Aepyceros melampus. J Hered. 2006; 97(2):119-32. DOI: 10.1093/jhered/esj012. View

3.
Martin S, Amos W . Signatures of Introgression across the Allele Frequency Spectrum. Mol Biol Evol. 2020; 38(2):716-726. PMC: 7826190. DOI: 10.1093/molbev/msaa239. View

4.
Excoffier L, Foll M . fastsimcoal: a continuous-time coalescent simulator of genomic diversity under arbitrarily complex evolutionary scenarios. Bioinformatics. 2011; 27(9):1332-4. DOI: 10.1093/bioinformatics/btr124. View

5.
Li H . A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics. 2011; 27(21):2987-93. PMC: 3198575. DOI: 10.1093/bioinformatics/btr509. View