Estimation of Site Frequency Spectra from Low-coverage Sequencing Data Using Stochastic EM Reduces Overfitting, Runtime, and Memory Usage
Overview
Authors
Affiliations
The site frequency spectrum is an important summary statistic in population genetics used for inference on demographic history and selection. However, estimation of the site frequency spectrum from called genotypes introduces bias when working with low-coverage sequencing data. Methods exist for addressing this issue but sometimes suffer from 2 problems. First, they can have very high computational demands, to the point that it may not be possible to run estimation for genome-scale data. Second, existing methods are prone to overfitting, especially for multidimensional site frequency spectrum estimation. In this article, we present a stochastic expectation-maximization algorithm for inferring the site frequency spectrum from NGS data that address these challenges. We show that this algorithm greatly reduces runtime and enables estimation with constant, trivial RAM usage. Furthermore, the algorithm reduces overfitting and thereby improves downstream inference. An implementation is available at github.com/malthesr/winsfs.
Modeling Biases from Low-Pass Genome Sequencing to Enable Accurate Population Genetic Inferences.
Fonseca E, Tran L, Mendoza H, Gutenkunst R Mol Biol Evol. 2025; 42(1.
PMID: 39847470 PMC: 11756381. DOI: 10.1093/molbev/msaf002.
Marr M, Humble E, Lurz P, Wilson L, Milne E, Beckmann K Evol Appl. 2025; 18(1):e70072.
PMID: 39822659 PMC: 11735740. DOI: 10.1111/eva.70072.
Sands A, Andersson A, Reid K, Hains T, Joseph L, Drew A Mol Biol Evol. 2024; 41(11).
PMID: 39447047 PMC: 11586666. DOI: 10.1093/molbev/msae222.
Wang X, Pedersen C, Athanasiadis G, Garcia-Erill G, Hanghoj K, Bertola L Syst Biol. 2024; 73(6):979-994.
PMID: 39140829 PMC: 11637686. DOI: 10.1093/sysbio/syae037.
Impact of Holocene environmental change on the evolutionary ecology of an Arctic top predator.
Westbury M, Brown S, Lorenzen J, ONeill S, Scott M, McCuaig J Sci Adv. 2023; 9(45):eadf3326.
PMID: 37939193 PMC: 10631739. DOI: 10.1126/sciadv.adf3326.