» Articles » PMID: 30064984

Geometry of the Sample Frequency Spectrum and the Perils of Demographic Inference

Overview
Journal Genetics
Specialty Genetics
Date 2018 Aug 2
PMID 30064984
Citations 13
Authors
Affiliations
Soon will be listed here.
Abstract

The sample frequency spectrum (SFS), which describes the distribution of mutant alleles in a sample of DNA sequences, is a widely used summary statistic in population genetics. The expected SFS has a strong dependence on the historical population demography and this property is exploited by popular statistical methods to infer complex demographic histories from DNA sequence data. Most, if not all, of these inference methods exhibit pathological behavior, however. Specifically, they often display runaway behavior in optimization, where the inferred population sizes and epoch durations can degenerate to zero or diverge to infinity, and show undesirable sensitivity to perturbations in the data. The goal of this article is to provide theoretical insights into why such problems arise. To this end, we characterize the geometry of the expected SFS for piecewise-constant demographies and use our results to show that the aforementioned pathological behavior of popular inference methods is intrinsic to the geometry of the expected SFS. We provide explicit descriptions and visualizations for a toy model, and generalize our intuition to arbitrary sample sizes using tools from convex and algebraic geometry. We also develop a universal characterization result which shows that the expected SFS of a sample of size under an arbitrary population history can be recapitulated by a piecewise-constant demography with only [Formula: see text] epochs, where [Formula: see text] is between [Formula: see text] and [Formula: see text] The set of expected SFS for piecewise-constant demographies with fewer than [Formula: see text] epochs is open and nonconvex, which causes the above phenomena for inference from data.

Citing Articles

A previously reported bottleneck in human ancestry 900 kya is likely a statistical artifact.

Deng Y, Nielsen R, Song Y Genetics. 2024; 229(1):1-3.

PMID: 39679949 PMC: 11708913. DOI: 10.1093/genetics/iyae192.


Unraveling the genomic landscape of wrens along western Ecuador's precipitation gradient: Insights into hybridization, isolation by distance, and isolation by the environment.

Montalvo L, Kimball R, Austin J, Robinson S Ecol Evol. 2024; 14(7):e11661.

PMID: 38994212 PMC: 11237350. DOI: 10.1002/ece3.11661.


Conditional frequency spectra as a tool for studying selection on complex traits in biobanks.

Patel R, Weiss C, Zhu H, Mostafavi H, Simons Y, Spence J bioRxiv. 2024; .

PMID: 38948697 PMC: 11212903. DOI: 10.1101/2024.06.15.599126.


Demographic history inference and the polyploid continuum.

Blischak P, Sajan M, Barker M, Gutenkunst R Genetics. 2023; 224(4).

PMID: 37279657 PMC: 10411560. DOI: 10.1093/genetics/iyad107.


Bayesian optimization for demographic inference.

Noskova E, Borovitskiy V G3 (Bethesda). 2023; 13(7).

PMID: 37070782 PMC: 10320152. DOI: 10.1093/g3journal/jkad080.


References
1.
Tajima F . Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics. 1989; 123(3):585-95. PMC: 1203831. DOI: 10.1093/genetics/123.3.585. View

2.
Wakeley J, Hey J . Estimating ancestral population parameters. Genetics. 1997; 145(3):847-55. PMC: 1207868. DOI: 10.1093/genetics/145.3.847. View

3.
Watterson G . On the number of segregating sites in genetical models without recombination. Theor Popul Biol. 1975; 7(2):256-76. DOI: 10.1016/0040-5809(75)90020-9. View

4.
Kamm J, Terhorst J, Song Y . Efficient computation of the joint sample frequency spectra for multiple populations. J Comput Graph Stat. 2017; 26(1):182-194. PMC: 5319604. DOI: 10.1080/10618600.2016.1159212. View

5.
Fu Y, Li W . Statistical tests of neutrality of mutations. Genetics. 1993; 133(3):693-709. PMC: 1205353. DOI: 10.1093/genetics/133.3.693. View