» Articles » PMID: 16532393

A Fast and Flexible Statistical Model for Large-scale Population Genotype Data: Applications to Inferring Missing Genotypes and Haplotypic Phase

Overview
Journal Am J Hum Genet
Publisher Cell Press
Specialty Genetics
Date 2006 Mar 15
PMID 16532393
Citations 954
Authors
Affiliations
Soon will be listed here.
Abstract

We present a statistical model for patterns of genetic variation in samples of unrelated individuals from natural populations. This model is based on the idea that, over short regions, haplotypes in a population tend to cluster into groups of similar haplotypes. To capture the fact that, because of recombination, this clustering tends to be local in nature, our model allows cluster memberships to change continuously along the chromosome according to a hidden Markov model. This approach is flexible, allowing for both "block-like" patterns of linkage disequilibrium (LD) and gradual decline in LD with distance. The resulting model is also fast and, as a result, is practicable for large data sets (e.g., thousands of individuals typed at hundreds of thousands of markers). We illustrate the utility of the model by applying it to dense single-nucleotide-polymorphism genotype data for the tasks of imputing missing genotypes and estimating haplotypic phase. For imputing missing genotypes, methods based on this model are as accurate or more accurate than existing methods. For haplotype estimation, the point estimates are slightly less accurate than those from the best existing methods (e.g., for unrelated Centre d'Etude du Polymorphisme Humain individuals from the HapMap project, switch error was 0.055 for our method vs. 0.051 for PHASE) but require a small fraction of the computational cost. In addition, we demonstrate that the model accurately reflects uncertainty in its estimates, in that probabilities computed using the model are approximately well calibrated. The methods described in this article are implemented in a software package, fastPHASE, which is available from the Stephens Lab Web site.

Citing Articles

Resequencing Composite Kazakh Whiteheaded Cattle: Insights into Ancestral Breed Contributions, Selection Signatures, and Candidate Genetic Variants.

Khamzina A, Igoshin A, Muslimova Z, Turgumbekov A, Khussainov D, Yudin N Animals (Basel). 2025; 15(3).

PMID: 39943155 PMC: 11815988. DOI: 10.3390/ani15030385.


STICI: Split-Transformer with integrated convolutions for genotype imputation.

Mowlaei M, Li C, Jamialahmadi O, Dias R, Chen J, Jamialahmadi B Nat Commun. 2025; 16(1):1218.

PMID: 39890780 PMC: 11785734. DOI: 10.1038/s41467-025-56273-3.


Validation of selection signatures for coat color in the Podolica Italiana gray cattle breed.

Bruno S, Rovelli G, Landi V, Sbarra F, Quaglia A, Pilla F Front Genet. 2024; 15:1453295.

PMID: 39717482 PMC: 11663911. DOI: 10.3389/fgene.2024.1453295.


Identification of QTL-allele systems of seed size and oil content for simultaneous genomic improvement in Northeast China soybeans.

He J, Fu L, Hao X, Wu Y, Wang M, Zhang Q Front Plant Sci. 2024; 15:1483995.

PMID: 39610887 PMC: 11602309. DOI: 10.3389/fpls.2024.1483995.


Genome-wide scans for signatures of selection in North African sheep reveals differentially selected regions between fat- and thin-tailed breeds.

Ben-Jemaa S, Yahyaoui G, Kdidi S, Najjari A, Lenstra J, Mastrangelo S Anim Genet. 2024; 56(1):e13487.

PMID: 39573836 PMC: 11653233. DOI: 10.1111/age.13487.


References
1.
Pritchard J, Stephens M, Donnelly P . Inference of population structure using multilocus genotype data. Genetics. 2000; 155(2):945-59. PMC: 1461096. DOI: 10.1093/genetics/155.2.945. View

2.
Rannala B, Mountain J . Detecting immigration by using multilocus genotypes. Proc Natl Acad Sci U S A. 1997; 94(17):9197-201. PMC: 23111. DOI: 10.1073/pnas.94.17.9197. View

3.
Stephens M, Smith N, Donnelly P . A new statistical method for haplotype reconstruction from population data. Am J Hum Genet. 2001; 68(4):978-89. PMC: 1275651. DOI: 10.1086/319501. View

4.
Kimmel G, Shamir R . GERBIL: Genotype resolution and block identification using likelihood. Proc Natl Acad Sci U S A. 2004; 102(1):158-62. PMC: 544046. DOI: 10.1073/pnas.0404730102. View

5.
Stephens M, Scheet P . Accounting for decay of linkage disequilibrium in haplotype inference and missing-data imputation. Am J Hum Genet. 2005; 76(3):449-62. PMC: 1196397. DOI: 10.1086/428594. View