» Articles » PMID: 17892327

PCA-correlated SNPs for Structure Identification in Worldwide Human Populations

Overview
Journal PLoS Genet
Specialty Genetics
Date 2007 Sep 26
PMID 17892327
Citations 95
Authors
Affiliations
Soon will be listed here.
Abstract

Existing methods to ascertain small sets of markers for the identification of human population structure require prior knowledge of individual ancestry. Based on Principal Components Analysis (PCA), and recent results in theoretical computer science, we present a novel algorithm that, applied on genomewide data, selects small subsets of SNPs (PCA-correlated SNPs) to reproduce the structure found by PCA on the complete dataset, without use of ancestry information. Evaluating our method on a previously described dataset (10,805 SNPs, 11 populations), we demonstrate that a very small set of PCA-correlated SNPs can be effectively employed to assign individuals to particular continents or populations, using a simple clustering algorithm. We validate our methods on the HapMap populations and achieve perfect intercontinental differentiation with 14 PCA-correlated SNPs. The Chinese and Japanese populations can be easily differentiated using less than 100 PCA-correlated SNPs ascertained after evaluating 1.7 million SNPs from HapMap. We show that, in general, structure informative SNPs are not portable across geographic regions. However, we manage to identify a general set of 50 PCA-correlated SNPs that effectively assigns individuals to one of nine different populations. Compared to analysis with the measure of informativeness, our methods, although unsupervised, achieved similar results. We proceed to demonstrate that our algorithm can be effectively used for the analysis of admixed populations without having to trace the origin of individuals. Analyzing a Puerto Rican dataset (192 individuals, 7,257 SNPs), we show that PCA-correlated SNPs can be used to successfully predict structure and ancestry proportions. We subsequently validate these SNPs for structure identification in an independent Puerto Rican dataset. The algorithm that we introduce runs in seconds and can be easily applied on large genome-wide datasets, facilitating the identification of population substructure, stratification assessment in multi-stage whole-genome association studies, and the study of demographic history in human populations.

Citing Articles

Determining population structure from k-mer frequencies.

Hrytsenko Y, Daniels N, Schwartz R PeerJ. 2025; 13:e18939.

PMID: 40061228 PMC: 11890038. DOI: 10.7717/peerj.18939.


Development and Validation of a 5K Liquid Chip for Identifying Cashmere Goat Populations in Inner Mongolia Autonomous Region.

Zhang T, Xu Q, Zhou B, Xiao J, Zheng S, Li J Animals (Basel). 2025; 14(24.

PMID: 39765493 PMC: 11672763. DOI: 10.3390/ani14243589.


A Pipeline and Recommendations for Population and Individual Diagnostic SNP Selection in Non-Model Species.

Armstrong E, Li C, Campana M, Ferrari T, Kelley J, Petrov D Mol Ecol Resour. 2024; 25(3):e14048.

PMID: 39611246 PMC: 11887608. DOI: 10.1111/1755-0998.14048.


AlignStatPlot: An R package and online tool for robust sequence alignment statistics and innovative visualization of big data.

Alsamman A, El Allali A, Mokhtar M, Al-Shamaa K, Nassar A, Mousa K PLoS One. 2023; 18(9):e0291204.

PMID: 37729135 PMC: 10511070. DOI: 10.1371/journal.pone.0291204.


Molecular characterization of doubled haploid lines derived from different cycles of the Iowa Stiff Stalk Synthetic (BSSS) maize population.

Ledesma A, Sales Ribeiro F, Uberti A, Edwards J, Hearne S, Frei U Front Plant Sci. 2023; 14:1226072.

PMID: 37600186 PMC: 10433169. DOI: 10.3389/fpls.2023.1226072.


References
1.
Campbell C, Ogburn E, Lunetta K, Lyon H, Freedman M, Groop L . Demonstrating stratification in a European American population. Nat Genet. 2005; 37(8):868-72. DOI: 10.1038/ng1607. View

2.
Reich D, Goldstein D . Detecting association in a case-control study while correcting for population stratification. Genet Epidemiol. 2000; 20(1):4-16. DOI: 10.1002/1098-2272(200101)20:1<4::AID-GEPI2>3.0.CO;2-T. View

3.
Weir B, Cardon L, Anderson A, Nielsen D, Hill W . Measures of human population structure show heterogeneity among genomic regions. Genome Res. 2005; 15(11):1468-76. PMC: 1310634. DOI: 10.1101/gr.4398405. View

4.
Satten G, Flanders W, Yang Q . Accounting for unmeasured population substructure in case-control studies of genetic association using a novel latent-class model. Am J Hum Genet. 2001; 68(2):466-77. PMC: 1235279. DOI: 10.1086/318195. View

5.
Devlin B, Roeder K . Genomic control for association studies. Biometrics. 2001; 55(4):997-1004. DOI: 10.1111/j.0006-341x.1999.00997.x. View