» Articles » PMID: 37292811

Imputation Accuracy Across Global Human Populations

Overview
Journal bioRxiv
Date 2023 Jun 9
PMID 37292811
Authors
Affiliations
Soon will be listed here.
Abstract

Genotype imputation is now fundamental for genome-wide association studies but lacks fairness due to the underrepresentation of populations with non-European ancestries. The state-of-the-art imputation reference panel released by the Trans-Omics for Precision Medicine (TOPMed) initiative contains a substantial number of admixed African-ancestry and Hispanic/Latino samples to impute these populations with nearly the same accuracy as European-ancestry cohorts. However, imputation for populations primarily residing outside of North America may still fall short in performance due to persisting underrepresentation. To illustrate this point, we curated genome-wide array data from 23 publications published between 2008 to 2021. In total, we imputed over 43k individuals across 123 populations around the world. We identified a number of populations where imputation accuracy paled in comparison to that of European-ancestry populations. For instance, the mean imputation r-squared (Rsq) for 1-5% alleles in Saudi Arabians (N=1061), Vietnamese (N=1264), Thai (N=2435), and Papua New Guineans (N=776) were 0.79, 0.78, 0.76, and 0.62, respectively. In contrast, the mean Rsq ranged from 0.90 to 0.93 for comparable European populations matched in sample size and SNP content. Outside of Africa and Latin America, Rsq appeared to decrease as genetic distances to European reference increased, as predicted. Further analysis using sequencing data as ground truth suggested that imputation software may over-estimate imputation accuracy for non-European populations than European populations, suggesting further disparity between populations. Using 1496 whole genome sequenced individuals from Taiwan Biobank as a reference, we also assessed a strategy to improve imputation for non-European populations with meta-imputation, which can combine results from TOPMed with smaller population-specific reference panels. We found that meta-imputation in this design did not improve Rsq genome-wide. Taken together, our analysis suggests that with the current size of alternative reference panels, meta-imputation alone cannot improve imputation efficacy for underrepresented cohorts and we must ultimately strive to increase diversity and size to promote equity within genetics research.

References
1.
Lin P, Hartz S, Zhang Z, Saccone S, Wang J, Tischfield J . A new statistic to evaluate imputation reliability. PLoS One. 2010; 5(3):e9697. PMC: 2837741. DOI: 10.1371/journal.pone.0009697. View

2.
Choi J, Kim S, Kim J, Son H, Yoo S, Kim C . A whole-genome reference panel of 14,393 individuals for East Asian populations accelerates discovery of rare functional variants. Sci Adv. 2023; 9(32):eadg6319. PMC: 10411914. DOI: 10.1126/sciadv.adg6319. View

3.
Chen C, Yang J, Chiang C, Hsiung C, Wu P, Chang L . Population structure of Han Chinese in the modern Taiwanese population based on 10,000 participants in the Taiwan Biobank project. Hum Mol Genet. 2016; 25(24):5321-5331. PMC: 6078601. DOI: 10.1093/hmg/ddw346. View

4.
Karczewski K, Francioli L, Tiao G, Cummings B, Alfoldi J, Wang Q . The mutational constraint spectrum quantified from variation in 141,456 humans. Nature. 2020; 581(7809):434-443. PMC: 7334197. DOI: 10.1038/s41586-020-2308-7. View

5.
Chang C, Chow C, Tellier L, Vattikuti S, Purcell S, Lee J . Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience. 2015; 4:7. PMC: 4342193. DOI: 10.1186/s13742-015-0047-8. View