» Articles » PMID: 39349818

Valid Inference for Machine Learning-assisted Genome-wide Association Studies

Overview
Journal Nat Genet
Specialty Genetics
Date 2024 Sep 30
PMID 39349818
Authors
Affiliations
Soon will be listed here.
Abstract

Machine learning (ML) has become increasingly popular in almost all scientific disciplines, including human genetics. Owing to challenges related to sample collection and precise phenotyping, ML-assisted genome-wide association study (GWAS), which uses sophisticated ML techniques to impute phenotypes and then performs GWAS on the imputed outcomes, have become increasingly common in complex trait genetics research. However, the validity of ML-assisted GWAS associations has not been carefully evaluated. Here, we report pervasive risks for false-positive associations in ML-assisted GWAS and introduce Post-Prediction GWAS (POP-GWAS), a statistical framework that redesigns GWAS on ML-imputed outcomes. POP-GWAS ensures valid and powerful statistical inference irrespective of imputation quality and choice of algorithm, requiring only GWAS summary statistics as input. We employed POP-GWAS to perform a GWAS of bone mineral density derived from dual-energy X-ray absorptiometry imaging at 14 skeletal sites, identifying 89 new loci and revealing skeletal site-specific genetic architecture. Our framework offers a robust analytic solution for future ML-assisted GWAS.

Citing Articles

ipd: an R package for conducting inference on predicted data.

Salerno S, Miao J, Afiaz A, Hoffman K, Neufeld A, Lu Q Bioinformatics. 2025; 41(2).

PMID: 39898809 PMC: 11842045. DOI: 10.1093/bioinformatics/btaf055.

References
1.
Dahl A, Thompson M, An U, Krebs M, Appadurai V, Border R . Phenotype integration improves power and preserves specificity in biobank-based genetic studies of major depressive disorder. Nat Genet. 2023; 55(12):2082-2093. PMC: 10703686. DOI: 10.1038/s41588-023-01559-9. View

2.
An U, Pazokitoroudi A, Alvarez M, Huang L, Bacanu S, Schork A . Deep learning-based phenotype imputation on population-scale biobank data increases genetic discoveries. Nat Genet. 2023; 55(12):2269-2276. PMC: 10703681. DOI: 10.1038/s41588-023-01558-w. View

3.
Burstein D, Griffen T, Therrien K, Bendl J, Venkatesh S, Dong P . Genome-wide analysis of a model-derived binge eating disorder phenotype identifies risk loci and implicates iron metabolism. Nat Genet. 2023; 55(9):1462-1470. PMC: 10947608. DOI: 10.1038/s41588-023-01464-1. View

4.
Cosentino J, Behsaz B, Alipanahi B, McCaw Z, Hill D, Schwantes-An T . Inference of chronic obstructive pulmonary disease with deep learning on raw spirograms identifies new genetic loci and improves risk models. Nat Genet. 2023; 55(5):787-795. DOI: 10.1038/s41588-023-01372-4. View

5.
Kun E, Javan E, Smith O, Gulamali F, de la Fuente J, Flynn B . The genetic architecture and evolution of the human skeletal form. Science. 2023; 381(6655):eadf8009. PMC: 11075689. DOI: 10.1126/science.adf8009. View