» Articles » PMID: 19014573

Application of Two Machine Learning Algorithms to Genetic Association Studies in the Presence of Covariates

Overview
Journal BMC Genet
Publisher Biomed Central
Date 2008 Nov 19
PMID 19014573
Citations 8
Authors
Affiliations
Soon will be listed here.
Abstract

Background: Population-based investigations aimed at uncovering genotype-trait associations often involve high-dimensional genetic polymorphism data as well as information on multiple environmental and clinical parameters. Machine learning (ML) algorithms offer a straightforward analytic approach for selecting subsets of these inputs that are most predictive of a pre-defined trait. The performance of these algorithms, however, in the presence of covariates is not well characterized.

Methods And Results: In this manuscript, we investigate two approaches: Random Forests (RFs) and Multivariate Adaptive Regression Splines (MARS). Through multiple simulation studies, the performance under several underlying models is evaluated. An application to a cohort of HIV-1 infected individuals receiving anti-retroviral therapies is also provided.

Conclusion: Consistent with more traditional regression modeling theory, our findings highlight the importance of considering the nature of underlying gene-covariate-trait relationships before applying ML algorithms, particularly when there is potential confounding or effect mediation.

Citing Articles

Covariate adjusted classification trees.

Asafu-Adjei J, Sampson A Biostatistics. 2017; 19(1):42-53.

PMID: 28520903 PMC: 6075597. DOI: 10.1093/biostatistics/kxx015.


RAPIDSNPs: A new computational pipeline for rapidly identifying key genetic variants reveals previously unidentified SNPs that are significantly associated with individual platelet responses.

Salehe B, Jones C, Di Fatta G, McGuffin L PLoS One. 2017; 12(4):e0175957.

PMID: 28441463 PMC: 5404774. DOI: 10.1371/journal.pone.0175957.


EPAS1 gene variants are associated with sprint/power athletic performance in two cohorts of European athletes.

Voisin S, Cieszczyk P, Pushkarev V, Dyatlov D, Vashlyayev B, Shumaylov V BMC Genomics. 2014; 15:382.

PMID: 24884370 PMC: 4035083. DOI: 10.1186/1471-2164-15-382.


Integrative systems biology approaches in asthma pharmacogenomics.

Dahlin A, Tantisira K Pharmacogenomics. 2012; 13(12):1387-404.

PMID: 22966888 PMC: 3553555. DOI: 10.2217/pgs.12.126.


An integrated approach to reduce the impact of minor allele frequency and linkage disequilibrium on variable importance measures for genome-wide data.

Walters R, Laurin C, Lubke G Bioinformatics. 2012; 28(20):2615-23.

PMID: 22847933 PMC: 3467741. DOI: 10.1093/bioinformatics/bts483.


References
1.
Foulkes A, Wohl D, Frank I, Puleo E, Restine S, Wolfe M . Associations among race/ethnicity, ApoC-III genotypes, and lipids in HIV-1-infected individuals on antiretroviral therapy. PLoS Med. 2006; 3(3):e52. PMC: 1334223. DOI: 10.1371/journal.pmed.0030052. View

2.
Christenfeld N, Sloan R, Carroll D, Greenland S . Risk factors, confounding, and the illusion of statistical control. Psychosom Med. 2004; 66(6):868-75. DOI: 10.1097/01.psy.0000140008.70959.41. View

3.
Tan C, Tai E, Tan C, Chia K, Lee J, Chew S . APOE polymorphism and lipid profile in three ethnic groups in the Singapore population. Atherosclerosis. 2003; 170(2):253-60. DOI: 10.1016/s0021-9150(03)00232-6. View

4.
Lunetta K, Hayward L, Segal J, Van Eerdewegh P . Screening large-scale association study data: exploiting interactions using random forests. BMC Genet. 2004; 5:32. PMC: 545646. DOI: 10.1186/1471-2156-5-32. View

5.
Cupples L, Bailey J, Cartier K, Falk C, Liu K, Ye Y . Data mining. Genet Epidemiol. 2005; 29 Suppl 1:S103-9. DOI: 10.1002/gepi.20117. View