Variable Selection Method for the Identification of Epistatic Models

Overview

Journal Pac Symp Biocomput

Publisher World Scientific

Specialty Biology

Date 2015 Jan 17

PMID 25592581

Citations 8

Authors

Emily Rose Holzinger

Silke Szymczak

Abhijit Dasgupta

James Malley

Qing Li

Joan E Bailey-Wilson

Affiliations

Soon will be listed here.

Abstract

Standard analysis methods for genome wide association studies (GWAS) are not robust to complex disease models, such as interactions between variables with small main effects. These types of effects likely contribute to the heritability of complex human traits. Machine learning methods that are capable of identifying interactions, such as Random Forests (RF), are an alternative analysis approach. One caveat to RF is that there is no standardized method of selecting variables so that false positives are reduced while retaining adequate power. To this end, we have developed a novel variable selection method called relative recurrency variable importance metric (r2VIM). This method incorporates recurrency and variance estimation to assist in optimal threshold selection. For this study, we specifically address how this method performs in data with almost completely epistatic effects (i.e. no marginal effects). Our results show that with appropriate parameter settings, r2VIM can identify interaction effects when the marginal effects are virtually nonexistent. It also outperforms logistic regression, which has essentially no power under this type of model when the number of potential features (genetic variants) is large. (All Supplementary Data can be found here: http://research.nhgri.nih.gov/manuscripts/Bailey-Wilson/r2VIM_epi/).

Citing Articles

What makes a good prediction? Feature importance and beginning to open the black box of machine learning in genetics.

Musolf A, Holzinger E, Malley J, Bailey-Wilson J Hum Genet. 2021; 141(9):1515-1528.

PMID: 34862561 PMC: 9360120. DOI: 10.1007/s00439-021-02402-z.

A comparison of methods for interpreting random forest models of genetic association in the presence of non-additive interactions.

Orlenko A, Moore J BioData Min. 2021; 14(1):9.

PMID: 33514397 PMC: 7847145. DOI: 10.1186/s13040-021-00243-0.

Evaluation of variable selection methods for random forests and omics data sets.

Degenhardt F, Seifert S, Szymczak S Brief Bioinform. 2017; 20(2):492-503.

PMID: 29045534 PMC: 6433899. DOI: 10.1093/bib/bbx124.

Advantages of Synthetic Noise and Machine Learning for Analyzing Radioecological Data Sets.

Shuryak I PLoS One. 2017; 12(1):e0170007.

PMID: 28068401 PMC: 5222373. DOI: 10.1371/journal.pone.0170007.

r2VIM: A new variable selection method for random forests in genome-wide association studies.

Szymczak S, Holzinger E, Dasgupta A, Malley J, Molloy A, Mills J BioData Min. 2016; 9:7.

PMID: 26839594 PMC: 4736152. DOI: 10.1186/s13040-016-0087-3.

References

Tishkoff S, Verrelli B . Role of evolutionary history on haplotype block structure in the human genome: implications for disease mapping. Curr Opin Genet Dev. 2003; 13(6):569-75. DOI: 10.1016/j.gde.2003.10.010. View

Dudek S, Motsinger A, Velez D, Williams S, Ritchie M . Data simulation software for whole-genome association and other studies in human genetics. Pac Symp Biocomput. 2006; :499-510. View

Manolio T, Collins F, Cox N, Goldstein D, Hindorff L, Hunter D . Finding the missing heritability of complex diseases. Nature. 2009; 461(7265):747-53. PMC: 2831613. DOI: 10.1038/nature08494. View

Strobl C, Malley J, Tutz G . An introduction to recursive partitioning: rationale, application, and characteristics of classification and regression trees, bagging, and random forests. Psychol Methods. 2009; 14(4):323-48. PMC: 2927982. DOI: 10.1037/a0016973. View

Nicodemus K, Malley J, Strobl C, Ziegler A . The behaviour of random forest permutation-based variable importance measures under predictor correlation. BMC Bioinformatics. 2010; 11:110. PMC: 2848005. DOI: 10.1186/1471-2105-11-110. View

Gertrudes J, Maltarollo V, Silva R, Oliveira P, Honorio K, da Silva A . Machine learning techniques and drug design. Curr Med Chem. 2012; 19(25):4289-97. DOI: 10.2174/092986712802884259. View

Huang W, Richards S, Carbone M, Zhu D, Anholt R, Ayroles J . Epistasis dominates the genetic architecture of Drosophila quantitative traits. Proc Natl Acad Sci U S A. 2012; 109(39):15553-9. PMC: 3465439. DOI: 10.1073/pnas.1213423109. View

Holzinger E, Dudek S, Frase A, Krauss R, Medina M, Ritchie M . ATHENA: a tool for meta-dimensional analysis applied to genotypes and gene expression data to predict HDL cholesterol levels. Pac Symp Biocomput. 2013; :385-96. PMC: 3587764. View

Godman B, Finlayson A, Cheema P, Zebedin-Brandl E, Gutierrez-Ibarluzea I, Jones J . Personalizing health care: feasibility and future implications. BMC Med. 2013; 11:179. PMC: 3750765. DOI: 10.1186/1741-7015-11-179. View

10.

Welter D, MacArthur J, Morales J, Burdett T, Hall P, Junkins H . The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res. 2013; 42(Database issue):D1001-6. PMC: 3965119. DOI: 10.1093/nar/gkt1229. View

11.

Dasgupta A, Szymczak S, Moore J, Bailey-Wilson J, Malley J . Risk estimation using probability machines. BioData Min. 2014; 7(1):2. PMC: 4015350. DOI: 10.1186/1756-0381-7-2. View

12.

Szymczak S, Holzinger E, Dasgupta A, Malley J, Molloy A, Mills J . r2VIM: A new variable selection method for random forests in genome-wide association studies. BioData Min. 2016; 9:7. PMC: 4736152. DOI: 10.1186/s13040-016-0087-3. View