VARIABLE SELECTION AND PREDICTION WITH INCOMPLETE HIGH-DIMENSIONAL DATA

Overview

Journal Ann Appl Stat

Date 2016 May 24

PMID 27213023

Citations 8

Authors

Ying Liu

Yuanjia Wang

Yang Feng

Melanie M Wall

Affiliations

Soon will be listed here.

Abstract

We propose a Multiple Imputation Random Lasso (mirl) method to select important variables and to predict the outcome for an epidemiological study of Eating and Activity in Teens. In this study 80% of individuals have at least one variable missing. Therefore, using variable selection methods developed for complete data after listwise deletion substantially reduces prediction power. Recent work on prediction models in the presence of incomplete data cannot adequately account for large numbers of variables with arbitrary missing patterns. We propose MIRL to combine penalized regression techniques with multiple imputation and stability selection. Extensive simulation studies are conducted to compare MIRL with several alternatives. MIRL outperforms other methods in high-dimensional scenarios in terms of both reduced prediction error and improved variable selection performance, and it has greater advantage when the correlation among variables is high and missing proportion is high. MIRL is shown to have improved performance when comparing with other applicable methods when applied to the study of Eating and Activity in Teens for the boys and girls separately, and to a subgroup of low social economic status (ses) Asian boys who are at high risk of developing obesity.

Citing Articles

Multi-omics regulatory network inference in the presence of missing data.

Henao J, Lauber M, Azevedo M, Grekova A, Theis F, List M Brief Bioinform. 2023; 24(5).

PMID: 37670505 PMC: 10516394. DOI: 10.1093/bib/bbad309.

Variable selection with multiply-imputed datasets: choosing between stacked and grouped methods.

Du J, Boss J, Han P, Beesley L, Kleinsasser M, Goutman S J Comput Graph Stat. 2023; 31(4):1063-1075.

PMID: 36644406 PMC: 9838615. DOI: 10.1080/10618600.2022.2035739.

How to apply variable selection machine learning algorithms with multiply imputed data: A missing discussion.

Gunn H, Rezvan P, Fernandez M, Comulada W Psychol Methods. 2022; 28(2):452-471.

PMID: 35113633 PMC: 10117422. DOI: 10.1037/met0000478.

Structure and stability of symptoms in first episode psychosis: a longitudinal network approach.

Griffiths S, Leighton S, Mallikarjun P, Blake G, Everard L, Jones P Transl Psychiatry. 2021; 11(1):567.

PMID: 34743179 PMC: 8572227. DOI: 10.1038/s41398-021-01687-y.

Health system influences on potentially avoidable hospital admissions by secondary mental health service use: A national ecological study.

Woodhead C, Martin P, Osborn D, Barratt H, Raine R J Health Serv Res Policy. 2021; 27(1):22-30.

PMID: 34337981 PMC: 8772012. DOI: 10.1177/13558196211036739.

References

Ibrahim J, Zhu H, Garcia R, Guo R . Fixed and random effects selection in mixed effects models. Biometrics. 2010; 67(2):495-503. PMC: 3041932. DOI: 10.1111/j.1541-0420.2010.01463.x. View

Azur M, Stuart E, Frangakis C, Leaf P . Multiple imputation by chained equations: what is it and how does it work?. Int J Methods Psychiatr Res. 2011; 20(1):40-9. PMC: 3074241. DOI: 10.1002/mpr.329. View

Wood A, White I, Royston P . How should variable selection be performed with multiply imputed data?. Stat Med. 2008; 27(17):3227-46. DOI: 10.1002/sim.3177. View

Laird N, Ware J . Random-effects models for longitudinal data. Biometrics. 1982; 38(4):963-74. View

Johnson B, Lin D, Zeng D . Penalized Estimating Functions and Variable Selection in Semiparametric Regression Models. J Am Stat Assoc. 2010; 103(482):672-680. PMC: 2850080. DOI: 10.1198/016214508000000184. View

Neumark-Sztainer D, Wall M, Larson N, Story M, Fulkerson J, Eisenberg M . Secular trends in weight status and weight-related attitudes and behaviors in adolescents from 1999 to 2010. Prev Med. 2011; 54(1):77-81. PMC: 3266744. DOI: 10.1016/j.ypmed.2011.10.003. View

Chen Q, Wang S . Variable selection for multiply-imputed data with application to dioxin exposure study. Stat Med. 2013; 32(21):3646-59. DOI: 10.1002/sim.5783. View

Garcia R, Ibrahim J, Zhu H . VARIABLE SELECTION FOR REGRESSION MODELS WITH MISSING DATA. Stat Sin. 2010; 20(1):149-165. PMC: 2844735. View

Siddique J, Belin T . Using an Approximate Bayesian Bootstrap to Multiply Impute Nonignorable Missing Data. Comput Stat Data Anal. 2009; 53(2):405-415. PMC: 2678725. DOI: 10.1016/j.csda.2008.07.042. View

10.

Matthews B . Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim Biophys Acta. 1975; 405(2):442-51. DOI: 10.1016/0005-2795(75)90109-9. View

11.

Larson N, Wall M, Story M, Neumark-Sztainer D . Home/family, peer, school, and neighborhood correlates of obesity in adolescents. Obesity (Silver Spring). 2013; 21(9):1858-69. PMC: 3776207. DOI: 10.1002/oby.20360. View

12.

Shen C, Chen Y . Model selection for generalized estimating equations accommodating dropout missingness. Biometrics. 2012; 68(4):1046-54. DOI: 10.1111/j.1541-0420.2012.01758.x. View

13.

Garcia R, Ibrahim J, Zhu H . Variable selection in the cox regression model with covariates missing at random. Biometrics. 2009; 66(1):97-104. PMC: 3303197. DOI: 10.1111/j.1541-0420.2009.01274.x. View

14.

Claeskens G, Consentino F . Variable selection with incomplete covariate data. Biometrics. 2008; 64(4):1062-9. DOI: 10.1111/j.1541-0420.2008.01003.x. View

15.

Wang S, Nan B, Rosset S, Zhu J . RANDOM LASSO. Ann Appl Stat. 2012; 5(1):468-485. PMC: 3445423. DOI: 10.1214/10-AOAS377. View

16.

Kral T, Faith M . Influences on child eating and weight development from a behavioral genetics perspective. J Pediatr Psychol. 2008; 34(6):596-605. DOI: 10.1093/jpepsy/jsn037. View