Using Background Knowledge from Preceding Studies for Building a Random Forest Prediction Model: A Plasmode Simulation Study

Overview

Journal Entropy (Basel)

Publisher MDPI

Date 2022 Jun 24

PMID 35741566

Authors

Lorena Hafermann

Nadja Klein

Geraldine Rauch

Michael Kammer

Georg Heinze

Affiliations

Soon will be listed here.

Abstract

There is an increasing interest in machine learning (ML) algorithms for predicting patient outcomes, as these methods are designed to automatically discover complex data patterns. For example, the random forest (RF) algorithm is designed to identify relevant predictor variables out of a large set of candidates. In addition, researchers may also use external information for variable selection to improve model interpretability and variable selection accuracy, thereby prediction quality. However, it is unclear to which extent, if at all, RF and ML methods may benefit from external information. In this paper, we examine the usefulness of external information from prior variable selection studies that used traditional statistical modeling approaches such as the Lasso, or suboptimal methods such as univariate selection. We conducted a plasmode simulation study based on subsampling a data set from a pharmacoepidemiologic study with nearly 200,000 individuals, two binary outcomes and 1152 candidate predictor (mainly sparse binary) variables. When the scope of candidate predictors was reduced based on external knowledge RF models achieved better calibration, that is, better agreement of predictions and observed outcome rates. However, prediction quality measured by cross-entropy, AUROC or the Brier score did not improve. We recommend appraising the methodological quality of studies that serve as an external information source for future prediction model development.

References

Sauerbrei W, Perperoglou A, Schmid M, Abrahamowicz M, Becher H, Binder H . State of the art in selection of variables and functional forms in multivariable analysis-outstanding issues. Diagn Progn Res. 2020; 4:3. PMC: 7114804. DOI: 10.1186/s41512-020-00074-3. View

Friedman J, Hastie T, Tibshirani R . Regularization Paths for Generalized Linear Models via Coordinate Descent. J Stat Softw. 2010; 33(1):1-22. PMC: 2929880. View

Malley J, Kruppa J, Dasgupta A, Malley K, Ziegler A . Probability machines: consistent probability estimation using nonparametric learning machines. Methods Inf Med. 2011; 51(1):74-81. PMC: 3250568. DOI: 10.3414/ME00-01-0052. View

Sun G, Shook T, Kay G . Inappropriate use of bivariable analysis to screen risk factors for use in multivariable analysis. J Clin Epidemiol. 1996; 49(8):907-16. DOI: 10.1016/0895-4356(96)00025-x. View

Moons K, Wolff R, Riley R, Whiting P, Westwood M, Collins G . PROBAST: A Tool to Assess Risk of Bias and Applicability of Prediction Model Studies: Explanation and Elaboration. Ann Intern Med. 2019; 170(1):W1-W33. DOI: 10.7326/M18-1377. View

Wynants L, Van Calster B, Collins G, Riley R, Heinze G, Schuit E . Prediction models for diagnosis and prognosis of covid-19: systematic review and critical appraisal. BMJ. 2020; 369:m1328. PMC: 7222643. DOI: 10.1136/bmj.m1328. View

Hemingway H, Croft P, Perel P, Hayden J, Abrams K, Timmis A . Prognosis research strategy (PROGRESS) 1: a framework for researching clinical outcomes. BMJ. 2013; 346:e5595. PMC: 3565687. DOI: 10.1136/bmj.e5595. View

Van Calster B, McLernon D, van Smeden M, Wynants L, Steyerberg E . Calibration: the Achilles heel of predictive analytics. BMC Med. 2019; 17(1):230. PMC: 6912996. DOI: 10.1186/s12916-019-1466-7. View

Van Calster B, van Smeden M, De Cock B, Steyerberg E . Regression shrinkage methods for clinical prediction models do not guarantee improved performance: Simulation study. Stat Methods Med Res. 2020; 29(11):3166-3178. DOI: 10.1177/0962280220921415. View

10.

van der Ploeg T, Austin P, Steyerberg E . Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints. BMC Med Res Methodol. 2014; 14:137. PMC: 4289553. DOI: 10.1186/1471-2288-14-137. View

11.

Heinze G, Dunkler D . Five myths about variable selection. Transpl Int. 2016; 30(1):6-10. DOI: 10.1111/tri.12895. View

12.

Hafermann L, Becher H, Herrmann C, Klein N, Heinze G, Rauch G . Statistical model building: Background "knowledge" based on inappropriate preselection causes misspecification. BMC Med Res Methodol. 2021; 21(1):196. PMC: 8480029. DOI: 10.1186/s12874-021-01373-z. View

13.

Bergersen L, Glad I, Lyng H . Weighted lasso with data integration. Stat Appl Genet Mol Biol. 2012; 10(1). DOI: 10.2202/1544-6115.1703. View

14.

Collins G, Reitsma J, Altman D, Moons K . Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. BMJ. 2015; 350:g7594. DOI: 10.1136/bmj.g7594. View

15.

Heinze G, Hronsky M, Reichardt B, Baumgartel C, Mullner M, Bucsics A . Potential savings in prescription drug costs for hypertension, hyperlipidemia, and diabetes mellitus by equivalent drug substitution in Austria: a nationwide cohort study. Appl Health Econ Health Policy. 2014; 13(2):193-205. DOI: 10.1007/s40258-014-0143-4. View

16.

Heinze G, Jandeck L, Hronsky M, Reichardt B, Baumgartel C, Bucsics A . Prevalence and determinants of unintended double medication of antihypertensive, lipid-lowering, and hypoglycemic drugs in Austria: a nationwide cohort study. Pharmacoepidemiol Drug Saf. 2015; 25(1):90-9. DOI: 10.1002/pds.3898. View

17.

Strobl C, Boulesteix A, Zeileis A, Hothorn T . Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinformatics. 2007; 8:25. PMC: 1796903. DOI: 10.1186/1471-2105-8-25. View

18.

Tian Y, Reichardt B, Dunkler D, Hronsky M, Winkelmayer W, Bucsics A . Comparative effectiveness of branded vs. generic versions of antihypertensive, lipid-lowering and hypoglycemic substances: a population-wide cohort study. Sci Rep. 2020; 10(1):5964. PMC: 7136234. DOI: 10.1038/s41598-020-62318-y. View

19.

Morris T, White I, Crowther M . Using simulation studies to evaluate statistical methods. Stat Med. 2019; 38(11):2074-2102. PMC: 6492164. DOI: 10.1002/sim.8086. View

20.

Haller M, Aschauer C, Wallisch C, Leffondre K, van Smeden M, Oberbauer R . Prediction models for living organ transplantation are poorly developed, reported, and validated: a systematic review. J Clin Epidemiol. 2022; 145:126-135. DOI: 10.1016/j.jclinepi.2022.01.025. View