» Articles » PMID: 20187966

The Behaviour of Random Forest Permutation-based Variable Importance Measures Under Predictor Correlation

Overview
Publisher Biomed Central
Specialty Biology
Date 2010 Mar 2
PMID 20187966
Citations 76
Authors
Affiliations
Soon will be listed here.
Abstract

Background: Random forests (RF) have been increasingly used in applications such as genome-wide association and microarray studies where predictor correlation is frequently observed. Recent works on permutation-based variable importance measures (VIMs) used in RF have come to apparently contradictory conclusions. We present an extended simulation study to synthesize results.

Results: In the case when both predictor correlation was present and predictors were associated with the outcome (HA), the unconditional RF VIM attributed a higher share of importance to correlated predictors, while under the null hypothesis that no predictors are associated with the outcome (H0) the unconditional RF VIM was unbiased. Conditional VIMs showed a decrease in VIM values for correlated predictors versus the unconditional VIMs under HA and was unbiased under H0. Scaled VIMs were clearly biased under HA and H0.

Conclusions: Unconditional unscaled VIMs are a computationally tractable choice for large datasets and are unbiased under the null hypothesis. Whether the observed increased VIMs for correlated predictors may be considered a "bias" - because they do not directly reflect the coefficients in the generating model - or if it is a beneficial attribute of these VIMs is dependent on the application. For example, in genetic association studies, where correlation between markers may help to localize the functionally relevant variant, the increased importance of correlated predictors may be an advantage. On the other hand, we show examples where this increased importance may result in spurious signals.

Citing Articles

Data science for pattern recognition in agricultural large time series data: A case study on sugarcane sucrose yield.

Bautista-Romero L, Sanchez-Murcia J, Ramirez-Gil J Heliyon. 2025; 11(4):e42632.

PMID: 40034300 PMC: 11874567. DOI: 10.1016/j.heliyon.2025.e42632.


Out of (the) bag-encoding categorical predictors impacts out-of-bag samples.

Smith H, Biggs P, French N, Smith A, Marshall J PeerJ Comput Sci. 2024; 10:e2445.

PMID: 39650463 PMC: 11623134. DOI: 10.7717/peerj-cs.2445.


A Return to Biased Nets: New Specifications and Approximate Bayesian Inference.

Butts C J Math Sociol. 2024; 48(4):479-507.

PMID: 39309218 PMC: 11412518. DOI: 10.1080/0022250X.2024.2340137.


Low-density SNP markers with high prediction accuracy of genomic selection for bacterial wilt resistance in tomato.

Yeon J, Le N, Heo J, Sim S Front Plant Sci. 2024; 15:1402693.

PMID: 38872894 PMC: 11169939. DOI: 10.3389/fpls.2024.1402693.


Vertical Metabolome Transfer from Mother to Child: An Explainable Machine Learning Method for Detecting Metabolomic Heritability.

Lovric M, Horner D, Chen L, Brustad N, Schoos A, Lasky-Su J Metabolites. 2024; 14(3).

PMID: 38535296 PMC: 10972480. DOI: 10.3390/metabo14030136.


References
1.
Cordell H . Detecting gene-gene interactions that underlie human diseases. Nat Rev Genet. 2009; 10(6):392-404. PMC: 2872761. DOI: 10.1038/nrg2579. View

2.
Meng Y, Yu Y, Cupples L, Farrer L, Lunetta K . Performance of random forest when SNPs are in linkage disequilibrium. BMC Bioinformatics. 2009; 10:78. PMC: 2666661. DOI: 10.1186/1471-2105-10-78. View

3.
Nicodemus K, Malley J . Predictor correlation impacts machine learning algorithms: implications for genomic studies. Bioinformatics. 2009; 25(15):1884-90. DOI: 10.1093/bioinformatics/btp331. View

4.
Strobl C, Boulesteix A, Kneib T, Augustin T, Zeileis A . Conditional variable importance for random forests. BMC Bioinformatics. 2008; 9:307. PMC: 2491635. DOI: 10.1186/1471-2105-9-307. View

5.
Diaz-Uriarte R, de Andres S . Gene selection and classification of microarray data using random forest. BMC Bioinformatics. 2006; 7:3. PMC: 1363357. DOI: 10.1186/1471-2105-7-3. View