The Revival of the Gini Importance?

Overview

Journal Bioinformatics

Publisher Oxford University Press

Specialty Biology

Date 2018 May 15

PMID 29757357

Citations 151

Authors

Stefano Nembrini

Inke R Konig

Marvin N Wright

Affiliations

Soon will be listed here.

Abstract

Motivation: Random forests are fast, flexible and represent a robust approach to analyze high dimensional data. A key advantage over alternative machine learning algorithms are variable importance measures, which can be used to identify relevant features or perform variable selection. Measures based on the impurity reduction of splits, such as the Gini importance, are popular because they are simple and fast to compute. However, they are biased in favor of variables with many possible split points and high minor allele frequency.

Results: We set up a fast approach to debias impurity-based variable importance measures for classification, regression and survival forests. We show that it creates a variable importance measure which is unbiased with regard to the number of categories and minor allele frequency and almost as fast as the standard impurity importance. As a result, it is now possible to compute reliable importance estimates without the extra computing cost of permutations. Further, we combine the importance measure with a fast testing procedure, producing p-values for variable importance with almost no computational overhead to the creation of the random forest. Applications to gene expression and genome-wide association data show that the proposed method is powerful and computationally efficient.

Availability And Implementation: The procedure is included in the ranger package, available at https://cran.r-project.org/package=ranger and https://github.com/imbs-hl/ranger.

Supplementary Information: Supplementary data are available at Bioinformatics online.

Citing Articles

Unveiling the antiviral inhibitory activity of ebselen and ebsulfur derivatives on SARS-CoV-2 using machine learning-based QSAR, LB-PaCS-MD, and experimental assay.

Sinsulpsiri S, Nishii Y, Xu-Xu Q, Miura M, Wilasluck P, Salamteh K Sci Rep. 2025; 15(1):6956.

PMID: 40011571 PMC: 11865625. DOI: 10.1038/s41598-025-91235-1.

Predicting positive test results using large-scale longitudinal data of demographics and medication history.

Pham A, El-Kareh R, Myers F, Ohno-Machado L, Kuo T Heliyon. 2025; 11(1):e41350.

PMID: 39958729 PMC: 11825254. DOI: 10.1016/j.heliyon.2024.e41350.

Feature graphs for interpretable unsupervised tree ensembles: centrality, interaction, and application in disease subtyping.

Sirocchi C, Urschler M, Pfeifer B BioData Min. 2025; 18(1):15.

PMID: 39955586 PMC: 11829558. DOI: 10.1186/s13040-025-00430-3.

Using Structured Codes and Free-Text Notes to Measure Information Complementarity in Electronic Health Records: Feasibility and Validation Study.

Seinen T, Kors J, van Mulligen E, Rijnbeek P J Med Internet Res. 2025; 27:e66910.

PMID: 39946687 PMC: 11887999. DOI: 10.2196/66910.

Enhancing individual glomerular filtration rate assessment: can we trust the equation? Development and validation of machine learning models to assess the trustworthiness of estimated GFR compared to measured GFR.

Lanot A, Akesson A, Nakano F, Vens C, Bjork J, Nyman U BMC Nephrol. 2025; 26(1):47.

PMID: 39885391 PMC: 11780799. DOI: 10.1186/s12882-025-03972-0.

References

Strobl C, Boulesteix A, Zeileis A, Hothorn T . Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinformatics. 2007; 8:25. PMC: 1796903. DOI: 10.1186/1471-2105-8-25. View

J van t Veer L, Dai H, van de Vijver M, He Y, Hart A, Mao M . Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002; 415(6871):530-6. DOI: 10.1038/415530a. View

Walters R, Laurin C, Lubke G . An integrated approach to reduce the impact of minor allele frequency and linkage disequilibrium on variable importance measures for genome-wide data. Bioinformatics. 2012; 28(20):2615-23. PMC: 3467741. DOI: 10.1093/bioinformatics/bts483. View

Nicodemus K . Letter to the editor: on the stability and ranking of predictors from random forest variable importance measures. Brief Bioinform. 2011; 12(4):369-73. PMC: 3137934. DOI: 10.1093/bib/bbr016. View

Diaz-Uriarte R, de Andres S . Gene selection and classification of microarray data using random forest. BMC Bioinformatics. 2006; 7:3. PMC: 1363357. DOI: 10.1186/1471-2105-7-3. View

Nicodemus K, Malley J . Predictor correlation impacts machine learning algorithms: implications for genomic studies. Bioinformatics. 2009; 25(15):1884-90. DOI: 10.1093/bioinformatics/btp331. View

Szymczak S, Holzinger E, Dasgupta A, Malley J, Molloy A, Mills J . r2VIM: A new variable selection method for random forests in genome-wide association studies. BioData Min. 2016; 9:7. PMC: 4736152. DOI: 10.1186/s13040-016-0087-3. View

Goldstein B, Polley E, Briggs F . Random forests for genetic association studies. Stat Appl Genet Mol Biol. 2012; 10(1):32. PMC: 3154091. DOI: 10.2202/1544-6115.1691. View

Altmann A, Tolosi L, Sander O, Lengauer T . Permutation importance: a corrected feature importance measure. Bioinformatics. 2010; 26(10):1340-7. DOI: 10.1093/bioinformatics/btq134. View

10.

Strobl C, Boulesteix A, Kneib T, Augustin T, Zeileis A . Conditional variable importance for random forests. BMC Bioinformatics. 2008; 9:307. PMC: 2491635. DOI: 10.1186/1471-2105-9-307. View

11.

Webster J, Gibbs J, Clarke J, Ray M, Zhang W, Holmans P . Genetic control of human brain transcript expression in Alzheimer disease. Am J Hum Genet. 2009; 84(4):445-58. PMC: 2667989. DOI: 10.1016/j.ajhg.2009.03.011. View

12.

Nicodemus K, Malley J, Strobl C, Ziegler A . The behaviour of random forest permutation-based variable importance measures under predictor correlation. BMC Bioinformatics. 2010; 11:110. PMC: 2848005. DOI: 10.1186/1471-2105-11-110. View

13.

Ishwaran H . The Effect of Splitting on Random Forests. Mach Learn. 2017; 99(1):75-118. PMC: 5599182. DOI: 10.1007/s10994-014-5451-2. View

14.

Wright M, Dankowski T, Ziegler A . Unbiased split variable selection for random survival forests using maximally selected rank statistics. Stat Med. 2017; 36(8):1272-1284. DOI: 10.1002/sim.7212. View

15.

Boulesteix A, Bender A, Lorenzo Bermejo J, Strobl C . Random forest Gini importance favours SNPs with large minor allele frequency: impact, sources and recommendations. Brief Bioinform. 2011; 13(3):292-304. DOI: 10.1093/bib/bbr053. View

16.

Calle M, Urrea V . Letter to the editor: Stability of Random Forest importance measures. Brief Bioinform. 2010; 12(1):86-9. DOI: 10.1093/bib/bbq011. View

17.

Degenhardt F, Seifert S, Szymczak S . Evaluation of variable selection methods for random forests and omics data sets. Brief Bioinform. 2017; 20(2):492-503. PMC: 6433899. DOI: 10.1093/bib/bbx124. View

18.

Golub T, Slonim D, Tamayo P, Huard C, Gaasenbeek M, Mesirov J . Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999; 286(5439):531-7. DOI: 10.1126/science.286.5439.531. View