» Articles » PMID: 16398926

Gene Selection and Classification of Microarray Data Using Random Forest

Overview
Publisher Biomed Central
Specialty Biology
Date 2006 Jan 10
PMID 16398926
Citations 619
Authors
Affiliations
Soon will be listed here.
Abstract

Background: Selection of relevant genes for sample classification is a common task in most gene expression studies, where researchers try to identify the smallest possible set of genes that can still achieve good predictive performance (for instance, for future use with diagnostic purposes in clinical practice). Many gene selection approaches use univariate (gene-by-gene) rankings of gene relevance and arbitrary thresholds to select the number of genes, can only be applied to two-class problems, and use gene selection ranking criteria unrelated to the classification algorithm. In contrast, random forest is a classification algorithm well suited for microarray data: it shows excellent performance even when most predictive variables are noise, can be used when the number of variables is much larger than the number of observations and in problems involving more than two classes, and returns measures of variable importance. Thus, it is important to understand the performance of random forest with microarray data and its possible use for gene selection.

Results: We investigate the use of random forest for classification of microarray data (including multi-class problems) and propose a new method of gene selection in classification problems based on random forest. Using simulated and nine microarray data sets we show that random forest has comparable performance to other classification methods, including DLDA, KNN, and SVM, and that the new gene selection procedure yields very small sets of genes (often smaller than alternative methods) while preserving predictive accuracy.

Conclusion: Because of its performance and features, random forest and gene selection using random forest should probably become part of the "standard tool-box" of methods for class prediction and gene selection with microarray data.

Citing Articles

A comparison of random forest variable selection methods for regression modeling of continuous outcomes.

OConnell N, Jaeger B, Bullock G, Speiser J Brief Bioinform. 2025; 26(2).

PMID: 40062620 PMC: 11891652. DOI: 10.1093/bib/bbaf096.


Explainable artificial intelligence of DNA methylation-based brain tumor diagnostics.

Benfatto S, Sill M, Jones D, Pfister S, Sahm F, von Deimling A Nat Commun. 2025; 16(1):1787.

PMID: 39979307 PMC: 11842776. DOI: 10.1038/s41467-025-57078-0.


Comparison of Random Forest and Stepwise Regression for Variable Selection Using Low Prevalence Predictors: A case Study in Paediatric Sepsis.

Gilholm P, Lister P, Irwin A, Harley A, Raman S, Schlapbach L Matern Child Health J. 2025; .

PMID: 39812888 DOI: 10.1007/s10995-025-04038-1.


Prognostic factors in patients with gastrointestinal perforation under the acute care surgery model : a retrospective cohort study.

Sung K, Hwang S, Lee J, Cho J BMC Surg. 2024; 24(1):406.

PMID: 39709362 PMC: 11662852. DOI: 10.1186/s12893-024-02687-7.


Monitoring data compilations can be leveraged to highlight relationships between estuarine and watershed factors influencing eutrophication in estuaries.

Pelletier M, Latimer J, Rashleigh B, Tilburg C, Charpentier M Environ Monit Assess. 2024; 197(1):80.

PMID: 39707068 PMC: 11753031. DOI: 10.1007/s10661-024-13564-4.


References
1.
Man M, Dyson G, Johnson K, Liao B . Evaluating methods for classifying expression data. J Biopharm Stat. 2004; 14(4):1065-84. DOI: 10.1081/BIP-200035491. View

2.
Ein-Dor L, Kela I, Getz G, Givol D, Domany E . Outcome signature genes in breast cancer: is there a unique set?. Bioinformatics. 2004; 21(2):171-8. DOI: 10.1093/bioinformatics/bth469. View

3.
Khan J, Wei J, Ringner M, Saal L, Ladanyi M, Westermann F . Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat Med. 2001; 7(6):673-9. PMC: 1282521. DOI: 10.1038/89044. View

4.
Ramaswamy S, Tamayo P, Rifkin R, Mukherjee S, Yeang C, Angelo M . Multiclass cancer diagnosis using tumor gene expression signatures. Proc Natl Acad Sci U S A. 2001; 98(26):15149-54. PMC: 64998. DOI: 10.1073/pnas.211566398. View

5.
Braga-Neto U, Hashimoto R, Dougherty E, Nguyen D, Carroll R . Is cross-validation better than resubstitution for ranking genes?. Bioinformatics. 2004; 20(2):253-8. DOI: 10.1093/bioinformatics/btg399. View