Splitting Random Forest (SRF) for Determining Compact Sets of Genes That Distinguish Between Cancer Subtypes

Overview

Journal J Clin Bioinforma

Specialty Biology

Date 2012 May 24

PMID 22616791

Citations 3

Authors

Xiaowei Guan

Mark R Chance

Jill S Barnholtz-Sloan

Affiliations

Soon will be listed here.

Abstract

Background: The identification of very small subsets of predictive variables is an important toπc that has not often been considered in the literature. In order to discover highly predictive yet compact gene set classifiers from whole genome expression data, a non-parametric, iterative algorithm, Splitting Random Forest (SRF), was developed to robustly identify genes that distinguish between molecular subtypes. The goal is to improve the prediction accuracy while considering sparsity.

Results: The optimal SRF 50 run (SRF50) gene classifiers for glioblastoma (GB), breast (BC) and ovarian cancer (OC) subtypes had overall prediction rates comparable to those from published datasets upon validation (80.1%-91.7%). The SRF50 sets outperformed other methods by identifying compact gene sets needed for distinguishing between tested cancer subtypes (10-200 fold fewer genes than ANOVA or published gene sets). The SRF50 sets achieved superior and robust overall and subtype prediction accuracies when compared with single random forest (RF) and the Top 50 ANOVA results (80.1% vs 77.8% for GB; 84.0% vs 74.1% for BC; 89.8% vs 88.9% for OC in SRF50 vs single RF comparison; 80.1% vs 77.2% for GB; 84.0% vs 82.7% for BC; 89.8% vs 87.0% for OC in SRF50 vs Top 50 ANOVA comparison). There was significant overlap between SRF50 and published gene sets, showing that SRF identifies the relevant sub-sets of important gene lists. Through Ingenuity Pathway Analysis (IPA), the overlap in "hub" genes between the SRF50 and published genes sets were RB1, πK3R1, PDGFBB and ERK1/2 for GB; ESR1, MYC, NFkB and ERK1/2 for BC; and Akt, FN1, NFkB, PDGFBB and ERK1/2 for OC.

Conclusions: The SRF approach is an effective driver of biomarker discovery research that reduces the number of genes needed for robust classification, dissects complex, high dimensional "omic" data and provides novel insights into the cellular mechanisms that define cancer subtypes.

Citing Articles

Molecular Subtyping of Cancer Based on Robust Graph Neural Network and Multi-Omics Data Integration.

Yin C, Cao Y, Sun P, Zhang H, Li Z, Xu Y Front Genet. 2022; 13:884028.

PMID: 35646077 PMC: 9137453. DOI: 10.3389/fgene.2022.884028.

Novel population of small tumour-initiating stem cells in the ovaries of women with borderline ovarian cancer.

Virant-Klun I, Stimpfel M Sci Rep. 2016; 6:34730.

PMID: 27703207 PMC: 5050448. DOI: 10.1038/srep34730.

A novel tree-based procedure for deciphering the genomic spectrum of clinical disease entities.

Mbogning C, Perdry H, Toussile W, Broet P J Clin Bioinforma. 2014; 4:6.

PMID: 24739673 PMC: 4129184. DOI: 10.1186/2043-9113-4-6.

References

Kaiser J . Clinical medicine. Biomarker tests need closer scrutiny, IOM concludes. Science. 2012; 335(6076):1554. DOI: 10.1126/science.335.6076.1554. View

Kleihues P, Ohgaki H . Primary and secondary glioblastomas: from concept to clinical diagnosis. Neuro Oncol. 2001; 1(1):44-51. PMC: 1919466. DOI: 10.1093/neuonc/1.1.44. View

Nicodemus K, Malley J, Strobl C, Ziegler A . The behaviour of random forest permutation-based variable importance measures under predictor correlation. BMC Bioinformatics. 2010; 11:110. PMC: 2848005. DOI: 10.1186/1471-2105-11-110. View

Reiner A, Yekutieli D, Benjamini Y . Identifying differentially expressed genes using false discovery rate controlling procedures. Bioinformatics. 2003; 19(3):368-75. DOI: 10.1093/bioinformatics/btf877. View

Dabney A . ClaNC: point-and-click software for classifying microarrays to nearest centroids. Bioinformatics. 2005; 22(1):122-3. DOI: 10.1093/bioinformatics/bti756. View

Parker J, Mullins M, Cheang M, Leung S, Voduc D, Vickery T . Supervised risk predictor of breast cancer based on intrinsic subtypes. J Clin Oncol. 2009; 27(8):1160-7. PMC: 2667820. DOI: 10.1200/JCO.2008.18.1370. View

Sorlie T, Perou C, Tibshirani R, Aas T, Geisler S, Johnsen H . Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc Natl Acad Sci U S A. 2001; 98(19):10869-74. PMC: 58566. DOI: 10.1073/pnas.191367098. View

Phillips H, Kharbanda S, Chen R, Forrest W, Soriano R, Wu T . Molecular subclasses of high-grade glioma predict prognosis, delineate a pattern of disease progression, and resemble stages in neurogenesis. Cancer Cell. 2006; 9(3):157-73. DOI: 10.1016/j.ccr.2006.02.019. View

Tothill R, Tinker A, George J, Brown R, Fox S, Lade S . Novel molecular subtypes of serous and endometrioid ovarian cancer linked to clinical outcome. Clin Cancer Res. 2008; 14(16):5198-208. DOI: 10.1158/1078-0432.CCR-08-0196. View

10.

Chang H, Nuyten D, Sneddon J, Hastie T, Tibshirani R, Sorlie T . Robustness, scalability, and integration of a wound-response gene expression signature in predicting breast cancer survival. Proc Natl Acad Sci U S A. 2005; 102(10):3738-43. PMC: 548329. DOI: 10.1073/pnas.0409462102. View

11.

Barnholtz-Sloan J, Guan X, Zeigler-Johnson C, Meropol N, Rebbeck T . Decision tree-based modeling of androgen pathway genes and prostate cancer risk. Cancer Epidemiol Biomarkers Prev. 2011; 20(6):1146-55. PMC: 3111844. DOI: 10.1158/1055-9965.EPI-10-0996. View

12.

Sorlie T, Tibshirani R, Parker J, Hastie T, Marron J, Nobel A . Repeated observation of breast tumor subtypes in independent gene expression data sets. Proc Natl Acad Sci U S A. 2003; 100(14):8418-23. PMC: 166244. DOI: 10.1073/pnas.0932692100. View

13.

Verhaak R, Hoadley K, Purdom E, Wang V, Qi Y, Wilkerson M . Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. Cancer Cell. 2010; 17(1):98-110. PMC: 2818769. DOI: 10.1016/j.ccr.2009.12.020. View

14.

Ohgaki H, Dessen P, Jourde B, Horstmann S, Nishikawa T, Di Patre P . Genetic pathways to glioblastoma: a population-based study. Cancer Res. 2004; 64(19):6892-9. DOI: 10.1158/0008-5472.CAN-04-1337. View

15.

Diaz-Uriarte R . GeneSrF and varSelRF: a web-based tool and R package for gene selection and classification using random forest. BMC Bioinformatics. 2007; 8:328. PMC: 2034606. DOI: 10.1186/1471-2105-8-328. View

16.

Gentleman R, Carey V, Bates D, Bolstad B, Dettling M, Dudoit S . Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004; 5(10):R80. PMC: 545600. DOI: 10.1186/gb-2004-5-10-r80. View

17.

Diaz-Uriarte R, de Andres S . Gene selection and classification of microarray data using random forest. BMC Bioinformatics. 2006; 7:3. PMC: 1363357. DOI: 10.1186/1471-2105-7-3. View

18.

Perreard L, Fan C, Quackenbush J, Mullins M, Gauthier N, Nelson E . Classification and risk stratification of invasive breast carcinomas using a real-time quantitative RT-PCR assay. Breast Cancer Res. 2006; 8(2):R23. PMC: 1557722. DOI: 10.1186/bcr1399. View

19.

Tusher V, Tibshirani R, Chu G . Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A. 2001; 98(9):5116-21. PMC: 33173. DOI: 10.1073/pnas.091062498. View

20.

Beroukhim R, Getz G, Nghiemphu L, Barretina J, Hsueh T, Linhart D . Assessing the significance of chromosomal aberrations in cancer: methodology and application to glioma. Proc Natl Acad Sci U S A. 2007; 104(50):20007-12. PMC: 2148413. DOI: 10.1073/pnas.0710052104. View