» Articles » PMID: 19851466

Can Survival Prediction Be Improved by Merging Gene Expression Data Sets?

Overview
Journal PLoS One
Date 2009 Oct 24
PMID 19851466
Citations 27
Authors
Affiliations
Soon will be listed here.
Abstract

Background: High-throughput gene expression profiling technologies generating a wealth of data, are increasingly used for characterization of tumor biopsies for clinical trials. By applying machine learning algorithms to such clinically documented data sets, one hopes to improve tumor diagnosis, prognosis, as well as prediction of treatment response. However, the limited number of patients enrolled in a single trial study limits the power of machine learning approaches due to over-fitting. One could partially overcome this limitation by merging data from different studies. Nevertheless, such data sets differ from each other with regard to technical biases, patient selection criteria and follow-up treatment. It is therefore not clear at all whether the advantage of increased sample size outweighs the disadvantage of higher heterogeneity of merged data sets. Here, we present a systematic study to answer this question specifically for breast cancer data sets. We use survival prediction based on Cox regression as an assay to measure the added value of merged data sets.

Results: Using time-dependent Receiver Operating Characteristic-Area Under the Curve (ROC-AUC) and hazard ratio as performance measures, we see in overall no significant improvement or deterioration of survival prediction with merged data sets as compared to individual data sets. This apparently was due to the fact that a few genes with strong prognostic power were not available on all microarray platforms and thus were not retained in the merged data sets. Surprisingly, we found that the overall best performance was achieved with a single-gene predictor consisting of CYB5D1.

Conclusions: Merging did not deteriorate performance on average despite (a) The diversity of microarray platforms used. (b) The heterogeneity of patients cohorts. (c) The heterogeneity of breast cancer disease. (d) Substantial variation of time to death or relapse. (e) The reduced number of genes in the merged data sets. Predictors derived from the merged data sets were more robust, consistent and reproducible across microarray platforms. Moreover, merging data sets from different studies helps to better understand the biases of individual studies and can lead to the identification of strong survival factors like CYB5D1 expression.

Citing Articles

Exploring SERPINA3 as a neuroinflammatory modulator in Alzheimer's disease with sex and regional brain variations.

Sanfilippo C, Castrogiovanni P, Imbesi R, Vecchio M, Sortino M, Musumeci G Metab Brain Dis. 2025; 40(1):83.

PMID: 39754632 DOI: 10.1007/s11011-024-01523-4.


Strategies for improving the performance of prediction models for response to immune checkpoint blockade therapy in cancer.

Zeng T, Zhang J, Stromberg A, Chen J, Wang C BMC Res Notes. 2024; 17(1):102.

PMID: 38594730 PMC: 11005243. DOI: 10.1186/s13104-024-06760-5.


Skeletal muscle of young females under resistance exercise exhibits a unique innate immune cell infiltration profile compared to males and elderly individuals.

Castrogiovanni P, Sanfilippo C, Imbesi R, Lazzarino G, Volti G, Tibullo D J Muscle Res Cell Motil. 2024; 45(4):171-190.

PMID: 38578562 DOI: 10.1007/s10974-024-09668-6.


Resistin-like beta reduction is associated to low survival rate and is downregulated by adjuvant therapy in colorectal cancer patients.

Di Rosa M, Di Cataldo A, Broggi G, Caltabiano R, Tibullo D, Castrogiovanni P Sci Rep. 2023; 13(1):1490.

PMID: 36707698 PMC: 9883247. DOI: 10.1038/s41598-023-28450-1.


A pairwise strategy for imputing predictive features when combining multiple datasets.

Wu Y, Ren B, Patil P Bioinformatics. 2022; 39(1).

PMID: 36576001 PMC: 9835467. DOI: 10.1093/bioinformatics/btac839.


References
1.
Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R . Missing value estimation methods for DNA microarrays. Bioinformatics. 2001; 17(6):520-5. DOI: 10.1093/bioinformatics/17.6.520. View

2.
Bild A, Yao G, Chang J, Wang Q, Potti A, Chasse D . Oncogenic pathway signatures in human cancers as a guide to targeted therapies. Nature. 2005; 439(7074):353-7. DOI: 10.1038/nature04296. View

3.
Ein-Dor L, Kela I, Getz G, Givol D, Domany E . Outcome signature genes in breast cancer: is there a unique set?. Bioinformatics. 2004; 21(2):171-8. DOI: 10.1093/bioinformatics/bth469. View

4.
Lu Y, Lemon W, Liu P, Yi Y, Morrison C, Yang P . A gene expression signature predicts survival of patients with stage I non-small cell lung cancer. PLoS Med. 2006; 3(12):e467. PMC: 1716187. DOI: 10.1371/journal.pmed.0030467. View

5.
Pawitan Y, Bjohle J, Amler L, Borg A, Egyhazi S, Hall P . Gene expression profiling spares early breast cancer patients from adjuvant therapy: derived and validated in two population-based cohorts. Breast Cancer Res. 2005; 7(6):R953-64. PMC: 1410752. DOI: 10.1186/bcr1325. View