Transforming RNA-Seq Data to Improve the Performance of Prognostic Gene Signatures

Overview

Journal PLoS One

Specialties General Medicine
Science

Date 2014 Jan 14

PMID 24416353

Citations 79

Authors

Isabella Zwiener

Barbara Frisch

Harald Binder

Affiliations

Soon will be listed here.

Abstract

Gene expression measurements have successfully been used for building prognostic signatures, i.e for identifying a short list of important genes that can predict patient outcome. Mostly microarray measurements have been considered, and there is little advice available for building multivariable risk prediction models from RNA-Seq data. We specifically consider penalized regression techniques, such as the lasso and componentwise boosting, which can simultaneously consider all measurements and provide both, multivariable regression models for prediction and automated variable selection. However, they might be affected by the typical skewness, mean-variance-dependency or extreme values of RNA-Seq covariates and therefore could benefit from transformations of the latter. In an analytical part, we highlight preferential selection of covariates with large variances, which is problematic due to the mean-variance dependency of RNA-Seq data. In a simulation study, we compare different transformations of RNA-Seq data for potentially improving detection of important genes. Specifically, we consider standardization, the log transformation, a variance-stabilizing transformation, the Box-Cox transformation, and rank-based transformations. In addition, the prediction performance for real data from patients with kidney cancer and acute myeloid leukemia is considered. We show that signature size, identification performance, and prediction performance critically depend on the choice of a suitable transformation. Rank-based transformations perform well in all scenarios and can even outperform complex variance-stabilizing approaches. Generally, the results illustrate that the distribution and potential transformations of RNA-Seq data need to be considered as a critical step when building risk prediction models by penalized regression techniques.

Citing Articles

Analyzing the relationship between gene expression and phenotype in space-flown mice using a causal inference machine learning ensemble.

Casaletto J, Scott R, Myrick M, Mackintosh G, Chok H, Saravia-Butler A Sci Rep. 2025; 15(1):2363.

PMID: 39824847 PMC: 11748630. DOI: 10.1038/s41598-024-81394-y.

Shrinkage estimation of gene interaction networks in single-cell RNA sequencing data.

Vo D, Thorne T BMC Bioinformatics. 2024; 25(1):339.

PMID: 39462345 PMC: 11515282. DOI: 10.1186/s12859-024-05946-9.

ML-GAP: machine learning-enhanced genomic analysis pipeline using autoencoders and data augmentation.

Agraz M, Goksuluk D, Zhang P, Choi B, Clements R, Choudhary G Front Genet. 2024; 15:1442759.

PMID: 39399219 PMC: 11467662. DOI: 10.3389/fgene.2024.1442759.

Machine learning on multiple epigenetic features reveals H3K27Ac as a driver of gene expression prediction across patients with glioblastoma.

Suita Y, Bright Jr H, Pu Y, Toruner M, Idehen J, Tapinos N bioRxiv. 2024; .

PMID: 38979226 PMC: 11230286. DOI: 10.1101/2024.06.25.600585.

Evaluation of normalization methods for predicting quantitative phenotypes in metagenomic data analysis.

Wang B, Luan Y Front Genet. 2024; 15:1369628.

PMID: 38903761 PMC: 11188486. DOI: 10.3389/fgene.2024.1369628.

References

Tarazona S, Garcia-Alcalde F, Dopazo J, Ferrer A, Conesa A . Differential expression in RNA-seq: a matter of depth. Genome Res. 2011; 21(12):2213-23. PMC: 3227109. DOI: 10.1101/gr.124321.111. View

Gerds T, Schumacher M . Efron-type measures of prediction error for survival analysis. Biometrics. 2007; 63(4):1283-7. DOI: 10.1111/j.1541-0420.2007.00832.x. View

Tutz G, Binder H . Generalized additive modeling with implicit variable selection by likelihood-based boosting. Biometrics. 2006; 62(4):961-71. DOI: 10.1111/j.1541-0420.2006.00578.x. View

Oshlack A, Wakefield M . Transcript length bias in RNA-seq data confounds systems biology. Biol Direct. 2009; 4:14. PMC: 2678084. DOI: 10.1186/1745-6150-4-14. View

Beasley T, Erickson S, Allison D . Rank-based inverse normal transformations are increasingly used, but are they merited?. Behav Genet. 2009; 39(5):580-95. PMC: 2921808. DOI: 10.1007/s10519-009-9281-0. View

Bovelstad H, Nygard S, Storvold H, Aldrin M, Borgan O, Frigessi A . Predicting survival from microarray data--a comparative study. Bioinformatics. 2007; 23(16):2080-7. DOI: 10.1093/bioinformatics/btm305. View

Robinson M, Oshlack A . A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 2010; 11(3):R25. PMC: 2864565. DOI: 10.1186/gb-2010-11-3-r25. View

Wu H, Wang C, Wu Z . A new shrinkage estimator for dispersion improves differential expression detection in RNA-seq data. Biostatistics. 2012; 14(2):232-43. PMC: 3590927. DOI: 10.1093/biostatistics/kxs033. View

Soneson C, Delorenzi M . A comparison of methods for differential expression analysis of RNA-seq data. BMC Bioinformatics. 2013; 14:91. PMC: 3608160. DOI: 10.1186/1471-2105-14-91. View

10.

Binder H, Porzelius C, Schumacher M . An overview of techniques for linking high-dimensional molecular data to time-to-event endpoints by risk prediction models. Biom J. 2011; 53(2):170-89. DOI: 10.1002/bimj.201000152. View

11.

Quinn E, Cormican P, Kenny E, Hill M, Anney R, Gill M . Development of strategies for SNP detection in RNA-seq data: application to lymphoblastoid cell lines and evaluation using 1000 Genomes data. PLoS One. 2013; 8(3):e58815. PMC: 3608647. DOI: 10.1371/journal.pone.0058815. View

12.

Tibshirani R . The lasso method for variable selection in the Cox model. Stat Med. 1997; 16(4):385-95. DOI: 10.1002/(sici)1097-0258(19970228)16:4<385::aid-sim380>3.0.co;2-3. View

13.

Mooney M, Bond J, Monks N, Eugster E, Cherba D, Berlinski P . Comparative RNA-Seq and microarray analysis of gene expression changes in B-cell lymphomas of Canis familiaris. PLoS One. 2013; 8(4):e61088. PMC: 3617154. DOI: 10.1371/journal.pone.0061088. View

14.

Binder H, Schumacher M . Allowing for mandatory covariates in boosting estimation of sparse high-dimensional survival models. BMC Bioinformatics. 2008; 9:14. PMC: 2245904. DOI: 10.1186/1471-2105-9-14. View

15.

Bullinger L, Dohner K, Kranz R, Stirner C, Frohling S, Scholl C . An FLT3 gene-expression signature predicts clinical outcome in normal karyotype AML. Blood. 2008; 111(9):4490-5. DOI: 10.1182/blood-2007-09-115055. View

16.

van de Wiel M, Leday G, Pardo L, Rue H, van der Vaart A, van Wieringen W . Bayesian analysis of RNA sequencing data by estimating multiple shrinkage priors. Biostatistics. 2012; 14(1):113-28. DOI: 10.1093/biostatistics/kxs031. View

17.

Verweij P, van Houwelingen H . Cross-validation in survival analysis. Stat Med. 1993; 12(24):2305-14. DOI: 10.1002/sim.4780122407. View

18.

Marioni J, Mason C, Mane S, Stephens M, Gilad Y . RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 2008; 18(9):1509-17. PMC: 2527709. DOI: 10.1101/gr.079558.108. View

19.

Anders S, Huber W . Differential expression analysis for sequence count data. Genome Biol. 2010; 11(10):R106. PMC: 3218662. DOI: 10.1186/gb-2010-11-10-r106. View

20.

Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M . The transcriptional landscape of the yeast genome defined by RNA sequencing. Science. 2008; 320(5881):1344-9. PMC: 2951732. DOI: 10.1126/science.1158441. View