Normalization, Testing, and False Discovery Rate Estimation for RNA-sequencing Data

Overview

Journal Biostatistics

Publisher Oxford University Press

Specialty Public Health

Date 2011 Oct 18

PMID 22003245

Citations 159

Authors

Jun Li

Daniela M Witten

Iain M Johnstone

Robert Tibshirani

Affiliations

Soon will be listed here.

Abstract

We discuss the identification of genes that are associated with an outcome in RNA sequencing and other sequence-based comparative genomic experiments. RNA-sequencing data take the form of counts, so models based on the Gaussian distribution are unsuitable. Moreover, normalization is challenging because different sequencing experiments may generate quite different total numbers of reads. To overcome these difficulties, we use a log-linear model with a new approach to normalization. We derive a novel procedure to estimate the false discovery rate (FDR). Our method can be applied to data with quantitative, two-class, or multiple-class outcomes, and the computation is fast even for large data sets. We study the accuracy of our approaches for significance calculation and FDR estimation, and we demonstrate that our method has potential advantages over existing methods that are based on a Poisson or negative binomial model. In summary, this work provides a pipeline for the significance analysis of sequencing data.

Citing Articles

Normalization and selecting non-differentially expressed genes improve machine learning modelling of cross-platform transcriptomic data.

Deng F, Feng C, Gao N, Zhang L ArXiv. 2025; .

PMID: 39975431 PMC: 11838701.

Analysis of gene expression of Babesia gibsoni cultured with diminazene aceturate using RNA sequencing.

Matsuda N, Ito M, Nukada Y, Toyoma M, Nagai K, Motegi T J Vet Med Sci. 2025; 87(2):181-188.

PMID: 39756884 PMC: 11830443. DOI: 10.1292/jvms.24-0395.

Genome-wide profiling of DNA repair proteins in single cells.

de Luca K, Rullens P, Karpinska M, de Vries S, Gacek-Matthews A, Pongor L Nat Commun. 2024; 15(1):9918.

PMID: 39572529 PMC: 11582664. DOI: 10.1038/s41467-024-54159-4.

Quantitative proteomics reveals extensive lysine ubiquitination and transcription factor stability states in Arabidopsis.

Song G, Montes C, Olatunji D, Malik S, Ji C, Clark N Plant Cell. 2024; 37(1).

PMID: 39570863 PMC: 11663597. DOI: 10.1093/plcell/koae310.

Global impacts of peroxisome and pexophagy dysfunction revealed through multi-omics analyses of lon2 and atg2 mutants.

Muhammad D, Clark N, Tharp N, Chatt E, Vierstra R, Bartel B Plant J. 2024; 120(6):2563-2583.

PMID: 39526456 PMC: 11658196. DOI: 10.1111/tpj.17129.

References

Srivastava S, Chen L . A two-parameter generalized Poisson model to improve the analysis of RNA-seq data. Nucleic Acids Res. 2010; 38(17):e170. PMC: 2943596. DOI: 10.1093/nar/gkq670. View

Robinson M, Smyth G . Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics. 2007; 9(2):321-32. DOI: 10.1093/biostatistics/kxm030. View

Bloom J, Khan Z, Kruglyak L, Singh M, Caudy A . Measuring differential gene expression by short read sequencing: quantitative comparison to 2-channel gene expression microarrays. BMC Genomics. 2009; 10:221. PMC: 2686739. DOI: 10.1186/1471-2164-10-221. View

Wilhelm B, Landry J . RNA-Seq-quantitative measurement of expression through massively parallel RNA-sequencing. Methods. 2009; 48(3):249-57. DOI: 10.1016/j.ymeth.2009.03.016. View

Li J, Jiang H, Wong W . Modeling non-uniformity in short-read rates in RNA-Seq data. Genome Biol. 2010; 11(5):R50. PMC: 2898062. DOI: 10.1186/gb-2010-11-5-r50. View

Storey J, Tibshirani R . Statistical significance for genomewide studies. Proc Natl Acad Sci U S A. 2003; 100(16):9440-5. PMC: 170937. DOI: 10.1073/pnas.1530509100. View

DeRisi J, Iyer V, Brown P . Exploring the metabolic and genetic control of gene expression on a genomic scale. Science. 1997; 278(5338):680-6. DOI: 10.1126/science.278.5338.680. View

Kerr M, Martin M, Churchill G . Analysis of variance for gene expression microarray data. J Comput Biol. 2001; 7(6):819-37. DOI: 10.1089/10665270050514954. View

Oshlack A, Wakefield M . Transcript length bias in RNA-seq data confounds systems biology. Biol Direct. 2009; 4:14. PMC: 2678084. DOI: 10.1186/1745-6150-4-14. View

10.

Bullard J, Purdom E, Hansen K, Dudoit S . Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics. 2010; 11:94. PMC: 2838869. DOI: 10.1186/1471-2105-11-94. View

11.

Robinson M, Oshlack A . A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 2010; 11(3):R25. PMC: 2864565. DOI: 10.1186/gb-2010-11-3-r25. View

12.

Robinson M, Smyth G . Moderated statistical tests for assessing differences in tag abundance. Bioinformatics. 2007; 23(21):2881-7. DOI: 10.1093/bioinformatics/btm453. View

13.

Marioni J, Mason C, Mane S, Stephens M, Gilad Y . RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 2008; 18(9):1509-17. PMC: 2527709. DOI: 10.1101/gr.079558.108. View

14.

Anders S, Huber W . Differential expression analysis for sequence count data. Genome Biol. 2010; 11(10):R106. PMC: 3218662. DOI: 10.1186/gb-2010-11-10-r106. View

15.

t Hoen P, Ariyurek Y, Thygesen H, Vreugdenhil E, Vossen R, de Menezes R . Deep sequencing-based expression analysis shows major advances in robustness, resolution and inter-lab portability over five microarray platforms. Nucleic Acids Res. 2008; 36(21):e141. PMC: 2588528. DOI: 10.1093/nar/gkn705. View

16.

Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M . The transcriptional landscape of the yeast genome defined by RNA sequencing. Science. 2008; 320(5881):1344-9. PMC: 2951732. DOI: 10.1126/science.1158441. View

17.

Wang Z, Gerstein M, Snyder M . RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2008; 10(1):57-63. PMC: 2949280. DOI: 10.1038/nrg2484. View

18.

Robinson M, McCarthy D, Smyth G . edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2009; 26(1):139-40. PMC: 2796818. DOI: 10.1093/bioinformatics/btp616. View

19.

Hansen K, Brenner S, Dudoit S . Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Res. 2010; 38(12):e131. PMC: 2896536. DOI: 10.1093/nar/gkq224. View

20.

Wang L, Feng Z, Wang X, Wang X, Zhang X . DEGseq: an R package for identifying differentially expressed genes from RNA-seq data. Bioinformatics. 2009; 26(1):136-8. DOI: 10.1093/bioinformatics/btp612. View