NBLDA: Negative Binomial Linear Discriminant Analysis for RNA-Seq Data

Overview

Journal BMC Bioinformatics

Publisher Biomed Central

Specialty Biology

Date 2016 Sep 15

PMID 27623864

Citations 13

Authors

Kai Dong

Hongyu Zhao

Tiejun Tong

Xiang Wan

Affiliations

Soon will be listed here.

Abstract

Background: RNA-sequencing (RNA-Seq) has become a powerful technology to characterize gene expression profiles because it is more accurate and comprehensive than microarrays. Although statistical methods that have been developed for microarray data can be applied to RNA-Seq data, they are not ideal due to the discrete nature of RNA-Seq data. The Poisson distribution and negative binomial distribution are commonly used to model count data. Recently, Witten (Annals Appl Stat 5:2493-2518, 2011) proposed a Poisson linear discriminant analysis for RNA-Seq data. The Poisson assumption may not be as appropriate as the negative binomial distribution when biological replicates are available and in the presence of overdispersion (i.e., when the variance is larger than or equal to the mean). However, it is more complicated to model negative binomial variables because they involve a dispersion parameter that needs to be estimated.

Results: In this paper, we propose a negative binomial linear discriminant analysis for RNA-Seq data. By Bayes' rule, we construct the classifier by fitting a negative binomial model, and propose some plug-in rules to estimate the unknown parameters in the classifier. The relationship between the negative binomial classifier and the Poisson classifier is explored, with a numerical investigation of the impact of dispersion on the discriminant score. Simulation results show the superiority of our proposed method. We also analyze two real RNA-Seq data sets to demonstrate the advantages of our method in real-world applications.

Conclusions: We have developed a new classifier using the negative binomial model for RNA-seq data classification. Our simulation results show that our proposed classifier has a better performance than existing works. The proposed classifier can serve as an effective tool for classifying RNA-seq data. Based on the comparison results, we have provided some guidelines for scientists to decide which method should be used in the discriminant analysis of RNA-Seq data. R code is available at http://www.comp.hkbu.edu.hk/~xwan/NBLDA.R or https://github.com/yangchadam/NBLDA.

Citing Articles

TabDEG: Classifying differentially expressed genes from RNA-seq data based on feature extraction and deep learning framework.

Feng S, Wang Z, Jin Y, Xu S PLoS One. 2024; 19(7):e0305857.

PMID: 39037985 PMC: 11262683. DOI: 10.1371/journal.pone.0305857.

A Bayesian model for predicting monthly fire frequency in Kenya.

Orero L, Omondi E, Omolo B PLoS One. 2024; 19(1):e0291800.

PMID: 38271480 PMC: 10810550. DOI: 10.1371/journal.pone.0291800.

Finite mixtures of matrix variate Poisson-log normal distributions for three-way count data.

Silva A, Qin X, Rothstein S, McNicholas P, Subedi S Bioinformatics. 2023; 39(5).

PMID: 37018147 PMC: 10159656. DOI: 10.1093/bioinformatics/btad167.

scDLC: a deep learning framework to classify large sample single-cell RNA-seq data.

Zhou Y, Peng M, Yang B, Tong T, Zhang B, Tang N BMC Genomics. 2022; 23(1):504.

PMID: 35831808 PMC: 9281153. DOI: 10.1186/s12864-022-08715-1.

LPDA: A new classification method based on linear programming.

Nueda M, Gandia C, Molina M PLoS One. 2022; 17(7):e0270403.

PMID: 35797275 PMC: 9262202. DOI: 10.1371/journal.pone.0270403.

References

Wang Z, Gerstein M, Snyder M . RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2008; 10(1):57-63. PMC: 2949280. DOI: 10.1038/nrg2484. View

Huang S, Tong T, Zhao H . Bias-corrected diagonal discriminant rules for high-dimensional classification. Biometrics. 2010; 66(4):1096-106. PMC: 3164859. DOI: 10.1111/j.1541-0420.2010.01395.x. View

Zhou Y, Xia K, Wright F . A powerful and flexible approach to the analysis of RNA sequence count data. Bioinformatics. 2011; 27(19):2672-8. PMC: 3179656. DOI: 10.1093/bioinformatics/btr449. View

Robinson M, McCarthy D, Smyth G . edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2009; 26(1):139-40. PMC: 2796818. DOI: 10.1093/bioinformatics/btp616. View

Mardis E . Next-generation DNA sequencing methods. Annu Rev Genomics Hum Genet. 2008; 9:387-402. DOI: 10.1146/annurev.genom.9.081307.164359. View

Li J, Tibshirani R . Finding consistent patterns: a nonparametric approach for identifying differential expression in RNA-Seq data. Stat Methods Med Res. 2011; 22(5):519-36. PMC: 4605138. DOI: 10.1177/0962280211428386. View

Hardcastle T, Kelly K . baySeq: empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinformatics. 2010; 11:422. PMC: 2928208. DOI: 10.1186/1471-2105-11-422. View

Marioni J, Mason C, Mane S, Stephens M, Gilad Y . RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 2008; 18(9):1509-17. PMC: 2527709. DOI: 10.1101/gr.079558.108. View

Morozova O, Hirst M, Marra M . Applications of new sequencing technologies for transcriptome analysis. Annu Rev Genomics Hum Genet. 2009; 10:135-51. DOI: 10.1146/annurev-genom-082908-145957. View

10.

Law C, Chen Y, Shi W, Smyth G . voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 2014; 15(2):R29. PMC: 4053721. DOI: 10.1186/gb-2014-15-2-r29. View

11.

Landau W, Liu P . Dispersion estimation and its effect on test performance in RNA-seq data analysis: a simulation-based comparison of methods. PLoS One. 2013; 8(12):e81415. PMC: 3857202. DOI: 10.1371/journal.pone.0081415. View

12.

Oshlack A, Robinson M, Young M . From RNA-seq reads to differential expression results. Genome Biol. 2010; 11(12):220. PMC: 3046478. DOI: 10.1186/gb-2010-11-12-220. View

13.

Dillies M, Rau A, Aubert J, Hennequet-Antier C, Jeanmougin M, Servant N . A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Brief Bioinform. 2012; 14(6):671-83. DOI: 10.1093/bib/bbs046. View

14.

Si Y, Liu P . An optimal test with maximum average power while controlling FDR with application to RNA-seq data. Biometrics. 2013; 69(3):594-605. DOI: 10.1111/biom.12036. View

15.

Wu H, Wang C, Wu Z . A new shrinkage estimator for dispersion improves differential expression detection in RNA-seq data. Biostatistics. 2012; 14(2):232-43. PMC: 3590927. DOI: 10.1093/biostatistics/kxs033. View

16.

Witten D, Tibshirani R, Gu S, Fire A, Lui W . Ultra-high throughput sequencing-based small RNA discovery and discrete statistical biomarker analysis in a collection of cervical tumours and matched controls. BMC Biol. 2010; 8:58. PMC: 2880020. DOI: 10.1186/1741-7007-8-58. View

17.

Yu D, Huber W, Vitek O . Shrinkage estimation of dispersion in Negative Binomial models for RNA-seq experiments with small sample size. Bioinformatics. 2013; 29(10):1275-82. PMC: 3654711. DOI: 10.1093/bioinformatics/btt143. View

18.

McCarthy D, Chen Y, Smyth G . Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Res. 2012; 40(10):4288-97. PMC: 3378882. DOI: 10.1093/nar/gks042. View

19.

Lin B, Zhang L, Chen X . LFCseq: a nonparametric approach for differential expression analysis of RNA-seq data. BMC Genomics. 2015; 15 Suppl 10:S7. PMC: 4304217. DOI: 10.1186/1471-2164-15-S10-S7. View

20.

Robinson M, Smyth G . Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics. 2007; 9(2):321-32. DOI: 10.1093/biostatistics/kxm030. View