Probabilistic Outlier Identification for RNA Sequencing Generalized Linear Models

Overview

Journal NAR Genom Bioinform

Publisher Oxford University Press

Specialty Biology

Date 2021 Mar 12

PMID 33709073

Citations 4

Authors

Stefano Mangiola

Evan A Thomas

Martin Modrak

Aki Vehtari

Anthony T Papenfuss

Affiliations

Soon will be listed here.

Abstract

Relative transcript abundance has proven to be a valuable tool for understanding the function of genes in biological systems. For the differential analysis of transcript abundance using RNA sequencing data, the negative binomial model is by far the most frequently adopted. However, common methods that are based on a negative binomial model are not robust to extreme outliers, which we found to be abundant in public datasets. So far, no rigorous and probabilistic methods for detection of outliers have been developed for RNA sequencing data, leaving the identification mostly to visual inspection. Recent advances in Bayesian computation allow large-scale comparison of observed data against its theoretical distribution given in a statistical model. Here we propose ppcseq, a key quality-control tool for identifying transcripts that include outlier data points in differential expression analysis, which do not follow a negative binomial distribution. Applying ppcseq to analyse several publicly available datasets using popular tools, we show that from 3 to 10 percent of differentially abundant transcripts across algorithms and datasets had statistics inflated by the presence of outliers.

Citing Articles

cellsig plug-in enhances CIBERSORTx signature selection for multidataset transcriptomes with sparse multilevel modelling.

Khan M, Wu J, Sun Y, Barrow A, Papenfuss A, Mangiola S Bioinformatics. 2023; 39(12).

PMID: 37952182 PMC: 10692870. DOI: 10.1093/bioinformatics/btad685.

sccomp: Robust differential composition and variability analysis for single-cell data.

Mangiola S, Roth-Schulze A, Trussart M, Zozaya-Valdes E, Ma M, Gao Z Proc Natl Acad Sci U S A. 2023; 120(33):e2203828120.

PMID: 37549298 PMC: 10438834. DOI: 10.1073/pnas.2203828120.

Taurine deficiency as a driver of aging.

Singh P, Gollapalli K, Mangiola S, Schranner D, Yusuf M, Chamoli M Science. 2023; 380(6649):eabn9257.

PMID: 37289866 PMC: 10630957. DOI: 10.1126/science.abn9257.

Gene filtering strategies for machine learning guided biomarker discovery using neonatal sepsis RNA-seq data.

Parkinson E, Liberatore F, Watkins W, Andrews R, Edkins S, Hibbert J Front Genet. 2023; 14:1158352.

PMID: 37113992 PMC: 10126415. DOI: 10.3389/fgene.2023.1158352.

References

Liu R, Holik A, Su S, Jansz N, Chen K, Leong H . Why weight? Modelling sample and observational level variability improves power in RNA-seq analyses. Nucleic Acids Res. 2015; 43(15):e97. PMC: 4551905. DOI: 10.1093/nar/gkv412. View

Blein T, Balzergue C, Roule T, Gabriel M, Scalisi L, Francois T . Landscape of the Noncoding Transcriptome Response of Two Arabidopsis Ecotypes to Phosphate Starvation. Plant Physiol. 2020; 183(3):1058-1072. PMC: 7333710. DOI: 10.1104/pp.20.00446. View

Ren X, Kuan P . Negative binomial additive model for RNA-Seq data analysis. BMC Bioinformatics. 2020; 21(1):171. PMC: 7195715. DOI: 10.1186/s12859-020-3506-x. View

McCarthy D, Chen Y, Smyth G . Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Res. 2012; 40(10):4288-97. PMC: 3378882. DOI: 10.1093/nar/gks042. View

Varet H, Brillet-Gueguen L, Coppee J, Dillies M . SARTools: A DESeq2- and EdgeR-Based R Pipeline for Comprehensive Differential Analysis of RNA-Seq Data. PLoS One. 2016; 11(6):e0157022. PMC: 4900645. DOI: 10.1371/journal.pone.0157022. View

Love M, Huber W, Anders S . Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014; 15(12):550. PMC: 4302049. DOI: 10.1186/s13059-014-0550-8. View

van de Wiel M, Leday G, Pardo L, Rue H, van der Vaart A, van Wieringen W . Bayesian analysis of RNA sequencing data by estimating multiple shrinkage priors. Biostatistics. 2012; 14(1):113-28. DOI: 10.1093/biostatistics/kxs031. View

Zhou X, Lindsay H, Robinson M . Robustly detecting differential expression in RNA sequencing data using observation weights. Nucleic Acids Res. 2014; 42(11):e91. PMC: 4066750. DOI: 10.1093/nar/gku310. View

Esnaola M, Puig P, Gonzalez D, Castelo R, Gonzalez J . A flexible count data model to fit the wide diversity of expression profiles arising from extensively replicated RNA-seq experiments. BMC Bioinformatics. 2013; 14:254. PMC: 3849762. DOI: 10.1186/1471-2105-14-254. View

10.

Zhao L, Wu W, Feng D, Jiang H, Nguyen X . Bayesian Analysis of RNA-Seq Data Using a Family of Negative Binomial Models. Bayesian Anal. 2021; 13(2):411-436. PMC: 8052637. DOI: 10.1214/17-BA1055. View

11.

Robinson M, McCarthy D, Smyth G . edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2009; 26(1):139-40. PMC: 2796818. DOI: 10.1093/bioinformatics/btp616. View

12.

Le Duc D, Lin C, Popkova Y, Yang Z, Akhil V, Cakir M . Reduced lipolysis in lipoma phenocopies lipid accumulation in obesity. Int J Obes (Lond). 2020; 45(3):565-576. PMC: 7906903. DOI: 10.1038/s41366-020-00716-y. View

13.

Mangiola S, Stuchbery R, McCoy P, Chow K, Kurganovs N, Kerger M . Androgen deprivation therapy promotes an obesity-like microenvironment in periprostatic fat. Endocr Connect. 2019; 8(5):547-558. PMC: 6499921. DOI: 10.1530/EC-19-0029. View

14.

Leon-Novelo L, Fuentes C, Emerson S . Marginal likelihood estimation of negative binomial parameters with applications to RNA-seq data. Biostatistics. 2017; 18(4):637-650. DOI: 10.1093/biostatistics/kxx006. View

15.

Silva A, Rothstein S, McNicholas P, Subedi S . A multivariate Poisson-log normal mixture model for clustering transcriptome sequencing data. BMC Bioinformatics. 2019; 20(1):394. PMC: 6636065. DOI: 10.1186/s12859-019-2916-0. View

16.

Atkins R, Stylli S, Kurganovs N, Mangiola S, Nowell C, Ware T . Cell quiescence correlates with enhanced glioblastoma cell invasion and cytotoxic resistance. Exp Cell Res. 2018; 374(2):353-364. DOI: 10.1016/j.yexcr.2018.12.010. View

17.

Mangiola S, Molania R, Dong R, Doyle M, Papenfuss A . tidybulk: an R tidy framework for modular transcriptomic data analysis. Genome Biol. 2021; 22(1):42. PMC: 7821481. DOI: 10.1186/s13059-020-02233-7. View

18.

Dantas W, Roschel H, Murai I, Gil S, Davuluri G, Axelrod C . Exercise-Induced Increases in Insulin Sensitivity After Bariatric Surgery Are Mediated By Muscle Extracellular Matrix Remodeling. Diabetes. 2020; 69(8):1675-1691. PMC: 7372074. DOI: 10.2337/db19-1180. View

19.

Wu H, Wang C, Wu Z . A new shrinkage estimator for dispersion improves differential expression detection in RNA-seq data. Biostatistics. 2012; 14(2):232-43. PMC: 3590927. DOI: 10.1093/biostatistics/kxs033. View

20.

Pomaznoy M, Kuan R, Lindvall M, Burel J, Seumois G, Vijayanand P . Quantitative and Qualitative Perturbations of CD8 MAITs in Healthy -Infected Individuals. Immunohorizons. 2020; 4(6):292-307. PMC: 7543048. DOI: 10.4049/immunohorizons.2000031. View