Perplexity: Evaluating Transcript Abundance Estimation in the Absence of Ground Truth

Overview

Journal Algorithms Mol Biol

Publisher Biomed Central

Date 2022 Mar 25

PMID 35331283

Authors

Jason Fan

Skylar Chan

Rob Patro

Affiliations

Soon will be listed here.

Abstract

Background: There has been rapid development of probabilistic models and inference methods for transcript abundance estimation from RNA-seq data. These models aim to accurately estimate transcript-level abundances, to account for different biases in the measurement process, and even to assess uncertainty in resulting estimates that can be propagated to subsequent analyses. The assumed accuracy of the estimates inferred by such methods underpin gene expression based analysis routinely carried out in the lab. Although hyperparameter selection is known to affect the distributions of inferred abundances (e.g. producing smooth versus sparse estimates), strategies for performing model selection in experimental data have been addressed informally at best.

Results: We derive perplexity for evaluating abundance estimates on fragment sets directly. We adapt perplexity from the analogous metric used to evaluate language and topic models and extend the metric to carefully account for corner cases unique to RNA-seq. In experimental data, estimates with the best perplexity also best correlate with qPCR measurements. In simulated data, perplexity is well behaved and concordant with genome-wide measurements against ground truth and differential expression analysis. Furthermore, we demonstrate theoretically and experimentally that perplexity can be computed for arbitrary transcript abundance estimation models.

Conclusions: Alongside the derivation and implementation of perplexity for transcript abundance estimation, our study is the first to make possible model selection for transcript abundance estimation on experimental data in the absence of ground truth.

References

Durinck S, Spellman P, Birney E, Huber W . Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt. Nat Protoc. 2009; 4(8):1184-91. PMC: 3159387. DOI: 10.1038/nprot.2009.97. View

Zhu A, Srivastava A, Ibrahim J, Patro R, Love M . Nonparametric expression analysis using inferential replicate counts. Nucleic Acids Res. 2019; 47(18):e105. PMC: 6765120. DOI: 10.1093/nar/gkz622. View

Lun A, Riesenfeld S, Andrews T, Dao T, Gomes T, Marioni J . EmptyDrops: distinguishing cells from empty droplets in droplet-based single-cell RNA sequencing data. Genome Biol. 2019; 20(1):63. PMC: 6431044. DOI: 10.1186/s13059-019-1662-y. View

Molder F, Jablonski K, Letcher B, Hall M, Tomkins-Tinch C, Sochat V . Sustainable data analysis with Snakemake. F1000Res. 2021; 10:33. PMC: 8114187. DOI: 10.12688/f1000research.29032.2. View

Roberts A, Pachter L . Streaming fragment assignment for real-time analysis of sequencing experiments. Nat Methods. 2012; 10(1):71-3. PMC: 3880119. DOI: 10.1038/nmeth.2251. View

Nasko D, Koren S, Phillippy A, Treangen T . RefSeq database growth influences the accuracy of k-mer-based lowest common ancestor species identification. Genome Biol. 2018; 19(1):165. PMC: 6206640. DOI: 10.1186/s13059-018-1554-6. View

Shi L, Reid L, Jones W, Shippy R, Warrington J, Baker S . The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat Biotechnol. 2006; 24(9):1151-61. PMC: 3272078. DOI: 10.1038/nbt1239. View

Li B, Fillmore N, Bai Y, Collins M, Thomson J, Stewart R . Evaluation of de novo transcriptome assemblies from RNA-Seq data. Genome Biol. 2015; 15(12):553. PMC: 4298084. DOI: 10.1186/s13059-014-0553-5. View

Anders S, Pyl P, Huber W . HTSeq--a Python framework to work with high-throughput sequencing data. Bioinformatics. 2014; 31(2):166-9. PMC: 4287950. DOI: 10.1093/bioinformatics/btu638. View

10.

Bushmanova E, Antipov D, Lapidus A, Prjibelski A . rnaSPAdes: a de novo transcriptome assembler and its application to RNA-Seq data. Gigascience. 2019; 8(9). PMC: 6736328. DOI: 10.1093/gigascience/giz100. View

11.

Shakya M, Lo C, Chain P . Advances and Challenges in Metatranscriptomic Analysis. Front Genet. 2019; 10:904. PMC: 6774269. DOI: 10.3389/fgene.2019.00904. View

12.

Liao Y, Smyth G, Shi W . featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics. 2013; 30(7):923-30. DOI: 10.1093/bioinformatics/btt656. View

13.

Baker S, Bauer S, Beyer R, Brenton J, Bromley B, Burrill J . The External RNA Controls Consortium: a progress report. Nat Methods. 2005; 2(10):731-4. DOI: 10.1038/nmeth1005-731. View

14.

Turro E, Su S, Goncalves A, Coin L, Richardson S, Lewin A . Haplotype and isoform specific expression estimation using multi-mapping RNA-seq reads. Genome Biol. 2011; 12(2):R13. PMC: 3188795. DOI: 10.1186/gb-2011-12-2-r13. View

15.

Jiang H, Wong W . Statistical inferences for isoform expression in RNA-Seq. Bioinformatics. 2009; 25(8):1026-32. PMC: 2666817. DOI: 10.1093/bioinformatics/btp113. View

16.

Rahman A, Pachter L . CGAL: computing genome assembly likelihoods. Genome Biol. 2013; 14(1):R8. PMC: 3663106. DOI: 10.1186/gb-2013-14-1-r8. View

17.

Grabherr M, Haas B, Yassour M, Levin J, Thompson D, Amit I . Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol. 2011; 29(7):644-52. PMC: 3571712. DOI: 10.1038/nbt.1883. View

18.

Smith-Unna R, Boursnell C, Patro R, Hibberd J, Kelly S . TransRate: reference-free quality assessment of de novo transcriptome assemblies. Genome Res. 2016; 26(8):1134-44. PMC: 4971766. DOI: 10.1101/gr.196469.115. View

19.

Glaus P, Honkela A, Rattray M . Identifying differentially expressed transcripts from RNA-seq data with biological variation. Bioinformatics. 2012; 28(13):1721-8. PMC: 3381971. DOI: 10.1093/bioinformatics/bts260. View

20.

Nariai N, Kojima K, Mimori T, Kawai Y, Nagasaki M . A Bayesian approach for estimating allele-specific expression from RNA-Seq data with diploid genomes. BMC Genomics. 2016; 17 Suppl 1:2. PMC: 4895278. DOI: 10.1186/s12864-015-2295-5. View