Factorial Study of the RNA-seq Computational Workflow Identifies Biases As Technical Gene Signatures
Overview
Affiliations
RNA-seq is a modular experimental and computational approach aiming in identifying and quantifying RNA molecules. The modularity of the RNA-seq technology enables adaptation of the protocol to develop new ways to explore RNA biology, but this modularity also brings forth the importance of methodological thoroughness. Liberty of approach comes with the responsibility of choices, and such choices must be informed. Here, we present an approach that identifies gene group-specific quantification biases in current RNA-seq software and references by processing datasets using diverse RNA-seq computational pipelines, and by decomposing these expression datasets with an independent component analysis matrix factorization method. By exploring the RNA-seq pipeline using this systemic approach, we identify genome annotations as a design choice that affects to the same extent quantification results as does the choice of aligners and quantifiers. We also show that the different choices in RNA-seq methodology are not independent, identifying interactions between genome annotations and quantification software. Genes were mainly affected by differences in their sequence, by overlapping genes and genes with similar sequence. Our approach offers an explanation for the observed biases by identifying the common features used differently by the software and references, therefore providing leads for the betterment of RNA-seq methodology.
RNA-seq data science: From raw data to effective interpretation.
Deshpande D, Chhugani K, Chang Y, Karlsberg A, Loeffler C, Zhang J Front Genet. 2023; 14:997383.
PMID: 36999049 PMC: 10043755. DOI: 10.3389/fgene.2023.997383.
Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers.
Wratten L, Wilm A, Goke J Nat Methods. 2021; 18(10):1161-1168.
PMID: 34556866 DOI: 10.1038/s41592-021-01254-9.
RNAflow: An Effective and Simple RNA-Seq Differential Gene Expression Pipeline Using Nextflow.
Lataretu M, Holzer M Genes (Basel). 2020; 11(12).
PMID: 33322033 PMC: 7763471. DOI: 10.3390/genes11121487.
OpenProt 2021: deeper functional annotation of the coding potential of eukaryotic genomes.
Brunet M, Lucier J, Levesque M, Leblanc S, Jacques J, Al-Saedi H Nucleic Acids Res. 2020; 49(D1):D380-D388.
PMID: 33179748 PMC: 7779043. DOI: 10.1093/nar/gkaa1036.