Workflow and Web Application for Annotating NCBI BioProject Transcriptome Data

Overview

Journal Database (Oxford)

Specialty Biology

Date 2017 Jun 13

PMID 28605765

Citations 3

Authors

Roberto Vera Alvarez

Newton Medeiros Vidal

Gina A Garzon-Martinez

Luz S Barrero

David Landsman

Leonardo Marino-Ramirez

Affiliations

Soon will be listed here.

Abstract

Abstract: The volume of transcriptome data is growing exponentially due to rapid improvement of experimental technologies. In response, large central resources such as those of the National Center for Biotechnology Information (NCBI) are continually adapting their computational infrastructure to accommodate this large influx of data. New and specialized databases, such as Transcriptome Shotgun Assembly Sequence Database (TSA) and Sequence Read Archive (SRA), have been created to aid the development and expansion of centralized repositories. Although the central resource databases are under continual development, they do not include automatic pipelines to increase annotation of newly deposited data. Therefore, third-party applications are required to achieve that aim. Here, we present an automatic workflow and web application for the annotation of transcriptome data. The workflow creates secondary data such as sequencing reads and BLAST alignments, which are available through the web application. They are based on freely available bioinformatics tools and scripts developed in-house. The interactive web application provides a search engine and several browser utilities. Graphical views of transcript alignments are available through SeqViewer, an embedded tool developed by NCBI for viewing biological sequence data. The web application is tightly integrated with other NCBI web applications and tools to extend the functionality of data processing and interconnectivity. We present a case study for the species Physalis peruviana with data generated from BioProject ID 67621.

Database: URL: http://www.ncbi.nlm.nih.gov/projects/physalis/.

Citing Articles

Combining transcriptome analysis and GWAS for identification and validation of marker genes in the - pathosystem.

Garzon-Martinez G, Garcia-Arias F, Enciso-Rodriguez F, Soto-Suarez M, Gonzalez C, Bombarely A PeerJ. 2021; 9:e11135.

PMID: 33828924 PMC: 7993016. DOI: 10.7717/peerj.11135.

Transcriptome annotation in the cloud: complexity, best practices, and cost.

Vera Alvarez R, Marino-Ramirez L, Landsman D Gigascience. 2021; 10(2).

PMID: 33511996 PMC: 7845158. DOI: 10.1093/gigascience/giaa163.

Obtaining extremely large and accurate protein multiple sequence alignments from curated hierarchical alignments.

Neuwald A, Lanczycki C, Hodges T, Marchler-Bauer A Database (Oxford). 2020; 2020.

PMID: 32500917 PMC: 7297217. DOI: 10.1093/database/baaa042.

References

Cock P, Antao T, Chang J, Chapman B, Cox C, Dalke A . Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009; 25(11):1422-3. PMC: 2682512. DOI: 10.1093/bioinformatics/btp163. View

Tatusov R, Fedorova N, Jackson J, Jacobs A, Kiryutin B, Koonin E . The COG database: an updated version includes eukaryotes. BMC Bioinformatics. 2003; 4:41. PMC: 222959. DOI: 10.1186/1471-2105-4-41. View

Wolf J . Principles of transcriptome analysis and gene expression quantification: an RNA-seq tutorial. Mol Ecol Resour. 2013; 13(4):559-72. DOI: 10.1111/1755-0998.12109. View

. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2015; 44(D1):D7-19. PMC: 4702911. DOI: 10.1093/nar/gkv1290. View

Kanehisa M, Goto S, Sato Y, Furumichi M, Tanabe M . KEGG for integration and interpretation of large-scale molecular data sets. Nucleic Acids Res. 2011; 40(Database issue):D109-14. PMC: 3245020. DOI: 10.1093/nar/gkr988. View

Shendure J, Ji H . Next-generation DNA sequencing. Nat Biotechnol. 2008; 26(10):1135-45. DOI: 10.1038/nbt1486. View

Gao Y, Zhang X, Wei J, Sun X, Yuan J, Li F . Whole Transcriptome Analysis Provides Insights into Molecular Mechanisms for Molting in Litopenaeus vannamei. PLoS One. 2015; 10(12):e0144350. PMC: 4674093. DOI: 10.1371/journal.pone.0144350. View

DAntonio M, Castrgnano T, Pallocca M, DErchia A, Picardi E, Pesole G . ASPicDB: a database web tool for alternative splicing analysis. Methods Mol Biol. 2015; 1269:365-78. DOI: 10.1007/978-1-4939-2291-8_23. View

Benson D, Karsch-Mizrachi I, Lipman D, Ostell J, Sayers E . GenBank. Nucleic Acids Res. 2009; 38(Database issue):D46-51. PMC: 2808980. DOI: 10.1093/nar/gkp1024. View

10.

Garzon-Martinez G, Osorio-Guarin J, Delgadillo-Duran P, Mayorga F, Enciso-Rodriguez F, Landsman D . Genetic diversity and population structure in and related taxa based on InDels and SNPs derived from COSII and IRG markers. Plant Gene. 2015; 4:29-37. PMC: 4630809. DOI: 10.1016/j.plgene.2015.09.003. View

11.

Jones P, Binns D, Chang H, Fraser M, Li W, McAnulla C . InterProScan 5: genome-scale protein function classification. Bioinformatics. 2014; 30(9):1236-40. PMC: 3998142. DOI: 10.1093/bioinformatics/btu031. View

12.

Jones M, Blaxter M . afterParty: turning raw transcriptomes into permanent resources. BMC Bioinformatics. 2013; 14:301. PMC: 3856601. DOI: 10.1186/1471-2105-14-301. View

13.

Sujayanont P, Chininmanu K, Tassaneetrithep B, Tangthawornchaikul N, Malasit P, Suriyaphol P . Comparison of phi29-based whole genome amplification and whole transcriptome amplification in dengue virus. J Virol Methods. 2013; 195:141-7. DOI: 10.1016/j.jviromet.2013.10.005. View

14.

Simbaqueba J, Sanchez P, Sanchez E, Nunez Zarantes V, Chacon M, Barrero L . Development and characterization of microsatellite markers for the Cape gooseberry Physalis peruviana. PLoS One. 2011; 6(10):e26719. PMC: 3198794. DOI: 10.1371/journal.pone.0026719. View

15.

Ashburner M, Ball C, Blake J, Botstein D, Butler H, Cherry J . Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000; 25(1):25-9. PMC: 3037419. DOI: 10.1038/75556. View

16.

Janies D, Witter Z, Linchangco G, Foltz D, Miller A, Kerr A . EchinoDB, an application for comparative transcriptomics of deeply-sampled clades of echinoderms. BMC Bioinformatics. 2016; 17:48. PMC: 4724074. DOI: 10.1186/s12859-016-0883-2. View

17.

Tripathi K, Evangelista D, Zuccaro A, Guarracino M . Transcriptator: An Automated Computational Pipeline to Annotate Assembled Reads and Identify Non Coding RNA. PLoS One. 2015; 10(11):e0140268. PMC: 4651556. DOI: 10.1371/journal.pone.0140268. View

18.

Wang Z, Gerstein M, Snyder M . RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2008; 10(1):57-63. PMC: 2949280. DOI: 10.1038/nrg2484. View

19.

Pongor L, Vera R, Ligeti B . Fast and sensitive alignment of microbial whole genome sequencing reads to large sequence datasets on a desktop PC: application to metagenomic datasets and pathogen identification. PLoS One. 2014; 9(7):e103441. PMC: 4117525. DOI: 10.1371/journal.pone.0103441. View

20.

Rangel L, Novaes J, Durham A, Madeira A, Gruber A . The Eimeria transcript DB: an integrated resource for annotated transcripts of protozoan parasites of the genus Eimeria. Database (Oxford). 2013; 2013:bat006. PMC: 3572530. DOI: 10.1093/database/bat006. View