AnnotaPipeline: An Integrated Tool to Annotate Eukaryotic Proteins Using Multi-omics Data

Overview

Journal Front Genet

Date 2022 Dec 9

PMID 36482896

Authors

Guilherme Augusto Maia

Vilmar Benetti Filho

Eric Kazuo Kawagoe

Tatiany Aparecida Teixeira Soratto

Renato Simoes Moreira

Edmundo Carlos Grisard

Glauber Wagner

Affiliations

Soon will be listed here.

Abstract

Assignment of gene function has been a crucial, laborious, and time-consuming step in genomics. Due to a variety of sequencing platforms that generates increasing amounts of data, manual annotation is no longer feasible. Thus, the need for an integrated, automated pipeline allowing the use of experimental data towards validation of prediction of gene function is of utmost relevance. Here, we present a computational workflow named AnnotaPipeline that integrates distinct software and data types on a proteogenomic approach to annotate and validate predicted features in genomic sequences. Based on FASTA (i) nucleotide or (ii) protein sequences or (iii) structural annotation files (GFF3), users can input FASTQ RNA-seq data, MS/MS data from mzXML or similar formats, as the pipeline uses both transcriptomic and proteomic information to corroborate annotations and validate gene prediction, providing transcription and expression evidence for functional annotation. Reannotation of the available , and genomes was performed using the AnnotaPipeline, resulting in a higher proportion of annotated proteins and a reduced proportion of hypothetical proteins when compared to the annotations publicly available for these organisms. AnnotaPipeline is a Unix-based pipeline developed using Python and is available at: https://github.com/bioinformatics-ufsc/AnnotaPipeline.

References

Hoff K, Stanke M . WebAUGUSTUS--a web service for training AUGUSTUS and predicting genes in eukaryotes. Nucleic Acids Res. 2013; 41(Web Server issue):W123-8. PMC: 3692069. DOI: 10.1093/nar/gkt418. View

The M, MacCoss M, Noble W, Kall L . Fast and Accurate Protein False Discovery Rates on Large-Scale Proteomics Data Sets with Percolator 3.0. J Am Soc Mass Spectrom. 2016; 27(11):1719-1727. PMC: 5059416. DOI: 10.1007/s13361-016-1460-7. View

Vizcaino J, Deutsch E, Wang R, Csordas A, Reisinger F, Rios D . ProteomeXchange provides globally coordinated proteomics data submission and dissemination. Nat Biotechnol. 2014; 32(3):223-6. PMC: 3986813. DOI: 10.1038/nbt.2839. View

Finn R, Clements J, Eddy S . HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 2011; 39(Web Server issue):W29-37. PMC: 3125773. DOI: 10.1093/nar/gkr367. View

Nesvizhskii A . Proteogenomics: concepts, applications and computational strategies. Nat Methods. 2014; 11(11):1114-25. PMC: 4392723. DOI: 10.1038/nmeth.3144. View

Stein L . Genome annotation: from sequence to biology. Nat Rev Genet. 2001; 2(7):493-503. DOI: 10.1038/35080529. View

Stanke M, Waack S . Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics. 2003; 19 Suppl 2:ii215-25. DOI: 10.1093/bioinformatics/btg1080. View

Ghali F, Krishna R, Perkins S, Collins A, Xia D, Wastling J . ProteoAnnotator--open source proteogenomics annotation software supporting PSI standards. Proteomics. 2014; 14(23-24):2731-41. DOI: 10.1002/pmic.201400265. View

Stoco P, Wagner G, Talavera-Lopez C, Gerber A, Zaha A, Thompson C . Genome of the avirulent human-infective trypanosome--Trypanosoma rangeli. PLoS Negl Trop Dis. 2014; 8(9):e3176. PMC: 4169256. DOI: 10.1371/journal.pntd.0003176. View

10.

Danchin A, Ouzounis C, Tokuyasu T, Zucker J . No wisdom in the crowd: genome annotation in the era of big data - current status and future prospects. Microb Biotechnol. 2018; 11(4):588-605. PMC: 6011933. DOI: 10.1111/1751-7915.13284. View

11.

Lubec G, Afjehi-Sadat L, Yang J, John J . Searching for hypothetical proteins: theory and practice based upon original data and literature. Prog Neurobiol. 2005; 77(1-2):90-127. DOI: 10.1016/j.pneurobio.2005.10.001. View

12.

Bruna T, Lomsadze A, Borodovsky M . GeneMark-EP+: eukaryotic gene prediction with self-training in the space of genes and proteins. NAR Genom Bioinform. 2020; 2(2):lqaa026. PMC: 7222226. DOI: 10.1093/nargab/lqaa026. View

13.

Eng J, Jahan T, Hoopmann M . Comet: an open-source MS/MS sequence database search tool. Proteomics. 2012; 13(1):22-4. DOI: 10.1002/pmic.201200439. View

14.

Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K . BLAST+: architecture and applications. BMC Bioinformatics. 2009; 10:421. PMC: 2803857. DOI: 10.1186/1471-2105-10-421. View

15.

Hegyi H, Gerstein M . Annotation transfer for genomics: measuring functional divergence in multi-domain proteins. Genome Res. 2001; 11(10):1632-40. PMC: 311165. DOI: 10.1101/gr.183801. View

16.

Vaudel M, Verheggen K, Csordas A, Raeder H, Berven F, Martens L . Exploring the potential of public proteomics data. Proteomics. 2015; 16(2):214-25. PMC: 4738454. DOI: 10.1002/pmic.201500295. View

17.

Toronen P, Holm L . PANNZER-A practical tool for protein function prediction. Protein Sci. 2021; 31(1):118-128. PMC: 8740830. DOI: 10.1002/pro.4193. View

18.

Wang B, Kumar V, Olson A, Ware D . Reviving the Transcriptome Studies: An Insight Into the Emergence of Single-Molecule Transcriptome Sequencing. Front Genet. 2019; 10:384. PMC: 6498185. DOI: 10.3389/fgene.2019.00384. View

19.

Vlasova A, Hermoso Pulido T, Camara F, Ponomarenko J, Guigo R . FA-nf: A Functional Annotation Pipeline for Proteins from Non-Model Organisms Implemented in Nextflow. Genes (Basel). 2021; 12(10). PMC: 8535801. DOI: 10.3390/genes12101645. View

20.

Jones P, Binns D, Chang H, Fraser M, Li W, McAnulla C . InterProScan 5: genome-scale protein function classification. Bioinformatics. 2014; 30(9):1236-40. PMC: 3998142. DOI: 10.1093/bioinformatics/btu031. View