Flexible Protein Database Based on Amino Acid K-mers

Overview

Journal Sci Rep

Specialty Science

Date 2022 Jun 1

PMID 35650262

Authors

Maxime Deraspe

Sebastien Boisvert

Francois Laviolette

Paul H Roy

Jacques Corbeil

Affiliations

Soon will be listed here.

Abstract

Identification of proteins is one of the most computationally intensive steps in genomics studies. It usually relies on aligners that do not accommodate rich information on proteins and require additional pipelining steps for protein identification. We introduce kAAmer, a protein database engine based on amino-acid k-mers that provides efficient identification of proteins while supporting the incorporation of flexible annotations on these proteins. Moreover, the database is built to be used as a microservice, to be hosted and queried remotely.

Citing Articles

Missing microbial eukaryotes and misleading meta-omic conclusions.

Krinos A, Mars Brisbin M, Hu S, Cohen N, Rynearson T, Follows M Nat Commun. 2024; 15(1):9873.

PMID: 39543100 PMC: 11564645. DOI: 10.1038/s41467-024-52212-w.

aaHash: recursive amino acid sequence hashing.

Wong J, Kazemi P, Coombe L, Warren R, Birol I Bioinform Adv. 2023; 3(1):vbad162.

PMID: 38023332 PMC: 10660294. DOI: 10.1093/bioadv/vbad162.

References

Priyam A, Woodcroft B, Rai V, Moghul I, Munagala A, Ter F . Sequenceserver: A Modern Graphical User Interface for Custom BLAST Databases. Mol Biol Evol. 2019; 36(12):2922-2924. PMC: 6878946. DOI: 10.1093/molbev/msz185. View

Steinegger M, Soding J . MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol. 2017; 35(11):1026-1028. DOI: 10.1038/nbt.3988. View

Mitchell A, Almeida A, Beracochea M, Boland M, Burgin J, Cochrane G . MGnify: the microbiome analysis resource in 2020. Nucleic Acids Res. 2019; 48(D1):D570-D578. PMC: 7145632. DOI: 10.1093/nar/gkz1035. View

Alcock B, Raphenya A, Lau T, Tsang K, Bouchard M, Edalatmand A . CARD 2020: antibiotic resistome surveillance with the comprehensive antibiotic resistance database. Nucleic Acids Res. 2019; 48(D1):D517-D525. PMC: 7145624. DOI: 10.1093/nar/gkz935. View

Zankari E, Hasman H, Cosentino S, Vestergaard M, Rasmussen S, Lund O . Identification of acquired antimicrobial resistance genes. J Antimicrob Chemother. 2012; 67(11):2640-4. PMC: 3468078. DOI: 10.1093/jac/dks261. View

Wood D, Salzberg S . Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 2014; 15(3):R46. PMC: 4053813. DOI: 10.1186/gb-2014-15-3-r46. View

Xiong J, Deraspe M, Iqbal N, Krajden S, Chapman W, Dewar K . Complete Genome of a Panresistant Strain, Isolated from a Patient with Respiratory Failure in a Canadian Community Hospital. Genome Announc. 2017; 5(22). PMC: 5454211. DOI: 10.1128/genomeA.00458-17. View

Eddy S . Profile hidden Markov models. Bioinformatics. 1999; 14(9):755-63. DOI: 10.1093/bioinformatics/14.9.755. View

Buchfink B, Xie C, Huson D . Fast and sensitive protein alignment using DIAMOND. Nat Methods. 2014; 12(1):59-60. DOI: 10.1038/nmeth.3176. View

10.

Feldgarden M, Brover V, Haft D, Prasad A, Slotta D, Tolstoy I . Validating the AMRFinder Tool and Resistance Gene Database by Using Antimicrobial Resistance Genotype-Phenotype Correlations in a Collection of Isolates. Antimicrob Agents Chemother. 2019; 63(11). PMC: 6811410. DOI: 10.1128/AAC.00483-19. View

11.

Iqbal Z, Caccamo M, Turner I, Flicek P, McVean G . De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat Genet. 2012; 44(2):226-32. PMC: 3272472. DOI: 10.1038/ng.1028. View

12.

OLeary N, Wright M, Brister J, Ciufo S, Haddad D, McVeigh R . Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2015; 44(D1):D733-45. PMC: 4702849. DOI: 10.1093/nar/gkv1189. View

13.

Ondov B, Treangen T, Melsted P, Mallonee A, Bergman N, Koren S . Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 2016; 17(1):132. PMC: 4915045. DOI: 10.1186/s13059-016-0997-x. View

14.

Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K . BLAST+: architecture and applications. BMC Bioinformatics. 2009; 10:421. PMC: 2803857. DOI: 10.1186/1471-2105-10-421. View

15.

Ge H, Sun L, Yu J . Fast batch searching for protein homology based on compression and clustering. BMC Bioinformatics. 2017; 18(1):508. PMC: 5697088. DOI: 10.1186/s12859-017-1938-8. View

16.

Soding J, Remmert M . Protein sequence comparison and fold recognition: progress and good-practice benchmarking. Curr Opin Struct Biol. 2011; 21(3):404-11. DOI: 10.1016/j.sbi.2011.03.005. View

17.

Cheng H, Liao Y, Schaeffer R, Grishin N . Manual classification strategies in the ECOD database. Proteins. 2015; 83(7):1238-51. PMC: 4624060. DOI: 10.1002/prot.24818. View

18.

. UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res. 2018; 47(D1):D506-D515. PMC: 6323992. DOI: 10.1093/nar/gky1049. View

19.

Boisvert S, Raymond F, Godzaridis E, Laviolette F, Corbeil J . Ray Meta: scalable de novo metagenome assembly and profiling. Genome Biol. 2012; 13(12):R122. PMC: 4056372. DOI: 10.1186/gb-2012-13-12-r122. View

20.

Pevzner P, Tang H, Waterman M . An Eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci U S A. 2001; 98(17):9748-53. PMC: 55524. DOI: 10.1073/pnas.171285098. View