SIMAP--the Database of All-against-all Protein Sequence Similarities and Annotations with New Interfaces and Increased Coverage

Overview

Journal Nucleic Acids Res

Publisher Oxford University Press

Specialty Biochemistry

Date 2013 Oct 30

PMID 24165881

Citations 14

Authors

Roland Arnold

Florian Goldenberg

Hans-Werner Mewes

Thomas Rattei

Affiliations

Soon will be listed here.

Abstract

The Similarity Matrix of Proteins (SIMAP, http://mips.gsf.de/simap/) database has been designed to massively accelerate computationally expensive protein sequence analysis tasks in bioinformatics. It provides pre-calculated sequence similarities interconnecting the entire known protein sequence universe, complemented by pre-calculated protein features and domains, similarity clusters and functional annotations. SIMAP covers all major public protein databases as well as many consistently re-annotated metagenomes from different repositories. As of September 2013, SIMAP contains >163 million proteins corresponding to ∼70 million non-redundant sequences. SIMAP uses the sensitive FASTA search heuristics, the Smith-Waterman alignment algorithm, the InterPro database of protein domain models and the BLAST2GO functional annotation algorithm. SIMAP assists biologists by facilitating the interactive exploration of the protein sequence universe. Web-Service and DAS interfaces allow connecting SIMAP with any other bioinformatic tool and resource. All-against-all protein sequence similarity matrices of project-specific protein collections are generated on request. Recent improvements allow SIMAP to cover the rapidly growing sequenced protein sequence universe. New Web-Service interfaces enhance the connectivity of SIMAP. Novel tools for interactive extraction of protein similarity networks have been added. Open access to SIMAP is provided through the web portal; the portal also contains instructions and links for software access and flat file downloads.

Citing Articles

Protein-Coding Gene Families in Prokaryote Genome Comparisons.

Carhuaricra-Huaman D, Setubal J Methods Mol Biol. 2024; 2802:33-55.

PMID: 38819555 DOI: 10.1007/978-1-0716-3838-5_2.

Cracking the black box of deep sequence-based protein-protein interaction prediction.

Bernett J, Blumenthal D, List M Brief Bioinform. 2024; 25(2).

PMID: 38446741 PMC: 10939362. DOI: 10.1093/bib/bbae076.

Cytoscape stringApp 2.0: Analysis and Visualization of Heterogeneous Biological Networks.

Doncheva N, Morris J, Holze H, Kirsch R, Nastou K, Cuesta-Astroz Y J Proteome Res. 2022; 22(2):637-646.

PMID: 36512705 PMC: 9904289. DOI: 10.1021/acs.jproteome.2c00651.

eggNOG 6.0: enabling comparative genomics across 12 535 organisms.

Hernandez-Plaza A, Szklarczyk D, Botas J, Cantalapiedra C, Giner-Lamia J, Mende D Nucleic Acids Res. 2022; 51(D1):D389-D394.

PMID: 36399505 PMC: 9825578. DOI: 10.1093/nar/gkac1022.

Ten Years of Collaborative Progress in the Quest for Orthologs.

Linard B, Ebersberger I, McGlynn S, Glover N, Mochizuki T, Patricio M Mol Biol Evol. 2021; 38(8):3033-3045.

PMID: 33822172 PMC: 8321534. DOI: 10.1093/molbev/msab098.

References

Altschul S, Wootton J, Gertz E, Agarwala R, Morgulis A, Schaffer A . Protein database searches using compositionally adjusted substitution matrices. FEBS J. 2005; 272(20):5101-9. PMC: 1343503. DOI: 10.1111/j.1742-4658.2005.04945.x. View

Safran M, Dalah I, Alexander J, Rosen N, Iny Stein T, Shmoish M . GeneCards Version 3: the human gene integrator. Database (Oxford). 2010; 2010:baq020. PMC: 2938269. DOI: 10.1093/database/baq020. View

Tatusov R, Fedorova N, Jackson J, Jacobs A, Kiryutin B, Koonin E . The COG database: an updated version includes eukaryotes. BMC Bioinformatics. 2003; 4:41. PMC: 222959. DOI: 10.1186/1471-2105-4-41. View

Kersey P, Staines D, Lawson D, Kulesha E, Derwent P, Humphrey J . Ensembl Genomes: an integrative resource for genome-scale data from non-vertebrate species. Nucleic Acids Res. 2011; 40(Database issue):D91-7. PMC: 3245118. DOI: 10.1093/nar/gkr895. View

Frickey T, Lupas A . CLANS: a Java application for visualizing protein families based on pairwise similarity. Bioinformatics. 2004; 20(18):3702-4. DOI: 10.1093/bioinformatics/bth444. View

Rappoport N, Karsenty S, Stern A, Linial N, Linial M . ProtoNet 6.0: organizing 10 million protein sequences in a compact hierarchical family tree. Nucleic Acids Res. 2011; 40(Database issue):D313-20. PMC: 3245180. DOI: 10.1093/nar/gkr1027. View

Powell S, Szklarczyk D, Trachana K, Roth A, Kuhn M, Muller J . eggNOG v3.0: orthologous groups covering 1133 organisms at 41 different taxonomic ranges. Nucleic Acids Res. 2011; 40(Database issue):D284-9. PMC: 3245133. DOI: 10.1093/nar/gkr1060. View

Rattei T, Arnold R, Tischler P, Lindner D, Stumpflen V, Mewes H . SIMAP: the similarity matrix of proteins. Nucleic Acids Res. 2005; 34(Database issue):D252-6. PMC: 1347468. DOI: 10.1093/nar/gkj106. View

Terrapon N, Weiner J, Grath S, Moore A, Bornberg-Bauer E . Rapid similarity search of proteins using alignments of domain arrangements. Bioinformatics. 2013; 30(2):274-81. DOI: 10.1093/bioinformatics/btt379. View

10.

Altenhoff A, Schneider A, Gonnet G, Dessimoz C . OMA 2011: orthology inference among 1000 complete genomes. Nucleic Acids Res. 2010; 39(Database issue):D289-94. PMC: 3013747. DOI: 10.1093/nar/gkq1238. View

11.

Yu Y, Altschul S . The construction of amino acid substitution matrices for the comparison of proteins with non-standard compositions. Bioinformatics. 2004; 21(7):902-11. DOI: 10.1093/bioinformatics/bti070. View

12.

Arnold R, Rattei T, Tischler P, Truong M, Stumpflen V, Mewes W . SIMAP--the similarity matrix of proteins. Bioinformatics. 2005; 21 Suppl 2:ii42-6. DOI: 10.1093/bioinformatics/bti1107. View

13.

Franceschini A, Szklarczyk D, Frankild S, Kuhn M, Simonovic M, Roth A . STRING v9.1: protein-protein interaction networks, with increased coverage and integration. Nucleic Acids Res. 2012; 41(Database issue):D808-15. PMC: 3531103. DOI: 10.1093/nar/gks1094. View

14.

Ashburner M, Ball C, Blake J, Botstein D, Butler H, Cherry J . Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000; 25(1):25-9. PMC: 3037419. DOI: 10.1038/75556. View

15.

Nakaya A, Katayama T, Itoh M, Hiranuka K, Kawashima S, Moriya Y . KEGG OC: a large-scale automatic construction of taxonomy-based ortholog clusters. Nucleic Acids Res. 2012; 41(Database issue):D353-7. PMC: 3531156. DOI: 10.1093/nar/gks1239. View

16.

Pruitt K, Tatusova T, Brown G, Maglott D . NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy. Nucleic Acids Res. 2011; 40(Database issue):D130-5. PMC: 3245008. DOI: 10.1093/nar/gkr1079. View

17.

Barber 2nd A, Babbitt P . Pythoscape: a framework for generation of large protein similarity networks. Bioinformatics. 2012; 28(21):2845-6. PMC: 3476340. DOI: 10.1093/bioinformatics/bts532. View

18.

Petryszak R, Kretschmann E, Wieser D, Apweiler R . The predictive power of the CluSTr database. Bioinformatics. 2005; 21(18):3604-9. DOI: 10.1093/bioinformatics/bti542. View

19.

Flicek P, Ahmed I, Amode M, Barrell D, Beal K, Brent S . Ensembl 2013. Nucleic Acids Res. 2012; 41(Database issue):D48-55. PMC: 3531136. DOI: 10.1093/nar/gks1236. View

20.

Rattei T, Tischler P, Gotz S, Jehl M, Hoser J, Arnold R . SIMAP--a comprehensive database of pre-calculated protein sequence similarities, domains, annotations and clusters. Nucleic Acids Res. 2009; 38(Database issue):D223-6. PMC: 2808863. DOI: 10.1093/nar/gkp949. View