FAS: Assessing the Similarity Between Proteins Using Multi-layered Feature Architectures

Overview

Journal Bioinformatics

Publisher Oxford University Press

Specialty Biology

Date 2023 Apr 21

PMID 37084276

Authors

Julian Dosch

Holger Bergmann

Vinh Tran

Ingo Ebersberger

Affiliations

Soon will be listed here.

Abstract

Motivation: Protein sequence comparison is a fundamental element in the bioinformatics toolkit. When sequences are annotated with features such as functional domains, transmembrane domains, low complexity regions or secondary structure elements, the resulting feature architectures allow better informed comparisons. However, many existing schemes for scoring architecture similarities cannot cope with features arising from multiple annotation sources. Those that do fall short in the resolution of overlapping and redundant feature annotations.

Results: Here, we introduce FAS, a scoring method that integrates features from multiple annotation sources in a directed acyclic architecture graph. Redundancies are resolved as part of the architecture comparison by finding the paths through the graphs that maximize the pair-wise architecture similarity. In a large-scale evaluation on more than 10 000 human-yeast ortholog pairs, architecture similarities assessed with FAS are consistently more plausible than those obtained using e-values to resolve overlaps or leaving overlaps unresolved. Three case studies demonstrate the utility of FAS on architecture comparison tasks: benchmarking of orthology assignment software, identification of functionally diverged orthologs, and diagnosing protein architecture changes stemming from faulty gene predictions. With the help of FAS, feature architecture comparisons can now be routinely integrated into these and many other applications.

Availability And Implementation: FAS is available as python package: https://pypi.org/project/greedyFAS/.

Citing Articles

New developments for the Quest for Orthologs benchmark service.

Altenhoff A, Nevers Y, Tran V, Jyothi D, Martin M, Cosentino S NAR Genom Bioinform. 2024; 6(4):lqae167.

PMID: 39664814 PMC: 11632614. DOI: 10.1093/nargab/lqae167.

Quest for Orthologs in the Era of Biodiversity Genomics.

Langschied F, Bordin N, Cosentino S, Fuentes-Palacios D, Glover N, Hiller M Genome Biol Evol. 2024; 16(10).

PMID: 39404012 PMC: 11523110. DOI: 10.1093/gbe/evae224.

SonicParanoid2: fast, accurate, and comprehensive orthology inference with machine learning and language models.

Cosentino S, Sriswasdi S, Iwasaki W Genome Biol. 2024; 25(1):195.

PMID: 39054525 PMC: 11270883. DOI: 10.1186/s13059-024-03298-4.

CANDy: Automated analysis of domain architectures in carbohydrate-active enzymes.

Windels A, Franceus J, Pleiss J, Desmet T PLoS One. 2024; 19(7):e0306410.

PMID: 38990885 PMC: 11238990. DOI: 10.1371/journal.pone.0306410.

Feature architecture aware phylogenetic profiling indicates a functional diversification of type IVa pili in the nosocomial pathogen Acinetobacter baumannii.

Iruegas R, Pfefferle K, Gottig S, Averhoff B, Ebersberger I PLoS Genet. 2023; 19(7):e1010646.

PMID: 37498819 PMC: 10374093. DOI: 10.1371/journal.pgen.1010646.

References

Blum M, Chang H, Chuguransky S, Grego T, Kandasaamy S, Mitchell A . The InterPro protein families and domains database: 20 years on. Nucleic Acids Res. 2020; 49(D1):D344-D354. PMC: 7778928. DOI: 10.1093/nar/gkaa977. View

Lupas A . Prediction and analysis of coiled-coil structures. Methods Enzymol. 1996; 266:513-25. DOI: 10.1016/s0076-6879(96)66032-7. View

Mukherjee S, Stamatis D, Bertsch J, Ovchinnikova G, Sundaramurthi J, Lee J . Genomes OnLine Database (GOLD) v.8: overview and updates. Nucleic Acids Res. 2020; 49(D1):D723-D733. PMC: 7778979. DOI: 10.1093/nar/gkaa983. View

Potter S, Luciani A, Eddy S, Park Y, Lopez R, Finn R . HMMER web server: 2018 update. Nucleic Acids Res. 2018; 46(W1):W200-W204. PMC: 6030962. DOI: 10.1093/nar/gky448. View

Kummerfeld S, Teichmann S . Protein domain organisation: adding order. BMC Bioinformatics. 2009; 10:39. PMC: 2657131. DOI: 10.1186/1471-2105-10-39. View

Wang D, Shaw G . The association of the C-terminal region of beta I sigma II spectrin to brain membranes is mediated by a PH domain, does not require membrane proteins, and coincides with a inositol-1,4,5 triphosphate binding site. Biochem Biophys Res Commun. 1995; 217(2):608-15. DOI: 10.1006/bbrc.1995.2818. View

Glover N, Dessimoz C, Ebersberger I, Forslund S, Gabaldon T, Huerta-Cepas J . Advances and Applications in the Quest for Orthologs. Mol Biol Evol. 2019; 36(10):2157-2164. PMC: 6759064. DOI: 10.1093/molbev/msz150. View

Lee B, Lee D . Protein comparison at the domain architecture level. BMC Bioinformatics. 2009; 10 Suppl 15:S5. PMC: 2788356. DOI: 10.1186/1471-2105-10-S15-S5. View

Lewin H, Robinson G, Kress W, Baker W, Coddington J, Crandall K . Earth BioGenome Project: Sequencing life for the future of life. Proc Natl Acad Sci U S A. 2018; 115(17):4325-4333. PMC: 5924910. DOI: 10.1073/pnas.1720115115. View

10.

Ma X, Jiang Y, He Y, Bao R, Chen Y, Zhou C . Structures of yeast glutathione-S-transferase Gtt2 reveal a new catalytic type of GST family. EMBO Rep. 2009; 10(12):1320-6. PMC: 2799204. DOI: 10.1038/embor.2009.216. View

11.

Geer L, Domrachev M, Lipman D, Bryant S . CDART: protein homology by domain architecture. Genome Res. 2002; 12(10):1619-23. PMC: 187533. DOI: 10.1101/gr.278202. View

12.

Aramaki T, Blanc-Mathieu R, Endo H, Ohkubo K, Kanehisa M, Goto S . KofamKOALA: KEGG Ortholog assignment based on profile HMM and adaptive score threshold. Bioinformatics. 2019; 36(7):2251-2252. PMC: 7141845. DOI: 10.1093/bioinformatics/btz859. View

13.

Yates A, Akanni W, Amode M, Barrell D, Billis K, Carvalho-Silva D . Ensembl 2016. Nucleic Acids Res. 2015; 44(D1):D710-6. PMC: 4702834. DOI: 10.1093/nar/gkv1157. View

14.

Altenhoff A, Garrayo-Ventas J, Cosentino S, Emms D, Glover N, Hernandez-Plaza A . The Quest for Orthologs benchmark service and consensus calls in 2020. Nucleic Acids Res. 2020; 48(W1):W538-W545. PMC: 7319555. DOI: 10.1093/nar/gkaa308. View

15.

Conesa A, Gotz S . Blast2GO: A comprehensive suite for functional analysis in plant genomics. Int J Plant Genomics. 2008; 2008:619832. PMC: 2375974. DOI: 10.1155/2008/619832. View

16.

Littler D, Assaad N, Harrop S, Brown L, Pankhurst G, Luciani P . Crystal structure of the soluble form of the redox-regulated chloride ion channel protein CLIC4. FEBS J. 2005; 272(19):4996-5007. DOI: 10.1111/j.1742-4658.2005.04909.x. View

17.

Defosset A, Kress A, Nevers Y, Ripp R, Thompson J, Poch O . Proteome-Scale Detection of Differential Conservation Patterns at Protein and Subprotein Levels with BLUR. Genome Biol Evol. 2020; 13(1). PMC: 7851591. DOI: 10.1093/gbe/evaa248. View

18.

Letunic I, Khedkar S, Bork P . SMART: recent updates, new developments and status in 2020. Nucleic Acids Res. 2020; 49(D1):D458-D460. PMC: 7778883. DOI: 10.1093/nar/gkaa937. View

19.

Pedruzzi I, Rivoire C, Auchincloss A, Coudert E, Keller G, de Castro E . HAMAP in 2015: updates to the protein family classification and annotation system. Nucleic Acids Res. 2014; 43(Database issue):D1064-70. PMC: 4383873. DOI: 10.1093/nar/gku1002. View

20.

Lu S, Wang J, Chitsaz F, Derbyshire M, Geer R, Gonzales N . CDD/SPARCLE: the conserved domain database in 2020. Nucleic Acids Res. 2019; 48(D1):D265-D268. PMC: 6943070. DOI: 10.1093/nar/gkz991. View