Improvement in Protein Domain Identification Is Reached by Breaking Consensus, with the Agreement of Many Profiles and Domain Co-occurrence

Overview

Journal PLoS Comput Biol

Specialty Biology

Date 2016 Jul 30

PMID 27472895

Citations 15

Authors

Juliana Bernardes

Gerson Zaverucha

Catherine Vaquero

Alessandra Carbone

Affiliations

Soon will be listed here.

Abstract

Traditional protein annotation methods describe known domains with probabilistic models representing consensus among homologous domain sequences. However, when relevant signals become too weak to be identified by a global consensus, attempts for annotation fail. Here we address the fundamental question of domain identification for highly divergent proteins. By using high performance computing, we demonstrate that the limits of state-of-the-art annotation methods can be bypassed. We design a new strategy based on the observation that many structural and functional protein constraints are not globally conserved through all species but might be locally conserved in separate clades. We propose a novel exploitation of the large amount of data available: 1. for each known protein domain, several probabilistic clade-centered models are constructed from a large and differentiated panel of homologous sequences, 2. a decision-making protocol combines outcomes obtained from multiple models, 3. a multi-criteria optimization algorithm finds the most likely protein architecture. The method is evaluated for domain and architecture prediction over several datasets and statistical testing hypotheses. Its performance is compared against HMMScan and HHblits, two widely used search methods based on sequence-profile and profile-profile comparison. Due to their closeness to actual protein sequences, clade-centered models are shown to be more specific and functionally predictive than the broadly used consensus models. Based on them, we improved annotation of Plasmodium falciparum protein sequences on a scale not previously possible. We successfully predict at least one domain for 72% of P. falciparum proteins against 63% achieved previously, corresponding to 30% of improvement over the total number of Pfam domain predictions on the whole genome. The method is applicable to any genome and opens new avenues to tackle evolutionary questions such as the reconstruction of ancient domain duplications, the reconstruction of the history of protein architectures, and the estimation of protein domain age. Website and software: http://www.lcqb.upmc.fr/CLADE.

Citing Articles

Evolutionary dynamics of genome size and content during the adaptive radiation of Heliconiini butterflies.

Cicconardi F, Milanetti E, Pinheiro de Castro E, Mazo-Vargas A, Van Belleghem S, Ruggieri A Nat Commun. 2023; 14(1):5620.

PMID: 37699868 PMC: 10497600. DOI: 10.1038/s41467-023-41412-5.

CeGAL: Redefining a Widespread Fungal-Specific Transcription Factor Family Using an In Silico Error-Tracking Approach.

Mayer C, Vogt A, Uslu T, Scalzitti N, Chennen K, Poch O J Fungi (Basel). 2023; 9(4).

PMID: 37108879 PMC: 10141177. DOI: 10.3390/jof9040424.

Multi-head attention-based U-Nets for predicting protein domain boundaries using 1D sequence features and 2D distance maps.

Mahmud S, Guo Z, Quadir F, Liu J, Cheng J BMC Bioinformatics. 2022; 23(1):283.

PMID: 35854211 PMC: 9295499. DOI: 10.1186/s12859-022-04829-1.

Multiple Profile Models Extract Features from Protein Sequence Data and Resolve Functional Diversity of Very Different Protein Families.

Vicedomini R, Bouly J, Laine E, Falciatore A, Carbone A Mol Biol Evol. 2022; 39(4).

PMID: 35353898 PMC: 9016551. DOI: 10.1093/molbev/msac070.

MyCLADE: a multi-source domain annotation server for sequence functional exploration.

Vicedomini R, Blachon C, Oteri F, Carbone A Nucleic Acids Res. 2021; 49(W1):W452-W458.

PMID: 34023906 PMC: 8262732. DOI: 10.1093/nar/gkab395.

References

Bjorklund A, Ekman D, Light S, Frey-Skott J, Elofsson A . Domain rearrangements in protein evolution. J Mol Biol. 2005; 353(4):911-23. DOI: 10.1016/j.jmb.2005.08.067. View

Bashford D, Chothia C, Lesk A . Determinants of a protein fold. Unique features of the globin amino acid sequences. J Mol Biol. 1987; 196(1):199-216. DOI: 10.1016/0022-2836(87)90521-3. View

Mirkin B, Fenner T, Galperin M, Koonin E . Algorithms for computing parsimonious evolutionary scenarios for genome evolution, the last universal common ancestor and dominance of horizontal gene transfer in the evolution of prokaryotes. BMC Evol Biol. 2003; 3:2. PMC: 149225. DOI: 10.1186/1471-2148-3-2. View

Geer L, Domrachev M, Lipman D, Bryant S . CDART: protein homology by domain architecture. Genome Res. 2002; 12(10):1619-23. PMC: 187533. DOI: 10.1101/gr.278202. View

Stothard P . The sequence manipulation suite: JavaScript programs for analyzing and formatting protein and DNA sequences. Biotechniques. 2000; 28(6):1102, 1104. DOI: 10.2144/00286ir01. View

Remmert M, Biegert A, Hauser A, Soding J . HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat Methods. 2011; 9(2):173-5. DOI: 10.1038/nmeth.1818. View

Anand A, Pugalenthi G, Suganthan P . Predicting protein structural class by SVM with class-wise optimized features and decision probabilities. J Theor Biol. 2008; 253(2):375-80. DOI: 10.1016/j.jtbi.2008.02.031. View

Yeats C, Redfern O, Orengo C . A fast and automated solution for accurately resolving protein domain architectures. Bioinformatics. 2010; 26(6):745-51. DOI: 10.1093/bioinformatics/btq034. View

Darnell J, Doolittle W . Speculations on the early course of evolution. Proc Natl Acad Sci U S A. 1986; 83(5):1271-5. PMC: 323057. DOI: 10.1073/pnas.83.5.1271. View

10.

Rost B . Twilight zone of protein sequence alignments. Protein Eng. 1999; 12(2):85-94. DOI: 10.1093/protein/12.2.85. View

11.

Sadreyev R, Baker D, Grishin N . Profile-profile comparisons by COMPASS predict intricate homologies between protein families. Protein Sci. 2003; 12(10):2262-72. PMC: 2366929. DOI: 10.1110/ps.03197403. View

12.

Pellegrini M, Marcotte E, Thompson M, Eisenberg D, Yeates T . Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci U S A. 1999; 96(8):4285-8. PMC: 16324. DOI: 10.1073/pnas.96.8.4285. View

13.

Bernardes J, Vieira F, Zaverucha G, Carbone A . A multi-objective optimization approach accurately resolves protein domain architectures. Bioinformatics. 2015; 32(3):345-53. PMC: 4734041. DOI: 10.1093/bioinformatics/btv582. View

14.

Scott M, Thomas D, Hallett M . Predicting subcellular localization via protein motif co-occurrence. Genome Res. 2004; 14(10A):1957-66. PMC: 524420. DOI: 10.1101/gr.2650004. View

15.

Finn R, Mistry J, Tate J, Coggill P, Heger A, Pollington J . The Pfam protein families database. Nucleic Acids Res. 2009; 38(Database issue):D211-22. PMC: 2808889. DOI: 10.1093/nar/gkp985. View

16.

Marcotte E, Pellegrini M, Ng H, Rice D, Yeates T, Eisenberg D . Detecting protein function and protein-protein interactions from genome sequences. Science. 1999; 285(5428):751-3. DOI: 10.1126/science.285.5428.751. View

17.

Rehmsmeier M, Vingron M . Phylogenetic information improves homology detection. Proteins. 2001; 45(4):360-71. DOI: 10.1002/prot.1156. View

18.

Keeling P, Burger G, Durnford D, Lang B, Lee R, Pearlman R . The tree of eukaryotes. Trends Ecol Evol. 2006; 20(12):670-6. DOI: 10.1016/j.tree.2005.09.005. View

19.

Fox N, Brenner S, Chandonia J . SCOPe: Structural Classification of Proteins--extended, integrating SCOP and ASTRAL data and classification of new structures. Nucleic Acids Res. 2013; 42(Database issue):D304-9. PMC: 3965108. DOI: 10.1093/nar/gkt1240. View

20.

Mi H, Muruganujan A, Thomas P . PANTHER in 2013: modeling the evolution of gene function, and other gene attributes, in the context of phylogenetic trees. Nucleic Acids Res. 2012; 41(Database issue):D377-86. PMC: 3531194. DOI: 10.1093/nar/gks1118. View