Synthetic Protein Alignments by CCMgen Quantify Noise in Residue-residue Contact Prediction

Overview

Journal PLoS Comput Biol

Specialty Biology

Date 2018 Nov 6

PMID 30395601

Citations 13

Authors

Susann Vorberg

Stefan Seemayer

Johannes Soding

Affiliations

Soon will be listed here.

Abstract

Compensatory mutations between protein residues in physical contact can manifest themselves as statistical couplings between the corresponding columns in a multiple sequence alignment (MSA) of the protein family. Conversely, large coupling coefficients predict residue contacts. Methods for de-novo protein structure prediction based on this approach are becoming increasingly reliable. Their main limitation is the strong systematic and statistical noise in the estimation of coupling coefficients, which has so far limited their application to very large protein families. While most research has focused on improving predictions by adding external information, little progress has been made to improve the statistical procedure at the core, because our lack of understanding of the sources of noise poses a major obstacle. First, we show theoretically that the expectation value of the coupling score assuming no coupling is proportional to the product of the square roots of the column entropies, and we propose a simple entropy bias correction (EntC) that subtracts out this expectation value. Second, we show that the average product correction (APC) includes the correction of the entropy bias, partly explaining its success. Third, we have developed CCMgen, the first method for simulating protein evolution and generating realistic synthetic MSAs with pairwise statistical residue couplings. Fourth, to learn exact statistical models that reliably reproduce observed alignment statistics, we developed CCMpredPy, an implementation of the persistent contrastive divergence (PCD) method for exact inference. Fifth, we demonstrate how CCMgen and CCMpredPy can facilitate the development of contact prediction methods by analysing the systematic noise contributions from phylogeny and entropy. Using the entropy bias correction, we can disentangle both sources of noise and find that entropy contributes roughly twice as much noise as phylogeny.

Citing Articles

Impact of phylogeny on the inference of functional sectors from protein sequence data.

Dietler N, Abbara A, Choudhury S, Bitbol A PLoS Comput Biol. 2024; 20(9):e1012091.

PMID: 39312591 PMC: 11449291. DOI: 10.1371/journal.pcbi.1012091.

Enhancing coevolutionary signals in protein-protein interaction prediction through clade-wise alignment integration.

Fang T, Szklarczyk D, Hachilif R, von Mering C Sci Rep. 2024; 14(1):6009.

PMID: 38472223 PMC: 10933411. DOI: 10.1038/s41598-024-55655-9.

Chasing long-range evolutionary couplings in the AlphaFold era.

Karamanos T Biopolymers. 2023; 114(3):e23530.

PMID: 36752285 PMC: 10909459. DOI: 10.1002/bip.23530.

Impact of phylogeny on structural contact inference from protein sequence data.

Dietler N, Lupo U, Bitbol A J R Soc Interface. 2023; 20(199):20220707.

PMID: 36751926 PMC: 9905998. DOI: 10.1098/rsif.2022.0707.

Generative power of a protein language model trained on multiple sequence alignments.

Sgarbossa D, Lupo U, Bitbol A Elife. 2023; 12.

PMID: 36734516 PMC: 10038667. DOI: 10.7554/eLife.79854.

References

Baldassi C, Zamparo M, Feinauer C, Procaccini A, Zecchina R, Weigt M . Fast and accurate multivariate Gaussian modeling of protein families: predicting residue contacts and protein-interaction partners. PLoS One. 2014; 9(3):e92721. PMC: 3963956. DOI: 10.1371/journal.pone.0092721. View

Marino Buslje C, Santos J, Delfino J, Nielsen M . Correction for phylogeny, small number of observations and data redundancy improves the identification of coevolving amino acid pairs using mutual information. Bioinformatics. 2009; 25(9):1125-31. PMC: 2672635. DOI: 10.1093/bioinformatics/btp135. View

Godzik A, Sander C . Conservation of residue interactions in a family of Ca-binding proteins. Protein Eng. 1989; 2(8):589-96. DOI: 10.1093/protein/2.8.589. View

Remmert M, Biegert A, Hauser A, Soding J . HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat Methods. 2011; 9(2):173-5. DOI: 10.1038/nmeth.1818. View

Balakrishnan S, Kamisetty H, Carbonell J, Lee S, Langmead C . Learning generative models for protein fold families. Proteins. 2011; 79(4):1061-78. DOI: 10.1002/prot.22934. View

Barton J, De Leonardis E, Coucke A, Cocco S . ACE: adaptive cluster expansion for maximum entropy graphical model inference. Bioinformatics. 2016; 32(20):3089-3097. DOI: 10.1093/bioinformatics/btw328. View

Price M, Dehal P, Arkin A . FastTree 2--approximately maximum-likelihood trees for large alignments. PLoS One. 2010; 5(3):e9490. PMC: 2835736. DOI: 10.1371/journal.pone.0009490. View

Hayat S, Sander C, Marks D, Elofsson A . All-atom 3D structure prediction of transmembrane β-barrel proteins from sequences. Proc Natl Acad Sci U S A. 2015; 112(17):5413-8. PMC: 4418893. DOI: 10.1073/pnas.1419956112. View

Seemayer S, Gruber M, Soding J . CCMpred--fast and precise prediction of protein residue-residue contacts from correlated mutations. Bioinformatics. 2014; 30(21):3128-30. PMC: 4201158. DOI: 10.1093/bioinformatics/btu500. View

10.

Gouveia-Oliveira R, Pedersen A . Finding coevolving amino acid residues using row and column weighting of mutual information and multi-dimensional amino acid representation. Algorithms Mol Biol. 2007; 2:12. PMC: 2234412. DOI: 10.1186/1748-7188-2-12. View

11.

Hopf T, Scharfe C, Rodrigues J, Green A, Kohlbacher O, Sander C . Sequence co-evolution gives 3D contacts and structures of protein complexes. Elife. 2014; 3. PMC: 4360534. DOI: 10.7554/eLife.03430. View

12.

Qin C, Colwell L . Power law tails in phylogenetic systems. Proc Natl Acad Sci U S A. 2018; 115(4):690-695. PMC: 5789915. DOI: 10.1073/pnas.1711913115. View

13.

Atchley W, Wollenberg K, FITCH W, Terhalle W, Dress A . Correlations among amino acid sites in bHLH protein domains: an information theoretic analysis. Mol Biol Evol. 2000; 17(1):164-78. DOI: 10.1093/oxfordjournals.molbev.a026229. View

14.

Hinton G . Training products of experts by minimizing contrastive divergence. Neural Comput. 2002; 14(8):1771-800. DOI: 10.1162/089976602760128018. View

15.

Ovchinnikov S, Kamisetty H, Baker D . Robust and accurate prediction of residue-residue interactions across protein interfaces using evolutionary information. Elife. 2014; 3:e02030. PMC: 4034769. DOI: 10.7554/eLife.02030. View

16.

Hopf T, Colwell L, Sheridan R, Rost B, Sander C, Marks D . Three-dimensional structures of membrane proteins from genomic sequencing. Cell. 2012; 149(7):1607-21. PMC: 3641781. DOI: 10.1016/j.cell.2012.04.012. View

17.

Monastyrskyy B, DAndrea D, Fidelis K, Tramontano A, Kryshtafovych A . New encouraging developments in contact prediction: Assessment of the CASP11 results. Proteins. 2015; 84 Suppl 1:131-44. PMC: 4834069. DOI: 10.1002/prot.24943. View

18.

Ovchinnikov S, Park H, Varghese N, Huang P, Pavlopoulos G, Kim D . Protein structure determination using metagenome sequence data. Science. 2017; 355(6322):294-298. PMC: 5493203. DOI: 10.1126/science.aah4043. View

19.

Skwark M, Raimondi D, Michel M, Elofsson A . Improved contact predictions using the recognition of protein like contact patterns. PLoS Comput Biol. 2014; 10(11):e1003889. PMC: 4222596. DOI: 10.1371/journal.pcbi.1003889. View

20.

Nugent T, Jones D . Accurate de novo structure prediction of large transmembrane protein domains using fragment-assembly and correlated mutation analysis. Proc Natl Acad Sci U S A. 2012; 109(24):E1540-7. PMC: 3386101. DOI: 10.1073/pnas.1120036109. View