» Articles » PMID: 30395601

Synthetic Protein Alignments by CCMgen Quantify Noise in Residue-residue Contact Prediction

Overview
Specialty Biology
Date 2018 Nov 6
PMID 30395601
Citations 13
Authors
Affiliations
Soon will be listed here.
Abstract

Compensatory mutations between protein residues in physical contact can manifest themselves as statistical couplings between the corresponding columns in a multiple sequence alignment (MSA) of the protein family. Conversely, large coupling coefficients predict residue contacts. Methods for de-novo protein structure prediction based on this approach are becoming increasingly reliable. Their main limitation is the strong systematic and statistical noise in the estimation of coupling coefficients, which has so far limited their application to very large protein families. While most research has focused on improving predictions by adding external information, little progress has been made to improve the statistical procedure at the core, because our lack of understanding of the sources of noise poses a major obstacle. First, we show theoretically that the expectation value of the coupling score assuming no coupling is proportional to the product of the square roots of the column entropies, and we propose a simple entropy bias correction (EntC) that subtracts out this expectation value. Second, we show that the average product correction (APC) includes the correction of the entropy bias, partly explaining its success. Third, we have developed CCMgen, the first method for simulating protein evolution and generating realistic synthetic MSAs with pairwise statistical residue couplings. Fourth, to learn exact statistical models that reliably reproduce observed alignment statistics, we developed CCMpredPy, an implementation of the persistent contrastive divergence (PCD) method for exact inference. Fifth, we demonstrate how CCMgen and CCMpredPy can facilitate the development of contact prediction methods by analysing the systematic noise contributions from phylogeny and entropy. Using the entropy bias correction, we can disentangle both sources of noise and find that entropy contributes roughly twice as much noise as phylogeny.

Citing Articles

Impact of phylogeny on the inference of functional sectors from protein sequence data.

Dietler N, Abbara A, Choudhury S, Bitbol A PLoS Comput Biol. 2024; 20(9):e1012091.

PMID: 39312591 PMC: 11449291. DOI: 10.1371/journal.pcbi.1012091.


Enhancing coevolutionary signals in protein-protein interaction prediction through clade-wise alignment integration.

Fang T, Szklarczyk D, Hachilif R, von Mering C Sci Rep. 2024; 14(1):6009.

PMID: 38472223 PMC: 10933411. DOI: 10.1038/s41598-024-55655-9.


Chasing long-range evolutionary couplings in the AlphaFold era.

Karamanos T Biopolymers. 2023; 114(3):e23530.

PMID: 36752285 PMC: 10909459. DOI: 10.1002/bip.23530.


Impact of phylogeny on structural contact inference from protein sequence data.

Dietler N, Lupo U, Bitbol A J R Soc Interface. 2023; 20(199):20220707.

PMID: 36751926 PMC: 9905998. DOI: 10.1098/rsif.2022.0707.


Generative power of a protein language model trained on multiple sequence alignments.

Sgarbossa D, Lupo U, Bitbol A Elife. 2023; 12.

PMID: 36734516 PMC: 10038667. DOI: 10.7554/eLife.79854.


References
1.
Baldassi C, Zamparo M, Feinauer C, Procaccini A, Zecchina R, Weigt M . Fast and accurate multivariate Gaussian modeling of protein families: predicting residue contacts and protein-interaction partners. PLoS One. 2014; 9(3):e92721. PMC: 3963956. DOI: 10.1371/journal.pone.0092721. View

2.
Marino Buslje C, Santos J, Delfino J, Nielsen M . Correction for phylogeny, small number of observations and data redundancy improves the identification of coevolving amino acid pairs using mutual information. Bioinformatics. 2009; 25(9):1125-31. PMC: 2672635. DOI: 10.1093/bioinformatics/btp135. View

3.
Godzik A, Sander C . Conservation of residue interactions in a family of Ca-binding proteins. Protein Eng. 1989; 2(8):589-96. DOI: 10.1093/protein/2.8.589. View

4.
Remmert M, Biegert A, Hauser A, Soding J . HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat Methods. 2011; 9(2):173-5. DOI: 10.1038/nmeth.1818. View

5.
Balakrishnan S, Kamisetty H, Carbonell J, Lee S, Langmead C . Learning generative models for protein fold families. Proteins. 2011; 79(4):1061-78. DOI: 10.1002/prot.22934. View