» Articles » PMID: 32290011

Statistical Physics of Interacting Proteins: Impact of Dataset Size and Quality Assessed in Synthetic Sequences

Overview
Journal Phys Rev E
Specialty Biophysics
Date 2020 Apr 16
PMID 32290011
Citations 5
Authors
Affiliations
Soon will be listed here.
Abstract

Identifying protein-protein interactions is crucial for a systems-level understanding of the cell. Recently, algorithms based on inverse statistical physics, e.g., direct coupling analysis (DCA), have allowed to use evolutionarily related sequences to address two conceptually related inference tasks: finding pairs of interacting proteins and identifying pairs of residues which form contacts between interacting proteins. Here we address two underlying questions: How are the performances of both inference tasks related? How does performance depend on dataset size and the quality? To this end, we formalize both tasks using Ising models defined over stochastic block models, with individual blocks representing single proteins and interblock couplings protein-protein interactions; controlled synthetic sequence data are generated by Monte Carlo simulations. We show that DCA is able to address both inference tasks accurately when sufficiently large training sets of known interaction partners are available and that an iterative pairing algorithm allows to make predictions even without a training set. Noise in the training data deteriorates performance. In both tasks we find a quadratic scaling relating dataset quality and size that is consistent with noise adding in square-root fashion and signal adding linearly when increasing the dataset. This implies that it is generally good to incorporate more data even if their quality are imperfect, thereby shedding light on the empirically observed performance of DCA applied to natural protein sequences.

Citing Articles

DiffPaSS-high-performance differentiable pairing of protein sequences using soft scores.

Lupo U, Sgarbossa D, Milighetti M, Bitbol A Bioinformatics. 2024; 41(1).

PMID: 39672677 PMC: 11676329. DOI: 10.1093/bioinformatics/btae738.


Combining phylogeny and coevolution improves the inference of interaction partners among paralogous proteins.

Gandarilla-Perez C, Pinilla S, Bitbol A, Weigt M PLoS Comput Biol. 2023; 19(3):e1011010.

PMID: 36996234 PMC: 10089317. DOI: 10.1371/journal.pcbi.1011010.


Impact of phylogeny on structural contact inference from protein sequence data.

Dietler N, Lupo U, Bitbol A J R Soc Interface. 2023; 20(199):20220707.

PMID: 36751926 PMC: 9905998. DOI: 10.1098/rsif.2022.0707.


Correlations from structure and phylogeny combine constructively in the inference of protein partners from sequences.

Gerardos A, Dietler N, Bitbol A PLoS Comput Biol. 2022; 18(5):e1010147.

PMID: 35576238 PMC: 9135348. DOI: 10.1371/journal.pcbi.1010147.


Inter-protein residue covariation information unravels physically interacting protein dimers.

Salmanian S, Pezeshk H, Sadeghi M BMC Bioinformatics. 2020; 21(1):584.

PMID: 33334319 PMC: 7745481. DOI: 10.1186/s12859-020-03930-7.