» Articles » PMID: 31415557

Size and Structure of the Sequence Space of Repeat Proteins

Overview
Specialty Biology
Date 2019 Aug 16
PMID 31415557
Citations 7
Authors
Affiliations
Soon will be listed here.
Abstract

The coding space of protein sequences is shaped by evolutionary constraints set by requirements of function and stability. We show that the coding space of a given protein family-the total number of sequences in that family-can be estimated using models of maximum entropy trained on multiple sequence alignments of naturally occuring amino acid sequences. We analyzed and calculated the size of three abundant repeat proteins families, whose members are large proteins made of many repetitions of conserved portions of ∼30 amino acids. While amino acid conservation at each position of the alignment explains most of the reduction of diversity relative to completely random sequences, we found that correlations between amino acid usage at different positions significantly impact that diversity. We quantified the impact of different types of correlations, functional and evolutionary, on sequence diversity. Analysis of the detailed structure of the coding space of the families revealed a rugged landscape, with many local energy minima of varying sizes with a hierarchical structure, reminiscent of fustrated energy landscapes of spin glass in physics. This clustered structure indicates a multiplicity of subtypes within each family, and suggests new strategies for protein design.

Citing Articles

A metric and its derived protein network for evaluation of ortholog database inconsistency.

Yang W, Ji J, Fang G BMC Bioinformatics. 2025; 26(1):6.

PMID: 39773281 PMC: 11707888. DOI: 10.1186/s12859-024-06023-x.


A transfer-learning approach to predict antigen immunogenicity and T-cell receptor specificity.

Bravi B, Di Gioacchino A, Fernandez-de-Cossio-Diaz J, Walczak A, Mora T, Cocco S Elife. 2023; 12.

PMID: 37681658 PMC: 10522340. DOI: 10.7554/eLife.85126.


The Effect of Mutations in the TPR and Ankyrin Families of Alpha Solenoid Repeat Proteins.

Izert M, Szybowska P, Gorna M, Merski M Front Bioinform. 2022; 1:696368.

PMID: 36303725 PMC: 9581033. DOI: 10.3389/fbinf.2021.696368.


Navigating the amino acid sequence space between functional proteins using a deep learning framework.

Bitard-Feildel T PeerJ Comput Sci. 2021; 7:e684.

PMID: 34616884 PMC: 8459775. DOI: 10.7717/peerj-cs.684.


Exploring the sequence fitness landscape of a bridge between protein folds.

Tian P, Best R PLoS Comput Biol. 2020; 16(10):e1008285.

PMID: 33048928 PMC: 7553338. DOI: 10.1371/journal.pcbi.1008285.


References
1.
Dryden D, Thomson A, White J . How much of protein sequence space has been explored by life on Earth?. J R Soc Interface. 2008; 5(25):953-6. PMC: 2459213. DOI: 10.1098/rsif.2008.0085. View

2.
Bateman A, Coin L, Durbin R, Finn R, Hollich V, Griffiths-Jones S . The Pfam protein families database. Nucleic Acids Res. 2003; 32(Database issue):D138-41. PMC: 308855. DOI: 10.1093/nar/gkh121. View

3.
Szurmant H, Weigt M . Inter-residue, inter-protein and inter-family coevolution: bridging the scales. Curr Opin Struct Biol. 2017; 50:26-32. PMC: 5940578. DOI: 10.1016/j.sbi.2017.10.014. View

4.
Tubiana J, Cocco S, Monasson R . Learning protein constitutive motifs from sequence data. Elife. 2019; 8. PMC: 6436896. DOI: 10.7554/eLife.39397. View

5.
Tripp K, Barrick D . Rerouting the folding pathway of the Notch ankyrin domain by reshaping the energy landscape. J Am Chem Soc. 2008; 130(17):5681-8. PMC: 2474552. DOI: 10.1021/ja0763201. View