A Unified Statistical Framework for Sequence Comparison and Structure Comparison

Overview

Journal Proc Natl Acad Sci U S A

Specialty Science

Date 1998 May 30

PMID 9600892

Citations 97

Authors

M Levitt

M Gerstein

Affiliations

Soon will be listed here.

Abstract

We present an approach for assessing the significance of sequence and structure comparisons by using nearly identical statistical formalisms for both sequence and structure. Doing so involves an all-vs.-all comparison of protein domains [taken here from the Structural Classification of Proteins (scop) database] and then fitting a simple distribution function to the observed scores. By using this distribution, we can attach a statistical significance to each comparison score in the form of a P value, the probability that a better score would occur by chance. As expected, we find that the scores for sequence matching follow an extreme-value distribution. The agreement, moreover, between the P values that we derive from this distribution and those reported by standard programs (e.g., BLAST and FASTA validates our approach. Structure comparison scores also follow an extreme-value distribution when the statistics are expressed in terms of a structural alignment score (essentially the sum of reciprocated distances between aligned atoms minus gap penalties). We find that the traditional metric of structural similarity, the rms deviation in atom positions after fitting aligned atoms, follows a different distribution of scores and does not perform as well as the structural alignment score. Comparison of the sequence and structure statistics for pairs of proteins known to be related distantly shows that structural comparison is able to detect approximately twice as many distant relationships as sequence comparison at the same error rate. The comparison also indicates that there are very few pairs with significant similarity in terms of sequence but not structure whereas many pairs have significant similarity in terms of structure but not sequence.

Citing Articles

How the technologies behind self-driving cars, social networks, ChatGPT, and DALL-E2 are changing structural biology.

Bochtler M Bioessays. 2024; 47(1):e2400155.

PMID: 39404756 PMC: 11662154. DOI: 10.1002/bies.202400155.

LoCoHD: a metric for comparing local environments of proteins.

Fazekas Z, Menyhard D, Perczel A Nat Commun. 2024; 15(1):4029.

PMID: 38740745 PMC: 11091161. DOI: 10.1038/s41467-024-48225-0.

Sequence-structure-function relationships in the microbial protein universe.

Leman J, Szczerbiak P, Renfrew P, Gligorijevic V, Berenberg D, Vatanen T Nat Commun. 2023; 14(1):2351.

PMID: 37100781 PMC: 10133388. DOI: 10.1038/s41467-023-37896-w.

InterPepRank: Assessment of Docked Peptide Conformations by a Deep Graph Network.

Johansson-Akhe I, Mirabello C, Wallner B Front Bioinform. 2022; 1:763102.

PMID: 36303778 PMC: 9581042. DOI: 10.3389/fbinf.2021.763102.

Estimating the Similarity between Protein Pockets.

Eguida M, Rognan D Int J Mol Sci. 2022; 23(20).

PMID: 36293316 PMC: 9604425. DOI: 10.3390/ijms232012462.

References

Murzin A, Brenner S, Hubbard T, Chothia C . SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol. 1995; 247(4):536-40. DOI: 10.1006/jmbi.1995.0159. View

Gerstein M, Altman R . Using a measure of structural variation to define a core for the globins. Comput Appl Biosci. 1995; 11(6):633-44. DOI: 10.1093/bioinformatics/11.6.633. View

Levitt M, Chothia C . Structural patterns in globular proteins. Nature. 1976; 261(5561):552-8. DOI: 10.1038/261552a0. View

Lipman D, Pearson W . Rapid and sensitive protein similarity searches. Science. 1985; 227(4693):1435-41. DOI: 10.1126/science.2983426. View

Gerstein M, Lesk A, Chothia C . Structural mechanisms for domain movements in proteins. Biochemistry. 1994; 33(22):6739-49. DOI: 10.1021/bi00188a001. View

Brenner S, Hubbard T, Murzin A, Chothia C . Gene duplications in H. influenzae. Nature. 1995; 378(6553):140. DOI: 10.1038/378140a0. View

Holm L, Sander C . Mapping the protein universe. Science. 1996; 273(5275):595-603. DOI: 10.1126/science.273.5275.595. View

Remington S, Matthews B . A systematic approach to the comparison of protein structures. J Mol Biol. 1980; 140(1):77-99. DOI: 10.1016/0022-2836(80)90357-5. View

Abola E, Sussman J, Prilusky J, Manning N . Protein Data Bank archives of three-dimensional macromolecular structures. Methods Enzymol. 1997; 277:556-71. DOI: 10.1016/s0076-6879(97)77031-9. View

10.

Falicov A, Cohen F . A surface of minimum area metric for the structural comparison of proteins. J Mol Biol. 1996; 258(5):871-92. DOI: 10.1006/jmbi.1996.0294. View

11.

Karlin S, Altschul S . Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci U S A. 1990; 87(6):2264-8. PMC: 53667. DOI: 10.1073/pnas.87.6.2264. View

12.

Holm L, Sander C . Protein structure comparison by alignment of distance matrices. J Mol Biol. 1993; 233(1):123-38. DOI: 10.1006/jmbi.1993.1489. View

13.

Gibrat J, Madej T, BRYANT S . Surprising similarities in structure comparison. Curr Opin Struct Biol. 1996; 6(3):377-85. DOI: 10.1016/s0959-440x(96)80058-3. View

14.

Pearson W . Identifying distantly related protein sequences. Comput Appl Biosci. 1997; 13(4):325-32. DOI: 10.1093/bioinformatics/13.4.325. View

15.

Pearson W . Effective protein sequence comparison. Methods Enzymol. 1996; 266:227-58. DOI: 10.1016/s0076-6879(96)66017-0. View

16.

Vriend G, Sander C . Detection of common three-dimensional substructures in proteins. Proteins. 1991; 11(1):52-8. DOI: 10.1002/prot.340110107. View

17.

BRYANT S, Altschul S . Statistics of sequence-structure threading. Curr Opin Struct Biol. 1995; 5(2):236-44. DOI: 10.1016/0959-440x(95)80082-4. View

18.

Altschul S, Boguski M, Gish W, Wootton J . Issues in searching molecular sequence databases. Nat Genet. 1994; 6(2):119-29. DOI: 10.1038/ng0294-119. View

19.

Gerstein M, Levitt M . Comprehensive assessment of automatic structural alignment against a manual standard, the scop classification of proteins. Protein Sci. 1998; 7(2):445-56. PMC: 2143933. DOI: 10.1002/pro.5560070226. View

20.

Pearson W, Lipman D . Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A. 1988; 85(8):2444-8. PMC: 280013. DOI: 10.1073/pnas.85.8.2444. View