Exploring the Nonlinear Geometry of Protein Homology

Overview

Journal Protein Sci

Specialty Biochemistry

Date 2003 Jul 24

PMID 12876310

Citations 1

Authors

Michael A Farnum

Huafeng Xu

Dimitris K Agrafiotis

Affiliations

Soon will be listed here.

Abstract

The explosion of biological data resulting from genomic and proteomic research has created a pressing need for data analysis techniques that work effectively on a large scale. An area of particular interest is the organization and visualization of large families of protein sequences. An increasingly popular approach is to embed the sequences into a low-dimensional Euclidean space in a way that preserves some predefined measure of sequence similarity. This method has been shown to produce maps that exhibit global order and continuity and reveal important evolutionary, structural, and functional relationships between the embedded proteins. However, protein sequences are related by evolutionary pathways that exhibit highly nonlinear geometry, which is invisible to classical embedding procedures such as multidimensional scaling (MDS) and nonlinear mapping (NLM). Here, we describe the use of stochastic proximity embedding (SPE) for producing Euclidean maps that preserve the intrinsic dimensionality and metric structure of the data. SPE extends previous approaches in two important ways: (1) It preserves only local relationships between closely related sequences, thus allowing the map to unfold and reveal its intrinsic dimension, and (2) it scales linearly with the number of sequences and therefore can be applied to very large protein families. The merits of the algorithm are illustrated using examples from the protein kinase and nuclear hormone receptor superfamilies.

Citing Articles

Molecular evolution of phosphoprotein phosphatases in Drosophila.

Miskei M, Adam C, Kovacs L, Karanyi Z, Dombradi V PLoS One. 2011; 6(7):e22218.

PMID: 21789237 PMC: 3137614. DOI: 10.1371/journal.pone.0022218.

References

Agrafiotis D . A new method for analyzing protein sequence relationships based on Sammon maps. Protein Sci. 1997; 6(2):287-93. PMC: 2143655. DOI: 10.1002/pro.5560060203. View

Falquet L, Pagni M, Bucher P, Hulo N, Sigrist C, Hofmann K . The PROSITE database, its status in 2002. Nucleic Acids Res. 2001; 30(1):235-8. PMC: 99105. DOI: 10.1093/nar/30.1.235. View

Bingham J, Plowman G, Sudarsanam S . Informatics issues in large-scale sequence analysis: elucidating the protein kinases of C. elegans. J Cell Biochem. 2000; 80(2):181-6. DOI: 10.1002/1097-4644(20010201)80:2<181::aid-jcb30>3.0.co;2-1. View

Notredame C, Higgins D, Heringa J . T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol. 2000; 302(1):205-17. DOI: 10.1006/jmbi.2000.4042. View

Ferran E, Pflugfelder B, Ferrara P . Self-organized neural maps of human protein sequences. Protein Sci. 1994; 3(3):507-21. PMC: 2142706. DOI: 10.1002/pro.5560030316. View

Holm L . Unification of protein families. Curr Opin Struct Biol. 1998; 8(3):372-9. DOI: 10.1016/s0959-440x(98)80072-9. View

Kostich M, English J, Madison V, Gheyas F, Wang L, Qiu P . Human members of the eukaryotic protein kinase family. Genome Biol. 2002; 3(9):RESEARCH0043. PMC: 126868. DOI: 10.1186/gb-2002-3-9-research0043. View

Manning G, Whyte D, Martinez R, Hunter T, Sudarsanam S . The protein kinase complement of the human genome. Science. 2002; 298(5600):1912-34. DOI: 10.1126/science.1075762. View

Tenenbaum J, De Silva V, Langford J . A global geometric framework for nonlinear dimensionality reduction. Science. 2000; 290(5500):2319-23. DOI: 10.1126/science.290.5500.2319. View

10.

Hanke J, Reich J . Kohonen map as a visualization tool for the analysis of protein sequences: multiple alignments, domains and segments of secondary structures. Comput Appl Biosci. 1996; 12(6):447-54. DOI: 10.1093/bioinformatics/12.6.447. View

11.

Yona G, Levitt M . Towards a complete map of the protein space based on a unified sequence and structure analysis of all known proteins. Proc Int Conf Intell Syst Mol Biol. 2000; 8:395-406. View

12.

Silverstein K, Shoop E, Johnson J, Retzel E . MetaFam: a unified classification of protein families. I. Overview and statistics. Bioinformatics. 2001; 17(3):249-61. DOI: 10.1093/bioinformatics/17.3.249. View

13.

Attwood T, Croning M, Flower D, Lewis A, Mabey J, Scordis P . PRINTS-S: the database formerly known as PRINTS. Nucleic Acids Res. 1999; 28(1):225-7. PMC: 102408. DOI: 10.1093/nar/28.1.225. View

14.

Hanks S, Hunter T . Protein kinases 6. The eukaryotic protein kinase superfamily: kinase (catalytic) domain structure and classification. FASEB J. 1995; 9(8):576-96. View

15.

Holm L, Sander C . Mapping the protein universe. Science. 1996; 273(5275):595-603. DOI: 10.1126/science.273.5275.595. View

16.

Enright A, Ouzounis C . BioLayout--an automatic graph layout algorithm for similarity visualization. Bioinformatics. 2001; 17(9):853-4. DOI: 10.1093/bioinformatics/17.9.853. View

17.

Agrafiotis D, Xu H . A self-organizing principle for learning nonlinear manifolds. Proc Natl Acad Sci U S A. 2002; 99(25):15869-72. PMC: 138530. DOI: 10.1073/pnas.242424399. View

18.

Apweiler R, Attwood T, Bairoch A, Bateman A, Birney E, Biswas M . The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Res. 2000; 29(1):37-40. PMC: 29841. DOI: 10.1093/nar/29.1.37. View

19.

Roweis S, Saul L . Nonlinear dimensionality reduction by locally linear embedding. Science. 2000; 290(5500):2323-6. DOI: 10.1126/science.290.5500.2323. View

20.

Gotoh O . Multiple sequence alignment: algorithms and applications. Adv Biophys. 1999; 36:159-206. DOI: 10.1016/s0065-227x(99)80007-0. View