» Articles » PMID: 12876310

Exploring the Nonlinear Geometry of Protein Homology

Overview
Journal Protein Sci
Specialty Biochemistry
Date 2003 Jul 24
PMID 12876310
Citations 1
Authors
Affiliations
Soon will be listed here.
Abstract

The explosion of biological data resulting from genomic and proteomic research has created a pressing need for data analysis techniques that work effectively on a large scale. An area of particular interest is the organization and visualization of large families of protein sequences. An increasingly popular approach is to embed the sequences into a low-dimensional Euclidean space in a way that preserves some predefined measure of sequence similarity. This method has been shown to produce maps that exhibit global order and continuity and reveal important evolutionary, structural, and functional relationships between the embedded proteins. However, protein sequences are related by evolutionary pathways that exhibit highly nonlinear geometry, which is invisible to classical embedding procedures such as multidimensional scaling (MDS) and nonlinear mapping (NLM). Here, we describe the use of stochastic proximity embedding (SPE) for producing Euclidean maps that preserve the intrinsic dimensionality and metric structure of the data. SPE extends previous approaches in two important ways: (1) It preserves only local relationships between closely related sequences, thus allowing the map to unfold and reveal its intrinsic dimension, and (2) it scales linearly with the number of sequences and therefore can be applied to very large protein families. The merits of the algorithm are illustrated using examples from the protein kinase and nuclear hormone receptor superfamilies.

Citing Articles

Molecular evolution of phosphoprotein phosphatases in Drosophila.

Miskei M, Adam C, Kovacs L, Karanyi Z, Dombradi V PLoS One. 2011; 6(7):e22218.

PMID: 21789237 PMC: 3137614. DOI: 10.1371/journal.pone.0022218.

References
1.
Agrafiotis D . A new method for analyzing protein sequence relationships based on Sammon maps. Protein Sci. 1997; 6(2):287-93. PMC: 2143655. DOI: 10.1002/pro.5560060203. View

2.
Falquet L, Pagni M, Bucher P, Hulo N, Sigrist C, Hofmann K . The PROSITE database, its status in 2002. Nucleic Acids Res. 2001; 30(1):235-8. PMC: 99105. DOI: 10.1093/nar/30.1.235. View

3.
Bingham J, Plowman G, Sudarsanam S . Informatics issues in large-scale sequence analysis: elucidating the protein kinases of C. elegans. J Cell Biochem. 2000; 80(2):181-6. DOI: 10.1002/1097-4644(20010201)80:2<181::aid-jcb30>3.0.co;2-1. View

4.
Notredame C, Higgins D, Heringa J . T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol. 2000; 302(1):205-17. DOI: 10.1006/jmbi.2000.4042. View

5.
Ferran E, Pflugfelder B, Ferrara P . Self-organized neural maps of human protein sequences. Protein Sci. 1994; 3(3):507-21. PMC: 2142706. DOI: 10.1002/pro.5560030316. View