» Articles » PMID: 15851683

Solving the Protein Sequence Metric Problem

Overview
Specialty Science
Date 2005 Apr 27
PMID 15851683
Citations 186
Authors
Affiliations
Soon will be listed here.
Abstract

Biological sequences are composed of long strings of alphabetic letters rather than arrays of numerical values. Lack of a natural underlying metric for comparing such alphabetic data significantly inhibits sophisticated statistical analyses of sequences, modeling structural and functional aspects of proteins, and related problems. Herein, we use multivariate statistical analyses on almost 500 amino acid attributes to produce a small set of highly interpretable numeric patterns of amino acid variability. These high-dimensional attribute data are summarized by five multidimensional patterns of attribute covariation that reflect polarity, secondary structure, molecular volume, codon diversity, and electrostatic charge. Numerical scores for each amino acid then transform amino acid sequences for statistical analyses. Relationships between transformed data and amino acid substitution matrices show significant associations for polarity and codon diversity scores. Transformed alphabetic data are used in analysis of variance and discriminant analysis to study DNA binding in the basic helix-loop-helix proteins. The transformed scores offer a general solution for analyzing a wide variety of sequence analysis problems.

Citing Articles

Exploring the repository of de novo-designed bifunctional antimicrobial peptides through deep learning.

Dong R, Liu R, Liu Z, Liu Y, Zhao G, Li H Elife. 2025; 13.

PMID: 40079572 PMC: 11906162. DOI: 10.7554/eLife.97330.


An antibody developability triaging pipeline exploiting protein language models.

Sweet-Jones J, Martin A MAbs. 2025; 17(1):2472009.

PMID: 40038849 PMC: 11901365. DOI: 10.1080/19420862.2025.2472009.


Phenomenological Modeling of Antibody Response from Vaccine Strain Composition.

Ovchinnikov V, Karplus M Antibodies (Basel). 2025; 14(1.

PMID: 39846614 PMC: 11755667. DOI: 10.3390/antib14010006.


Unveiling cross-reactivity: implications for immune response modulation in cancer.

Pretti M, Vieira G, Boroni M, Bonamino M Brief Bioinform. 2025; 26(1).

PMID: 39831892 PMC: 11744606. DOI: 10.1093/bib/bbaf012.


Machine Learning Reveals Signatures of Promiscuous Microbial Amidases for Micropollutant Biotransformations.

Marti T, Schweizer D, Yu Y, Scharer M, Probst S, Robinson S ACS Environ Au. 2025; 5(1):114-127.

PMID: 39830714 PMC: 11741061. DOI: 10.1021/acsenvironau.4c00066.


References
1.
Henikoff S, Henikoff J . Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A. 1992; 89(22):10915-9. PMC: 50453. DOI: 10.1073/pnas.89.22.10915. View

2.
Atchley W, Fernandes A . Sequence signatures and the probabilistic identification of proteins in the Myc-Max-Mad network. Proc Natl Acad Sci U S A. 2005; 102(18):6401-6. PMC: 1088358. DOI: 10.1073/pnas.0408964102. View

3.
Janin J, Wodak S . Conformation of amino acid side-chains in proteins. J Mol Biol. 1978; 125(3):357-86. DOI: 10.1016/0022-2836(78)90408-4. View

4.
Atchley W, FITCH W . A natural classification of the basic helix-loop-helix class of transcription factors. Proc Natl Acad Sci U S A. 1997; 94(10):5172-6. PMC: 24651. DOI: 10.1073/pnas.94.10.5172. View

5.
Oobatake M, Ooi T . An analysis of non-bonded energy of proteins. J Theor Biol. 1977; 67(3):567-84. DOI: 10.1016/0022-5193(77)90058-3. View