» Articles » PMID: 2315319

Methods for Assessing the Statistical Significance of Molecular Sequence Features by Using General Scoring Schemes

Overview
Specialty Science
Date 1990 Mar 1
PMID 2315319
Citations 411
Authors
Affiliations
Soon will be listed here.
Abstract

An unusual pattern in a nucleic acid or protein sequence or a region of strong similarity shared by two or more sequences may have biological significance. It is therefore desirable to know whether such a pattern can have arisen simply by chance. To identify interesting sequence patterns, appropriate scoring values can be assigned to the individual residues of a single sequence or to sets of residues when several sequences are compared. For single sequences, such scores can reflect biophysical properties such as charge, volume, hydrophobicity, or secondary structure potential; for multiple sequences, they can reflect nucleotide or amino acid similarity measured in a wide variety of ways. Using an appropriate random model, we present a theory that provides precise numerical formulas for assessing the statistical significance of any region with high aggregate score. A second class of results describes the composition of high-scoring segments. In certain contexts, these permit the choice of scoring systems which are "optimal" for distinguishing biologically relevant patterns. Examples are given of applications of the theory to a variety of protein sequences, highlighting segments with unusual biological features. These include distinctive charge regions in transcription factors and protooncogene products, pronounced hydrophobic segments in various receptor and transport proteins, and statistically significant subalignments involving the recently characterized cystic fibrosis gene.

Citing Articles

Challenges in adjusting scoring matrices when comparing functional motifs with non-standard compositions.

Jarnot P Sci Rep. 2024; 14(1):31777.

PMID: 39738463 PMC: 11685636. DOI: 10.1038/s41598-024-82548-8.


Alignment-Free Viral Sequence Classification at Scale.

van Zyl D, Dunaiski M, Tegally H, Baxter C, de Oliveira T, Xavier J bioRxiv. 2024; .

PMID: 39713356 PMC: 11661207. DOI: 10.1101/2024.12.10.627186.


A BLAST from the past: revisiting blastp's E-value.

Lu Y, Noble W, Keich U Bioinformatics. 2024; 40(12).

PMID: 39656790 PMC: 11652269. DOI: 10.1093/bioinformatics/btae729.


Mining and expression analysis of color related genes in Bougainvillea glabra bracts based on transcriptome sequencing.

Wang F, Yao G, Li J, Zhu W, Li Z, Sun Z Sci Rep. 2024; 14(1):24491.

PMID: 39424873 PMC: 11489674. DOI: 10.1038/s41598-024-73964-x.


Antibacterial and antibiofilm activities of bacteriocin produced by a new strain of Enterococcus faecalis BDR22.

Dutta B, Basu D, Lahiri D, Nag M, Ray R Naunyn Schmiedebergs Arch Pharmacol. 2024; .

PMID: 39311922 DOI: 10.1007/s00210-024-03458-0.


References
1.
Theissen H, Etzerodt M, Reuter R, Schneider C, Lottspeich F, Argos P . Cloning of the human cDNA for the U1 RNA-associated 70K protein. EMBO J. 1986; 5(12):3209-17. PMC: 1167314. DOI: 10.1002/j.1460-2075.1986.tb04631.x. View

2.
Feng D, Johnson M, Doolittle R . Aligning amino acid sequences: comparison of commonly used methods. J Mol Evol. 1984; 21(2):112-25. DOI: 10.1007/BF02100085. View

3.
Pirrotta V, Manet E, HARDON E, Bickel S, Benson M . Structure and sequence of the Drosophila zeste gene. EMBO J. 1987; 6(3):791-9. PMC: 553464. DOI: 10.1002/j.1460-2075.1987.tb04821.x. View

4.
Karlin S, Morris M, Ghandour G, Leung M . Efficient algorithms for molecular sequence analysis. Proc Natl Acad Sci U S A. 1988; 85(3):841-5. PMC: 279651. DOI: 10.1073/pnas.85.3.841. View

5.
Ryder K, Lau L, Nathans D . A gene activated by growth factors is related to the oncogene v-jun. Proc Natl Acad Sci U S A. 1988; 85(5):1487-91. PMC: 279796. DOI: 10.1073/pnas.85.5.1487. View