» Articles » PMID: 9514730

Empirical Statistical Estimates for Sequence Similarity Searches

Overview
Journal J Mol Biol
Publisher Elsevier
Date 1998 Mar 26
PMID 9514730
Citations 74
Authors
Affiliations
Soon will be listed here.
Abstract

The FASTA package of sequence comparison programs has been modified to provide accurate statistical estimates for local sequence similarity scores with gaps. These estimates are derived using the extreme value distribution from the mean and variance of the local similarity scores of unrelated sequences after the scores have been corrected for the expected effect of library sequence length. This approach allows accurate estimates to be calculated for both FASTA and Smith-Waterman similarity scores for protein/protein, DNA/DNA, and protein/translated-DNA comparisons. The accuracy of the statistical estimates is summarized for 54 protein families using FASTA and Smith-Waterman scores. Probability estimates calculated from the distribution of similarity scores are generally conservative, as are probabilities calculated using the Altschul-Gish lambda, kappa, and eta parameters. The performance of several alternative methods for correcting similarity scores for library-sequence length was evaluated using 54 protein superfamilies from the PIR39 database and 110 protein families from the Prosite/SwissProt rel. 34 database. Both regression-scaled and Altschul-Gish scaled scores perform significantly better than unscaled Smith-Waterman or FASTA similarity scores. When the Prosite/ SwissProt test set is used, regression-scaled scores perform slightly better; when the PIR database is used, Altschul-Gish scaled scores perform best. Thus, length-corrected similarity scores improve the sensitivity of database searches. Statistical parameters that are derived from the distribution of similarity scores from the thousands of unrelated sequences typically encountered in a database search provide accurate estimates of statistical significance that can be used to infer sequence homology.

Citing Articles

Resistance and virulence genes characteristic of a South Asia Clade (I) Candida auris strain isolated from blood in Beijing.

Yang J, Ma G, Li Y, Shi Y, Liang G Clinics (Sao Paulo). 2024; 79:100497.

PMID: 39284275 PMC: 11419799. DOI: 10.1016/j.clinsp.2024.100497.


The Pentameric Ligand-Gated Ion Channel Family: A New Member of the Voltage Gated Ion Channel Superfamily?.

Dubey A, Baxter M, Hendargo K, Medrano-Soto A, Saier Jr M Int J Mol Sci. 2024; 25(9).

PMID: 38732224 PMC: 11084639. DOI: 10.3390/ijms25095005.


KEGG tools for classification and analysis of viral proteins.

Jin Z, Sato Y, Kawashima M, Kanehisa M Protein Sci. 2023; 32(12):e4820.

PMID: 37881892 PMC: 10661063. DOI: 10.1002/pro.4820.


Sequence Similarity among Structural Repeats in the Piezo Family of Mechanosensitive Ion Channels.

Hendargo K, Patel A, Chukwudozie O, Moreno-Hagelsieb G, Christen J, Medrano-Soto A Microb Physiol. 2023; 33(1):49-62.

PMID: 37321192 PMC: 11283329. DOI: 10.1159/000531468.


Protein salvage and repurposing in evolution: Phospholipase D toxins are stabilized by a remodeled scrap of a membrane association domain.

Cordes M, Sundman A, Fox H, Binford G Protein Sci. 2023; 32(7):e4701.

PMID: 37313620 PMC: 10303701. DOI: 10.1002/pro.4701.