» Articles » PMID: 11452024

Improving the Accuracy of PSI-BLAST Protein Database Searches with Composition-based Statistics and Other Refinements

Overview
Specialty Biochemistry
Date 2001 Jul 14
PMID 11452024
Citations 524
Authors
Affiliations
Soon will be listed here.
Abstract

PSI-BLAST is an iterative program to search a database for proteins with distant similarity to a query sequence. We investigated over a dozen modifications to the methods used in PSI-BLAST, with the goal of improving accuracy in finding true positive matches. To evaluate performance we used a set of 103 queries for which the true positives in yeast had been annotated by human experts, and a popular measure of retrieval accuracy (ROC) that can be normalized to take on values between 0 (worst) and 1 (best). The modifications we consider novel improve the ROC score from 0.758 +/- 0.005 to 0.895 +/- 0.003. This does not include the benefits from four modifications we included in the 'baseline' version, even though they were not implemented in PSI-BLAST version 2.0. The improvement in accuracy was confirmed on a small second test set. This test involved analyzing three protein families with curated lists of true positives from the non-redundant protein database. The modification that accounts for the majority of the improvement is the use, for each database sequence, of a position-specific scoring system tuned to that sequence's amino acid composition. The use of composition-based statistics is particularly beneficial for large-scale automated applications of PSI-BLAST.

Citing Articles

Piperideine-6-carboxylic acid regulates vitamin B6 homeostasis and modulates systemic immunity in plants.

Liu H, Iyer L, Norris P, Liu R, Yu K, Grant M Nat Plants. 2025; 11(2):263-278.

PMID: 39953358 DOI: 10.1038/s41477-025-01906-0.


Further Development of SAMPDI-3D: A Machine Learning Method for Predicting Binding Free Energy Changes Caused by Mutations in Either Protein or DNA.

Rimal P, Paul S, Panday S, Alexov E Genes (Basel). 2025; 16(1).

PMID: 39858648 PMC: 11764785. DOI: 10.3390/genes16010101.


Bioinformatic approach to explain how Mg from seawater may be incorporated into coral skeletons.

Bell T, Iguchi A, Ohno Y, Sakai K, Yokoyama Y R Soc Open Sci. 2025; 12(1):232011.

PMID: 39845712 PMC: 11750370. DOI: 10.1098/rsos.232011.


TargetCLP: clathrin proteins prediction combining transformed and evolutionary scale modeling-based multi-view features via weighted feature integration approach.

Ullah M, Akbar S, Raza A, Khan K, Zou Q Brief Bioinform. 2025; 26(1.

PMID: 39844339 PMC: 11753890. DOI: 10.1093/bib/bbaf026.


Challenges in adjusting scoring matrices when comparing functional motifs with non-standard compositions.

Jarnot P Sci Rep. 2024; 14(1):31777.

PMID: 39738463 PMC: 11685636. DOI: 10.1038/s41598-024-82548-8.


References
1.
Henikoff S, Henikoff J . Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A. 1992; 89(22):10915-9. PMC: 50453. DOI: 10.1073/pnas.89.22.10915. View

2.
Mott R . Maximum-likelihood estimation of the statistical distribution of Smith-Waterman local sequence similarity scores. Bull Math Biol. 2015; 54(1):59-75. DOI: 10.1007/BF02458620. View

3.
Altschul S, Erickson B . Optimal sequence alignment using affine gap costs. Bull Math Biol. 1986; 48(5-6):603-16. DOI: 10.1007/BF02462326. View

4.
Bailey T, Gribskov M . The megaprior heuristic for discovering protein sequence patterns. Proc Int Conf Intell Syst Mol Biol. 1996; 4:15-24. View

5.
Aravind L, Koonin E . Gleaning non-trivial structural, functional and evolutionary information about proteins by iterative database searches. J Mol Biol. 1999; 287(5):1023-40. DOI: 10.1006/jmbi.1999.2653. View