Empirical Determination of Effective Gap Penalties for Sequence Comparison
Overview
Authors
Affiliations
Motivation: No general theory guides the selection of gap penalties for local sequence alignment. We empirically determined the most effective gap penalties for protein sequence similarity searches with substitution matrices over a range of target evolutionary distances from 20 to 200 Point Accepted Mutations (PAMs).
Results: We embedded real and simulated homologs of protein sequences into a database and searched the database to determine the gap penalties that produced the best statistical significance for the distant homologs. The most effective penalty for the first residue in a gap (q+r) changes as a function of evolutionary distance, while the gap extension penalty for additional residues (r) does not. For these data, the optimal gap penalties for a given matrix scaled in 1/3 bit units (e.g. BLOSUM50, PAM200) are q=25-0.1 * (target PAM distance), r=5. Our results provide an empirical basis for selection of gap penalties and demonstrate how optimal gap penalties behave as a function of the target evolutionary distance of the substitution matrix. These gap penalties can improve expectation values by at least one order of magnitude when searching with short sequences, and improve the alignment of proteins containing short sequences repeated in tandem.
Polyanovsky V, Lifanov A, Esipova N, Tumanyan V BMC Bioinformatics. 2020; 21(Suppl 11):294.
PMID: 32921315 PMC: 7489204. DOI: 10.1186/s12859-020-03616-0.
RBLOSUM performs better than CorBLOSUM with lesser error per query.
Govindarajan R, Leela B, Nair A BMC Res Notes. 2018; 11(1):328.
PMID: 29784028 PMC: 5963171. DOI: 10.1186/s13104-018-3415-5.
Disease Sequences High-Accuracy Alignment Based on the Precision Medicine.
Li M, Long H, Wang H, Fu H, Xu D, Shen Y Biomed Res Int. 2018; 2018:1718046.
PMID: 29682519 PMC: 5842723. DOI: 10.1155/2018/1718046.
PFASUM: a substitution matrix from Pfam structural alignments.
Keul F, Hess M, Goesele M, Hamacher K BMC Bioinformatics. 2017; 18(1):293.
PMID: 28583067 PMC: 5460430. DOI: 10.1186/s12859-017-1703-z.
Determination of optimal parameters of MAFFT program based on BAliBASE3.0 database.
Long H, Li M, Fu H Springerplus. 2016; 5(1):736.
PMID: 27376004 PMC: 4909661. DOI: 10.1186/s40064-016-2526-5.