Improving the Alignment Quality of Consistency Based Aligners with an Evaluation Function Using Synonymous Protein Words

Overview

Journal PLoS One

Specialties General Medicine
Science

Date 2011 Dec 14

PMID 22163274

Citations 2

Authors

Hsin-Nan Lin

Cedric Notredame

Jia-Ming Chang

Ting-Yi Sung

Wen-Lian Hsu

Affiliations

Soon will be listed here.

Abstract

Most sequence alignment tools can successfully align protein sequences with higher levels of sequence identity. The accuracy of corresponding structure alignment, however, decreases rapidly when considering distantly related sequences (<20% identity). In this range of identity, alignments optimized so as to maximize sequence similarity are often inaccurate from a structural point of view. Over the last two decades, most multiple protein aligners have been optimized for their capacity to reproduce structure-based alignments while using sequence information. Methods currently available differ essentially in the similarity measurement between aligned residues using substitution matrices, Fourier transform, sophisticated profile-profile functions, or consistency-based approaches, more recently.In this paper, we present a flexible similarity measure for residue pairs to improve the quality of protein sequence alignment. Our approach, called SymAlign, relies on the identification of conserved words found across a sizeable fraction of the considered dataset, and supported by evolutionary analysis. These words are then used to define a position specific substitution matrix that better reflects the biological significance of local similarity. The experiment results show that the SymAlign scoring scheme can be incorporated within T-Coffee to improve sequence alignment accuracy. We also demonstrate that SymAlign is less sensitive to the presence of structurally non-similar proteins. In the analysis of the relationship between sequence identity and structure similarity, SymAlign can better differentiate structurally similar proteins from non- similar proteins. We show that protein sequence alignments can be significantly improved using a similarity estimation based on weighted n-grams. In our analysis of the alignments thus produced, sequence conservation becomes a better indicator of structural similarity. SymAlign also provides alignment visualization that can display sub-optimal alignments on dot-matrices. The visualization makes it easy to identify well-supported alternative alignments that may not have been identified by dynamic programming. SymAlign is available at http://bio-cluster.iis.sinica.edu.tw/SymAlign/.

Citing Articles

Identifying functionally informative evolutionary sequence profiles.

Gil N, Fiser A Bioinformatics. 2017; 34(8):1278-1286.

PMID: 29211823 PMC: 5905606. DOI: 10.1093/bioinformatics/btx779.

On the necessity of dissecting sequence similarity scores into segment-specific contributions for inferring protein homology, function prediction and annotation.

Wong W, Maurer-Stroh S, Eisenhaber B, Eisenhaber F BMC Bioinformatics. 2014; 15:166.

PMID: 24890864 PMC: 4061105. DOI: 10.1186/1471-2105-15-166.

References

LARKIN M, Blackshields G, Brown N, Chenna R, McGettigan P, McWilliam H . Clustal W and Clustal X version 2.0. Bioinformatics. 2007; 23(21):2947-8. DOI: 10.1093/bioinformatics/btm404. View

Bennett-Lovsey R, Herbert A, Sternberg M, Kelley L . Exploring the extremes of sequence/structure space with ensemble fold recognition in the program Phyre. Proteins. 2007; 70(3):611-25. DOI: 10.1002/prot.21688. View

Kemena C, Notredame C . Upcoming challenges for multiple sequence alignment methods in the high-throughput era. Bioinformatics. 2009; 25(19):2455-65. PMC: 2752613. DOI: 10.1093/bioinformatics/btp452. View

Armougom F, Moretti S, Keduas V, Notredame C . The iRMSD: a local measure of sequence alignment accuracy using structural information. Bioinformatics. 2006; 22(14):e35-9. DOI: 10.1093/bioinformatics/btl218. View

Edgar R . Optimizing substitution matrix choice and gap parameters for sequence alignment. BMC Bioinformatics. 2009; 10:396. PMC: 2791778. DOI: 10.1186/1471-2105-10-396. View

Baker D, Sali A . Protein structure prediction and structural genomics. Science. 2001; 294(5540):93-6. DOI: 10.1126/science.1065659. View

Sadreyev R, Grishin N . COMPASS: a tool for comparison of multiple protein alignments with assessment of statistical significance. J Mol Biol. 2003; 326(1):317-36. DOI: 10.1016/s0022-2836(02)01371-2. View

Xu J, Zhang Y . How significant is a protein structure similarity with TM-score = 0.5?. Bioinformatics. 2010; 26(7):889-95. PMC: 2913670. DOI: 10.1093/bioinformatics/btq066. View

Rychlewski L, Zhang B, Godzik A . Fold and function predictions for Mycoplasma genitalium proteins. Fold Des. 1998; 3(4):229-38. DOI: 10.1016/S1359-0278(98)00034-0. View

10.

Henikoff S, Henikoff J . Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A. 1992; 89(22):10915-9. PMC: 50453. DOI: 10.1073/pnas.89.22.10915. View

11.

STERNBERG M, Bates P, Kelley L, MacCallum R . Progress in protein structure prediction: assessment of CASP3. Curr Opin Struct Biol. 1999; 9(3):368-73. DOI: 10.1016/S0959-440X(99)80050-5. View

12.

Lin H, Sung T, Ho S, Hsu W . Improving protein secondary structure prediction based on short subsequences with local structure similarity. BMC Genomics. 2010; 11 Suppl 4:S4. PMC: 3005913. DOI: 10.1186/1471-2164-11-S4-S4. View

13.

Soding J . Protein homology detection by HMM-HMM comparison. Bioinformatics. 2004; 21(7):951-60. DOI: 10.1093/bioinformatics/bti125. View

14.

Hagopian R, Davidson J, Datta R, Samad B, Jarvis G, Sjolander K . SATCHMO-JS: a webserver for simultaneous protein multiple sequence alignment and phylogenetic tree construction. Nucleic Acids Res. 2010; 38(Web Server issue):W29-34. PMC: 2896197. DOI: 10.1093/nar/gkq298. View

15.

Zhou H, Zhou Y . SPEM: improving multiple sequence alignment with sequence profiles and predicted secondary structures. Bioinformatics. 2005; 21(18):3615-21. DOI: 10.1093/bioinformatics/bti582. View

16.

Muller T, Vingron M . Modeling amino acid replacement. J Comput Biol. 2001; 7(6):761-76. DOI: 10.1089/10665270050514918. View

17.

Simossis V, Kleinjung J, Heringa J . Homology-extended sequence alignment. Nucleic Acids Res. 2005; 33(3):816-24. PMC: 549400. DOI: 10.1093/nar/gki233. View

18.

Wang L, Jiang T . On the complexity of multiple sequence alignment. J Comput Biol. 1994; 1(4):337-48. DOI: 10.1089/cmb.1994.1.337. View

19.

Edgar R . Quality measures for protein alignment benchmarks. Nucleic Acids Res. 2010; 38(7):2145-53. PMC: 2853116. DOI: 10.1093/nar/gkp1196. View

20.

Katoh K, Misawa K, Kuma K, Miyata T . MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 2002; 30(14):3059-66. PMC: 135756. DOI: 10.1093/nar/gkf436. View