» Articles » PMID: 22216138

Non-negative Matrix Factorization for Learning Alignment-specific Models of Protein Evolution

Overview
Journal PLoS One
Date 2012 Jan 5
PMID 22216138
Citations 4
Authors
Affiliations
Soon will be listed here.
Abstract

Models of protein evolution currently come in two flavors: generalist and specialist. Generalist models (e.g. PAM, JTT, WAG) adopt a one-size-fits-all approach, where a single model is estimated from a number of different protein alignments. Specialist models (e.g. mtREV, rtREV, HIVbetween) can be estimated when a large quantity of data are available for a single organism or gene, and are intended for use on that organism or gene only. Unsurprisingly, specialist models outperform generalist models, but in most instances there simply are not enough data available to estimate them. We propose a method for estimating alignment-specific models of protein evolution in which the complexity of the model is adapted to suit the richness of the data. Our method uses non-negative matrix factorization (NNMF) to learn a set of basis matrices from a general dataset containing a large number of alignments of different proteins, thus capturing the dimensions of important variation. It then learns a set of weights that are specific to the organism or gene of interest and for which only a smaller dataset is available. Thus the alignment-specific model is obtained as a weighted sum of the basis matrices. Having been constrained to vary along only as many dimensions as the data justify, the model has far fewer parameters than would be required to estimate a specialist model. We show that our NNMF procedure produces models that outperform existing methods on all but one of 50 test alignments. The basis matrices we obtain confirm the expectation that amino acid properties tend to be conserved, and allow us to quantify, on specific alignments, how the strength of conservation varies across different properties. We also apply our new models to phylogeny inference and show that the resulting phylogenies are different from, and have improved likelihood over, those inferred under standard models.

Citing Articles

Online multi-modal robust non-negative dictionary learning for visual tracking.

Zhang X, Guan N, Tao D, Qiu X, Luo Z PLoS One. 2015; 10(5):e0124685.

PMID: 25961715 PMC: 4427315. DOI: 10.1371/journal.pone.0124685.


Gene-wide identification of episodic selection.

Murrell B, Weaver S, Smith M, Wertheim J, Murrell S, Aylward A Mol Biol Evol. 2015; 32(5):1365-71.

PMID: 25701167 PMC: 4408417. DOI: 10.1093/molbev/msv035.


Discriminant projective non-negative matrix factorization.

Guan N, Zhang X, Luo Z, Tao D, Yang X PLoS One. 2013; 8(12):e83291.

PMID: 24376680 PMC: 3869764. DOI: 10.1371/journal.pone.0083291.


Superiority of a mechanistic codon substitution model even for protein sequences in phylogenetic analysis.

Miyazawa S BMC Evol Biol. 2013; 13:257.

PMID: 24256155 PMC: 4225520. DOI: 10.1186/1471-2148-13-257.

References
1.
Posada D, Buckley T . Model selection and model averaging in phylogenetics: advantages of akaike information criterion and bayesian approaches over likelihood ratio tests. Syst Biol. 2004; 53(5):793-808. DOI: 10.1080/10635150490522304. View

2.
Kosakovsky Pond S, Poon A, Leigh Brown A, Frost S . A maximum likelihood method for detecting directional evolution in protein sequences and its application to influenza A virus. Mol Biol Evol. 2008; 25(9):1809-24. PMC: 2515872. DOI: 10.1093/molbev/msn123. View

3.
Whelan S, de Bakker P, Quevillon E, Rodriguez N, Goldman N . PANDIT: an evolution-centric database of protein and associated nucleotide domains with inferred trees. Nucleic Acids Res. 2005; 34(Database issue):D327-31. PMC: 1347450. DOI: 10.1093/nar/gkj087. View

4.
Delport W, Scheffler K, Botha G, Gravenor M, Muse S, Kosakovsky Pond S . CodonTest: modeling amino acid substitution preferences in coding sequences. PLoS Comput Biol. 2010; 6(8). PMC: 2924240. DOI: 10.1371/journal.pcbi.1000885. View

5.
Cao Y, Waddell P, Okada N, Hasegawa M . The complete mitochondrial DNA sequence of the shark Mustelus manazo: evaluating rooting contradictions to living bony vertebrates. Mol Biol Evol. 1998; 15(12):1637-46. DOI: 10.1093/oxfordjournals.molbev.a025891. View