» Articles » PMID: 14530134

Performance-based Selection of Likelihood Models for Phylogeny Estimation

Overview
Journal Syst Biol
Specialty Biology
Date 2003 Oct 8
PMID 14530134
Citations 89
Authors
Affiliations
Soon will be listed here.
Abstract

Phylogenetic estimation has largely come to rely on explicitly model-based methods. This approach requires that a model be chosen and that that choice be justified. To date, justification has largely been accomplished through use of likelihood-ratio tests (LRTs) to assess the relative fit of a nested series of reversible models. While this approach certainly represents an important advance over arbitrary model selection, the best fit of a series of models may not always provide the most reliable phylogenetic estimates for finite real data sets, where all available models are surely incorrect. Here, we develop a novel approach to model selection, which is based on the Bayesian information criterion, but incorporates relative branch-length error as a performance measure in a decision theory (DT) framework. This DT method includes a penalty for overfitting, is applicable prior to running extensive analyses, and simultaneously compares all models being considered and thus does not rely on a series of pairwise comparisons of models to traverse model space. We evaluate this method by examining four real data sets and by using those data sets to define simulation conditions. In the real data sets, the DT method selects the same or simpler models than conventional LRTs. In order to lend generality to the simulations, codon-based models (with parameters estimated from the real data sets) were used to generate simulated data sets, which are therefore more complex than any of the models we evaluate. On average, the DT method selects models that are simpler than those chosen by conventional LRTs. Nevertheless, these simpler models provide estimates of branch lengths that are more accurate both in terms of relative error and absolute error than those derived using the more complex (yet still wrong) models chosen by conventional LRTs. This method is available in a program called DT-ModSel.

Citing Articles

Phylogenetic analysis of microbial CP-lyase cluster genes for bioremediation of phosphonate.

Richard P, Baltosser W, Williams P, He Q AMB Express. 2025; 15(1):42.

PMID: 40064825 PMC: 11893972. DOI: 10.1186/s13568-025-01856-y.


A Guide to Phylogenomic Inference.

Patane J, Martins Jr J, Setubal J Methods Mol Biol. 2024; 2802:267-345.

PMID: 38819564 DOI: 10.1007/978-1-0716-3838-5_11.


Selection among site-dependent structurally constrained substitution models of protein evolution by approximate Bayesian computation.

Ferreiro D, Branco C, Arenas M Bioinformatics. 2024; 40(3).

PMID: 38374231 PMC: 10914458. DOI: 10.1093/bioinformatics/btae096.


Incongruence in the phylogenomics era.

Steenwyk J, Li Y, Zhou X, Shen X, Rokas A Nat Rev Genet. 2023; 24(12):834-850.

PMID: 37369847 PMC: 11499941. DOI: 10.1038/s41576-023-00620-x.


Viral genome sequence datasets display pervasive evidence of strand-specific substitution biases that are best described using non-reversible nucleotide substitution models.

Sianga-Mete R, Hartnady P, Mandikumba W, Rutherford K, Currin C, Phelanyane F Res Sq. 2023; .

PMID: 36597548 PMC: 9810213. DOI: 10.21203/rs.3.rs-2407778/v1.