» Articles » PMID: 24299043

Selecting Informative Subsets of Sparse Supermatrices Increases the Chance to Find Correct Trees

Overview
Publisher Biomed Central
Specialty Biology
Date 2013 Dec 5
PMID 24299043
Citations 36
Authors
Affiliations
Soon will be listed here.
Abstract

Background: Character matrices with extensive missing data are frequently used in phylogenomics with potentially detrimental effects on the accuracy and robustness of tree inference. Therefore, many investigators select taxa and genes with high data coverage. Drawbacks of these selections are their exclusive reliance on data coverage without consideration of actual signal in the data which might, thus, not deliver optimal data matrices in terms of potential phylogenetic signal. In order to circumvent this problem, we have developed a heuristics implemented in a software called mare which (1) assesses information content of genes in supermatrices using a measure of potential signal combined with data coverage and (2) reduces supermatrices with a simple hill climbing procedure to submatrices with high total information content. We conducted simulation studies using matrices of 50 taxa × 50 genes with heterogeneous phylogenetic signal among genes and data coverage between 10-30%.

Results: With matrices of 50 taxa × 50 genes with heterogeneous phylogenetic signal among genes and data coverage between 10-30% Maximum Likelihood (ML) tree reconstructions failed to recover correct trees. A selection of a data subset with the herein proposed approach increased the chance to recover correct partial trees more than 10-fold. The selection of data subsets with the herein proposed simple hill climbing procedure performed well either considering the information content or just a simple presence/absence information of genes. We also applied our approach on an empirical data set, addressing questions of vertebrate systematics. With this empirical dataset selecting a data subset with high information content and supporting a tree with high average boostrap support was most successful if information content of genes was considered.

Conclusions: Our analyses of simulated and empirical data demonstrate that sparse supermatrices can be reduced on a formal basis outperforming the usually used simple selections of taxa and genes with high data coverage.

Citing Articles

Unraveling myriapod evolution: sealion, a novel quartet-based approach for evaluating phylogenetic uncertainty.

Kuck P, Wilkinson M, Romahn J, Seidel N, Meusemann K, Wagele J NAR Genom Bioinform. 2025; 7(1):lqaf018.

PMID: 40060371 PMC: 11886814. DOI: 10.1093/nargab/lqaf018.


The genomic and cellular basis of biosynthetic innovation in rove beetles.

Kitchen S, Naragon T, Bruckner A, Ladinsky M, Quinodoz S, Badroos J Cell. 2024; 187(14):3563-3584.e26.

PMID: 38889727 PMC: 11246231. DOI: 10.1016/j.cell.2024.05.012.


Multiple Origins of Bioluminescence in Beetles and Evolution of Luciferase Function.

He J, Li J, Zhang R, Dong Z, Liu G, Chang Z Mol Biol Evol. 2024; 41(1).

PMID: 38174583 PMC: 10798137. DOI: 10.1093/molbev/msad287.


Evolutionary Insights into the Relationship of Frogs, Salamanders, and Caecilians and Their Adaptive Traits, with an Emphasis on Salamander Regeneration and Longevity.

Lu B Animals (Basel). 2023; 13(22).

PMID: 38003067 PMC: 10668855. DOI: 10.3390/ani13223449.


Stepwise emergence of the neuronal gene expression program in early animal evolution.

Najle S, Grau-Bove X, Elek A, Navarrete C, Cianferoni D, Chiva C Cell. 2023; 186(21):4676-4693.e29.

PMID: 37729907 PMC: 10580291. DOI: 10.1016/j.cell.2023.08.027.


References
1.
Henikoff S, Henikoff J . Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A. 1992; 89(22):10915-9. PMC: 50453. DOI: 10.1073/pnas.89.22.10915. View

2.
Smith S, Beaulieu J, Donoghue M . Mega-phylogeny approach for comparative biology: an alternative to supertree and supermatrix approaches. BMC Evol Biol. 2009; 9:37. PMC: 2645364. DOI: 10.1186/1471-2148-9-37. View

3.
Ho S, Jermiin L . Tracing the decay of the historical signal in biological sequence data. Syst Biol. 2004; 53(4):623-37. DOI: 10.1080/10635150490503035. View

4.
Eigen M, Dress A . Statistical geometry in sequence space: a method of quantitative comparative sequence analysis. Proc Natl Acad Sci U S A. 1988; 85(16):5913-7. PMC: 281875. DOI: 10.1073/pnas.85.16.5913. View

5.
Struck T, Paul C, Hill N, Hartmann S, Hosel C, Kube M . Phylogenomic analyses unravel annelid evolution. Nature. 2011; 471(7336):95-8. DOI: 10.1038/nature09864. View