» Articles » PMID: 34390574

An Efficient, Nonphylogenetic Method for Detecting Genes Sharing Evolutionary Signals in Phylogenomic Data Sets

Overview
Date 2021 Aug 14
PMID 34390574
Authors
Affiliations
Soon will be listed here.
Abstract

Assessing the compatibility between gene family phylogenies is a crucial and often computationally demanding step in many phylogenomic analyses. Here, we describe the Evolutionary Similarity Index (IES), a means to assess shared evolution between gene families using a weighted orthogonal distance regression model applied to sequence distances. The utilization of pairwise distance matrices circumvents comparisons between gene tree topologies, which are inherently uncertain and sensitive to evolutionary model choice, phylogenetic reconstruction artifacts, and other sources of error. Furthermore, IES enables the many-to-many pairing of multiple copies between similarly evolving gene families. This is done by selecting non-overlapping pairs of copies, one from each assessed family, and yielding the least sum of squared residuals. Analyses of simulated gene family data sets show that IES's accuracy is on par with popular tree-based methods while also less susceptible to noise introduced by sequence alignment and evolutionary model fitting. Applying IES to an empirical data set of 1,322 genes from 42 archaeal genomes identified eight major clusters of gene families with compatible evolutionary trends. The most cohesive cluster consisted of 62 genes with compatible evolutionary signal, which occur as both single-copy and multiple homologs per genome; phylogenetic analysis of concatenated alignments from this cluster produced a tree closely matching previously published species trees for Archaea. Four other clusters are mainly composed of accessory genes with limited distribution among Archaea and enriched toward specific metabolic functions. Pairwise evolutionary distances obtained from these accessory gene clusters suggest patterns of interphyla horizontal gene transfer. An IES implementation is available at https://github.com/lthiberiol/evolSimIndex.

References
1.
Katoh K, Standley D . MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol. 2013; 30(4):772-80. PMC: 3603318. DOI: 10.1093/molbev/mst010. View

2.
Owen M, Provan J . A fast algorithm for computing geodesic distances in tree space. IEEE/ACM Trans Comput Biol Bioinform. 2010; 8(1):2-13. DOI: 10.1109/TCBB.2010.3. View

3.
Puigbo P, Wolf Y, Koonin E . Search for a 'Tree of Life' in the thicket of the phylogenetic forest. J Biol. 2009; 8(6):59. PMC: 2737373. DOI: 10.1186/jbiol159. View

4.
Mirarab S, Warnow T . FastSP: linear time calculation of alignment accuracy. Bioinformatics. 2011; 27(23):3250-8. DOI: 10.1093/bioinformatics/btr553. View

5.
Dagan T, Martin W . The tree of one percent. Genome Biol. 2006; 7(10):118. PMC: 1794558. DOI: 10.1186/gb-2006-7-10-118. View