» Articles » PMID: 39799515

PhyloMix: Enhancing Microbiome-trait Association Prediction Through Phylogeny-mixing Augmentation

Overview
Journal Bioinformatics
Date 2025 Jan 12
PMID 39799515
Authors
Affiliations
Soon will be listed here.
Abstract

Motivation: Understanding the associations between traits and microbial composition is a fundamental objective in microbiome research. Recently, researchers have turned to machine learning (ML) models to achieve this goal with promising results. However, the effectiveness of advanced ML models is often limited by the unique characteristics of microbiome data, which are typically high-dimensional, compositional, and imbalanced. These characteristics can hinder the models' ability to fully explore the relationships among taxa in predictive analyses. To address this challenge, data augmentation has become crucial. It involves generating synthetic samples with artificial labels based on existing data and incorporating these samples into the training set to improve ML model performance.

Results: Here, we propose PhyloMix, a novel data augmentation method specifically designed for microbiome data to enhance predictive analyses. PhyloMix leverages the phylogenetic relationships among microbiome taxa as an informative prior to guide the generation of synthetic microbial samples. Leveraging phylogeny, PhyloMix creates new samples by removing a subtree from one sample and combining it with the corresponding subtree from another sample. Notably, PhyloMix is designed to address the compositional nature of microbiome data, effectively handling both raw counts and relative abundances. This approach introduces sufficient diversity into the augmented samples, leading to improved predictive performance. We empirically evaluated PhyloMix on six real microbiome datasets across five commonly used ML models. PhyloMix significantly outperforms distinct baseline methods including sample-mixing-based data augmentation techniques like vanilla mixup and compositional cutmix, as well as the phylogeny-based method TADA. We also demonstrated the wide applicability of PhyloMix in both supervised learning and contrastive representation learning.

Availability And Implementation: The Apache-licensed source code is available at (https://github.com/batmen-lab/phylomix).

References
1.
Rong R, Jiang S, Xu L, Xiao G, Xie Y, Liu D . MB-GAN: Microbiome Simulation via Generative Adversarial Network. Gigascience. 2021; 10(2). PMC: 7931821. DOI: 10.1093/gigascience/giab005. View

2.
Boktor J, Sharon G, Verhagen Metman L, Hall D, Engen P, Zreloff Z . Integrated Multi-Cohort Analysis of the Parkinson's Disease Gut Metagenome. Mov Disord. 2023; 38(3):399-409. DOI: 10.1002/mds.29300. View

3.
Turnbaugh P, Ley R, Mahowald M, Magrini V, Mardis E, Gordon J . An obesity-associated gut microbiome with increased capacity for energy harvest. Nature. 2006; 444(7122):1027-31. DOI: 10.1038/nature05414. View

4.
Hernandez Medina R, Kutuzova S, Nielsen K, Johansen J, Hansen L, Nielsen M . Machine learning and deep learning applications in microbiome research. ISME Commun. 2023; 2(1):98. PMC: 9723725. DOI: 10.1038/s43705-022-00182-9. View

5.
Dutta S, Verma S, Jain V, Surapaneni B, Vinayek R, Phillips L . Parkinson's Disease: The Emerging Role of Gut Dysbiosis, Antibiotics, Probiotics, and Fecal Microbiota Transplantation. J Neurogastroenterol Motil. 2019; 25(3):363-376. PMC: 6657920. DOI: 10.5056/jnm19044. View