A Framework for Block-wise Missing Data in Multi-omics

Overview

Journal PLoS One

Specialties General Medicine
Science

Date 2024 Jul 23

PMID 39042603

Authors

Sergi Baena-Miret

Ferran Reverter

Esteban Vegas

Affiliations

Soon will be listed here.

Abstract

High-throughput technologies have generated vast amounts of omic data. It is a consensus that the integration of diverse omics sources improves predictive models and biomarker discovery. However, managing multiple omics data poses challenges such as data heterogeneity, noise, high-dimensionality and missing data, especially in block-wise patterns. This study addresses the challenges of high dimensionality and block-wise missing data through a regularization and constrained-based approach. The methodology is implemented in the R package bwm for binary and continuous response variables, and applied to breast cancer and exposome multi-omics datasets, achieving strong performance even in scenarios with missing data present in all omics. In binary classification task, our proposed model achieves accuracy in the range of 86% to 92%, and F1 in the range of 68% to 79%. And, in regression task the correlation between true and predicted responses is in the range of 72% to 76%. However, there is a slight decline in performance metrics as the percentage of missing data increases. In scenarios where block-wise missing data affects multiple omics, the model performance actually surpasses that of scenarios where missing data is present in only one omics. One possible explanation for this might be that the other scenarios introduce a greater diversity of observation profiles, leading to a more robust model. Depending on the specific omics being studied, there is greater consistency in feature selection when comparing block-wise missing data scenarios.

Citing Articles

Predictive analytics in bronchopulmonary dysplasia: past, present, and future.

McOmber B, Moreira A, Kirkman K, Acosta S, Rusin C, Shivanna B Front Pediatr. 2024; 12:1483940.

PMID: 39633818 PMC: 11615574. DOI: 10.3389/fped.2024.1483940.

References

Xiang S, Yuan L, Fan W, Wang Y, Thompson P, Ye J . Bi-level multi-source learning for heterogeneous block-wise missing data. Neuroimage. 2013; 102 Pt 1:192-206. PMC: 3937297. DOI: 10.1016/j.neuroimage.2013.08.015. View

Zhou L, Rueda M, Alkhateeb A . Classification of Breast Cancer Nottingham Prognostic Index Using High-Dimensional Embedding and Residual Neural Network. Cancers (Basel). 2022; 14(4). PMC: 8870306. DOI: 10.3390/cancers14040934. View

Song M, Greenbaum J, Luttrell 4th J, Zhou W, Wu C, Shen H . A Review of Integrative Imputation for Multi-Omics Datasets. Front Genet. 2020; 11:570255. PMC: 7594632. DOI: 10.3389/fgene.2020.570255. View

Picard M, Scott-Boyer M, Bodein A, Perin O, Droit A . Integration strategies of multi-omics data for machine learning analysis. Comput Struct Biotechnol J. 2021; 19:3735-3746. PMC: 8258788. DOI: 10.1016/j.csbj.2021.06.030. View

Chierici M, Bussola N, Marcolini A, Francescatto M, Zandona A, Trastulla L . Integrative Network Fusion: A Multi-Omics Approach in Molecular Profiling. Front Oncol. 2020; 10:1065. PMC: 7340129. DOI: 10.3389/fonc.2020.01065. View