Benchmark Study of Feature Selection Strategies for Multi-omics Data

Overview

Journal BMC Bioinformatics

Publisher Biomed Central

Specialty Biology

Date 2022 Oct 5

PMID 36199022

Authors

Yingxia Li

Ulrich Mansmann

Shangming Du

Roman Hornung

Affiliations

Soon will be listed here.

Abstract

Background: In the last few years, multi-omics data, that is, datasets containing different types of high-dimensional molecular variables for the same samples, have become increasingly available. To date, several comparison studies focused on feature selection methods for omics data, but to our knowledge, none compared these methods for the special case of multi-omics data. Given that these data have specific structures that differentiate them from single-omics data, it is unclear whether different feature selection strategies may be optimal for such data. In this paper, using 15 cancer multi-omics datasets we compared four filter methods, two embedded methods, and two wrapper methods with respect to their performance in the prediction of a binary outcome in several situations that may affect the prediction results. As classifiers, we used support vector machines and random forests. The methods were compared using repeated fivefold cross-validation. The accuracy, the AUC, and the Brier score served as performance metrics.

Results: The results suggested that, first, the chosen number of selected features affects the predictive performance for many feature selection methods but not all. Second, whether the features were selected by data type or from all data types concurrently did not considerably affect the predictive performance, but for some methods, concurrent selection took more time. Third, regardless of which performance measure was considered, the feature selection methods mRMR, the permutation importance of random forests, and the Lasso tended to outperform the other considered methods. Here, mRMR and the permutation importance of random forests already delivered strong predictive performance when considering only a few selected features. Finally, the wrapper methods were computationally much more expensive than the filter and embedded methods.

Conclusions: We recommend the permutation importance of random forests and the filter method mRMR for feature selection using multi-omics data, where, however, mRMR is considerably more computationally costly.

Citing Articles

Early Diagnosis of Bloodstream Infections Using Serum Metabolomic Analysis.

Han S, Li R, Wang H, Wang L, Gao Y, Wen Y Metabolites. 2024; 14(12).

PMID: 39728466 PMC: 11676852. DOI: 10.3390/metabo14120685.

Transforming Clinical Research: The Power of High-Throughput Omics Integration.

Vitorino R Proteomes. 2024; 12(3).

PMID: 39311198 PMC: 11417901. DOI: 10.3390/proteomes12030025.

Exploring genomic feature selection: A comparative analysis of GWAS and machine learning algorithms in a large-scale soybean dataset.

Al-Mamun H, Danilevicz M, Marsh J, Gondro C, Edwards D Plant Genome. 2024; 18(1):e20503.

PMID: 39253773 PMC: 11726426. DOI: 10.1002/tpg2.20503.

Does combining numerous data types in multi-omics data improve or hinder performance in survival prediction? Insights from a large-scale benchmark study.

Li Y, Herold T, Mansmann U, Hornung R BMC Med Inform Decis Mak. 2024; 24(1):244.

PMID: 39223659 PMC: 11370316. DOI: 10.1186/s12911-024-02642-9.

Logistic PCA explains differences between genome-scale metabolic models in terms of metabolic pathways.

Zehetner L, Szeliova D, Kraus B, Bort J, Zanghellini J PLoS Comput Biol. 2024; 20(6):e1012236.

PMID: 38913731 PMC: 11226097. DOI: 10.1371/journal.pcbi.1012236.

References

Wang X, Sun Q . TP53 mutations, expression and interaction networks in human cancers. Oncotarget. 2016; 8(1):624-643. PMC: 5352183. DOI: 10.18632/oncotarget.13483. View

De Bin R, Sauerbrei W, Boulesteix A . Investigating the prediction ability of survival models based on both clinical and omics data: two case studies. Stat Med. 2014; 33(30):5310-29. DOI: 10.1002/sim.6246. View

Haury A, Gestraud P, Vert J . The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures. PLoS One. 2011; 6(12):e28210. PMC: 3244389. DOI: 10.1371/journal.pone.0028210. View

Tomczak K, Czerwinska P, Wiznerowicz M . The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge. Contemp Oncol (Pozn). 2015; 19(1A):A68-77. PMC: 4322527. DOI: 10.5114/wo.2014.47136. View

Leclercq M, Vittrant B, Martin-Magniette M, Scott Boyer M, Perin O, Bergeron A . Large-Scale Automatic Feature Selection for Biomarker Discovery in High-Dimensional OMICs Data. Front Genet. 2019; 10:452. PMC: 6532608. DOI: 10.3389/fgene.2019.00452. View

Liu H, Li J, Wong L . A comparative study on feature selection and classification methods using gene expression profiles and proteomic patterns. Genome Inform. 2003; 13:51-60. View

Huang S, Chaudhary K, Garmire L . More Is Better: Recent Progress in Multi-Omics Data Integration Methods. Front Genet. 2017; 8:84. PMC: 5472696. DOI: 10.3389/fgene.2017.00084. View

Saeys Y, Inza I, Larranaga P . A review of feature selection techniques in bioinformatics. Bioinformatics. 2007; 23(19):2507-17. DOI: 10.1093/bioinformatics/btm344. View

Gao L, Ye M, Lu X, Huang D . Hybrid Method Based on Information Gain and Support Vector Machine for Gene Selection in Cancer Classification. Genomics Proteomics Bioinformatics. 2017; 15(6):389-395. PMC: 5828665. DOI: 10.1016/j.gpb.2017.08.002. View

10.

Herrmann M, Probst P, Hornung R, Jurinovic V, Boulesteix A . Large-scale benchmark study of survival prediction methods using multi-omics data. Brief Bioinform. 2020; 22(3). PMC: 8138887. DOI: 10.1093/bib/bbaa167. View

11.

Hornung R, Wright M . Block Forests: random forests for blocks of clinical and omics covariate data. BMC Bioinformatics. 2019; 20(1):358. PMC: 6598279. DOI: 10.1186/s12859-019-2942-y. View

12.

Momeni Z, Hassanzadeh E, Saniee Abadeh M, Bellazzi R . A survey on single and multi omics data mining methods in cancer data classification. J Biomed Inform. 2020; 107:103466. DOI: 10.1016/j.jbi.2020.103466. View

13.

Zhao Q, Shi X, Xie Y, Huang J, Shia B, Ma S . Combining multidimensional genomic measurements for predicting cancer prognosis: observations from TCGA. Brief Bioinform. 2014; 16(2):291-303. PMC: 4375393. DOI: 10.1093/bib/bbu003. View

14.

Gidskehaug L, Anderssen E, Flatberg A, Alsberg B . A framework for significance analysis of gene expression data using dimension reduction methods. BMC Bioinformatics. 2007; 8:346. PMC: 2194745. DOI: 10.1186/1471-2105-8-346. View

15.

Peng H, Long F, Ding C . Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell. 2005; 27(8):1226-38. DOI: 10.1109/TPAMI.2005.159. View

16.

Tibshirani R . The lasso method for variable selection in the Cox model. Stat Med. 1997; 16(4):385-95. DOI: 10.1002/(sici)1097-0258(19970228)16:4<385::aid-sim380>3.0.co;2-3. View

17.

Romero E, Sopena J . Performing feature selection with multilayer perceptrons. IEEE Trans Neural Netw. 2008; 19(3):431-41. DOI: 10.1109/TNN.2007.909535. View