» Articles » PMID: 29422764

Integrated Theory- and Data-driven Feature Selection in Gene Expression Data Analysis

Overview
Date 2018 Feb 10
PMID 29422764
Citations 8
Authors
Affiliations
Soon will be listed here.
Abstract

The exponential growth of high dimensional biological data has led to a rapid increase in demand for automated approaches for knowledge production. Existing methods rely on two general approaches to address this challenge: 1) the Theory-driven approach, which utilizes prior accumulated knowledge, and 2) the Data-driven approach, which solely utilizes the data to deduce scientific knowledge. Both of these approaches alone suffer from bias toward past/present knowledge, as they fail to incorporate all of the current knowledge that is available to make new discoveries. In this paper, we show how an integrated method can effectively address the high dimensionality of big biological data, which is a major problem for pure data-driven analysis approaches. We realize our approach in a novel two-step analytical workflow that incorporates a new feature selection paradigm as the first step to handling high-throughput gene expression data analysis and that utilizes graphical causal modeling as the second step to handle the automatic extraction of causal relationships. Our results, on real-world clinical datasets from The Cancer Genome Atlas (TCGA), demonstrate that our method is capable of intelligently selecting genes for learning effective causal networks.

Citing Articles

Review of feature selection approaches based on grouping of features.

Kuzudisli C, Bakir-Gungor B, Bulut N, Qaqish B, Yousef M PeerJ. 2023; 11:e15666.

PMID: 37483989 PMC: 10358338. DOI: 10.7717/peerj.15666.


Applying a GAN-based classifier to improve transcriptome-based prognostication in breast cancer.

Gutta C, Morhard C, Rehm M PLoS Comput Biol. 2023; 19(4):e1011035.

PMID: 37011102 PMC: 10101642. DOI: 10.1371/journal.pcbi.1011035.


Trust in the scientific research community predicts intent to comply with COVID-19 prevention measures: An analysis of a large-scale international survey dataset.

Han H Epidemiol Infect. 2022; 150:e36.

PMID: 35131001 PMC: 8886075. DOI: 10.1017/S0950268822000255.


CogNet: classification of gene expression data based on ranked active-subnetwork-oriented KEGG pathway enrichment analysis.

Yousef M, Ulgen E, Sezerman O PeerJ Comput Sci. 2021; 7:e336.

PMID: 33816987 PMC: 7959595. DOI: 10.7717/peerj-cs.336.


Application of Biological Domain Knowledge Based Feature Selection on Gene Expression Data.

Yousef M, Kumar A, Bakir-Gungor B Entropy (Basel). 2020; 23(1).

PMID: 33374969 PMC: 7821996. DOI: 10.3390/e23010002.


References
1.
Sedgewick A, Shi I, Donovan R, Benos P . Learning mixed graphical models with separate sparsity parameters and stability-based model selection. BMC Bioinformatics. 2016; 17 Suppl 5:175. PMC: 4905606. DOI: 10.1186/s12859-016-1039-0. View

2.
Jones J, Chin S, Wong-Taylor L, Leaford D, Ponder B, Caldas C . TOX3 mutations in breast cancer. PLoS One. 2013; 8(9):e74102. PMC: 3777980. DOI: 10.1371/journal.pone.0074102. View

3.
Spirtes P, Zhang K . Causal discovery and inference: concepts and recent methodological advances. Appl Inform (Berl). 2016; 3:3. PMC: 4841209. DOI: 10.1186/s40535-016-0018-x. View

4.
Villaruz L, Huang G, Romkes M, Kirkwood J, Buch S, Nukui T . MicroRNA expression profiling predicts clinical outcome of carboplatin/paclitaxel-based therapy in metastatic melanoma treated on the ECOG-ACRIN trial E2603. Clin Epigenetics. 2015; 7:58. PMC: 4457092. DOI: 10.1186/s13148-015-0092-2. View

5.
Huang G, Tsamardinos I, Raghu V, Kaminski N, Benos P . T-ReCS: stable selection of dynamically formed groups of features with application to prediction of clinical outcomes. Pac Symp Biocomput. 2015; :431-42. PMC: 4299881. View