» Articles » PMID: 28182548

Fast and Scalable Feature Selection for Gene Expression Data Using Hilbert-Schmidt Independence Criterion

Overview
Specialty Biology
Date 2017 Feb 10
PMID 28182548
Citations 6
Authors
Affiliations
Soon will be listed here.
Abstract

Goal: In computational biology, selecting a small subset of informative genes from microarray data continues to be a challenge due to the presence of thousands of genes. This paper aims at quantifying the dependence between gene expression data and the response variables and to identifying a subset of the most informative genes using a fast and scalable multivariate algorithm.

Methods: A novel algorithm for feature selection from gene expression data was developed. The algorithm was based on the Hilbert-Schmidt independence criterion (HSIC), and was partly motivated by singular value decomposition (SVD).

Results: The algorithm is computationally fast and scalable to large datasets. Moreover, it can be applied to problems with any type of response variables including, biclass, multiclass, and continuous response variables. The performance of the proposed algorithm in terms of accuracy, stability of the selected genes, speed, and scalability was evaluated using both synthetic and real-world datasets. The simulation results demonstrated that the proposed algorithm effectively and efficiently extracted stable genes with high predictive capability, in particular for datasets with multiclass response variables.

Conclusion/significance: The proposed method does not require the whole microarray dataset to be stored in memory, and thus can easily be scaled to large datasets. This capability is an important attribute in big data analytics, where data can be large and massively distributed.

Citing Articles

A review on advancements in feature selection and feature extraction for high-dimensional NGS data analysis.

Borah K, Shekhar Das H, Seth S, Mallick K, Rahaman Z, Mallik S Funct Integr Genomics. 2024; 24(5):139.

PMID: 39158621 DOI: 10.1007/s10142-024-01415-x.


A feature selection method based on the Golden Jackal-Grey Wolf Hybrid Optimization Algorithm.

Liu G, Guo Z, Liu W, Jiang F, Fu E PLoS One. 2024; 19(1):e0295579.

PMID: 38165924 PMC: 10760777. DOI: 10.1371/journal.pone.0295579.


Applying causal discovery to single-cell analyses using CausalCell.

Wen Y, Huang J, Guo S, Elyahu Y, Monsonego A, Zhang H Elife. 2023; 12.

PMID: 37129360 PMC: 10229139. DOI: 10.7554/eLife.81464.


The amniotic fluid cell-free transcriptome in spontaneous preterm labor.

Bhatti G, Romero R, Gomez-Lopez N, Pique-Regi R, Pacora P, Jung E Sci Rep. 2021; 11(1):13481.

PMID: 34188072 PMC: 8242007. DOI: 10.1038/s41598-021-92439-x.


Machine Learning Based Computational Gene Selection Models: A Survey, Performance Evaluation, Open Issues, and Future Research Directions.

Mahendran N, Vincent P, Srinivasan K, Chang C Front Genet. 2020; 11:603808.

PMID: 33362861 PMC: 7758324. DOI: 10.3389/fgene.2020.603808.