Fast and Scalable Feature Selection for Gene Expression Data Using Hilbert-Schmidt Independence Criterion
Overview
Authors
Affiliations
Goal: In computational biology, selecting a small subset of informative genes from microarray data continues to be a challenge due to the presence of thousands of genes. This paper aims at quantifying the dependence between gene expression data and the response variables and to identifying a subset of the most informative genes using a fast and scalable multivariate algorithm.
Methods: A novel algorithm for feature selection from gene expression data was developed. The algorithm was based on the Hilbert-Schmidt independence criterion (HSIC), and was partly motivated by singular value decomposition (SVD).
Results: The algorithm is computationally fast and scalable to large datasets. Moreover, it can be applied to problems with any type of response variables including, biclass, multiclass, and continuous response variables. The performance of the proposed algorithm in terms of accuracy, stability of the selected genes, speed, and scalability was evaluated using both synthetic and real-world datasets. The simulation results demonstrated that the proposed algorithm effectively and efficiently extracted stable genes with high predictive capability, in particular for datasets with multiclass response variables.
Conclusion/significance: The proposed method does not require the whole microarray dataset to be stored in memory, and thus can easily be scaled to large datasets. This capability is an important attribute in big data analytics, where data can be large and massively distributed.
Borah K, Shekhar Das H, Seth S, Mallick K, Rahaman Z, Mallik S Funct Integr Genomics. 2024; 24(5):139.
PMID: 39158621 DOI: 10.1007/s10142-024-01415-x.
A feature selection method based on the Golden Jackal-Grey Wolf Hybrid Optimization Algorithm.
Liu G, Guo Z, Liu W, Jiang F, Fu E PLoS One. 2024; 19(1):e0295579.
PMID: 38165924 PMC: 10760777. DOI: 10.1371/journal.pone.0295579.
Applying causal discovery to single-cell analyses using CausalCell.
Wen Y, Huang J, Guo S, Elyahu Y, Monsonego A, Zhang H Elife. 2023; 12.
PMID: 37129360 PMC: 10229139. DOI: 10.7554/eLife.81464.
The amniotic fluid cell-free transcriptome in spontaneous preterm labor.
Bhatti G, Romero R, Gomez-Lopez N, Pique-Regi R, Pacora P, Jung E Sci Rep. 2021; 11(1):13481.
PMID: 34188072 PMC: 8242007. DOI: 10.1038/s41598-021-92439-x.
Mahendran N, Vincent P, Srinivasan K, Chang C Front Genet. 2020; 11:603808.
PMID: 33362861 PMC: 7758324. DOI: 10.3389/fgene.2020.603808.