» Articles » PMID: 34054930

The Unsupervised Feature Selection Algorithms Based on Standard Deviation and Cosine Similarity for Genomic Data Analysis

Overview
Journal Front Genet
Date 2021 May 31
PMID 34054930
Citations 4
Authors
Affiliations
Soon will be listed here.
Abstract

To tackle the challenges in genomic data analysis caused by their tens of thousands of dimensions while having a small number of examples and unbalanced examples between classes, the technique of unsupervised feature selection based on standard deviation and cosine similarity is proposed in this paper. We refer to this idea as SCFS (Standard deviation and Cosine similarity based Feature Selection). It defines the discernibility and independence of a feature to value its distinguishable capability between classes and its redundancy to other features, respectively. A 2-dimensional space is constructed using discernibility as x-axis and independence as y-axis to represent all features where the upper right corner features have both comparatively high discernibility and independence. The importance of a feature is defined as the product of its discernibility and its independence (i.e., the area of the rectangular enclosed by the feature's coordinate lines and axes). The upper right corner features are by far the most important, comprising the optimal feature subset. Based on different definitions of independence using cosine similarity, there are three feature selection algorithms derived from SCFS. These are SCEFS (Standard deviation and Exponent Cosine similarity based Feature Selection), SCRFS (Standard deviation and Reciprocal Cosine similarity based Feature Selection) and SCAFS (Standard deviation and Anti-Cosine similarity based Feature Selection), respectively. The KNN and SVM classifiers are built based on the optimal feature subsets detected by these feature selection algorithms, respectively. The experimental results on 18 genomic datasets of cancers demonstrate that the proposed unsupervised feature selection algorithms SCEFS, SCRFS and SCAFS can detect the stable biomarkers with strong classification capability. This shows that the idea proposed in this paper is powerful. The functional analysis of these biomarkers show that the occurrence of the cancer is closely related to the biomarker gene regulation level. This fact will benefit cancer pathology research, drug development, early diagnosis, treatment and prevention.

Citing Articles

Adoption of K-means clustering algorithm in smart city security analysis and mythical experience analysis of urban image.

Han H PLoS One. 2025; 20(3):e0319620.

PMID: 40063658 PMC: 11892831. DOI: 10.1371/journal.pone.0319620.


MDFGNN-SMMA: prediction of potential small molecule-miRNA associations based on multi-source data fusion and graph neural networks.

Li J, Zhang X, Li B, Li Z, Chen Z BMC Bioinformatics. 2025; 26(1):13.

PMID: 39806287 PMC: 11730471. DOI: 10.1186/s12859-025-06040-4.


MIFAM-DTI: a drug-target interactions predicting model based on multi-source information fusion and attention mechanism.

Li J, Sun L, Liu L, Li Z Front Genet. 2024; 15:1381997.

PMID: 38770418 PMC: 11102998. DOI: 10.3389/fgene.2024.1381997.


Identification of survival-associated biomarkers based on three datasets by bioinformatics analysis in gastric cancer.

Yin L, Yuan H, Liu J, Xu X, Wang W, Bai X World J Clin Cases. 2023; 11(20):4763-4787.

PMID: 37584004 PMC: 10424043. DOI: 10.12998/wjcc.v11.i20.4763.


Automated Dashboards for the Identification of Pathogenic Circulating Tumor DNA Mutations in Longitudinal Blood Draws of Cancer Patients.

Udalov A, Kumar L, Gaudette A, Zhang R, Salomao J, Saigal S Methods Protoc. 2023; 6(3).

PMID: 37218906 PMC: 10204543. DOI: 10.3390/mps6030046.


References
1.
Bhattacharjee A, Richards W, Staunton J, Li C, Monti S, Vasa P . Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc Natl Acad Sci U S A. 2001; 98(24):13790-5. PMC: 61120. DOI: 10.1073/pnas.191502998. View

2.
Monica V, Ceppi P, Righi L, Tavaglione V, Volante M, Pelosi G . Desmocollin-3: a new marker of squamous differentiation in undifferentiated large-cell carcinoma of the lung. Mod Pathol. 2009; 22(5):709-17. DOI: 10.1038/modpathol.2009.30. View

3.
Sjoblom L, Saramaki O, Annala M, Leinonen K, Nattinen J, Tolonen T . Microseminoprotein-Beta Expression in Different Stages of Prostate Cancer. PLoS One. 2016; 11(3):e0150241. PMC: 4777373. DOI: 10.1371/journal.pone.0150241. View

4.
Pomeroy S, Tamayo P, Gaasenbeek M, Sturla L, Angelo M, McLaughlin M . Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature. 2002; 415(6870):436-42. DOI: 10.1038/415436a. View

5.
Dashtban M, Balafar M . Gene selection for microarray cancer classification using a new evolutionary method employing artificial intelligence concepts. Genomics. 2017; 109(2):91-107. DOI: 10.1016/j.ygeno.2017.01.004. View