» Articles » PMID: 20811510

A Framework for Feature Selection in Clustering

Overview
Journal J Am Stat Assoc
Specialty Public Health
Date 2010 Sep 3
PMID 20811510
Citations 138
Authors
Affiliations
Soon will be listed here.
Abstract

We consider the problem of clustering observations using a potentially large set of features. One might expect that the true underlying clusters present in the data differ only with respect to a small fraction of the features, and will be missed if one clusters the observations using the full set of features. We propose a novel framework for sparse clustering, in which one clusters the observations using an adaptively chosen subset of the features. The method uses a lasso-type penalty to select the features. We use this framework to develop simple methods for sparse K-means and sparse hierarchical clustering. A single criterion governs both the selection of the features and the resulting clusters. These approaches are demonstrated on simulated data and on genomic data sets.

Citing Articles

Exploring BIRC family genes as prognostic biomarkers and therapeutic targets in prostate cancer.

Yu X, Liu Y, Mo Z, Luo R, Chen W Discov Oncol. 2025; 16(1):240.

PMID: 40009266 PMC: 11865399. DOI: 10.1007/s12672-025-02002-7.


An unsupervised learning approach for clustering joint trajectories of Alzheimer's disease biomarkers: An application to ADNI Data.

Sonmez T, Harvey D, Beckett L Alzheimers Dement. 2025; 21(2):e14524.

PMID: 39868506 PMC: 11851129. DOI: 10.1002/alz.14524.


Outcome-guided Bayesian clustering for disease subtype discovery using high-dimensional transcriptomic data.

Meng L, Huo Z J Appl Stat. 2025; 52(1):183-207.

PMID: 39811087 PMC: 11727188. DOI: 10.1080/02664763.2024.2362275.


Sparse kernel -means clustering.

Park B, Park C, Hong S, Choi H J Appl Stat. 2025; 52(1):158-182.

PMID: 39811085 PMC: 11727190. DOI: 10.1080/02664763.2024.2362266.


Higher-Order Disease Interactions in Multimorbidity Measurement: Marginal Benefit Over Additive Disease Summation.

Wei M, Tseng C, Kang A J Gerontol A Biol Sci Med Sci. 2024; 80(1.

PMID: 39565288 PMC: 11701747. DOI: 10.1093/gerona/glae282.


References
1.
Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R . Missing value estimation methods for DNA microarrays. Bioinformatics. 2001; 17(6):520-5. DOI: 10.1093/bioinformatics/17.6.520. View

2.
Lee D, Seung H . Learning the parts of objects by non-negative matrix factorization. Nature. 1999; 401(6755):788-91. DOI: 10.1038/44565. View

3.
Maugis C, Celeux G, Martin-Magniette M . Variable selection for clustering with Gaussian mixture models. Biometrics. 2009; 65(3):701-9. DOI: 10.1111/j.1541-0420.2008.01160.x. View

4.
Perou C, Sorlie T, Eisen M, van de Rijn M, Jeffrey S, Rees C . Molecular portraits of human breast tumours. Nature. 2000; 406(6797):747-52. DOI: 10.1038/35021093. View

5.
Wang S, Zhu J . Variable selection for model-based high-dimensional clustering and its application to microarray data. Biometrics. 2007; 64(2):440-8. DOI: 10.1111/j.1541-0420.2007.00922.x. View