A Framework for Feature Selection in Clustering

Overview

Journal J Am Stat Assoc

Specialty Public Health

Date 2010 Sep 3

PMID 20811510

Citations 138

Authors

Daniela M Witten

Robert Tibshirani

Affiliations

Soon will be listed here.

Abstract

We consider the problem of clustering observations using a potentially large set of features. One might expect that the true underlying clusters present in the data differ only with respect to a small fraction of the features, and will be missed if one clusters the observations using the full set of features. We propose a novel framework for sparse clustering, in which one clusters the observations using an adaptively chosen subset of the features. The method uses a lasso-type penalty to select the features. We use this framework to develop simple methods for sparse K-means and sparse hierarchical clustering. A single criterion governs both the selection of the features and the resulting clusters. These approaches are demonstrated on simulated data and on genomic data sets.

Citing Articles

Exploring BIRC family genes as prognostic biomarkers and therapeutic targets in prostate cancer.

Yu X, Liu Y, Mo Z, Luo R, Chen W Discov Oncol. 2025; 16(1):240.

PMID: 40009266 PMC: 11865399. DOI: 10.1007/s12672-025-02002-7.

An unsupervised learning approach for clustering joint trajectories of Alzheimer's disease biomarkers: An application to ADNI Data.

Sonmez T, Harvey D, Beckett L Alzheimers Dement. 2025; 21(2):e14524.

PMID: 39868506 PMC: 11851129. DOI: 10.1002/alz.14524.

Outcome-guided Bayesian clustering for disease subtype discovery using high-dimensional transcriptomic data.

Meng L, Huo Z J Appl Stat. 2025; 52(1):183-207.

PMID: 39811087 PMC: 11727188. DOI: 10.1080/02664763.2024.2362275.

Sparse kernel -means clustering.

Park B, Park C, Hong S, Choi H J Appl Stat. 2025; 52(1):158-182.

PMID: 39811085 PMC: 11727190. DOI: 10.1080/02664763.2024.2362266.

Higher-Order Disease Interactions in Multimorbidity Measurement: Marginal Benefit Over Additive Disease Summation.

Wei M, Tseng C, Kang A J Gerontol A Biol Sci Med Sci. 2024; 80(1.

PMID: 39565288 PMC: 11701747. DOI: 10.1093/gerona/glae282.

References

Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R . Missing value estimation methods for DNA microarrays. Bioinformatics. 2001; 17(6):520-5. DOI: 10.1093/bioinformatics/17.6.520. View

Lee D, Seung H . Learning the parts of objects by non-negative matrix factorization. Nature. 1999; 401(6755):788-91. DOI: 10.1038/44565. View

Maugis C, Celeux G, Martin-Magniette M . Variable selection for clustering with Gaussian mixture models. Biometrics. 2009; 65(3):701-9. DOI: 10.1111/j.1541-0420.2008.01160.x. View

Perou C, Sorlie T, Eisen M, van de Rijn M, Jeffrey S, Rees C . Molecular portraits of human breast tumours. Nature. 2000; 406(6797):747-52. DOI: 10.1038/35021093. View

Wang S, Zhu J . Variable selection for model-based high-dimensional clustering and its application to microarray data. Biometrics. 2007; 64(2):440-8. DOI: 10.1111/j.1541-0420.2007.00922.x. View

. A haplotype map of the human genome. Nature. 2005; 437(7063):1299-320. PMC: 1880871. DOI: 10.1038/nature04226. View

Tamayo P, Scanfeld D, Ebert B, Gillette M, Roberts C, Mesirov J . Metagene projection for cross-platform, cross-species characterization of global transcriptional states. Proc Natl Acad Sci U S A. 2007; 104(14):5959-64. PMC: 1838404. DOI: 10.1073/pnas.0701068104. View

Chipman H, Tibshirani R . Hybrid hierarchical clustering with applications to microarray data. Biostatistics. 2005; 7(2):286-301. DOI: 10.1093/biostatistics/kxj007. View

Eisen M, Spellman P, Brown P, Botstein D . Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A. 1998; 95(25):14863-8. PMC: 24541. DOI: 10.1073/pnas.95.25.14863. View

10.

Price A, Patterson N, Plenge R, Weinblatt M, Shadick N, Reich D . Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006; 38(8):904-9. DOI: 10.1038/ng1847. View

11.

Nowak G, Tibshirani R . Complementary hierarchical clustering. Biostatistics. 2007; 9(3):467-83. PMC: 3294318. DOI: 10.1093/biostatistics/kxm046. View

12.

Witten D, Tibshirani R, Hastie T . A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics. 2009; 10(3):515-34. PMC: 2697346. DOI: 10.1093/biostatistics/kxp008. View

13.

McLachlan G, Bean R, Peel D . A mixture model-based approach to the clustering of microarray expression data. Bioinformatics. 2002; 18(3):413-22. DOI: 10.1093/bioinformatics/18.3.413. View

14.

Frazer K, Ballinger D, Cox D, Hinds D, Stuve L, Boudreau A . A second generation human haplotype map of over 3.1 million SNPs. Nature. 2007; 449(7164):851-61. PMC: 2689609. DOI: 10.1038/nature06258. View

15.

Xie B, Pan W, Shen X . Penalized model-based clustering with cluster-specific diagonal covariance matrices and grouped variables. Electron J Stat. 2009; 2:168-212. PMC: 2777718. DOI: 10.1214/08-EJS194. View

16.

Ghosh D, Chinnaiyan A . Mixture modelling of gene expression data from microarray experiments. Bioinformatics. 2002; 18(2):275-86. DOI: 10.1093/bioinformatics/18.2.275. View