» Articles » PMID: 23522326

SMOTE for High-dimensional Class-imbalanced Data

Overview
Publisher Biomed Central
Specialty Biology
Date 2013 Mar 26
PMID 23522326
Citations 218
Authors
Affiliations
Soon will be listed here.
Abstract

Background: Classification using class-imbalanced data is biased in favor of the majority class. The bias is even larger for high-dimensional data, where the number of variables greatly exceeds the number of samples. The problem can be attenuated by undersampling or oversampling, which produce class-balanced data. Generally undersampling is helpful, while random oversampling is not. Synthetic Minority Oversampling TEchnique (SMOTE) is a very popular oversampling method that was proposed to improve random oversampling but its behavior on high-dimensional data has not been thoroughly investigated. In this paper we investigate the properties of SMOTE from a theoretical and empirical point of view, using simulated and real high-dimensional data.

Results: While in most cases SMOTE seems beneficial with low-dimensional data, it does not attenuate the bias towards the classification in the majority class for most classifiers when data are high-dimensional, and it is less effective than random undersampling. SMOTE is beneficial for k-NN classifiers for high-dimensional data if the number of variables is reduced performing some type of variable selection; we explain why, otherwise, the k-NN classification is biased towards the minority class. Furthermore, we show that on high-dimensional data SMOTE does not change the class-specific mean values while it decreases the data variability and it introduces correlation between samples. We explain how our findings impact the class-prediction for high-dimensional data.

Conclusions: In practice, in the high-dimensional setting only k-NN classifiers based on the Euclidean distance seem to benefit substantially from the use of SMOTE, provided that variable selection is performed before using SMOTE; the benefit is larger if more neighbors are used. SMOTE for k-NN without variable selection should not be used, because it strongly biases the classification towards the minority class.

Citing Articles

Machine learning prediction of right ventricular volume and ejection fraction from two-dimensional echocardiography in patients with pulmonary regurgitation.

Duong S, Dominy C, Arivazhagan N, Barris D, Hopkins K, Stern K Int J Cardiovasc Imaging. 2025; .

PMID: 40080276 DOI: 10.1007/s10554-025-03368-z.


Risk Factors for African Swine Fever in Wild Boar in Russia: Application of Regression for Classification Algorithms.

Zakharova O, Liskova E Animals (Basel). 2025; 15(4).

PMID: 40002992 PMC: 11851450. DOI: 10.3390/ani15040510.


Prediction of outpatient rehabilitation patient preferences and optimization of graded diagnosis and treatment based on XGBoost machine learning algorithm.

Fan X, Ye R, Gao Y, Xue K, Zhang Z, Xu J Front Artif Intell. 2025; 7:1473837.

PMID: 39881882 PMC: 11776094. DOI: 10.3389/frai.2024.1473837.


Machine Learning-Based Radiomics Analysis for Identifying KRAS Mutations in Non-Small-Cell Lung Cancer from CT Images: Challenges, Insights and Implications.

Schoneck M, Rehbach N, Lotter-Becker L, Persigehl T, Lennartz S, Caldeira L Life (Basel). 2025; 15(1).

PMID: 39860023 PMC: 11766547. DOI: 10.3390/life15010083.


Research on the improvement method of imbalance of ground penetrating radar image data.

Cao L, Liu L, Lu C, Chen R Sci Rep. 2025; 15(1):2859.

PMID: 39843521 PMC: 11754838. DOI: 10.1038/s41598-025-87123-3.


References
1.
Doyle S, Monaco J, Feldman M, Tomaszewski J, Madabhushi A . An active learning based classification strategy for the minority class problem: application to histopathology annotation. BMC Bioinformatics. 2011; 12:424. PMC: 3284114. DOI: 10.1186/1471-2105-12-424. View

2.
Radivojac P, Chawla N, Dunker A, Obradovic Z . Classification and knowledge discovery in protein databases. J Biomed Inform. 2004; 37(4):224-39. DOI: 10.1016/j.jbi.2004.07.008. View

3.
Miller L, Smeds J, George J, Vega V, Vergara L, Ploner A . An expression signature for p53 status in human breast cancer predicts mutation status, transcriptional effects, and patient survival. Proc Natl Acad Sci U S A. 2005; 102(38):13550-5. PMC: 1197273. DOI: 10.1073/pnas.0506230102. View

4.
Sotiriou C, Neo S, McShane L, Korn E, Long P, Jazaeri A . Breast cancer classification and prognosis based on gene expression profiles from a population-based study. Proc Natl Acad Sci U S A. 2003; 100(18):10393-8. PMC: 193572. DOI: 10.1073/pnas.1732912100. View

5.
MacIsaac K, Gordon D, Nekludova L, Odom D, Schreiber J, Gifford D . A hypothesis-based approach for identifying the binding specificity of regulatory proteins from chromatin immunoprecipitation data. Bioinformatics. 2005; 22(4):423-9. DOI: 10.1093/bioinformatics/bti815. View