SMOTE for High-dimensional Class-imbalanced Data
Overview
Authors
Affiliations
Background: Classification using class-imbalanced data is biased in favor of the majority class. The bias is even larger for high-dimensional data, where the number of variables greatly exceeds the number of samples. The problem can be attenuated by undersampling or oversampling, which produce class-balanced data. Generally undersampling is helpful, while random oversampling is not. Synthetic Minority Oversampling TEchnique (SMOTE) is a very popular oversampling method that was proposed to improve random oversampling but its behavior on high-dimensional data has not been thoroughly investigated. In this paper we investigate the properties of SMOTE from a theoretical and empirical point of view, using simulated and real high-dimensional data.
Results: While in most cases SMOTE seems beneficial with low-dimensional data, it does not attenuate the bias towards the classification in the majority class for most classifiers when data are high-dimensional, and it is less effective than random undersampling. SMOTE is beneficial for k-NN classifiers for high-dimensional data if the number of variables is reduced performing some type of variable selection; we explain why, otherwise, the k-NN classification is biased towards the minority class. Furthermore, we show that on high-dimensional data SMOTE does not change the class-specific mean values while it decreases the data variability and it introduces correlation between samples. We explain how our findings impact the class-prediction for high-dimensional data.
Conclusions: In practice, in the high-dimensional setting only k-NN classifiers based on the Euclidean distance seem to benefit substantially from the use of SMOTE, provided that variable selection is performed before using SMOTE; the benefit is larger if more neighbors are used. SMOTE for k-NN without variable selection should not be used, because it strongly biases the classification towards the minority class.
Duong S, Dominy C, Arivazhagan N, Barris D, Hopkins K, Stern K Int J Cardiovasc Imaging. 2025; .
PMID: 40080276 DOI: 10.1007/s10554-025-03368-z.
Zakharova O, Liskova E Animals (Basel). 2025; 15(4).
PMID: 40002992 PMC: 11851450. DOI: 10.3390/ani15040510.
Fan X, Ye R, Gao Y, Xue K, Zhang Z, Xu J Front Artif Intell. 2025; 7:1473837.
PMID: 39881882 PMC: 11776094. DOI: 10.3389/frai.2024.1473837.
Schoneck M, Rehbach N, Lotter-Becker L, Persigehl T, Lennartz S, Caldeira L Life (Basel). 2025; 15(1).
PMID: 39860023 PMC: 11766547. DOI: 10.3390/life15010083.
Research on the improvement method of imbalance of ground penetrating radar image data.
Cao L, Liu L, Lu C, Chen R Sci Rep. 2025; 15(1):2859.
PMID: 39843521 PMC: 11754838. DOI: 10.1038/s41598-025-87123-3.