SMOTE for high-dimensional class-imbalanced data

doi:10.1186/1471-2105-14-106

Open AccessJournal ArticleDOI

SMOTE for high-dimensional class-imbalanced data

Rok Blagus, +1 more

- 22 Mar 2013 -

BMC Bioinformatics

- Vol. 14, Iss: 1, pp 106-106

TLDR

In the high-dimensional setting only k-NN classifiers based on the Euclidean distance seem to benefit substantially from the use of SMOTE, provided that variable selection is performed before using SMOTE; the benefit is larger if more neighbors are used.

Abstract:

Classification using class-imbalanced data is biased in favor of the majority class. The bias is even larger for high-dimensional data, where the number of variables greatly exceeds the number of samples. The problem can be attenuated by undersampling or oversampling, which produce class-balanced data. Generally undersampling is helpful, while random oversampling is not. Synthetic Minority Oversampling TEchnique (SMOTE) is a very popular oversampling method that was proposed to improve random oversampling but its behavior on high-dimensional data has not been thoroughly investigated. In this paper we investigate the properties of SMOTE from a theoretical and empirical point of view, using simulated and real high-dimensional data. While in most cases SMOTE seems beneficial with low-dimensional data, it does not attenuate the bias towards the classification in the majority class for most classifiers when data are high-dimensional, and it is less effective than random undersampling. SMOTE is beneficial for k-NN classifiers for high-dimensional data if the number of variables is reduced performing some type of variable selection; we explain why, otherwise, the k-NN classification is biased towards the minority class. Furthermore, we show that on high-dimensional data SMOTE does not change the class-specific mean values while it decreases the data variability and it introduces correlation between samples. We explain how our findings impact the class-prediction for high-dimensional data. In practice, in the high-dimensional setting only k-NN classifiers based on the Euclidean distance seem to benefit substantially from the use of SMOTE, provided that variable selection is performed before using SMOTE; the benefit is larger if more neighbors are used. SMOTE for k-NN without variable selection should not be used, because it strongly biases the classification towards the minority class.

SMOTE for high-dimensional class-imbalanced data

Citations

When is nearest neighbor meaningful

Learning from class-imbalanced data

SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary

Radiomics-based Prognosis Analysis for Non-Small Cell Lung Cancer.

An empirical survey of data augmentation for time series classification with neural networks.

References

R: A language and environment for statistical computing.

Random Forests

Support-Vector Networks

Classification and Regression Trees.

SMOTE: synthetic minority over-sampling technique

Related Papers (5)

SMOTE: synthetic minority over-sampling technique

Learning from Imbalanced Data

Random Forests

Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning

An introduction to ROC analysis