Geometric SMOTE a geometrically enhanced drop-in replacement for SMOTE

doi:10.1016/J.INS.2019.06.007

Home
/
Papers
/
Geometric SMOTE a geometrically enhanced drop-in replacement for SMOTE

Journal Article•DOI•

Geometric SMOTE a geometrically enhanced drop-in replacement for SMOTE

Georgios Douzas¹, Fernando Bacao¹•Institutions (1)

Universidade Nova de Lisboa¹

01 Oct 2019-Information Sciences (Elsevier)-Vol. 501, pp 118-135

TL;DR: Geometric SMOTE (G-SMOTE) is proposed as a enhancement of the SMOTE data generation mechanism and empirical results show a significant improvement in the quality of the generated data when G- SMOTE is used as an oversampling algorithm.

read less

About: This article is published in Information Sciences.The article was published on 2019-10-01. It has received 102 citations till now.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

RCSMOTE: Range-Controlled synthetic minority over-sampling technique for handling the class imbalance problem

[...]

Paria Soltanzadeh¹, Mahdi Hashemzadeh¹•Institutions (1)

Azarbaijan Shahid Madani University¹

04 Jan 2021-Information Sciences

TL;DR: An improved SMOTE-based method, namely Range-Controlled SMOTE (RCSMOTE), which targets all three problems simultaneously and overcomes the above-mentioned problems of SMOTE.

...read moreread less

68 citations

Journal Article•DOI•

LoRAS: an oversampling approach for imbalanced datasets

[...]

Saptarshi Bej¹, Narek Davtyan¹, Markus Wolfien¹, Mariam Nassar¹, Olaf Wolkenhauer¹ - Show less +1 more•Institutions (1)

University of Rostock¹

01 Feb 2021-Machine Learning

TL;DR: To explain the success of the algorithm, a mathematical framework is constructed to prove that LoRAS oversampling technique provides a better estimate for the mean of the underlying local data distribution of the minority class data space.

...read moreread less

Abstract: The Synthetic Minority Oversampling TEchnique (SMOTE) is widely-used for the analysis of imbalanced datasets. It is known that SMOTE frequently over-generalizes the minority class, leading to misclassifications for the majority class, and effecting the overall balance of the model. In this article, we present an approach that overcomes this limitation of SMOTE, employing Localized Random Affine Shadowsampling (LoRAS) to oversample from an approximated data manifold of the minority class. We benchmarked our algorithm with 14 publicly available imbalanced datasets using three different Machine Learning (ML) algorithms and compared the performance of LoRAS, SMOTE and several SMOTE extensions that share the concept of using convex combinations of minority class data points for oversampling with LoRAS. We observed that LoRAS, on average generates better ML models in terms of F1-Score and Balanced accuracy. Another key observation is that while most of the extensions of SMOTE we have tested, improve the F1-Score with respect to SMOTE on an average, they compromise on the Balanced accuracy of a classification model. LoRAS on the contrary, improves both F1 Score and the Balanced accuracy thus produces better classification models. Moreover, to explain the success of the algorithm, we have constructed a mathematical framework to prove that LoRAS oversampling technique provides a better estimate for the mean of the underlying local data distribution of the minority class data space.

...read moreread less

67 citations

Journal Article•DOI•

RSMOTE: A self-adaptive robust SMOTE for imbalanced problems with label noise

[...]

Baiyun Chen¹, Shuyin Xia¹, Zizhong Chen², Binggui Wang¹, Guoyin Wang¹ - Show less +1 more•Institutions (2)

Chongqing University of Posts and Telecommunications¹, University of California, Riverside²

01 Apr 2021-Information Sciences

TL;DR: In RSMOTE, relative density has been introduced to measure the local density of every minority sample, and the non-noisy minority samples are divided into the borderline samples and safe samples adaptively basing their distinguishing characteristics of relative density.

...read moreread less

65 citations

Posted Content•

DeepSMOTE: Fusing Deep Learning and SMOTE for Imbalanced Data.

[...]

Damien Dablain¹, Bartosz Krawczyk¹, Nitesh V. Chawla²•Institutions (2)

University of Notre Dame¹, Virginia Commonwealth University²

05 May 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: DeepSMOTE as discussed by the authors is a novel oversampling algorithm for deep learning models, which consists of three major components: an encoder/decoder framework, SMOTE-based over-sampling, and a dedicated loss function that is enhanced with a penalty term.

...read moreread less

Abstract: Despite over two decades of progress, imbalanced data is still considered a significant challenge for contemporary machine learning models. Modern advances in deep learning have magnified the importance of the imbalanced data problem. The two main approaches to address this issue are based on loss function modifications and instance resampling. Instance sampling is typically based on Generative Adversarial Networks (GANs), which may suffer from mode collapse. Therefore, there is a need for an oversampling method that is specifically tailored to deep learning models, can work on raw images while preserving their properties, and is capable of generating high quality, artificial images that can enhance minority classes and balance the training set. We propose DeepSMOTE - a novel oversampling algorithm for deep learning models. It is simple, yet effective in its design. It consists of three major components: (i) an encoder/decoder framework; (ii) SMOTE-based oversampling; and (iii) a dedicated loss function that is enhanced with a penalty term. An important advantage of DeepSMOTE over GAN-based oversampling is that DeepSMOTE does not require a discriminator, and it generates high-quality artificial images that are both information-rich and suitable for visual inspection. DeepSMOTE code is publicly available at: this https URL

...read moreread less

52 citations

Journal Article•DOI•

Multi-class imbalanced big data classification on Spark

[...]

William C. Sleeman¹, Bartosz Krawczyk¹•Institutions (1)

Virginia Commonwealth University¹

05 Jan 2021-Knowledge Based Systems

TL;DR: This paper proposes the first compound framework for dealing with multi-class big data problems, addressing at the same time the existence of multiple classes and high volumes of data, and proposes an efficient implementation of the discussed algorithm on Apache Spark.

...read moreread less

Abstract: Despite more than two decades of progress, learning from imbalanced data is still considered as one of the contemporary challenges in machine learning. This has been further complicated by the advent of the big data era, where popular algorithms dedicated to alleviating the class skew impact are no longer feasible due to the volume of datasets. Additionally, most of existing algorithms focus on binary imbalanced problems, where majority and minority classes are well-defined. Multi-class imbalanced data poses further challenges as the relationship between classes is much more complex and simple decomposition into a number of binary problems leads to a significant loss of information. In this paper, we propose the first compound framework for dealing with multi-class big data problems, addressing at the same time the existence of multiple classes and high volumes of data. We propose to analyze the instance-level difficulties in each class, leading to understanding what causes learning difficulties. We embed this information in popular resampling algorithms which allows for informative balancing of multiple classes. We propose an efficient implementation of the discussed algorithm on Apache Spark, including a novel version of SMOTE that overcomes spatial limitations in distributed environments of its predecessor. Extensive experimental study shows that using instance-level information significantly improves learning from multi-class imbalanced big data. Our framework can be downloaded from https://github.com/fsleeman/minority-type-imbalanced .

...read moreread less

50 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

Collapse

References

PDF

Open Access

More filters

Journal Article•

Scikit-learn: Machine Learning in Python

[...]

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel¹, Peter Prettenhofer², Ron Weiss³, Vincent Dubourg, Jake Vanderplas⁴, Alexandre Passos⁵, David Cournapeau, Matthieu Brucher⁶, Matthieu Perrot, Edouard Duchesnay - Show less +12 more•Institutions (6)

Kobe University¹, Bauhaus University, Weimar², Google³, University of Washington⁴, University of Massachusetts Amherst⁵, Total S.A.⁶

01 Feb 2011-Journal of Machine Learning Research

TL;DR: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems, focusing on bringing machine learning to non-specialists using a general-purpose high-level language.

...read moreread less

Abstract: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings. Source code, binaries, and documentation can be downloaded from http://scikit-learn.sourceforge.net.

...read moreread less

47,974 citations

Posted Content•

Scikit-learn: Machine Learning in Python

[...]

Fabian Pedregosa¹, Gaël Varoquaux¹, Alexandre Gramfort¹, Vincent Michel¹, Bertrand Thirion¹, Olivier Grisel, Mathieu Blondel, Andreas Müller², Joel Nothman, Gilles Louppe², Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, Edouard Duchesnay - Show less +15 more•Institutions (2)

French Institute for Research in Computer Science and Automation¹, University of Liège²

02 Jan 2012-arXiv: Learning

TL;DR: Scikit-learn as mentioned in this paper is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems.

...read moreread less

28,898 citations

Book•

Generalized Linear Models

[...]

Peter McCullagh¹, John A. Nelder•Institutions (1)

Imperial College London¹

01 Jan 1983

TL;DR: In this paper, a generalization of the analysis of variance is given for these models using log- likelihoods, illustrated by examples relating to four distributions; the Normal, Binomial (probit analysis, etc.), Poisson (contingency tables), and gamma (variance components).

...read moreread less

Abstract: The technique of iterative weighted linear regression can be used to obtain maximum likelihood estimates of the parameters with observations distributed according to some exponential family and systematic effects that can be made linear by a suitable transformation. A generalization of the analysis of variance is given for these models using log- likelihoods. These generalized linear models are illustrated by examples relating to four distributions; the Normal, Binomial (probit analysis, etc.), Poisson (contingency tables) and gamma (variance components).

...read moreread less

23,215 citations

Journal Article•DOI•

Greedy function approximation: A gradient boosting machine.

[...]

Jerome H. Friedman¹•Institutions (1)

Stanford University¹

01 Oct 2001-Annals of Statistics

TL;DR: A general gradient descent boosting paradigm is developed for additive expansions based on any fitting criterion, and specific algorithms are presented for least-squares, least absolute deviation, and Huber-M loss functions for regression, and multiclass logistic likelihood for classification.

...read moreread less

Abstract: Function estimation/approximation is viewed from the perspective of numerical optimization in function space, rather than parameter space. A connection is made between stagewise additive expansions and steepest-descent minimization. A general gradient descent “boosting” paradigm is developed for additive expansions based on any fitting criterion.Specific algorithms are presented for least-squares, least absolute deviation, and Huber-M loss functions for regression, and multiclass logistic likelihood for classification. Special enhancements are derived for the particular case where the individual additive components are regression trees, and tools for interpreting such “TreeBoost” models are presented. Gradient boosting of regression trees produces competitive, highly robust, interpretable procedures for both regression and classification, especially appropriate for mining less than clean data. Connections between this approach and the boosting methods of Freund and Shapire and Friedman, Hastie and Tibshirani are discussed.

...read moreread less

17,764 citations

UCI Machine Learning Repository

[...]

A. Asuncion

01 Jan 2007

17,341 citations