scispace - formally typeset
Search or ask a question
Posted Content

Geometric SMOTE: Effective oversampling for imbalanced learning through a geometric extension of SMOTE.

TL;DR: This paper proposes Geometric SMOTE (G-SMOTE) as a generalization of the SMOTE data generation mechanism, and presents empirical results that show a significant improvement in the quality of the generated data when G- SMOTE is used as an oversampling algorithm.
Abstract: Classification of imbalanced datasets is a challenging task for standard algorithms. Although many methods exist to address this problem in different ways, generating artificial data for the minority class is a more general approach compared to algorithmic modifications. SMOTE algorithm and its variations generate synthetic samples along a line segment that joins minority class instances. In this paper we propose Geometric SMOTE (G-SMOTE) as a generalization of the SMOTE data generation mechanism. G-SMOTE generates synthetic samples in a geometric region of the input space, around each selected minority instance. While in the basic configuration this region is a hyper-sphere, G-SMOTE allows its deformation to a hyper-spheroid and finally to a line segment, emulating, in the last case, the SMOTE mechanism. The performance of G-SMOTE is compared against multiple standard oversampling algorithms. We present empirical results that show a significant improvement in the quality of the generated data when G-SMOTE is used as an oversampling algorithm.
Citations
More filters
Journal ArticleDOI
TL;DR: Critical discussion and objective evaluation of class overlap in the context of imbalanced data and its impact on classification accuracy and an in-depth critical technical review of existing approaches to handle imbalanced datasets are provided.
Abstract: Class imbalance is an active research area in the machine learning community. However, existing and recent literature showed that class overlap had a higher negative impact on the performance of learning algorithms. This paper provides detailed critical discussion and objective evaluation of class overlap in the context of imbalanced data and its impact on classification accuracy. First, we present a thorough experimental comparison of class overlap and class imbalance. Unlike previous work, our experiment was carried out on the full scale of class overlap and an extreme range of class imbalance degrees. Second, we provide an in-depth critical technical review of existing approaches to handle imbalanced datasets. Existing solutions from selective literature are critically reviewed and categorised as class distribution-based and class overlap-based methods. Emerging techniques and the latest development in this area are also discussed in detail. Experimental results in this paper are consistent with existing literature and show clearly that the performance of the learning algorithm deteriorates across varying degrees of class overlap whereas class imbalance does not always have an effect. The review emphasises the need for further research towards handling class overlap in imbalanced datasets to effectively improve learning algorithms’ performance.

93 citations

Journal ArticleDOI
TL;DR: The proposed approach incorporating both growth and environmental parameters of different crop periods could distinguish wheat powdery mildew and aphids at the regional scale and the combination of SMOTE and BPNN could effectively improve the discrimination accuracy of the minor disease or pest.
Abstract: Monitoring and discriminating co-epidemic diseases and pests at regional scales are of practical importance in guiding differential treatment. A combination of vegetation and environmental parameters could improve the accuracy for discriminating crop diseases and pests. Different diseases and pests could cause similar stresses and symptoms during the same crop growth period, so combining growth period information can be useful for discerning different changes in crop diseases and pests. Additionally, problems associated with imbalanced data often have detrimental effects on the performance of image classification. In this study, we developed an approach for discriminating crop diseases and pests based on bi-temporal Landsat-8 satellite imagery integrating both crop growth and environmental parameters. As a case study, the approach was applied to data during a period of typical co-epidemic outbreak of winter wheat powdery mildew and aphids in the Shijiazhuang area of Hebei Province, China. Firstly, bi-temporal remotely sensed features characterizing growth indices and environmental factors were calculated based on two Landsat-8 images. The synthetic minority oversampling technique (SMOTE) algorithm was used to resample the imbalanced training data set before model construction. Then, a back propagation neural network (BPNN) based on a new training data set balanced by the SMOTE approach (SMOTE-BPNN) was developed to generate the regional wheat disease and pest distribution maps. The original training data set-based BPNN and support vector machine (SVM) methods were used for comparison and testing of the initial results. Our findings suggest that the proposed approach incorporating both growth and environmental parameters of different crop periods could distinguish wheat powdery mildew and aphids at the regional scale. The bi-temporal growth indices and environmental factors-based SMOTE-BPNN, BPNN, and SVM models all had an overall accuracy high than 80%. Meanwhile, the SMOTE-BPNN method had the highest G-means among the three methods. These results revealed that the combination of bi-temporal crop growth and environmental parameters is essential for improving the accuracy of the crop disease and pest discriminating models. The combination of SMOTE and BPNN could effectively improve the discrimination accuracy of the minor disease or pest.

30 citations


Cites methods from "Geometric SMOTE: Effective oversamp..."

  • ...Some modifications of SMOTE have been proposed [83,84]....

    [...]

Journal ArticleDOI
06 May 2019
TL;DR: A thorough analysis of four supervised learning classifiers that represent linear, ensemble, instance and neural networks on Uwezo Annual Learning Assessment datasets for Tanzania as a case study indicates that two classifiers: logistic regression and multilayer perceptron achieve the highest performance when over-sampling technique was employed.
Abstract: School dropout is a widely recognized serious issue in developing countries. On the other hand, machine learning techniques have gained much attention on addressing this problem. This paper, presents a thorough analysis of four supervised learning classifiers that represent linear, ensemble, instance and neural networks on Uwezo Annual Learning Assessment datasets for Tanzania as a case study. The goal of the study is to provide data-driven algorithm recommendations to current researchers on the topic. Using three metrics: geometric mean, F-measure and adjusted geometric mean, we assessed and quantified the effect of different sampling techniques on the imbalanced dataset for model selection. We further indicate the significance of hyper parameter tuning in improving predictive performance. The results indicate that two classifiers: logistic regression and multilayer perceptron achieve the highest performance when over-sampling technique was employed. Furthermore, hyper parameter tuning improves each algorithm's performance compared to its baseline settings and stacking these classifiers improves the overall predictive performance.

16 citations


Cites methods from "Geometric SMOTE: Effective oversamp..."

  • ...SMOTEENN combines over and under sampling using SMOTE and edited nearest neighbour (EN) to generate more minority class in order to reinforce its signal [32] and random under sampler is a fast and easy way to balance the minority class by randomly selecting a subset of data for the targeted classes [33]....

    [...]

Journal ArticleDOI
TL;DR: An oversampling method for large class imbalance problems that do not require the k-nearest neighbors’ search is proposed and is at least twice as fast as the fastest method reported in the literature while obtaining similar oversampled quality.
Abstract: Several oversampling methods have been proposed for solving the class imbalance problem. However, most of them require searching the k-nearest neighbors to generate synthetic objects. This requirement makes them time-consuming and therefore unsuitable for large datasets. In this paper, an oversampling method for large class imbalance problems that do not require the k-nearest neighbors’ search is proposed. According to our experiments on large datasets with different sizes of imbalance, the proposed method is at least twice as fast as 8 the fastest method reported in the literature while obtaining similar oversampling quality.

5 citations

References
More filters
Book
01 Jan 1983
TL;DR: In this paper, a generalization of the analysis of variance is given for these models using log- likelihoods, illustrated by examples relating to four distributions; the Normal, Binomial (probit analysis, etc.), Poisson (contingency tables), and gamma (variance components).
Abstract: The technique of iterative weighted linear regression can be used to obtain maximum likelihood estimates of the parameters with observations distributed according to some exponential family and systematic effects that can be made linear by a suitable transformation. A generalization of the analysis of variance is given for these models using log- likelihoods. These generalized linear models are illustrated by examples relating to four distributions; the Normal, Binomial (probit analysis, etc.), Poisson (contingency tables) and gamma (variance components).

23,215 citations

Journal ArticleDOI
TL;DR: A general gradient descent boosting paradigm is developed for additive expansions based on any fitting criterion, and specific algorithms are presented for least-squares, least absolute deviation, and Huber-M loss functions for regression, and multiclass logistic likelihood for classification.
Abstract: Function estimation/approximation is viewed from the perspective of numerical optimization in function space, rather than parameter space. A connection is made between stagewise additive expansions and steepest-descent minimization. A general gradient descent “boosting” paradigm is developed for additive expansions based on any fitting criterion.Specific algorithms are presented for least-squares, least absolute deviation, and Huber-M loss functions for regression, and multiclass logistic likelihood for classification. Special enhancements are derived for the particular case where the individual additive components are regression trees, and tools for interpreting such “TreeBoost” models are presented. Gradient boosting of regression trees produces competitive, highly robust, interpretable procedures for both regression and classification, especially appropriate for mining less than clean data. Connections between this approach and the boosting methods of Freund and Shapire and Friedman, Hastie and Tibshirani are discussed.

17,764 citations


"Geometric SMOTE: Effective oversamp..." refers methods in this paper

  • ...For the evaluation of the oversampling methods, Logistic Regression (LR) [McCullagh and Nelder, 1989] and Gradient Boosting Classifier (GBC) [Friedman, 2001] were used....

    [...]

Journal ArticleDOI
TL;DR: In this article, a method of over-sampling the minority class involves creating synthetic minority class examples, which is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.
Abstract: An approach to the construction of classifiers from imbalanced datasets is described. A dataset is imbalanced if the classification categories are not approximately equally represented. Often real-world data sets are predominately composed of "normal" examples with only a small percentage of "abnormal" or "interesting" examples. It is also the case that the cost of misclassifying an abnormal (interesting) example as a normal example is often much higher than the cost of the reverse error. Under-sampling of the majority (normal) class has been proposed as a good means of increasing the sensitivity of a classifier to the minority class. This paper shows that a combination of our method of oversampling the minority (abnormal)cla ss and under-sampling the majority (normal) class can achieve better classifier performance (in ROC space)tha n only under-sampling the majority class. This paper also shows that a combination of our method of over-sampling the minority class and under-sampling the majority class can achieve better classifier performance (in ROC space)t han varying the loss ratios in Ripper or class priors in Naive Bayes. Our method of over-sampling the minority class involves creating synthetic minority class examples. Experiments are performed using C4.5, Ripper and a Naive Bayes classifier. The method is evaluated using the area under the Receiver Operating Characteristic curve (AUC)and the ROC convex hull strategy.

17,313 citations

Journal ArticleDOI
TL;DR: In this article, a method of over-sampling the minority class involves creating synthetic minority class examples, which is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.
Abstract: An approach to the construction of classifiers from imbalanced datasets is described. A dataset is imbalanced if the classification categories are not approximately equally represented. Often real-world data sets are predominately composed of "normal" examples with only a small percentage of "abnormal" or "interesting" examples. It is also the case that the cost of misclassifying an abnormal (interesting) example as a normal example is often much higher than the cost of the reverse error. Under-sampling of the majority (normal) class has been proposed as a good means of increasing the sensitivity of a classifier to the minority class. This paper shows that a combination of our method of over-sampling the minority (abnormal) class and under-sampling the majority (normal) class can achieve better classifier performance (in ROC space) than only under-sampling the majority class. This paper also shows that a combination of our method of over-sampling the minority class and under-sampling the majority class can achieve better classifier performance (in ROC space) than varying the loss ratios in Ripper or class priors in Naive Bayes. Our method of over-sampling the minority class involves creating synthetic minority class examples. Experiments are performed using C4.5, Ripper and a Naive Bayes classifier. The method is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.

11,512 citations