scispace - formally typeset
Journal ArticleDOI

Class imbalance revisited: a new experimental setup to assess the performance of treatment methods

Reads0
Chats0
TLDR
A simple experimental design to assess the performance of class imbalance treatment methods and a statistical procedure aimed to evaluate the relative degradation and recoveries, based on confidence intervals are proposed.
Abstract
In the last decade, class imbalance has attracted a huge amount of attention from researchers and practitioners. Class imbalance is ubiquitous in Machine Learning, Data Mining and Pattern Recognition applications; therefore, these research communities have responded to such interest with literally dozens of methods and techniques. Surprisingly, there are still many fundamental open-ended questions such as "Are all learning paradigms equally affected by class imbalance?", "What is the expected performance loss for different imbalance degrees?" and "How much of the performance losses can be recovered by the treatment methods?". In this paper, we propose a simple experimental design to assess the performance of class imbalance treatment methods. This experimental setup uses real data set with artificially modified class distributions to evaluate classifiers in a wide range of class imbalance. We apply such experimental design in a large-scale experimental evaluation with 22 data set and seven learning algorithms from different paradigms. We also propose a statistical procedure aimed to evaluate the relative degradation and recoveries, based on confidence intervals. This procedure allows a simple yet insightful visualization of the results, as well as provide the basis for drawing statistical conclusions. Our results indicate that the expected performance loss, as a percentage of the performance obtained with the balanced distribution, is quite modest (below 5 %) for the most balanced distributions up to 10 % of minority examples. However, the loss tends to increase quickly for higher degrees of class imbalance, reaching 20 % for 1 % of minority class examples. Support Vector Machine is the classifier paradigm that is less affected by class imbalance, being almost insensitive to all but the most imbalanced distributions. Finally, we show that the treatment methods only partially recover the performance losses. On average, typically, about 30 % or less of the performance that was lost due to class imbalance was recovered by these methods.

read more

Citations
More filters
Journal ArticleDOI

Learning from imbalanced data: open challenges and future directions

TL;DR: Seven vital areas of research in this topic are identified, covering the full spectrum of learning from imbalanced data: classification, regression, clustering, data streams, big data analytics and applications, e.g., in social media and computer vision.
Journal ArticleDOI

SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary

TL;DR: The Synthetic Minority Oversampling Technique (SMOTE) preprocessing algorithm is considered "de facto" standard in the framework of learning from imbalanced data because of its simplicity in the design, as well as its robustness when applied to different type of problems.
Journal ArticleDOI

A Survey of Predictive Modeling on Imbalanced Domains

TL;DR: The main challenges raised by imbalanced domains are discussed, a definition of the problem is proposed, the main approaches to these tasks are described, and a taxonomy of the methods are proposed.
Journal ArticleDOI

Data imbalance in classification: Experimental evaluation

TL;DR: The goal of this paper is to demonstrate the effects of class imbalance on classification models and determine that the relationship between the class imbalance ratio and the accuracy is convex.
Journal ArticleDOI

Tutorial on practical tips of the most influential data preprocessing algorithms in data mining

TL;DR: A real world problem presented in the ECDBL’2014 Big Data competition is used to provide a thorough analysis on the application of some preprocessing techniques, their combination and their performance.
References
More filters
Journal ArticleDOI

LIBSVM: A library for support vector machines

TL;DR: Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
Book

C4.5: Programs for Machine Learning

TL;DR: A complete guide to the C4.5 system as implemented in C for the UNIX environment, which starts from simple core learning methods and shows how they can be elaborated and extended to deal with typical problems such as missing data and over hitting.
Journal ArticleDOI

SMOTE: synthetic minority over-sampling technique

TL;DR: In this article, a method of over-sampling the minority class involves creating synthetic minority class examples, which is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.
Journal ArticleDOI

An introduction to ROC analysis

TL;DR: The purpose of this article is to serve as an introduction to ROC graphs and as a guide for using them in research.
Related Papers (5)