Learning from Imbalanced Data

doi:10.1109/TKDE.2008.239

Journal ArticleDOI

Learning from Imbalanced Data

Haibo He, +1 more

- 01 Sep 2009 -

IEEE Transactions on Knowledge and Data ...

- Vol. 21, Iss: 9, pp 1263-1284

Chats0

TLDR

A critical review of the nature of the problem, the state-of-the-art technologies, and the current assessment metrics used to evaluate learning performance under the imbalanced learning scenario is provided.

Abstract:

With the continuous expansion of data availability in many large-scale, complex, and networked systems, such as surveillance, security, Internet, and finance, it becomes critical to advance the fundamental understanding of knowledge discovery and analysis from raw data to support decision-making processes. Although existing knowledge discovery and data engineering techniques have shown great success in many real-world applications, the problem of learning from imbalanced data (the imbalanced learning problem) is a relatively new challenge that has attracted growing attention from both academia and industry. The imbalanced learning problem is concerned with the performance of learning algorithms in the presence of underrepresented data and severe class distribution skews. Due to the inherent complex characteristics of imbalanced data sets, learning from such data requires new understandings, principles, algorithms, and tools to transform vast amounts of raw data efficiently into information and knowledge representation. In this paper, we provide a comprehensive review of the development of research in learning from imbalanced data. Our focus is to provide a critical review of the nature of the problem, the state-of-the-art technologies, and the current assessment metrics used to evaluate learning performance under the imbalanced learning scenario. Furthermore, in order to stimulate future research in this field, we also highlight the major opportunities and challenges, as well as potential important research directions for learning from imbalanced data.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

The Precision-Recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets

Takaya Saito, +1 more

- 04 Mar 2015 -

PLOS ONE

TL;DR: It is shown that the visual interpretability of ROC plots in the context of imbalanced datasets can be deceptive with respect to conclusions about the reliability of classification performance, owing to an intuitive but wrong interpretation of specificity.

...read moreread less

Journal ArticleDOI

A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches

Mikel Galar, +4 more

TL;DR: A taxonomy for ensemble-based methods to address the class imbalance where each proposal can be categorized depending on the inner ensemble methodology in which it is based is proposed and a thorough empirical comparison is developed by the consideration of the most significant published approaches to show whether any of them makes a difference.

...read moreread less

Proceedings Article

KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework

Jesús Alcalá-Fdez, +6 more

TL;DR: The aim of this paper is to present three new aspects of KEEL: KEEL-dataset, a data set repository which includes the data set partitions in theKEELformat and some guidelines for including new algorithms in KEEL, helping the researcher to compare the results of many approaches already included within the KEEL software.

...read moreread less

Journal ArticleDOI

A systematic study of the class imbalance problem in convolutional neural networks

Mateusz Buda, +2 more

- 01 Oct 2018 -

Neural Networks

TL;DR: The effect of class imbalance on classification performance is detrimental; the method of addressing class imbalance that emerged as dominant in almost all analyzed scenarios was oversampling; and thresholding should be applied to compensate for prior class probabilities when overall number of properly classified cases is of interest.

...read moreread less

Journal ArticleDOI

Learning from imbalanced data: open challenges and future directions

Bartosz Krawczyk

- 22 Apr 2016 -

Progress in Artificial Intelligence

TL;DR: Seven vital areas of research in this topic are identified, covering the full spectrum of learning from imbalanced data: classification, regression, clustering, data streams, big data analytics and applications, e.g., in social media and computer vision.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Book

The Nature of Statistical Learning Theory

Vladimir Vapnik

TL;DR: Setting of the learning problem consistency of learning processes bounds on the rate of convergence ofLearning processes controlling the generalization ability of learning process constructing learning algorithms what is important in learning theory?

...read moreread less

Book

Neural Networks: A Comprehensive Foundation

Simon Haykin

TL;DR: Thorough, well-organized, and completely up to date, this book examines all the important aspects of this emerging technology, including the learning process, back-propagation learning, radial-basis function networks, self-organizing systems, modular networks, temporal processing and neurodynamics, and VLSI implementation of neural networks.

...read moreread less

Journal ArticleDOI

Classification and Regression Trees.

John Van Ryzin, +4 more

- 01 Mar 1986 -

Journal of the American Statistical Asso...

Journal ArticleDOI

SMOTE: synthetic minority over-sampling technique

Nitesh V. Chawla, +3 more

- 01 Jan 2002 -

Journal of Artificial Intelligence Resea...

TL;DR: In this article, a method of over-sampling the minority class involves creating synthetic minority class examples, which is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.

...read moreread less

Journal ArticleDOI

Induction of Decision Trees

J. R. Quinlan

- 25 Mar 1986 -

Machine Learning

TL;DR: In this paper, an approach to synthesizing decision trees that has been used in a variety of systems, and it describes one such system, ID3, in detail, is described, and a reported shortcoming of the basic algorithm is discussed.

...read moreread less

Collapse

Related Papers (5)

SMOTE: synthetic minority over-sampling technique

Nitesh V. Chawla, +3 more

- 01 Jan 2002 -

Journal of Artificial Intelligence Resea...

A study of the behavior of several methods for balancing machine learning training data

Gustavo E. A. P. A. Batista, +2 more

- 01 Jun 2004 -

Sigkdd Explorations

Learning from Imbalanced Data

Citations

The Precision-Recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets

A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches

KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework

A systematic study of the class imbalance problem in convolutional neural networks

Learning from imbalanced data: open challenges and future directions

References

The Nature of Statistical Learning Theory

Neural Networks: A Comprehensive Foundation

Classification and Regression Trees.

SMOTE: synthetic minority over-sampling technique

Induction of Decision Trees

Related Papers (5)

SMOTE: synthetic minority over-sampling technique

A study of the behavior of several methods for balancing machine learning training data

Random Forests

Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning

The class imbalance problem: A systematic study