RUSBoost: A Hybrid Approach to Alleviating Class Imbalance

doi:10.1109/TSMCA.2009.2029559

Home
/
Papers
/
RUSBoost: A Hybrid Approach to Alleviating Class Imbalance

Journal Article•DOI•

RUSBoost: A Hybrid Approach to Alleviating Class Imbalance

C. Seiffert¹, Taghi M. Khoshgoftaar¹, J. Van Hulse¹, Amri Napolitano¹•Institutions (1)

Florida Atlantic University¹

01 Jan 2010-Vol. 40, Iss: 1, pp 185-197

TL;DR: This paper presents a new hybrid sampling/boosting algorithm, called RUSBoost, for learning from skewed training data, which provides a simpler and faster alternative to SMOTEBoost, which is another algorithm that combines boosting and data sampling.

read less

Abstract: Class imbalance is a problem that is common to many application domains. When examples of one class in a training data set vastly outnumber examples of the other class(es), traditional data mining algorithms tend to create suboptimal classification models. Several techniques have been used to alleviate the problem of class imbalance, including data sampling and boosting. In this paper, we present a new hybrid sampling/boosting algorithm, called RUSBoost, for learning from skewed training data. This algorithm provides a simpler and faster alternative to SMOTEBoost, which is another algorithm that combines boosting and data sampling. This paper evaluates the performances of RUSBoost and SMOTEBoost, as well as their individual components (random undersampling, synthetic minority oversampling technique, and AdaBoost). We conduct experiments using 15 data sets from various application domains, four base learners, and four evaluation metrics. RUSBoost and SMOTEBoost both outperform the other procedures, and RUSBoost performs comparably to (and often better than) SMOTEBoost while being a simpler and faster technique. Given these experimental results, we highly recommend RUSBoost as an attractive alternative for improving the classification performance of learners built using imbalanced data.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Journal Article•DOI•

A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches

[...]

Mikel Galar, Alberto Fernández¹, Edurne Barrenechea, Humberto Bustince, Francisco Herrera² - Show less +1 more•Institutions (2)

University of Jaén¹, University of Granada²

01 Jul 2012

TL;DR: A taxonomy for ensemble-based methods to address the class imbalance where each proposal can be categorized depending on the inner ensemble methodology in which it is based is proposed and a thorough empirical comparison is developed by the consideration of the most significant published approaches to show whether any of them makes a difference.

...read moreread less

Abstract: Classifier learning with data-sets that suffer from imbalanced class distributions is a challenging problem in data mining community. This issue occurs when the number of examples that represent one class is much lower than the ones of the other classes. Its presence in many real-world applications has brought along a growth of attention from researchers. In machine learning, the ensemble of classifiers are known to increase the accuracy of single classifiers by combining several of them, but neither of these learning techniques alone solve the class imbalance problem, to deal with this issue the ensemble learning algorithms have to be designed specifically. In this paper, our aim is to review the state of the art on ensemble techniques in the framework of imbalanced data-sets, with focus on two-class problems. We propose a taxonomy for ensemble-based methods to address the class imbalance where each proposal can be categorized depending on the inner ensemble methodology in which it is based. In addition, we develop a thorough empirical comparison by the consideration of the most significant published approaches, within the families of the taxonomy proposed, to show whether any of them makes a difference. This comparison has shown the good behavior of the simplest approaches which combine random undersampling techniques with bagging or boosting ensembles. In addition, the positive synergy between sampling techniques and bagging has stood out. Furthermore, our results show empirically that ensemble-based algorithms are worthwhile since they outperform the mere use of preprocessing techniques before learning the classifier, therefore justifying the increase of complexity by means of a significant enhancement of the results.

...read moreread less

2,228 citations

Cites background or methods from "RUSBoost: A Hybrid Approach to Alle..."

...Likewise, UnderBagging is computationally harder than RUSBoost, in spite of obtaining comparable size trees, it uses four times more classifiers....
[...]
...The Boosting-based ensembles that are considered in our study are RUSBoost, SMOTEBoost and MSMOTEBoost....
[...]
...On the other hand, with regard to ensemble learning methods, a large number of different approaches have been proposed in the literature, including but not limited to SMOTEBoost [44], RUSBoost [45], IIVotes [46], EasyEnsemble [47], or SMOTEBagging [55]....
[...]
...Particularly noteworthy is the performance of RUSBoost, which is the computationally least complex among the best performers....
[...]
...have arisen as a possible solution to the class imbalance problem attracting great interest among researchers [45], [47], [50], [62]....
[...]

Journal Article•DOI•

Learning from class-imbalanced data

[...]

Guo Haixiang¹, Li Yijing¹, Jennifer Shang², Gu Mingyun¹, Huang Yuanyue¹, Gong Bing³ - Show less +2 more•Institutions (3)

China University of Geosciences (Wuhan)¹, University of Pittsburgh², Technical University of Madrid³

01 May 2017-Expert Systems With Applications

TL;DR: An in depth review of rare event detection from an imbalanced learning perspective and a comprehensive taxonomy of the existing application domains of im balanced learning are provided.

...read moreread less

Abstract: 527 articles related to imbalanced data and rare events are reviewed.Viewing reviewed papers from both technical and practical perspectives.Summarizing existing methods and corresponding statistics by a new taxonomy idea.Categorizing 162 application papers into 13 domains and giving introduction.Some opening questions are discussed at the end of this manuscript. Rare events, especially those that could potentially negatively impact society, often require humans decision-making responses. Detecting rare events can be viewed as a prediction task in data mining and machine learning communities. As these events are rarely observed in daily life, the prediction task suffers from a lack of balanced data. In this paper, we provide an in depth review of rare event detection from an imbalanced learning perspective. Five hundred and seventeen related papers that have been published in the past decade were collected for the study. The initial statistics suggested that rare events detection and imbalanced learning are concerned across a wide range of research areas from management science to engineering. We reviewed all collected papers from both a technical and a practical point of view. Modeling methods discussed include techniques such as data preprocessing, classification algorithms and model evaluation. For applications, we first provide a comprehensive taxonomy of the existing application domains of imbalanced learning, and then we detail the applications for each category. Finally, some suggestions from the reviewed papers are incorporated with our experiences and judgments to offer further research directions for the imbalanced learning and rare event detection fields.

...read moreread less

1,448 citations

Journal Article•DOI•

An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics

[...]

Victoria López¹, Alberto Fernández², Salvador García², Vasile Palade³, Francisco Herrera¹ - Show less +1 more•Institutions (3)

University of Granada¹, University of Jaén², University of Oxford³

20 Nov 2013-Information Sciences

TL;DR: This work carries out a thorough discussion on the main issues related to using data intrinsic characteristics in this classification problem, and introduces several approaches and recommendations to address these problems in conjunction with imbalanced data.

...read moreread less

1,292 citations

Cites background or methods from "RUSBoost: A Hybrid Approach to Alle..."

...In recent years, ensembles of classifiers have arisen as a possible solution to the class imbalance problem [77,85,112,117,127,131]....
[...]
...M1 (AdaB-M1) [110], AdaBoost with costs outside the exponent (AdaC2) [117], RUSBoost (RUSB) [112], SMOTEBagging (SBAG) [130], and EasyEnsemble (EASY) [85]....
[...]
...Ensemble methods [101,108] are also frequently adapted to imbalanced domains, either by modifying the ensemble learning algorithm at the data-level approach to preprocess the data before the learning stage of each classifier [17,30,112] or by embedding a cost-sensitive framework in the ensemble learning process [44,117,122]....
[...]
...Following this idea, many approaches have been developed by modifying the standard boosting weight-update mechanism in order to improve the performance on the minority class and the small disjuncts [30,44,61,69,74,112,117,122]....
[...]
...In this last section of the experimental analysis on the behavior of the methodologies for addressing classification with imbalanced datasets, we will perform a cross-family comparison for the approaches previously selected as the representatives for each case, namely preprocessing (SMOTE and SMOTE+ENN), cost-sensitive learning (CS-Weighted and MetaCost) and ensemble techniques (RUSB and SBAG)....
[...]

Journal Article•DOI•

A Survey of Predictive Modeling on Imbalanced Domains

[...]

Paula Branco¹, Luís Torgo¹, Rita P. Ribeiro¹•Institutions (1)

University of Porto¹

13 Aug 2016-ACM Computing Surveys

TL;DR: The main challenges raised by imbalanced domains are discussed, a definition of the problem is proposed, the main approaches to these tasks are described, and a taxonomy of the methods are proposed.

...read moreread less

Abstract: Many real-world data-mining applications involve obtaining predictive models using datasets with strongly imbalanced distributions of the target variable. Frequently, the least-common values of this target variable are associated with events that are highly relevant for end users (e.g., fraud detection, unusual returns on stock markets, anticipation of catastrophes, etc.). Moreover, the events may have different costs and benefits, which, when associated with the rarity of some of them on the available training data, creates serious problems to predictive modeling techniques. This article presents a survey of existing techniques for handling these important applications of predictive analytics. Although most of the existing work addresses classification tasks (nominal target variables), we also describe methods designed to handle similar problems within regression tasks (numeric target variables). In this survey, we discuss the main challenges raised by imbalanced domains, propose a definition of the problem, describe the main approaches to these tasks, propose a taxonomy of the methods, summarize the conclusions of existing comparative studies as well as some theoretical analyses of some methods, and refer to some related problems within predictive modeling.

...read moreread less

730 citations

Cites methods from "RUSBoost: A Hybrid Approach to Alle..."

...Namely, it was combined with boosting [Seiffert et al. 2010] and bagging [Chang et al. 2003; Tao et al. 2006; Wang and Yao 2009; Wallace et al. 2011] and was applied to both classes in random forests in a method named Balanced Random Forest (BRF) [Chen et al. 2004]....
[...]

Journal Article•DOI•

Text Data Augmentation for Deep Learning.

[...]

Connor Shorten¹, Taghi M. Khoshgoftaar¹, Borko Furht¹•Institutions (1)

Florida Atlantic University¹

29 Jun 2021-Journal of Big Data

TL;DR: A survey of data augmentation for text data can be found in this article, where the major motifs of Data Augmentation are summarized into strengthening local decision boundaries, brute force training, causality and counterfactual examples, and the distinction between meaning and form.

...read moreread less

Abstract: Natural Language Processing (NLP) is one of the most captivating applications of Deep Learning. In this survey, we consider how the Data Augmentation training strategy can aid in its development. We begin with the major motifs of Data Augmentation summarized into strengthening local decision boundaries, brute force training, causality and counterfactual examples, and the distinction between meaning and form. We follow these motifs with a concrete list of augmentation frameworks that have been developed for text data. Deep Learning generally struggles with the measurement of generalization and characterization of overfitting. We highlight studies that cover how augmentations can construct test sets for generalization. NLP is at an early stage in applying Data Augmentation compared to Computer Vision. We highlight the key differences and promising ideas that have yet to be tested in NLP. For the sake of practical implementation, we describe tools that facilitate Data Augmentation such as the use of consistency regularization, controllers, and offline and online augmentation pipelines, to preview a few. Finally, we discuss interesting topics around Data Augmentation in NLP such as task-specific augmentations, the use of prior knowledge in self-supervised learning versus Data Augmentation, intersections with transfer and multi-task learning, and ideas for AI-GAs (AI-Generating Algorithms). We hope this paper inspires further research interest in Text Data Augmentation.

...read moreread less

487 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Book•

C4.5: Programs for Machine Learning

[...]

J. Ross Quinlan¹•Institutions (1)

University of Sydney¹

15 Oct 1992

TL;DR: A complete guide to the C4.5 system as implemented in C for the UNIX environment, which starts from simple core learning methods and shows how they can be elaborated and extended to deal with typical problems such as missing data and over hitting.

...read moreread less

Abstract: From the Publisher: Classifier systems play a major role in machine learning and knowledge-based systems, and Ross Quinlan's work on ID3 and C4.5 is widely acknowledged to have made some of the most significant contributions to their development. This book is a complete guide to the C4.5 system as implemented in C for the UNIX environment. It contains a comprehensive guide to the system's use , the source code (about 8,800 lines), and implementation notes. The source code and sample datasets are also available on a 3.5-inch floppy diskette for a Sun workstation. C4.5 starts with large sets of cases belonging to known classes. The cases, described by any mixture of nominal and numeric properties, are scrutinized for patterns that allow the classes to be reliably discriminated. These patterns are then expressed as models, in the form of decision trees or sets of if-then rules, that can be used to classify new cases, with emphasis on making the models understandable as well as accurate. The system has been applied successfully to tasks involving tens of thousands of cases described by hundreds of properties. The book starts from simple core learning methods and shows how they can be elaborated and extended to deal with typical problems such as missing data and over hitting. Advantages and disadvantages of the C4.5 approach are discussed and illustrated with several case studies. This book and software should be of interest to developers of classification-based intelligent systems and to students in machine learning and expert systems courses.

...read moreread less

21,674 citations

Book•

Data Mining: Practical Machine Learning Tools and Techniques

[...]

Ian H. Witten, Eibe Frank, Mark Hall

25 Oct 1999

TL;DR: This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining.

...read moreread less

Abstract: Data Mining: Practical Machine Learning Tools and Techniques offers a thorough grounding in machine learning concepts as well as practical advice on applying machine learning tools and techniques in real-world data mining situations. This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining. Thorough updates reflect the technical changes and modernizations that have taken place in the field since the last edition, including new material on Data Transformations, Ensemble Learning, Massive Data Sets, Multi-instance Learning, plus a new version of the popular Weka machine learning software developed by the authors. Witten, Frank, and Hall include both tried-and-true techniques of today as well as methods at the leading edge of contemporary research. *Provides a thorough grounding in machine learning concepts as well as practical advice on applying the tools and techniques to your data mining projects *Offers concrete tips and techniques for performance improvement that work by transforming the input or output in machine learning methods *Includes downloadable Weka software toolkit, a collection of machine learning algorithms for data mining tasks-in an updated, interactive interface. Algorithms in toolkit cover: data pre-processing, classification, regression, clustering, association rules, visualization

...read moreread less

20,196 citations

Journal Article•DOI•

SMOTE: synthetic minority over-sampling technique

[...]

Nitesh V. Chawla¹, Kevin W. Bowyer², Lawrence O. Hall¹, W. Philip Kegelmeyer³•Institutions (3)

University of South Florida¹, University of Notre Dame², Sandia National Laboratories³

01 Jan 2002-Journal of Artificial Intelligence Research

TL;DR: In this article, a method of over-sampling the minority class involves creating synthetic minority class examples, which is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.

...read moreread less

Abstract: An approach to the construction of classifiers from imbalanced datasets is described. A dataset is imbalanced if the classification categories are not approximately equally represented. Often real-world data sets are predominately composed of "normal" examples with only a small percentage of "abnormal" or "interesting" examples. It is also the case that the cost of misclassifying an abnormal (interesting) example as a normal example is often much higher than the cost of the reverse error. Under-sampling of the majority (normal) class has been proposed as a good means of increasing the sensitivity of a classifier to the minority class. This paper shows that a combination of our method of oversampling the minority (abnormal)cla ss and under-sampling the majority (normal) class can achieve better classifier performance (in ROC space)tha n only under-sampling the majority class. This paper also shows that a combination of our method of over-sampling the minority class and under-sampling the majority class can achieve better classifier performance (in ROC space)t han varying the loss ratios in Ripper or class priors in Naive Bayes. Our method of over-sampling the minority class involves creating synthetic minority class examples. Experiments are performed using C4.5, Ripper and a Naive Bayes classifier. The method is evaluated using the area under the Receiver Operating Characteristic curve (AUC)and the ROC convex hull strategy.

...read moreread less

17,313 citations

UCI Repository of machine learning databases

[...]

Catherine Blake

01 Jan 1998

12,940 citations

"RUSBoost: A Hybrid Approach to Alle..." refers methods in this paper

...The remaining data sets were obtained from the popular University of California–Irvine repository [ 25 ], and they represent various application domains....
[...]

Journal Article•DOI•

SMOTE: Synthetic Minority Over-sampling Technique

[...]

Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, W.P. Kegelmeyer

09 Jun 2011-arXiv: Artificial Intelligence

...read moreread less

Abstract: An approach to the construction of classifiers from imbalanced datasets is described. A dataset is imbalanced if the classification categories are not approximately equally represented. Often real-world data sets are predominately composed of "normal" examples with only a small percentage of "abnormal" or "interesting" examples. It is also the case that the cost of misclassifying an abnormal (interesting) example as a normal example is often much higher than the cost of the reverse error. Under-sampling of the majority (normal) class has been proposed as a good means of increasing the sensitivity of a classifier to the minority class. This paper shows that a combination of our method of over-sampling the minority (abnormal) class and under-sampling the majority (normal) class can achieve better classifier performance (in ROC space) than only under-sampling the majority class. This paper also shows that a combination of our method of over-sampling the minority class and under-sampling the majority class can achieve better classifier performance (in ROC space) than varying the loss ratios in Ripper or class priors in Naive Bayes. Our method of over-sampling the minority class involves creating synthetic minority class examples. Experiments are performed using C4.5, Ripper and a Naive Bayes classifier. The method is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.

...read moreread less

11,512 citations