scispace - formally typeset
Open AccessProceedings ArticleDOI

SCUT: Multi-class imbalanced data classification using SMOTE and cluster-based undersampling

TLDR
The SCUT hybrid sampling method is proposed, which is used to balance the number of training examples in a multi-class setting and, when the SCUT method is used for pre-processing the data before classification, it obtain highly accurate models that compare favourably to the state-of-the-art.
Abstract
Class imbalance is a crucial problem in machine learning and occurs in many domains. Specifically, the two-class problem has received interest from researchers in recent years, leading to solutions for oil spill detection, tumour discovery and fraudulent credit card detection, amongst others. However, handling class imbalance in datasets that contains multiple classes, with varying degree of imbalance, has received limited attention. In such a multi-class imbalanced dataset, the classification model tends to favour the majority classes and incorrectly classify instances from the minority classes as belonging to the majority classes, leading to poor predictive accuracies. Further, there is a need to handle both the imbalances between classes as well as address the selection of examples within a class (i.e. the so-called within class imbalance). In this paper, we propose the SCUT hybrid sampling method, which is used to balance the number of training examples in such a multi-class setting. Our SCUT approach oversamples minority class examples through the generation of synthetic examples and employs cluster analysis in order to undersample majority classes. In addition, it handles both within-class and between-class imbalance. Our experimental results against a number of multi-class problems show that, when the SCUT method is used for pre-processing the data before classification, we obtain highly accurate models that compare favourably to the state-of-the-art.

read more

Content maybe subject to copyright    Report

Publisher’s version / Version de l'éditeur:
Proceedings of the 7th International Joint Conference on Knowledge Discovery,
Knowledge Engineering and Knowledge Management, 2015-11-14
READ THESE TERMS AND CONDITIONS CAREFULLY BEFORE USING THIS WEBSITE.
https://nrc-publications.canada.ca/eng/copyright
Vous avez des questions?
Nous pouvons vous aider. Pour communiquer directement avec un auteur, consultez la
première page de la revue dans laquelle son article a été publié afin de trouver ses coordonnées. Si vous n’arrivez
pas à les repérer, communiquez avec nous à PublicationsArchive-ArchivesPublications@nrc-cnrc.gc.ca.
Questions? Contact the NRC Publications Archive team at
PublicationsArchive-ArchivesPublications@nrc-cnrc.gc.ca. If you wish to email the authors directly, please see the
first page of the publication for their contact information.
NRC Publications Archive
Archives des publications du CNRC
This publication could be one of several versions: author’s original, accepted manuscript or the publisher’s version. /
La version de cette publication peut être l’une des suivantes : la version prépublication de l’auteur, la version
acceptée du manuscrit ou la version de l’éditeur.
For the publisher’s version, please access the DOI link below./ Pour consulter la version de l’éditeur, utilisez le lien
DOI ci-dessous.
https://doi.org/10.5220/0005595502260234
Access and use of this website and the material on it are subject to the Terms and Conditions set forth at
SCUT: multi-class imbalanced data classification using SMOTE and
cluster-based undersampling
Agrawal, Astha; Viktor, Herna L.; Paquet, Eric
https://publications-cnrc.canada.ca/fra/droits
L’accès à ce site Web et l’utilisation de son contenu sont assujettis aux conditions présentées dans le site
LISEZ CES CONDITIONS ATTENTIVEMENT AVANT D’UTILISER CE SITE WEB.
NRC Publications Record / Notice d'Archives des publications de CNRC:
https://nrc-publications.canada.ca/eng/view/object/?id=e8c7556d-9f94-466f-a1e5-72cdf9b9513f
https://publications-cnrc.canada.ca/fra/voir/objet/?id=e8c7556d-9f94-466f-a1e5-72cdf9b9513f

SCUT: Multi-Class Imbalanced Data Classification using SMOTE
and Cluster-based Undersampling
Astha Agrawal
1
, Herna L. Viktor
1
and Eric Paquet
1,2
1
School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, Ontario, Canada
2
National Research Council of Canada, Ottawa, Ontario, Canada
Keywords: Multi-Class Imbalance, Undersampling, Oversampling, Classification, Clustering.
Abstract: Class imbalance is a crucial problem in machine learning and occurs in many domains. Specifically, the
two-class problem has received interest from researchers in recent years, leading to solutions for oil spill
detection, tumour discovery and fraudulent credit card detection, amongst others. However, handling class
imbalance in datasets that contains multiple classes, with varying degree of imbalance, has received limited
attention. In such a multi-class imbalanced dataset, the classification model tends to favour the majority
classes and incorrectly classify instances from the minority classes as belonging to the majority classes,
leading to poor predictive accuracies. Further, there is a need to handle both the imbalances between classes
as well as address the selection of examples within a class (i.e. the so-called within class imbalance). In this
paper, we propose the SCUT hybrid sampling method, which is used to balance the number of training
examples in such a multi-class setting. Our SCUT approach oversamples minority class examples through
the generation of synthetic examples and employs cluster analysis in order to undersample majority classes.
In addition, it handles both within-class and between-class imbalance. Our experimental results against a
number of multi-class problems show that, when the SCUT method is used for pre-processing the data
before classification, we obtain highly accurate models that compare favourably to the state-of-the-art.
1 INTRODUCTION
In an imbalanced dataset used for classification, the
sizes of one or more classes are much greater than
the other classes. The classes with the larger number
of instances are called majority classes and the
classes with the smaller number of instances are
referred to as the minority classes. Intuitively, since
there are a large number of majority class examples,
a classification model tends to favour majority
classes while incorrectly classifying the examples
from the minority classes. However, in imbalanced
datasets, we are often more interested in correctly
classifying the minority classes. For instance, in a
two class setting within the medical domain, if we
are classifying patients’ condition, the minority class
(e.g. cancer) is of more interest than the majority
class (e.g. cancer free). In practice, many problems
have more than two classes. For example, in
bioinformatics, protein family classification, where a
protein may belong to very small families within the
large Protein Data Bank repository (Viktor et. al,
2013), as well as protein fold prediction, are
examples of multi-class problems. Typically, in such
a multi-class imbalanced dataset, there are multiple
classes that are underrepresented, that is, there may
be multiple majority classes and multiple minority
classes, resulting in skewed distributions.
A number of research studies have been realized
in order to improve classification performance on
imbalanced binary class datasets, in which there is
one majority class and one minority class. However,
improving the performance on imbalanced multi-
class datasets has not been researched as
extensively. Consequently, most existing techniques
for improving classification performance on
imbalanced datasets are designed to be applied
directly on binary class imbalanced datasets. These
methods cannot be applied directly on multi-class
datasets (Wang and Yao, 2012). Rather, class
decomposition is usually used to convert a multi-
class problem into a binary class problem. For
instance, the One-versus-one (OVO) approach
employs multiple classifiers for each possible pair of
classes, discarding the remaining instances that do
not belong to the pair under consideration. The One-
versus-all (OVA) approach, on the other hand,
226
Agrawal, A., Viktor, H. and Paquet, E..
SCUT: Multi-Class Imbalanced Data Classification using SMOTE and Cluster-based Undersampling.
In Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2015) - Volume 1: KDIR, pages 226-234
ISBN: 978-989-758-158-8
Copyright
c
2015 by SCITEPRESS Science and Technology Publications, Lda. All rights reserved

considers one class as the positive class, and merges
the remaining classes to form the negative class. For
n’ classes, ‘n’ classifiers are used, and each class
acts as the positive class once (Fernández et al.,
2010). Subsequently, the results from different
classifiers are combined in order to reach a final
decision. Interested readers are referred to (Ramanan
et al., 2007) for detailed discussions of the OVO and
OVA approaches. However, combining results from
classifiers that are trained on different sub-problems
may result in classification errors (Wang and Yao,
2012). In addition, in OVO, each classifier is trained
only on a subset of the dataset, which may lead to
some data regions being left unlearned. In this paper,
we propose a different method to improve
classification performance on multi-class
imbalanced datasets which preserves the structure of
the data, without converting the dataset into a binary
class problem.
In addition to between-class imbalance (i.e. the
imbalance in the number of instances in each
classes), within-class imbalance is also commonly
observed in datasets. Such a situation occurs when a
class is composed of different sub-clusters and these
sub-clusters do not contain the same number of
examples (Japkowicz, 2001). It follows that
between-class and within-class imbalances both
affect classification performance. In an attempt to
address these two problems, and in order to improve
classification performance on imbalanced datasets,
sampling methods are often used for pre-processing
the data prior to using a classifier to build a
classification model.
Sampling methods focus on adapting the class
distribution in order to reduce the between-class
imbalance. Sampling methods may be divided into
two categories, namely undersampling and
oversampling. Undersampling reduces the number
of majority class instances and oversampling
increases the number of minority class instances.
Unfortunately, both random oversampling and
undersampling techniques present some weaknesses.
For instance, random oversampling adds duplicate
minority class instances to the minority class. This
may result in smaller and more specific decision
regions causing the learner to over-fit the data. Also,
oversampling may increase the training time.
Random undersampling randomly takes away some
instances from the majority class. A drawback of
this method is that useful information may be taken
away (Han et al., 2005). Further, when performing
random undersampling, if the dataset has within-
class imbalance and some sub-clusters are
represented by very few instances, the probability
that instances from these sub-clusters be retained is
relatively low. Consequently, these instances may
remain unlearned.
SMOTE represents an improvement over random
oversampling in that the minority class is
oversampled by generating “synthetic” examples
(Chawla et. al., 2002). However, in highly
imbalanced datasets, too much oversampling (i.e.
oversampling using a high sampling percentage)
may result in overfitting. This is especially
important in a multi-class setting where there are a
number of minority classes with very few examples.
Further, in a multi-class setting, there is a need to
find the correct balance, in terms of number of
examples, between multiple classes. In order to
address this issue, we propose an algorithm called
SCUT (SMOTE and Clustered Undersampling
Technique) which combines SMOTE and cluster-
based undersampling in order to handle between-
class and within-class imbalance.
Undersampling is required to balance the dataset
without using excessive oversampling. If majority
class instances are randomly selected, small
disjuncts with less representative data may remain
unlearned. Clustering the majority classes helps
identify sub-concepts, and if at least one instance is
selected from each sub-concept (cluster) while doing
undersampling, this issue might be addressed
(Sobhani et. al, 2014). This implies that the scenario
of having unlearned regions when within-class
imbalance exists, is reduced. In this setting,
combining clustering and undersampling makes
sense as it addresses the disadvantage of random
undersampling. To this end, Yen and Lee proposed
several cluster-based undersampling approaches to
select representative data as training data to improve
the classification accuracy for the minority class
(Yen and Lee, 2009). The main idea behind their
cluster-based undersampling approaches was based
on the assumption that each dataset has different
clusters and each cluster seems to have distinct
characteristics. Subsequently, from each cluster, a
suitable number of majority class samples were
selected (Yen and Lee, 2009). Rahman and Davis
also used a cluster-based undersampling technique
for classifying imbalanced cardiovascular data that
not only balances the data in a dataset, but further
selects good quality training set data for building
classification models (Rahman and Davis, 2013).
Chawla et al. combined random undersampling
with SMOTE, so that the minority class had a larger
presence in the training set. By combining
undersampling and oversampling, the initial bias of
the learner towards the majority class is reversed in
SCUT: Multi-Class Imbalanced Data Classification using SMOTE and Cluster-based Undersampling
227

the favour of the minority class (Chawla et al.,
2002). In summary, cluster-based undersampling
ensures that all sub-concepts are adequately
represented. When used in conjunction with
SMOTE, the hybrid sampling method thus aid to
ensure that between-class imbalance is reduced
without excessive use of oversampling and
undersampling.
This paper is organized as follows. Section 2
contains a description of the proposed method. In
Section 3, the experimental setup and results are
presented while Section 4 concludes the paper and
discusses our future plans.
2 SCUT ALGORITHM
Our SCUT algorithm combines both undersampling
and oversampling techniques in order to reduce the
imbalance between classes in a multi-class setting.
The pseudocode for our SCUT method is shown in
Figure 1.
For undersampling, we employ a cluster-based
undersampling technique, using the Expectation
Maximization (EM) algorithm (Dempster et al.,
1977). The EM algorithm replaces the hard clusters
by a probability distribution formed by a mixture of
Gaussians. Instead of being assigned to a particular
cluster, each member has a certain probability to
belong to a particular Gaussian distribution of the
mixture. The parameters of the mixture, including
the number of Gaussians, are determined with the
Expectation Maximization algorithm. An advantage
of using EM is that the number of clusters does not
have to be specified beforehand. EM clustering may
be used to find both hard and soft clusters. That is,
EM assigns a probability distribution to each
instance relative to each particular cluster (Dempster
et al., 1997).
The SCUT algorithm proceeds as follows. The
dataset is split into n parts, namely D
1,
D
2
, D
3
... D
n,
where n is the number of classes and D
i
represents a
single class. Subsequently, the mean (m) of the
number of instances of all the classes is calculated.
i) For all classes that have a number of instances
less than the mean m, oversampling is performed in
order to obtain a number of instances equal to the
mean. The sampling percentage used for SMOTE is
calculated such that the number of instances in the
class after oversampling is equal to m.
ii) For all classes that have a number of instances
greater than the mean m, undersampling is
conducted to obtain a number of instances equal to
the mean. Recall that the EM technique is used to
discover the clusters within each class (Dempster et
al., 1977). Subsequently, for each cluster within the
current class, instances are randomly selected such
that the total number of instances from all the
clusters is equal to m. Therefore, instead of fixing
the number of instances selected from each cluster,
we fix the total number of instances. It follows that a
different number of instances may be selected from
the various clusters. However, we aim to select the
instances as uniformly as possible. The selected
instances are combined together in order to obtain m
instances (for each class).
iii) All classes for which the number of instances
is equal to the mean m are left untouched.
Input: Dataset D with n classes
Output: Dataset D' with all classes
having m instances, where m is the mean
number of instances of all classes
Split D into D
1
, D
2
, D
3
, ..., D
n
where D
i
is a single class
Calculate
m
Undersampling:
For each D
i
, i=1,2, ... , n where
number of instances >m
Cluster
D
i
using EM algorithm
For each cluster
C
i
, i = 1,2,
... ,
k
Randomly select instances
from
C
i
Add selected instances to
C
i
End For
C = Ø
For i=1,2, ... , k
C = C U C
i
End For
D
i
= C
End For
Oversampling:
For each D
i
, i=1,2, ... , n where
number of instances <m
Apply SMOTE on
D
i
to get D
i
End For
For each
D
i
, i=1,2, ... , n where
number of instances = m
D
i
= D
i
D’ = Ø
For
i = 1,2, ... , n
D’ = D’ U D
i
End For
Return
D’
Figure 1: SCUT Algorithm.
KDIR 2015 - 7th International Conference on Knowledge Discovery and Information Retrieval
228

Finally, all the classes are merged together in order
to obtain a dataset D’, where all the classes have m
instances. Classification may be performed on D’
using an appropriate classifier.
For instance, one of the datasets used in our work
is the Lymphography dataset, as obtained from the
KEEL repository (Alcalá-Fdez et al., 2011). This
dataset concerns detecting the presence of a
lymphoma, together with its current status, and
contains four (4) classes (normal, metastases,
malignant-lymphoma and fibrosis), with 2, 81, 61
and 4 examples, respectively. That is, the dataset has
a high level of imbalance and contains two majority
and two minority classes. The dataset is split into
four (4) classes and the mean is 37.
i) For class 1, the number of instances is 2, so
SMOTE is applied with a sampling percentage of
1850% in order to obtain 37 instances.
ii) For class 2, the number of examples is 81, so
EM is applied and 3 clusters are obtained, with the
numbers of instances equal to 29, 17 and 35
respectively. In order to obtain a total of 37
instances, 12, 12 and 13 instances are randomly
selected from the clusters.
iii) For class 3, the number of instances is 61.
When EM is applied, only one cluster is obtained.
Next, 37 instances are randomly selected from this
one cluster.
iv) The number of instances of class 4 is equal to
4, so SMOTE is applied with a sampling percentage
of 925% in order to obtain 37 instances.
Lastly, the classes are merged together and a new
dataset of 148 instances (in which each class has 37
examples) is obtained. The next section discusses
our experimental setup and results.
3 EXPERIMENTATION
We implemented our SCUT algorithm by extending
WEKA, an open source data mining tool that was
developed at the University of Waikato. For
classification, the WEKA implementations of four
classifiers, namely J48 (decision tree), SMO
(support vector machine), Naïve Bayes and IBk
(Nearest Neighbour), were used. For IBk, the
number of nearest neighbours (k) was set to five (5),
by inspection. Default values for the other
parameters were used.
A ten-fold cross validation approach was used
for testing and training. Ten-fold cross validation
has been shown to be an effective testing
methodology when datasets are not too small, since
each fold is a good representation of the entire
dataset (Japkowicz, 2001).
3.1 Benchmarking Datasets
Seven multi-class datasets from the KEEL
repository (Alcalá-Fdez et al., 2011) and the Wine
Quality dataset from the UCI repository (Lichman,
2013) (Cortez et al., 2009) were used in the
experiments. The details of these datasets are
summarized in Table 1. The table shows that the
number of classes in the datasets varies from three
(3) to ten (10) and the number of training examples
range from 148 to 6497. Here, the levels of
imbalance and numbers of classes with majority and
minority instances vary considerably.
The WEKA implementation of the EM cluster
analysis algorithm was used. Recall that the EM
approach employs probabilistic models which imply
that the number of clusters does not have to be
specified in advance. Therefore, a major strength of
EM is that it determines the number of clusters that
must be created by cross validation. In order to
determine the number of clusters, cross validation is
performed as follows:
1. Initially, the number of clusters is set to one (1).
2. The training set is split randomly into ten (10)
folds. The number of folds is set to ten, as long
as the number of instances in the training set is
not smaller ten. If this is the case, the number of
folds is set equal to the number of instances.
3. EM is performed ten (10) times using the ten
(10) folds.
4. The logarithm of the likelihood is averaged over
all ten (10) results. If logarithm of the likelihood
increases, the number of clusters is increased by
one (1) and the algorithm resumes from step 2.
Table 1: Datasets.
Datasets Size # Class Class distribution
Thyroid 720 3 17, 37, 666
Lymphography 148 4 2, 81, 61, 4
Pageblocks 548 5 492, 33, 8, 12, 3
Dermatology 366 6
112, 61, 72, 49,
52, 20
Autos 159 6
3, 20, 48, 46, 29,
13
Ecoli 336 8
143, 77, 52, 35,
20, 5, 2, 2
Wine Quality 6497 7
30, 216, 2138,
2836, 1079, 193, 5
Yeast 1484 10
244, 429, 463, 44,
51, 163, 35, 30,
20, 5
SCUT: Multi-Class Imbalanced Data Classification using SMOTE and Cluster-based Undersampling
229

Citations
More filters
Journal ArticleDOI

Spruce budworm tree host species distribution and abundance mapping using multi-temporal Sentinel-1 and Sentinel-2 satellite imagery

TL;DR: In this article, the authors developed and evaluated new models to map the distribution and abundance of spruce budworm (Choristoneura fumiferana; SBW) host species at 20m spatial resolution using Sentinel-1 synthetic aperture radar (SAR) and Sentinel-2 multispectral imagery in combination with several site variables for a total of 191 variables in northern New Brunswick, Canada using the Random Forest (RF) algorithm.
Journal ArticleDOI

Using Information on Class Interrelations to Improve Classification of Multiclass Imbalanced Data: A New Resampling Algorithm

TL;DR: A new resampling algorithm, called Similarity Oversampling and Undersampling Preprocessing (SOUP), which resamples examples according to their difficulty and is competitive with the most popular decomposition ensembles and better than specialized preprocessing techniques for multi-imbalanced problems.
Journal ArticleDOI

What makes multi-class imbalanced problems difficult? An experimental study

TL;DR: In this article , the impact of various multi-class imbalanced difficulty factors on the performance of three popular classifiers was investigated and the results demonstrated a strong influence of the class overlapping with the extent of its impact related to the types of overlapped classes.
Journal ArticleDOI

A clustering based ensemble of weighted kernelized extreme learning machine for class imbalance learning

TL;DR: This work proposes an ensemble method which decomposes a complex imbalanced problem into simpler sub-problems, solves these sub-Problems using cost sensitive classifiers and then combines the results of each classifier using voting methods, and outperforms other state of the art methods of imbalance classification.
References
More filters
Journal ArticleDOI

SMOTE: synthetic minority over-sampling technique

TL;DR: In this article, a method of over-sampling the minority class involves creating synthetic minority class examples, which is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.
Journal ArticleDOI

SMOTE: Synthetic Minority Over-sampling Technique

TL;DR: In this article, a method of over-sampling the minority class involves creating synthetic minority class examples, which is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.
Book ChapterDOI

Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning

TL;DR: Two new minority over-sampling methods are presented, borderline- SMOTE1 and borderline-SMOTE2, in which only the minority examples near the borderline are over- Sampling, which achieve better TP rate and F-value than SMOTE and random over-Sampling methods.
Proceedings Article

KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework

TL;DR: The aim of this paper is to present three new aspects of KEEL: KEEL-dataset, a data set repository which includes the data set partitions in theKEELformat and some guidelines for including new algorithms in KEEL, helping the researcher to compare the results of many approaches already included within the KEEL software.
Related Papers (5)
Frequently Asked Questions (12)
Q1. What contributions have the authors mentioned in the paper "Scut: multi-class imbalanced data classification using smote and cluster-based undersampling" ?

In this paper, the authors propose the SCUT hybrid sampling method, which is used to balance the number of training examples in such a multi-class setting. Their SCUT approach oversamples minority class examples through the generation of synthetic examples and employs cluster analysis in order to undersample majority classes. 

Designing a multi-class cost-sensitive learning approach for inconsistent costs without transforming the problem into a binary-class problem will be the focus of their future work. 

the combination of cluster-based undersampling and SMOTE aids to reduce between-class imbalance, without excessive use of sampling. 

Designing a multi-class cost-sensitive learning approach for inconsistent costs without transforming the problem into a binary-class problem will be the focus of their future work. 

which is a popular cost-sensitive learning approach for binary class problems can be applied directly on multi-class datasets to obtain good performance only when the costs are consistent (Zhou and Liu, 2010). 

The authors also intend extending their approach to very large datasets with extreme levels of imbalances, since their early results indicate that their SCUT approach would potentially outperform undersampling-only techniques in such a setting. 

In this paper, the authors have proposed a hybrid sampling method called SCUT which combines SMOTE and cluster-based undersampling to improve the classification performance on multi-class imbalanced datasets. 

When used in conjunction with SMOTE, the hybrid sampling method thus aid to ensure that between-class imbalance is reduced without excessive use of oversampling and undersampling. 

For instance, the One-versus-one (OVO) approach employs multiple classifiers for each possible pair of classes, discarding the remaining instances that do not belong to the pair under consideration. 

For class 2, the number of examples is 81, so EM is applied and 3 clusters are obtained, with the numbers of instances equal to 29, 17 and 35 respectively. 

Input: Dataset D with n classes Output: Dataset D' with all classes having m instances, where m is the mean number of instances of all classesSplit D into D1, D2, D3, ..., Dn where Di is a single class Calculate mUndersampling: For each Di, i=1,2, ... , n where number of instances >mCluster Di using EM algorithm 

In order to determine the number of clusters, cross validation is performed as follows:1. Initially, the number of clusters is set to one (1).