What contributions have the authors mentioned in the paper "Scut: multi-class imbalanced data classification using smote and cluster-based undersampling" ?

In this paper, the authors propose the SCUT hybrid sampling method, which is used to balance the number of training examples in such a multi-class setting. Their SCUT approach oversamples minority class examples through the generation of synthetic examples and employs cluster analysis in order to undersample majority classes.

How does the SCUT method help to reduce between class imbalance?

the combination of cluster-based undersampling and SMOTE aids to reduce between-class imbalance, without excessive use of sampling.

What is the focus of the future work?

Designing a multi-class cost-sensitive learning approach for inconsistent costs without transforming the problem into a binary-class problem will be the focus of their future work.

what is the cost-sensitive learning approach for binary class problems?

which is a popular cost-sensitive learning approach for binary class problems can be applied directly on multi-class datasets to obtain good performance only when the costs are consistent (Zhou and Liu, 2010).

How do you plan to extend your approach to large datasets?

The authors also intend extending their approach to very large datasets with extreme levels of imbalances, since their early results indicate that their SCUT approach would potentially outperform undersampling-only techniques in such a setting.

How many instances are selected from the class?

Input: Dataset D with n classes Output: Dataset D' with all classes having m instances, where m is the mean number of instances of all classesSplit D into D1, D2, D3, ..., Dn where Di is a single class Calculate mUndersampling: For each Di, i=1,2, ... , n where number of instances >mCluster Di using EM algorithm

(Open Access) SCUT: Multi-class imbalanced data classification using SMOTE and cluster-based undersampling (2015) | Astha Agrawal

Q: How does SCUT improve the classification performance on multi-class imbalanced datasets?

In this paper, the authors have proposed a hybrid sampling method called SCUT which combines SMOTE and cluster-based undersampling to improve the classification performance on multi-class imbalanced datasets.

Q: How many instances are obtained from class 2?

For class 2, the number of examples is 81, so EM is applied and 3 clusters are obtained, with the numbers of instances equal to 29, 17 and 35 respectively.

Publisher’s version / Version de l'éditeur:

Proceedings of the 7th International Joint Conference on Knowledge Discovery,

Knowledge Engineering and Knowledge Management, 2015-11-14

READ THESE TERMS AND CONDITIONS CAREFULLY BEFORE USING THIS WEBSITE.

https://nrc-publications.canada.ca/eng/copyright

Vous avez des questions?

Nous pouvons vous aider. Pour communiquer directement avec un auteur, consultez la

première page de la revue dans laquelle son article a été publié afin de trouver ses coordonnées. Si vous n’arrivez

pas à les repérer, communiquez avec nous à PublicationsArchive-ArchivesPublications@nrc-cnrc.gc.ca.

Questions? Contact the NRC Publications Archive team at

PublicationsArchive-ArchivesPublications@nrc-cnrc.gc.ca. If you wish to email the authors directly, please see the

first page of the publication for their contact information.

NRC Publications Archive

Archives des publications du CNRC

This publication could be one of several versions: author’s original, accepted manuscript or the publisher’s version. /

La version de cette publication peut être l’une des suivantes : la version prépublication de l’auteur, la version

acceptée du manuscrit ou la version de l’éditeur.

For the publisher’s version, please access the DOI link below./ Pour consulter la version de l’éditeur, utilisez le lien

DOI ci-dessous.

https://doi.org/10.5220/0005595502260234

Access and use of this website and the material on it are subject to the Terms and Conditions set forth at

SCUT: multi-class imbalanced data classification using SMOTE and

cluster-based undersampling

Agrawal, Astha; Viktor, Herna L.; Paquet, Eric

https://publications-cnrc.canada.ca/fra/droits

L’accès à ce site Web et l’utilisation de son contenu sont assujettis aux conditions présentées dans le site

LISEZ CES CONDITIONS ATTENTIVEMENT AVANT D’UTILISER CE SITE WEB.

NRC Publications Record / Notice d'Archives des publications de CNRC:

https://nrc-publications.canada.ca/eng/view/object/?id=e8c7556d-9f94-466f-a1e5-72cdf9b9513f

https://publications-cnrc.canada.ca/fra/voir/objet/?id=e8c7556d-9f94-466f-a1e5-72cdf9b9513f

SCUT: Multi-Class Imbalanced Data Classification using SMOTE

and Cluster-based Undersampling

Astha Agrawal

, Herna L. Viktor

and Eric Paquet

1,2

School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, Ontario, Canada

National Research Council of Canada, Ottawa, Ontario, Canada

Keywords: Multi-Class Imbalance, Undersampling, Oversampling, Classification, Clustering.

Abstract: Class imbalance is a crucial problem in machine learning and occurs in many domains. Specifically, the

two-class problem has received interest from researchers in recent years, leading to solutions for oil spill

detection, tumour discovery and fraudulent credit card detection, amongst others. However, handling class

imbalance in datasets that contains multiple classes, with varying degree of imbalance, has received limited

attention. In such a multi-class imbalanced dataset, the classification model tends to favour the majority

classes and incorrectly classify instances from the minority classes as belonging to the majority classes,

leading to poor predictive accuracies. Further, there is a need to handle both the imbalances between classes

as well as address the selection of examples within a class (i.e. the so-called within class imbalance). In this

paper, we propose the SCUT hybrid sampling method, which is used to balance the number of training

examples in such a multi-class setting. Our SCUT approach oversamples minority class examples through

the generation of synthetic examples and employs cluster analysis in order to undersample majority classes.

In addition, it handles both within-class and between-class imbalance. Our experimental results against a

number of multi-class problems show that, when the SCUT method is used for pre-processing the data

before classification, we obtain highly accurate models that compare favourably to the state-of-the-art.

1 INTRODUCTION

In an imbalanced dataset used for classification, the

sizes of one or more classes are much greater than

the other classes. The classes with the larger number

of instances are called majority classes and the

classes with the smaller number of instances are

referred to as the minority classes. Intuitively, since

there are a large number of majority class examples,

a classification model tends to favour majority

classes while incorrectly classifying the examples

from the minority classes. However, in imbalanced

datasets, we are often more interested in correctly

classifying the minority classes. For instance, in a

two class setting within the medical domain, if we

are classifying patients’ condition, the minority class

(e.g. cancer) is of more interest than the majority

class (e.g. cancer free). In practice, many problems

have more than two classes. For example, in

bioinformatics, protein family classification, where a

protein may belong to very small families within the

large Protein Data Bank repository (Viktor et. al,

2013), as well as protein fold prediction, are

examples of multi-class problems. Typically, in such

a multi-class imbalanced dataset, there are multiple

classes that are underrepresented, that is, there may

be multiple majority classes and multiple minority

classes, resulting in skewed distributions.

A number of research studies have been realized

in order to improve classification performance on

imbalanced binary class datasets, in which there is

one majority class and one minority class. However,

improving the performance on imbalanced multi-

class datasets has not been researched as

extensively. Consequently, most existing techniques

for improving classification performance on

imbalanced datasets are designed to be applied

directly on binary class imbalanced datasets. These

methods cannot be applied directly on multi-class

datasets (Wang and Yao, 2012). Rather, class

decomposition is usually used to convert a multi-

class problem into a binary class problem. For

instance, the One-versus-one (OVO) approach

employs multiple classifiers for each possible pair of

classes, discarding the remaining instances that do

not belong to the pair under consideration. The One-

versus-all (OVA) approach, on the other hand,

226

Agrawal, A., Viktor, H. and Paquet, E..

SCUT: Multi-Class Imbalanced Data Classiﬁcation using SMOTE and Cluster-based Undersampling.

In Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2015) - Volume 1: KDIR, pages 226-234

ISBN: 978-989-758-158-8

considers one class as the positive class, and merges

the remaining classes to form the negative class. For

‘n’ classes, ‘n’ classifiers are used, and each class

acts as the positive class once (Fernández et al.,

2010). Subsequently, the results from different

classifiers are combined in order to reach a final

decision. Interested readers are referred to (Ramanan

et al., 2007) for detailed discussions of the OVO and

OVA approaches. However, combining results from

classifiers that are trained on different sub-problems

may result in classification errors (Wang and Yao,

2012). In addition, in OVO, each classifier is trained

only on a subset of the dataset, which may lead to

some data regions being left unlearned. In this paper,

we propose a different method to improve

classification performance on multi-class

imbalanced datasets which preserves the structure of

the data, without converting the dataset into a binary

class problem.

In addition to between-class imbalance (i.e. the

imbalance in the number of instances in each

classes), within-class imbalance is also commonly

observed in datasets. Such a situation occurs when a

class is composed of different sub-clusters and these

sub-clusters do not contain the same number of

examples (Japkowicz, 2001). It follows that

between-class and within-class imbalances both

affect classification performance. In an attempt to

address these two problems, and in order to improve

classification performance on imbalanced datasets,

sampling methods are often used for pre-processing

the data prior to using a classifier to build a

classification model.

Sampling methods focus on adapting the class

distribution in order to reduce the between-class

imbalance. Sampling methods may be divided into

two categories, namely undersampling and

oversampling. Undersampling reduces the number

of majority class instances and oversampling

increases the number of minority class instances.

Unfortunately, both random oversampling and

undersampling techniques present some weaknesses.

For instance, random oversampling adds duplicate

minority class instances to the minority class. This

may result in smaller and more specific decision

regions causing the learner to over-fit the data. Also,

oversampling may increase the training time.

Random undersampling randomly takes away some

instances from the majority class. A drawback of

this method is that useful information may be taken

away (Han et al., 2005). Further, when performing

random undersampling, if the dataset has within-

class imbalance and some sub-clusters are

represented by very few instances, the probability

that instances from these sub-clusters be retained is

relatively low. Consequently, these instances may

remain unlearned.

SMOTE represents an improvement over random

oversampling in that the minority class is

oversampled by generating “synthetic” examples

(Chawla et. al., 2002). However, in highly

imbalanced datasets, too much oversampling (i.e.

oversampling using a high sampling percentage)

may result in overfitting. This is especially

important in a multi-class setting where there are a

number of minority classes with very few examples.

Further, in a multi-class setting, there is a need to

find the correct balance, in terms of number of

examples, between multiple classes. In order to

address this issue, we propose an algorithm called

SCUT (SMOTE and Clustered Undersampling

Technique) which combines SMOTE and cluster-

based undersampling in order to handle between-

class and within-class imbalance.

Undersampling is required to balance the dataset

without using excessive oversampling. If majority

class instances are randomly selected, small

disjuncts with less representative data may remain

unlearned. Clustering the majority classes helps

identify sub-concepts, and if at least one instance is

selected from each sub-concept (cluster) while doing

undersampling, this issue might be addressed

(Sobhani et. al, 2014). This implies that the scenario

of having unlearned regions when within-class

imbalance exists, is reduced. In this setting,

combining clustering and undersampling makes

sense as it addresses the disadvantage of random

undersampling. To this end, Yen and Lee proposed

several cluster-based undersampling approaches to

select representative data as training data to improve

the classification accuracy for the minority class

(Yen and Lee, 2009). The main idea behind their

cluster-based undersampling approaches was based

on the assumption that each dataset has different

clusters and each cluster seems to have distinct

characteristics. Subsequently, from each cluster, a

suitable number of majority class samples were

selected (Yen and Lee, 2009). Rahman and Davis

also used a cluster-based undersampling technique

for classifying imbalanced cardiovascular data that

not only balances the data in a dataset, but further

selects good quality training set data for building

classification models (Rahman and Davis, 2013).

Chawla et al. combined random undersampling

with SMOTE, so that the minority class had a larger

presence in the training set. By combining

undersampling and oversampling, the initial bias of

the learner towards the majority class is reversed in

SCUT: Multi-Class Imbalanced Data Classiﬁcation using SMOTE and Cluster-based Undersampling

227

the favour of the minority class (Chawla et al.,

2002). In summary, cluster-based undersampling

ensures that all sub-concepts are adequately

represented. When used in conjunction with

SMOTE, the hybrid sampling method thus aid to

ensure that between-class imbalance is reduced

without excessive use of oversampling and

undersampling.

This paper is organized as follows. Section 2

contains a description of the proposed method. In

Section 3, the experimental setup and results are

presented while Section 4 concludes the paper and

discusses our future plans.

2 SCUT ALGORITHM

Our SCUT algorithm combines both undersampling

and oversampling techniques in order to reduce the

imbalance between classes in a multi-class setting.

The pseudocode for our SCUT method is shown in

Figure 1.

For undersampling, we employ a cluster-based

undersampling technique, using the Expectation

Maximization (EM) algorithm (Dempster et al.,

1977). The EM algorithm replaces the hard clusters

by a probability distribution formed by a mixture of

Gaussians. Instead of being assigned to a particular

cluster, each member has a certain probability to

belong to a particular Gaussian distribution of the

mixture. The parameters of the mixture, including

the number of Gaussians, are determined with the

Expectation Maximization algorithm. An advantage

of using EM is that the number of clusters does not

have to be specified beforehand. EM clustering may

be used to find both hard and soft clusters. That is,

EM assigns a probability distribution to each

instance relative to each particular cluster (Dempster

et al., 1997).

The SCUT algorithm proceeds as follows. The

dataset is split into n parts, namely D

, D

... D

where n is the number of classes and D

represents a

single class. Subsequently, the mean (m) of the

number of instances of all the classes is calculated.

i) For all classes that have a number of instances

less than the mean m, oversampling is performed in

order to obtain a number of instances equal to the

mean. The sampling percentage used for SMOTE is

calculated such that the number of instances in the

class after oversampling is equal to m.

ii) For all classes that have a number of instances

greater than the mean m, undersampling is

conducted to obtain a number of instances equal to

the mean. Recall that the EM technique is used to

discover the clusters within each class (Dempster et

al., 1977). Subsequently, for each cluster within the

current class, instances are randomly selected such

that the total number of instances from all the

clusters is equal to m. Therefore, instead of fixing

the number of instances selected from each cluster,

we fix the total number of instances. It follows that a

different number of instances may be selected from

the various clusters. However, we aim to select the

instances as uniformly as possible. The selected

instances are combined together in order to obtain m

instances (for each class).

iii) All classes for which the number of instances

is equal to the mean m are left untouched.

Input: Dataset D with n classes

Output: Dataset D' with all classes

having m instances, where m is the mean

number of instances of all classes

Split D into D

, D

, ..., D

where D

is a single class

Calculate

Undersampling:

For each D

, i=1,2, ... , n where

number of instances >m

Cluster

using EM algorithm

For each cluster

, i = 1,2,

... ,

Randomly select instances

from

Add selected instances to

’

End For

C = Ø

For i=1,2, ... , k

C = C U C

’

End For

’

= C

End For

Oversampling:

For each D

, i=1,2, ... , n where

number of instances <m

Apply SMOTE on

to get D

’

End For

For each

, i=1,2, ... , n where

number of instances = m

’

= D

D’ = Ø

For

i = 1,2, ... , n

D’ = D’ U D

’

End For

Return

D’

Figure 1: SCUT Algorithm.

KDIR 2015 - 7th International Conference on Knowledge Discovery and Information Retrieval

228

Finally, all the classes are merged together in order

to obtain a dataset D’, where all the classes have m

instances. Classification may be performed on D’

using an appropriate classifier.

For instance, one of the datasets used in our work

is the Lymphography dataset, as obtained from the

KEEL repository (Alcalá-Fdez et al., 2011). This

dataset concerns detecting the presence of a

lymphoma, together with its current status, and

contains four (4) classes (normal, metastases,

malignant-lymphoma and fibrosis), with 2, 81, 61

and 4 examples, respectively. That is, the dataset has

a high level of imbalance and contains two majority

and two minority classes. The dataset is split into

four (4) classes and the mean is 37.

i) For class 1, the number of instances is 2, so

SMOTE is applied with a sampling percentage of

1850% in order to obtain 37 instances.

ii) For class 2, the number of examples is 81, so

EM is applied and 3 clusters are obtained, with the

numbers of instances equal to 29, 17 and 35

respectively. In order to obtain a total of 37

instances, 12, 12 and 13 instances are randomly

selected from the clusters.

iii) For class 3, the number of instances is 61.

When EM is applied, only one cluster is obtained.

Next, 37 instances are randomly selected from this

one cluster.

iv) The number of instances of class 4 is equal to

4, so SMOTE is applied with a sampling percentage

of 925% in order to obtain 37 instances.

Lastly, the classes are merged together and a new

dataset of 148 instances (in which each class has 37

examples) is obtained. The next section discusses

our experimental setup and results.

3 EXPERIMENTATION

We implemented our SCUT algorithm by extending

WEKA, an open source data mining tool that was

developed at the University of Waikato. For

classification, the WEKA implementations of four

classifiers, namely J48 (decision tree), SMO

(support vector machine), Naïve Bayes and IBk

(Nearest Neighbour), were used. For IBk, the

number of nearest neighbours (k) was set to five (5),

by inspection. Default values for the other

parameters were used.

A ten-fold cross validation approach was used

for testing and training. Ten-fold cross validation

has been shown to be an effective testing

methodology when datasets are not too small, since

each fold is a good representation of the entire

dataset (Japkowicz, 2001).

3.1 Benchmarking Datasets

Seven multi-class datasets from the KEEL

repository (Alcalá-Fdez et al., 2011) and the Wine

Quality dataset from the UCI repository (Lichman,

2013) (Cortez et al., 2009) were used in the

experiments. The details of these datasets are

summarized in Table 1. The table shows that the

number of classes in the datasets varies from three

(3) to ten (10) and the number of training examples

range from 148 to 6497. Here, the levels of

imbalance and numbers of classes with majority and

minority instances vary considerably.

The WEKA implementation of the EM cluster

analysis algorithm was used. Recall that the EM

approach employs probabilistic models which imply

that the number of clusters does not have to be

specified in advance. Therefore, a major strength of

EM is that it determines the number of clusters that

must be created by cross validation. In order to

determine the number of clusters, cross validation is

performed as follows:

1. Initially, the number of clusters is set to one (1).

2. The training set is split randomly into ten (10)

folds. The number of folds is set to ten, as long

as the number of instances in the training set is

not smaller ten. If this is the case, the number of

folds is set equal to the number of instances.

3. EM is performed ten (10) times using the ten

(10) folds.

4. The logarithm of the likelihood is averaged over

all ten (10) results. If logarithm of the likelihood

increases, the number of clusters is increased by

one (1) and the algorithm resumes from step 2.

Table 1: Datasets.

Datasets Size # Class Class distribution

Thyroid 720 3 17, 37, 666

Lymphography 148 4 2, 81, 61, 4

Pageblocks 548 5 492, 33, 8, 12, 3

Dermatology 366 6

112, 61, 72, 49,

52, 20

Autos 159 6

3, 20, 48, 46, 29,

Ecoli 336 8

143, 77, 52, 35,

20, 5, 2, 2

Wine Quality 6497 7

30, 216, 2138,

2836, 1079, 193, 5

Yeast 1484 10

244, 429, 463, 44,

51, 163, 35, 30,

20, 5

SCUT: Multi-Class Imbalanced Data Classiﬁcation using SMOTE and Cluster-based Undersampling

229

SCUT: Multi-class imbalanced data classification using SMOTE and cluster-based undersampling

Figures

Citations

Machine learning workflows to estimate class probabilities for precision cancer diagnostics on DNA methylation microarray data.

Spruce budworm tree host species distribution and abundance mapping using multi-temporal Sentinel-1 and Sentinel-2 satellite imagery

Using Information on Class Interrelations to Improve Classification of Multiclass Imbalanced Data: A New Resampling Algorithm

What makes multi-class imbalanced problems difficult? An experimental study

A clustering based ensemble of weighted kernelized extreme learning machine for class imbalance learning

References

Maximum likelihood from incomplete data via the EM algorithm

SMOTE: synthetic minority over-sampling technique

SMOTE: Synthetic Minority Over-sampling Technique

Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning

KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework

Related Papers (5)

SMOTE: synthetic minority over-sampling technique

Learning from Imbalanced Data

Multiclass Imbalance Problems: Analysis and Potential Solutions

A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches

Handling class imbalance problem using oversampling techniques: A review

Frequently Asked Questions (12)

Q1. What contributions have the authors mentioned in the paper "Scut: multi-class imbalanced data classification using smote and cluster-based undersampling" ?

Q2. What are the future works mentioned in the paper "Scut: multi-class imbalanced data classification using smote and cluster-based undersampling" ?

Q3. How does the SCUT method help to reduce between class imbalance?

Q4. What is the focus of the future work?

Q5. what is the cost-sensitive learning approach for binary class problems?

Q6. How do you plan to extend your approach to large datasets?

Q7. How does SCUT improve the classification performance on multi-class imbalanced datasets?

Q8. What is the main idea behind the hybrid sampling method?

Q9. What is the common approach for a multiclass dataset?

Q10. How many instances are obtained from class 2?

Q11. How many instances are selected from the class?

Q12. How many clusters are created in the EM experiment?