Swift Imbalance Data Classification using SMOTE and Extreme Learning Machine

doi:10.1109/ICCIDS.2019.8862112

Home
/
Papers
/
Swift Imbalance Data Classification using SMOTE and Extreme Learning Machine

Proceedings Article•DOI•

Swift Imbalance Data Classification using SMOTE and Extreme Learning Machine

Rishabh Rustogi¹, Ayush Prasad¹•Institutions (1)

Shiv Nadar University¹

01 Feb 2019-

TL;DR: A hybrid method to classify binary imbalanced data using Synthetic Minority Over-sampling Technique followed by Extreme Learning Machine is proposed, which along with swift learning rate is efficacious to predict the desired class.

read less

Abstract: Continuous expansion in the fields of science and technology has led to the immense availability and attainability of data in every field. Fundamentally understanding and analyzing this data is a critical job in the decision-making process. Although, great success has been achieved by the prevailing data engineering and mining techniques, the problem of swift classification of the imbalanced data still exists in academia and industry. A potential solution to the problem of skewness in data can be resolved by data upsampling or downsampling. There exists a few techniques that firstly remove skewness and then perform classification, however, these methods suffer from hurdles like abortive precision or slower learning rate. In this paper, a hybrid method to classify binary imbalanced data using Synthetic Minority Over-sampling Technique followed by Extreme Learning Machine is proposed. Our method along with swift learning rate is efficacious to predict the desired class. We verified our model using five standard imbalance dataset and obtained higher F-measure, G-mean and ROC score for all the dataset.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Automatic classification of citizen requests for transportation using deep learning: Case study from Boston city

[...]

Na-Rang Kim¹, Soon-Goo Hong¹•Institutions (1)

Dong-a University¹

01 Jan 2021-Information Processing and Management

TL;DR: This study proposes an automatic classification model for unstructured data by using transportation-related citizen requests from January 15th, 2016 until November 7th, 2018 of the City of Boston, USA, as an example and has substantial academic importance given that it has proven diverse machine learning-related theories through their application to un Structured data.

...read moreread less

Abstract: Responding to requests from citizens is an essential administrative service that affects the daily life of people. The drastic increase in the volume of citizen requests in recent years has necessitated on-going studies on the automatic classification of citizen requests due to the time, effort, and misclassification errors involved in manual classification. Even though there have been prior studies that have analyzed citizen requests according to topic and frequency, they ignore the complicated and dynamic nature of such a dataset. Using a deep learning algorithm, this study proposes an automatic classification model for unstructured data by using transportation-related citizen requests from January 15th, 2016 until November 7th, 2018 of the City of Boston, USA, as an example. A combination of unsupervised and supervised learning was applied to the data. To address the issue of imbalance in data, this study also considered an equalization method. Five stepwise models were applied to increase the classification accuracy for the unstructured data. The final model uses achieved a classification accuracy of 90%. The model proposed in this study is expected to be generalized for classification of other citizen requests or unstructured text data on specific topics in the future. Moreover, this study has substantial academic importance given that it has proven diverse machine learning-related theories through their application to unstructured data.

...read moreread less

19 citations

Journal Article•DOI•

Machine Learning Based Cloud Computing Anomalies Detection

[...]

Zina Chkirbene¹, Aiman Erbad², Ridha Hamila¹, Ala Gouissem², Amr Mohamed¹, Mounir Hamdi² - Show less +2 more•Institutions (2)

Qatar University¹, Khalifa University²

01 Sep 2020-IEEE Network

TL;DR: A novel weighted classes classification scheme to secure the network against malicious nodes while alleviating the problem of imbalanced data and enhances the overall detection accuracy.

...read moreread less

Abstract: Recently, machine learning algorithms have been proposed to design new security systems for anomalies detection as they exhibit fast processing with real-time predictions. However, one of the major challenges in machine learning-based intrusion detection methods is how to include enough training examples for all the possible classes in the model to avoid the class imbalance problem and accurately detect the intrusions and their types. in this article, we propose a novel weighted classes classification scheme to secure the network against malicious nodes while alleviating the problem of imbalanced data. in the proposed system, we combine a supervised machine learning algorithm with the network node past information and a specific designed best effort iterative algorithm to enhance the accuracy of rarely detectable attacks. The machine learning algorithm is used to generate a classifier that differentiates between the investigated attacks. The system stores these decisions in a private database. Then, we design a new weight optimization algorithm that exploits these decisions to generate a weights vector that includes the best weight for each class. The proposed model enhances the overall detection accuracy and maximizes the number of correctly detectable classes even for the classes with a relatively low number of training entries. The UNSW dataset has been used to evaluate the performance of the proposed model and compare it with state of the art techniques.

...read moreread less

15 citations

Cites methods from "Swift Imbalance Data Classification..."

...Several techniques have been used to alleviate the problem of class imbalance, including data sampling (Random Undersampling (RUS) [12]) where a duplicating instance is generated from the minority classes and boosting [6], RUSBoost [13]) where synthetic examples from the rare or minority class are created to interpolate between existing minority-class instances, thus creating a denser minority class....
[...]

Proceedings Article•DOI•

Handling Data Imbalance in Predictive Maintenance for Machines using SMOTE-based Oversampling

[...]

Sashank Sridhar¹, Sowmya Sanagavarapu¹•Institutions (1)

College of Engineering, Guindy¹

22 Sep 2021

TL;DR: In this article, a synthesized dataset was used in the predictive maintenance model, that reflects real-time failures encountered in the industries. But the class data imbalance hinders the performance of machine learning algorithms and this is handled by evaluating SMOTE-based oversampling techniques.

...read moreread less

Abstract: The identification of failures and defects in industrial machines has proven to be a challenge to gauge their warranty and performance. Depreciation in industrial machines occurs due to several factors, most commonly- tool wear, strain, heat and power failure. In this paper, the development of machine learning algorithms for the prediction of machine failures is done. A synthesized dataset was used in the predictive maintenance model, that reflects real-time failures encountered in the industries. The class data imbalance hinders the performance of machine learning algorithms and this is handled by evaluating SMOTE-based oversampling techniques. By using SMOTE technique, a 7.83 % increase in the AUC score is observed, thereby improving the performance of the Random Forest classifier in distinguishing the instances of non-failure and machine failures.

...read moreread less

9 citations

Book Chapter•DOI•

A Survey on Solution of Imbalanced Data Classification Problem Using SMOTE and Extreme Learning Machine

[...]

Ankur Goyal, Likhita Rathore, Sandeep Kumar¹•Institutions (1)

Christ University¹

01 Jan 2021

TL;DR: In this article, the authors focused on the SMOTE and extreme learning machine algorithms to deal with the problem of highly imbalanced data, where unequal data are widely distributed and ensemble learning algorithms are a more efficient classifier in classifying imbalances.

...read moreread less

Abstract: Imbalanced data are a common classification problem. Since it occurs in most real fields, this trend is increasingly important. It is of particular concern for highly imbalanced datasets (when the class ratio is high). Different techniques have been developed to deal with supervised learning sets. SMOTE is a well-known method for over-sampling that discusses imbalances at the level of the data. In the area, unequal data are widely distributed, and ensemble learning algorithms are a more efficient classifier in classifying imbalances. SMOTE synthetically contrasts two closely connected vectors. The learning algorithm itself, however, is not designed for imbalanced results. The simple ensemble idea, as well as the SMOTE algorithm, works with imbalanced data. There are detailed studies about imbalanced data problems and resolving this problem through several approaches. There are various approaches to overcome this problem, but we mainly focused on SMOTE and extreme learning machine algorithms.

...read moreread less

7 citations

Journal Article•DOI•

Development of a robotic structure for acquisition and classification of images (ERACI) in sugarcane crops

[...]

José Ricardo Ferreira Cardoso¹, Carlos Eduardo Angeli Furlani², José Eduardo Pitelli Turco², Cristiano Zerbato², Franciele Morlin Carneiro², Francisca Nivanda de Lima Estevam² - Show less +2 more•Institutions (2)

São Paulo Federal Institute of Education, Science and Technology¹, Sao Paulo State University²

01 Jan 2020-Revista Ciencia Agronomica

TL;DR: The Raspberry Pi 3B+ as mentioned in this paper is a Raspberry Pi-based system for classifying images, which is based on a set of pre-classificacoes generated by a supervisor.

...read moreread less

Abstract: A agricultura digital contribui com a eficiencia agricola por meio da utilizacao de ferramentas como a visao computacional, robotica e agricultura de precisao. Com este trabalho o objetivo foi desenvolver um sistema capaz classificar imagens por meio do reconhecimento de padroes pre estabelecidos. Para este fim foi criado um sistema distribuido geograficamente, baseado no computador Raspberry Pi 3B+, que captura imagens no campo e armazena em um banco de dados, onde estao disponibilizadas para receber uma pre-classificacao por parte de um supervisor. Depois disso, classificadores sao gerados, avaliados e enviados para o dispositivo remoto realizar a classificacao em tempo real. Para a avaliacao do sistema foram definidas 23 classes agrupadas em 3 superclasses, capturadas 36.979 imagens e, realizadas 1.579 pre-classificacoes, que permitiram a realizacao de testes de classificacao por meio de validacao cruzada com divisao equivalente a quantidade de classes e de forma embaralhada. Estes testes mostraram que a acuracia entregue por cada classificador e diferente e, diretamente proporcional a quantidade e balanceamento das amostras, com variacao da acuracia de 11% a 79%, com 26 e 2.200 amostras consideradas, respectivamente. O tempo de resposta do sistema foi avaliado em 1.585 periodos e se mantiveram em aproximadamente 0,20 segundos, podendo, sob velocidade controlada do veiculo, ser utilizada para dispersao de insumos em tempo real.

...read moreread less

3 citations

References

PDF

Open Access

More filters

Journal Article•DOI•

SMOTE: synthetic minority over-sampling technique

[...]

Nitesh V. Chawla¹, Kevin W. Bowyer², Lawrence O. Hall¹, W. Philip Kegelmeyer³•Institutions (3)

University of South Florida¹, University of Notre Dame², Sandia National Laboratories³

01 Jan 2002-Journal of Artificial Intelligence Research

TL;DR: In this article, a method of over-sampling the minority class involves creating synthetic minority class examples, which is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.

...read moreread less

Abstract: An approach to the construction of classifiers from imbalanced datasets is described. A dataset is imbalanced if the classification categories are not approximately equally represented. Often real-world data sets are predominately composed of "normal" examples with only a small percentage of "abnormal" or "interesting" examples. It is also the case that the cost of misclassifying an abnormal (interesting) example as a normal example is often much higher than the cost of the reverse error. Under-sampling of the majority (normal) class has been proposed as a good means of increasing the sensitivity of a classifier to the minority class. This paper shows that a combination of our method of oversampling the minority (abnormal)cla ss and under-sampling the majority (normal) class can achieve better classifier performance (in ROC space)tha n only under-sampling the majority class. This paper also shows that a combination of our method of over-sampling the minority class and under-sampling the majority class can achieve better classifier performance (in ROC space)t han varying the loss ratios in Ripper or class priors in Naive Bayes. Our method of over-sampling the minority class involves creating synthetic minority class examples. Experiments are performed using C4.5, Ripper and a Naive Bayes classifier. The method is evaluated using the area under the Receiver Operating Characteristic curve (AUC)and the ROC convex hull strategy.

...read moreread less

17,313 citations

Journal Article•DOI•

SMOTE: Synthetic Minority Over-sampling Technique

[...]

Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, W.P. Kegelmeyer

09 Jun 2011-arXiv: Artificial Intelligence

...read moreread less

Abstract: An approach to the construction of classifiers from imbalanced datasets is described. A dataset is imbalanced if the classification categories are not approximately equally represented. Often real-world data sets are predominately composed of "normal" examples with only a small percentage of "abnormal" or "interesting" examples. It is also the case that the cost of misclassifying an abnormal (interesting) example as a normal example is often much higher than the cost of the reverse error. Under-sampling of the majority (normal) class has been proposed as a good means of increasing the sensitivity of a classifier to the minority class. This paper shows that a combination of our method of over-sampling the minority (abnormal) class and under-sampling the majority (normal) class can achieve better classifier performance (in ROC space) than only under-sampling the majority class. This paper also shows that a combination of our method of over-sampling the minority class and under-sampling the majority class can achieve better classifier performance (in ROC space) than varying the loss ratios in Ripper or class priors in Naive Bayes. Our method of over-sampling the minority class involves creating synthetic minority class examples. Experiments are performed using C4.5, Ripper and a Naive Bayes classifier. The method is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.

...read moreread less

11,512 citations

Journal Article•DOI•

Extreme learning machine: Theory and applications

[...]

Guang-Bin Huang, Qin-Yu Zhu, Chee Kheong Siew

01 Dec 2006-Neurocomputing

TL;DR: A new learning algorithm called ELM is proposed for feedforward neural networks (SLFNs) which randomly chooses hidden nodes and analytically determines the output weights of SLFNs which tends to provide good generalization performance at extremely fast learning speed.

...read moreread less

10,217 citations

"Swift Imbalance Data Classification..." refers methods in this paper

...[26] proposed a simple and effective learning algorithm known as ELM....
[...]

Journal Article•DOI•

Learning from Imbalanced Data

[...]

Haibo He¹, E.A. Garcia¹•Institutions (1)

Stevens Institute of Technology¹

01 Sep 2009-IEEE Transactions on Knowledge and Data Engineering

TL;DR: A critical review of the nature of the problem, the state-of-the-art technologies, and the current assessment metrics used to evaluate learning performance under the imbalanced learning scenario is provided.

...read moreread less

Abstract: With the continuous expansion of data availability in many large-scale, complex, and networked systems, such as surveillance, security, Internet, and finance, it becomes critical to advance the fundamental understanding of knowledge discovery and analysis from raw data to support decision-making processes. Although existing knowledge discovery and data engineering techniques have shown great success in many real-world applications, the problem of learning from imbalanced data (the imbalanced learning problem) is a relatively new challenge that has attracted growing attention from both academia and industry. The imbalanced learning problem is concerned with the performance of learning algorithms in the presence of underrepresented data and severe class distribution skews. Due to the inherent complex characteristics of imbalanced data sets, learning from such data requires new understandings, principles, algorithms, and tools to transform vast amounts of raw data efficiently into information and knowledge representation. In this paper, we provide a comprehensive review of the development of research in learning from imbalanced data. Our focus is to provide a critical review of the nature of the problem, the state-of-the-art technologies, and the current assessment metrics used to evaluate learning performance under the imbalanced learning scenario. Furthermore, in order to stimulate future research in this field, we also highlight the major opportunities and challenges, as well as potential important research directions for learning from imbalanced data.

...read moreread less

6,320 citations

"Swift Imbalance Data Classification..." refers background or methods in this paper

...Certain significant methods for tackling the imbalanced 978-1-5386-9471-8/19/$31.00 c©2019 IEEE dataset situation includes two basic broad fields : Sampling methods for imbalanced data and Cost-Sensitive Methods for Imbalanced Learning [1]....
[...]
...For evaluation purpose, we have used Geometric mean (Gmean), F-score and ROC curve as the evaluation metrics as these are principally used for assessing the data classification algorithms [1]....
[...]
...that the dataset available in these fields was imbalanced, which was the principal reason of the substandard classification performance [1]....
[...]
...dataset situation includes two basic broad fields : Sampling methods for imbalanced data and Cost-Sensitive Methods for Imbalanced Learning [1]....
[...]

Journal Article•DOI•

A study of the behavior of several methods for balancing machine learning training data

[...]

Gustavo E. A. P. A. Batista¹, Ronaldo C. Prati¹, Maria Carolina Monard¹•Institutions (1)

Spanish National Research Council¹

01 Jun 2004-Sigkdd Explorations

TL;DR: This work performs a broad experimental evaluation involving ten methods, three of them proposed by the authors, to deal with the class imbalance problem in thirteen UCI data sets, and shows that, in general, over-sampling methods provide more accurate results than under-sampled methods considering the area under the ROC curve (AUC).

...read moreread less

Abstract: There are several aspects that might influence the performance achieved by existing learning systems. It has been reported that one of these aspects is related to class imbalance in which examples in training data belonging to one class heavily outnumber the examples in the other class. In this situation, which is found in real world data describing an infrequent but important event, the learning system may have difficulties to learn the concept related to the minority class. In this work we perform a broad experimental evaluation involving ten methods, three of them proposed by the authors, to deal with the class imbalance problem in thirteen UCI data sets. Our experiments provide evidence that class imbalance does not systematically hinder the performance of learning systems. In fact, the problem seems to be related to learning with too few minority class examples in the presence of other complicating factors, such as class overlapping. Two of our proposed methods deal with these conditions directly, allying a known over-sampling method with data cleaning methods in order to produce better-defined class clusters. Our comparative experiments show that, in general, over-sampling methods provide more accurate results than under-sampling methods considering the area under the ROC curve (AUC). This result seems to contradict results previously published in the literature. Two of our proposed methods, Smote + Tomek and Smote + ENN, presented very good results for data sets with a small number of positive examples. Moreover, Random over-sampling, a very simple over-sampling method, is very competitive to more complex over-sampling methods. Since the over-sampling methods provided very good performance results, we also measured the syntactic complexity of the decision trees induced from over-sampled data. Our results show that these trees are usually more complex then the ones induced from original data. Random over-sampling usually produced the smallest increase in the mean number of induced rules and Smote + ENN the smallest increase in the mean number of conditions per rule, when compared among the investigated over-sampling methods.

...read moreread less

2,914 citations

"Swift Imbalance Data Classification..." refers methods in this paper

...Some techniques includes random undersampling and oversampling; informed undersampling [7]; synthetic sampling such as Synthetic Minority Over-sampling Technique (SMOTE) which has shown a great deal of success in various applications [8] and sampling with data cleaning techniques like the Condensed Nearest Neighbor rule [9], OSS method [10], the Neighborhood Cleaning rule (NCL) [11]....
[...]
...However, in-case of low imbalance ratio J , one can use downsampling, Condensed Nearest Neighbour (CNN) with T-Link [9], along with ELM....
[...]
...For comparing, we have also used a downsampling technique, CNN+T-Link [9]....
[...]
...The prior three have been used with original dataset, upsampled dataset (by using SMOTE) and downsampled dataset (by using CNN+T-Link)....
[...]
...In this work, four different models, namely DTC, ELM, K-Means (each with original data, upsampling data using SMOTE and downsampling data using CNN+T-Link) along with W-ELM have been compared against each other....
[...]