scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Swift Imbalance Data Classification using SMOTE and Extreme Learning Machine

01 Feb 2019-
TL;DR: A hybrid method to classify binary imbalanced data using Synthetic Minority Over-sampling Technique followed by Extreme Learning Machine is proposed, which along with swift learning rate is efficacious to predict the desired class.
Abstract: Continuous expansion in the fields of science and technology has led to the immense availability and attainability of data in every field. Fundamentally understanding and analyzing this data is a critical job in the decision-making process. Although, great success has been achieved by the prevailing data engineering and mining techniques, the problem of swift classification of the imbalanced data still exists in academia and industry. A potential solution to the problem of skewness in data can be resolved by data upsampling or downsampling. There exists a few techniques that firstly remove skewness and then perform classification, however, these methods suffer from hurdles like abortive precision or slower learning rate. In this paper, a hybrid method to classify binary imbalanced data using Synthetic Minority Over-sampling Technique followed by Extreme Learning Machine is proposed. Our method along with swift learning rate is efficacious to predict the desired class. We verified our model using five standard imbalance dataset and obtained higher F-measure, G-mean and ROC score for all the dataset.
Citations
More filters
Journal ArticleDOI
TL;DR: This study proposes an automatic classification model for unstructured data by using transportation-related citizen requests from January 15th, 2016 until November 7th, 2018 of the City of Boston, USA, as an example and has substantial academic importance given that it has proven diverse machine learning-related theories through their application to un Structured data.
Abstract: Responding to requests from citizens is an essential administrative service that affects the daily life of people. The drastic increase in the volume of citizen requests in recent years has necessitated on-going studies on the automatic classification of citizen requests due to the time, effort, and misclassification errors involved in manual classification. Even though there have been prior studies that have analyzed citizen requests according to topic and frequency, they ignore the complicated and dynamic nature of such a dataset. Using a deep learning algorithm, this study proposes an automatic classification model for unstructured data by using transportation-related citizen requests from January 15th, 2016 until November 7th, 2018 of the City of Boston, USA, as an example. A combination of unsupervised and supervised learning was applied to the data. To address the issue of imbalance in data, this study also considered an equalization method. Five stepwise models were applied to increase the classification accuracy for the unstructured data. The final model uses achieved a classification accuracy of 90%. The model proposed in this study is expected to be generalized for classification of other citizen requests or unstructured text data on specific topics in the future. Moreover, this study has substantial academic importance given that it has proven diverse machine learning-related theories through their application to unstructured data.

19 citations

Journal ArticleDOI
TL;DR: A novel weighted classes classification scheme to secure the network against malicious nodes while alleviating the problem of imbalanced data and enhances the overall detection accuracy.
Abstract: Recently, machine learning algorithms have been proposed to design new security systems for anomalies detection as they exhibit fast processing with real-time predictions. However, one of the major challenges in machine learning-based intrusion detection methods is how to include enough training examples for all the possible classes in the model to avoid the class imbalance problem and accurately detect the intrusions and their types. in this article, we propose a novel weighted classes classification scheme to secure the network against malicious nodes while alleviating the problem of imbalanced data. in the proposed system, we combine a supervised machine learning algorithm with the network node past information and a specific designed best effort iterative algorithm to enhance the accuracy of rarely detectable attacks. The machine learning algorithm is used to generate a classifier that differentiates between the investigated attacks. The system stores these decisions in a private database. Then, we design a new weight optimization algorithm that exploits these decisions to generate a weights vector that includes the best weight for each class. The proposed model enhances the overall detection accuracy and maximizes the number of correctly detectable classes even for the classes with a relatively low number of training entries. The UNSW dataset has been used to evaluate the performance of the proposed model and compare it with state of the art techniques.

15 citations


Cites methods from "Swift Imbalance Data Classification..."

  • ...Several techniques have been used to alleviate the problem of class imbalance, including data sampling (Random Undersampling (RUS) [12]) where a duplicating instance is generated from the minority classes and boosting [6], RUSBoost [13]) where synthetic examples from the rare or minority class are created to interpolate between existing minority-class instances, thus creating a denser minority class....

    [...]

Proceedings ArticleDOI
22 Sep 2021
TL;DR: In this article, a synthesized dataset was used in the predictive maintenance model, that reflects real-time failures encountered in the industries. But the class data imbalance hinders the performance of machine learning algorithms and this is handled by evaluating SMOTE-based oversampling techniques.
Abstract: The identification of failures and defects in industrial machines has proven to be a challenge to gauge their warranty and performance. Depreciation in industrial machines occurs due to several factors, most commonly- tool wear, strain, heat and power failure. In this paper, the development of machine learning algorithms for the prediction of machine failures is done. A synthesized dataset was used in the predictive maintenance model, that reflects real-time failures encountered in the industries. The class data imbalance hinders the performance of machine learning algorithms and this is handled by evaluating SMOTE-based oversampling techniques. By using SMOTE technique, a 7.83 % increase in the AUC score is observed, thereby improving the performance of the Random Forest classifier in distinguishing the instances of non-failure and machine failures.

9 citations

Book ChapterDOI
01 Jan 2021
TL;DR: In this article, the authors focused on the SMOTE and extreme learning machine algorithms to deal with the problem of highly imbalanced data, where unequal data are widely distributed and ensemble learning algorithms are a more efficient classifier in classifying imbalances.
Abstract: Imbalanced data are a common classification problem. Since it occurs in most real fields, this trend is increasingly important. It is of particular concern for highly imbalanced datasets (when the class ratio is high). Different techniques have been developed to deal with supervised learning sets. SMOTE is a well-known method for over-sampling that discusses imbalances at the level of the data. In the area, unequal data are widely distributed, and ensemble learning algorithms are a more efficient classifier in classifying imbalances. SMOTE synthetically contrasts two closely connected vectors. The learning algorithm itself, however, is not designed for imbalanced results. The simple ensemble idea, as well as the SMOTE algorithm, works with imbalanced data. There are detailed studies about imbalanced data problems and resolving this problem through several approaches. There are various approaches to overcome this problem, but we mainly focused on SMOTE and extreme learning machine algorithms.

7 citations

Journal ArticleDOI
TL;DR: The Raspberry Pi 3B+ as mentioned in this paper is a Raspberry Pi-based system for classifying images, which is based on a set of pre-classificacoes generated by a supervisor.
Abstract: A agricultura digital contribui com a eficiencia agricola por meio da utilizacao de ferramentas como a visao computacional, robotica e agricultura de precisao. Com este trabalho o objetivo foi desenvolver um sistema capaz classificar imagens por meio do reconhecimento de padroes pre estabelecidos. Para este fim foi criado um sistema distribuido geograficamente, baseado no computador Raspberry Pi 3B+, que captura imagens no campo e armazena em um banco de dados, onde estao disponibilizadas para receber uma pre-classificacao por parte de um supervisor. Depois disso, classificadores sao gerados, avaliados e enviados para o dispositivo remoto realizar a classificacao em tempo real. Para a avaliacao do sistema foram definidas 23 classes agrupadas em 3 superclasses, capturadas 36.979 imagens e, realizadas 1.579 pre-classificacoes, que permitiram a realizacao de testes de classificacao por meio de validacao cruzada com divisao equivalente a quantidade de classes e de forma embaralhada. Estes testes mostraram que a acuracia entregue por cada classificador e diferente e, diretamente proporcional a quantidade e balanceamento das amostras, com variacao da acuracia de 11% a 79%, com 26 e 2.200 amostras consideradas, respectivamente. O tempo de resposta do sistema foi avaliado em 1.585 periodos e se mantiveram em aproximadamente 0,20 segundos, podendo, sob velocidade controlada do veiculo, ser utilizada para dispersao de insumos em tempo real.

3 citations

References
More filters
Journal ArticleDOI
TL;DR: In this article, a method of over-sampling the minority class involves creating synthetic minority class examples, which is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.
Abstract: An approach to the construction of classifiers from imbalanced datasets is described. A dataset is imbalanced if the classification categories are not approximately equally represented. Often real-world data sets are predominately composed of "normal" examples with only a small percentage of "abnormal" or "interesting" examples. It is also the case that the cost of misclassifying an abnormal (interesting) example as a normal example is often much higher than the cost of the reverse error. Under-sampling of the majority (normal) class has been proposed as a good means of increasing the sensitivity of a classifier to the minority class. This paper shows that a combination of our method of oversampling the minority (abnormal)cla ss and under-sampling the majority (normal) class can achieve better classifier performance (in ROC space)tha n only under-sampling the majority class. This paper also shows that a combination of our method of over-sampling the minority class and under-sampling the majority class can achieve better classifier performance (in ROC space)t han varying the loss ratios in Ripper or class priors in Naive Bayes. Our method of over-sampling the minority class involves creating synthetic minority class examples. Experiments are performed using C4.5, Ripper and a Naive Bayes classifier. The method is evaluated using the area under the Receiver Operating Characteristic curve (AUC)and the ROC convex hull strategy.

17,313 citations

Journal ArticleDOI
TL;DR: In this article, a method of over-sampling the minority class involves creating synthetic minority class examples, which is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.
Abstract: An approach to the construction of classifiers from imbalanced datasets is described. A dataset is imbalanced if the classification categories are not approximately equally represented. Often real-world data sets are predominately composed of "normal" examples with only a small percentage of "abnormal" or "interesting" examples. It is also the case that the cost of misclassifying an abnormal (interesting) example as a normal example is often much higher than the cost of the reverse error. Under-sampling of the majority (normal) class has been proposed as a good means of increasing the sensitivity of a classifier to the minority class. This paper shows that a combination of our method of over-sampling the minority (abnormal) class and under-sampling the majority (normal) class can achieve better classifier performance (in ROC space) than only under-sampling the majority class. This paper also shows that a combination of our method of over-sampling the minority class and under-sampling the majority class can achieve better classifier performance (in ROC space) than varying the loss ratios in Ripper or class priors in Naive Bayes. Our method of over-sampling the minority class involves creating synthetic minority class examples. Experiments are performed using C4.5, Ripper and a Naive Bayes classifier. The method is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.

11,512 citations

Journal ArticleDOI
TL;DR: A new learning algorithm called ELM is proposed for feedforward neural networks (SLFNs) which randomly chooses hidden nodes and analytically determines the output weights of SLFNs which tends to provide good generalization performance at extremely fast learning speed.

10,217 citations


"Swift Imbalance Data Classification..." refers methods in this paper

  • ...[26] proposed a simple and effective learning algorithm known as ELM....

    [...]

Journal ArticleDOI
TL;DR: A critical review of the nature of the problem, the state-of-the-art technologies, and the current assessment metrics used to evaluate learning performance under the imbalanced learning scenario is provided.
Abstract: With the continuous expansion of data availability in many large-scale, complex, and networked systems, such as surveillance, security, Internet, and finance, it becomes critical to advance the fundamental understanding of knowledge discovery and analysis from raw data to support decision-making processes. Although existing knowledge discovery and data engineering techniques have shown great success in many real-world applications, the problem of learning from imbalanced data (the imbalanced learning problem) is a relatively new challenge that has attracted growing attention from both academia and industry. The imbalanced learning problem is concerned with the performance of learning algorithms in the presence of underrepresented data and severe class distribution skews. Due to the inherent complex characteristics of imbalanced data sets, learning from such data requires new understandings, principles, algorithms, and tools to transform vast amounts of raw data efficiently into information and knowledge representation. In this paper, we provide a comprehensive review of the development of research in learning from imbalanced data. Our focus is to provide a critical review of the nature of the problem, the state-of-the-art technologies, and the current assessment metrics used to evaluate learning performance under the imbalanced learning scenario. Furthermore, in order to stimulate future research in this field, we also highlight the major opportunities and challenges, as well as potential important research directions for learning from imbalanced data.

6,320 citations


"Swift Imbalance Data Classification..." refers background or methods in this paper

  • ...Certain significant methods for tackling the imbalanced 978-1-5386-9471-8/19/$31.00 c©2019 IEEE dataset situation includes two basic broad fields : Sampling methods for imbalanced data and Cost-Sensitive Methods for Imbalanced Learning [1]....

    [...]

  • ...For evaluation purpose, we have used Geometric mean (Gmean), F-score and ROC curve as the evaluation metrics as these are principally used for assessing the data classification algorithms [1]....

    [...]

  • ...that the dataset available in these fields was imbalanced, which was the principal reason of the substandard classification performance [1]....

    [...]

  • ...dataset situation includes two basic broad fields : Sampling methods for imbalanced data and Cost-Sensitive Methods for Imbalanced Learning [1]....

    [...]

Journal ArticleDOI
TL;DR: This work performs a broad experimental evaluation involving ten methods, three of them proposed by the authors, to deal with the class imbalance problem in thirteen UCI data sets, and shows that, in general, over-sampling methods provide more accurate results than under-sampled methods considering the area under the ROC curve (AUC).
Abstract: There are several aspects that might influence the performance achieved by existing learning systems. It has been reported that one of these aspects is related to class imbalance in which examples in training data belonging to one class heavily outnumber the examples in the other class. In this situation, which is found in real world data describing an infrequent but important event, the learning system may have difficulties to learn the concept related to the minority class. In this work we perform a broad experimental evaluation involving ten methods, three of them proposed by the authors, to deal with the class imbalance problem in thirteen UCI data sets. Our experiments provide evidence that class imbalance does not systematically hinder the performance of learning systems. In fact, the problem seems to be related to learning with too few minority class examples in the presence of other complicating factors, such as class overlapping. Two of our proposed methods deal with these conditions directly, allying a known over-sampling method with data cleaning methods in order to produce better-defined class clusters. Our comparative experiments show that, in general, over-sampling methods provide more accurate results than under-sampling methods considering the area under the ROC curve (AUC). This result seems to contradict results previously published in the literature. Two of our proposed methods, Smote + Tomek and Smote + ENN, presented very good results for data sets with a small number of positive examples. Moreover, Random over-sampling, a very simple over-sampling method, is very competitive to more complex over-sampling methods. Since the over-sampling methods provided very good performance results, we also measured the syntactic complexity of the decision trees induced from over-sampled data. Our results show that these trees are usually more complex then the ones induced from original data. Random over-sampling usually produced the smallest increase in the mean number of induced rules and Smote + ENN the smallest increase in the mean number of conditions per rule, when compared among the investigated over-sampling methods.

2,914 citations


"Swift Imbalance Data Classification..." refers methods in this paper

  • ...Some techniques includes random undersampling and oversampling; informed undersampling [7]; synthetic sampling such as Synthetic Minority Over-sampling Technique (SMOTE) which has shown a great deal of success in various applications [8] and sampling with data cleaning techniques like the Condensed Nearest Neighbor rule [9], OSS method [10], the Neighborhood Cleaning rule (NCL) [11]....

    [...]

  • ...However, in-case of low imbalance ratio J , one can use downsampling, Condensed Nearest Neighbour (CNN) with T-Link [9], along with ELM....

    [...]

  • ...For comparing, we have also used a downsampling technique, CNN+T-Link [9]....

    [...]

  • ...The prior three have been used with original dataset, upsampled dataset (by using SMOTE) and downsampled dataset (by using CNN+T-Link)....

    [...]

  • ...In this work, four different models, namely DTC, ELM, K-Means (each with original data, upsampling data using SMOTE and downsampling data using CNN+T-Link) along with W-ELM have been compared against each other....

    [...]

Trending Questions (1)
How to learn swift with no programming experience?

Our method along with swift learning rate is efficacious to predict the desired class.