scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

A new approach for handling imbalanced dataset using ANN and genetic algorithm

01 Apr 2016-pp 1987-1990
TL;DR: This paper compared the existing methods to handle imbalance dataset and provides a new hybrid approach which will improve the accuracy of classifier on imbalanced data.
Abstract: Classification of imbalance data is the major challenge to the community these days. Machine learning algorithms can evolve a one-sided classifier when data is imbalance. The vital challenge in imbalance dataset problem is that sometimes the minority (tiny) classes are more useful, but standard classifiers tend to be biased toward the majority (huge) classes and ignore the tiny ones. In this paper we compared the existing methods to handle imbalance dataset and provide a new hybrid approach which will improve the accuracy of classifier on imbalanced data.
Citations
More filters
Journal ArticleDOI
11 Oct 2018-Symmetry
TL;DR: An efficient analytics framework is proposed, which is technically a progressive machine learning technique merged with Spark-based linear models, Multilayer Perceptron (MLP) and LSTM, using a two-stage cascade structure in order to enhance the predictive accuracy.
Abstract: Every day we experience unprecedented data growth from numerous sources, which contribute to big data in terms of volume, velocity, and variability. These datasets again impose great challenges to analytics framework and computational resources, making the overall analysis difficult for extracting meaningful information in a timely manner. Thus, to harness these kinds of challenges, developing an efficient big data analytics framework is an important research topic. Consequently, to address these challenges by exploiting non-linear relationships from very large and high-dimensional datasets, machine learning (ML) and deep learning (DL) algorithms are being used in analytics frameworks. Apache Spark has been in use as the fastest big data processing arsenal, which helps to solve iterative ML tasks, using distributed ML library called Spark MLlib. Considering real-world research problems, DL architectures such as Long Short-Term Memory (LSTM) is an effective approach to overcoming practical issues such as reduced accuracy, long-term sequence dependency, and vanishing and exploding gradient in conventional deep architectures. In this paper, we propose an efficient analytics framework, which is technically a progressive machine learning technique merged with Spark-based linear models, Multilayer Perceptron (MLP) and LSTM, using a two-stage cascade structure in order to enhance the predictive accuracy. Our proposed architecture enables us to organize big data analytics in a scalable and efficient way. To show the effectiveness of our framework, we applied the cascading structure to two different real-life datasets to solve a multiclass and a binary classification problem, respectively. Experimental results show that our analytical framework outperforms state-of-the-art approaches with a high-level of classification accuracy.

35 citations


Cites methods from "A new approach for handling imbalan..."

  • ...[35] analyzed several different methods of class imbalance problems....

    [...]

Proceedings ArticleDOI
01 Nov 2017
TL;DR: This paper proposes a novel framework that combines the distributive computational abilities of Apache Spark and the advanced machine learning architecture of a deep multi-layer perceptron (MLP) using the popular concept of Cascade Learning.
Abstract: With the spreading prevalence of Big Data, many advances have recently been made in this field. Frameworks such as Apache Hadoop and Apache Spark have gained a lot of traction over the past decades and have become massively popular, especially in industries. It is becoming increasingly evident that effective big data analysis is key to solving artificial intelligence problems. Thus, a multi-algorithm library was implemented in the Spark framework, called MLlib. While this library supports multiple machine learning algorithms, there is still scope to use the Spark setup efficiently for highly time-intensive and computationally expensive procedures like deep learning. In this paper, we propose a novel framework that combines the distributive computational abilities of Apache Spark and the advanced machine learning architecture of a deep multi-layer perceptron (MLP), using the popular concept of Cascade Learning. We conduct empirical analysis of our framework on two real world datasets. The results are encouraging and corroborate our proposed framework, in turn proving that it is an improvement over traditional big data analysis methods that use either Spark or Deep learning as individual elements.

30 citations


Cites methods from "A new approach for handling imbalan..."

  • ...A comprehensive analysis of various methods targeted towards solving the problem has been mentioned in [7]....

    [...]

Posted Content
TL;DR: In this article, the authors propose a framework that combines the distributive computational abilities of Apache Spark and the advanced machine learning architecture of a deep multi-layer perceptron (MLP), using the popular concept of Cascade Learning.
Abstract: With the spreading prevalence of Big Data, many advances have recently been made in this field. Frameworks such as Apache Hadoop and Apache Spark have gained a lot of traction over the past decades and have become massively popular, especially in industries. It is becoming increasingly evident that effective big data analysis is key to solving artificial intelligence problems. Thus, a multi-algorithm library was implemented in the Spark framework, called MLlib. While this library supports multiple machine learning algorithms, there is still scope to use the Spark setup efficiently for highly time-intensive and computationally expensive procedures like deep learning. In this paper, we propose a novel framework that combines the distributive computational abilities of Apache Spark and the advanced machine learning architecture of a deep multi-layer perceptron (MLP), using the popular concept of Cascade Learning. We conduct empirical analysis of our framework on two real world datasets. The results are encouraging and corroborate our proposed framework, in turn proving that it is an improvement over traditional big data analysis methods that use either Spark or Deep learning as individual elements.

24 citations

Journal ArticleDOI
25 Apr 2021
TL;DR: This study applies Random Forest-based oversampling technology for dialect recognition and applies the Grid Search method for hyper-parameter optimization of the random forest algorithm.
Abstract: Speech recognition is one of the important research fields which is currently widely used for various applications. However, speech recognition performance is affected by the dialect of the speaker. Therefore, dialect recognition is often used as an additional feature in speech recognition. The process of recognizing dialects is not easy. Currently, Machine Learning technology is widely applied in dialect recognition. One of the challenges in the introduction of machine learning-based dialects is the imbalance of classes and overlaps in a wide variety of classification techniques. This study applies Random Forest-based oversampling technology for dialect recognition. For hyper-parameter optimization of the random forest algorithm, we apply the Grid Search method. Experiments on Speech Accent Archive data using the MFCC feature resulted in an accuracy of 0.91 and AUC of 0.95

5 citations


Cites background from "A new approach for handling imbalan..."

  • ...Pendekatan kedua adalah dengan cara penyesuaian Cost Sensitive pada data aslinya [13], Cost Sensitive merupakan pembelajaran machine learning dalam mempertimbangkan kesalahan klasifikasi [13]....

    [...]

  • ...Pendekatan level data (Sampling) dapat digunakan untuk modifikasi distribusi kelas dari data latih untuk menyeimbangkan data [13], pendekatan level data itu sendiri adalah tahapan preprocessing yang dilakukan sebelum membuat pemodelan machine learning [12]....

    [...]

  • ...Dalam kondisi seperti ini satu kelas dataset yang digambarkan hanya oleh sejumlah kecil contoh atau kelas minoritas dan kelas lain membentuk sebagian besar data atau kelas mayoritas [13]....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: A critical review of the nature of the problem, the state-of-the-art technologies, and the current assessment metrics used to evaluate learning performance under the imbalanced learning scenario is provided.
Abstract: With the continuous expansion of data availability in many large-scale, complex, and networked systems, such as surveillance, security, Internet, and finance, it becomes critical to advance the fundamental understanding of knowledge discovery and analysis from raw data to support decision-making processes. Although existing knowledge discovery and data engineering techniques have shown great success in many real-world applications, the problem of learning from imbalanced data (the imbalanced learning problem) is a relatively new challenge that has attracted growing attention from both academia and industry. The imbalanced learning problem is concerned with the performance of learning algorithms in the presence of underrepresented data and severe class distribution skews. Due to the inherent complex characteristics of imbalanced data sets, learning from such data requires new understandings, principles, algorithms, and tools to transform vast amounts of raw data efficiently into information and knowledge representation. In this paper, we provide a comprehensive review of the development of research in learning from imbalanced data. Our focus is to provide a critical review of the nature of the problem, the state-of-the-art technologies, and the current assessment metrics used to evaluate learning performance under the imbalanced learning scenario. Furthermore, in order to stimulate future research in this field, we also highlight the major opportunities and challenges, as well as potential important research directions for learning from imbalanced data.

6,320 citations

Book ChapterDOI
01 Jan 2018
TL;DR: This chapter aims to highlight the existence of imbalance in all real world data and the need to focus on the inherent characteristics present in imbalanced data that can degrade the performance of classifiers.
Abstract: Pattern Identification on various domains have become one of the most researched fields. Accuracy of all traditional and standard classifiers is highly proportional to the completeness or quality of the training data. Completeness is bound by various parameters such as noise, highly representative samples of the real world population, availability of training data, dimensionality etc. Another very pressing and domineering issue identified in real world data sets is that the data is well-dominated by typical occurring examples but with only a few rare or unusual occurrences. This distribution among classes make the real world data inherently imbalanced in many domains like medicine, finance, marketing, web, fault detection, anomaly detection etc. This chapter aims to highlight the existence of imbalance in all real world data and the need to focus on the inherent characteristics present in imbalanced data that can degrade the performance of classifiers. It provides an overview of the existing effective methods and solutions implemented towards the significant problems of imbalanced data for improvement in the performance of standard classifiers. Efficient metrics for evaluating the performance of imbalanced learning models followed by future directions for research is been highlighted.

1,763 citations


"A new approach for handling imbalan..." refers background in this paper

  • ...Haibo He, Edwardo A. Garcia [5] gives an extensive survey of research in learning from imbalanced data....

    [...]

Book ChapterDOI
01 Jan 2005
TL;DR: In this Chapter, some of the sampling techniques used for balancing the datasets, and the performance measures more appropriate for mining imbalanced datasets are discussed.
Abstract: A dataset is imbalanced if the classification categories are not approximately equally represented. Recent years brought increased interest in applying machine learning techniques to difficult “real-world” problems, many of which are characterized by imbalanced data. Additionally the distribution of the testing data may differ from that of the training data, and the true misclassification costs may be unknown at learning time. Predictive accuracy, a popular choice for evaluating performance of a classifier, might not be appropriate when the data is imbalanced and/or the costs of different errors vary markedly. In this Chapter, we discuss some of the sampling techniques used for balancing the datasets, and the performance measures more appropriate for mining imbalanced datasets.

1,241 citations

Journal ArticleDOI
01 Apr 2012
TL;DR: This paper aims to both highlight the limitations of the current GP approaches in this area and develop several new fitness functions for binary classification with unbalanced data and empirically show that these new Fitness functions evolve classifiers with good performance on both the minority and majority classes.
Abstract: Machine learning algorithms such as genetic programming (GP) can evolve biased classifiers when data sets are unbalanced. Data sets are unbalanced when at least one class is represented by only a small number of training examples (called the minority class) while other classes make up the majority. In this scenario, classifiers can have good accuracy on the majority class but very poor accuracy on the minority class(es) due to the influence that the larger majority class has on traditional training criteria in the fitness function. This paper aims to both highlight the limitations of the current GP approaches in this area and develop several new fitness functions for binary classification with unbalanced data. Using a range of real-world classification problems with class imbalance, we empirically show that these new fitness functions evolve classifiers with good performance on both the minority and majority classes. Our approaches use the original unbalanced training data in the GP learning process, without the need to artificially balance the training examples from the two classes (e.g., via sampling).

98 citations

01 Jan 2012
TL;DR: This survey paper elaborating Artificial Neural Network or ANN, its various characteristics and business applications, shows that "what are neural networks" and "Why they are so important in today's Artificial intelligence?"
Abstract: In this survey paper, we are elaborating Artificial Neural Network or ANN, its various characteristics and business applications. In this paper we also show that "what are neural networks" and "Why they are so important in today's Artificial intelligence?" Because numerous advances have been made in developing Intelligent system, some inspired by biological neural networks. ANN provides a very exciting alternatives and other application which can play important role in today's computer science field. There are some Limitations also which are mentioned

96 citations