A new approach for handling imbalanced dataset using ANN and genetic algorithm

doi:10.1109/ICCSP.2016.7754521

Home
/
Papers
/
A new approach for handling imbalanced dataset using ANN and genetic algorithm

Proceedings Article•DOI•

A new approach for handling imbalanced dataset using ANN and genetic algorithm

Apurva Sonak¹, Ruhi Patankar¹, Nitin Pise¹•Institutions (1)

Maharashtra Institute of Technology¹

01 Apr 2016-pp 1987-1990

TL;DR: This paper compared the existing methods to handle imbalance dataset and provides a new hybrid approach which will improve the accuracy of classifier on imbalanced data.

read less

Abstract: Classification of imbalance data is the major challenge to the community these days. Machine learning algorithms can evolve a one-sided classifier when data is imbalance. The vital challenge in imbalance dataset problem is that sometimes the minority (tiny) classes are more useful, but standard classifiers tend to be biased toward the majority (huge) classes and ignore the tiny ones. In this paper we compared the existing methods to handle imbalance dataset and provide a new hybrid approach which will improve the accuracy of classifier on imbalanced data.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

A Two-Stage Big Data Analytics Framework with Real World Applications Using Spark Machine Learning and Long Short-Term Memory Network

[...]

Muhammad Ashfaq Khan, Md. Rezaul Karim, Yangwoo Kim

11 Oct 2018-Symmetry

TL;DR: An efficient analytics framework is proposed, which is technically a progressive machine learning technique merged with Spark-based linear models, Multilayer Perceptron (MLP) and LSTM, using a two-stage cascade structure in order to enhance the predictive accuracy.

...read moreread less

Abstract: Every day we experience unprecedented data growth from numerous sources, which contribute to big data in terms of volume, velocity, and variability. These datasets again impose great challenges to analytics framework and computational resources, making the overall analysis difficult for extracting meaningful information in a timely manner. Thus, to harness these kinds of challenges, developing an efficient big data analytics framework is an important research topic. Consequently, to address these challenges by exploiting non-linear relationships from very large and high-dimensional datasets, machine learning (ML) and deep learning (DL) algorithms are being used in analytics frameworks. Apache Spark has been in use as the fastest big data processing arsenal, which helps to solve iterative ML tasks, using distributed ML library called Spark MLlib. Considering real-world research problems, DL architectures such as Long Short-Term Memory (LSTM) is an effective approach to overcoming practical issues such as reduced accuracy, long-term sequence dependency, and vanishing and exploding gradient in conventional deep architectures. In this paper, we propose an efficient analytics framework, which is technically a progressive machine learning technique merged with Spark-based linear models, Multilayer Perceptron (MLP) and LSTM, using a two-stage cascade structure in order to enhance the predictive accuracy. Our proposed architecture enables us to organize big data analytics in a scalable and efficient way. To show the effectiveness of our framework, we applied the cascading structure to two different real-life datasets to solve a multiclass and a binary classification problem, respectively. Experimental results show that our analytical framework outperforms state-of-the-art approaches with a high-level of classification accuracy.

...read moreread less

35 citations

Cites methods from "A new approach for handling imbalan..."

...[35] analyzed several different methods of class imbalance problems....
[...]

Proceedings Article•DOI•

A Big Data Analysis Framework Using Apache Spark and Deep Learning

[...]

Anand Gupta¹, Hardeo Kumar Thakur², Ritvik Shrivastava², Pulkit Kumar³, Sreyashi Nag² - Show less +1 more•Institutions (3)

Insight Enterprises¹, Netaji Subhas Institute of Technology², Indraprastha Institute of Information Technology³

01 Nov 2017

TL;DR: This paper proposes a novel framework that combines the distributive computational abilities of Apache Spark and the advanced machine learning architecture of a deep multi-layer perceptron (MLP) using the popular concept of Cascade Learning.

...read moreread less

Abstract: With the spreading prevalence of Big Data, many advances have recently been made in this field. Frameworks such as Apache Hadoop and Apache Spark have gained a lot of traction over the past decades and have become massively popular, especially in industries. It is becoming increasingly evident that effective big data analysis is key to solving artificial intelligence problems. Thus, a multi-algorithm library was implemented in the Spark framework, called MLlib. While this library supports multiple machine learning algorithms, there is still scope to use the Spark setup efficiently for highly time-intensive and computationally expensive procedures like deep learning. In this paper, we propose a novel framework that combines the distributive computational abilities of Apache Spark and the advanced machine learning architecture of a deep multi-layer perceptron (MLP), using the popular concept of Cascade Learning. We conduct empirical analysis of our framework on two real world datasets. The results are encouraging and corroborate our proposed framework, in turn proving that it is an improvement over traditional big data analysis methods that use either Spark or Deep learning as individual elements.

...read moreread less

30 citations

Cites methods from "A new approach for handling imbalan..."

...A comprehensive analysis of various methods targeted towards solving the problem has been mentioned in [7]....
[...]

Journal Article•DOI•

Cardiac Arrhythmia Disease Classification Using LSTM Deep Learning Approach

[...]

Muhammad Ashfaq Khan, Yangwoo Kim

01 Jan 2021-Cmc-computers Materials & Continua

25 citations

Posted Content•

A Big Data Analysis Framework Using Apache Spark and Deep Learning

[...]

Anand Gupta¹, Hardeo Kumar Thakur², Ritvik Shrivastava², Pulkit Kumar³, Sreyashi Nag² - Show less +1 more•Institutions (3)

Insight Enterprises¹, Netaji Subhas Institute of Technology², Indraprastha Institute of Information Technology³

25 Nov 2017-arXiv: Databases

TL;DR: In this article, the authors propose a framework that combines the distributive computational abilities of Apache Spark and the advanced machine learning architecture of a deep multi-layer perceptron (MLP), using the popular concept of Cascade Learning.

...read moreread less

24 citations

Journal Article•DOI•

Klasifikasi Dialek Pengujar Bahasa Inggris Menggunakan Random Forest

[...]

Muhamad Azhar, Hilman F. Pardede¹•Institutions (1)

Indonesian Institute of Sciences¹

25 Apr 2021

TL;DR: This study applies Random Forest-based oversampling technology for dialect recognition and applies the Grid Search method for hyper-parameter optimization of the random forest algorithm.

...read moreread less

Abstract: Speech recognition is one of the important research fields which is currently widely used for various applications. However, speech recognition performance is affected by the dialect of the speaker. Therefore, dialect recognition is often used as an additional feature in speech recognition. The process of recognizing dialects is not easy. Currently, Machine Learning technology is widely applied in dialect recognition. One of the challenges in the introduction of machine learning-based dialects is the imbalance of classes and overlaps in a wide variety of classification techniques. This study applies Random Forest-based oversampling technology for dialect recognition. For hyper-parameter optimization of the random forest algorithm, we apply the Grid Search method. Experiments on Speech Accent Archive data using the MFCC feature resulted in an accuracy of 0.91 and AUC of 0.95

...read moreread less

5 citations

Cites background from "A new approach for handling imbalan..."

...Pendekatan kedua adalah dengan cara penyesuaian Cost Sensitive pada data aslinya [13], Cost Sensitive merupakan pembelajaran machine learning dalam mempertimbangkan kesalahan klasifikasi [13]....
[...]
...Pendekatan level data (Sampling) dapat digunakan untuk modifikasi distribusi kelas dari data latih untuk menyeimbangkan data [13], pendekatan level data itu sendiri adalah tahapan preprocessing yang dilakukan sebelum membuat pemodelan machine learning [12]....
[...]
...Dalam kondisi seperti ini satu kelas dataset yang digambarkan hanya oleh sejumlah kecil contoh atau kelas minoritas dan kelas lain membentuk sebagian besar data atau kelas mayoritas [13]....
[...]

1
2
3
4
…

References

PDF

Open Access

More filters

Journal Article•DOI•

Learning from Imbalanced Data

[...]

Haibo He¹, E.A. Garcia¹•Institutions (1)

Stevens Institute of Technology¹

01 Sep 2009-IEEE Transactions on Knowledge and Data Engineering

TL;DR: A critical review of the nature of the problem, the state-of-the-art technologies, and the current assessment metrics used to evaluate learning performance under the imbalanced learning scenario is provided.

...read moreread less

Abstract: With the continuous expansion of data availability in many large-scale, complex, and networked systems, such as surveillance, security, Internet, and finance, it becomes critical to advance the fundamental understanding of knowledge discovery and analysis from raw data to support decision-making processes. Although existing knowledge discovery and data engineering techniques have shown great success in many real-world applications, the problem of learning from imbalanced data (the imbalanced learning problem) is a relatively new challenge that has attracted growing attention from both academia and industry. The imbalanced learning problem is concerned with the performance of learning algorithms in the presence of underrepresented data and severe class distribution skews. Due to the inherent complex characteristics of imbalanced data sets, learning from such data requires new understandings, principles, algorithms, and tools to transform vast amounts of raw data efficiently into information and knowledge representation. In this paper, we provide a comprehensive review of the development of research in learning from imbalanced data. Our focus is to provide a critical review of the nature of the problem, the state-of-the-art technologies, and the current assessment metrics used to evaluate learning performance under the imbalanced learning scenario. Furthermore, in order to stimulate future research in this field, we also highlight the major opportunities and challenges, as well as potential important research directions for learning from imbalanced data.

...read moreread less

6,320 citations

Book Chapter•DOI•

Learning From Imbalanced Data

[...]

Lincy Meera Mathews¹, Seetha Hari²•Institutions (2)

M. S. Ramaiah Institute of Technology¹, VIT University²

01 Jan 2018

TL;DR: This chapter aims to highlight the existence of imbalance in all real world data and the need to focus on the inherent characteristics present in imbalanced data that can degrade the performance of classifiers.

...read moreread less

Abstract: Pattern Identification on various domains have become one of the most researched fields. Accuracy of all traditional and standard classifiers is highly proportional to the completeness or quality of the training data. Completeness is bound by various parameters such as noise, highly representative samples of the real world population, availability of training data, dimensionality etc. Another very pressing and domineering issue identified in real world data sets is that the data is well-dominated by typical occurring examples but with only a few rare or unusual occurrences. This distribution among classes make the real world data inherently imbalanced in many domains like medicine, finance, marketing, web, fault detection, anomaly detection etc. This chapter aims to highlight the existence of imbalance in all real world data and the need to focus on the inherent characteristics present in imbalanced data that can degrade the performance of classifiers. It provides an overview of the existing effective methods and solutions implemented towards the significant problems of imbalanced data for improvement in the performance of standard classifiers. Efficient metrics for evaluating the performance of imbalanced learning models followed by future directions for research is been highlighted.

...read moreread less

1,763 citations

"A new approach for handling imbalan..." refers background in this paper

...Haibo He, Edwardo A. Garcia [5] gives an extensive survey of research in learning from imbalanced data....
[...]

Book Chapter•DOI•

Data Mining for Imbalanced Datasets: An Overview

[...]

Nitesh V. Chawla¹•Institutions (1)

University of Notre Dame¹

01 Jan 2005

TL;DR: In this Chapter, some of the sampling techniques used for balancing the datasets, and the performance measures more appropriate for mining imbalanced datasets are discussed.

...read moreread less

Abstract: A dataset is imbalanced if the classification categories are not approximately equally represented. Recent years brought increased interest in applying machine learning techniques to difficult “real-world” problems, many of which are characterized by imbalanced data. Additionally the distribution of the testing data may differ from that of the training data, and the true misclassification costs may be unknown at learning time. Predictive accuracy, a popular choice for evaluating performance of a classifier, might not be appropriate when the data is imbalanced and/or the costs of different errors vary markedly. In this Chapter, we discuss some of the sampling techniques used for balancing the datasets, and the performance measures more appropriate for mining imbalanced datasets.

...read moreread less

1,241 citations

Journal Article•DOI•

Developing New Fitness Functions in Genetic Programming for Classification With Unbalanced Data

[...]

Urvesh Bhowan¹, Mark Johnston¹, Mengjie Zhang¹•Institutions (1)

Victoria University of Wellington¹

01 Apr 2012

TL;DR: This paper aims to both highlight the limitations of the current GP approaches in this area and develop several new fitness functions for binary classification with unbalanced data and empirically show that these new Fitness functions evolve classifiers with good performance on both the minority and majority classes.

...read moreread less

Abstract: Machine learning algorithms such as genetic programming (GP) can evolve biased classifiers when data sets are unbalanced. Data sets are unbalanced when at least one class is represented by only a small number of training examples (called the minority class) while other classes make up the majority. In this scenario, classifiers can have good accuracy on the majority class but very poor accuracy on the minority class(es) due to the influence that the larger majority class has on traditional training criteria in the fitness function. This paper aims to both highlight the limitations of the current GP approaches in this area and develop several new fitness functions for binary classification with unbalanced data. Using a range of real-world classification problems with class imbalance, we empirically show that these new fitness functions evolve classifiers with good performance on both the minority and majority classes. Our approaches use the original unbalanced training data in the GP learning process, without the need to artificially balance the training examples from the two classes (e.g., via sampling).

...read moreread less

98 citations

A Comprehensive Study of Artificial Neural Networks

[...]

Vidushi Sharma, Sachin Rai, Anurag Dev

01 Jan 2012

TL;DR: This survey paper elaborating Artificial Neural Network or ANN, its various characteristics and business applications, shows that "what are neural networks" and "Why they are so important in today's Artificial intelligence?"

...read moreread less

Abstract: In this survey paper, we are elaborating Artificial Neural Network or ANN, its various characteristics and business applications. In this paper we also show that "what are neural networks" and "Why they are so important in today's Artificial intelligence?" Because numerous advances have been made in developing Intelligent system, some inspired by biological neural networks. ANN provides a very exciting alternatives and other application which can play important role in today's computer science field. There are some Limitations also which are mentioned

...read moreread less

96 citations