Improved Accuracy of Naive Bayes Classifier for Determination of Customer Churn Uses SMOTE and Genetic Algorithms

doi:10.52465/JOSCEX.V1I1.5

Home
/
Papers
/
Improved Accuracy of Naive Bayes Classifier for Determination of Customer Churn Uses SMOTE and Genetic Algorithms

Proceedings Article•DOI•

Improved Accuracy of Naive Bayes Classifier for Determination of Customer Churn Uses SMOTE and Genetic Algorithms

Afifah Ratna Safitri¹, Much Aziz Muslim¹•Institutions (1)

State University of Semarang¹

06 Oct 2020-Vol. 1, Iss: 1, pp 70-75

TL;DR: The purpose of this study is to improve the accuracy of the Naive Bayes for customer classification by using the SMOTE and genetic algorithm to handle class imbalance problems and attributes selection.

read less

Abstract: With increasing competition in the business world, many companies use data mining techniques to determine the level of customer loyalty. The customer data used in this study is the german credit dataset obtained from UCI. Such data have an imbalance problem of class because the amount of data in the loyal class is more than in the churn class. In addition, there are some irrelevant attributes for customer classification, so attributes selection is needed to get more accurate classification results. One classification algorithm is naive bayes. Naive Bayes has been used as an effective classification for years because it is easy to build and give an independent attribute into its structure. The purpose of this study is to improve the accuracy of the Naive Bayes for customer classification. SMOTE and genetic algorithm do for improving the accuracy. The SMOTE is used to handle class imbalance problems, while the genetic algorithm is used for attributes selection. Accuracy using the Naive Bayes is 47.10%, while the mean accuracy results obtained from the Naive Bayes with the application of the SMOTE is 78.15% and the accuracy obtained from the Naive Bayes with the application of the SMOTE and genetic algorithm is 78.46%.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Company bankruptcy prediction framework based on the most influential features using XGBoost and stacking ensemble learning

[...]

Much Aziz Muslim¹, Yosza Dasril¹•Institutions (1)

Universiti Tun Hussein Onn Malaysia¹

01 Dec 2021-International Journal of Electrical and Computer Engineering

TL;DR: This study aims to find the best predictive model or method to predict company bankruptcy using the dataset from Polish companies bankruptcy and uses the best feature selection and ensemble learning.

...read moreread less

Abstract: Company bankruptcy is often a very big problem for companies. The impact of bankruptcy can cause losses to elements of the company such as owners, investors, employees, and consumers. One way to prevent bankruptcy is to predict the possibility of bankruptcy based on the company's financial data. Therefore, this study aims to find the best predictive model or method to predict company bankruptcy using the dataset from Polish companies bankruptcy. The prediction analysis process uses the best feature selection and ensemble learning. The best feature selection is selected using feature importance to XGBoost with a weight value filter of 10. The ensemble learning method used is stacking. Stacking is composed of the base model and meta learner. The base model consists of K-nearest neighbor, decision tree, SVM, and random forest, while the meta learner used is LightGBM. The stacking model accuracy results can outperform the base model accuracy with an accuracy rate of 97%.

...read moreread less

12 citations

Journal Article•DOI•

Improved logistics service quality for goods quality delivery services of companies using analytical hierarchy process

[...]

Popy Riliandini, Erika Noor Dianti¹, Sayidah Rohmatul Hidayah, Dwika Ananda Agustina Pertiwi¹•Institutions (1)

State University of Semarang¹

31 Mar 2021

TL;DR: There is the main dimension of logistic service quality in improving the quality of service, namely ordering condition, time, and information quality, which can be the basis of decision making for companies in choosing alternative criteria priorities.

...read moreread less

Abstract: Logistics plays a role in the smooth transaction between companies because it is a facilitator of buying and selling goods and services to fulfill the supply orders of consumer companies. This study aims to analyze how the impact of improved Logistic Service Quality (LSQ) for quality of goods delivery services by using LSQ dimensions from previous research. Sample data is obtained through the dissemination of questionnaires which are then processed quantitatively with convergent validity and reliability tests. Data processing with a sample count of 61 respondents. The results of this study show that there is the main dimension of logistic service quality in improving the quality of service, namely ordering condition, time, and information quality. Each comparison factor is tested for consistency using the Analytical Hierarchy Process (AHP), each of the main criteria has a consistency value of less than 0.1 so that the main criteria tested have a consistent comparison matrix and can be the basis of decision making for companies in choosing alternative criteria priorities.

...read moreread less

3 citations

Proceedings Article•DOI•

SMOTE Classification and Random Oversampling Naive Bayes in Imbalanced Data : (Case Study of Early Detection of Cervical Cancer in Indonesia)

[...]

Nur Silviyah Rahmi, Ni Wayan Surya Wardhani, Maria Bernadetha Mitakda, Regina Syahla Fauztina, Imelda Salsabila - Show less +1 more

04 Nov 2022

TL;DR: In this article , the authors used SMOTE and Random Oversampling (ROS) sampling techniques to overcome imbalanced data combined with the Naive Bayes classification method in cases of detection of early cervical cancer in Indonesia.

...read moreread less

Abstract: Imbalanced data was a problem that is often encountered when classifying, where the distribution of the majority class has more numbers than the minority class. The existence of imbalanced data makes the performance of classification methods in machine learning decrease. This study adopted SMOTE and Random Oversampling (ROS) sampling techniques to overcome imbalanced data combined with the Naive Bayes classification method in cases of detection of early cervical cancer in Indonesia. Cervical cancer is a disease of an abnormal cell group that growth in the cervix (mouth of the womb). Cervical cancer is the most common type and ranks number 2 as cancer suffered by Indonesian women. Various factors that influence the event include eating behavior, personal hygiene behavior, motivational strength, social support, empowerment of knowledge, abilities, and desires. The data used is secondary data with a sample of 72 patients and 20 attributes. A total of21 patients in the classification had cervical cancer and 51 patients did not have cervical cancer. The ratio of 30:70 are imbalanced data. Through this classification method, it is expected to know what factors influence the event of cervical cancer and gains the best performance of two classifications. The results point out that average performance of SMOTE Naive Bayes has a higher (81,73%) than Random Oversampling Naive Bayes which is 81,12%. Therefore, SMOTE Naive Bayes outperforms Random Oversampling Naïve Bayes.

...read moreread less

3 citations

Journal Article•DOI•

Optimize Naïve Bayes Classifier Using Chi Square and Term Frequency Inverse Document Frequency For Amazon Review Sentiment Analysis

[...]

Anisa Falasari, Much Aziz Muslim

30 Mar 2022-Journal of Soft Computing Exploration

TL;DR: In this study, using sentiment labelled dataset (field amazon_labelled) obtained from UCI Machine Learning, the accuracy of the naïve bayes classifier in the amazon review sentiment analysis was 82% and the accuracy by applying chi square and TF-IDF is 83%.

...read moreread less

Abstract: The rapid development of the internet has made information flow rapidly wich has an impact on the world of commerce. Some people who have bought a product will write their opinion on social media or other online site. Long-text buyer reviews need a machine to recognize opinions. Sentiment analysis applies the text mining method. One of the methods applied in sentiment analysis is classification. One of the classification algorithms is the naïve bayes classifier. Naïve bayes classifier is a classification method with good efficiency and performance. However, it is very sensitive with too many features, wich makes the accuracy low. To improve the accuracy of the naïve bayes classifier algorithm it can be done by selecting features. One of the feature selection is chi square. The selection of features with chi square calculation based on the top-K value that has been determined, namely 450. In addition, weighting features can also improve the accuracy of the naïve bayes classifier algorithm. One of the feature weighting techniques is term frequency inverse document frequency (TF-IDF). In this study, using sentiment labelled dataset (field amazon_labelled) obtained from UCI Machine Learning. This dataset has 500 positive reviews and 500 negative reviews. The accuracy of the naïve bayes classifier in the amazon review sentiment analysis was 82%. Meanwhile, the accuracy of the naïve bayes classifier by applying chi square and TF-IDF is 83%.

...read moreread less

3 citations

Journal Article•DOI•

Recommendation of Yogyakarta tourism based on simple additive weighting under fuzziness

[...]

Eko Yunanto Utomo

31 Mar 2021

TL;DR: The results of this study obtained the best 2 packages recommended for tourists to choose, namely the Triangular Fuzzy Number and the Simple Additive Weighting method.

...read moreread less

Abstract: For tourists who do not understand the situation or the desired tourist attraction, they can choose tour and travel services. Tour and travel provides a choice of tour packages with various variations. Determining the right tour and travel package and agency can benefit tourists, both in terms of financial and vacation quality. The data used in this study were obtained from several Tour and Travel agents. There are several variables used, namely the price of the package, the number of participants, and the number of facilities obtained. The method used in this study combines the Triangular Fuzzy Number (TFN) and the Simple Additive Weighting (SAW) method. The purpose of this study is to help tourists determine the most profitable or best packages. The results of this study obtained the best 2 packages recommended for tourists to choose.

...read moreread less

2 citations

References

PDF

Open Access

More filters

Journal Article•DOI•

Churn prediction on huge telecom data using hybrid firefly based classification

[...]

Ammar A. Q. Ahmed, D. Maheswari

01 Nov 2017-Egyptian Informatics Journal

TL;DR: A metaheuristic based churn prediction technique that performs churn prediction on huge telecom data using a hybridized form of Firefly algorithm as the classifier and it was observed that Firefly algorithm works best on churn data and the hybridized Firefly algorithm provides effective and faster results.

...read moreread less

50 citations

Journal Article•DOI•

Churn prediction in the telecommunications sector using support vector machines

[...]

Ionut Brandusoiu, Gavril Toderean

01 Jan 2013-Annals of the Oradea University: Fascicle Management and Technological Engineering

47 citations

Journal Article•DOI•

Gaussian-Based SMOTE Algorithm for Solving Skewed Class Distributions

[...]

Hansoo Lee¹, Jonggeun Kim¹, Sungshin Kim¹•Institutions (1)

Pusan National University¹

25 Dec 2017-The International Journal of Fuzzy Logic and Intelligent Systems

44 citations

Journal Article•DOI•

Data Mining Classification Comparison (Naïve Bayes and C4.5 Algorithms)

[...]

Leni Marlina, Mus lim, Andysah Putera Utama Siahaan

25 Aug 2016-international journal of engineering trends and technology

TL;DR: The author will do a comparison between the performance of the technical classification methods naïve Bayes and C4.5 algorithms.

...read moreread less

Abstract: The development of data miningis inseparable from the recent developments in information technology that enables the accumulation of large amounts of data. For example, a shopping mall that records every sales transaction of goods using various POS (point of sales). Database data from these sales could reach a large storage capacity, even more being added each day, especially when the shopping center will develop into a nationwide network. The development of the internet at the moment also has a share large enough in the accumulation of data occurs. But the rapid growth of data accumulation it has created conditions that are often referred to as \"data rich but information poor\" because the data collected can not be used optimally for useful applications. Not infrequently the data set was left just seemed to be a \"grave data\". There are several techniques used in data mining which includes association, classification, and clustering. In this paper, the author will do a comparison between the performance of the technical classification methods naïve Bayes and C4.5 algorithms.

...read moreread less

31 citations

Journal Article•DOI•

Review on factors affecting customer churn in telecom sector

[...]

Vishal Mahajan¹, Richa Misra², Renuka Mahajan³•Institutions (3)

HCL Technologies¹, Jaipuria Institute of Management², Amity University³

23 Aug 2017-International Journal of Data Analysis Techniques and Strategies

TL;DR: A model on churn factors, identified from the study is proposed to serve as a roadmap, to build upon exciting churn management techniques.

...read moreread less

Abstract: The communications sector is emerging with new technologies, wireless and wireline services. The industry's success expects a better perception of customer requirements and superior quality of service and models. Customer churn has a huge impact on companies and is the prime focus area for the companies to remain competitive and profitable. Hence, significant research had been undertaken by researchers worldwide to understand the dynamics of customer churn. This paper provides a review of around 75 recent journal articles (starting from year 2000) to identify the various churn factors and their complex relationships, in existing telecom churn literature. It gives detailed discussion of what factors were identified in various studies, the sample sizes used and the method used for the study by different researchers. The gaps identified in the previous studies have also been discussed. A model on churn factors, identified from the study is proposed to serve as a roadmap, to build upon exciting churn management techniques.

...read moreread less

21 citations