scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Classification of Imbalanced Banking Dataset using Dimensionality Reduction

01 May 2019-pp 1353-1357
TL;DR: This paper has used the imbalanced bank marketing dataset for direct marketing campaign available at UCI machine learning repository and compares the accuracy of different classification algorithms like Naive Bayes, J48, KNN and Bayesnet.
Abstract: The classification is an important data mining technique. In classification, the classifier automatically learns the properties of classes or categories from the pre-defined training documents. In this paper we have used the imbalanced bank marketing dataset for direct marketing campaign available at UCI machine learning repository. In this paper, we compare the accuracy of different classification algorithms like Naive Bayes, J48, KNN and Bayesnet. CfsSubsetEval method is used for the dimensionality reduction by evaluating the subset of attributes based on the predictive ability and finding out the duplication among the selected features. Before the dimensionality reduction, J48 provide 89%. of accuracy. By using the dimensionality reduction, the accuracy is increased to 91.2% by J48 algorithm.
Citations
More filters
Journal ArticleDOI
TL;DR: In this paper , a machine learning workflow for high classification accuracy and improved prediction confidence using a binary classification approach on a publicly available dataset from a Portuguese financial institution as a proof of concept was developed.
Abstract: The UK financial sector increasingly employs machine learning techniques to enhance revenue and understand customer behaviour. In this study, we develop a machine learning workflow for high classification accuracy and improved prediction confidence using a binary classification approach on a publicly available dataset from a Portuguese financial institution as a proof of concept. Our methodology includes data analysis, transformation, training, and testing machine learning classifiers such as Naïve Bayes, Decision Trees, Random Forests, Support Vector Machines, Logistic Regression, Artificial Neural Networks, AdaBoost, and Gradient Descent. We use stratified k-folding (k=5) cross-validation and assemble the top-performing classifiers into a decision-making committee, resulting in over 92% accuracy with two-thirds voting confidence. The workflow is simple, adaptable, and suitable for UK banks, demonstrating the potential for practical implementation and data privacy. Future work will extend our approach to UK banks, reformulate the problem as a multi-class classification, and introduce pre-training automated steps for data analysis and transformation.
References
More filters
Journal ArticleDOI
TL;DR: This paper presents the top 10 data mining algorithms identified by the IEEE International Conference on Data Mining (ICDM) in December 2006: C4.5, k-Means, SVM, Apriori, EM, PageRank, AdaBoost, kNN, Naive Bayes, and CART.
Abstract: This paper presents the top 10 data mining algorithms identified by the IEEE International Conference on Data Mining (ICDM) in December 2006: C4.5, k-Means, SVM, Apriori, EM, PageRank, AdaBoost, kNN, Naive Bayes, and CART. These top 10 algorithms are among the most influential data mining algorithms in the research community. With each algorithm, we provide a description of the algorithm, discuss the impact of the algorithm, and review current and further research on the algorithm. These 10 algorithms cover classification, clustering, statistical learning, association analysis, and link mining, which are all among the most important topics in data mining research and development.

4,944 citations

Journal ArticleDOI
Mahesh Pal1
TL;DR: It is suggested that the random forest classifier performs equally well to SVMs in terms of classification accuracy and training time and the number of user‐defined parameters required byrandom forest classifiers is less than the number required for SVMs and easier to define.
Abstract: Growing an ensemble of decision trees and allowing them to vote for the most popular class produced a significant increase in classification accuracy for land cover classification. The objective of this study is to present results obtained with the random forest classifier and to compare its performance with the support vector machines (SVMs) in terms of classification accuracy, training time and user defined parameters. Landsat Enhanced Thematic Mapper Plus (ETM+) data of an area in the UK with seven different land covers were used. Results from this study suggest that the random forest classifier performs equally well to SVMs in terms of classification accuracy and training time. This study also concludes that the number of user‐defined parameters required by random forest classifiers is less than the number required for SVMs and easier to define.

2,255 citations

Proceedings Article
01 Jan 2004
TL;DR: A sufficient condition for the optimality of naive Bayes is presented and proved, in which the dependence between attributes do exist, and evidence that dependence among attributes may cancel out each other is provided.
Abstract: Naive Bayes is one of the most efficient and effective inductive learning algorithms for machine learning and data mining. Its competitive performance in classification is surprising, because the conditional independence assumption on which it is based, is rarely true in realworld applications. An open question is: what is the true reason for the surprisingly good performance of naive Bayes in classification? In this paper, we propose a novel explanation on the superb classification performance of naive Bayes. We show that, essentially, the dependence distribution; i.e., how the local dependence of a node distributes in each class, evenly or unevenly, and how the local dependencies of all nodes work together, consistently (supporting a certain classification) or inconsistently (canceling each other out), plays a crucial role. Therefore, no matter how strong the dependences among attributes are, naive Bayes can still be optimal if the dependences distribute evenly in classes, or if the dependences cancel each other out. We propose and prove a sufficient and necessary conditions for the optimality of naive Bayes. Further, we investigate the optimality of naive Bayes under the Gaussian distribution. We present and prove a sufficient condition for the optimality of naive Bayes, in which the dependence between attributes do exist. This provides evidence that dependence among attributes may cancel out each other. In addition, we explore when naive Bayes works well. Naive Bayes and Augmented Naive Bayes Classification is a fundamental issue in machine learning and data mining. In classification, the goal of a learning algorithm is to construct a classifier given a set of training examples with class labels. Typically, an example E is represented by a tuple of attribute values (x1, x2, , · · · , xn), where xi is the value of attribute Xi. Let C represent the classification variable, and let c be the value of C. In this paper, we assume that there are only two classes: + (the positive class) or − (the negative class). A classifier is a function that assigns a class label to an example. From the probability perspective, according to Bayes Copyright c © 2004, American Association for Artificial Intelligence (www.aaai.org). All rights reserved. Rule, the probability of an example E = (x1, x2, · · · , xn) being class c is p(c|E) = p(E|c)p(c) p(E) . E is classified as the class C = + if and only if fb(E) = p(C = +|E) p(C = −|E) ≥ 1, (1) where fb(E) is called a Bayesian classifier. Assume that all attributes are independent given the value of the class variable; that is, p(E|c) = p(x1, x2, · · · , xn|c) = n ∏

1,536 citations

Journal ArticleDOI
TL;DR: Findings of this paper indicate that the research area of customer retention received most research attention and classification and association models are the two commonly used models for data mining in CRM.
Abstract: Despite the importance of data mining techniques to customer relationship management (CRM), there is a lack of a comprehensive literature review and a classification scheme for it. This is the first identifiable academic literature review of the application of data mining techniques to CRM. It provides an academic database of literature between the period of 2000-2006 covering 24 journals and proposes a classification scheme to classify the articles. Nine hundred articles were identified and reviewed for their direct relevance to applying data mining techniques to CRM. Eighty-seven articles were subsequently selected, reviewed and classified. Each of the 87 selected papers was categorized on four CRM dimensions (Customer Identification, Customer Attraction, Customer Retention and Customer Development) and seven data mining functions (Association, Classification, Clustering, Forecasting, Regression, Sequence Discovery and Visualization). Papers were further classified into nine sub-categories of CRM elements under different data mining techniques based on the major focus of each paper. The review and classification process was independently verified. Findings of this paper indicate that the research area of customer retention received most research attention. Of these, most are related to one-to-one marketing and loyalty programs respectively. On the other hand, classification and association models are the two commonly used models for data mining in CRM. Our analysis provides a roadmap to guide future research and facilitate knowledge accumulation and creation concerning the application of data mining techniques in CRM.

1,135 citations

Journal ArticleDOI
01 Jun 2014
TL;DR: A data mining approach to predict the success of telemarketing calls for selling bank long-term deposits in Portuguese retail bank was addressed, with data collected from 2008 to 2013, thus including the effects of the recent financial crisis.
Abstract: We propose a data mining (DM) approach to predict the success of telemarketing calls for selling bank long-term deposits. A Portuguese retail bank was addressed, with data collected from 2008 to 2013, thus including the effects of the recent financial crisis. We analyzed a large set of 150 features related with bank client, product and social-economic attributes. A semi-automatic feature selection was explored in the modeling phase, performed with the data prior to July 2012 and that allowed to select a reduced set of 22 features. We also compared four DM models: logistic regression, decision trees (DTs), neural network (NN) and support vector machine. Using two metrics, area of the receiver operating characteristic curve (AUC) and area of the LIFT cumulative curve (ALIFT), the four models were tested on an evaluation set, using the most recent data (after July 2012) and a rolling window scheme. The NN presented the best results (AUC = 0.8 and ALIFT = 0.7), allowing to reach 79% of the subscribers by selecting the half better classified clients. Also, two knowledge extraction methods, a sensitivity analysis and a DT, were applied to the NN model and revealed several key attributes (e.g., Euribor rate, direction of the call and bank agent experience). Such knowledge extraction confirmed the obtained model as credible and valuable for telemarketing campaign managers.

673 citations