Assessment of various supervised learning algorithms using different performance metrics

doi:10.1088/1757-899X/263/4/042087

Home
/
Papers
/
Assessment of various supervised learning algorithms using different performance metrics

Journal Article•DOI•

Assessment of various supervised learning algorithms using different performance metrics

S M Susheel Kumar, Deepak Laxkar, Sourav Adhikari, V. Vijayarajan

01 Nov 2017-Vol. 263, Iss: 4, pp 042087

TL;DR: This work brings out comparison based on the performance of supervised machine learning algorithms on a binary classification task by analysing the Metrics such as Accuracy, F-Measure, G- Measure, Precision, Misclassification Rate, False Positive Rate, True positive Rate, Specificity, Prevalence.

read less

Abstract: Our work brings out comparison based on the performance of supervised machine learning algorithms on a binary classification task. The supervised machine learning algorithms which are taken into consideration in the following work are namely Support Vector Machine(SVM), Decision Tree(DT), K Nearest Neighbour (KNN), Naive Bayes(NB) and Random Forest(RF). This paper mostly focuses on comparing the performance of above mentioned algorithms on one binary classification task by analysing the Metrics such as Accuracy, F-Measure, G-Measure, Precision, Misclassification Rate, False Positive Rate, True Positive Rate, Specificity, Prevalence.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Motor Insurance Claim Status Prediction using Machine Learning Techniques

[...]

Endalew Alamir, Teklu Urgessa, Ashebir Hunegnaw, Tiruveedula Gopikrishna

01 Jan 2021-International Journal of Advanced Computer Science and Applications

TL;DR: The aims of the study was to build a machine learning model that classifies and make motor insurance claim status prediction in machine learning approach and it is found that RF classifier is slightly better than Multi-Class Support vector machines.

...read moreread less

Abstract: The insurance claim is a basic problem in insurance companies Insurance insurers always have a challenge to the growing of insurance claim loss Because there is the occurrence of claim fraud and the volume of claim data increases in the insurance companies As a result, it is difficult to classify the insured claim status during the claim review process Therefore, the aims of the study was to build a machine learning model that classifies and make motor insurance claim status prediction in machine learning approach To achieve this study Missing value ratio, Z- Score, encoding techniques and entropy were used as data set preparation techniques The final preprocessed data sets split using K- Fold cross validation techniques into training and testing sets Finally the prediction model was built using Random Forest (RF) and Multi Class –Support Vector Machine (SVM)The performance of the models, RF and Multi –Class SVM classifiers were evaluated using Accuracy, Precision, Recall, and F- measure The prediction accuracy of the model is capable of predicting the motor insurance claim status with 9836% and 9817% by RF and SVM classifiers respectively As a result, RF classifier is slightly better than Multi-Class Support vector machines Developing and implementing hybrid model to benefit from the advantages of different algorithms having graphical user interface to apply the solution to real world problem of the insurance company is a pressing future work

...read moreread less

3 citations

Cites background from "Assessment of various supervised le..."

...These are TP (True positive), TN (True negative), FP (False positive) and FN (False negative) [16]....
[...]

Proceedings Article•DOI•

A Novel approach to Data Visualization by supporting Ad-hoc Query and Predictive analysis : (An Intelligent Data Analyzer and visualizer)

[...]

Arunima Bafna¹, Asmita Parkhe¹, Anushka Iyer¹, Aparna Halbe¹•Institutions (1)

Sardar Patel Institute of Technology¹

01 May 2019

TL;DR: This paper is an improvement over the existing BI tools as it supports predictive analytics along with the existing functionalities offered by any BI Tool and improves user experience to accommodate the growing needs of the industry.

...read moreread less

Abstract: Business Intelligence tools help to present a snapshot of the company by using graphical tools like pie charts, bar graphs, dashboards, etc. which facilitates easy understanding and decision-making. However, measures can be adopted to make BI tools more user-friendly. Our paper is an improvement over the existing BI tools as it supports predictive analytics along with the existing functionalities offered by any BI Tool. Our proposal also enables the user to ask queries in natural language format. This application analyses the query structure and categorizes it as a classification, regression, clustering, etc. problem. Once the query is categorized, it can then be processed by applying all different algorithms which are supported by Apache Spark’s machine learning library MLlib within each category. These algorithms are compared based on various evaluation metrics like accuracy, precision etc. and the most suitable algorithm is used to form the final predictive model. A labelled dataset ensures that our predictive analysis model needs to focus on Supervised learning algorithms only. The results computed are then represented in a graphical format for ease of comprehension of the management. The proposed solution exploits Apache Spark’s processing power, speed, its ability to handle huge datasets and its Machine learning support called Apache Spark MLlib. For implementation of the proposed solution we have used a MongoDB database of a windmill electricity generation plant. This proposal offers added functionalities to BI tools and improves user experience to accommodate the growing needs of the industry.

...read moreread less

2 citations

Book Chapter•DOI•

On the Variability in the Application and Measurement of Supervised Machine Learning in Cyber Security

[...]

Omar Alshaikh, Simon Parkinson, Saad Khan

01 Jan 2023-Communications in computer and information science

TL;DR: In this article , a literature review is undertaken, classifying the most common supervised learning performance measurements in cybersecurity research, and the key finding of this paper revealed that supervised learning is mostly used because of its capabilities in detecting known patterns on a restrictive application challenge.

...read moreread less

Abstract: Supervised learning (SL) is being increasingly adopted to enhance capability and mitigate cyberattacks. Published literature containing empirical studies often demonstrates an optimistic viewpoint, with promising results achieving greater than 90% in terms of accuracy when detecting and mitigating cyberattacks. These results are often generated on well-refined test scenarios. Cyberattack statistics show a continued increase in occurrence and continue to result in significant damage. This is resulting in organisations becoming increasingly worried about suffering a cyberattack, increasing their desire to identify and adopt suitable solutions. The optimistic result presented in research studies might misrepresent the application’s true capabilities and set unreachable expectations. The purpose of this paper is to investigate how SL technique is applied to cybersecurity challenges and how it is evaluated. To pursue this aim, a literature review is undertaken, classifying the most common SL performance measurements in cybersecurity research. The key finding of this paper revealed that SL is mostly used because of its capabilities in detecting known patterns on a restrictive application challenge. This could therefore be misleading for those wanting to utilise such systems.

...read moreread less

Journal Article•DOI•

Data Balancing Techniques for Predicting Student Dropout Using Machine Learning

[...]

Neema Mduma

27 Feb 2023-Data

TL;DR: In this paper , different data balancing techniques were applied to improve prediction accuracy in the minority class while maintaining a satisfactory overall classification performance, and the results indicated that SMOTE with Edited Nearest Neighbor achieved the best classification performance on the 10-fold holdout sample.

...read moreread less

Abstract: Predicting student dropout is a challenging problem in the education sector. This is due to an imbalance in student dropout data, mainly because the number of registered students is always higher than the number of dropout students. Developing a model without taking the data imbalance issue into account may lead to an ungeneralized model. In this study, different data balancing techniques were applied to improve prediction accuracy in the minority class while maintaining a satisfactory overall classification performance. Random Over Sampling, Random Under Sampling, Synthetic Minority Over Sampling, SMOTE with Edited Nearest Neighbor and SMOTE with Tomek links were tested, along with three popular classification models: Logistic Regression, Random Forest, and Multi-Layer Perceptron. Publicly accessible datasets from Tanzania and India were used to evaluate the effectiveness of balancing techniques and prediction models. The results indicate that SMOTE with Edited Nearest Neighbor achieved the best classification performance on the 10-fold holdout sample. Furthermore, Logistic Regression correctly classified the largest number of dropout students (57348 for the Uwezo dataset and 13430 for the India dataset) using the confusion matrix as the evaluation matrix. The applications of these models allow for the precise prediction of at-risk students and the reduction of dropout rates.

...read moreread less

Journal Article•DOI•

Novel Study of Machine Learning Algorithms with Supervised Learning Approach to Solve Real Time Big Data Applications Using Hadoop and Spark

[...]

T Gopi Krishna, Endalew Alamir

08 Jul 2021

TL;DR: This review focuses on different machine learning algorithms, descriptions, pros and cons of the algorithms, and includes task of machine learning and supervised learning process, tools, techniques and programming language to build machine learning model.

...read moreread less

Abstract: This review focuses on different machine learning algorithms, descriptions, pros and cons of the algorithms. It also includes task of machine learning and supervised learning process, tools, techniques and programming language to build machine learning model. Machine learning is a building computational art craft, which learns through time from experience. Machine learning is learning from data and it turns input data to information (mapping input value to output value using label data). The objective of machine learning is building the algorithm that can learn from past data without the help of human experts. The learning algorithm contains task T, performance measure P and training experience T. Machine learning algorithm categorized as supervised, unsupervised and reinforcement learning. Supervised machine learning has two tasks, which are classification and regression. Hadoop with Spark and Python programing language most commonly used to build machine learning model. Information gain and Gain ratio, Gini index and Random forest algorithm used to measure feature importance. From this study most of supervised learning algorithms used k-fold cross validation techniques to split the data set in to training set and testing set. Standard value of k will be five or ten but it didn’t fixed, because it depends on the size of data set. The machine learning model performance mainly evaluated using classification accuracy, precision, recall, area under the curve and F-measure (F-score). Generally machine learning algorithms depend on the nature of the data, because the performance of the learning algorithms affected by data set. Therefore it is impossible to say the prediction accuracy of one algorithm is best over others. As each machine learning algorithms have pros and cons, designing of hybrid algorithms might be overcome this problem.

...read moreread less

References

PDF

Open Access

More filters

Journal Article•DOI•

Random Forests

[...]

Leo Breiman¹•Institutions (1)

University of California, Berkeley¹

01 Oct 2001

TL;DR: Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the forest, and are also applicable to regression.

...read moreread less

Abstract: Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, aaa, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.

...read moreread less

79,257 citations

Classification and Regression by randomForest

[...]

Andy Liaw, Matthew C. Wiener

01 Jan 2007

TL;DR: random forests are proposed, which add an additional layer of randomness to bagging and are robust against overfitting, and the randomForest package provides an R interface to the Fortran programs by Breiman and Cutler.

...read moreread less

Abstract: Recently there has been a lot of interest in “ensemble learning” — methods that generate many classifiers and aggregate their results. Two well-known methods are boosting (see, e.g., Shapire et al., 1998) and bagging Breiman (1996) of classification trees. In boosting, successive trees give extra weight to points incorrectly predicted by earlier predictors. In the end, a weighted vote is taken for prediction. In bagging, successive trees do not depend on earlier trees — each is independently constructed using a bootstrap sample of the data set. In the end, a simple majority vote is taken for prediction. Breiman (2001) proposed random forests, which add an additional layer of randomness to bagging. In addition to constructing each tree using a different bootstrap sample of the data, random forests change how the classification or regression trees are constructed. In standard trees, each node is split using the best split among all variables. In a random forest, each node is split using the best among a subset of predictors randomly chosen at that node. This somewhat counterintuitive strategy turns out to perform very well compared to many other classifiers, including discriminant analysis, support vector machines and neural networks, and is robust against overfitting (Breiman, 2001). In addition, it is very user-friendly in the sense that it has only two parameters (the number of variables in the random subset at each node and the number of trees in the forest), and is usually not very sensitive to their values. The randomForest package provides an R interface to the Fortran programs by Breiman and Cutler (available at http://www.stat.berkeley.edu/ users/breiman/). This article provides a brief introduction to the usage and features of the R functions.

...read moreread less

14,830 citations

UCI Repository of machine learning databases

[...]

Catherine Blake

01 Jan 1998

12,940 citations

Proceedings Article•

Distance Metric Learning for Large Margin Nearest Neighbor Classification

[...]

Kilian Q. Weinberger¹, John Blitzer¹, Lawrence K. Saul¹•Institutions (1)

University of Pennsylvania¹

05 Dec 2005

TL;DR: In this article, a Mahanalobis distance metric for k-NN classification is trained with the goal that the k-nearest neighbors always belong to the same class while examples from different classes are separated by a large margin.

...read moreread less

Abstract: We show how to learn a Mahanalobis distance metric for k-nearest neighbor (kNN) classification by semidefinite programming. The metric is trained with the goal that the k-nearest neighbors always belong to the same class while examples from different classes are separated by a large margin. On seven data sets of varying size and difficulty, we find that metrics trained in this way lead to significant improvements in kNN classification—for example, achieving a test error rate of 1.3% on the MNIST handwritten digits. As in support vector machines (SVMs), the learning problem reduces to a convex optimization based on the hinge loss. Unlike learning in SVMs, however, our framework requires no modification or extension for problems in multiway (as opposed to binary) classification.

...read moreread less

4,433 citations

Journal Article•DOI•

A survey of decision tree classifier methodology

[...]

S.R. Safavian¹, David A. Landgrebe¹•Institutions (1)

Purdue University¹

01 Jun 1991

TL;DR: The subjects of tree structure design, feature selection at each internal node, and decision and search strategies are discussed, and the relation between decision trees and neutral networks (NN) is also discussed.

...read moreread less

Abstract: A survey is presented of current methods for decision tree classifier (DTC) designs and the various existing issues. After considering potential advantages of DTCs over single-state classifiers, the subjects of tree structure design, feature selection at each internal node, and decision and search strategies are discussed. The relation between decision trees and neutral networks (NN) is also discussed. >

...read moreread less

3,176 citations