scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Assessment of various supervised learning algorithms using different performance metrics

01 Nov 2017-Vol. 263, Iss: 4, pp 042087
TL;DR: This work brings out comparison based on the performance of supervised machine learning algorithms on a binary classification task by analysing the Metrics such as Accuracy, F-Measure, G- Measure, Precision, Misclassification Rate, False Positive Rate, True positive Rate, Specificity, Prevalence.
Abstract: Our work brings out comparison based on the performance of supervised machine learning algorithms on a binary classification task. The supervised machine learning algorithms which are taken into consideration in the following work are namely Support Vector Machine(SVM), Decision Tree(DT), K Nearest Neighbour (KNN), Naive Bayes(NB) and Random Forest(RF). This paper mostly focuses on comparing the performance of above mentioned algorithms on one binary classification task by analysing the Metrics such as Accuracy, F-Measure, G-Measure, Precision, Misclassification Rate, False Positive Rate, True Positive Rate, Specificity, Prevalence.
Citations
More filters
Journal ArticleDOI
TL;DR: The aims of the study was to build a machine learning model that classifies and make motor insurance claim status prediction in machine learning approach and it is found that RF classifier is slightly better than Multi-Class Support vector machines.
Abstract: The insurance claim is a basic problem in insurance companies Insurance insurers always have a challenge to the growing of insurance claim loss Because there is the occurrence of claim fraud and the volume of claim data increases in the insurance companies As a result, it is difficult to classify the insured claim status during the claim review process Therefore, the aims of the study was to build a machine learning model that classifies and make motor insurance claim status prediction in machine learning approach To achieve this study Missing value ratio, Z- Score, encoding techniques and entropy were used as data set preparation techniques The final preprocessed data sets split using K- Fold cross validation techniques into training and testing sets Finally the prediction model was built using Random Forest (RF) and Multi Class –Support Vector Machine (SVM)The performance of the models, RF and Multi –Class SVM classifiers were evaluated using Accuracy, Precision, Recall, and F- measure The prediction accuracy of the model is capable of predicting the motor insurance claim status with 9836% and 9817% by RF and SVM classifiers respectively As a result, RF classifier is slightly better than Multi-Class Support vector machines Developing and implementing hybrid model to benefit from the advantages of different algorithms having graphical user interface to apply the solution to real world problem of the insurance company is a pressing future work

3 citations


Cites background from "Assessment of various supervised le..."

  • ...These are TP (True positive), TN (True negative), FP (False positive) and FN (False negative) [16]....

    [...]

Proceedings ArticleDOI
01 May 2019
TL;DR: This paper is an improvement over the existing BI tools as it supports predictive analytics along with the existing functionalities offered by any BI Tool and improves user experience to accommodate the growing needs of the industry.
Abstract: Business Intelligence tools help to present a snapshot of the company by using graphical tools like pie charts, bar graphs, dashboards, etc. which facilitates easy understanding and decision-making. However, measures can be adopted to make BI tools more user-friendly. Our paper is an improvement over the existing BI tools as it supports predictive analytics along with the existing functionalities offered by any BI Tool. Our proposal also enables the user to ask queries in natural language format. This application analyses the query structure and categorizes it as a classification, regression, clustering, etc. problem. Once the query is categorized, it can then be processed by applying all different algorithms which are supported by Apache Spark’s machine learning library MLlib within each category. These algorithms are compared based on various evaluation metrics like accuracy, precision etc. and the most suitable algorithm is used to form the final predictive model. A labelled dataset ensures that our predictive analysis model needs to focus on Supervised learning algorithms only. The results computed are then represented in a graphical format for ease of comprehension of the management. The proposed solution exploits Apache Spark’s processing power, speed, its ability to handle huge datasets and its Machine learning support called Apache Spark MLlib. For implementation of the proposed solution we have used a MongoDB database of a windmill electricity generation plant. This proposal offers added functionalities to BI tools and improves user experience to accommodate the growing needs of the industry.

2 citations

Book ChapterDOI
TL;DR: In this article , a literature review is undertaken, classifying the most common supervised learning performance measurements in cybersecurity research, and the key finding of this paper revealed that supervised learning is mostly used because of its capabilities in detecting known patterns on a restrictive application challenge.
Abstract: Supervised learning (SL) is being increasingly adopted to enhance capability and mitigate cyberattacks. Published literature containing empirical studies often demonstrates an optimistic viewpoint, with promising results achieving greater than 90% in terms of accuracy when detecting and mitigating cyberattacks. These results are often generated on well-refined test scenarios. Cyberattack statistics show a continued increase in occurrence and continue to result in significant damage. This is resulting in organisations becoming increasingly worried about suffering a cyberattack, increasing their desire to identify and adopt suitable solutions. The optimistic result presented in research studies might misrepresent the application’s true capabilities and set unreachable expectations. The purpose of this paper is to investigate how SL technique is applied to cybersecurity challenges and how it is evaluated. To pursue this aim, a literature review is undertaken, classifying the most common SL performance measurements in cybersecurity research. The key finding of this paper revealed that SL is mostly used because of its capabilities in detecting known patterns on a restrictive application challenge. This could therefore be misleading for those wanting to utilise such systems.
Journal ArticleDOI
27 Feb 2023-Data
TL;DR: In this paper , different data balancing techniques were applied to improve prediction accuracy in the minority class while maintaining a satisfactory overall classification performance, and the results indicated that SMOTE with Edited Nearest Neighbor achieved the best classification performance on the 10-fold holdout sample.
Abstract: Predicting student dropout is a challenging problem in the education sector. This is due to an imbalance in student dropout data, mainly because the number of registered students is always higher than the number of dropout students. Developing a model without taking the data imbalance issue into account may lead to an ungeneralized model. In this study, different data balancing techniques were applied to improve prediction accuracy in the minority class while maintaining a satisfactory overall classification performance. Random Over Sampling, Random Under Sampling, Synthetic Minority Over Sampling, SMOTE with Edited Nearest Neighbor and SMOTE with Tomek links were tested, along with three popular classification models: Logistic Regression, Random Forest, and Multi-Layer Perceptron. Publicly accessible datasets from Tanzania and India were used to evaluate the effectiveness of balancing techniques and prediction models. The results indicate that SMOTE with Edited Nearest Neighbor achieved the best classification performance on the 10-fold holdout sample. Furthermore, Logistic Regression correctly classified the largest number of dropout students (57348 for the Uwezo dataset and 13430 for the India dataset) using the confusion matrix as the evaluation matrix. The applications of these models allow for the precise prediction of at-risk students and the reduction of dropout rates.
Journal ArticleDOI
08 Jul 2021
TL;DR: This review focuses on different machine learning algorithms, descriptions, pros and cons of the algorithms, and includes task of machine learning and supervised learning process, tools, techniques and programming language to build machine learning model.
Abstract: This review focuses on different machine learning algorithms, descriptions, pros and cons of the algorithms. It also includes task of machine learning and supervised learning process, tools, techniques and programming language to build machine learning model. Machine learning is a building computational art craft, which learns through time from experience. Machine learning is learning from data and it turns input data to information (mapping input value to output value using label data). The objective of machine learning is building the algorithm that can learn from past data without the help of human experts. The learning algorithm contains task T, performance measure P and training experience T. Machine learning algorithm categorized as supervised, unsupervised and reinforcement learning. Supervised machine learning has two tasks, which are classification and regression. Hadoop with Spark and Python programing language most commonly used to build machine learning model. Information gain and Gain ratio, Gini index and Random forest algorithm used to measure feature importance. From this study most of supervised learning algorithms used k-fold cross validation techniques to split the data set in to training set and testing set. Standard value of k will be five or ten but it didn’t fixed, because it depends on the size of data set. The machine learning model performance mainly evaluated using classification accuracy, precision, recall, area under the curve and F-measure (F-score). Generally machine learning algorithms depend on the nature of the data, because the performance of the learning algorithms affected by data set. Therefore it is impossible to say the prediction accuracy of one algorithm is best over others. As each machine learning algorithms have pros and cons, designing of hybrid algorithms might be overcome this problem.
References
More filters
Journal ArticleDOI
01 Oct 2001
TL;DR: Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the forest, and are also applicable to regression.
Abstract: Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, aaa, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.

79,257 citations

01 Jan 2007
TL;DR: random forests are proposed, which add an additional layer of randomness to bagging and are robust against overfitting, and the randomForest package provides an R interface to the Fortran programs by Breiman and Cutler.
Abstract: Recently there has been a lot of interest in “ensemble learning” — methods that generate many classifiers and aggregate their results. Two well-known methods are boosting (see, e.g., Shapire et al., 1998) and bagging Breiman (1996) of classification trees. In boosting, successive trees give extra weight to points incorrectly predicted by earlier predictors. In the end, a weighted vote is taken for prediction. In bagging, successive trees do not depend on earlier trees — each is independently constructed using a bootstrap sample of the data set. In the end, a simple majority vote is taken for prediction. Breiman (2001) proposed random forests, which add an additional layer of randomness to bagging. In addition to constructing each tree using a different bootstrap sample of the data, random forests change how the classification or regression trees are constructed. In standard trees, each node is split using the best split among all variables. In a random forest, each node is split using the best among a subset of predictors randomly chosen at that node. This somewhat counterintuitive strategy turns out to perform very well compared to many other classifiers, including discriminant analysis, support vector machines and neural networks, and is robust against overfitting (Breiman, 2001). In addition, it is very user-friendly in the sense that it has only two parameters (the number of variables in the random subset at each node and the number of trees in the forest), and is usually not very sensitive to their values. The randomForest package provides an R interface to the Fortran programs by Breiman and Cutler (available at http://www.stat.berkeley.edu/ users/breiman/). This article provides a brief introduction to the usage and features of the R functions.

14,830 citations

Proceedings Article
05 Dec 2005
TL;DR: In this article, a Mahanalobis distance metric for k-NN classification is trained with the goal that the k-nearest neighbors always belong to the same class while examples from different classes are separated by a large margin.
Abstract: We show how to learn a Mahanalobis distance metric for k-nearest neighbor (kNN) classification by semidefinite programming. The metric is trained with the goal that the k-nearest neighbors always belong to the same class while examples from different classes are separated by a large margin. On seven data sets of varying size and difficulty, we find that metrics trained in this way lead to significant improvements in kNN classification—for example, achieving a test error rate of 1.3% on the MNIST handwritten digits. As in support vector machines (SVMs), the learning problem reduces to a convex optimization based on the hinge loss. Unlike learning in SVMs, however, our framework requires no modification or extension for problems in multiway (as opposed to binary) classification.

4,433 citations

Journal ArticleDOI
01 Jun 1991
TL;DR: The subjects of tree structure design, feature selection at each internal node, and decision and search strategies are discussed, and the relation between decision trees and neutral networks (NN) is also discussed.
Abstract: A survey is presented of current methods for decision tree classifier (DTC) designs and the various existing issues. After considering potential advantages of DTCs over single-state classifiers, the subjects of tree structure design, feature selection at each internal node, and decision and search strategies are discussed. The relation between decision trees and neutral networks (NN) is also discussed. >

3,176 citations