scispace - formally typeset
Search or ask a question
Journal ArticleDOI

An up-to-date comparison of state-of-the-art classification algorithms

TL;DR: It is found that Stochastic Gradient Boosting Trees (GBDT) matches or exceeds the prediction performance of Support Vector Machines and Random Forests, while being the fastest algorithm in terms of prediction efficiency.
Abstract: Up-to-date report on the accuracy and efficiency of state-of-the-art classifiers.We compare the accuracy of 11 classification algorithms pairwise and groupwise.We examine separately the training, parameter-tuning, and testing time.GBDT and Random Forests yield highest accuracy, outperforming SVM.GBDT is the fastest in testing, Naive Bayes the fastest in training. Current benchmark reports of classification algorithms generally concern common classifiers and their variants but do not include many algorithms that have been introduced in recent years. Moreover, important properties such as the dependency on number of classes and features and CPU running time are typically not examined. In this paper, we carry out a comparative empirical study on both established classifiers and more recently proposed ones on 71 data sets originating from different domains, publicly available at UCI and KEEL repositories. The list of 11 algorithms studied includes Extreme Learning Machine (ELM), Sparse Representation based Classification (SRC), and Deep Learning (DL), which have not been thoroughly investigated in existing comparative studies. It is found that Stochastic Gradient Boosting Trees (GBDT) matches or exceeds the prediction performance of Support Vector Machines (SVM) and Random Forests (RF), while being the fastest algorithm in terms of prediction efficiency. ELM also yields good accuracy results, ranking in the top-5, alongside GBDT, RF, SVM, and C4.5 but this performance varies widely across all data sets. Unsurprisingly, top accuracy performers have average or slow training time efficiency. DL is the worst performer in terms of accuracy but second fastest in prediction efficiency. SRC shows good accuracy performance but it is the slowest classifier in both training and testing.

Summary (3 min read)

2. State-of-the-art in classifier comparison

  • In summary, as can be seen in Table 1 , most of existing works for classifier comparison have not considered several important classifiers, i.e. GBDT, ELM, SRC, and DL.
  • Moreover, the findings in these works are not always consistent, e.g. the accuracy of RF and SVM were reported to be very similar in Macia and Bernado-Mansilla (2014) , while in Brown and Mues (2012) , Fernández-Delgado et al. (2014) and Lessmann et al. (2015) , RF was reported to outperform SVM.
  • Therefore, it is essential to conduct an up-todate comparative study on the current state-of-the-art classification algorithms, taking into consideration GBDT and newer classifiers, i.e. ELM, SRC, and DL.

3. Classification algorithms to be compared

  • For the sake of convenience and clarity, the authors group existing classification algorithms investigated in this work into three groups.
  • The first category mainly includes Support Vector Machines (SVM) and Random Forests (RF), which are known to be among the best performers (thus usually used as the default classifiers).
  • These algorithms were proposed in recent years, so they have not been included in most comparative studies.
  • Such algorithms are not as popular as SVM and RF, but they are also important and have found applications in many domains.
  • In particular, GBDT is commonly underutilised by many researchers and practitioners, given the fact that it has been reported in the lit-erature to achieve high classification accuracy, e.g. Caruana and Niculescu-Mizil (2006) , Brown and Mues (2012) , and Chapelle and Chang (2011) .

3.1. SVM and RF

  • Vanschoren, Blockeel, Pfahringer, and Holmes (2012) also performed experiments with UCI data sets and they attested that RBFbased SVM and RF yield good results in predictive accuracy but the variation in the performance is large due to parameters of the data that heavily affect their performance.
  • They also observed that it is more rewarding to fine-tune RF and SVM than to bag or boost their default setting, whereas for C4.5, bagging and boosting outperform fine-tuning.

3.2. Newer classifiers

  • A Deep Belief Network (DBN) ( Hinton, Osindero, & Teh, 2006 ) is a type of DNN that uses restricted Boltzmann machines to derive good initial parameters.
  • Yet, few experiments have been carried out to comprehensively investigate the performance of DL on numeric data (nonimage data).

3.3. Other established classifiers

  • Logistic Regression (LR) ( Cox, 1958 ) is a regression analysis method when the dependent variable is categorical.
  • Extending linear regression, LR measures the relationship between the dependent variable and one or more independent variables by estimating probabilities using a logistic function.
  • This can also be generalised to multiclass problems (i.e. multinomial logistic regression).

3.4.1. Classifier ensemble

  • In the "Yahoo! Learning to Rank Challenge", Chapelle and Chang (2011) pointed out that decision trees (especially GBDT) were the most popular class of functions among the top competitors.
  • Moreover, comparing the best solution of ensemble learning with the baseline of regression model (GBDT), the relevance gap is rather small.

3.4.2. Feature selection and extraction

  • Another approach for improving classification performance is representing feature vectors as matrices.
  • Several studies have explored different feature reshaping techniques using matrix representations and the resulting methods have shown to significantly outperform existing classifiers in both accuracy and computation time ( Nanni, Brahnam, & Lumini, 2010 ) .
  • Nanni, Brahnam, and Lumini (2012) proposes a method that combines the Meyer continuous wavelet approach for transforming a vector into a matrix with the a variant of the local phase quantisation for feature extraction.
  • Using generic pattern recognition data sets from the UCI repository, as well as real-world protein and peptide data sets, the authors show that their proposed method yields similar or even better performance than that obtained by the state-ofthe-art.

4.2. Measuring a Classifier's prediction performance

  • The authors use both ACC and AUC metrics for evaluating classification algorithms.
  • AUC values were computed using the implementations in R-package caTools ( Tuszynski, 2008 ) .
  • To compare the efficiency of different classifiers, the authors also examine the separate training, parameter-tuning, and prediction time of each individual classifier.

5.2.2. Testing setting of each classifier

  • It should be noted that, while Neural Networks (NN) works on every data set, Deep Belief Networks (DBN) and Pylearn2 require non-negative matrices, and Pylearn2 also requires nonnegative integers, which represent the pixel values of the images, thus some data sets are not testable for them.
  • In the 20 data sets where all these implementations are valid, they have similar performance in terms of best accuracy.
  • It is worth noting that DL does not show satisfactory results on most of the data sets, it has shown outstanding prediction performance on imagedata though.

6.1. Accuracy comparisons of individual classifiers

  • In order to demonstrate the difference in prediction accuracy between classifiers, the authors make pairwise comparisons.
  • Comparison between classifiers over multiple data sets using just the average accuracy values is not statistically safe, because these values are not commensurable ( Demšar, 2006 ) .
  • Demšar (2006) and Garcia and Herrera (2009) recommend pair significance testing for comparative studies.
  • Moreover, because the accuracy values over all data sets for each classifier do not follow the normal distribution, significance tests should be nonparametric.
  • The authors observe that, the difference between GBDT and RF/SVM is not statistically significant, while the difference between GBDT/RF and ELM is significant.

6.2. Accuracy comparisons among groups of classifiers

  • To summarise, for the selected 71 data sets and the 11 algorithms under test, with high probability, it is possible to achieve the most accurate prediction by considering just a small subset of the 11 algorithms, instead of exhaustively checking/testing all of them.
  • Therefore, this study can guide the practitioners, engineers and researchers to promptly find the most accurate classifier for their specific applications/data.

6.3. Influence of the number of classes and features on the prediction performance

  • The authors see that, when the number of features is smaller than 60, GBDT is the best classifier in terms of accuracy.
  • More importantly, it is significantly better than the other classifiers when the number of features is less than 40.
  • Specifically, when the number of features is below 20.
  • GBDT, SVM, RF, and ELM have the best classification accuracies (it should be noted that the best classification accuracies of different classification algorithms may be the same).
  • When the number of features is between 20 and 40, GBDT remains the best classifier, whereas SVM, RF, ELM and C4.5 are generally among the next best classification algorithms.

6.4. AUC Comparisons of individual classifiers

  • The authors observe that RF ranks first in AUC mean rank, followed by GBDT and SVM.
  • Being consistent with the ACC results in Fig. 5 , DL, AB, and NB are still the worst performers.

6.5. Running time comparisons between classifiers

  • These classifiers can be divided into two groups with regard to the testing time.
  • The authors see that the classifiers in the first group, i.e. in Fig. 26 , are the most efficient algorithms at the median of the testing times.
  • It is worth noted that, in order to achieve the best classification accuracy, ELM needs parameter-tuning, which is very timedemanding.

Did you find this useful? Give us your feedback

Figures (40)
Citations
More filters
Journal ArticleDOI
TL;DR: A comprehensive overview of the modern classification algorithms used in EEG-based BCIs is provided, the principles of these methods and guidelines on when and how to use them are presented, and a number of challenges to further advance EEG classification in BCI are identified.
Abstract: Objective: Most current Electroencephalography (EEG)-based Brain-Computer Interfaces (BCIs) are based on machine learning algorithms. There is a large diversity of classifier types that are used in this field, as described in our 2007 review paper. Now, approximately 10 years after this review publication, many new algorithms have been developed and tested to classify EEG signals in BCIs. The time is therefore ripe for an updated review of EEG classification algorithms for BCIs. Approach: We surveyed the BCI and machine learning literature from 2007 to 2017 to identify the new classification approaches that have been investigated to design BCIs. We synthesize these studies in order to present such algorithms, to report how they were used for BCIs, what were the outcomes, and to identify their pros and cons. Main results: We found that the recently designed classification algorithms for EEG-based BCIs can be divided into four main categories: adaptive classifiers, matrix and tensor classifiers, transfer learning and deep learning, plus a few other miscellaneous classifiers. Among these, adaptive classifiers were demonstrated to be generally superior to static ones, even with unsupervised adaptation. Transfer learning can also prove useful although the benefits of transfer learning remain unpredictable. Riemannian geometry-based methods have reached state-of-the-art performances on multiple BCI problems and deserve to be explored more thoroughly, along with tensor-based methods. Shrinkage linear discriminant analysis and random forests also appear particularly useful for small training samples settings. On the other hand, deep learning methods have not yet shown convincing improvement over state-of-the-art BCI methods. Significance: This paper provides a comprehensive overview of the modern classification algorithms used in EEG-based BCIs, presents the principles of these Review of Classification Algorithms for EEG-based BCI 2 methods and guidelines on when and how to use them. It also identifies a number of challenges to further advance EEG classification in BCI.

1,280 citations

Journal ArticleDOI
TL;DR: A comprehensive comparison between XGBoost, LightGBM, CatBoost, random forests and gradient boosting has been performed and indicates that CatBoost obtains the best results in generalization accuracy and AUC in the studied datasets although the differences are small.
Abstract: The family of gradient boosting algorithms has been recently extended with several interesting proposals (i.e. XGBoost, LightGBM and CatBoost) that focus on both speed and accuracy. XGBoost is a scalable ensemble technique that has demonstrated to be a reliable and efficient machine learning challenge solver. LightGBM is an accurate model focused on providing extremely fast training performance using selective sampling of high gradient instances. CatBoost modifies the computation of gradients to avoid the prediction shift in order to improve the accuracy of the model. This work proposes a practical analysis of how these novel variants of gradient boosting work in terms of training speed, generalization performance and hyper-parameter setup. In addition, a comprehensive comparison between XGBoost, LightGBM, CatBoost, random forests and gradient boosting has been performed using carefully tuned models as well as using their default settings. The results of this comparison indicate that CatBoost obtains the best results in generalization accuracy and AUC in the studied datasets although the differences are small. LightGBM is the fastest of all methods but not the most accurate. Finally, XGBoost places second both in accuracy and in training speed. Finally an extensive analysis of the effect of hyper-parameter tuning in XGBoost, LightGBM and CatBoost is carried out using two novel proposed tools.

375 citations

Journal ArticleDOI
TL;DR: New generator and discriminator of Generative Adversarial Network (GAN) are designed in this paper to generate more discriminant fault samples using a scheme of global optimization to solve the problem of unbalanced fault samples.
Abstract: Deep learning can be applied to the field of fault diagnosis for its powerful feature representation capabilities. When a certain class fault samples available are very limited, it is inevitably to be unbalanced. The fault feature extracted from unbalanced data via deep learning is inaccurate, which can lead to high misclassification rate. To solve this problem, new generator and discriminator of Generative Adversarial Network (GAN) are designed in this paper to generate more discriminant fault samples using a scheme of global optimization. The generator is designed to generate those fault feature extracted from a few fault samples via Auto Encoder (AE) instead of fault data sample. The training of the generator is guided by fault feature and fault diagnosis error instead of the statistical coincidence of traditional GAN. The discriminator is designed to filter the unqualified generated samples in the sense that qualified samples are helpful for more accurate fault diagnosis. The experimental results of rolling bearings verify the effectiveness of the proposed algorithm.

318 citations

Journal ArticleDOI
TL;DR: This work reconstructs the high-dimensional features of Android applications (apps) and employ multiple CNN to detect Android malware and proposes a hybrid model based on deep autoencoder (DAE) and convolutional neural network (CNN), which shows powerful ability in feature extraction and malware detection.
Abstract: Android security incidents occurred frequently in recent years. To improve the accuracy and efficiency of large-scale Android malware detection, in this work, we propose a hybrid model based on deep autoencoder (DAE) and convolutional neural network (CNN). First, to improve the accuracy of malware detection, we reconstruct the high-dimensional features of Android applications (apps) and employ multiple CNN to detect Android malware. In the serial convolutional neural network architecture (CNN-S), we use Relu, a non-linear function, as the activation function to increase sparseness and “dropout” to prevent over-fitting. The convolutional layer and pooling layer are combined with the full-connection layer to enhance feature extraction capability. Under these conditions, CNN-S shows powerful ability in feature extraction and malware detection. Second, to reduce the training time, we use deep autoencoder as a pre-training method of CNN. With the combination, deep autoencoder and CNN model (DAE-CNN) can learn more flexible patterns in a short time. We conduct experiments on 10,000 benign apps and 13,000 malicious apps. CNN-S demonstrates a significant improvement compared with traditional machine learning methods in Android malware detection. In details, compared with SVM, the accuracy with the CNN-S model is improved by 5%, while the training time using DAE-CNN model is reduced by 83% compared with CNN-S model.

212 citations

Journal ArticleDOI
TL;DR: A mobile edge computing-based intelligent trust evaluation scheme is proposed to comprehensively evaluate the trustworthiness of sensor nodes using probabilistic graphical model and can effectively ensure the trustworthy of sensor node nodes and decrease the energy consumption.
Abstract: As an enabler for smart industrial Internet of Things (IoT), sensor cloud facilitates data collection, processing, analysis, storage, and sharing on demand. However, compromised or malicious sensor nodes may cause the collected data to be invalid or even endanger the normal operation of an entire IoT system. Therefore, designing an effective mechanism to ensure the trustworthiness of sensor nodes is a critical issue. However, existing cloud computing models cannot provide direct and effective management for the sensor nodes. Meanwhile, the insufficient computation and storage ability of sensor nodes makes them incapable of performing complex intelligent algorithms. To this end, mobile edge nodes with relatively strong computation and storage ability are exploited to provide intelligent trust evaluation and management for sensor nodes. In this article, a mobile edge computing-based intelligent trust evaluation scheme is proposed to comprehensively evaluate the trustworthiness of sensor nodes using probabilistic graphical model. The proposed mechanism evaluates the trustworthiness of sensor nodes from data collection and communication behavior. Moreover, the moving path for the edge nodes is scheduled to improve the probability of direct trust evaluation and decrease the moving distance. An approximation algorithm with provable performance is designed. Extensive experiments validate that our method can effectively ensure the trustworthiness of sensor nodes and decrease the energy consumption.

156 citations


Cites background from "An up-to-date comparison of state-o..."

  • ...sensor nodes due to their limited computing and storage capabilities [15]....

    [...]

References
More filters
Journal ArticleDOI
01 Oct 2001
TL;DR: Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the forest, and are also applicable to regression.
Abstract: Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, aaa, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.

79,257 citations

Proceedings Article
03 Dec 2012
TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overriding in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

73,978 citations

Journal ArticleDOI
TL;DR: Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
Abstract: LIBSVM is a library for Support Vector Machines (SVMs). We have been actively developing this package since the year 2000. The goal is to help users to easily apply SVM to their applications. LIBSVM has gained wide popularity in machine learning and many other areas. In this article, we present all implementation details of LIBSVM. Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.

40,826 citations

Journal ArticleDOI
TL;DR: High generalization ability of support-vector networks utilizing polynomial input transformations is demonstrated and the performance of the support- vector network is compared to various classical learning algorithms that all took part in a benchmark study of Optical Character Recognition.
Abstract: The support-vector network is a new learning machine for two-group classification problems. The machine conceptually implements the following idea: input vectors are non-linearly mapped to a very high-dimension feature space. In this feature space a linear decision surface is constructed. Special properties of the decision surface ensures high generalization ability of the learning machine. The idea behind the support-vector network was previously implemented for the restricted case where the training data can be separated without errors. We here extend this result to non-separable training data. High generalization ability of support-vector networks utilizing polynomial input transformations is demonstrated. We also compare the performance of the support-vector network to various classical learning algorithms that all took part in a benchmark study of Optical Character Recognition.

37,861 citations

Book
15 Oct 1992
TL;DR: A complete guide to the C4.5 system as implemented in C for the UNIX environment, which starts from simple core learning methods and shows how they can be elaborated and extended to deal with typical problems such as missing data and over hitting.
Abstract: From the Publisher: Classifier systems play a major role in machine learning and knowledge-based systems, and Ross Quinlan's work on ID3 and C4.5 is widely acknowledged to have made some of the most significant contributions to their development. This book is a complete guide to the C4.5 system as implemented in C for the UNIX environment. It contains a comprehensive guide to the system's use , the source code (about 8,800 lines), and implementation notes. The source code and sample datasets are also available on a 3.5-inch floppy diskette for a Sun workstation. C4.5 starts with large sets of cases belonging to known classes. The cases, described by any mixture of nominal and numeric properties, are scrutinized for patterns that allow the classes to be reliably discriminated. These patterns are then expressed as models, in the form of decision trees or sets of if-then rules, that can be used to classify new cases, with emphasis on making the models understandable as well as accurate. The system has been applied successfully to tasks involving tens of thousands of cases described by hundreds of properties. The book starts from simple core learning methods and shows how they can be elaborated and extended to deal with typical problems such as missing data and over hitting. Advantages and disadvantages of the C4.5 approach are discussed and illustrated with several case studies. This book and software should be of interest to developers of classification-based intelligent systems and to students in machine learning and expert systems courses.

21,674 citations


Additional excerpts

  • ...Duda, Stork, & Hart, 20 0 0 ), and C4.5 ( Quinlan, 1993 ), have also een adopted in many classification tasks....

    [...]

Frequently Asked Questions (2)
Q1. What have the authors contributed in "An up-to-date comparison of state- of-the-art classification algorithms" ?

Moreover, important properties such as the dependency on number of classes and features and CPU running time are typically not examined. In this paper, the authors carry out a comparative empirical study on both established classifiers and more recently proposed ones on 71 data sets originating from different domains, publicly available at UCI and KEEL repositories. The list of 11 algorithms studied includes Extreme Learning Machine ( ELM ), Sparse Representation based Classification ( SRC ), and Deep Learning ( DL ), which have not been thoroughly investigated in existing comparative studies. 

In the future work, the authors will further investigate the performance of the 11 classifiers in specific application domains and with different feature selection methods. 

Trending Questions (1)
What are the state of the art video classification algorithms?

The paper does not mention any specific video classification algorithms.