Journal Article•DOI•

An up-to-date comparison of state-of-the-art classification algorithms

Chongsheng Zhang¹, Changchang Liu¹, Xiangliang Zhang², George Almpanidis¹•Institutions (2)

Henan University¹, King Abdullah University of Science and Technology²

01 Oct 2017-Expert Systems With Applications (Pergamon)-Vol. 82, Iss: 82, pp 128-150

TL;DR: It is found that Stochastic Gradient Boosting Trees (GBDT) matches or exceeds the prediction performance of Support Vector Machines and Random Forests, while being the fastest algorithm in terms of prediction efficiency.

read less

Abstract: Up-to-date report on the accuracy and efficiency of state-of-the-art classifiers.We compare the accuracy of 11 classification algorithms pairwise and groupwise.We examine separately the training, parameter-tuning, and testing time.GBDT and Random Forests yield highest accuracy, outperforming SVM.GBDT is the fastest in testing, Naive Bayes the fastest in training. Current benchmark reports of classification algorithms generally concern common classifiers and their variants but do not include many algorithms that have been introduced in recent years. Moreover, important properties such as the dependency on number of classes and features and CPU running time are typically not examined. In this paper, we carry out a comparative empirical study on both established classifiers and more recently proposed ones on 71 data sets originating from different domains, publicly available at UCI and KEEL repositories. The list of 11 algorithms studied includes Extreme Learning Machine (ELM), Sparse Representation based Classification (SRC), and Deep Learning (DL), which have not been thoroughly investigated in existing comparative studies. It is found that Stochastic Gradient Boosting Trees (GBDT) matches or exceeds the prediction performance of Support Vector Machines (SVM) and Random Forests (RF), while being the fastest algorithm in terms of prediction efficiency. ELM also yields good accuracy results, ranking in the top-5, alongside GBDT, RF, SVM, and C4.5 but this performance varies widely across all data sets. Unsurprisingly, top accuracy performers have average or slow training time efficiency. DL is the worst performer in terms of accuracy but second fastest in prediction efficiency. SRC shows good accuracy performance but it is the slowest classifier in both training and testing.

...read moreread less

Summary (3 min read)

Jump to: [2. State-of-the-art in classifier comparison] – [3. Classification algorithms to be compared] – [3.1. SVM and RF] – [3.2. Newer classifiers] – [3.3. Other established classifiers] – [3.4.1. Classifier ensemble] – [3.4.2. Feature selection and extraction] – [4.2. Measuring a Classifier's prediction performance] – [5.2.2. Testing setting of each classifier] – [6.1. Accuracy comparisons of individual classifiers] – [6.2. Accuracy comparisons among groups of classifiers] – [6.3. Influence of the number of classes and features on the prediction performance] – [6.4. AUC Comparisons of individual classifiers] and [6.5. Running time comparisons between classifiers]

2. State-of-the-art in classifier comparison

In summary, as can be seen in Table 1 , most of existing works for classifier comparison have not considered several important classifiers, i.e. GBDT, ELM, SRC, and DL.
Moreover, the findings in these works are not always consistent, e.g. the accuracy of RF and SVM were reported to be very similar in Macia and Bernado-Mansilla (2014) , while in Brown and Mues (2012) , Fernández-Delgado et al. (2014) and Lessmann et al. (2015) , RF was reported to outperform SVM.
Therefore, it is essential to conduct an up-todate comparative study on the current state-of-the-art classification algorithms, taking into consideration GBDT and newer classifiers, i.e. ELM, SRC, and DL.

3. Classification algorithms to be compared

For the sake of convenience and clarity, the authors group existing classification algorithms investigated in this work into three groups.
The first category mainly includes Support Vector Machines (SVM) and Random Forests (RF), which are known to be among the best performers (thus usually used as the default classifiers).
These algorithms were proposed in recent years, so they have not been included in most comparative studies.
Such algorithms are not as popular as SVM and RF, but they are also important and have found applications in many domains.
In particular, GBDT is commonly underutilised by many researchers and practitioners, given the fact that it has been reported in the lit-erature to achieve high classification accuracy, e.g. Caruana and Niculescu-Mizil (2006) , Brown and Mues (2012) , and Chapelle and Chang (2011) .

3.1. SVM and RF

Vanschoren, Blockeel, Pfahringer, and Holmes (2012) also performed experiments with UCI data sets and they attested that RBFbased SVM and RF yield good results in predictive accuracy but the variation in the performance is large due to parameters of the data that heavily affect their performance.
They also observed that it is more rewarding to fine-tune RF and SVM than to bag or boost their default setting, whereas for C4.5, bagging and boosting outperform fine-tuning.

3.2. Newer classifiers

A Deep Belief Network (DBN) ( Hinton, Osindero, & Teh, 2006 ) is a type of DNN that uses restricted Boltzmann machines to derive good initial parameters.
Yet, few experiments have been carried out to comprehensively investigate the performance of DL on numeric data (nonimage data).

3.3. Other established classifiers

Logistic Regression (LR) ( Cox, 1958 ) is a regression analysis method when the dependent variable is categorical.
Extending linear regression, LR measures the relationship between the dependent variable and one or more independent variables by estimating probabilities using a logistic function.
This can also be generalised to multiclass problems (i.e. multinomial logistic regression).

3.4.1. Classifier ensemble

In the "Yahoo! Learning to Rank Challenge", Chapelle and Chang (2011) pointed out that decision trees (especially GBDT) were the most popular class of functions among the top competitors.
Moreover, comparing the best solution of ensemble learning with the baseline of regression model (GBDT), the relevance gap is rather small.

3.4.2. Feature selection and extraction

Another approach for improving classification performance is representing feature vectors as matrices.
Several studies have explored different feature reshaping techniques using matrix representations and the resulting methods have shown to significantly outperform existing classifiers in both accuracy and computation time ( Nanni, Brahnam, & Lumini, 2010 ) .
Nanni, Brahnam, and Lumini (2012) proposes a method that combines the Meyer continuous wavelet approach for transforming a vector into a matrix with the a variant of the local phase quantisation for feature extraction.
Using generic pattern recognition data sets from the UCI repository, as well as real-world protein and peptide data sets, the authors show that their proposed method yields similar or even better performance than that obtained by the state-ofthe-art.

4.2. Measuring a Classifier's prediction performance

The authors use both ACC and AUC metrics for evaluating classification algorithms.
AUC values were computed using the implementations in R-package caTools ( Tuszynski, 2008 ) .
To compare the efficiency of different classifiers, the authors also examine the separate training, parameter-tuning, and prediction time of each individual classifier.

5.2.2. Testing setting of each classifier

It should be noted that, while Neural Networks (NN) works on every data set, Deep Belief Networks (DBN) and Pylearn2 require non-negative matrices, and Pylearn2 also requires nonnegative integers, which represent the pixel values of the images, thus some data sets are not testable for them.
In the 20 data sets where all these implementations are valid, they have similar performance in terms of best accuracy.
It is worth noting that DL does not show satisfactory results on most of the data sets, it has shown outstanding prediction performance on imagedata though.

6.1. Accuracy comparisons of individual classifiers

In order to demonstrate the difference in prediction accuracy between classifiers, the authors make pairwise comparisons.
Comparison between classifiers over multiple data sets using just the average accuracy values is not statistically safe, because these values are not commensurable ( Demšar, 2006 ) .
Demšar (2006) and Garcia and Herrera (2009) recommend pair significance testing for comparative studies.
Moreover, because the accuracy values over all data sets for each classifier do not follow the normal distribution, significance tests should be nonparametric.
The authors observe that, the difference between GBDT and RF/SVM is not statistically significant, while the difference between GBDT/RF and ELM is significant.

6.2. Accuracy comparisons among groups of classifiers

To summarise, for the selected 71 data sets and the 11 algorithms under test, with high probability, it is possible to achieve the most accurate prediction by considering just a small subset of the 11 algorithms, instead of exhaustively checking/testing all of them.
Therefore, this study can guide the practitioners, engineers and researchers to promptly find the most accurate classifier for their specific applications/data.

6.3. Influence of the number of classes and features on the prediction performance

The authors see that, when the number of features is smaller than 60, GBDT is the best classifier in terms of accuracy.
More importantly, it is significantly better than the other classifiers when the number of features is less than 40.
Specifically, when the number of features is below 20.
GBDT, SVM, RF, and ELM have the best classification accuracies (it should be noted that the best classification accuracies of different classification algorithms may be the same).
When the number of features is between 20 and 40, GBDT remains the best classifier, whereas SVM, RF, ELM and C4.5 are generally among the next best classification algorithms.

6.4. AUC Comparisons of individual classifiers

The authors observe that RF ranks first in AUC mean rank, followed by GBDT and SVM.
Being consistent with the ACC results in Fig. 5 , DL, AB, and NB are still the worst performers.

6.5. Running time comparisons between classifiers

These classifiers can be divided into two groups with regard to the testing time.
The authors see that the classifiers in the first group, i.e. in Fig. 26 , are the most efficient algorithms at the median of the testing times.
It is worth noted that, in order to achieve the best classification accuracy, ELM needs parameter-tuning, which is very timedemanding.

Did you find this useful? Give us your feedback

Figures (40)

Fig. 20. Box plots for AUC (raw values).

Table 15 Training time (in seconds) efficiency results for different classification algorithms.

Fig. 19. The number of data sets on which each classifier achieves the best accuracy, grouped by the number of features.

Fig. 18. The number of data sets on which each classifier achieves the best accuracy, grouped by the number of classes.

Fig. 22. Box plots for AUC (mean ranks).

Table 6 Accuracy results for different classification algorithms on 71 data sets.

Fig. 17. Accuracy comparison between SVM and C4.5 grouped by the number of features.

Table 10 Accuracy ranking of different classification algorithms groups.

Table 17 Total running time (in seconds) efficiency results for different classification algorithms.

Fig. 4. Box plots for Accuracy (raw values).

Fig. 5. Box plots for Accuracy (mean ranks).

Table 5 Accuracy results for different implementations of Deep Learning.

Fig. 16. Accuracy comparison between SVM and C4.5 grouped by the number of classes.

Fig. 11. Accuracy comparison between RF and ELM grouped by the number of features.

Fig. 14. Accuracy comparison between RF and SVM grouped by the number of classes.

Fig. 12. Accuracy comparison between ELM and SVM grouped by the number of classes.

Fig. 13. Accuracy comparison between ELM and SVM grouped by the number of features.

Fig. 15. Accuracy comparison between RF and SVM grouped by the number of features.

Table 1 Summary of related work on classifier comparison in chronological order.

Fig. 21. Cell plot of absolute mean rank difference between ACC and AUC for 36 data sets.

Frequently Asked Questions (2)

Q1. What have the authors contributed in "An up-to-date comparison of state- of-the-art classification algorithms" ?

Moreover, important properties such as the dependency on number of classes and features and CPU running time are typically not examined. In this paper, the authors carry out a comparative empirical study on both established classifiers and more recently proposed ones on 71 data sets originating from different domains, publicly available at UCI and KEEL repositories. The list of 11 algorithms studied includes Extreme Learning Machine ( ELM ), Sparse Representation based Classification ( SRC ), and Deep Learning ( DL ), which have not been thoroughly investigated in existing comparative studies.

Q2. What have the authors stated for future works in "An up-to-date comparison of state- of-the-art classification algorithms" ?

In the future work, the authors will further investigate the performance of the 11 classifiers in specific application domains and with different feature selection methods.