scispace - formally typeset
Search or ask a question
Journal Article

Statistical Comparisons of Classifiers over Multiple Data Sets

01 Dec 2006-Journal of Machine Learning Research (JMLR.org)-Vol. 7, Iss: 1, pp 1-30
TL;DR: A set of simple, yet safe and robust non-parametric tests for statistical comparisons of classifiers is recommended: the Wilcoxon signed ranks test for comparison of two classifiers and the Friedman test with the corresponding post-hoc tests for comparisons of more classifiers over multiple data sets.
Abstract: While methods for comparing two learning algorithms on a single data set have been scrutinized for quite some time already, the issue of statistical tests for comparisons of more algorithms on multiple data sets, which is even more essential to typical machine learning studies, has been all but ignored. This article reviews the current practice and then theoretically and empirically examines several suitable tests. Based on that, we recommend a set of simple, yet safe and robust non-parametric tests for statistical comparisons of classifiers: the Wilcoxon signed ranks test for comparison of two classifiers and the Friedman test with the corresponding post-hoc tests for comparison of more classifiers over multiple data sets. Results of the latter can also be neatly presented with the newly introduced CD (critical difference) diagrams.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: The state-of-the-art in evaluated methods for both classification and detection are reviewed, whether the methods are statistically different, what they are learning from the images, and what the methods find easy or confuse.
Abstract: The Pascal Visual Object Classes (VOC) challenge is a benchmark in visual object category recognition and detection, providing the vision and machine learning communities with a standard dataset of images and annotation, and standard evaluation procedures. Organised annually from 2005 to present, the challenge and its associated dataset has become accepted as the benchmark for object detection. This paper describes the dataset and evaluation procedure. We review the state-of-the-art in evaluated methods for both classification and detection, analyse whether the methods are statistically different, what they are learning from the images (e.g. the object or its context), and what the methods find easy or confuse. The paper concludes with lessons learnt in the three year history of the challenge, and proposes directions for future improvement and extension.

15,935 citations


Cites background from "Statistical Comparisons of Classifi..."

  • ...As noted above, there has recently been considerable interest in learning recognition from “weak” supervision (Duygulu et al 2002; Fergus et al 2007)....

    [...]

Journal ArticleDOI
TL;DR: This paper presents a systematic analysis of twenty four performance measures used in the complete spectrum of Machine Learning classification tasks, i.e., binary, multi-class,multi-labelled, and hierarchical, to produce a measure invariance taxonomy with respect to all relevant label distribution changes in a classification problem.
Abstract: This paper presents a systematic analysis of twenty four performance measures used in the complete spectrum of Machine Learning classification tasks, i.e., binary, multi-class, multi-labelled, and hierarchical. For each classification task, the study relates a set of changes in a confusion matrix to specific characteristics of data. Then the analysis concentrates on the type of changes to a confusion matrix that do not change a measure, therefore, preserve a classifier's evaluation (measure invariance). The result is the measure invariance taxonomy with respect to all relevant label distribution changes in a classification problem. This formal analysis is supported by examples of applications where invariance properties of measures lead to a more reliable evaluation of classifiers. Text classification supplements the discussion with several case studies.

3,945 citations


Cites background from "Statistical Comparisons of Classifi..."

  • ...Demsar (2006) surveys how classifiers are compared over multiple data sets....

    [...]

  • ...Demsar (2006) surveys how classifiers are compared over multiple data sets....

    [...]

Journal ArticleDOI
TL;DR: The basics are discussed and a survey of a complete set of nonparametric procedures developed to perform both pairwise and multiple comparisons, for multi-problem analysis are given.
Abstract: a b s t r a c t The interest in nonparametric statistical analysis has grown recently in the field of computational intelligence. In many experimental studies, the lack of the required properties for a proper application of parametric procedures - independence, normality, and homoscedasticity - yields to nonparametric ones the task of performing a rigorous comparison among algorithms. In this paper, we will discuss the basics and give a survey of a complete set of nonparametric procedures developed to perform both pairwise and multiple comparisons, for multi-problem analysis. The test problems of the CEC'2005 special session on real parameter optimization will help to illustrate the use of the tests throughout this tutorial, analyzing the results of a set of well-known evolutionary and swarm intelligence algorithms. This tutorial is concluded with a compilation of considerations and recommendations, which will guide practitioners when using these tests to contrast their experimental results.

3,832 citations


Cites methods from "Statistical Comparisons of Classifi..."

  • ...For the Wilcoxon’s test, a maximum of 30 domains is suggested [4]....

    [...]

Journal ArticleDOI
TL;DR: An extensive evaluation of the state of the art in a unified framework of monocular pedestrian detection using sixteen pretrained state-of-the-art detectors across six data sets and proposes a refined per-frame evaluation methodology.
Abstract: Pedestrian detection is a key problem in computer vision, with several applications that have the potential to positively impact quality of life. In recent years, the number of approaches to detecting pedestrians in monocular images has grown steadily. However, multiple data sets and widely varying evaluation protocols are used, making direct comparisons difficult. To address these shortcomings, we perform an extensive evaluation of the state of the art in a unified framework. We make three primary contributions: 1) We put together a large, well-annotated, and realistic monocular pedestrian detection data set and study the statistics of the size, position, and occlusion patterns of pedestrians in urban scenes, 2) we propose a refined per-frame evaluation methodology that allows us to carry out probing and informative comparisons, including measuring performance in relation to scale and occlusion, and 3) we evaluate the performance of sixteen pretrained state-of-the-art detectors across six data sets. Our study allows us to assess the state of the art and provides a framework for gauging future efforts. Our experiments show that despite significant progress, performance still has much room for improvement. In particular, detection is disappointing at low resolutions and for partially occluded pedestrians.

3,170 citations


Cites background or methods from "Statistical Comparisons of Classifi..."

  • ...[87] found this non-parametric approach to be more robust....

    [...]

  • ...(b) Critical difference diagram [87]: the x-axis shows mean rank, blue bars link detectors for which there is insufficient evidence to declare them statistically significantly different (due to the relatively low number of performance samples and fairly high variance)....

    [...]

  • ...A further in-depth study by Garcı́a and Herrera [88] concludes that the Nemenyi post-hoc test which was used by [87] (and also in the PASCAL challenge [14]) is too conservative for n × n comparisons such as in a benchmark....

    [...]

  • ...[87] introduced a series of powerful statistical tests that operate on an m dataset by n algorithm performance matrix (e....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: In this paper, a simple and widely accepted multiple test procedure of the sequentially rejective type is presented, i.e. hypotheses are rejected one at a time until no further rejections can be done.
Abstract: This paper presents a simple and widely ap- plicable multiple test procedure of the sequentially rejective type, i.e. hypotheses are rejected one at a tine until no further rejections can be done. It is shown that the test has a prescribed level of significance protection against error of the first kind for any combination of true hypotheses. The power properties of the test and a number of possible applications are also discussed.

20,459 citations


"Statistical Comparisons of Classifi..." refers methods in this paper

  • ...The simplest such methods are due to Holm (1979) and Hochberg (1988)....

    [...]

01 Jan 1998

12,940 citations


"Statistical Comparisons of Classifi..." refers methods in this paper

  • ...We have compiled a sample of forty real-world data sets,2 from the UCI machine learning repository (Blake and Merz, 1998); we have used the data sets with discrete classes and avoided artificial data sets like Monk problems....

    [...]

Book ChapterDOI
Frank Wilcoxon1
TL;DR: The comparison of two treatments generally falls into one of the following two categories: (a) a number of replications for each of the two treatments, which are unpaired, or (b) we may have a series of paired comparisons, some of which may be positive and some negative as mentioned in this paper.
Abstract: The comparison of two treatments generally falls into one of the following two categories: (a) we may have a number of replications for each of the two treatments, which are unpaired, or (b) we may have a number of paired comparisons leading to a series of differences, some of which may be positive and some negative. The appropriate methods for testing the significance of the differences of the means in these two cases are described in most of the textbooks on statistical methods.

12,871 citations


"Statistical Comparisons of Classifi..." refers methods in this paper

  • ...3.1.3 WILCOXON SIGNED-RANKS TEST The Wilcoxon signed-ranks test (Wilcoxon, 1945) is a non-parametric altern tive to the paired t-test, which ranks the differences in performances of two classifiers for each d ta set, ignoring the signs, and compares the ranks for the positive and the negative…...

    [...]

  • ...Since we will finally recommend the Wilcoxon (1945) signed-ranks test, it will be presented with more details....

    [...]

Journal ArticleDOI
William S. Cleveland1
TL;DR: Robust locally weighted regression as discussed by the authors is a method for smoothing a scatterplot, in which the fitted value at z k is the value of a polynomial fit to the data using weighted least squares, where the weight for (x i, y i ) is large if x i is close to x k and small if it is not.
Abstract: The visual information on a scatterplot can be greatly enhanced, with little additional cost, by computing and plotting smoothed points. Robust locally weighted regression is a method for smoothing a scatterplot, (x i , y i ), i = 1, …, n, in which the fitted value at z k is the value of a polynomial fit to the data using weighted least squares, where the weight for (x i , y i ) is large if x i is close to x k and small if it is not. A robust fitting procedure is used that guards against deviant points distorting the smoothed points. Visual, computational, and statistical issues of robust locally weighted regression are discussed. Several examples, including data on lead intoxication, are used to illustrate the methodology.

10,225 citations


"Statistical Comparisons of Classifi..." refers methods in this paper

  • ...5), naive Bayesian learner that models continuous probabilities using LOESS (Cleveland, 1979), naive Bayesian learner with continuous attributes discretized using Fayyad-Irani’s discretization (Fayyad and Irani, 1993) and kNN (k=10, neighbour weights adjusted with the Gaussian kernel)....

    [...]

  • ...4.1.1 DATA SETS AND LEARNING ALGORITHMS We based our experiments on several common learning algorithms and their variations: C4.5, C4.5 with m and C4.5 with cf fitted for optimal accuracy, another tree learning algorithm implemented in Orange (with features similar to the original C4.5), naive Bayesian learnerthat models continuous probabilities using LOESS (Cleveland, 1979), naive Bayesian learner with continuous attributes discretized using Fayyad-Irani’s discretization (Fayyad and Irani, 1993) and kNN (k=10, neighbour weights adjusted with the Gaussian kernel)....

    [...]

  • ...…implemented in Orange (with features similar to the original C4.5), naive Bayesian learnerthat models continuous probabilities using LOESS (Cleveland, 1979), naive Bayesian learner with continuous attributes discretized using Fayyad-Irani’s discretization (Fayyad and Irani, 1993) and kNN…...

    [...]