Statistical Comparisons of Classifiers over Multiple Data Sets

Home
/
Papers
/
Statistical Comparisons of Classifiers over Multiple Data Sets

Journal Article•

Statistical Comparisons of Classifiers over Multiple Data Sets

01 Dec 2006-Journal of Machine Learning Research (JMLR.org)-Vol. 7, Iss: 1, pp 1-30

TL;DR: A set of simple, yet safe and robust non-parametric tests for statistical comparisons of classifiers is recommended: the Wilcoxon signed ranks test for comparison of two classifiers and the Friedman test with the corresponding post-hoc tests for comparisons of more classifiers over multiple data sets.

read less

Abstract: While methods for comparing two learning algorithms on a single data set have been scrutinized for quite some time already, the issue of statistical tests for comparisons of more algorithms on multiple data sets, which is even more essential to typical machine learning studies, has been all but ignored. This article reviews the current practice and then theoretically and empirically examines several suitable tests. Based on that, we recommend a set of simple, yet safe and robust non-parametric tests for statistical comparisons of classifiers: the Wilcoxon signed ranks test for comparison of two classifiers and the Friedman test with the corresponding post-hoc tests for comparison of more classifiers over multiple data sets. Results of the latter can also be neatly presented with the newly introduced CD (critical difference) diagrams.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Journal Article•

다중혈관 관상동맥 환자에서 y-문합을 이용하여 양쪽 내흉동맥만을 사용한 우회술의 조기 성적

[...]

성기익, 이영탁, 박계현, 전태국, 박표원, 한일용, 장윤희 - Show less +3 more

01 Mar 2003-The Korean Journal of Thoracic and Cardiovascular Surgery

28,685 citations

Journal Article•DOI•

The Pascal Visual Object Classes (VOC) Challenge

[...]

Mark Everingham¹, Luc Van Gool², Christopher Williams³, John Winn⁴, Andrew Zisserman⁵ - Show less +1 more•Institutions (5)

University of Leeds¹, Katholieke Universiteit Leuven², University of Edinburgh³, Microsoft⁴, University of Oxford⁵

01 Jun 2010-International Journal of Computer Vision

TL;DR: The state-of-the-art in evaluated methods for both classification and detection are reviewed, whether the methods are statistically different, what they are learning from the images, and what the methods find easy or confuse.

...read moreread less

Abstract: The Pascal Visual Object Classes (VOC) challenge is a benchmark in visual object category recognition and detection, providing the vision and machine learning communities with a standard dataset of images and annotation, and standard evaluation procedures. Organised annually from 2005 to present, the challenge and its associated dataset has become accepted as the benchmark for object detection. This paper describes the dataset and evaluation procedure. We review the state-of-the-art in evaluated methods for both classification and detection, analyse whether the methods are statistically different, what they are learning from the images (e.g. the object or its context), and what the methods find easy or confuse. The paper concludes with lessons learnt in the three year history of the challenge, and proposes directions for future improvement and extension.

...read moreread less

15,935 citations

Cites background from "Statistical Comparisons of Classifi..."

...As noted above, there has recently been considerable interest in learning recognition from “weak” supervision (Duygulu et al 2002; Fergus et al 2007)....
[...]

Journal Article•DOI•

A systematic analysis of performance measures for classification tasks

[...]

Marina Sokolova¹, Guy Lapalme²•Institutions (2)

Children's Hospital of Eastern Ontario¹, Université de Montréal²

01 Jul 2009-Information Processing and Management

TL;DR: This paper presents a systematic analysis of twenty four performance measures used in the complete spectrum of Machine Learning classification tasks, i.e., binary, multi-class,multi-labelled, and hierarchical, to produce a measure invariance taxonomy with respect to all relevant label distribution changes in a classification problem.

...read moreread less

Abstract: This paper presents a systematic analysis of twenty four performance measures used in the complete spectrum of Machine Learning classification tasks, i.e., binary, multi-class, multi-labelled, and hierarchical. For each classification task, the study relates a set of changes in a confusion matrix to specific characteristics of data. Then the analysis concentrates on the type of changes to a confusion matrix that do not change a measure, therefore, preserve a classifier's evaluation (measure invariance). The result is the measure invariance taxonomy with respect to all relevant label distribution changes in a classification problem. This formal analysis is supported by examples of applications where invariance properties of measures lead to a more reliable evaluation of classifiers. Text classification supplements the discussion with several case studies.

...read moreread less

3,945 citations

Cites background from "Statistical Comparisons of Classifi..."

...Demsar (2006) surveys how classifiers are compared over multiple data sets....
[...]
...Demsar (2006) surveys how classifiers are compared over multiple data sets....
[...]

Journal Article•DOI•

A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms

[...]

Joaquín Derrac¹, Salvador García², Daniel Molina³, Francisco Herrera¹•Institutions (3)

University of Granada¹, University of Jaén², University of Cádiz³

01 Mar 2011-Swarm and evolutionary computation

TL;DR: The basics are discussed and a survey of a complete set of nonparametric procedures developed to perform both pairwise and multiple comparisons, for multi-problem analysis are given.

...read moreread less

Abstract: a b s t r a c t The interest in nonparametric statistical analysis has grown recently in the field of computational intelligence. In many experimental studies, the lack of the required properties for a proper application of parametric procedures - independence, normality, and homoscedasticity - yields to nonparametric ones the task of performing a rigorous comparison among algorithms. In this paper, we will discuss the basics and give a survey of a complete set of nonparametric procedures developed to perform both pairwise and multiple comparisons, for multi-problem analysis. The test problems of the CEC'2005 special session on real parameter optimization will help to illustrate the use of the tests throughout this tutorial, analyzing the results of a set of well-known evolutionary and swarm intelligence algorithms. This tutorial is concluded with a compilation of considerations and recommendations, which will guide practitioners when using these tests to contrast their experimental results.

...read moreread less

3,832 citations

Cites methods from "Statistical Comparisons of Classifi..."

...For the Wilcoxon’s test, a maximum of 30 domains is suggested [4]....
[...]

Journal Article•DOI•

Pedestrian Detection: An Evaluation of the State of the Art

[...]

Piotr Dollár¹, Christian Wojek², Bernt Schiele², Pietro Perona¹•Institutions (2)

California Institute of Technology¹, Max Planck Society²

01 Apr 2012-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: An extensive evaluation of the state of the art in a unified framework of monocular pedestrian detection using sixteen pretrained state-of-the-art detectors across six data sets and proposes a refined per-frame evaluation methodology.

...read moreread less

Abstract: Pedestrian detection is a key problem in computer vision, with several applications that have the potential to positively impact quality of life. In recent years, the number of approaches to detecting pedestrians in monocular images has grown steadily. However, multiple data sets and widely varying evaluation protocols are used, making direct comparisons difficult. To address these shortcomings, we perform an extensive evaluation of the state of the art in a unified framework. We make three primary contributions: 1) We put together a large, well-annotated, and realistic monocular pedestrian detection data set and study the statistics of the size, position, and occlusion patterns of pedestrians in urban scenes, 2) we propose a refined per-frame evaluation methodology that allows us to carry out probing and informative comparisons, including measuring performance in relation to scale and occlusion, and 3) we evaluate the performance of sixteen pretrained state-of-the-art detectors across six data sets. Our study allows us to assess the state of the art and provides a framework for gauging future efforts. Our experiments show that despite significant progress, performance still has much room for improvement. In particular, detection is disappointing at low resolutions and for partially occluded pedestrians.

...read moreread less

3,170 citations

Cites background or methods from "Statistical Comparisons of Classifi..."

...[87] found this non-parametric approach to be more robust....
[...]
...(b) Critical difference diagram [87]: the x-axis shows mean rank, blue bars link detectors for which there is insufficient evidence to declare them statistically significantly different (due to the relatively low number of performance samples and fairly high variance)....
[...]
...A further in-depth study by Garcı́a and Herrera [88] concludes that the Nemenyi post-hoc test which was used by [87] (and also in the PASCAL challenge [14]) is too conservative for n × n comparisons such as in a benchmark....
[...]
...[87] introduced a series of powerful statistical tests that operate on an m dataset by n algorithm performance matrix (e....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Journal Article•

다중혈관 관상동맥 환자에서 y-문합을 이용하여 양쪽 내흉동맥만을 사용한 우회술의 조기 성적

[...]

성기익, 이영탁, 박계현, 전태국, 박표원, 한일용, 장윤희 - Show less +3 more

01 Mar 2003-The Korean Journal of Thoracic and Cardiovascular Surgery

28,685 citations

Journal Article•DOI•

A Simple Sequentially Rejective Multiple Test Procedure

[...]

Sture Holm

01 Jan 1979-Scandinavian Journal of Statistics

TL;DR: In this paper, a simple and widely accepted multiple test procedure of the sequentially rejective type is presented, i.e. hypotheses are rejected one at a time until no further rejections can be done.

...read moreread less

Abstract: This paper presents a simple and widely ap- plicable multiple test procedure of the sequentially rejective type, i.e. hypotheses are rejected one at a tine until no further rejections can be done. It is shown that the test has a prescribed level of significance protection against error of the first kind for any combination of true hypotheses. The power properties of the test and a number of possible applications are also discussed.

...read moreread less

20,459 citations

"Statistical Comparisons of Classifi..." refers methods in this paper

...The simplest such methods are due to Holm (1979) and Hochberg (1988)....
[...]

UCI Repository of machine learning databases

[...]

Catherine Blake

01 Jan 1998

12,940 citations

"Statistical Comparisons of Classifi..." refers methods in this paper

...We have compiled a sample of forty real-world data sets,2 from the UCI machine learning repository (Blake and Merz, 1998); we have used the data sets with discrete classes and avoided artificial data sets like Monk problems....
[...]

Book Chapter•DOI•

Individual Comparisons by Ranking Methods

[...]

Frank Wilcoxon¹•Institutions (1)

American Cyanamid¹

01 Dec 1945-Biometrics

TL;DR: The comparison of two treatments generally falls into one of the following two categories: (a) a number of replications for each of the two treatments, which are unpaired, or (b) we may have a series of paired comparisons, some of which may be positive and some negative as mentioned in this paper.

...read moreread less

Abstract: The comparison of two treatments generally falls into one of the following two categories: (a) we may have a number of replications for each of the two treatments, which are unpaired, or (b) we may have a number of paired comparisons leading to a series of differences, some of which may be positive and some negative. The appropriate methods for testing the significance of the differences of the means in these two cases are described in most of the textbooks on statistical methods.

...read moreread less

12,871 citations

"Statistical Comparisons of Classifi..." refers methods in this paper

...3.1.3 WILCOXON SIGNED-RANKS TEST The Wilcoxon signed-ranks test (Wilcoxon, 1945) is a non-parametric altern tive to the paired t-test, which ranks the differences in performances of two classifiers for each d ta set, ignoring the signs, and compares the ranks for the positive and the negative…...
[...]
...Since we will finally recommend the Wilcoxon (1945) signed-ranks test, it will be presented with more details....
[...]

Journal Article•DOI•

Robust Locally Weighted Regression and Smoothing Scatterplots

[...]

William S. Cleveland¹•Institutions (1)

Bell Labs¹

01 Dec 1979-Journal of the American Statistical Association

TL;DR: Robust locally weighted regression as discussed by the authors is a method for smoothing a scatterplot, in which the fitted value at z k is the value of a polynomial fit to the data using weighted least squares, where the weight for (x i, y i ) is large if x i is close to x k and small if it is not.

...read moreread less

Abstract: The visual information on a scatterplot can be greatly enhanced, with little additional cost, by computing and plotting smoothed points. Robust locally weighted regression is a method for smoothing a scatterplot, (x i , y i ), i = 1, …, n, in which the fitted value at z k is the value of a polynomial fit to the data using weighted least squares, where the weight for (x i , y i ) is large if x i is close to x k and small if it is not. A robust fitting procedure is used that guards against deviant points distorting the smoothed points. Visual, computational, and statistical issues of robust locally weighted regression are discussed. Several examples, including data on lead intoxication, are used to illustrate the methodology.

...read moreread less

10,225 citations

"Statistical Comparisons of Classifi..." refers methods in this paper

...5), naive Bayesian learner that models continuous probabilities using LOESS (Cleveland, 1979), naive Bayesian learner with continuous attributes discretized using Fayyad-Irani’s discretization (Fayyad and Irani, 1993) and kNN (k=10, neighbour weights adjusted with the Gaussian kernel)....
[...]
...4.1.1 DATA SETS AND LEARNING ALGORITHMS We based our experiments on several common learning algorithms and their variations: C4.5, C4.5 with m and C4.5 with cf fitted for optimal accuracy, another tree learning algorithm implemented in Orange (with features similar to the original C4.5), naive Bayesian learnerthat models continuous probabilities using LOESS (Cleveland, 1979), naive Bayesian learner with continuous attributes discretized using Fayyad-Irani’s discretization (Fayyad and Irani, 1993) and kNN (k=10, neighbour weights adjusted with the Gaussian kernel)....
[...]
...…implemented in Orange (with features similar to the original C4.5), naive Bayesian learnerthat models continuous probabilities using LOESS (Cleveland, 1979), naive Bayesian learner with continuous attributes discretized using Fayyad-Irani’s discretization (Fayyad and Irani, 1993) and kNN…...
[...]