scispace - formally typeset
Open AccessProceedings ArticleDOI

Permutation Tests for Studying Classifier Performance

TLDR
In this paper, the authors explore the framework of permutation-based p-values for assessing the behavior of the classification error and study two simple permutation tests: the first test estimates the null distribution by permuting the labels in the data; this has been used extensively in classification problems in computational biology and the second test produces permutations of the features within classes, inspired by restricted randomization techniques traditionally used in statistics.
Abstract
We explore the framework of permutation-based p-values for assessing the behavior of the classification error. In this paper we study two simple permutation tests. The first test estimates the null distribution by permuting the labels in the data; this has been used extensively in classification problems in computational biology. The second test produces permutations of the features within classes, inspired by restricted randomization techniques traditionally used in statistics. We study the properties of these tests and present an extensive empirical evaluation on real and synthetic data. Our analysis shows that studying the classification error via permutation tests is effective; in particular, the restricted permutation test clearly reveals whether the classifier exploits the interdependency between the features in the data.

read more

Content maybe subject to copyright    Report

Publication V
Markus Ojala and Gemma C. Garriga. 2010. Permutation tests for studying
classifier performance. Journal of Machine Learning Research, volume 11,
pages 1833-1863.
© 2010 by authors

Journal of Machine Learning Research 11 (2010) 1833-1863 Submitted 10/09; Revised 5/10; Published 6/10
Permutation Tests for Studying Classifier Performance
Markus Ojala MARKUS.OJALA@TKK.FI
Helsinki Institute for Information Technology
Department of Information and Computer Science
Aalto University School of Science and Technology
P.O. Box 15400, FI-00076 Aalto, Finland
Gemma C. Garriga GEMMA.GARRIGA@LIP6.FR
Universit
´
e Pierre et Marie Curie
Laboratoire d’Informatique de Paris 6
4 place Jussieu, 75005 Paris, France
Editor: Xiaotong Shen
Abstract
We explore the framework of permutation-based p-values for assessing the performance of classi-
fiers. In this paper we study two simple permutation tests. The first test assess whether the classifier
has found a real class structure in the data; the corresponding null distribution is estimated by per-
muting the labels in the data. This test has been used extensively in classification problems in
computational biology. The second test studies whether the classifier is exploiting the dependency
between the features in classification; the corresponding null distribution is estimated by permut-
ing the features within classes, inspired by restricted randomization techniques traditionally used
in statistics. This new test can serve to identify descriptive features which can be valuable infor-
mation in improving the classifier performance. We study the properties of these tests and present
an extensive empirical evaluation on real and synthetic data. Our analysis shows that studying the
classifier performance via permutation tests is effective. In particular, the restricted permutation
test clearly reveals whether the classifier exploits the interdependency between the features in the
data.
Keywords: classification, labeled data, permutation tests, restricted randomization, significance
testing
1. Introduction
Building effective classification systems is a central task in data mining and machine learning.
Usually, a classification algorithm builds a model from a given set of data records in which the labels
are known, and later, the learned model is used to assign labels to new data points. Applications of
such classification setting abound in many elds, for instance, in text categorization, fraud detection,
optical character recognition, or medical diagnosis, to cite some.
For all these applications, a desired property of a good classifier is the power of generalization
to new, unknown instances. The detection and characterization of statistically significant predictive
patterns is crucial for obtaining a good classification accuracy that generalizes beyond the training
data. Unfortunately, it is very often the case that the number of available data points with labels is
not sufficient. Data from medical or biological applications, for example, are characterized by high
c
2010 Markus Ojala and Gemma C. Garriga.

OJALA AND GARRIGA
o x x x x x x x +
x x o x x x x o +
x x x x o o x x +
x x x x x x x o +
x x o x o o o x +
x x x x x x x o +
x o o x o x x x +
x x x x o x x o +
o o o x x o o o
o o o o o o o o
x o x o o o o o
x o x o o x o o
o o x o o o o o
o o o o o o x o
x o o o o o o o
o o o x o o o o
Data Set D
1
x x x o x x x x +
x x x x o x x x +
x x x x x x x x +
x o x x x x x x +
o o o o o o o x +
x o o o o o o o +
o o o o o x o o +
o o o o o o o o +
x x x x o o o x
x x x x x o o o
x x o x o o o o
x x x x o o o o
o o o o x x x x
o o o o x x x x
o x o o x x x o
o o o x x x x x
Data Set D
2
Figure 1: Examples of two 16 ×8 nominal data sets D
1
and D
2
each having two classes. The last
column in both data sets denotes the class labels (+, ) of the samples in the rows.
dimensionality (thousands of features) and small number of data points (tens of rows). An important
question is whether we should believe in the classification accuracy obtained by such classifiers.
The most traditional approach to this problem is to estimate the error of the classifier by means
of cross-validation or leave-one-out cross-validation, among others. This estimate, together with a
variance-based bound, provides an interval for the expected error of the classifier. The error estimate
itself is the best statistics when different classifiers are compared against each other (Hsing et al.,
2003). However, it has been argued that evaluating a single classifier with an error measurement
is ineffective for small amount of data samples (Braga-Neto and Dougherty, 2004; Golland et al.,
2005; Isaksson et al., 2008). Also classical generalization bounds are not directly appropriate when
the dimensionality of the data is too high; for these reasons, some recent approaches using filtering
and regularization alleviate this problem (Rossi and Villa, 2006; Berlinet et al., 2008). Indeed,
for many other general cases, it is useful to have other statistics associated to the error in order
to understand better the behavior of the classifier. For example, even if a classification algorithm
produces a classifier with low error, the data itself may have no structure. Thus the question is, how
can we trust that the classifier has learned a significant predictive pattern in the data and that the
chosen classifier is appropriate for the specific classification task?
For instance, consider the small toy example in Figure 1. There are two nominal data matrices
D
1
and D
2
of sizes 16 ×8. Each row (data point) has two different values present, x and o. Both
data sets have a clear separation into the two given classes, + and . However, it seems at first sight
that the structure within the classes for data set D
1
is much simpler than for data set D
2
. If we train
a 1-Nearest Neighbor classifier on the data sets of Figure 1, we have that the classification error
(leave-one-out cross-validation) is 0.00 on both D
1
and D
2
. However, is it true that the classifier is
using a real dependency in the data? Or are the dependencies in D
1
or D
2
just a random artifact of
1834

PERMUTATION TESTS FOR STUDYING CLASSIFIER PERFORMANCE
some simple structure? It turns out that the good classification result in D
1
is explained purely by
the different value distributions inside the classes whereas in D
2
the interdependency between the
features is important in classification. This example will be analyzed in detail later on in Section 3.3.
In recent years, a number of papers have suggested to use permutation-based p-values for as-
sessing the competence of a classifier (Golland and Fischl, 2003; Golland et al., 2005; Hsing et al.,
2003; Jensen, 1992; Molinaro et al., 2005). Essentially, the permutation test procedure measures
how likely the observed accuracy would be obtained by chance. A p-value represents the fraction
of random data sets under a certain null hypothesis where the classifier behaved as well as or better
than in the original data.
Traditional permutation tests suggested in the recent literature study the null hypothesis that
the features and the labels are independent, that is, that there is no difference between the classes.
The null distribution under this null hypothesis is estimated by permuting the labels of the data set.
This corresponds also to the most traditional statistical methods (Good, 2000), where the results on
a control group are compared against the results on a treatment group. This simple test has been
proven effective already for selecting relevant genes in small data samples (Maglietta et al., 2007) or
for attribute selection in decision trees (Frank, 2000; Frank and Witten, 1998). However, the related
literature has not performed extensive experimental studies for this traditional test in more general
cases.
The goal of this paper is to study permutation tests for assessing the properties and performance
of the classifiers. We first study the traditional permutation test for testing whether the classifier has
found a real class structure, that is, a real connection between the data and the class labels. Our
experimental studies suggest that this traditional null hypothesis leads to very low p-values, thus
rendering the classifier significant most of the time even if the class structure is weak.
We then propose a test for studying whether the classifier is exploiting dependency between
some features for improving the classification accuracy. This second test is inspired by restricted
randomization techniques traditionally used in statistics (Good, 2000). We study its relation to
the traditional method both analytically and empirically. This new test can serve as a method for
obtaining descriptive properties for classifiers, namely whether the classifier is using the feature
dependency in the classification or not. For example, many existing classification algorithms are
like black boxes whose functionality is hard to interpret directly. In such cases, indirect methods
are needed to get descriptive information for the obtained class structure in the data.
If the studied data set is known to contain useful feature dependencies that increase the class
separation, this new test can be used to evaluate the classifier against this knowledge. For example,
often the data is gathered by a domain expert having deeper knowledge of the inner structure of
the data. If the classifier is not using a known useful dependency, the classifier performance could
be improved. For example, with medical data, if we are predicting the blood pressure of a person
based on the height and the weight of the individual, the dependency between these two features is
important in the classification as large body mass index is known to be connected with high blood
pressure. However, both weight and height convey information about the blood pressure but the
dependency between them is the most important factor in describing the blood pressure. Of course,
in this case we could introduce a new feature, the body mass index, but in general, this may not be
practical; for example, introducing too many new features can make the classification ineffective or
too time consuming.
If nothing is known previously from the structure of the data, Test 2 can give some descriptive in-
formation for the obtained class structure. This information can be useful as such for understanding
1835

OJALA AND GARRIGA
the properties of the classifier, or it can guide the search towards an optimal classifier. For example,
if the classifier is not exploiting the feature dependency, there might be no reason to use the chosen
classifier as either more complex classifiers (if the data contains useful feature dependencies) or
simpler classifiers (if the data does not contain useful feature dependencies) could perform better.
Note, however, that not all feature dependencies are useful in predicting the class labels. Therefore,
in the same way that traditional permutation tests have already been proven useful for selecting
relevant features in some contexts as mentioned above (Maglietta et al., 2007; Frank, 2000; Frank
and Witten, 1998), the new test can serve for selecting combinations of relevant features to boost
the classifier performance for specific applications.
The idea is to provide users with practical p-values for the analysis of the classifier. The per-
mutation tests provide useful statistics about the underlying reasons for the obtained classification
result. Indeed, no test is better than the other, but all provide us with information about the classifier
performance. Each p-value is a statistic about the classifier performance; each p-value depends on
the original data (whether it contains some real structure or not) and the classifier (whether it is able
to use certain structure in the data or not).
The remaining of the paper is organized as follows. In Section 2, we give the background to
classifiers and permutation-test p-values, and discuss connections with previous related work. In
Section 3, we describe two simple permutation methods and study their behavior on the small toy
example in Figure 1. In Section 4, we analyze in detail the properties of the different permutations
and the effect of the tests for synthetic data on four different classifiers. In Section 5, we give
experimental results on various real data sets. Finally, Section 6 concludes the paper.
1
2. Background
Let X be an n ×m data matrix. For example, in gene expression analysis the values of the matrix X
are numerical expression measurements, each row is a tissue sample and each column represents a
gene. We denote the i-th row vector of X by X
i
and the j-th column vector of X by X
j
. Rows are also
called observations or data points, while columns are also called attributes or features. Observe that
we do not restrict the data domain of X and therefore the scale of its attributes can be categorical or
numerical.
Associated to the data points X
i
we have a class label y
i
. We assume a finite set of known class
labels Y , so y
i
Y . Let D be the set of labeled data D = {(X
i
,y
i
)}
n
i=1
. For the gene expression
example above, the class labels associated to each tissue sample could be, for example, “sick” or
“healthy”.
In a traditional classification task the aim is to predict the label of new data points by training
a classifier from D. The function learned by the classification algorithm is denoted by f : X
Y . A test statistic is typically computed to evaluate the classifier performance: this can be either
the training error, cross-validation error or jackknife estimate, among others. Here we give as an
example the leave-one-out cross-validation error,
e( f,D) =
1
n
n
i=1
I( f
D\D
i
(X
i
) 6= y
i
) (1)
1. A shorter version of this paper appears in the proceedings of the IEEE International Conference on Data Mining (Ojala
and Garriga, 2009). This is an improved version based on valuable comments by reviewers which includes: detailed
discussions and examples, extended theoretical analysis of the tests including statistical power in special case scenar-
ios, related work comparisons and a thorough experimental evaluation with large data sets.
1836

Figures
Citations
More filters
Journal ArticleDOI

Restoring cortical control of functional movement in a human with quadriplegia

TL;DR: This is the first demonstration of successful control of muscle activation using intracortically recorded signals in a paralysed human, and has significant implications in advancing neuroprosthetic technology for people worldwide living with the effects of paralysis.
Journal ArticleDOI

Identifying major depression using whole-brain functional connectivity: a multivariate pattern analysis

TL;DR: The majority of the most discriminating functional connections were located within or across the default mode network, affective network, visual cortical areas and cerebellum, thereby indicating that the disease-related resting-state network alterations may give rise to a portion of the complex of emotional and cognitive disturbances in major depression.
Journal ArticleDOI

Ten quick tips for machine learning in computational biology

TL;DR: Ten quick tips to take advantage of machine learning in any computational biology context, by avoiding some common errors that the authors observed hundreds of times in multiple bioinformatics projects are presented.
Journal ArticleDOI

MRIQC: Advancing the automatic prediction of image quality in MRI from unseen sites.

TL;DR: The MRI Quality Control tool (MRIQC), a tool for extracting quality measures and fitting a binary (accept/exclude) classifier, is introduced, which performs with high accuracy in intra-site prediction, but performance on unseen sites leaves space for improvement.
Journal ArticleDOI

Exceeding chance level by chance: The caveat of theoretical chance levels in brain signal classification and statistical assessment of decoding accuracy.

TL;DR: This study demonstrates how applying signal classification to Gaussian random signals can yield decoding accuracies of up to 70% or higher in two-class decoding with small sample sets, taking sample size into account.
References
More filters
Journal ArticleDOI

Controlling the false discovery rate: a practical and powerful approach to multiple testing

TL;DR: In this paper, a different approach to problems of multiple significance testing is presented, which calls for controlling the expected proportion of falsely rejected hypotheses -the false discovery rate, which is equivalent to the FWER when all hypotheses are true but is smaller otherwise.
Book

An introduction to the bootstrap

TL;DR: This article presents bootstrap methods for estimation, using simple arguments, with Minitab macros for implementing these methods, as well as some examples of how these methods could be used for estimation purposes.
Journal ArticleDOI

A Simple Sequentially Rejective Multiple Test Procedure

TL;DR: In this paper, a simple and widely accepted multiple test procedure of the sequentially rejective type is presented, i.e. hypotheses are rejected one at a time until no further rejections can be done.
Book

Data Mining: Practical Machine Learning Tools and Techniques

TL;DR: This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining.
Related Papers (5)
Frequently Asked Questions (14)
Q1. What are the contributions in "Permutation tests for studying classifier performance" ?

The authors explore the framework of permutation-based p-values for assessing the performance of classifiers. In this paper the authors study two simple permutation tests. The authors study the properties of these tests and present an extensive empirical evaluation on real and synthetic data. 

However, if the classifier is not significant with Test 2, that is, the authors obtain a high p-value, there are three different possibilities: ( 1 ) there are no dependencies between the features in the data ; ( 2 ) there are some dependencies between the features in the data but they do not increase the class separation ; or ( 3 ) there are useful dependencies between the features in the data that increase the class separation but the chosen classifier is not able to exploit them. Future work should explore the use of Test 2 for selecting the best discriminant features for classifiers, in similar fashion as Test 1 has been used for decision trees and other biological applications ( Frank, 2000 ; Frank and Witten, 1998 ; Maglietta et al., 2007 ). Also, it would be useful to extend the setting to unsupervised learning, such as clustering. However, in general, when a high p-value is obtained with Test 2, the authors can not know which of these applies to the data and to the chosen classifier. 

The evaluation of the different models in this local search strategy is done via permutation tests, using the framework of multiple hypothesis testing (Benjamini and Hochberg, 1995; Holm, 1979). 

both weight and height convey information about the blood pressure but the dependency between them is the most important factor in describing the blood pressure. 

note that when the null hypothesis is true, that is, t = 1/2, the power of Test 1 calculated by Equation (4) equals the significance level α as it should. 

On the randomized samples of data set D1 the authors obtain an average classification error of 0.53, a standard deviation 0.14 and a minimum classification error of 0.13. 

for large data sets, the authors divide the dataset into training set with 10 000 random rows and to test set with the rest of the rows. 

The most traditional approach to this problem is to estimate the error of the classifier by means of cross-validation or leave-one-out cross-validation, among others. 

Missing values and the combination of nominal and numerical values are given as such as the input for the classifiers; the default approaches in Weka of the classifiers are used to handle these cases. 

Bin (n, 1 2 − 1 π arcsinρ) ≈N (n 2 − n π arcsinρ, n 4 − n π2 arcsin2 ρ ) ,where 12 − 1π arcsinρ is the probability of incorrectly classifying a sample by Equation (2). 

That is, the authors try more complex classifiers that could use the possible existing feature dependency, as well as simpler classifiers that could perform better if no feature dependency exists. 

Note that in total the authors will compute the error of the classifier r+ k times: r times on the original data and one time for each of the k randomized data sets. 

and as a more important reason, the traditional permutation tests easily regard the results as significant even if there is only a slight class structure present because in the corresponding permuted data sets there is no class structure, especially if the original data set is large. 

the authors have that permuting the data columns is the randomization method producing the most diverse samples, while permuting labels (Test 1) and permuting data within class (Test 2) produce different randomized samples.