Open AccessProceedings ArticleDOI

Permutation Tests for Studying Classifier Performance

- pp 908-913

TLDR

In this paper, the authors explore the framework of permutation-based p-values for assessing the behavior of the classification error and study two simple permutation tests: the first test estimates the null distribution by permuting the labels in the data; this has been used extensively in classification problems in computational biology and the second test produces permutations of the features within classes, inspired by restricted randomization techniques traditionally used in statistics.

Abstract:

We explore the framework of permutation-based p-values for assessing the behavior of the classification error. In this paper we study two simple permutation tests. The first test estimates the null distribution by permuting the labels in the data; this has been used extensively in classification problems in computational biology. The second test produces permutations of the features within classes, inspired by restricted randomization techniques traditionally used in statistics. We study the properties of these tests and present an extensive empirical evaluation on real and synthetic data. Our analysis shows that studying the classification error via permutation tests is effective; in particular, the restricted permutation test clearly reveals whether the classifier exploits the interdependency between the features in the data.

Content maybe subject to copyright Report

Publication V

Markus Ojala and Gemma C. Garriga. 2010. Permutation tests for studying

classifier performance. Journal of Machine Learning Research, volume 11,

pages 1833-1863.

Journal of Machine Learning Research 11 (2010) 1833-1863 Submitted 10/09; Revised 5/10; Published 6/10

Permutation Tests for Studying Classiﬁer Performance

Markus Ojala MARKUS.OJALA@TKK.FI

Helsinki Institute for Information Technology

Department of Information and Computer Science

Aalto University School of Science and Technology

P.O. Box 15400, FI-00076 Aalto, Finland

Gemma C. Garriga GEMMA.GARRIGA@LIP6.FR

Universit

e Pierre et Marie Curie

Laboratoire d’Informatique de Paris 6

4 place Jussieu, 75005 Paris, France

Editor: Xiaotong Shen

Abstract

We explore the framework of permutation-based p-values for assessing the performance of classi-

ﬁers. In this paper we study two simple permutation tests. The ﬁrst test assess whether the classiﬁer

has found a real class structure in the data; the corresponding null distribution is estimated by per-

muting the labels in the data. This test has been used extensively in classiﬁcation problems in

computational biology. The second test studies whether the classiﬁer is exploiting the dependency

between the features in classiﬁcation; the corresponding null distribution is estimated by permut-

ing the features within classes, inspired by restricted randomization techniques traditionally used

in statistics. This new test can serve to identify descriptive features which can be valuable infor-

mation in improving the classiﬁer performance. We study the properties of these tests and present

an extensive empirical evaluation on real and synthetic data. Our analysis shows that studying the

classiﬁer performance via permutation tests is effective. In particular, the restricted permutation

test clearly reveals whether the classiﬁer exploits the interdependency between the features in the

data.

Keywords: classiﬁcation, labeled data, permutation tests, restricted randomization, signiﬁcance

testing

1. Introduction

Building effective classiﬁcation systems is a central task in data mining and machine learning.

Usually, a classiﬁcation algorithm builds a model from a given set of data records in which the labels

are known, and later, the learned model is used to assign labels to new data points. Applications of

such classiﬁcation setting abound in many ﬁelds, for instance, in text categorization, fraud detection,

optical character recognition, or medical diagnosis, to cite some.

For all these applications, a desired property of a good classiﬁer is the power of generalization

to new, unknown instances. The detection and characterization of statistically signiﬁcant predictive

patterns is crucial for obtaining a good classiﬁcation accuracy that generalizes beyond the training

data. Unfortunately, it is very often the case that the number of available data points with labels is

not sufﬁcient. Data from medical or biological applications, for example, are characterized by high

2010 Markus Ojala and Gemma C. Garriga.

OJALA AND GARRIGA

o x x x x x x x +

x x o x x x x o +

x x x x o o x x +

x x x x x x x o +

x x o x o o o x +

x x x x x x x o +

x o o x o x x x +

x x x x o x x o +

o o o x x o o o –

o o o o o o o o –

x o x o o o o o –

x o x o o x o o –

o o x o o o o o –

o o o o o o x o –

x o o o o o o o –

o o o x o o o o –

Data Set D

x x x o x x x x +

x x x x o x x x +

x x x x x x x x +

x o x x x x x x +

o o o o o o o x +

x o o o o o o o +

o o o o o x o o +

o o o o o o o o +

x x x x o o o x –

x x x x x o o o –

x x o x o o o o –

x x x x o o o o –

o o o o x x x x –

o x o o x x x o –

o o o x x x x x –

Data Set D

Figure 1: Examples of two 16 ×8 nominal data sets D

and D

each having two classes. The last

column in both data sets denotes the class labels (+, –) of the samples in the rows.

dimensionality (thousands of features) and small number of data points (tens of rows). An important

question is whether we should believe in the classiﬁcation accuracy obtained by such classiﬁers.

The most traditional approach to this problem is to estimate the error of the classiﬁer by means

of cross-validation or leave-one-out cross-validation, among others. This estimate, together with a

variance-based bound, provides an interval for the expected error of the classiﬁer. The error estimate

itself is the best statistics when different classiﬁers are compared against each other (Hsing et al.,

2003). However, it has been argued that evaluating a single classiﬁer with an error measurement

is ineffective for small amount of data samples (Braga-Neto and Dougherty, 2004; Golland et al.,

2005; Isaksson et al., 2008). Also classical generalization bounds are not directly appropriate when

the dimensionality of the data is too high; for these reasons, some recent approaches using ﬁltering

and regularization alleviate this problem (Rossi and Villa, 2006; Berlinet et al., 2008). Indeed,

for many other general cases, it is useful to have other statistics associated to the error in order

to understand better the behavior of the classiﬁer. For example, even if a classiﬁcation algorithm

produces a classiﬁer with low error, the data itself may have no structure. Thus the question is, how

can we trust that the classiﬁer has learned a signiﬁcant predictive pattern in the data and that the

chosen classiﬁer is appropriate for the speciﬁc classiﬁcation task?

For instance, consider the small toy example in Figure 1. There are two nominal data matrices

and D

of sizes 16 ×8. Each row (data point) has two different values present, x and o. Both

data sets have a clear separation into the two given classes, + and –. However, it seems at ﬁrst sight

that the structure within the classes for data set D

is much simpler than for data set D

. If we train

a 1-Nearest Neighbor classiﬁer on the data sets of Figure 1, we have that the classiﬁcation error

(leave-one-out cross-validation) is 0.00 on both D

and D

. However, is it true that the classiﬁer is

using a real dependency in the data? Or are the dependencies in D

or D

just a random artifact of

1834

PERMUTATION TESTS FOR STUDYING CLASSIFIER PERFORMANCE

some simple structure? It turns out that the good classiﬁcation result in D

is explained purely by

the different value distributions inside the classes whereas in D

the interdependency between the

features is important in classiﬁcation. This example will be analyzed in detail later on in Section 3.3.

In recent years, a number of papers have suggested to use permutation-based p-values for as-

sessing the competence of a classiﬁer (Golland and Fischl, 2003; Golland et al., 2005; Hsing et al.,

2003; Jensen, 1992; Molinaro et al., 2005). Essentially, the permutation test procedure measures

how likely the observed accuracy would be obtained by chance. A p-value represents the fraction

of random data sets under a certain null hypothesis where the classiﬁer behaved as well as or better

than in the original data.

Traditional permutation tests suggested in the recent literature study the null hypothesis that

the features and the labels are independent, that is, that there is no difference between the classes.

The null distribution under this null hypothesis is estimated by permuting the labels of the data set.

This corresponds also to the most traditional statistical methods (Good, 2000), where the results on

a control group are compared against the results on a treatment group. This simple test has been

proven effective already for selecting relevant genes in small data samples (Maglietta et al., 2007) or

for attribute selection in decision trees (Frank, 2000; Frank and Witten, 1998). However, the related

literature has not performed extensive experimental studies for this traditional test in more general

cases.

The goal of this paper is to study permutation tests for assessing the properties and performance

of the classiﬁers. We ﬁrst study the traditional permutation test for testing whether the classiﬁer has

found a real class structure, that is, a real connection between the data and the class labels. Our

experimental studies suggest that this traditional null hypothesis leads to very low p-values, thus

rendering the classiﬁer signiﬁcant most of the time even if the class structure is weak.

We then propose a test for studying whether the classiﬁer is exploiting dependency between

some features for improving the classiﬁcation accuracy. This second test is inspired by restricted

randomization techniques traditionally used in statistics (Good, 2000). We study its relation to

the traditional method both analytically and empirically. This new test can serve as a method for

obtaining descriptive properties for classiﬁers, namely whether the classiﬁer is using the feature

dependency in the classiﬁcation or not. For example, many existing classiﬁcation algorithms are

like black boxes whose functionality is hard to interpret directly. In such cases, indirect methods

are needed to get descriptive information for the obtained class structure in the data.

If the studied data set is known to contain useful feature dependencies that increase the class

separation, this new test can be used to evaluate the classiﬁer against this knowledge. For example,

often the data is gathered by a domain expert having deeper knowledge of the inner structure of

the data. If the classiﬁer is not using a known useful dependency, the classiﬁer performance could

be improved. For example, with medical data, if we are predicting the blood pressure of a person

based on the height and the weight of the individual, the dependency between these two features is

important in the classiﬁcation as large body mass index is known to be connected with high blood

pressure. However, both weight and height convey information about the blood pressure but the

dependency between them is the most important factor in describing the blood pressure. Of course,

in this case we could introduce a new feature, the body mass index, but in general, this may not be

practical; for example, introducing too many new features can make the classiﬁcation ineffective or

too time consuming.

If nothing is known previously from the structure of the data, Test 2 can give some descriptive in-

formation for the obtained class structure. This information can be useful as such for understanding

1835

OJALA AND GARRIGA

the properties of the classiﬁer, or it can guide the search towards an optimal classiﬁer. For example,

if the classiﬁer is not exploiting the feature dependency, there might be no reason to use the chosen

classiﬁer as either more complex classiﬁers (if the data contains useful feature dependencies) or

simpler classiﬁers (if the data does not contain useful feature dependencies) could perform better.

Note, however, that not all feature dependencies are useful in predicting the class labels. Therefore,

in the same way that traditional permutation tests have already been proven useful for selecting

relevant features in some contexts as mentioned above (Maglietta et al., 2007; Frank, 2000; Frank

and Witten, 1998), the new test can serve for selecting combinations of relevant features to boost

the classiﬁer performance for speciﬁc applications.

The idea is to provide users with practical p-values for the analysis of the classiﬁer. The per-

mutation tests provide useful statistics about the underlying reasons for the obtained classiﬁcation

result. Indeed, no test is better than the other, but all provide us with information about the classiﬁer

performance. Each p-value is a statistic about the classiﬁer performance; each p-value depends on

the original data (whether it contains some real structure or not) and the classiﬁer (whether it is able

to use certain structure in the data or not).

The remaining of the paper is organized as follows. In Section 2, we give the background to

classiﬁers and permutation-test p-values, and discuss connections with previous related work. In

Section 3, we describe two simple permutation methods and study their behavior on the small toy

example in Figure 1. In Section 4, we analyze in detail the properties of the different permutations

and the effect of the tests for synthetic data on four different classiﬁers. In Section 5, we give

experimental results on various real data sets. Finally, Section 6 concludes the paper.

2. Background

Let X be an n ×m data matrix. For example, in gene expression analysis the values of the matrix X

are numerical expression measurements, each row is a tissue sample and each column represents a

gene. We denote the i-th row vector of X by X

and the j-th column vector of X by X

. Rows are also

called observations or data points, while columns are also called attributes or features. Observe that

we do not restrict the data domain of X and therefore the scale of its attributes can be categorical or

numerical.

Associated to the data points X

we have a class label y

. We assume a ﬁnite set of known class

labels Y , so y

∈ Y . Let D be the set of labeled data D = {(X

)}

i=1

. For the gene expression

example above, the class labels associated to each tissue sample could be, for example, “sick” or

“healthy”.

In a traditional classiﬁcation task the aim is to predict the label of new data points by training

a classiﬁer from D. The function learned by the classiﬁcation algorithm is denoted by f : X →

Y . A test statistic is typically computed to evaluate the classiﬁer performance: this can be either

the training error, cross-validation error or jackknife estimate, among others. Here we give as an

example the leave-one-out cross-validation error,

e( f,D) =

∑

i=1

I( f

D\D

) 6= y

) (1)

1. A shorter version of this paper appears in the proceedings of the IEEE International Conference on Data Mining (Ojala

and Garriga, 2009). This is an improved version based on valuable comments by reviewers which includes: detailed

discussions and examples, extended theoretical analysis of the tests including statistical power in special case scenar-

ios, related work comparisons and a thorough experimental evaluation with large data sets.

1836

HTML Viewer

Figures

Table 5: Classification errors and empiricalp-values obtained with 1-Nearest Neighbor classifier for Test 1 and Test 2. The empiricalp-values are calculated over 1000 randomized samples for small data sets and over 100 randomized samples for medium and large data sets. Classification on the original data is repeated ten times. In the table, the average values and standard deviations of the classification errors are given. Boldp-values correspond to nonsignificant results when the false discovery rate is restricted below 0.05 with Benjamini and Hochberg (1995)approach.

Table 6: Classification errors and empiricalp-values for the Support Vector Machine classifier for Test 1 and Test 2. The empiricalp-values are calculated over 1000 randomized samples for small data sets and over 100 randomized samples for medium and large data sets. Classification on the original data is repeated ten times. In the table, the average values and standard deviations of the classification errors are given. Boldp-values correspond to nonsignificant results when the false discovery rate is restricted below 0.05 with Benjamini and Hochberg (1995)approach.

Figure 4: Average values of stratified 10-fold cross-validation error (y-axis) for the Decision Tree classifier when noise varies on the original data set (x-axis) with four fixed correlation values between the features inside the classes. The solid line shows the error on the original data, and symbols× and• show the average error on 1000 randomized samples from Test 1 and Test 2, r spectively. Each average of the error on the randomized samples× and• is depicted together with the [1%,99%]-deviation bar below which the associated null hypothesis is rejected with significance level α = 0.01.

Table 2: Summary of 33 selected data sets from UCI machine learning repository (Asuncion and Newman, 2007). The data sets are divided into three categories based ontheir size: small, medium and large.

Table 7: Average running times in seconds for obtaining one randomization version of each data set for Test 1 (T1) and Test 2 (T2), as well as running times for obtaining one classification error for the four studied classifiers on each original data set (Or.) and on each randomized version of each data set (T1, T2). The running times are the average values over all the samples produced. Note that the classification procedures for small, medium and large data sets differ fromeach other.

Figure 2: Scatter plots of original Iris data set and randomized versions for full permutation of the data and for Tests 1 and 2 (one sample for each test). The data points belong to three different classes denoted by different markers, and they are scattered against petal length and width in centimeters.

Open Access

More filters

Journal ArticleDOI

Controlling the false discovery rate: a practical and powerful approach to multiple testing

Yoav Benjamini, +1 more

- 01 Jan 1995 -

Journal of the royal statistical society...

TL;DR: In this paper, a different approach to problems of multiple significance testing is presented, which calls for controlling the expected proportion of falsely rejected hypotheses -the false discovery rate, which is equivalent to the FWER when all hypotheses are true but is smaller otherwise.

...read moreread less

Book

An introduction to the bootstrap

Bradley Efron, +1 more

TL;DR: This article presents bootstrap methods for estimation, using simple arguments, with Minitab macros for implementing these methods, as well as some examples of how these methods could be used for estimation purposes.

...read moreread less

Journal ArticleDOI

A Simple Sequentially Rejective Multiple Test Procedure

Sture Holm

- 01 Jan 1979 -

Scandinavian Journal of Statistics

TL;DR: In this paper, a simple and widely accepted multiple test procedure of the sequentially rejective type is presented, i.e. hypotheses are rejected one at a time until no further rejections can be done.

...read moreread less

Book

Data Mining: Practical Machine Learning Tools and Techniques

Ian H. Witten, +2 more

TL;DR: This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining.

...read moreread less

UCI Machine Learning Repository

A. Asuncion

Collapse

Permutation tests for classification

Polina Golland, +3 more

Permutation Tests for Classification: Towards Statistical Significance in Image-Based Studies

Polina Golland, +1 more

Controlling the false discovery rate: a practical and powerful approach to multiple testing

Yoav Benjamini, +1 more

- 01 Jan 1995 -

Journal of the royal statistical society...

Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses

Phillip I. Good

Frequently Asked Questions (14)

Q1. What are the contributions in "Permutation tests for studying classifier performance" ?

The authors explore the framework of permutation-based p-values for assessing the performance of classifiers. In this paper the authors study two simple permutation tests. The authors study the properties of these tests and present an extensive empirical evaluation on real and synthetic data.

Q2. What are the future works mentioned in the paper "Permutation tests for studying classifier performance" ?

However, if the classifier is not significant with Test 2, that is, the authors obtain a high p-value, there are three different possibilities: ( 1 ) there are no dependencies between the features in the data ; ( 2 ) there are some dependencies between the features in the data but they do not increase the class separation ; or ( 3 ) there are useful dependencies between the features in the data that increase the class separation but the chosen classifier is not able to exploit them. Future work should explore the use of Test 2 for selecting the best discriminant features for classifiers, in similar fashion as Test 1 has been used for decision trees and other biological applications ( Frank, 2000 ; Frank and Witten, 1998 ; Maglietta et al., 2007 ). Also, it would be useful to extend the setting to unsupervised learning, such as clustering. However, in general, when a high p-value is obtained with Test 2, the authors can not know which of these applies to the data and to the chosen classifier.

Q3. How is the evaluation of the different models in this local search strategy done?

The evaluation of the different models in this local search strategy is done via permutation tests, using the framework of multiple hypothesis testing (Benjamini and Hochberg, 1995; Holm, 1979).

Q4. What is the important factor in describing the blood pressure?

both weight and height convey information about the blood pressure but the dependency between them is the most important factor in describing the blood pressure.

Q5. What is the power of Test 1 calculated by Equation (4)?

note that when the null hypothesis is true, that is, t = 1/2, the power of Test 1 calculated by Equation (4) equals the significance level α as it should.

Q6. What is the average classification error for the randomized samples of data set D1?

On the randomized samples of data set D1 the authors obtain an average classification error of 0.53, a standard deviation 0.14 and a minimum classification error of 0.13.

Q7. How many random rows are used for training the data set?

for large data sets, the authors divide the dataset into training set with 10 000 random rows and to test set with the rest of the rows.

Q8. What is the traditional approach to the problem of classification accuracy?

The most traditional approach to this problem is to estimate the error of the classifier by means of cross-validation or leave-one-out cross-validation, among others.

Q9. What are the default approaches in Weka of the classifiers?

Missing values and the combination of nominal and numerical values are given as such as the input for the classifiers; the default approaches in Weka of the classifiers are used to handle these cases.

Q10. what is the probability of incorrectly classifying a sample by Equation (2)?

Bin (n, 1 2 − 1 π arcsinρ) ≈N (n 2 − n π arcsinρ, n 4 − n π2 arcsin2 ρ ) ,where 12 − 1π arcsinρ is the probability of incorrectly classifying a sample by Equation (2).

Q11. What is the way to use the feature dependency to improve the classification performance?

That is, the authors try more complex classifiers that could use the possible existing feature dependency, as well as simpler classifiers that could perform better if no feature dependency exists.

Q12. How many times will the classifier calculate the error of the original data?

Note that in total the authors will compute the error of the classifier r+ k times: r times on the original data and one time for each of the k randomized data sets.

Q13. What is the reason why the traditional permutation tests regard the results as significant?

and as a more important reason, the traditional permutation tests easily regard the results as significant even if there is only a slight class structure present because in the corresponding permuted data sets there is no class structure, especially if the original data set is large.

Q14. What is the relationship between the permutation method and the class labels?

the authors have that permuting the data columns is the randomization method producing the most diverse samples, while permuting labels (Test 1) and permuting data within class (Test 2) produce different randomized samples.

Permutation Tests for Studying Classifier Performance

Figures

Citations

Restoring cortical control of functional movement in a human with quadriplegia

Identifying major depression using whole-brain functional connectivity: a multivariate pattern analysis

Ten quick tips for machine learning in computational biology

MRIQC: Advancing the automatic prediction of image quality in MRI from unseen sites.

Exceeding chance level by chance: The caveat of theoretical chance levels in brain signal classification and statistical assessment of decoding accuracy.

References

Controlling the false discovery rate: a practical and powerful approach to multiple testing

An introduction to the bootstrap

A Simple Sequentially Rejective Multiple Test Procedure

Data Mining: Practical Machine Learning Tools and Techniques

UCI Machine Learning Repository

Related Papers (5)

Permutation tests for classification

Permutation Tests for Classification: Towards Statistical Significance in Image-Based Studies

Controlling the false discovery rate: a practical and powerful approach to multiple testing

Scikit-learn: Machine Learning in Python

Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses

Frequently Asked Questions (14)

Q1. What are the contributions in "Permutation tests for studying classifier performance" ?

Q2. What are the future works mentioned in the paper "Permutation tests for studying classifier performance" ?

Q3. How is the evaluation of the different models in this local search strategy done?

Q4. What is the important factor in describing the blood pressure?

Q5. What is the power of Test 1 calculated by Equation (4)?

Q6. What is the average classification error for the randomized samples of data set D1?

Q7. How many random rows are used for training the data set?

Q8. What is the traditional approach to the problem of classification accuracy?

Q9. What are the default approaches in Weka of the classifiers?

Q10. what is the probability of incorrectly classifying a sample by Equation (2)?

Q11. What is the way to use the feature dependency to improve the classification performance?

Q12. How many times will the classifier calculate the error of the original data?

Q13. What is the reason why the traditional permutation tests regard the results as significant?

Q14. What is the relationship between the permutation method and the class labels?