scispace - formally typeset
Book ChapterDOI

Rademacher and gaussian complexities: risk bounds and structural results

Peter L. Bartlett, +1 more
- Vol. 3, Iss: 3, pp 463-482
TLDR
In this paper, the authors investigate the use of data-dependent estimates of the complexity of a function class, called Rademacher and Gaussian complexities, in a decision theoretic setting and prove general risk bounds in terms of these complexities.
Abstract
We investigate the use of certain data-dependent estimates of the complexity of a function class, called Rademacher and Gaussian complexities. In a decision theoretic setting, we prove general risk bounds in terms of these complexities. We consider function classes that can be expressed as combinations of functions from basis classes and show how the Rademacher and Gaussian complexities of such a function class can be bounded in terms of the complexity of the basis classes. We give examples of the application of these techniques in finding data-dependent risk bounds for decision trees, neural networks and support vector machines.

read more

Content maybe subject to copyright    Report

Journal of Machine Learning Research v (2002) pages Submitted 11/01; Published 10/02
Rademacher and Gaussian Complexities:
Risk Bounds and Structural Results
Peter L. Bartlett Peter.Bartlett@anu.edu.au
Shahar Mendelson shahar@csl.anu.edu.au
Research School of Information Sciences and Engineering
Australian National University
Canberra 0200, Australia
Editor: Philip M. Long
Abstract
We investigate the use of certain data -dependent estimates of the complexity of a function
class, called Rademacher and gaussian complexities. In a decision theoretic setting, we
prove general risk bounds in terms of these complexities. We consider function cla sses
that can be express e d as combinations of functions from basis classes and show how the
Rademacher and gaussian complexities o f such a function class can be bounded in terms of
the complexity of the basis classes. We give examples of the application of these techniques
in finding data-dependent risk bounds for decision trees, neural networks and support vector
machines.
Keywords: Error Bounds, Data-Dependent Complexity, Rademacher Averages, Maxi-
mum Discrepancy
1. Introduction
In learning problems like pattern classification and r egression, a considerable amount of
effort has been spent on obtaining good err or bounds. These are useful, for example, for
the problem of model selection—choosing a model of suitable complexity. Typically, such
bounds take the form of a sum of two terms: some sample-based estimate of performance
and a penalty term that is large for more complex models. For example, in pattern clas-
sification, the follow ing theorem is an improvement of a classical result of Vapnik and
Chervonenkis (Vapnik and Chervonenkis, 1971).
Theorem 1 Let F be a class of 1}-valued functions defined on a set X. Let P be a
probability distribution on X × 1}, and suppose that (X
1
, Y
1
), . . . , (X
n
, Y
n
) and (X, Y )
are chosen independently according to P . Then, there is an absolute constant c such that
for any integer n, with probability at least 1 δ over samples of length n, every f in F
satisfies
P (Y 6= f (X))
ˆ
P
n
(Y 6= f (X)) + c
r
VCdim(F )
n
,
where VCdim(F ) denotes the Vapnik-Chervonenkis dimension of F ,
ˆ
P
n
(S) =
1
n
n
X
i=1
1
S
(X
i
, Y
i
),
c
2002 Peter L. Bartlett and Shahar Mendelson.

Bartlett and Mendelson
and 1
S
is the indicator function of S.
In this case, the sample-based estimate of performance is the proportion of examples
in the training sample that are misclassified by the fun ction f, and the complexity penalty
term involves the VC-dimension of the class of functions. It is natural to use such bounds
for the mod el s election scheme known as complexity regularization: choose the mod el class
containing the function with the best u pper bound on its error. The performance of such
a model selection scheme critically depends on how well the error boun ds match the true
error (see Bartlett et al., 2002). There is theoretical and experiment evidence that error
bounds involving a fixed complexity penalty (that is, a penalty that does not depend on
the training data) cannot be universally effective (Kearns et al., 1997).
Recently, several authors have considered alternative notions of the complexity of a func-
tion class: th e maximum discrepancy (Bartlett et al., 2002) and the R ademacher and gaus-
sian complexities (see Bartlett et al., 2002, Koltchinskii, 2000, Koltchinskii and Panchenko,
2000a,b, Mendelson, 2001b).
Definition 2 Let µ be a probability distribution on a set X and suppose that X
1
, . . . , X
n
are independent samples selected according to µ. Let F be a class of functions mapping from
X to R. Define the maximum d iscrepancy of F as the random variable
ˆ
D
n
(F ) = sup
fF
2
n
n/2
X
i=1
f(X
i
)
2
n
n
X
i=n/2+1
f(X
i
)
.
Denote the expected maximum discrepancy of F by D
n
(F ) = E
ˆ
D
n
(F ).
Define the random variable
ˆ
R
n
(F ) = E
"
sup
fF
2
n
n
X
i=1
σ
i
f(X
i
)
X
1
, . . . , X
n
#
,
where σ
1
, . . . , σ
n
are independent uniform 1}-valued random variables. Then the Rademacher
complexity of F is R
n
(F ) = E
ˆ
R
n
(F ). Similarly, define the random variable
ˆ
G
n
(F ) = E
"
sup
fF
2
n
n
X
i=1
g
i
f(X
i
)
X
1
, . . . , X
n
#
,
where g
1
, . . . , g
n
are independent gaussian N (0, 1) random variables. The gaussian com-
plexity of F i s G
n
(F ) = E
ˆ
G
n
(F ).
All three quantities are intuitively reasonable as measures of complexity of the function
class F :
ˆ
D
n
(F ) quantifies how much the behavior on half of the s ample can be unrepresen-
tative of the behavior on the other half, an d both R
n
(F ) and G
n
(F ) quantify th e extent
to which some function in the class F can be correlated with a noise sequence of length n.
The following two lemmas show that these complexity measures are closely related. The
proof of the first is in Appendix A; the second is f rom (Tomczak-Jaegermann, 1989).
2

Rademacher and Gaussian Complexities
Lemma 3 Let F be a class of functions that map to [1, 1]. Then for every integer n,
R
n
(F )
2
2
r
2
n
D
n
(F ) R
n
(F ) + 4
r
2
n
.
If F is closed under negation, the lower bound can be strengthened to
R
n
(F ) 4
r
2
n
D
n
(F ).
Furthermore,
P
n
ˆ
D
n
(F ) D
n
(F )
ǫ
o
2 exp
ǫ
2
n
2
.
Lemma 4 There are absolute constants c and C such that for every class F and every
integer n, cR
n
(F ) G
n
(F ) C ln nR
n
(F ).
The following theorem is an example of the u sefulness of these notions of complexity.
The p roof of the first part is in (Bartlett et al., 2002). The proof of the second part is a
slight refinement of a proof of a more general result which we give below (Theorem 8); it is
presented in Appendix B.
Theorem 5 Let P be a probability distribution on X ×1}, let F be a set of 1}-valued
functions defined on X, and let (X
i
, Y
i
)
n
i=1
be training samples drawn according to P
n
.
(a) With probability at least 1 δ, every function f in F satisfies
P (Y 6= f (X))
ˆ
P
n
(Y 6= f(X)) +
ˆ
D
n
(F ) +
r
9 ln(1)
2n
.
(b) With probability at least 1 δ, every function f i n F satisfies
P (Y 6= f(X))
ˆ
P
n
(Y 6= f (X)) +
R
n
(F )
2
+
r
ln(1/δ)
2n
.
The following result shows that this theorem implies the upper bound of Theorem 1
in terms of VC-dimension, as well as a refinement in terms of VC-entropy. In particular,
Theorem 5 can never be much worse than the VC results. Since the proof of Theorem 5 is
a close analog of the rst step of the proof of VC-style results (the symmetrization step),
this is not surprising. In fact, the bounds of Theorem 5 can be considerably better than
Theorem 1, since the rst part of the following result is in terms of the empirical VC-
dimension.
Theorem 6 Fix a sample X
1
, . . . , X
n
. For a function class F 1}
X
, define the restric-
tion of F to the sample as
F
|X
i
= {(f(X
1
), . . . , f(X
n
)) : f F }.
Define the empirical VC-dimension of F as d = VCdim
F
|X
i
and the empirical VC-entropy
of F as E = log
2
F
|X
i
. Then
ˆ
G
n
(F ) = O
p
d/n
and
ˆ
G
n
(F ) = O
p
E/n
.
3

Bartlett and Mendelson
The proof of this theorem is based on an upper bound on
ˆ
G
n
which is due to Dudley, together
with an up per bound on covering numbers due to Haussler (see Mendelson, 2001a).
Koltchinskii and Panchenko (2000a) proved an analogous error bound in terms of mar-
gins. The margin of a real-valued function f on a labelled example (x, y) X × 1} is
yf (x). For a function h : X ×Y R and a training sample (X
1
, Y
1
), . . . , (X
n
, Y
n
), we write
ˆ
E
n
h(X, Y ) = (1/n)
n
X
i=1
h(X
i
, Y
i
).
Theorem 7 Let P be a probability distribution on X × 1} and let F be a set of real-
valued functions defined on X, with sup{|f(x)| : f F } finite for all x X. Suppose
that φ : R [0, 1] satisfies φ(α) 1(α 0) and is Lipschitz with constant L. Then with
probability at least 1 δ with respect to training samples (X
i
, Y
i
)
n
i=1
drawn according to P
n
,
every function in F satisfies
P (Y f(X) 0)
ˆ
E
n
φ(Y f (X)) + 2LR
n
(F ) +
r
ln(2/δ)
2n
.
This improves a number of r esults bounding error in terms of a sample average of a
margin error plus a penalty term involving the complexity of the real-valued class (such
as covering numbers and fat-shattering dimensions; see Bartlett, 1998, Mason et al., 2000,
Schapire et al., 1998, Shawe-Taylor et al., 1998).
In th e next section, we give a bound of this form that is applicable in a more general,
decision-theoretic setting. Here, we have an input space X, an action space A and an
output space Y. The n training examples (X
1
, Y
1
), . . . , (X
n
, Y
n
) are selected independently
according to a probability measure P on X ×Y. T here is a loss function L : Y ×A [0, 1],
so that L(y, a) reflects the cost of taking a particular action a A wh en the outcome is
y Y. The aim of learning is to choose a function f that maps from X to A, so as to
minimize the expected loss EL(Y, f(X)).
For example, in multiclass classification, the output space Y is the space Y = {1, . . . , k}
of class labels. When using error correcting output codes (Kong and Dietterich, 1995,
Schapire, 1997) for this problem, the action space might be A = [0, 1]
m
, and for each y Y
there is a codeword a
y
A. The loss function L(y, a) is equal to 0 if the closest codeword
a
y
has y
= y and 1 otherwise.
Section 2 gives bounds on the expected loss for decision-theoretic problems of this kind in
terms of the sample average of a Lipschitz dominating cost function (a function that is point-
wise larger than the loss function) plus a complexity penalty term involving a Rademacher
complexity.
We also consider the problem of estimating R
n
(F ) and G
n
(F ) (for instance, for model
selection). These quantities can be estimated by solving an optimization problem over
F . However, f or cases of practical interest, such optimization problems are difficult. On
the other hand, in many such cases, functions in F can be represented as combinations
of fun ctions from simpler classes. This is the case, for instance, for decision trees, voting
methods, and neural networks. In Section 3, we show how the complexity of such a class can
be related to the complexity of the class of basis functions. Section 4 d escribes examples of
the application of these techniques.
An earlier version of this p aper appeared in COLT’01 (Bartlett and Mendelson, 2001).
4

Rademacher and Gaussian Complexities
2. Risk Bounds
We begin with some notation. Given an independent sample (X
i
, Y
i
)
n
i=1
distributed as
(X, Y ), we denote by P
n
the empirical measure supported on that sample and by µ
n
the
empirical measure supported on (X
i
)
n
i=1
. We say a function φ : Y×A R dominates a loss
function L if for all y Y and a A, φ(y, a) L(y, a). For a class of functions F , convF
is the class of convex combinations of functions from F , F = {−f : f F }, absconvF is
the class of convex combinations of functions from F F , and cF = {cf : f F }. If φ is
a function defined on the range of the functions in F , let φ F = {φ f|f F }. Given a
set A, we denote its characteristic function by 1
A
or 1(A). Finally, constants are den oted
by C or c. Their values may change from line to line, or even within the same line.
Theorem 8 Consider a loss function L : Y × A [0, 1] and a dominating cost function
φ : Y × A [0, 1]. Let F be a class of functions mapping from X to A and let (X
i
, Y
i
)
n
i=1
be independently selected according to the probability measure P . Then, for any integer n
and any 0 < δ < 1, with probability at least 1 δ over samples of length n, every f in F
satisfies
EL(Y, f (X))
ˆ
E
n
φ(Y, f(X)) + R
n
(
˜
φ F ) +
r
8 ln(2)
n
,
where
˜
φ F = {(x, y) 7→ φ(y, f(x)) φ(y, 0) : f F }.
The proof uses McDiarmid’s inequality (McDiarmid, 1989).
Theorem 9 (McDiarmid’s Inequality) Let X
1
, ..., X
n
be independent random variables
taking values i n a set A, and assume that f : A
n
R satisfies
sup
x
1
,...,x
n
,x
i
A
f(x
1
, ..., x
n
) f(x
1
, ..., x
i1
, x
i
, x
i+1
, ...x
n
)
c
i
for every 1 i n. Then, for every t > 0,
P {f(X
1
, ..., X
n
) Ef(X
1
, ..., X
n
) t} e
2t
2
/
P
n
i=1
c
2
i
.
Proof (of Theore m 8) Since φ dominates L, for all f F we can write
EL(Y, f (X)) Eφ(Y, f(X))
ˆ
E
n
φ(Y, f(X)) + sup
hφF
Eh
ˆ
E
n
h
=
ˆ
E
n
φ(Y, f(X)) + sup
h
˜
φF
Eh
ˆ
E
n
h
+ Eφ(Y, 0)
ˆ
E
n
φ(Y, 0).
When an (X
i
, Y
i
) pair changes, the random variable sup
h
˜
φF
Eh
ˆ
E
n
h
can change by
no more than 2/n. McDiarmid’s inequality implies that with probability at least 1 δ/2,
sup
Eh
ˆ
E
n
h
E sup
Eh
ˆ
E
n
h
+
p
2 ln(2)/n .
5

Citations
More filters
Journal ArticleDOI

Statistical learning theory of structured data

TL;DR: A way to integrate physical models of data into statistical learning theory and address, with both combinatorial and statistical mechanics methods, the computation of the Vapnik-Chervonenkis entropy, which counts the number of different binary classifications compatible with the loss class.
Journal ArticleDOI

What Makes a Pattern? Matching Decoding Methods to Data in Multivariate Pattern Analysis

TL;DR: It is concluded that for spatially localized analyses, such as searchlight and region of interest, multiple classification approaches should be compared in order to match fMRI analyses to the properties of local circuits.
Posted Content

Learning Bounds for Risk-sensitive Learning

TL;DR: This paper proposes to study the generalization properties of risk-sensitive learning schemes whose optimand is described via optimized certainty equivalents (OCE): the general scheme can handle various known risks, e.g., the entropic risk, mean-variance, and conditional value-at-risk, as special cases.
Journal ArticleDOI

Learning interpretable kernelized prototype-based models

TL;DR: This contribution extends a recent kernel LVQ scheme by sparse approximations to overcome the problem of interpretability: instead of the full coefficient vectors, few exemplars which represent the prototypes can be directly inspected by practitioners in the same way as data in this case.
Proceedings ArticleDOI

Locally linear support vector machines and other local models

TL;DR: This is the first paper which proves the stability bounds for local SVMs and it shows that they are tighter than the ones for traditional, global, SVM.
References
More filters
Book

An Introduction to Support Vector Machines and Other Kernel-based Learning Methods

TL;DR: This is the first comprehensive introduction to Support Vector Machines (SVMs), a new generation learning system based on recent advances in statistical learning theory, and will guide practitioners to updated literature, new applications, and on-line software.
Book ChapterDOI

On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities

TL;DR: This chapter reproduces the English translation by B. Seckler of the paper by Vapnik and Chervonenkis in which they gave proofs for the innovative results they had obtained in a draft form in July 1966 and announced in 1968 in their note in Soviet Mathematics Doklady.
Book

A Probabilistic Theory of Pattern Recognition

TL;DR: The Bayes Error and Vapnik-Chervonenkis theory are applied as guide for empirical classifier selection on the basis of explicit specification and explicit enforcement of the maximum likelihood principle.
Proceedings Article

Boosting the margin: A new explanation for the effectiveness of voting methods

TL;DR: In this paper, the authors show that the test error of the generated classifier usually does not increase as its size becomes very large, and often is observed to decrease even after the training error reaches zero.
Journal ArticleDOI

Boosting the margin: a new explanation for the effectiveness of voting methods

TL;DR: It is shown that techniques used in the analysis of Vapnik's support vector classifiers and of neural networks with small weights can be applied to voting methods to relate the margin distribution to the test error.