scispace - formally typeset
Book ChapterDOI

Rademacher and gaussian complexities: risk bounds and structural results

Peter L. Bartlett, +1 more
- Vol. 3, Iss: 3, pp 463-482
TLDR
In this paper, the authors investigate the use of data-dependent estimates of the complexity of a function class, called Rademacher and Gaussian complexities, in a decision theoretic setting and prove general risk bounds in terms of these complexities.
Abstract
We investigate the use of certain data-dependent estimates of the complexity of a function class, called Rademacher and Gaussian complexities. In a decision theoretic setting, we prove general risk bounds in terms of these complexities. We consider function classes that can be expressed as combinations of functions from basis classes and show how the Rademacher and Gaussian complexities of such a function class can be bounded in terms of the complexity of the basis classes. We give examples of the application of these techniques in finding data-dependent risk bounds for decision trees, neural networks and support vector machines.

read more

Content maybe subject to copyright    Report

Journal of Machine Learning Research v (2002) pages Submitted 11/01; Published 10/02
Rademacher and Gaussian Complexities:
Risk Bounds and Structural Results
Peter L. Bartlett Peter.Bartlett@anu.edu.au
Shahar Mendelson shahar@csl.anu.edu.au
Research School of Information Sciences and Engineering
Australian National University
Canberra 0200, Australia
Editor: Philip M. Long
Abstract
We investigate the use of certain data -dependent estimates of the complexity of a function
class, called Rademacher and gaussian complexities. In a decision theoretic setting, we
prove general risk bounds in terms of these complexities. We consider function cla sses
that can be express e d as combinations of functions from basis classes and show how the
Rademacher and gaussian complexities o f such a function class can be bounded in terms of
the complexity of the basis classes. We give examples of the application of these techniques
in finding data-dependent risk bounds for decision trees, neural networks and support vector
machines.
Keywords: Error Bounds, Data-Dependent Complexity, Rademacher Averages, Maxi-
mum Discrepancy
1. Introduction
In learning problems like pattern classification and r egression, a considerable amount of
effort has been spent on obtaining good err or bounds. These are useful, for example, for
the problem of model selection—choosing a model of suitable complexity. Typically, such
bounds take the form of a sum of two terms: some sample-based estimate of performance
and a penalty term that is large for more complex models. For example, in pattern clas-
sification, the follow ing theorem is an improvement of a classical result of Vapnik and
Chervonenkis (Vapnik and Chervonenkis, 1971).
Theorem 1 Let F be a class of 1}-valued functions defined on a set X. Let P be a
probability distribution on X × 1}, and suppose that (X
1
, Y
1
), . . . , (X
n
, Y
n
) and (X, Y )
are chosen independently according to P . Then, there is an absolute constant c such that
for any integer n, with probability at least 1 δ over samples of length n, every f in F
satisfies
P (Y 6= f (X))
ˆ
P
n
(Y 6= f (X)) + c
r
VCdim(F )
n
,
where VCdim(F ) denotes the Vapnik-Chervonenkis dimension of F ,
ˆ
P
n
(S) =
1
n
n
X
i=1
1
S
(X
i
, Y
i
),
c
2002 Peter L. Bartlett and Shahar Mendelson.

Bartlett and Mendelson
and 1
S
is the indicator function of S.
In this case, the sample-based estimate of performance is the proportion of examples
in the training sample that are misclassified by the fun ction f, and the complexity penalty
term involves the VC-dimension of the class of functions. It is natural to use such bounds
for the mod el s election scheme known as complexity regularization: choose the mod el class
containing the function with the best u pper bound on its error. The performance of such
a model selection scheme critically depends on how well the error boun ds match the true
error (see Bartlett et al., 2002). There is theoretical and experiment evidence that error
bounds involving a fixed complexity penalty (that is, a penalty that does not depend on
the training data) cannot be universally effective (Kearns et al., 1997).
Recently, several authors have considered alternative notions of the complexity of a func-
tion class: th e maximum discrepancy (Bartlett et al., 2002) and the R ademacher and gaus-
sian complexities (see Bartlett et al., 2002, Koltchinskii, 2000, Koltchinskii and Panchenko,
2000a,b, Mendelson, 2001b).
Definition 2 Let µ be a probability distribution on a set X and suppose that X
1
, . . . , X
n
are independent samples selected according to µ. Let F be a class of functions mapping from
X to R. Define the maximum d iscrepancy of F as the random variable
ˆ
D
n
(F ) = sup
fF
2
n
n/2
X
i=1
f(X
i
)
2
n
n
X
i=n/2+1
f(X
i
)
.
Denote the expected maximum discrepancy of F by D
n
(F ) = E
ˆ
D
n
(F ).
Define the random variable
ˆ
R
n
(F ) = E
"
sup
fF
2
n
n
X
i=1
σ
i
f(X
i
)
X
1
, . . . , X
n
#
,
where σ
1
, . . . , σ
n
are independent uniform 1}-valued random variables. Then the Rademacher
complexity of F is R
n
(F ) = E
ˆ
R
n
(F ). Similarly, define the random variable
ˆ
G
n
(F ) = E
"
sup
fF
2
n
n
X
i=1
g
i
f(X
i
)
X
1
, . . . , X
n
#
,
where g
1
, . . . , g
n
are independent gaussian N (0, 1) random variables. The gaussian com-
plexity of F i s G
n
(F ) = E
ˆ
G
n
(F ).
All three quantities are intuitively reasonable as measures of complexity of the function
class F :
ˆ
D
n
(F ) quantifies how much the behavior on half of the s ample can be unrepresen-
tative of the behavior on the other half, an d both R
n
(F ) and G
n
(F ) quantify th e extent
to which some function in the class F can be correlated with a noise sequence of length n.
The following two lemmas show that these complexity measures are closely related. The
proof of the first is in Appendix A; the second is f rom (Tomczak-Jaegermann, 1989).
2

Rademacher and Gaussian Complexities
Lemma 3 Let F be a class of functions that map to [1, 1]. Then for every integer n,
R
n
(F )
2
2
r
2
n
D
n
(F ) R
n
(F ) + 4
r
2
n
.
If F is closed under negation, the lower bound can be strengthened to
R
n
(F ) 4
r
2
n
D
n
(F ).
Furthermore,
P
n
ˆ
D
n
(F ) D
n
(F )
ǫ
o
2 exp
ǫ
2
n
2
.
Lemma 4 There are absolute constants c and C such that for every class F and every
integer n, cR
n
(F ) G
n
(F ) C ln nR
n
(F ).
The following theorem is an example of the u sefulness of these notions of complexity.
The p roof of the first part is in (Bartlett et al., 2002). The proof of the second part is a
slight refinement of a proof of a more general result which we give below (Theorem 8); it is
presented in Appendix B.
Theorem 5 Let P be a probability distribution on X ×1}, let F be a set of 1}-valued
functions defined on X, and let (X
i
, Y
i
)
n
i=1
be training samples drawn according to P
n
.
(a) With probability at least 1 δ, every function f in F satisfies
P (Y 6= f (X))
ˆ
P
n
(Y 6= f(X)) +
ˆ
D
n
(F ) +
r
9 ln(1)
2n
.
(b) With probability at least 1 δ, every function f i n F satisfies
P (Y 6= f(X))
ˆ
P
n
(Y 6= f (X)) +
R
n
(F )
2
+
r
ln(1/δ)
2n
.
The following result shows that this theorem implies the upper bound of Theorem 1
in terms of VC-dimension, as well as a refinement in terms of VC-entropy. In particular,
Theorem 5 can never be much worse than the VC results. Since the proof of Theorem 5 is
a close analog of the rst step of the proof of VC-style results (the symmetrization step),
this is not surprising. In fact, the bounds of Theorem 5 can be considerably better than
Theorem 1, since the rst part of the following result is in terms of the empirical VC-
dimension.
Theorem 6 Fix a sample X
1
, . . . , X
n
. For a function class F 1}
X
, define the restric-
tion of F to the sample as
F
|X
i
= {(f(X
1
), . . . , f(X
n
)) : f F }.
Define the empirical VC-dimension of F as d = VCdim
F
|X
i
and the empirical VC-entropy
of F as E = log
2
F
|X
i
. Then
ˆ
G
n
(F ) = O
p
d/n
and
ˆ
G
n
(F ) = O
p
E/n
.
3

Bartlett and Mendelson
The proof of this theorem is based on an upper bound on
ˆ
G
n
which is due to Dudley, together
with an up per bound on covering numbers due to Haussler (see Mendelson, 2001a).
Koltchinskii and Panchenko (2000a) proved an analogous error bound in terms of mar-
gins. The margin of a real-valued function f on a labelled example (x, y) X × 1} is
yf (x). For a function h : X ×Y R and a training sample (X
1
, Y
1
), . . . , (X
n
, Y
n
), we write
ˆ
E
n
h(X, Y ) = (1/n)
n
X
i=1
h(X
i
, Y
i
).
Theorem 7 Let P be a probability distribution on X × 1} and let F be a set of real-
valued functions defined on X, with sup{|f(x)| : f F } finite for all x X. Suppose
that φ : R [0, 1] satisfies φ(α) 1(α 0) and is Lipschitz with constant L. Then with
probability at least 1 δ with respect to training samples (X
i
, Y
i
)
n
i=1
drawn according to P
n
,
every function in F satisfies
P (Y f(X) 0)
ˆ
E
n
φ(Y f (X)) + 2LR
n
(F ) +
r
ln(2/δ)
2n
.
This improves a number of r esults bounding error in terms of a sample average of a
margin error plus a penalty term involving the complexity of the real-valued class (such
as covering numbers and fat-shattering dimensions; see Bartlett, 1998, Mason et al., 2000,
Schapire et al., 1998, Shawe-Taylor et al., 1998).
In th e next section, we give a bound of this form that is applicable in a more general,
decision-theoretic setting. Here, we have an input space X, an action space A and an
output space Y. The n training examples (X
1
, Y
1
), . . . , (X
n
, Y
n
) are selected independently
according to a probability measure P on X ×Y. T here is a loss function L : Y ×A [0, 1],
so that L(y, a) reflects the cost of taking a particular action a A wh en the outcome is
y Y. The aim of learning is to choose a function f that maps from X to A, so as to
minimize the expected loss EL(Y, f(X)).
For example, in multiclass classification, the output space Y is the space Y = {1, . . . , k}
of class labels. When using error correcting output codes (Kong and Dietterich, 1995,
Schapire, 1997) for this problem, the action space might be A = [0, 1]
m
, and for each y Y
there is a codeword a
y
A. The loss function L(y, a) is equal to 0 if the closest codeword
a
y
has y
= y and 1 otherwise.
Section 2 gives bounds on the expected loss for decision-theoretic problems of this kind in
terms of the sample average of a Lipschitz dominating cost function (a function that is point-
wise larger than the loss function) plus a complexity penalty term involving a Rademacher
complexity.
We also consider the problem of estimating R
n
(F ) and G
n
(F ) (for instance, for model
selection). These quantities can be estimated by solving an optimization problem over
F . However, f or cases of practical interest, such optimization problems are difficult. On
the other hand, in many such cases, functions in F can be represented as combinations
of fun ctions from simpler classes. This is the case, for instance, for decision trees, voting
methods, and neural networks. In Section 3, we show how the complexity of such a class can
be related to the complexity of the class of basis functions. Section 4 d escribes examples of
the application of these techniques.
An earlier version of this p aper appeared in COLT’01 (Bartlett and Mendelson, 2001).
4

Rademacher and Gaussian Complexities
2. Risk Bounds
We begin with some notation. Given an independent sample (X
i
, Y
i
)
n
i=1
distributed as
(X, Y ), we denote by P
n
the empirical measure supported on that sample and by µ
n
the
empirical measure supported on (X
i
)
n
i=1
. We say a function φ : Y×A R dominates a loss
function L if for all y Y and a A, φ(y, a) L(y, a). For a class of functions F , convF
is the class of convex combinations of functions from F , F = {−f : f F }, absconvF is
the class of convex combinations of functions from F F , and cF = {cf : f F }. If φ is
a function defined on the range of the functions in F , let φ F = {φ f|f F }. Given a
set A, we denote its characteristic function by 1
A
or 1(A). Finally, constants are den oted
by C or c. Their values may change from line to line, or even within the same line.
Theorem 8 Consider a loss function L : Y × A [0, 1] and a dominating cost function
φ : Y × A [0, 1]. Let F be a class of functions mapping from X to A and let (X
i
, Y
i
)
n
i=1
be independently selected according to the probability measure P . Then, for any integer n
and any 0 < δ < 1, with probability at least 1 δ over samples of length n, every f in F
satisfies
EL(Y, f (X))
ˆ
E
n
φ(Y, f(X)) + R
n
(
˜
φ F ) +
r
8 ln(2)
n
,
where
˜
φ F = {(x, y) 7→ φ(y, f(x)) φ(y, 0) : f F }.
The proof uses McDiarmid’s inequality (McDiarmid, 1989).
Theorem 9 (McDiarmid’s Inequality) Let X
1
, ..., X
n
be independent random variables
taking values i n a set A, and assume that f : A
n
R satisfies
sup
x
1
,...,x
n
,x
i
A
f(x
1
, ..., x
n
) f(x
1
, ..., x
i1
, x
i
, x
i+1
, ...x
n
)
c
i
for every 1 i n. Then, for every t > 0,
P {f(X
1
, ..., X
n
) Ef(X
1
, ..., X
n
) t} e
2t
2
/
P
n
i=1
c
2
i
.
Proof (of Theore m 8) Since φ dominates L, for all f F we can write
EL(Y, f (X)) Eφ(Y, f(X))
ˆ
E
n
φ(Y, f(X)) + sup
hφF
Eh
ˆ
E
n
h
=
ˆ
E
n
φ(Y, f(X)) + sup
h
˜
φF
Eh
ˆ
E
n
h
+ Eφ(Y, 0)
ˆ
E
n
φ(Y, 0).
When an (X
i
, Y
i
) pair changes, the random variable sup
h
˜
φF
Eh
ˆ
E
n
h
can change by
no more than 2/n. McDiarmid’s inequality implies that with probability at least 1 δ/2,
sup
Eh
ˆ
E
n
h
E sup
Eh
ˆ
E
n
h
+
p
2 ln(2)/n .
5

Citations
More filters
Book ChapterDOI

Support Vector Machines, Data Reduction, and Approximate Kernel Matrices

TL;DR: A consistent estimator of the SVM's achieved classification error is introduced, and the bound is shown to be empirically tight in a wide range of domains, making it practical for the practitioner to determine the amount of data reduction given a permissible loss in the classification performance.
Journal ArticleDOI

Training data reduction to speed up SVM training

TL;DR: Empirical studies on four large datasets show the effectiveness of the two-stage informative pattern extraction approach in reducing the training data size and the computational cost, compared with the state-of-the-art algorithms, including PEGASOS, LIBLINEAR SVM and RSVM.
Posted Content

Data-dependent Generalization Bounds for Multi-class Classification

TL;DR: This paper studies data-dependent generalization error bounds that exhibit a mild dependency on the number of classes, making them suitable for multi-class learning with a large number of label classes and applies the results to several prominent multi- class learning machines.
Posted Content

Bellman Eluder Dimension: New Rich Classes of RL Problems, and Sample-Efficient Algorithms

TL;DR: In this article, a new complexity measure called low Bellman Eluder (BE) dimension is introduced, which subsumes a vast majority of existing tractable RL problems including but not limited to tabular MDP, linear MDPs, reactive POMDPs as well as low Eluder dimension problems.
Posted Content

From low probability to high confidence in stochastic convex optimization

TL;DR: This work shows that a wide class of stochastic optimization algorithms for strongly convex problems can be augmented with high confidence bounds at an overhead cost that is only logarithmic in the confidence level and polylogarithms in the condition number.
References
More filters
Book

An Introduction to Support Vector Machines and Other Kernel-based Learning Methods

TL;DR: This is the first comprehensive introduction to Support Vector Machines (SVMs), a new generation learning system based on recent advances in statistical learning theory, and will guide practitioners to updated literature, new applications, and on-line software.
Book ChapterDOI

On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities

TL;DR: This chapter reproduces the English translation by B. Seckler of the paper by Vapnik and Chervonenkis in which they gave proofs for the innovative results they had obtained in a draft form in July 1966 and announced in 1968 in their note in Soviet Mathematics Doklady.
Book

A Probabilistic Theory of Pattern Recognition

TL;DR: The Bayes Error and Vapnik-Chervonenkis theory are applied as guide for empirical classifier selection on the basis of explicit specification and explicit enforcement of the maximum likelihood principle.
Proceedings Article

Boosting the margin: A new explanation for the effectiveness of voting methods

TL;DR: In this paper, the authors show that the test error of the generated classifier usually does not increase as its size becomes very large, and often is observed to decrease even after the training error reaches zero.
Journal ArticleDOI

Boosting the margin: a new explanation for the effectiveness of voting methods

TL;DR: It is shown that techniques used in the analysis of Vapnik's support vector classifiers and of neural networks with small weights can be applied to voting methods to relate the margin distribution to the test error.