Rademacher and gaussian complexities: risk bounds and structural results

doi:10.1007/3-540-44581-1_15

Journal of Machine Learning Research v (2002) pages Submitted 11/01; Published 10/02

Rademacher and Gaussian Complexities:

Risk Bounds and Structural Results

Peter L. Bartlett Peter.Bartlett@anu.edu.au

Shahar Mendelson shahar@csl.anu.edu.au

Research School of Information Sciences and Engineering

Australian National University

Canberra 0200, Australia

Editor: Philip M. Long

Abstract

We investigate the use of certain data -dependent estimates of the complexity of a function

class, called Rademacher and gaussian complexities. In a decision theoretic setting, we

prove general risk bounds in terms of these complexities. We consider function cla sses

that can be express e d as combinations of functions from basis classes and show how the

Rademacher and gaussian complexities o f such a function class can be bounded in terms of

the complexity of the basis classes. We give examples of the application of these techniques

in ﬁnding data-dependent risk bounds for decision trees, neural networks and support vector

machines.

Keywords: Error Bounds, Data-Dependent Complexity, Rademacher Averages, Maxi-

mum Discrepancy

1. Introduction

In learning problems like pattern classiﬁcation and r egression, a considerable amount of

eﬀort has been spent on obtaining good err or bounds. These are useful, for example, for

the problem of model selection—choosing a model of suitable complexity. Typically, such

bounds take the form of a sum of two terms: some sample-based estimate of performance

and a penalty term that is large for more complex models. For example, in pattern clas-

siﬁcation, the follow ing theorem is an improvement of a classical result of Vapnik and

Chervonenkis (Vapnik and Chervonenkis, 1971).

Theorem 1 Let F be a class of {±1}-valued functions deﬁned on a set X. Let P be a

probability distribution on X × {±1}, and suppose that (X

1

, Y

1

), . . . , (X

n

, Y

n

) and (X, Y )

are chosen independently according to P . Then, there is an absolute constant c such that

for any integer n, with probability at least 1 − δ over samples of length n, every f in F

satisﬁes

P (Y 6= f (X)) ≤

ˆ

P

n

(Y 6= f (X)) + c

r

VCdim(F )

n

,

where VCdim(F ) denotes the Vapnik-Chervonenkis dimension of F ,

ˆ

P

n

(S) =

1

n

X

i=1

1

S

(X

i

, Y

i

),

c

2002 Peter L. Bartlett and Shahar Mendelson.

Bartlett and Mendelson

and 1

S

is the indicator function of S.

In this case, the sample-based estimate of performance is the proportion of examples

in the training sample that are misclassiﬁed by the fun ction f, and the complexity penalty

term involves the VC-dimension of the class of functions. It is natural to use such bounds

for the mod el s election scheme known as complexity regularization: choose the mod el class

containing the function with the best u pper bound on its error. The performance of such

a model selection scheme critically depends on how well the error boun ds match the true

error (see Bartlett et al., 2002). There is theoretical and experiment evidence that error

bounds involving a ﬁxed complexity penalty (that is, a penalty that does not depend on

the training data) cannot be universally eﬀective (Kearns et al., 1997).

Recently, several authors have considered alternative notions of the complexity of a func-

tion class: th e maximum discrepancy (Bartlett et al., 2002) and the R ademacher and gaus-

sian complexities (see Bartlett et al., 2002, Koltchinskii, 2000, Koltchinskii and Panchenko,

2000a,b, Mendelson, 2001b).

Deﬁnition 2 Let µ be a probability distribution on a set X and suppose that X

1

, . . . , X

n

are independent samples selected according to µ. Let F be a class of functions mapping from

X to R. Deﬁne the maximum d iscrepancy of F as the random variable

ˆ

D

n

(F ) = sup

f∈F





2

n

n/2

X

i=1

f(X

i

) −

2

n

X

i=n/2+1

f(X

i

)





.

Denote the expected maximum discrepancy of F by D

n

(F ) = E

ˆ

D

n

(F ).

Deﬁne the random variable

ˆ

R

n

(F ) = E

"

sup

f∈F



2

n

X

i=1

σ

i

f(X

i

)



X

1

, . . . , X

n

#

,

where σ

1

, . . . , σ

n

are independent uniform {±1}-valued random variables. Then the Rademacher

complexity of F is R

n

(F ) = E

ˆ

R

n

(F ). Similarly, deﬁne the random variable

ˆ

G

n

(F ) = E

"

sup

f∈F



2

n

X

i=1

g

i

f(X

i

)



X

1

, . . . , X

n

#

,

where g

1

, . . . , g

n

are independent gaussian N (0, 1) random variables. The gaussian com-

plexity of F i s G

n

(F ) = E

ˆ

G

n

(F ).

All three quantities are intuitively reasonable as measures of complexity of the function

class F :

ˆ

D

n

(F ) quantiﬁes how much the behavior on half of the s ample can be unrepresen-

tative of the behavior on the other half, an d both R

n

(F ) and G

n

(F ) quantify th e extent

to which some function in the class F can be correlated with a noise sequence of length n.

The following two lemmas show that these complexity measures are closely related. The

proof of the ﬁrst is in Appendix A; the second is f rom (Tomczak-Jaegermann, 1989).

2

Rademacher and Gaussian Complexities

Lemma 3 Let F be a class of functions that map to [−1, 1]. Then for every integer n,

R

n

(F )

2

− 2

r

2

n

≤ D

n

(F ) ≤ R

n

(F ) + 4

r

2

n

.

If F is closed under negation, the lower bound can be strengthened to

R

n

(F ) − 4

r

2

n

≤ D

n

(F ).

Furthermore,

P

n



ˆ

D

n

(F ) − D

n

(F )



≥ ǫ

o

≤ 2 exp



−ǫ

2

n

2



.

Lemma 4 There are absolute constants c and C such that for every class F and every

integer n, cR

n

(F ) ≤ G

n

(F ) ≤ C ln nR

n

(F ).

The following theorem is an example of the u sefulness of these notions of complexity.

The p roof of the ﬁrst part is in (Bartlett et al., 2002). The proof of the second part is a

slight reﬁnement of a proof of a more general result which we give below (Theorem 8); it is

presented in Appendix B.

Theorem 5 Let P be a probability distribution on X ×{±1}, let F be a set of {±1}-valued

functions deﬁned on X, and let (X

i

, Y

i

)

n

i=1

be training samples drawn according to P

n

.

(a) With probability at least 1 −δ, every function f in F satisﬁes

P (Y 6= f (X)) ≤

ˆ

P

n

(Y 6= f(X)) +

ˆ

D

n

(F ) +

r

9 ln(1/δ)

2n

.

(b) With probability at least 1 −δ, every function f i n F satisﬁes

P (Y 6= f(X)) ≤

ˆ

P

n

(Y 6= f (X)) +

R

n

(F )

2

+

r

ln(1/δ)

2n

.

The following result shows that this theorem implies the upper bound of Theorem 1

in terms of VC-dimension, as well as a reﬁnement in terms of VC-entropy. In particular,

Theorem 5 can never be much worse than the VC results. Since the proof of Theorem 5 is

a close analog of the ﬁrst step of the proof of VC-style results (the symmetrization step),

this is not surprising. In fact, the bounds of Theorem 5 can be considerably better than

Theorem 1, since the ﬁrst part of the following result is in terms of the empirical VC-

dimension.

Theorem 6 Fix a sample X

1

, . . . , X

n

. For a function class F ⊆ {±1}

X

, deﬁne the restric-

tion of F to the sample as

F

|X

i

= {(f(X

1

), . . . , f(X

n

)) : f ∈ F }.

Deﬁne the empirical VC-dimension of F as d = VCdim



F

|X

i



and the empirical VC-entropy

of F as E = log

2



F

|X

i



. Then

ˆ

G

n

(F ) = O



p

d/n



and

ˆ

G

n

(F ) = O



p

E/n



.

3

Bartlett and Mendelson

The proof of this theorem is based on an upper bound on

ˆ

G

n

which is due to Dudley, together

with an up per bound on covering numbers due to Haussler (see Mendelson, 2001a).

Koltchinskii and Panchenko (2000a) proved an analogous error bound in terms of mar-

gins. The margin of a real-valued function f on a labelled example (x, y) ∈ X × {±1} is

yf (x). For a function h : X ×Y → R and a training sample (X

1

, Y

1

), . . . , (X

n

, Y

n

), we write

ˆ

E

n

h(X, Y ) = (1/n)

n

X

i=1

h(X

i

, Y

i

).

Theorem 7 Let P be a probability distribution on X × {±1} and let F be a set of real-

valued functions deﬁned on X, with sup{|f(x)| : f ∈ F } ﬁnite for all x ∈ X. Suppose

that φ : R → [0, 1] satisﬁes φ(α) ≥ 1(α ≤ 0) and is Lipschitz with constant L. Then with

probability at least 1 −δ with respect to training samples (X

i

, Y

i

)

n

i=1

drawn according to P

n

,

every function in F satisﬁes

P (Y f(X) ≤ 0) ≤

ˆ

E

n

φ(Y f (X)) + 2LR

n

(F ) +

r

ln(2/δ)

2n

.

This improves a number of r esults bounding error in terms of a sample average of a

margin error plus a penalty term involving the complexity of the real-valued class (such

as covering numbers and fat-shattering dimensions; see Bartlett, 1998, Mason et al., 2000,

Schapire et al., 1998, Shawe-Taylor et al., 1998).

In th e next section, we give a bound of this form that is applicable in a more general,

decision-theoretic setting. Here, we have an input space X, an action space A and an

output space Y. The n training examples (X

1

, Y

1

), . . . , (X

n

, Y

n

) are selected independently

according to a probability measure P on X ×Y. T here is a loss function L : Y ×A → [0, 1],

so that L(y, a) reﬂects the cost of taking a particular action a ∈ A wh en the outcome is

y ∈ Y. The aim of learning is to choose a function f that maps from X to A, so as to

minimize the expected loss EL(Y, f(X)).

For example, in multiclass classiﬁcation, the output space Y is the space Y = {1, . . . , k}

of class labels. When using error correcting output codes (Kong and Dietterich, 1995,

Schapire, 1997) for this problem, the action space might be A = [0, 1]

m

, and for each y ∈ Y

there is a codeword a

y

∈ A. The loss function L(y, a) is equal to 0 if the closest codeword

a

y

∗

has y

∗

= y and 1 otherwise.

Section 2 gives bounds on the expected loss for decision-theoretic problems of this kind in

terms of the sample average of a Lipschitz dominating cost function (a function that is point-

wise larger than the loss function) plus a complexity penalty term involving a Rademacher

complexity.

We also consider the problem of estimating R

n

(F ) and G

n

(F ) (for instance, for model

selection). These quantities can be estimated by solving an optimization problem over

F . However, f or cases of practical interest, such optimization problems are diﬃcult. On

the other hand, in many such cases, functions in F can be represented as combinations

of fun ctions from simpler classes. This is the case, for instance, for decision trees, voting

methods, and neural networks. In Section 3, we show how the complexity of such a class can

be related to the complexity of the class of basis functions. Section 4 d escribes examples of

the application of these techniques.

An earlier version of this p aper appeared in COLT’01 (Bartlett and Mendelson, 2001).

4

Rademacher and Gaussian Complexities

2. Risk Bounds

We begin with some notation. Given an independent sample (X

i

, Y

i

)

n

i=1

distributed as

(X, Y ), we denote by P

n

the empirical measure supported on that sample and by µ

n

the

empirical measure supported on (X

i

)

n

i=1

. We say a function φ : Y×A → R dominates a loss

function L if for all y ∈ Y and a ∈ A, φ(y, a) ≥ L(y, a). For a class of functions F , convF

is the class of convex combinations of functions from F , −F = {−f : f ∈ F }, absconvF is

the class of convex combinations of functions from F ∪ −F , and cF = {cf : f ∈ F }. If φ is

a function deﬁned on the range of the functions in F , let φ ◦ F = {φ ◦ f|f ∈ F }. Given a

set A, we denote its characteristic function by 1

A

or 1(A). Finally, constants are den oted

by C or c. Their values may change from line to line, or even within the same line.

Theorem 8 Consider a loss function L : Y × A → [0, 1] and a dominating cost function

φ : Y × A → [0, 1]. Let F be a class of functions mapping from X to A and let (X

i

, Y

i

)

n

i=1

be independently selected according to the probability measure P . Then, for any integer n

and any 0 < δ < 1, with probability at least 1 − δ over samples of length n, every f in F

satisﬁes

EL(Y, f (X)) ≤

ˆ

E

n

φ(Y, f(X)) + R

n

(

˜

φ ◦ F ) +

r

8 ln(2/δ)

n

,

where

˜

φ ◦ F = {(x, y) 7→ φ(y, f(x)) − φ(y, 0) : f ∈ F }.

The proof uses McDiarmid’s inequality (McDiarmid, 1989).

Theorem 9 (McDiarmid’s Inequality) Let X

1

, ..., X

n

be independent random variables

taking values i n a set A, and assume that f : A

n

→ R satisﬁes

sup

x

1

,...,x

n

,x

′

i

∈A



f(x

1

, ..., x

n

) − f(x

1

, ..., x

i−1

, x

′

i

, x

i+1

, ...x

n

)



≤ c

i

for every 1 ≤ i ≤ n. Then, for every t > 0,

P {f(X

1

, ..., X

n

) − Ef(X

1

, ..., X

n

) ≥ t} ≤ e

−2t

2

/

P

n

i=1

c

2

i

.

Proof (of Theore m 8) Since φ dominates L, for all f ∈ F we can write

EL(Y, f (X)) ≤ Eφ(Y, f(X))

≤

ˆ

E

n

φ(Y, f(X)) + sup

h∈φ◦F



Eh −

ˆ

E

n

h



=

ˆ

E

n

φ(Y, f(X)) + sup

h∈

˜

φ◦F



Eh −

ˆ

E

n

h



+ Eφ(Y, 0) −

ˆ

E

n

φ(Y, 0).

When an (X

i

, Y

i

) pair changes, the random variable sup

h∈

˜

φ◦F



Eh −

ˆ

E

n

h



can change by

no more than 2/n. McDiarmid’s inequality implies that with probability at least 1 − δ/2,

sup



Eh −

ˆ

E

n

h



≤ E sup



Eh −

ˆ

E

n

h



+

p

2 ln(2/δ)/n .

5

Rademacher and gaussian complexities: risk bounds and structural results

Citations

Statistical learning theory of structured data

What Makes a Pattern? Matching Decoding Methods to Data in Multivariate Pattern Analysis

Learning Bounds for Risk-sensitive Learning

Learning interpretable kernelized prototype-based models

Locally linear support vector machines and other local models

References

An Introduction to Support Vector Machines and Other Kernel-based Learning Methods

On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities

A Probabilistic Theory of Pattern Recognition

Boosting the margin: A new explanation for the effectiveness of voting methods

Boosting the margin: a new explanation for the effectiveness of voting methods

Related Papers (5)

Statistical learning theory

The Nature of Statistical Learning Theory

On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities

Understanding Machine Learning: From Theory To Algorithms

Kernel Methods for Pattern Analysis