scispace - formally typeset
Search or ask a question
Book ChapterDOI

Rademacher and gaussian complexities: risk bounds and structural results

01 Mar 2003-Vol. 3, Iss: 3, pp 463-482
TL;DR: In this paper, the authors investigate the use of data-dependent estimates of the complexity of a function class, called Rademacher and Gaussian complexities, in a decision theoretic setting and prove general risk bounds in terms of these complexities.
Abstract: We investigate the use of certain data-dependent estimates of the complexity of a function class, called Rademacher and Gaussian complexities. In a decision theoretic setting, we prove general risk bounds in terms of these complexities. We consider function classes that can be expressed as combinations of functions from basis classes and show how the Rademacher and Gaussian complexities of such a function class can be bounded in terms of the complexity of the basis classes. We give examples of the application of these techniques in finding data-dependent risk bounds for decision trees, neural networks and support vector machines.

Content maybe subject to copyright    Report

Journal of Machine Learning Research v (2002) pages Submitted 11/01; Published 10/02
Rademacher and Gaussian Complexities:
Risk Bounds and Structural Results
Peter L. Bartlett Peter.Bartlett@anu.edu.au
Shahar Mendelson shahar@csl.anu.edu.au
Research School of Information Sciences and Engineering
Australian National University
Canberra 0200, Australia
Editor: Philip M. Long
Abstract
We investigate the use of certain data -dependent estimates of the complexity of a function
class, called Rademacher and gaussian complexities. In a decision theoretic setting, we
prove general risk bounds in terms of these complexities. We consider function cla sses
that can be express e d as combinations of functions from basis classes and show how the
Rademacher and gaussian complexities o f such a function class can be bounded in terms of
the complexity of the basis classes. We give examples of the application of these techniques
in finding data-dependent risk bounds for decision trees, neural networks and support vector
machines.
Keywords: Error Bounds, Data-Dependent Complexity, Rademacher Averages, Maxi-
mum Discrepancy
1. Introduction
In learning problems like pattern classification and r egression, a considerable amount of
effort has been spent on obtaining good err or bounds. These are useful, for example, for
the problem of model selection—choosing a model of suitable complexity. Typically, such
bounds take the form of a sum of two terms: some sample-based estimate of performance
and a penalty term that is large for more complex models. For example, in pattern clas-
sification, the follow ing theorem is an improvement of a classical result of Vapnik and
Chervonenkis (Vapnik and Chervonenkis, 1971).
Theorem 1 Let F be a class of 1}-valued functions defined on a set X. Let P be a
probability distribution on X × 1}, and suppose that (X
1
, Y
1
), . . . , (X
n
, Y
n
) and (X, Y )
are chosen independently according to P . Then, there is an absolute constant c such that
for any integer n, with probability at least 1 δ over samples of length n, every f in F
satisfies
P (Y 6= f (X))
ˆ
P
n
(Y 6= f (X)) + c
r
VCdim(F )
n
,
where VCdim(F ) denotes the Vapnik-Chervonenkis dimension of F ,
ˆ
P
n
(S) =
1
n
n
X
i=1
1
S
(X
i
, Y
i
),
c
2002 Peter L. Bartlett and Shahar Mendelson.

Bartlett and Mendelson
and 1
S
is the indicator function of S.
In this case, the sample-based estimate of performance is the proportion of examples
in the training sample that are misclassified by the fun ction f, and the complexity penalty
term involves the VC-dimension of the class of functions. It is natural to use such bounds
for the mod el s election scheme known as complexity regularization: choose the mod el class
containing the function with the best u pper bound on its error. The performance of such
a model selection scheme critically depends on how well the error boun ds match the true
error (see Bartlett et al., 2002). There is theoretical and experiment evidence that error
bounds involving a fixed complexity penalty (that is, a penalty that does not depend on
the training data) cannot be universally effective (Kearns et al., 1997).
Recently, several authors have considered alternative notions of the complexity of a func-
tion class: th e maximum discrepancy (Bartlett et al., 2002) and the R ademacher and gaus-
sian complexities (see Bartlett et al., 2002, Koltchinskii, 2000, Koltchinskii and Panchenko,
2000a,b, Mendelson, 2001b).
Definition 2 Let µ be a probability distribution on a set X and suppose that X
1
, . . . , X
n
are independent samples selected according to µ. Let F be a class of functions mapping from
X to R. Define the maximum d iscrepancy of F as the random variable
ˆ
D
n
(F ) = sup
fF
2
n
n/2
X
i=1
f(X
i
)
2
n
n
X
i=n/2+1
f(X
i
)
.
Denote the expected maximum discrepancy of F by D
n
(F ) = E
ˆ
D
n
(F ).
Define the random variable
ˆ
R
n
(F ) = E
"
sup
fF
2
n
n
X
i=1
σ
i
f(X
i
)
X
1
, . . . , X
n
#
,
where σ
1
, . . . , σ
n
are independent uniform 1}-valued random variables. Then the Rademacher
complexity of F is R
n
(F ) = E
ˆ
R
n
(F ). Similarly, define the random variable
ˆ
G
n
(F ) = E
"
sup
fF
2
n
n
X
i=1
g
i
f(X
i
)
X
1
, . . . , X
n
#
,
where g
1
, . . . , g
n
are independent gaussian N (0, 1) random variables. The gaussian com-
plexity of F i s G
n
(F ) = E
ˆ
G
n
(F ).
All three quantities are intuitively reasonable as measures of complexity of the function
class F :
ˆ
D
n
(F ) quantifies how much the behavior on half of the s ample can be unrepresen-
tative of the behavior on the other half, an d both R
n
(F ) and G
n
(F ) quantify th e extent
to which some function in the class F can be correlated with a noise sequence of length n.
The following two lemmas show that these complexity measures are closely related. The
proof of the first is in Appendix A; the second is f rom (Tomczak-Jaegermann, 1989).
2

Rademacher and Gaussian Complexities
Lemma 3 Let F be a class of functions that map to [1, 1]. Then for every integer n,
R
n
(F )
2
2
r
2
n
D
n
(F ) R
n
(F ) + 4
r
2
n
.
If F is closed under negation, the lower bound can be strengthened to
R
n
(F ) 4
r
2
n
D
n
(F ).
Furthermore,
P
n
ˆ
D
n
(F ) D
n
(F )
ǫ
o
2 exp
ǫ
2
n
2
.
Lemma 4 There are absolute constants c and C such that for every class F and every
integer n, cR
n
(F ) G
n
(F ) C ln nR
n
(F ).
The following theorem is an example of the u sefulness of these notions of complexity.
The p roof of the first part is in (Bartlett et al., 2002). The proof of the second part is a
slight refinement of a proof of a more general result which we give below (Theorem 8); it is
presented in Appendix B.
Theorem 5 Let P be a probability distribution on X ×1}, let F be a set of 1}-valued
functions defined on X, and let (X
i
, Y
i
)
n
i=1
be training samples drawn according to P
n
.
(a) With probability at least 1 δ, every function f in F satisfies
P (Y 6= f (X))
ˆ
P
n
(Y 6= f(X)) +
ˆ
D
n
(F ) +
r
9 ln(1)
2n
.
(b) With probability at least 1 δ, every function f i n F satisfies
P (Y 6= f(X))
ˆ
P
n
(Y 6= f (X)) +
R
n
(F )
2
+
r
ln(1/δ)
2n
.
The following result shows that this theorem implies the upper bound of Theorem 1
in terms of VC-dimension, as well as a refinement in terms of VC-entropy. In particular,
Theorem 5 can never be much worse than the VC results. Since the proof of Theorem 5 is
a close analog of the rst step of the proof of VC-style results (the symmetrization step),
this is not surprising. In fact, the bounds of Theorem 5 can be considerably better than
Theorem 1, since the rst part of the following result is in terms of the empirical VC-
dimension.
Theorem 6 Fix a sample X
1
, . . . , X
n
. For a function class F 1}
X
, define the restric-
tion of F to the sample as
F
|X
i
= {(f(X
1
), . . . , f(X
n
)) : f F }.
Define the empirical VC-dimension of F as d = VCdim
F
|X
i
and the empirical VC-entropy
of F as E = log
2
F
|X
i
. Then
ˆ
G
n
(F ) = O
p
d/n
and
ˆ
G
n
(F ) = O
p
E/n
.
3

Bartlett and Mendelson
The proof of this theorem is based on an upper bound on
ˆ
G
n
which is due to Dudley, together
with an up per bound on covering numbers due to Haussler (see Mendelson, 2001a).
Koltchinskii and Panchenko (2000a) proved an analogous error bound in terms of mar-
gins. The margin of a real-valued function f on a labelled example (x, y) X × 1} is
yf (x). For a function h : X ×Y R and a training sample (X
1
, Y
1
), . . . , (X
n
, Y
n
), we write
ˆ
E
n
h(X, Y ) = (1/n)
n
X
i=1
h(X
i
, Y
i
).
Theorem 7 Let P be a probability distribution on X × 1} and let F be a set of real-
valued functions defined on X, with sup{|f(x)| : f F } finite for all x X. Suppose
that φ : R [0, 1] satisfies φ(α) 1(α 0) and is Lipschitz with constant L. Then with
probability at least 1 δ with respect to training samples (X
i
, Y
i
)
n
i=1
drawn according to P
n
,
every function in F satisfies
P (Y f(X) 0)
ˆ
E
n
φ(Y f (X)) + 2LR
n
(F ) +
r
ln(2/δ)
2n
.
This improves a number of r esults bounding error in terms of a sample average of a
margin error plus a penalty term involving the complexity of the real-valued class (such
as covering numbers and fat-shattering dimensions; see Bartlett, 1998, Mason et al., 2000,
Schapire et al., 1998, Shawe-Taylor et al., 1998).
In th e next section, we give a bound of this form that is applicable in a more general,
decision-theoretic setting. Here, we have an input space X, an action space A and an
output space Y. The n training examples (X
1
, Y
1
), . . . , (X
n
, Y
n
) are selected independently
according to a probability measure P on X ×Y. T here is a loss function L : Y ×A [0, 1],
so that L(y, a) reflects the cost of taking a particular action a A wh en the outcome is
y Y. The aim of learning is to choose a function f that maps from X to A, so as to
minimize the expected loss EL(Y, f(X)).
For example, in multiclass classification, the output space Y is the space Y = {1, . . . , k}
of class labels. When using error correcting output codes (Kong and Dietterich, 1995,
Schapire, 1997) for this problem, the action space might be A = [0, 1]
m
, and for each y Y
there is a codeword a
y
A. The loss function L(y, a) is equal to 0 if the closest codeword
a
y
has y
= y and 1 otherwise.
Section 2 gives bounds on the expected loss for decision-theoretic problems of this kind in
terms of the sample average of a Lipschitz dominating cost function (a function that is point-
wise larger than the loss function) plus a complexity penalty term involving a Rademacher
complexity.
We also consider the problem of estimating R
n
(F ) and G
n
(F ) (for instance, for model
selection). These quantities can be estimated by solving an optimization problem over
F . However, f or cases of practical interest, such optimization problems are difficult. On
the other hand, in many such cases, functions in F can be represented as combinations
of fun ctions from simpler classes. This is the case, for instance, for decision trees, voting
methods, and neural networks. In Section 3, we show how the complexity of such a class can
be related to the complexity of the class of basis functions. Section 4 d escribes examples of
the application of these techniques.
An earlier version of this p aper appeared in COLT’01 (Bartlett and Mendelson, 2001).
4

Rademacher and Gaussian Complexities
2. Risk Bounds
We begin with some notation. Given an independent sample (X
i
, Y
i
)
n
i=1
distributed as
(X, Y ), we denote by P
n
the empirical measure supported on that sample and by µ
n
the
empirical measure supported on (X
i
)
n
i=1
. We say a function φ : Y×A R dominates a loss
function L if for all y Y and a A, φ(y, a) L(y, a). For a class of functions F , convF
is the class of convex combinations of functions from F , F = {−f : f F }, absconvF is
the class of convex combinations of functions from F F , and cF = {cf : f F }. If φ is
a function defined on the range of the functions in F , let φ F = {φ f|f F }. Given a
set A, we denote its characteristic function by 1
A
or 1(A). Finally, constants are den oted
by C or c. Their values may change from line to line, or even within the same line.
Theorem 8 Consider a loss function L : Y × A [0, 1] and a dominating cost function
φ : Y × A [0, 1]. Let F be a class of functions mapping from X to A and let (X
i
, Y
i
)
n
i=1
be independently selected according to the probability measure P . Then, for any integer n
and any 0 < δ < 1, with probability at least 1 δ over samples of length n, every f in F
satisfies
EL(Y, f (X))
ˆ
E
n
φ(Y, f(X)) + R
n
(
˜
φ F ) +
r
8 ln(2)
n
,
where
˜
φ F = {(x, y) 7→ φ(y, f(x)) φ(y, 0) : f F }.
The proof uses McDiarmid’s inequality (McDiarmid, 1989).
Theorem 9 (McDiarmid’s Inequality) Let X
1
, ..., X
n
be independent random variables
taking values i n a set A, and assume that f : A
n
R satisfies
sup
x
1
,...,x
n
,x
i
A
f(x
1
, ..., x
n
) f(x
1
, ..., x
i1
, x
i
, x
i+1
, ...x
n
)
c
i
for every 1 i n. Then, for every t > 0,
P {f(X
1
, ..., X
n
) Ef(X
1
, ..., X
n
) t} e
2t
2
/
P
n
i=1
c
2
i
.
Proof (of Theore m 8) Since φ dominates L, for all f F we can write
EL(Y, f (X)) Eφ(Y, f(X))
ˆ
E
n
φ(Y, f(X)) + sup
hφF
Eh
ˆ
E
n
h
=
ˆ
E
n
φ(Y, f(X)) + sup
h
˜
φF
Eh
ˆ
E
n
h
+ Eφ(Y, 0)
ˆ
E
n
φ(Y, 0).
When an (X
i
, Y
i
) pair changes, the random variable sup
h
˜
φF
Eh
ˆ
E
n
h
can change by
no more than 2/n. McDiarmid’s inequality implies that with probability at least 1 δ/2,
sup
Eh
ˆ
E
n
h
E sup
Eh
ˆ
E
n
h
+
p
2 ln(2)/n .
5

Citations
More filters
Book
01 Jan 2004
TL;DR: This book provides an easy introduction for students and researchers to the growing field of kernel-based pattern analysis, demonstrating with examples how to handcraft an algorithm or a kernel for a new specific application, and covering all the necessary conceptual and mathematical tools to do so.
Abstract: Kernel methods provide a powerful and unified framework for pattern discovery, motivating algorithms that can act on general types of data (e.g. strings, vectors or text) and look for general types of relations (e.g. rankings, classifications, regressions, clusters). The application areas range from neural networks and pattern recognition to machine learning and data mining. This book, developed from lectures and tutorials, fulfils two major roles: firstly it provides practitioners with a large toolkit of algorithms, kernels and solutions ready to use for standard pattern discovery problems in fields such as bioinformatics, text analysis, image analysis. Secondly it provides an easy introduction for students and researchers to the growing field of kernel-based pattern analysis, demonstrating with examples how to handcraft an algorithm or a kernel for a new specific application, and covering all the necessary conceptual and mathematical tools to do so.

6,050 citations

Book
01 Jan 2015
TL;DR: The aim of this textbook is to introduce machine learning, and the algorithmic paradigms it offers, in a principled way in an advanced undergraduate or beginning graduate course.
Abstract: Machine learning is one of the fastest growing areas of computer science, with far-reaching applications. The aim of this textbook is to introduce machine learning, and the algorithmic paradigms it offers, in a principled way. The book provides an extensive theoretical account of the fundamental ideas underlying machine learning and the mathematical derivations that transform these principles into practical algorithms. Following a presentation of the basics of the field, the book covers a wide array of central topics that have not been addressed by previous textbooks. These include a discussion of the computational complexity of learning and the concepts of convexity and stability; important algorithmic paradigms including stochastic gradient descent, neural networks, and structured output learning; and emerging theoretical concepts such as the PAC-Bayes approach and compression-based bounds. Designed for an advanced undergraduate or beginning graduate course, the text makes the fundamentals and algorithms of machine learning accessible to students and non-expert readers in statistics, computer science, mathematics, and engineering.

3,857 citations

Journal ArticleDOI
TL;DR: This work proposes a framework for analyzing and comparing distributions, which is used to construct statistical tests to determine if two samples are drawn from different distributions, and presents two distribution free tests based on large deviation bounds for the maximum mean discrepancy (MMD).
Abstract: We propose a framework for analyzing and comparing distributions, which we use to construct statistical tests to determine if two samples are drawn from different distributions. Our test statistic is the largest difference in expectations over functions in the unit ball of a reproducing kernel Hilbert space (RKHS), and is called the maximum mean discrepancy (MMD).We present two distribution free tests based on large deviation bounds for the MMD, and a third test based on the asymptotic distribution of this statistic. The MMD can be computed in quadratic time, although efficient linear time approximations are available. Our statistic is an instance of an integral probability metric, and various classical metrics on distributions are obtained when alternative function classes are used in place of an RKHS. We apply our two-sample tests to a variety of problems, including attribute matching for databases using the Hungarian marriage method, where they perform strongly. Excellent performance is also obtained when comparing distributions over graphs, for which these are the first such tests.

3,792 citations

BookDOI
31 Mar 2010
TL;DR: Semi-supervised learning (SSL) as discussed by the authors is the middle ground between supervised learning (in which all training examples are labeled) and unsupervised training (where no label data are given).
Abstract: In the field of machine learning, semi-supervised learning (SSL) occupies the middle ground, between supervised learning (in which all training examples are labeled) and unsupervised learning (in which no label data are given). Interest in SSL has increased in recent years, particularly because of application domains in which unlabeled data are plentiful, such as images, text, and bioinformatics. This first comprehensive overview of SSL presents state-of-the-art algorithms, a taxonomy of the field, selected applications, benchmark experiments, and perspectives on ongoing and future research. Semi-Supervised Learning first presents the key assumptions and ideas underlying the field: smoothness, cluster or low-density separation, manifold structure, and transduction. The core of the book is the presentation of SSL methods, organized according to algorithmic strategies. After an examination of generative models, the book describes algorithms that implement the low-density separation assumption, graph-based methods, and algorithms that perform two-step learning. The book then discusses SSL applications and offers guidelines for SSL practitioners by analyzing the results of extensive benchmark experiments. Finally, the book looks at interesting directions for SSL research. The book closes with a discussion of the relationship between semi-supervised learning and transduction. Adaptive Computation and Machine Learning series

3,773 citations

Journal ArticleDOI
TL;DR: This survey intends to relate the model selection performances of cross-validation procedures to the most recent advances of model selection theory, with a particular emphasis on distinguishing empirical statements from rigorous theoretical results.
Abstract: Used to estimate the risk of an estimator or to perform model selection, cross-validation is a widespread strategy because of its simplicity and its apparent universality. Many results exist on the model selection performances of cross-validation procedures. This survey intends to relate these results to the most recent advances of model selection theory, with a particular emphasis on distinguishing empirical statements from rigorous theoretical results. As a conclusion, guidelines are provided for choosing the best cross-validation procedure according to the particular features of the problem in hand.

2,980 citations

References
More filters
Book
01 Jan 2000
TL;DR: This is the first comprehensive introduction to Support Vector Machines (SVMs), a new generation learning system based on recent advances in statistical learning theory, and will guide practitioners to updated literature, new applications, and on-line software.
Abstract: From the publisher: This is the first comprehensive introduction to Support Vector Machines (SVMs), a new generation learning system based on recent advances in statistical learning theory. SVMs deliver state-of-the-art performance in real-world applications such as text categorisation, hand-written character recognition, image classification, biosequences analysis, etc., and are now established as one of the standard tools for machine learning and data mining. Students will find the book both stimulating and accessible, while practitioners will be guided smoothly through the material required for a good grasp of the theory and its applications. The concepts are introduced gradually in accessible and self-contained stages, while the presentation is rigorous and thorough. Pointers to relevant literature and web sites containing software ensure that it forms an ideal starting point for further study. Equally, the book and its associated web site will guide practitioners to updated literature, new applications, and on-line software.

13,736 citations

Book ChapterDOI
TL;DR: This chapter reproduces the English translation by B. Seckler of the paper by Vapnik and Chervonenkis in which they gave proofs for the innovative results they had obtained in a draft form in July 1966 and announced in 1968 in their note in Soviet Mathematics Doklady.
Abstract: This chapter reproduces the English translation by B. Seckler of the paper by Vapnik and Chervonenkis in which they gave proofs for the innovative results they had obtained in a draft form in July 1966 and announced in 1968 in their note in Soviet Mathematics Doklady. The paper was first published in Russian as Вапник В. Н. and Червоненкис А. Я. О равномерноЙ сходимости частот появления событиЙ к их вероятностям. Теория вероятностеЙ и ее применения 16(2), 264–279 (1971).

3,939 citations

Book
01 Jan 1996
TL;DR: The Bayes Error and Vapnik-Chervonenkis theory are applied as guide for empirical classifier selection on the basis of explicit specification and explicit enforcement of the maximum likelihood principle.
Abstract: Preface * Introduction * The Bayes Error * Inequalities and alternatedistance measures * Linear discrimination * Nearest neighbor rules *Consistency * Slow rates of convergence Error estimation * The regularhistogram rule * Kernel rules Consistency of the k-nearest neighborrule * Vapnik-Chervonenkis theory * Combinatorial aspects of Vapnik-Chervonenkis theory * Lower bounds for empirical classifier selection* The maximum likelihood principle * Parametric classification *Generalized linear discrimination * Complexity regularization *Condensed and edited nearest neighbor rules * Tree classifiers * Data-dependent partitioning * Splitting the data * The resubstitutionestimate * Deleted estimates of the error probability * Automatickernel rules * Automatic nearest neighbor rules * Hypercubes anddiscrete spaces * Epsilon entropy and totally bounded sets * Uniformlaws of large numbers * Neural networks * Other error estimates *Feature extraction * Appendix * Notation * References * Index

3,598 citations

Proceedings Article
08 Jul 1997
TL;DR: In this paper, the authors show that the test error of the generated classifier usually does not increase as its size becomes very large, and often is observed to decrease even after the training error reaches zero.
Abstract: One of the surprising recurring phenomena observed in experiments with boosting is that the test error of the generated classifier usually does not increase as its size becomes very large, and often is observed to decrease even after the training error reaches zero. In this paper, we show that this phenomenon is related to the distribution of margins of the training examples with respect to the generated voting classification rule, where the margin of an example is simply the difference between the number of correct votes and the maximum number of votes received by any incorrect label. We show that techniques used in the analysis of Vapnik's support vector classifiers and of neural networks with small weights can be applied to voting methods to relate the margin distribution to the test error. We also show theoretically and experimentally that boosting is especially effective at increasing the margins of the training examples. Finally, we compare our explanation to those based on the bias-variance

2,423 citations

Journal ArticleDOI
TL;DR: It is shown that techniques used in the analysis of Vapnik's support vector classifiers and of neural networks with small weights can be applied to voting methods to relate the margin distribution to the test error.
Abstract: One of the surprising recurring phenomena observed in experiments with boosting is that the test error of the generated classifier usually does not increase as its size becomes very large, and often is observed to decrease even after the training error reaches zero. In this paper, we show that this phenomenon is related to the distribution of margins of the training examples with respect to the generated voting classification rule, where the margin of an example is simply the difference between the number of correct votes and the maximum number of votes received by any incorrect label. We show that techniques used in the analysis of Vapnik's support vector classifiers and of neural networks with small weights can be applied to voting methods to relate the margin distribution to the test error. We also show theoretically and experimentally that boosting is especially effective at increasing the margins of the training examples. Finally, we compare our explanation to those based on the bias-variance decomposition.

2,257 citations