Baltzer Journals July 2, 1995
Measurements of Generalisation
Based on Information Geometry
Huaiyu Zhu and Richard Rohwer
Neural Computing Research Group
Department of Computer Science and Applied Mathematics,
Aston University, Birmingham B4 7ET, UK
E-mail:
zhuh@aston.ac.uk
Neural networks are statistical mo dels and learning rules are estimators. In this
paper a theory for measuring generalisation is developed by combining Bayesian
decision theory with information geometry. The p erformance of an estimator is
measured by the information divergence b etween the true distribution and the
estimate, averaged over the Bayesian p osterior. This unies the ma jority of error
measures currently in use. The optimal estimators also reveal some intricate inter-
relationships among information geometry, Banach spaces and sucient statistics.
1 Introduction
A neural network (deterministic or stochastic) can be regarded as a parameterised
statistical model
P
(
y
j
x; w
), where
x
2
X
is the input,
y
2
Y
is the output and
w
2
W
is the weight. In an environment with an input distribution
P
(
x
), it is
also equivalent to
P
(
z
j
w
), where
z
:= [
x; y
]
2
Z
:=
X
Y
denotes the combined
input and output as data [11]. Learning is the task of inferring
w
from
z
. It is
a typical statistical inference problem in which a neural network mo del acts as a
\likelihood function", a learning rule as an \estimator", the trained network as
an \estimate" and the data set as a \sample". The set of probability measures
on sample space
Z
forms a (p ossibly innite dimensional) dierential manifold
P
[2, 16]. A statistical mo del forms a nite-dimensional submanifold
Q
, composed
of representable distributions, parameterised by weights
w
acting as coordinates.
To infer
w
from
z
requires additional information about
w
. In a Bayesian
framework such auxiliary information is represented by a prior
P
(
p
), where
p
is
the true but unknown distribution from which
z
is drawn. This is then combined
with the likelihood function
P
(
z
j
p
) to yield the posterior distribution
P
(
p
j
z
) via
the Bayes formula
P
(
p
j
z
) =
P
(
z
j
p
)
P
(
p
)
=P
(
z
).
H. Zhu and R. Rohwer / Measurements of Generalisation
2
An estimator
:
Z
! Q
must, for each
z
, x one
q
2 Q
which in a sense ap-
proximate
p
.
1
This requires a measure of \divergence"
D
(
p; q
) b etween
p; q
2 P
dened indep endent of parameterisation. General studies on divergences between
probability distributions are provided by the theory of information geometry (See
[2, 3, 7] and further references therein). The main thesis of this paper is that
generalisation error should b e measured by the p osterior expectation of the in-
formation divergence b etween true distribution and estimate. We shall show that
this retains most of the mathematical simplicity of mean squared error theory
while b eing generally applicable to any statistical inference problems.
2 Measurements of Generalisation
The most natural \information divergence" between two distribution
p; q
2 P
is
the
-divergence dened as [2]
2
D
(
p; q
) :=
1
(1
?
)
1
?
Z
p
q
1
?
;
8
2
(0
;
1)
:
(1)
The limits as
tends to 0 and 1 are taken as denitions of
D
0
and
D
1
, respectively.
Following are some salient properties of the
-divergences [2]:
D
(
p; q
) =
D
1
?
(
q ; p
)
0
: D
(
p; q
) = 0
()
p
=
q :
(2)
D
0
(
q ; p
) =
D
1
(
p; q
) =
K
(
p; q
) :=
Z
p
log
p
q
:
(3)
D
1
=
2
(
p; q
) =
D
1
=
2
(
q ; p
) = 2
Z
(
p
p
?
p
q
)
2
:
(4)
D
(
p; p
+
p
)
1
2
Z
(
p
)
2
p
1
2
( log
p
)
2
:
(5)
The quantity
K
(
p; q
) is the Kullback-Leibl er divergence (cross entropy). The
quantity
D
1
=
2
(
p; q
) is the Hellinger distance. The quantity
R
(
p
)
2
=p
is usually
called the
2
distance between two nearby distributions.
Armed with the
-divergence, we now dene the generalisation error
E
(
) :=
Z
p
P
(
p
)
Z
z
P
(
z
j
p
)
D
(
p;
(
z
))
; E
(
q
j
z
) :=
Z
p
P
(
p
j
z
)
D
(
p; q
)
;
(6)
where
p
is the true distribution,
is the learning rule,
z
is the data, and
q
=
(
z
)
is the estimate. A learning rule
is called
-optimal if it minimises
E
(
). A
1
Some Bayesian metho ds give the entire posterior
P
(
p
j
z
) instead of a p oint estimate
q
as the
answer. They will be shown later to b e a sp ecial case of the current framework.
2
This is essentially Amari's
-divergence, where
2
[
?
1
;
1], re-parameterised by
= (1
?
)
=
2
2
[0
;
1] for technical convenience, following [6].
H. Zhu and R. Rohwer / Measurements of Generalisation
3
probability distribution
q
is called a
-optimal estimate, or simply a
-estimate,
from data
z
, if it minimises
E
(
q
j
z
). The following theorem is a special case of a
standard result from Bayesian decision theory.
Theorem .1 (Coherence)
A learning rule
is
-optimal if and only if for any data
z
, excluding a set of zero
probability, the result of training
q
=
(
z
)
is a
-estimate.
Denition .2 (
-coordinate)
Let
:= 1
=
,
:= 1
=
(1
?
)
. Let
L
be the Banach space of
th power integrable
functions. Then
L
and
L
are dual to each other as Banach spaces. Let
p
2 P
.
Its
-coordinate is dened as
l
(
p
) :=
p
=
2
L
for
>
0
, and
0
l
(
p
) := log
p
[2].
Denote by
1
=
l
the inverse of
l
.
Theorem .3 (
-estimator in
P
)
The
-estimate
b
q
2 P
is uniquely given [14] by
b
q
1
=
l
(
R
P
(
p
j
z
)
l
(
p
))
.
3 Divergence between Finite Positive Measures
One of the most useful properties of the least mean square estimate is the so called
M S E
=
V AR
+
B I AS
2
relation, which also implies that, for a given linear space
W
, the LMS estimate of
w
within
W
is given by the pro jection of the p osterior
mean
b
w
onto
W
. This is generalised to the following theorem [16], applying the
generalised Pythagorean Theorem for
-divergences [2].
Theorem .4 (Error decomposition in
Q
)
Let
Q
be a
-at manifold. Let
P
(
p
)
be a prior on
Q
. Then
8
q
2 Q
,
8
z
2
Z
,
E
(
q
j
z
) =
E
(
b
p
j
z
) +
D
(
b
p; q
)
;
(7)
where
b
p
is the
-estimate in
Q
.
To apply this theorem it is necessary to extend the denition of
-divergence
to
e
P
, the space of nite p ositive measures, which is
-at for any
for a nite
sample space
Z
[2], following suggestions in [2].
H. Zhu and R. Rohwer / Measurements of Generalisation
4
Denition .5 (
-divergence on
e
P
)
The
-divergence on
e
P
is dened by
D
(
p; q
) : =
1
(1
?
)
Z
?
p
+ (1
?
)
q
?
p
q
1
?
(8)
This denition retains most of the imp ortant prop erties of
-divergence on
P
, and reduces to the original denition when restricted to
P
. It has the addi-
tional advantage of b eing the integral of a positive measure, making it possible to
attribute the divergence b etween two measures to their divergence over various
events [16]. In particular, the generalised cross entropy is [16]
K
(
p; q
) :=
Z
q
?
p
+
p
log
p
q
:
(9)
The
-divergence denes a dierential structure on
e
P
. The Riemannian geom-
etry and the
-ane connections can b e obtained by the Eguchi relations [2, 7]
The most imp ortant advantage of this denition is that the following imp ortant
theorem is true and can b e proved by pure algebraic manipulation [16].
Theorem .6 (Error Decomposition on
e
P
)
Let
P
(
p
)
be a distribution over
e
P
. Let
q
2
e
P
. Then
h
D
(
p; q
)
i
=
h
D
(
p;
b
p
)
i
+
D
(
b
p; q
)
;
(10)
where
b
p
is the
-average of
p
given by
b
p
:=
p
.
Theorem .7 (
-estimator in
e
P
)
The
-estimate
b
p
=
(
z
)
in
e
P
is given by
b
p
=
p
z
. In particular, the 1-estimate
is the posterior marginal distribution
b
p
=
h
p
i
z
.
Theorem .8 (
-estimator in
Q
)
Let
Q
be an arbitrary submanifold of
e
P
. The
-estimate
b
q
in
Q
is given by the
-projection of
b
p
onto
Q
, where
b
p
is the
-estimate in
e
P
.
4 Examples and Applications to Neural Networks
Explicit formulas are derived for the optimal estimators for the multinomial [15]
and normal distributions [14].
H. Zhu and R. Rohwer / Measurements of Generalisation
5
Example 1
Let
m
2
N
n
; p
2 P
=
n
?
1
; a
2
R
n
+
. Consider multinomial family of distributions
M
(
m
j
p
)
with a Dirichlet prior
D
(
p
j
a
)
. The posterior is also a Dirichlet distribu-
tion
D
(
p
j
a
+
m
)
. The
-estimate
b
p
2
e
P
is given by
(
b
p
i
)
= (
a
i
+
m
i
)
=
(
j
a
+
m
j
)
,
where
j
a
j
:=
P
i
a
i
and
(
a
)
b
:= ?(
a
+
b
)
=
?(
a
)
. In particular,
b
p
i
= (
a
i
+
m
i
)
=
j
a
+
m
j
for
= 1
, and
b
p
i
= exp ((
a
i
+
m
i
)
?
(
j
a
+
m
j
))
for
= 0
, where
is the the
digamma function. The
-estimate
b
q
2 P
is given by normalising
b
p
.
Example 2
Let
z ;
2
R
; h
2
R
+
; a
2
R
; n
2
R
+
. Consider the Gaussian family of distribu-
tions
f
(
z
j
) =
N
(
z
?
j
h
)
, with xed variance
2
= 1
=h
. Let the prior be another
Gaussian
f
(
) =
N
(
?
a
j
nh
)
, Then the posterior after seeing a sample z of size
k
,
is also a Gaussian
f
(
j
z
) =
N
(
?
a
k
j
n
k
h
)
, where
n
k
=
n
+
k; a
k
= (
na
+
P
z
)
=n
k
,
which is also the posterior least squares estimate. The
-estimate
b
q
2 P
is given
by the density
f
(
z
0
j
b
q
) =
N
?
z
0
?
a
k
h=
(1 +
=n
k
)
.
The entities
j
a
j
for the multinomial model and
n
for the Gaussian mo del are
eective previous sample sizes, a fact known since Fisher's time. In a restricted
model, the sample size might not b e well reected, and some ancillary statistics
may be used for information recovery [2].
Example 3
In some Bayesian methods, such as the Monte Carlo method [10], no estimator
is explictly given. Instead, the posterior is directly used for sampling
p
. This
produces a prediction distribution on test data which is the posterior marginal
distribution. Therefore these methods are implicitly 1-estimators.
Example 4
Multilayer neural networks are usual ly not
-convex for any
, and there may exist
local optima of
E
(
j
z
)
on
Q
. A practical learning rule is usual ly a gradient descent
rule which moves
w
in the direction which reduces
E
(
q
j
z
)
. The 1-divergence can
be minimised by a supervised learning rule, the Boltzmann machine learning rule
[1]. The 0-divergence can be minimised by a reinforcement learning rule, the
simulated annealing reinforcement learning rule for stochastic networks[13].
Min
q
K
(
p; q
)
()
w
h
@
w
0
l
(
q
)
i
p
? h
@
w
0
l
(
q
)
i
q
(11)
Min
q
K
(
q ; p
)
()
w
h
@
w
0
l
(
q
)
;
0
l
(
p
)
?
0
l
(
q
)
i
q
(12)