scispace - formally typeset
Open AccessBook ChapterDOI

Measurements of generalisation based on information geometry

TLDR
A theory for measuring generalisation is developed by combining Bayesian decision theory with information geometry, which unifies the majority of error measures currently in use.
Abstract
Neural networks are statistical models and learning rules are estimators In this paper a theory for measuring generalisation is developed by combining Bayesian decision theory with information geometry The performance of an estimator is measured by the information divergence between the true distribution and the estimate, averaged over the Bayesian posterior This unifies the majority of error measures currently in use The optimal estimators also reveal some intricate interrelationships among information geometry, Banach spaces and sufficient statistics

read more

Content maybe subject to copyright    Report

Baltzer Journals July 2, 1995
Measurements of Generalisation
Based on Information Geometry
Huaiyu Zhu and Richard Rohwer
Neural Computing Research Group
Department of Computer Science and Applied Mathematics,
Aston University, Birmingham B4 7ET, UK
E-mail:
zhuh@aston.ac.uk
Neural networks are statistical mo dels and learning rules are estimators. In this
paper a theory for measuring generalisation is developed by combining Bayesian
decision theory with information geometry. The p erformance of an estimator is
measured by the information divergence b etween the true distribution and the
estimate, averaged over the Bayesian p osterior. This unies the ma jority of error
measures currently in use. The optimal estimators also reveal some intricate inter-
relationships among information geometry, Banach spaces and sucient statistics.
1 Introduction
A neural network (deterministic or stochastic) can be regarded as a parameterised
statistical model
P
(
y
j
x; w
), where
x
2
X
is the input,
y
2
Y
is the output and
w
2
W
is the weight. In an environment with an input distribution
P
(
x
), it is
also equivalent to
P
(
z
j
w
), where
z
:= [
x; y
]
2
Z
:=
X
Y
denotes the combined
input and output as data [11]. Learning is the task of inferring
w
from
z
. It is
a typical statistical inference problem in which a neural network mo del acts as a
\likelihood function", a learning rule as an \estimator", the trained network as
an \estimate" and the data set as a \sample". The set of probability measures
on sample space
Z
forms a (p ossibly innite dimensional) dierential manifold
P
[2, 16]. A statistical mo del forms a nite-dimensional submanifold
Q
, composed
of representable distributions, parameterised by weights
w
acting as coordinates.
To infer
w
from
z
requires additional information about
w
. In a Bayesian
framework such auxiliary information is represented by a prior
P
(
p
), where
p
is
the true but unknown distribution from which
z
is drawn. This is then combined
with the likelihood function
P
(
z
j
p
) to yield the posterior distribution
P
(
p
j
z
) via
the Bayes formula
P
(
p
j
z
) =
P
(
z
j
p
)
P
(
p
)
=P
(
z
).

H. Zhu and R. Rohwer / Measurements of Generalisation
2
An estimator
:
Z
! Q
must, for each
z
, x one
q
2 Q
which in a sense ap-
proximate
p
.
1
This requires a measure of \divergence"
D
(
p; q
) b etween
p; q
2 P
dened indep endent of parameterisation. General studies on divergences between
probability distributions are provided by the theory of information geometry (See
[2, 3, 7] and further references therein). The main thesis of this paper is that
generalisation error should b e measured by the p osterior expectation of the in-
formation divergence b etween true distribution and estimate. We shall show that
this retains most of the mathematical simplicity of mean squared error theory
while b eing generally applicable to any statistical inference problems.
2 Measurements of Generalisation
The most natural \information divergence" between two distribution
p; q
2 P
is
the
-divergence dened as [2]
2
D
(
p; q
) :=
1
(1
?
)
1
?
Z
p
q
1
?
;
8
2
(0
;
1)
:
(1)
The limits as
tends to 0 and 1 are taken as denitions of
D
0
and
D
1
, respectively.
Following are some salient properties of the
-divergences [2]:
D
(
p; q
) =
D
1
?
(
q ; p
)
0
: D
(
p; q
) = 0
()
p
=
q :
(2)
D
0
(
q ; p
) =
D
1
(
p; q
) =
K
(
p; q
) :=
Z
p
log
p
q
:
(3)
D
1
=
2
(
p; q
) =
D
1
=
2
(
q ; p
) = 2
Z
(
p
p
?
p
q
)
2
:
(4)
D
(
p; p
+
p
)
1
2
Z
(
p
)
2
p
1
2
( log
p
)
2
:
(5)
The quantity
K
(
p; q
) is the Kullback-Leibl er divergence (cross entropy). The
quantity
D
1
=
2
(
p; q
) is the Hellinger distance. The quantity
R
(
p
)
2
=p
is usually
called the
2
distance between two nearby distributions.
Armed with the
-divergence, we now dene the generalisation error
E
(
) :=
Z
p
P
(
p
)
Z
z
P
(
z
j
p
)
D
(
p;
(
z
))
; E
(
q
j
z
) :=
Z
p
P
(
p
j
z
)
D
(
p; q
)
;
(6)
where
p
is the true distribution,
is the learning rule,
z
is the data, and
q
=
(
z
)
is the estimate. A learning rule
is called
-optimal if it minimises
E
(
). A
1
Some Bayesian metho ds give the entire posterior
P
(
p
j
z
) instead of a p oint estimate
q
as the
answer. They will be shown later to b e a sp ecial case of the current framework.
2
This is essentially Amari's
-divergence, where
2
[
?
1
;
1], re-parameterised by
= (1
?
)
=
2
2
[0
;
1] for technical convenience, following [6].

H. Zhu and R. Rohwer / Measurements of Generalisation
3
probability distribution
q
is called a
-optimal estimate, or simply a
-estimate,
from data
z
, if it minimises
E
(
q
j
z
). The following theorem is a special case of a
standard result from Bayesian decision theory.
Theorem .1 (Coherence)
A learning rule
is
-optimal if and only if for any data
z
, excluding a set of zero
probability, the result of training
q
=
(
z
)
is a
-estimate.
Denition .2 (
-coordinate)
Let
:= 1
=
,
:= 1
=
(1
?
)
. Let
L
be the Banach space of
th power integrable
functions. Then
L
and
L
are dual to each other as Banach spaces. Let
p
2 P
.
Its
-coordinate is dened as
l
(
p
) :=
p
=
2
L
for
>
0
, and
0
l
(
p
) := log
p
[2].
Denote by
1
=
l
the inverse of
l
.
Theorem .3 (
-estimator in
P
)
The
-estimate
b
q
2 P
is uniquely given [14] by
b
q
1
=
l
(
R
P
(
p
j
z
)
l
(
p
))
.
3 Divergence between Finite Positive Measures
One of the most useful properties of the least mean square estimate is the so called
M S E
=
V AR
+
B I AS
2
relation, which also implies that, for a given linear space
W
, the LMS estimate of
w
within
W
is given by the pro jection of the p osterior
mean
b
w
onto
W
. This is generalised to the following theorem [16], applying the
generalised Pythagorean Theorem for
-divergences [2].
Theorem .4 (Error decomposition in
Q
)
Let
Q
be a
-at manifold. Let
P
(
p
)
be a prior on
Q
. Then
8
q
2 Q
,
8
z
2
Z
,
E
(
q
j
z
) =
E
(
b
p
j
z
) +
D
(
b
p; q
)
;
(7)
where
b
p
is the
-estimate in
Q
.
To apply this theorem it is necessary to extend the denition of
-divergence
to
e
P
, the space of nite p ositive measures, which is
-at for any
for a nite
sample space
Z
[2], following suggestions in [2].

H. Zhu and R. Rohwer / Measurements of Generalisation
4
Denition .5 (
-divergence on
e
P
)
The
-divergence on
e
P
is dened by
D
(
p; q
) : =
1
(1
?
)
Z
?
p
+ (1
?
)
q
?
p
q
1
?
(8)
This denition retains most of the imp ortant prop erties of
-divergence on
P
, and reduces to the original denition when restricted to
P
. It has the addi-
tional advantage of b eing the integral of a positive measure, making it possible to
attribute the divergence b etween two measures to their divergence over various
events [16]. In particular, the generalised cross entropy is [16]
K
(
p; q
) :=
Z
q
?
p
+
p
log
p
q
:
(9)
The
-divergence denes a dierential structure on
e
P
. The Riemannian geom-
etry and the
-ane connections can b e obtained by the Eguchi relations [2, 7]
The most imp ortant advantage of this denition is that the following imp ortant
theorem is true and can b e proved by pure algebraic manipulation [16].
Theorem .6 (Error Decomposition on
e
P
)
Let
P
(
p
)
be a distribution over
e
P
. Let
q
2
e
P
. Then
h
D
(
p; q
)
i
=
h
D
(
p;
b
p
)
i
+
D
(
b
p; q
)
;
(10)
where
b
p
is the
-average of
p
given by
b
p
:=
p
.
Theorem .7 (
-estimator in
e
P
)
The
-estimate
b
p
=
(
z
)
in
e
P
is given by
b
p
=
p
z
. In particular, the 1-estimate
is the posterior marginal distribution
b
p
=
h
p
i
z
.
Theorem .8 (
-estimator in
Q
)
Let
Q
be an arbitrary submanifold of
e
P
. The
-estimate
b
q
in
Q
is given by the
-projection of
b
p
onto
Q
, where
b
p
is the
-estimate in
e
P
.
4 Examples and Applications to Neural Networks
Explicit formulas are derived for the optimal estimators for the multinomial [15]
and normal distributions [14].

H. Zhu and R. Rohwer / Measurements of Generalisation
5
Example 1
Let
m
2
N
n
; p
2 P
=
n
?
1
; a
2
R
n
+
. Consider multinomial family of distributions
M
(
m
j
p
)
with a Dirichlet prior
D
(
p
j
a
)
. The posterior is also a Dirichlet distribu-
tion
D
(
p
j
a
+
m
)
. The
-estimate
b
p
2
e
P
is given by
(
b
p
i
)
= (
a
i
+
m
i
)
=
(
j
a
+
m
j
)
,
where
j
a
j
:=
P
i
a
i
and
(
a
)
b
:= ?(
a
+
b
)
=
?(
a
)
. In particular,
b
p
i
= (
a
i
+
m
i
)
=
j
a
+
m
j
for
= 1
, and
b
p
i
= exp ((
a
i
+
m
i
)
?
(
j
a
+
m
j
))
for
= 0
, where
is the the
digamma function. The
-estimate
b
q
2 P
is given by normalising
b
p
.
Example 2
Let
z ;
2
R
; h
2
R
+
; a
2
R
; n
2
R
+
. Consider the Gaussian family of distribu-
tions
f
(
z
j
) =
N
(
z
?
j
h
)
, with xed variance
2
= 1
=h
. Let the prior be another
Gaussian
f
(
) =
N
(
?
a
j
nh
)
, Then the posterior after seeing a sample z of size
k
,
is also a Gaussian
f
(
j
z
) =
N
(
?
a
k
j
n
k
h
)
, where
n
k
=
n
+
k; a
k
= (
na
+
P
z
)
=n
k
,
which is also the posterior least squares estimate. The
-estimate
b
q
2 P
is given
by the density
f
(
z
0
j
b
q
) =
N
?
z
0
?
a
k
h=
(1 +
=n
k
)
.
The entities
j
a
j
for the multinomial model and
n
for the Gaussian mo del are
eective previous sample sizes, a fact known since Fisher's time. In a restricted
model, the sample size might not b e well reected, and some ancillary statistics
may be used for information recovery [2].
Example 3
In some Bayesian methods, such as the Monte Carlo method [10], no estimator
is explictly given. Instead, the posterior is directly used for sampling
p
. This
produces a prediction distribution on test data which is the posterior marginal
distribution. Therefore these methods are implicitly 1-estimators.
Example 4
Multilayer neural networks are usual ly not
-convex for any
, and there may exist
local optima of
E
(
j
z
)
on
Q
. A practical learning rule is usual ly a gradient descent
rule which moves
w
in the direction which reduces
E
(
q
j
z
)
. The 1-divergence can
be minimised by a supervised learning rule, the Boltzmann machine learning rule
[1]. The 0-divergence can be minimised by a reinforcement learning rule, the
simulated annealing reinforcement learning rule for stochastic networks[13].
Min
q
K
(
p; q
)
()
w
h
@
w
0
l
(
q
)
i
p
? h
@
w
0
l
(
q
)
i
q
(11)
Min
q
K
(
q ; p
)
()
w
h
@
w
0
l
(
q
)
;
0
l
(
p
)
?
0
l
(
q
)
i
q
(12)

Citations
More filters
Journal ArticleDOI

Families of Alpha- Beta- and Gamma- Divergences: Flexible and Robust Measures of Similarities

Andrzej Cichocki, +1 more
- 14 Jun 2010 - 
TL;DR: It is shown that a new wide class of Gamma-divergences can be generated not only from the family of Beta-diversgences but also from a family of Alpha-d divergences.
Journal ArticleDOI

Generalized Alpha-Beta Divergences and Their Application to Robust Nonnegative Matrix Factorization

TL;DR: Owing to more degrees of freedom in tuning the parameters, the proposed family of AB-multiplicative NMF algorithms is shown to improve robustness with respect to noise and outliers.
Journal ArticleDOI

Divergence function, duality, and convex analysis

TL;DR: This formulation based on convex analysis naturally extends the information-geometric interpretation of divergence functions to allow the distinction between two different kinds of duality: referential duality and representational duality.
Journal ArticleDOI

Comparative Analysis of Regression and Artificial Neural Network Models for Wind Turbine Power Curve Estimation

TL;DR: The neural network model is found to possess better performance than the regression model for turbine power curve estimation under complicated influence factors.
References
More filters
Journal ArticleDOI

A learning algorithm for boltzmann machines

TL;DR: A general parallel search method is described, based on statistical mechanics, and it is shown how it leads to a general learning rule for modifying the connection strengths so as to incorporate knowledge about a task domain in an efficient way.
Journal ArticleDOI

Learning in Artificial Neural Networks: A Statistical Perspective

TL;DR: Concepts and analytical results from the literatures of mathematical statistics, econometrics, systems identification, and optimization theory relevant to the analysis of learning in artificial neural networks are reviewed.
DissertationDOI

Bayesian methods for adaptive models

TL;DR: The Bayesian framework for model comparison and regularisation is demonstrated by studying interpolation and classification problems modelled with both linear and non-linear models, and it is shown that the careful incorporation of error bar information into a classifier's predictions yields improved performance.