Measurements of generalisation based on information geometry

doi:10.1007/978-1-4615-6099-9_69

Baltzer Journals July 2, 1995

Measurements of Generalisation

Based on Information Geometry

Huaiyu Zhu and Richard Rohwer

Neural Computing Research Group

Department of Computer Science and Applied Mathematics,

Aston University, Birmingham B4 7ET, UK

E-mail:

zhuh@aston.ac.uk

Neural networks are statistical mo dels and learning rules are estimators. In this

paper a theory for measuring generalisation is developed by combining Bayesian

decision theory with information geometry. The p erformance of an estimator is

measured by the information divergence b etween the true distribution and the

estimate, averaged over the Bayesian p osterior. This unies the ma jority of error

measures currently in use. The optimal estimators also reveal some intricate inter-

relationships among information geometry, Banach spaces and sucient statistics.

1 Introduction

A neural network (deterministic or stochastic) can be regarded as a parameterised

statistical model

P

(

y

j

x; w

), where

x

2

X

is the input,

y

2

Y

is the output and

w

2

W

is the weight. In an environment with an input distribution

P

(

x

), it is

also equivalent to

P

(

z

j

w

), where

z

:= [

x; y

]

2

Z

:=

X



Y

denotes the combined

input and output as data [11]. Learning is the task of inferring

w

from

z

. It is

a typical statistical inference problem in which a neural network mo del acts as a

\likelihood function", a learning rule as an \estimator", the trained network as

an \estimate" and the data set as a \sample". The set of probability measures

on sample space

Z

forms a (p ossibly innite dimensional) dierential manifold

P

[2, 16]. A statistical mo del forms a nite-dimensional submanifold

Q

, composed

of representable distributions, parameterised by weights

w

acting as coordinates.

To infer

w

from

z

requires additional information about

w

. In a Bayesian

framework such auxiliary information is represented by a prior

P

(

p

), where

p

is

the true but unknown distribution from which

z

is drawn. This is then combined

with the likelihood function

P

(

z

j

p

) to yield the posterior distribution

P

(

p

j

z

) via

the Bayes formula

P

(

p

j

z

) =

P

(

z

j

p

)

P

(

p

)

=P

(

z

).

H. Zhu and R. Rohwer / Measurements of Generalisation

2

An estimator



:

Z

! Q

must, for each

z

, x one

q

2 Q

which in a sense ap-

proximate

p

.

1

This requires a measure of \divergence"

D

(

p; q

) b etween

p; q

2 P

dened indep endent of parameterisation. General studies on divergences between

probability distributions are provided by the theory of information geometry (See

[2, 3, 7] and further references therein). The main thesis of this paper is that

generalisation error should b e measured by the p osterior expectation of the in-

formation divergence b etween true distribution and estimate. We shall show that

this retains most of the mathematical simplicity of mean squared error theory

while b eing generally applicable to any statistical inference problems.

2 Measurements of Generalisation

The most natural \information divergence" between two distribution

p; q

2 P

is

the



-divergence dened as [2]

2

D



(

p; q

) :=

1



(1

?



)



1

?

Z

p



q

1

?





;

8



2

(0

;

1)

:

(1)

The limits as



tends to 0 and 1 are taken as denitions of

D

0

and

D

1

, respectively.

Following are some salient properties of the



-divergences [2]:

D



(

p; q

) =

D

1

?



(

q ; p

)



0

: D



(

p; q

) = 0

()

p

=

q :

(2)

D

0

(

q ; p

) =

D

1

(

p; q

) =

K

(

p; q

) :=

Z

p

log

p

q

:

(3)

D

1

=

2

(

p; q

) =

D

1

=

2

(

q ; p

) = 2

Z

(

p

?

p

q

)

2

:

(4)

D



(

p; p

+ 

p

)



1

2

Z

(

p

)

2

p



1

2



( log

p

)

2



:

(5)

The quantity

K

(

p; q

) is the Kullback-Leibl er divergence (cross entropy). The

quantity

D

1

=

2

(

p; q

) is the Hellinger distance. The quantity

R

(

p

)

2

=p

is usually

called the



2

distance between two nearby distributions.

Armed with the



-divergence, we now dene the generalisation error

E



(



) :=

Z

p

P

(

p

)

Z

z

P

(

z

j

p

)

D



(

p; 

(

z

))

; E



(

q

j

z

) :=

Z

p

P

(

p

j

z

)

D



(

p; q

)

;

(6)

where

p

is the true distribution,



is the learning rule,

z

is the data, and

q

=



(

z

)

is the estimate. A learning rule



is called



-optimal if it minimises

E



(



). A

1

Some Bayesian metho ds give the entire posterior

P

(

p

j

z

) instead of a p oint estimate

q

as the

answer. They will be shown later to b e a sp ecial case of the current framework.

2

This is essentially Amari's



-divergence, where



2

[

?

1

;

1], re-parameterised by



= (1

?



)

=

2

[0

;

1] for technical convenience, following [6].

H. Zhu and R. Rohwer / Measurements of Generalisation

3

probability distribution

q

is called a



-optimal estimate, or simply a



-estimate,

from data

z

, if it minimises

E



(

q

j

z

). The following theorem is a special case of a

standard result from Bayesian decision theory.

Theorem .1 (Coherence)

A learning rule



is



-optimal if and only if for any data

z

, excluding a set of zero

probability, the result of training

q

=



(

z

)

is a



-estimate.

Denition .2 (



-coordinate)

Let



:= 1

=

,



:= 1

=

(1

?



)

. Let

L



be the Banach space of



th power integrable

functions. Then

L



and

L



are dual to each other as Banach spaces. Let

p

2 P

.

Its



-coordinate is dened as



l

(

p

) :=

p



=

2

L



for

 >

0

, and

0

l

(

p

) := log

p

[2].

Denote by

1

=

l

the inverse of



l

.

Theorem .3 (



-estimator in

P

)

The



-estimate

b

q

2 P

is uniquely given [14] by

b

q



1

=

l

(

R

P

(

p

j

z

)



l

(

p

))

.

3 Divergence between Finite Positive Measures

One of the most useful properties of the least mean square estimate is the so called

M S E

=

V AR

+

B I AS

2

relation, which also implies that, for a given linear space

W

, the LMS estimate of

w

within

W

is given by the pro jection of the p osterior

mean

b

w

onto

W

. This is generalised to the following theorem [16], applying the

generalised Pythagorean Theorem for



-divergences [2].

Theorem .4 (Error decomposition in

Q

)

Let

Q

be a



-at manifold. Let

P

(

p

)

be a prior on

Q

. Then

8

q

2 Q

,

8

z

2

Z

,

E



(

q

j

z

) =

E



(

b

p

j

z

) +

D



(

b

p; q

)

;

(7)

where

b

p

is the



-estimate in

Q

.

To apply this theorem it is necessary to extend the denition of



-divergence

to

e

P

, the space of nite p ositive measures, which is



-at for any



for a nite

sample space

Z

[2], following suggestions in [2].

H. Zhu and R. Rohwer / Measurements of Generalisation

4

Denition .5 (



-divergence on

e

P

)

The



-divergence on

e

P

is dened by

D



(

p; q

) : =

1



(1

?



)

Z

?

 p

+ (1

?



)

q

?

p



q

1

?





(8)

This denition retains most of the imp ortant prop erties of



-divergence on

P

, and reduces to the original denition when restricted to

P

. It has the addi-

tional advantage of b eing the integral of a positive measure, making it possible to

attribute the divergence b etween two measures to their divergence over various

events [16]. In particular, the generalised cross entropy is [16]

K

(

p; q

) :=

Z



q

?

p

+

p

log

p

q



:

(9)

The



-divergence denes a dierential structure on

e

P

. The Riemannian geom-

etry and the



-ane connections can b e obtained by the Eguchi relations [2, 7]

The most imp ortant advantage of this denition is that the following imp ortant

theorem is true and can b e proved by pure algebraic manipulation [16].

Theorem .6 (Error Decomposition on

e

P

)

Let

P

(

p

)

be a distribution over

e

P

. Let

q

2

e

P

. Then

h

D



(

p; q

)

i

=

h

D



(

p;

b

p

)

i

+

D



(

b

p; q

)

;

(10)

where

b

p

is the



-average of

p

given by

b

p



:=



p





.

Theorem .7 (



-estimator in

e

P

)

The



-estimate

b

p

=





(

z

)

in

e

P

is given by

b

p



=



p





z

. In particular, the 1-estimate

is the posterior marginal distribution

b

p

=

h

p

i

z

.

Theorem .8 (



-estimator in

Q

)

Let

Q

be an arbitrary submanifold of

e

P

. The



-estimate

b

q

in

Q

is given by the



-projection of

b

p

onto

Q

, where

b

p

is the



-estimate in

e

P

.

4 Examples and Applications to Neural Networks

Explicit formulas are derived for the optimal estimators for the multinomial [15]

and normal distributions [14].

H. Zhu and R. Rohwer / Measurements of Generalisation

5

Example 1

Let

m

2

N

n

; p

2 P

= 

n

?

1

; a

2

R

n

+

. Consider multinomial family of distributions

M

(

m

j

p

)

with a Dirichlet prior

D

(

p

j

a

)

. The posterior is also a Dirichlet distribu-

tion

D

(

p

j

a

+

m

)

. The



-estimate

b

p

2

e

P

is given by

(

b

p

i

)



= (

a

i

+

m

i

)



=

(

j

a

+

m

j

)



,

where

j

a

j

:=

P

i

a

i

and

(

a

)

b

:= ?(

a

+

b

)

=

?(

a

)

. In particular,

b

p

i

= (

a

i

+

m

i

)

=

j

a

+

m

j

for



= 1

, and

b

p

i

= exp ((

a

i

+

m

i

)

?

(

j

a

+

m

j

))

for



= 0

, where



is the the

digamma function. The



-estimate

b

q

2 P

is given by normalising

b

p

.

Example 2

Let

z ; 

2

R

; h

2

R

+

; a

2

R

; n

2

R

+

. Consider the Gaussian family of distribu-

tions

f

(

z

j



) =

N

(

z

?



j

h

)

, with xed variance



2

= 1

=h

. Let the prior be another

Gaussian

f

(



) =

N

(



?

a

j

nh

)

, Then the posterior after seeing a sample z of size

k

,

is also a Gaussian

f

(



j

z

) =

N

(



?

a

k

j

n

k

h

)

, where

n

k

=

n

+

k; a

k

= (

na

+

P

z

)

=n

k

,

which is also the posterior least squares estimate. The



-estimate

b

q

2 P

is given

by the density

f

(

z

0

j

b

q

) =

N

?

z

0

?

a

k



h=

(1 +

 =n

k

)



.

The entities

j

a

j

for the multinomial model and

n

for the Gaussian mo del are

eective previous sample sizes, a fact known since Fisher's time. In a restricted

model, the sample size might not b e well reected, and some ancillary statistics

may be used for information recovery [2].

Example 3

In some Bayesian methods, such as the Monte Carlo method [10], no estimator

is explictly given. Instead, the posterior is directly used for sampling

p

. This

produces a prediction distribution on test data which is the posterior marginal

distribution. Therefore these methods are implicitly 1-estimators.

Example 4

Multilayer neural networks are usual ly not



-convex for any



, and there may exist

local optima of

E



(

j

z

)

on

Q

. A practical learning rule is usual ly a gradient descent

rule which moves

w

in the direction which reduces

E



(

q

j

z

)

. The 1-divergence can

be minimised by a supervised learning rule, the Boltzmann machine learning rule

[1]. The 0-divergence can be minimised by a reinforcement learning rule, the

simulated annealing reinforcement learning rule for stochastic networks[13].

Min

q

K

(

p; q

)

()



w

 h

@

w

0

l

(

q

)

i

p

? h

@

w

0

l

(

q

)

i

q

(11)

Min

q

K

(

q ; p

)

()



w

 h

@

w

0

l

(

q

)

;

0

l

(

p

)

?

0

l

(

q

)

i

q

(12)

Measurements of generalisation based on information geometry

Citations

Methods of information geometry

Families of Alpha- Beta- and Gamma- Divergences: Flexible and Robust Measures of Similarities

Generalized Alpha-Beta Divergences and Their Application to Robust Nonnegative Matrix Factorization

Divergence function, duality, and convex analysis

Comparative Analysis of Regression and Artificial Neural Network Models for Wind Turbine Power Curve Estimation

References

A learning algorithm for boltzmann machines

Diffential-Geometrical Methods in Statistics

Differential-geometrical methods in statistics

Learning in Artificial Neural Networks: A Statistical Perspective

Bayesian methods for adaptive models

Related Papers (5)

Differential-geometrical methods in statistics

A General Class of Coefficients of Divergence of One Distribution from Another

A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of Observations

Markov Processes and the H -Theorem

Relative Entropy of States of von Neumann Algebras