A practical Bayesian framework for backpropagation networks

doi:10.1162/NECO.1992.4.3.448

Communicated

by

David

Haussler

A

Practical Bayesian Framework

for

Backpropagation

Networks

David

J.

C.

MacKay’

Computation

and

Neural Systems, California lnstitute of Technology

139-74,

Pasadena, CA

91125

USA

A

quantitative and practical Bayesian framework

is

described for learn-

ing of mappings in feedforward networks. The framework makes

possible

(1)

objective comparisons between solutions using alternative

network architectures,

(2)

objective stopping rules for network prun-

ing or growing procedures,

(3)

objective choice of magnitude and type

of weight decay terms or additive regularizers (for penalizing large

weights, etc.),

(4)

a measure of the effective number of well-determined

parameters in a model,

(5)

quantified estimates of the error bars on net-

work parameters and on network output, and

(6)

objective comparisons

with alternative learning and interpolation models such as splines and

radial basis functions. The Bayesian “evidence” automatically embod-

ies ”Occam’s razor,’’ penalizing overflexible and overcomplex models.

The Bayesian approach helps detect poor underlying assumptions

in

learning models. For learning models well matched to a problem, a

good correlation between generalization ability and the Bayesian evi-

dence

is

obtained.

This paper makes use of the Bayesian framework for regularization and

model comparison described in the companion paper “Bayesian Inter-

polation” (MacKay 1992a). This framework is due to Gull and Skilling

(Gull 1989).

1

The Gaps in Backprop

There are many knobs on the black box

of

“backprop” [learning by back-

propagation

of

errors (Rumelhart

et

al.

198611. Generally these knobs are

set by rules

of

thumb, trial and error, and the

use

of

reserved test data

to assess generalization ability (or more sophisticated cross-validation).

The knobs fall into two classes:

(1)

parameters that change the effective

learning model, for example, number

of

hidden units, and weight decay

‘Present address: Darwin College, Cambridge CB3

9EU,

U.K.

Neural

Computation

4,448-472

(1992)

@

1992 Massachusetts Institute

of

Technology

Bayesian Framework for Backpropagation Networks

449

terms; and

(2)

parameters concerned with function optimization tech-

nique, for example, "momentum" terms. This paper is concerned with

making objective the choice of the parameters in the first class, and with

ranking alternative solutions to a learning problem in a way that makes

full use of all the available data. Bayesian techniques will be described

that are both theoretically well-founded and practically implementable.

Let

us

review the basic framework for learning in networks, then

discuss the points at which objective techniques are needed. The training

set for the mapping to be learned is a set of input-target pairs

D

=

{xm,

t"},

where

rn

is a label running over the pairs. A neural network

architecture

A

is invented, consisting of a specification of the number

of

layers, the number of units in each layer, the type of activation function

performed by each unit, and the available connections between the units.

If a set of values w is assigned to the connections in the network, the

network defines a mapping

y(x;

w,

A)

from the input activities

x

to the

output activities

y.'

The distance of this mapping to the training set is

measured by some error function; for example, the error for the entire

data set is commonly taken to be

1

ED(D

1

W,

d)

=

C

2

[Y(X";

W,

d)

-

t"]'

(1.1)

The task

of

"learning" is to find a set of connections w that gives a

mapping that fits the training set well, that is, has small error

ED;

it is

also hoped that the learned connections will "generalize" well to new

examples. Plain backpropagation learns by performing gradient descent

on

ED

in w-space. Modifications include the addition

of

a "momentum"

term, and the inclusion of noise in the descent process. More efficient

optimization techniques may also be used, such as conjugate gradients

or variable metric methods. This paper will not discuss computational

modifications concerned only with speeding the optimization. It will

address, however, those modifications to the plain backprop algorithm

that implicitly or explicitly modify the objective function, with decay

terms or regularizers.

It is moderately common for extra regularizing terms Ew(w) to be

added to

ED;

for example, terms that penalize large weights may be

introduced, in the hope

of

achieving a smoother or simpler mapping

(Hinton and Sejnowski 1986; Ji

etal.

1990; Nowlan 1991; Rumelhart 1987;

Weigend

et

al.

1991). Some of the "hints" in Abu-Mostafa (1990b) also

fall into the category of additive weight-dependent energies.

A

sample

weight energy term is

m

iL

'The framework developed in this paper will apply

not

only to networks composed

of

"neurons," but to

any

regression model for which we can compute the derivatives

of

the

outputs

with respect

to

the parameters,

*(x;w,d)/dw.

450

David

J.

C.

MacKay

The weight energy may be implicit, for example, “weight decay” (sub-

traction of a multiple of

w

in the weight change rule) corresponds to

the energy in equation 1.2. Gradient-based optimization is then used to

minimize the combined function:

M

=

@E~(w

I

d)

+

PED(D

I

W,

d)

(1.3)

where

a

and

P

are “black box” parameters.

The constant

cy.

should not be confused with the ”momentum” pa-

rameter sometimes introduced into backprop; in the present context

a

is a decay rate or regularizing constant. Also note that

a

should not be

viewed as causing ”forgetting”;

ED

is defined as the error on the entire

data set,

so

gradient descent on

M

treats all data points equally irrespec-

tive of the order in which they were acquired.

1.1

What

Is

Lacking.

The above procedures include a host of free

parameters such as the choice of neural network architecture, and of the

regularizing constant

a.

There are not yet established ways of objectively

setting these parameters, though there are many rules of thumb (see Ji

et

al.

1990; Weigend

et

al.

1991, for examples).

One popular way of comparing networks trained with different pa-

rameter values is to assess their performance by measuring the error on

an unseen test set or by similar cross-validation techniques. The data are

divided into two sets, a training set that is used to optimize the param-

eters

w

of the network, and a test set that

is

used to optimize control

parameters such as

a

and the architecture

A.

However, the utility of

these techniques in determining values for the parameters

a

and

/3

or

for comparing alternative network solutions, etc., is limited because a

large test set may be needed to reduce the signal-to-noise ratio in the test

error, and cross-validation is computationally demanding. Furthermore,

if there are several parameters such as

a

and

p,

it is out of the question

to optimize such parameters by repeating the learning with all possible

values of these parameters and using a test set. Such parameters must

be optimized on line.

It is, therefore, interesting to study objective criteria for setting free

parameters and comparing alternative solutions, that depend only on

the data set used for the training. Such criteria will prove especially

important in applications where the total amount

of

data is limited,

so

that one does not want to sacrifice good data for use as a test set. Rather,

we wish to find a way to use

all

our data in the process of optimizing the

parameters

w

and

in the process of optimizing control parameters such

as

a

and

A.

This paper will describe practical Bayesian methods for filling the

following holes in the neural network framework just described:

1.

Objective criteria

for

comparing alternative neural network

solutions,

in

particular with different architectures

A. Given a single architecture

Bayesian Framework for Backpropagation Networks

451

A,

there may be more than one minimum of the objective function

M.

If there is a large disparity in

M

between the minima then it is

plausible to choose the solution with smallest

M.

But where the dif-

ference is not

so

great it is desirable to be able to assign an objective

preference to the alternatives. It is also desirable to be able to assign

preferences to neural network solutions using different numbers of

hidden units, and different activation functions. Here there is an

"Occam's razor" problem: the more free parameters a model has,

the smaller the data error

ED

it can achieve.

So

we cannot simply

choose the architecture with smallest data error. That would lead

us

to an overcomplex network that generalizes poorly. The use of

weight decay does not fully alleviate this problem; networks with

too many hidden units still generalize worse, even if weight decay

is used (see Section

4).

2.

Objective criteria for setting the decay rate

a.

As

in the choice of

A

above, there is an "Occam's razor" problem: a small value of

a

in equation

1.3

allows the weights to become large and overfit the

noise in the data. This leads to a small value of the data error

ED

(and a small value of

M),

so

we cannot base our choice of

a

only on

ED

or

M.

The Bayesian solution presented here can be implemented

on-line, that is, it is not necessary to

do

multiple learning runs with

different values of

a

in order to find the best.

3.

Objective choice of regularizing function

Ew.

4.

Objective criteria for choosing between

a

neural network solution and

a

solution using

a

different learning

or

interpolation model,

for example,

splines or radial basis functions.

1.2

The Probability Connection.

Tishby

et al.

(1989)

introduced a

probabilistic view of learning that is an important step toward solving the

problems listed above. The idea is to force a probabilistic interpretation

onto the neural network technique

so

as to be able to make objective

statements. This interpretation does not involve the addition of any new

arbitrary functions or parameters, but it involves assigning a meaning to

the functions and parameters that are already used.

My work

is

based on the same probabilistic framework, and extends

it using concepts and techniques adapted from Gull and Skilling's (Gull

1989) Bayesian image reconstruction methods. This paper also adopts

a shift in emphasis from Tishby

et al.'s

paper. Their work concentrated

on predicting the average generalization ability of one network trained

on a task drawn from a known prior ensemble of tasks. This is called

forward probability.

In this paper the emphasis will be on quantifying the

relative plausibilities of many alternative solutions to an interpolation

or

classification task; that task is defined by a single data set produced by

the real world, and we do not know the prior ensemble from which the

452

David

J.

C.

MacKay

task comes. This is called

inverse probability.

This paper avoids using

the language

of

statistical physics, in order to maintain wider readability,

and to avoid concepts that would sound strange in that language; for

example, ”the probability distribution

of

the temperature” is unfamiliar

in physics, but “the probability distribution

of

the noise variance’’ is its

innocent counterpart in literal terms.

Let

us

now review the probabilistic interpretation of network learning.

0

Likelihood.

A network with specified architecture

A

and connections

w is viewed as making predictions about the target outputs as a

function

of

input

x

in accordance with the probability distribution:

where

Z,(P)

=

Jdt exp(-PE).

E

is the error for a single datum,

and

P

is a measure of the presumed noise included in t. If

E

is the

quadratic error function then this corresponds to the assumption

that

t

includes additive gaussian noise with variance

u‘,

=

l/P.

0

Prior.

A prior probability is assigned to alternative network con-

nection strengths w, written in the form:

where

ZW

=

Jdkw exp(-aEw). Here

a

is a measure

of

the char-

acteristic expected connection magnitude.

If

Ew

is quadratic as

specified in equation

1.2

then weights are expected to come from

a gaussian with zero mean and variance

o&

=

l/a.

Alternative

“regularizers”

R

(each using a different energy function

Ew)

im-

plicitly correspond to alternative hypotheses about the statistics of

the environment.

0

The

posterior probability

of the network connections w is then

(1.6)

where ZM(~,

P)

=

Jdkw

exp(-aEw-PED). Notice that the exponent

in this expression is the same as (minus) the objective function

M

defined in equation

1.3.

So

under this framework, minimization of

M

=

aEw

+

BED

is identi-

cal to finding the (locally) most probable parameters

WMP;

minimization

of

ED

alone is identical to finding the maximum likelihood parameters

WML. Thus an interpretation has been given to backpropagation’s en-

ergy functions

ED

and Ew, and to the parameters

a

and

P.

It should be

A practical Bayesian framework for backpropagation networks

Citations

Neural networks for pattern recognition

Deep learning in neural networks

Pattern Recognition and Machine Learning

Information Theory, Inference and Learning Algorithms

Bayesian interpolation

References

Learning representations by back-propagating errors

Bayesian interpolation

Optimal Brain Damage

Learning and relearning in Boltzmann machines

The evidence framework applied to classification networks

Related Papers (5)

Bayesian learning for neural networks

Bayesian interpolation

Dropout as a Bayesian approximation: representing model uncertainty in deep learning

Neural networks for pattern recognition

Dropout: a simple way to prevent neural networks from overfitting