scispace - formally typeset
Open AccessJournal ArticleDOI

A practical Bayesian framework for backpropagation networks

David J. C. MacKay
- 01 May 1992 - 
- Vol. 4, Iss: 3, pp 448-472
TLDR
A quantitative and practical Bayesian framework is described for learning of mappings in feedforward networks that automatically embodies "Occam's razor," penalizing overflexible and overcomplex models.
Abstract
A quantitative and practical Bayesian framework is described for learning of mappings in feedforward networks. The framework makes possible (1) objective comparisons between solutions using alternative network architectures, (2) objective stopping rules for network pruning or growing procedures, (3) objective choice of magnitude and type of weight decay terms or additive regularizers (for penalizing large weights, etc.), (4) a measure of the effective number of well-determined parameters in a model, (5) quantified estimates of the error bars on network parameters and on network output, and (6) objective comparisons with alternative learning and interpolation models such as splines and radial basis functions. The Bayesian "evidence" automatically embodies "Occam's razor," penalizing overflexible and overcomplex models. The Bayesian approach helps detect poor underlying assumptions in learning models. For learning models well matched to a problem, a good correlation between generalization ability and the Bayesian evidence is obtained.

read more

Content maybe subject to copyright    Report

Communicated
by
David
Haussler
A
Practical Bayesian Framework
for
Backpropagation
Networks
David
J.
C.
MacKay’
Computation
and
Neural Systems, California lnstitute of Technology
139-74,
Pasadena, CA
91125
USA
A
quantitative and practical Bayesian framework
is
described for learn-
ing of mappings in feedforward networks. The framework makes
possible
(1)
objective comparisons between solutions using alternative
network architectures,
(2)
objective stopping rules for network prun-
ing or growing procedures,
(3)
objective choice of magnitude and type
of weight decay terms or additive regularizers (for penalizing large
weights, etc.),
(4)
a measure of the effective number of well-determined
parameters in a model,
(5)
quantified estimates of the error bars on net-
work parameters and on network output, and
(6)
objective comparisons
with alternative learning and interpolation models such as splines and
radial basis functions. The Bayesian “evidence” automatically embod-
ies ”Occam’s razor,’’ penalizing overflexible and overcomplex models.
The Bayesian approach helps detect poor underlying assumptions
in
learning models. For learning models well matched to a problem, a
good correlation between generalization ability and the Bayesian evi-
dence
is
obtained.
This paper makes use of the Bayesian framework for regularization and
model comparison described in the companion paper “Bayesian Inter-
polation” (MacKay 1992a). This framework is due to Gull and Skilling
(Gull 1989).
1
The Gaps in Backprop
There are many knobs on the black box
of
“backprop” [learning by back-
propagation
of
errors (Rumelhart
et
al.
198611. Generally these knobs are
set by rules
of
thumb, trial and error, and the
use
of
reserved test data
to assess generalization ability (or more sophisticated cross-validation).
The knobs fall into two classes:
(1)
parameters that change the effective
learning model, for example, number
of
hidden units, and weight decay
‘Present address: Darwin College, Cambridge CB3
9EU,
U.K.
Neural
Computation
4,448-472
(1992)
@
1992 Massachusetts Institute
of
Technology

Bayesian Framework for Backpropagation Networks
449
terms; and
(2)
parameters concerned with function optimization tech-
nique, for example, "momentum" terms. This paper is concerned with
making objective the choice of the parameters in the first class, and with
ranking alternative solutions to a learning problem in a way that makes
full use of all the available data. Bayesian techniques will be described
that are both theoretically well-founded and practically implementable.
Let
us
review the basic framework for learning in networks, then
discuss the points at which objective techniques are needed. The training
set for the mapping to be learned is a set of input-target pairs
D
=
{xm,
t"},
where
rn
is a label running over the pairs. A neural network
architecture
A
is invented, consisting of a specification of the number
of
layers, the number of units in each layer, the type of activation function
performed by each unit, and the available connections between the units.
If a set of values w is assigned to the connections in the network, the
network defines a mapping
y(x;
w,
A)
from the input activities
x
to the
output activities
y.'
The distance of this mapping to the training set is
measured by some error function; for example, the error for the entire
data set is commonly taken to be
1
ED(D
1
W,
d)
=
C
2
[Y(X";
W,
d)
-
t"]'
(1.1)
The task
of
"learning" is to find a set of connections w that gives a
mapping that fits the training set well, that is, has small error
ED;
it is
also hoped that the learned connections will "generalize" well to new
examples. Plain backpropagation learns by performing gradient descent
on
ED
in w-space. Modifications include the addition
of
a "momentum"
term, and the inclusion of noise in the descent process. More efficient
optimization techniques may also be used, such as conjugate gradients
or variable metric methods. This paper will not discuss computational
modifications concerned only with speeding the optimization. It will
address, however, those modifications to the plain backprop algorithm
that implicitly or explicitly modify the objective function, with decay
terms or regularizers.
It is moderately common for extra regularizing terms Ew(w) to be
added to
ED;
for example, terms that penalize large weights may be
introduced, in the hope
of
achieving a smoother or simpler mapping
(Hinton and Sejnowski 1986; Ji
etal.
1990; Nowlan 1991; Rumelhart 1987;
Weigend
et
al.
1991). Some of the "hints" in Abu-Mostafa (1990b) also
fall into the category of additive weight-dependent energies.
A
sample
weight energy term is
m
iL
'The framework developed in this paper will apply
not
only to networks composed
of
"neurons," but to
any
regression model for which we can compute the derivatives
of
the
outputs
with respect
to
the parameters,
*(x;w,d)/dw.

450
David
J.
C.
MacKay
The weight energy may be implicit, for example, “weight decay” (sub-
traction of a multiple of
w
in the weight change rule) corresponds to
the energy in equation 1.2. Gradient-based optimization is then used to
minimize the combined function:
M
=
@E~(w
I
d)
+
PED(D
I
W,
d)
(1.3)
where
a
and
P
are “black box” parameters.
The constant
cy.
should not be confused with the ”momentum” pa-
rameter sometimes introduced into backprop; in the present context
a
is a decay rate or regularizing constant. Also note that
a
should not be
viewed as causing ”forgetting”;
ED
is defined as the error on the entire
data set,
so
gradient descent on
M
treats all data points equally irrespec-
tive of the order in which they were acquired.
1.1
What
Is
Lacking.
The above procedures include a host of free
parameters such as the choice of neural network architecture, and of the
regularizing constant
a.
There are not yet established ways of objectively
setting these parameters, though there are many rules of thumb (see Ji
et
al.
1990; Weigend
et
al.
1991, for examples).
One popular way of comparing networks trained with different pa-
rameter values is to assess their performance by measuring the error on
an unseen test set or by similar cross-validation techniques. The data are
divided into two sets, a training set that is used to optimize the param-
eters
w
of the network, and a test set that
is
used to optimize control
parameters such as
a
and the architecture
A.
However, the utility of
these techniques in determining values for the parameters
a
and
/3
or
for comparing alternative network solutions, etc., is limited because a
large test set may be needed to reduce the signal-to-noise ratio in the test
error, and cross-validation is computationally demanding. Furthermore,
if there are several parameters such as
a
and
p,
it is out of the question
to optimize such parameters by repeating the learning with all possible
values of these parameters and using a test set. Such parameters must
be optimized on line.
It is, therefore, interesting to study objective criteria for setting free
parameters and comparing alternative solutions, that depend only on
the data set used for the training. Such criteria will prove especially
important in applications where the total amount
of
data is limited,
so
that one does not want to sacrifice good data for use as a test set. Rather,
we wish to find a way to use
all
our data in the process of optimizing the
parameters
w
and
in the process of optimizing control parameters such
as
a
and
A.
This paper will describe practical Bayesian methods for filling the
following holes in the neural network framework just described:
1.
Objective criteria
for
comparing alternative neural network
solutions,
in
particular with different architectures
A. Given a single architecture

Bayesian Framework for Backpropagation Networks
451
A,
there may be more than one minimum of the objective function
M.
If there is a large disparity in
M
between the minima then it is
plausible to choose the solution with smallest
M.
But where the dif-
ference is not
so
great it is desirable to be able to assign an objective
preference to the alternatives. It is also desirable to be able to assign
preferences to neural network solutions using different numbers of
hidden units, and different activation functions. Here there is an
"Occam's razor" problem: the more free parameters a model has,
the smaller the data error
ED
it can achieve.
So
we cannot simply
choose the architecture with smallest data error. That would lead
us
to an overcomplex network that generalizes poorly. The use of
weight decay does not fully alleviate this problem; networks with
too many hidden units still generalize worse, even if weight decay
is used (see Section
4).
2.
Objective criteria for setting the decay rate
a.
As
in the choice of
A
above, there is an "Occam's razor" problem: a small value of
a
in equation
1.3
allows the weights to become large and overfit the
noise in the data. This leads to a small value of the data error
ED
(and a small value of
M),
so
we cannot base our choice of
a
only on
ED
or
M.
The Bayesian solution presented here can be implemented
on-line, that is, it is not necessary to
do
multiple learning runs with
different values of
a
in order to find the best.
3.
Objective choice of regularizing function
Ew.
4.
Objective criteria for choosing between
a
neural network solution and
a
solution using
a
different learning
or
interpolation model,
for example,
splines or radial basis functions.
1.2
The Probability Connection.
Tishby
et al.
(1989)
introduced a
probabilistic view of learning that is an important step toward solving the
problems listed above. The idea is to force a probabilistic interpretation
onto the neural network technique
so
as to be able to make objective
statements. This interpretation does not involve the addition of any new
arbitrary functions or parameters, but it involves assigning a meaning to
the functions and parameters that are already used.
My work
is
based on the same probabilistic framework, and extends
it using concepts and techniques adapted from Gull and Skilling's (Gull
1989) Bayesian image reconstruction methods. This paper also adopts
a shift in emphasis from Tishby
et al.'s
paper. Their work concentrated
on predicting the average generalization ability of one network trained
on a task drawn from a known prior ensemble of tasks. This is called
forward probability.
In this paper the emphasis will be on quantifying the
relative plausibilities of many alternative solutions to an interpolation
or
classification task; that task is defined by a single data set produced by
the real world, and we do not know the prior ensemble from which the

452
David
J.
C.
MacKay
task comes. This is called
inverse probability.
This paper avoids using
the language
of
statistical physics, in order to maintain wider readability,
and to avoid concepts that would sound strange in that language; for
example, ”the probability distribution
of
the temperature” is unfamiliar
in physics, but “the probability distribution
of
the noise variance’’ is its
innocent counterpart in literal terms.
Let
us
now review the probabilistic interpretation of network learning.
0
Likelihood.
A network with specified architecture
A
and connections
w is viewed as making predictions about the target outputs as a
function
of
input
x
in accordance with the probability distribution:
where
Z,(P)
=
Jdt exp(-PE).
E
is the error for a single datum,
and
P
is a measure of the presumed noise included in t. If
E
is the
quadratic error function then this corresponds to the assumption
that
t
includes additive gaussian noise with variance
u‘,
=
l/P.
0
Prior.
A prior probability is assigned to alternative network con-
nection strengths w, written in the form:
where
ZW
=
Jdkw exp(-aEw). Here
a
is a measure
of
the char-
acteristic expected connection magnitude.
If
Ew
is quadratic as
specified in equation
1.2
then weights are expected to come from
a gaussian with zero mean and variance
o&
=
l/a.
Alternative
“regularizers”
R
(each using a different energy function
Ew)
im-
plicitly correspond to alternative hypotheses about the statistics of
the environment.
0
The
posterior probability
of the network connections w is then
(1.6)
where ZM(~,
P)
=
Jdkw
exp(-aEw-PED). Notice that the exponent
in this expression is the same as (minus) the objective function
M
defined in equation
1.3.
So
under this framework, minimization of
M
=
aEw
+
BED
is identi-
cal to finding the (locally) most probable parameters
WMP;
minimization
of
ED
alone is identical to finding the maximum likelihood parameters
WML. Thus an interpretation has been given to backpropagation’s en-
ergy functions
ED
and Ew, and to the parameters
a
and
P.
It should be

Citations
More filters
Book

Neural networks for pattern recognition

TL;DR: This is the first comprehensive treatment of feed-forward neural networks from the perspective of statistical pattern recognition, and is designed as a text, with over 100 exercises, to benefit anyone involved in the fields of neural computation and pattern recognition.
Journal ArticleDOI

Deep learning in neural networks

TL;DR: This historical survey compactly summarizes relevant work, much of it from the previous millennium, review deep supervised learning, unsupervised learning, reinforcement learning & evolutionary computation, and indirect search for short programs encoding deep and large networks.

Pattern Recognition and Machine Learning

TL;DR: Probability distributions of linear models for regression and classification are given in this article, along with a discussion of combining models and combining models in the context of machine learning and classification.
Book

Information Theory, Inference and Learning Algorithms

TL;DR: A fun and exciting textbook on the mathematics underpinning the most dynamic areas of modern science and engineering.
Journal ArticleDOI

Bayesian interpolation

TL;DR: The Bayesian approach to regularization and model-comparison is demonstrated by studying the inference problem of interpolating noisy data by examining the posterior probability distribution of regularizing constants and noise levels.
References
More filters
Journal ArticleDOI

Learning representations by back-propagating errors

TL;DR: Back-propagation repeatedly adjusts the weights of the connections in the network so as to minimize a measure of the difference between the actual output vector of the net and the desired output vector, which helps to represent important features of the task domain.
Journal ArticleDOI

Bayesian interpolation

TL;DR: The Bayesian approach to regularization and model-comparison is demonstrated by studying the inference problem of interpolating noisy data by examining the posterior probability distribution of regularizing constants and noise levels.
Proceedings Article

Optimal Brain Damage

TL;DR: A class of practical and nearly optimal schemes for adapting the size of a neural network by using second-derivative information to make a tradeoff between network complexity and training set error is derived.
Book

Learning and relearning in Boltzmann machines

TL;DR: This chapter contains sections titled: Relaxation Searches, Easy and Hard learning, The Boltzmann Machine Learning Algorithm, An Example of Hard Learning, Achieving Reliable Computation with Unreliable Hardware, and an Example of the Effects of Damage.
Journal ArticleDOI

The evidence framework applied to classification networks

TL;DR: It is demonstrated that the Bayesian framework for model comparison described for regression models in MacKay (1992a,b) can also be applied to classification problems and an information-based data selection criterion is derived and demonstrated within this framework.