scispace - formally typeset
Open AccessJournal ArticleDOI

Hierarchical mixtures of experts and the EM algorithm

Reads0
Chats0
TLDR
An Expectation-Maximization (EM) algorithm for adjusting the parameters of the tree-structured architecture for supervised learning and an on-line learning algorithm in which the parameters are updated incrementally.
Abstract
We present a tree-structured architecture for supervised learning. The statistical model underlying the architecture is a hierarchical mixture model in which both the mixture coefficients and the mixture components are generalized linear models (GLIM's). Learning is treated as a maximum likelihood problem; in particular, we present an Expectation-Maximization (EM) algorithm for adjusting the parameters of the architecture. We also develop an on-line learning algorithm in which the parameters are updated incrementally. Comparative simulation results are presented in the robot dynamics domain.

read more

Content maybe subject to copyright    Report

Proceedings of
1993
International Joint Conference on Neural Networks
Hierarchical mixtures
of
experts and the
EM
algorithm
Michael
I.
Jordan
Department of Brain and Cognitive Sciences
MIT
Cambridge,
MA
02139
Abstract
We present a tree-structured architecture
for supervised learning. The statistical
model underlying the architecture is a hi-
erarchical mixture model in which both
the mixture coefficients and the mixture
components are generalized linear models
(GLIM’s). Learning is treated
as
a max-
imum likelihood problem; in particular,
we present an
Expectation-Maximization
(EM) algorithm for adjusting the parame-
ters of the architecture. We also develop an
on-line learning algorithm in which the pa-
rameters are updated incrementally. Com-
parative simulation results are presented in
the robot dynamics domain.
1
INTRODUCTION
In the statistical literature and in the machine learn-
ing literature, divide-and-conquer algorithms have
become increasingly popular. The CART algorithm
[l],
the MARS algorithm
[5],
and the
ID3
algorithm
[12]
are well-known examples. These algorithms
fit surfaces to data by explicitly dividing the input
space into a nested sequence of regions, and by
fit-
ting simple surfaces (e.g., constant functions) within
these regions. The advantages
of
these algorithms
include the interpretability of their solutions and the
speed of the traiining process.
In this paper wte present a neural network archi-
tecture that is a close cousin to architectures such
as
CART and
MARS.
As
in
our
earlier work
[6,7],
we formulate the learning problem for this architec-
ture as a maximum likelihood problem. In the cur-
rent paper we utiilize the
Expectation-Maximization
(EM) framework to derive the learning algorithm.
2
HIERARCHICAL MIXTURES
OF
EXPIERTS
The algorithms that we discuss in this paper are
supervised learning algorithms. We explicitly ad-
dress the case of regression, in which the input vec-
tors are elements of
Sm
and the output vectors are
elements of
Xn.
Our
model also handles classifi-
Robert
A.
Jacobs
Department of Psychology
University of Rochester
Rochester,
NY
14627
\
Gatng
X
Expert
Expert
Network
Network
T
T
1
.
Ix
Ix
Figure
1:
A two-level hierarchical mixture of ex-
perts.
cation problems and counting problems in which
the outputs are integer-valued. The data are as-
sumed to form a countable set of paired observations
X
=
{(dt),
y‘‘))}.
In the case of the
batch
algorithm
discussed below, this set is assumed to be finite; in
the case of the
on-line
algorithm, the set may be
infinite.
We propose to solve nonlinear supervised learning
problems by dividing the input space into a nested
set of regions and fitting simple surfaces to the data
that fall in these regions. The regions have “soft”
boundaries, meaning that data points may lie simul-
taneously in multiple regions. The boundaries be-
tween regions are themselves simple parameterized
surfaces that are adjusted by the learning algorithm.
The hierarchical mixture-of-experts (HME) archi-
tecture is shown in Figure
l.’
The architecture is
a tree in which the
gating networks
sit at the non-
terminals of the tree. These networks receive the
vector
x
as input and produce scalar outputs that
‘To
simplify the presentation,
we
restrict ourselves to
a
two-level hierarchy throughout the
paper.
All
of
the
algorithms that
we
describe,
however,
generalize readily
to
hierarchies
of
arbitrary depth. See
[9]
for a recursive
formalism that handles arbitrary trees.
1339

are a partition of unity at each point in the input
spzce. The
experi networks
sit at the leaves of the
tree. Each expert produces an output vector
pij
for each input vector. These output vectors proceed
up the tree, being multiplied by the gating network
outputs and summed at the nonterminals.
-411 of the expert networks in the tree are linear with
a single output nonlinearity. We will refer to such a
network as “generalized linear,” borrowing the ter-
minology from statistics
[ll].
Expert network
(i,
j)
produces its output
pij
as a generalized linear func-
tion of the input
x:
where
Uij
is a weight matrix and
f
is a fixed continu-
ous nonlinearity. The vector
x
is assumed to include
a fixed component
of
one to allow
for
an intercept
term.
For regression problems,
f(.)
is the identity function
(i.e., the experts are linear).
For
binary classification
problems,
f(.)
is generally taken to be the logistic
function, in which case the expert outputs are inter-
preted as the log odds of “success” under a Bernoulli
probability model. Other models (e.g., multiway
classification, counting, rate estimation and survival
estimation) are handled readily by making other
choices for
f(.).
These models are smoothed piece-
wise analogs of the corresponding generalized linear
models (GLIM’s; cf.
[ll]).
The gating networks are also generalized linear. At
the top level, define linear predictors
&
as follows:
where
vi
is a weight vector. Then the
ith
output of
the top-level gating network is the “softmax” func-
tion of the [2,11]:
Pij
=
f(uijx),
(1)
(2)
T
E‘i
=
vj
x,
(3)
Note that the
gi
are positive and sum to one
for
each
x.
The gating networks at the lower level are defined
similarly, yielding outputs
gjli
that are obtained by
taking the softmax function of linear predictors
&j
=
VTX
The output vector at each nonterminal of the tree
is
the weighted output of the experts below that non-
terminal. That
is,
the output at the
ith
nonterminal
in the second layer of the two-level tree is:
$3
.
PI
=
Cgjlipij
P
=
CSiPi.
i
and the output at the top level
of
the tree is:
i
Note that both the
g’s
and the
p’s
depend on the
input
x,
thus the total output is a nonlinear function
of the input.
2.1
A
PROBABILITY
MODEL
The hierarchy can be given a probabilistic interpre-
tation. We suppose that the mechanism by which
data are generated by the environment involves a
nested sequence
of
decisions that terminates in a re-
gressive process that maps
x
to
y.
The decisions
are modeled
as
multinomial random variables. That
is,
for each
x,
we interpret the values
gi(x,vf)
as
the multinomial probabilities associated with the
first decision and the
gjli(x, vz”)
as the (conditional)
multinomial Probabilities associated with the second
decision.
We
use a statistical model to model these
probabilities; in particular, our choice of parameter-
ization (cf. Eqs.
2
and
3)
corresponds to a
log-lanear
probability model (see
[$]I.
A
log-linear model is a
special case
of
a GLIM that is commonly used
for
“soft” multiway classification
[
113.
Under the log-
linear model, we interpret the gating networks
as
modeling the input-dependent, multinomial proba-
bilities
of
making particular nested sequences of de-
cisions.
Once a particular sequence of decisions has been
made, output
y
is assumed to be generated ac-
cording to the following generalized linear statistical
model. First, a linear predictor
vV
is formed:
722
=
where the superscript refers to the “true” values of
the parameters. The expected value of
y
is obtained
by passing the linear predictor through the
lank
func-
taon
f:
The output
y
is then chosen from a probability den-
sity
P,
with mean
py2
and “dispersion” parameter
&.
We denote the density of
y
as:
0
d3
=
f(dj).
P(Y
Ix,
fe3
),
where the parameter vector
0i9i
includes the weights
U$
and the dispersion parameter
+fJ.
We assume
the density
P
to be a member of the exponen-
tial family of densities [ll]. The interpretation of
the dispersion parameter depends on the particular
choice of density. For example, in the case of the
n-
dimensional Gaussian, the dispersion parameter is
the covariance matrix
.’
Given these assumptions, the total probability
of
generating
y
from
x
is the mixture
of
the proba-
bilities
of
generating
y
from each
of
the component
densities, where the mixture components are multi-
nomial probabilities:
i
1
(4)
Not all exponential family densities have a dispersion
parameter; in particular, the Bernoulli density has no
dispersion parameter.
1340

Note that
8’
includes the expert network parame-
ters
8ij0
as
well
as
the gating network parameters
vp
and
v!’.
Note also that we can utilize Eq.
4
without
the superscripts to refer to the probability model de-
fined by a particular
HME
architecture, irrespective
of any reference to a “true” model.
2.2
POSTERIOR PROBABILITIES
In developing the learning algorithms to be pre-
sented in the remainder of the paper, it will prove
useful to define posterior probabilities associated
with the nodes of the tree. The terms “posterior”
and “prior” have meaning in this context during the
training of the system. We refer to the probabili-
ties
g2
and
gjli
as
prior
probabilities, because they
are computed basad oniy on the input
x,
without
knowledge of the corresponding target output
y.
A
posterior
probability is defined once both the input
and the target output are known. Using Bayes’ rule,
we define the posterior probabilities at the nodes of
the tree
as
follows
and
where we have drolpped the dependence on the input
and the parameters to simplify the notation.
We will also find it useful to define the joint pos-
terior probability
hij,
the product of
hj
and
hjli.
This quantity is the probability that expert network
(i,j)
can be considered to have generated the data,
based on knowledge of both the input and the out-
put. Once again, we emphasize that all of these
quantities are conditional on the input
x.
In deeper trees, the posterior probability associated
with an expert network is simply the product of
the conditional posterior probabilities along the path
from the root of the tree to that expert.
2.3
THE L1KE:LIHOOD
We treat the problem of learning in the
HME
architecture
as
a maximum likelihood estimation
problem. The log likelihood of a data set
X
=
{(~(~),y(~))}r
is clbtained by taking the log of the
product of
N
densities of the form
of
Eq.
4,
which
yields the following log likelihood:
/(e;
XI
=
In
git)
gjiiPij(y(f)).
(7)
We wish to maximize this function with respect to
the parameters
8.
ti
j
2.4
THE
EM ALGORITHM
In the following sections we develop a learning al-
gorithm for the
HME
architecture based on the
Expectation-Maximization
(EM)
framework
[4].
We
derive an
EM
algorithm for the architecture that
consists of the iterative solution of a coupled set of
iteratively-reweighted least-squares problems.
EM
is an iterative approach to maximum likelihood
estimation. Each iteration
of
an EM algorithm is
composed of two steps:
an
Estimation
(E)
step and
a Maximization
(M)
step. The
M
step involves the
maximization of a likelihood function that is rede-
fined in each iteration by the E step. An application
of
EM
generally begins with the observation that
the optimization of the likelihood function
/(e;
X)
would be simplified if only a set of additional vari-
ables, called “missing”
or
“hidden” variables, were
known. In this context, we refer to the observ-
able data
X
as the “incomplete data” and posit a
“complete data” set
y
that includes the missing
variables
2.
We specify a probability model that
links the fictive missing variables to the actual data:
P(y,zlx,B).
The logarithm of the density
P
de-
fines the “complete-data likelihood,”
lc(f3;
y).
The
original likelihood,
/(e;
X),
is referred to in this con-
text
as
the “incomplete-data likelihood.” It is the
relationship between these two likelihood functions
that motivates the
EM
algorithm. Note that the
complete-data likelihood is a random variable, be-
cause the missing variables
2
are in fact unknown.
An
EM
algorithm first finds the expected value
of the complete-data likelihood, given the observed
data and the current model. This is the
E
step:
Q(6
f3(P))
=
E[L(8;
Y)IXl,
where
dP)
is the value
of
the parameters at the
pth
iteration and the expectation is taken with respect
to
dP).
This step yields a deterministic function
Q.
The
M
step maximizes this function with respect to
8
to find the new parameter estimates
dP+’)
=
arg maxQ(8,
dP))
The
E
step is then repeated
to
yield an improved
estimate
of
the complete likelihood and the process
iterates.
An iterative step
of
EM
chooses a parameter value
that increases the value of
Q,
the expectation of the
complete likelihood.
What is the effect of such a
step on the incomplete likelihood? Dempster, et al.
proved that an increase in
Q
implies an increase in
the incomplete likelihood:
8
with equality obtaining only at the stationary points
of
I.
Thus the likelihood
6
increases monotonically
1341

along the sequence of parameter estimates generated
by an EM algorithm. In practice this implies con-
vergence to a local maximum.
2.5
APPLYING EM TO THE HME
ARCHITECTURE
To develop an EM algorithm for the HME architec-
ture, we must define appropriate "missing data"
so
as
to simplify the likelihood function. We define in-
dicator variables
zij
such that one and only one
of
the
z;j
is one for any given data point. These indi-
cator variables have the interpretation
as
labels that
specify which expert in the probability model gen-
erated the data point. This choice of missing data
yields the following complete-data likelihood:
L(8;X)
=
~~~z$)ln{gj
(t)
gjliPij(Y('))}.
(2)
(8)
tij
Note the relationship of the complete-data likelihood
in Eq.
8
to the incomplete-data likelihood in Eq.
7.
The use of the indicator variables
zij
has allowed the
logarithm to be brought inside the summation signs,
substantially simplifying the maximization problem.
We now define the E step of the EM algorithm by
taking the expectation of the complete-data likeli-
hood:
and
Note that each
of
these maximization problems
are themselves maximum likelihood problems, given
that
Pjj,
gj
and
gjli
are probability densities. More-
over, given our parameterization of these densi-
ties, the log likelihoods that we have obtained
are
weighted log likelihoods for generalized linear mod-
els (GLIM's). An efficient algorithm known as iter-
atively reweighted least-squares (IRLS) is available
to
solve the maximum likelihood problem for such
models
[ll].
(See
[8]
for a discussion of IRLS.)
In summary, the EM algorithm that we have ob-
tained involves a calculation of posterior probabili-
ties in the outer loop (the
E
step), and the solution
of a set of weighted IRLS problems in the inner loop
(the M step). We summarize the algorithm as fol-
lows:
HME
Algorzthm
1
1.
For
each data pair
(x(~),
y(')),
compute the pos-
terior probabilities
hi(')
and
h:;!
using the cur-
rent values of the parameters.
2.
For
each expert
(i,j),
solve
an
IRLS problem
with observations
{
(~(~1,
y('))}fJ
and observa-
tion weights
{
h!:)}?.
3.
For
each top-level gating network, solve an
IRLS problem with observations
{
(~('1,
hf))}?.
The M step requires maximizing
Q(0,
dp))
with re-
spect to the expert network parameters and the gat-
ing network parameters. Examining Eq.
9,
we see
that the expert network parameters influence the
Q
function only through the terms
h!t)
In
Pij(~(~)),
and the gating network parameters influence the
Q
function only through the terms
hi:)
In
Sit)
and
hi:)
In
9;;;.
Thus the M step involves the following
separate maximizations:
$?
and observation weights
{
ht)}?.
5.
Iterate using the updated parameter values.
[8]
also presents an approximation to this algorithm
in which the gating networks are
fit
by least-squares
rather than maximum likelihood. In this case, the
IRLS inner loop reduces to a weighted least-squares
problem that can be solved without iteration.
2.5.1
Simulation Results
We tested the algorithm on
a
nonlinear system iden-
tification problem. The data were obtained from
a simulation of a four-joint robot arm moving in
three-dimensional space. The network must learn
the forward dynamics of the arm; a mapping from
twelve coupled input variables to four output vari-
ables. This mapping is rather smooth and we expect
1342

I
Architecture
I
Relative Error
I
#
Epochs
CART .17 NA
CART (obliqule)
.13
NA
MARS .16 NA
Table
1:
Average Values of Relative Error and Num-
ber of Epochs Required
for
Convergence
for
the
Batch Algorithms.
the error
for
a global fitting algorithm like backprop-
agation to be small; our main interest
is
in the train-
ing time.
We generated
15
000
data points for training and
5,000
points for testing. For each epoch (i.e., each
pass through the training set), we computed the rela-
tive error on the best set. Relative error is computed
as
a ratio between the mean squared error and the
mean squared error that would be obtained if the
learner were to output the mean value of the out-
puts
for
all data points.
We compared the performance of
a
binary hierarchy
to that
of
the best linear approximation, a back-
propagation network, the CART algorithm and the
MARS algorithm. The hierarchy was a four-level hi-
erarchy with 16 expert networks and
15
gating net-
works. Each expert network had
4
output units and
each gating network had
1
output unit. The back-
propagation network had 60 hidden units, which
yields approximately the same number of parame-
ters in the network
as
in the hierarchy. The MARS
algorithm was run with a maximum
of
16 basis func-
tions, based on the fact that each such function cor-
responds roughly
to
a single expert in the HME ar-
chitecture.
Table
1
reports the average values of minimum rela-
tive error and the convergence times for all architec-
tures. As can be seen in the Table, the backpropa-
gation algorithm required
5,500
passes through the
data to converge to a relative error
of
0.09.
The
HME algorithm converged to a similar relative er-
ror in only
35
passes through the data. CART and
MARS required similar CPU time
as
compared to
the HME algorithm, but produced less accurate
fits.
(For
further details on the simulation, see
[SI).
As shown in Figure
2,
the HME architecture lends
it,self well to graphical investigation. This figure dis-
plays the time sequence of the distributions of poste-
rior probabilities across the training set at each node
of the tree. At Elpoch
0,
before any learning has
taken place, most of the posterior probabilities at
each node are approximately
0.5
across the training
set. As the training proceeds, the histograms flatten
Epoch
19
Epoch
29
Figure
2:
A sequence of histogram trees for the HME
architecture. Each histogram displays the distribu-
tion of posterior probabilities across the training set
at each node in the tree.
out, eventually approaching bimodal distributions in
which the posterior probabilities are either one
or
zero
for
most
of
the training patterns. This evolu-
tion is indicative of increasingly sharp splits being
fit by the gating networks. Note that there is a ten-
dency for the splits to be formed more rapidly at
higher levels in the tree than at lower levels.
2.6
AN ON-LINE ALGORITHM
Jordan and Jacobs
[8]
derive an on-line algorithm
for the HME architecture using techniques from re-
cursive estimation theory
[lo].
The on-line update rule for the parameters of the
expert networks is given by the following recursive
equation:
U!?+l)
a3
=
U:;)
+
h;f)h;;!(y(t)
-
pj;))X(t)TR;;),
(10)
where
Rij
is the inverse covariance matrix for ex-
pert network
(i,j).
This matrix is updated via the
equation:
where
X
is a decay parameter.
Similar update rules are obtained for the parameters
of the gating networks. See
[8]
for
further details.
2.6.1
Simulation Results
The on-line algorithm was tested on the robot dy-
namics problem described in the previous section.
1343

Citations
More filters
Journal ArticleDOI

Statistical pattern recognition: a review

TL;DR: The objective of this review paper is to summarize and compare some of the well-known methods used in various stages of a pattern recognition system and identify research topics and applications which are at the forefront of this exciting and challenging field.
Journal ArticleDOI

Multitask Learning

TL;DR: Multi-task Learning (MTL) as mentioned in this paper is an approach to inductive transfer that improves generalization by using the domain information contained in the training signals of related tasks as an inductive bias.
Journal ArticleDOI

The empirical case for two systems of reasoning.

TL;DR: The distinction between rule-based and associative systems of reasoning has been discussed extensively in cognitive psychology as discussed by the authors, where the distinction is based on the properties that are normally assigned to rules.

Dynamic bayesian networks: representation, inference and learning

TL;DR: This thesis will discuss how to represent many different kinds of models as DBNs, how to perform exact and approximate inference in Dbns, and how to learn DBN models from sequential data.
Journal ArticleDOI

Ensemble based systems in decision making

TL;DR: Conditions under which ensemble based systems may be more beneficial than their single classifier counterparts are reviewed, algorithms for generating individual components of the ensemble systems, and various procedures through which the individual classifiers can be combined are reviewed.
References
More filters
Book

Generalized Linear Models

TL;DR: In this paper, a generalization of the analysis of variance is given for these models using log- likelihoods, illustrated by examples relating to four distributions; the Normal, Binomial (probit analysis, etc.), Poisson (contingency tables), and gamma (variance components).
Journal ArticleDOI

Multilayer feedforward networks are universal approximators

TL;DR: It is rigorously established that standard multilayer feedforward networks with as few as one hidden layer using arbitrary squashing functions are capable of approximating any Borel measurable function from one finite dimensional space to another to any desired degree of accuracy, provided sufficiently many hidden units are available.
Book

Statistical Analysis with Missing Data

TL;DR: This work states that maximum Likelihood for General Patterns of Missing Data: Introduction and Theory with Ignorable Nonresponse and large-Sample Inference Based on Maximum Likelihood Estimates is likely to be high.
Journal ArticleDOI

Induction of Decision Trees

J. R. Quinlan
- 25 Mar 1986 - 
TL;DR: In this paper, an approach to synthesizing decision trees that has been used in a variety of systems, and it describes one such system, ID3, in detail, is described, and a reported shortcoming of the basic algorithm is discussed.