Hierarchical mixtures of experts and the EM algorithm

doi:10.1162/NECO.1994.6.2.181

Proceedings of

1993

International Joint Conference on Neural Networks

Hierarchical mixtures

of

experts and the

EM

algorithm

Michael

I.

Jordan

Department of Brain and Cognitive Sciences

MIT

Cambridge,

MA

02139

Abstract

We present a tree-structured architecture

for supervised learning. The statistical

model underlying the architecture is a hi-

erarchical mixture model in which both

the mixture coefficients and the mixture

components are generalized linear models

(GLIM’s). Learning is treated

as

a max-

imum likelihood problem; in particular,

we present an

Expectation-Maximization

(EM) algorithm for adjusting the parame-

ters of the architecture. We also develop an

on-line learning algorithm in which the pa-

rameters are updated incrementally. Com-

parative simulation results are presented in

the robot dynamics domain.

1

INTRODUCTION

In the statistical literature and in the machine learn-

ing literature, divide-and-conquer algorithms have

become increasingly popular. The CART algorithm

[l],

the MARS algorithm

[5],

and the

ID3

algorithm

[12]

are well-known examples. These algorithms

fit surfaces to data by explicitly dividing the input

space into a nested sequence of regions, and by

fit-

ting simple surfaces (e.g., constant functions) within

these regions. The advantages

of

these algorithms

include the interpretability of their solutions and the

speed of the traiining process.

In this paper wte present a neural network archi-

tecture that is a close cousin to architectures such

as

CART and

MARS.

As

in

our

earlier work

[6,7],

we formulate the learning problem for this architec-

ture as a maximum likelihood problem. In the cur-

rent paper we utiilize the

Expectation-Maximization

(EM) framework to derive the learning algorithm.

2

HIERARCHICAL MIXTURES

OF

EXPIERTS

The algorithms that we discuss in this paper are

supervised learning algorithms. We explicitly ad-

dress the case of regression, in which the input vec-

tors are elements of

Sm

and the output vectors are

elements of

Xn.

Our

model also handles classifi-

Robert

A.

Jacobs

Department of Psychology

University of Rochester

Rochester,

NY

14627

\

Gatng

X

Expert

Network

T

1

.

Ix

Figure

1:

A two-level hierarchical mixture of ex-

perts.

cation problems and counting problems in which

the outputs are integer-valued. The data are as-

sumed to form a countable set of paired observations

X

=

{(dt),

y‘‘))}.

In the case of the

batch

algorithm

discussed below, this set is assumed to be finite; in

the case of the

on-line

algorithm, the set may be

infinite.

We propose to solve nonlinear supervised learning

problems by dividing the input space into a nested

set of regions and fitting simple surfaces to the data

that fall in these regions. The regions have “soft”

boundaries, meaning that data points may lie simul-

taneously in multiple regions. The boundaries be-

tween regions are themselves simple parameterized

surfaces that are adjusted by the learning algorithm.

The hierarchical mixture-of-experts (HME) archi-

tecture is shown in Figure

l.’

The architecture is

a tree in which the

gating networks

sit at the non-

terminals of the tree. These networks receive the

vector

x

as input and produce scalar outputs that

‘To

simplify the presentation,

we

restrict ourselves to

a

two-level hierarchy throughout the

paper.

All

of

the

algorithms that

we

describe,

however,

generalize readily

to

hierarchies

of

arbitrary depth. See

[9]

for a recursive

formalism that handles arbitrary trees.

1339

are a partition of unity at each point in the input

spzce. The

experi networks

sit at the leaves of the

tree. Each expert produces an output vector

pij

for each input vector. These output vectors proceed

up the tree, being multiplied by the gating network

outputs and summed at the nonterminals.

-411 of the expert networks in the tree are linear with

a single output nonlinearity. We will refer to such a

network as “generalized linear,” borrowing the ter-

minology from statistics

[ll].

Expert network

(i,

j)

produces its output

pij

as a generalized linear func-

tion of the input

x:

where

Uij

is a weight matrix and

f

is a fixed continu-

ous nonlinearity. The vector

x

is assumed to include

a fixed component

of

one to allow

for

an intercept

term.

For regression problems,

f(.)

is the identity function

(i.e., the experts are linear).

For

binary classification

problems,

f(.)

is generally taken to be the logistic

function, in which case the expert outputs are inter-

preted as the log odds of “success” under a Bernoulli

probability model. Other models (e.g., multiway

classification, counting, rate estimation and survival

estimation) are handled readily by making other

choices for

f(.).

These models are smoothed piece-

wise analogs of the corresponding generalized linear

models (GLIM’s; cf.

[ll]).

The gating networks are also generalized linear. At

the top level, define linear predictors

&

as follows:

where

vi

is a weight vector. Then the

ith

output of

the top-level gating network is the “softmax” func-

tion of the [2,11]:

Pij

=

f(uijx),

(1)

(2)

T

E‘i

=

vj

x,

(3)

Note that the

gi

are positive and sum to one

for

each

x.

The gating networks at the lower level are defined

similarly, yielding outputs

gjli

that are obtained by

taking the softmax function of linear predictors

&j

=

VTX

The output vector at each nonterminal of the tree

is

the weighted output of the experts below that non-

terminal. That

is,

the output at the

ith

nonterminal

in the second layer of the two-level tree is:

$3

.

PI

=

Cgjlipij

P

=

CSiPi.

i

and the output at the top level

of

the tree is:

i

Note that both the

g’s

and the

p’s

depend on the

input

x,

thus the total output is a nonlinear function

of the input.

2.1

A

PROBABILITY

MODEL

The hierarchy can be given a probabilistic interpre-

tation. We suppose that the mechanism by which

data are generated by the environment involves a

nested sequence

of

decisions that terminates in a re-

gressive process that maps

x

to

y.

The decisions

are modeled

as

multinomial random variables. That

is,

for each

x,

we interpret the values

gi(x,vf)

as

the multinomial probabilities associated with the

first decision and the

gjli(x, vz”)

as the (conditional)

multinomial Probabilities associated with the second

decision.

We

use a statistical model to model these

probabilities; in particular, our choice of parameter-

ization (cf. Eqs.

2

and

3)

corresponds to a

log-lanear

probability model (see

[$]I.

A

log-linear model is a

special case

of

a GLIM that is commonly used

for

“soft” multiway classification

[

113.

Under the log-

linear model, we interpret the gating networks

as

modeling the input-dependent, multinomial proba-

bilities

of

making particular nested sequences of de-

cisions.

Once a particular sequence of decisions has been

made, output

y

is assumed to be generated ac-

cording to the following generalized linear statistical

model. First, a linear predictor

vV

is formed:

722

=

where the superscript refers to the “true” values of

the parameters. The expected value of

y

is obtained

by passing the linear predictor through the

lank

func-

taon

f:

The output

y

is then chosen from a probability den-

sity

P,

with mean

py2

and “dispersion” parameter

&.

We denote the density of

y

as:

0

d3

=

f(dj).

P(Y

Ix,

fe3

),

where the parameter vector

0i9i

includes the weights

U$

and the dispersion parameter

+fJ.

We assume

the density

P

to be a member of the exponen-

tial family of densities [ll]. The interpretation of

the dispersion parameter depends on the particular

choice of density. For example, in the case of the

n-

dimensional Gaussian, the dispersion parameter is

the covariance matrix

.’

Given these assumptions, the total probability

of

generating

y

from

x

is the mixture

of

the proba-

bilities

of

generating

y

from each

of

the component

densities, where the mixture components are multi-

nomial probabilities:

i

1

(4)

Not all exponential family densities have a dispersion

parameter; in particular, the Bernoulli density has no

dispersion parameter.

1340

Note that

8’

includes the expert network parame-

ters

8ij0

as

well

as

the gating network parameters

vp

and

v!’.

Note also that we can utilize Eq.

4

without

the superscripts to refer to the probability model de-

fined by a particular

HME

architecture, irrespective

of any reference to a “true” model.

2.2

POSTERIOR PROBABILITIES

In developing the learning algorithms to be pre-

sented in the remainder of the paper, it will prove

useful to define posterior probabilities associated

with the nodes of the tree. The terms “posterior”

and “prior” have meaning in this context during the

training of the system. We refer to the probabili-

ties

g2

and

gjli

as

prior

probabilities, because they

are computed basad oniy on the input

x,

without

knowledge of the corresponding target output

y.

A

posterior

probability is defined once both the input

and the target output are known. Using Bayes’ rule,

we define the posterior probabilities at the nodes of

the tree

as

follows

and

where we have drolpped the dependence on the input

and the parameters to simplify the notation.

We will also find it useful to define the joint pos-

terior probability

hij,

the product of

hj

and

hjli.

This quantity is the probability that expert network

(i,j)

can be considered to have generated the data,

based on knowledge of both the input and the out-

put. Once again, we emphasize that all of these

quantities are conditional on the input

x.

In deeper trees, the posterior probability associated

with an expert network is simply the product of

the conditional posterior probabilities along the path

from the root of the tree to that expert.

2.3

THE L1KE:LIHOOD

We treat the problem of learning in the

HME

architecture

as

a maximum likelihood estimation

problem. The log likelihood of a data set

X

=

{(~(~),y(~))}r

is clbtained by taking the log of the

product of

N

densities of the form

of

Eq.

4,

which

yields the following log likelihood:

/(e;

XI

=

In

git)

gjiiPij(y(f)).

(7)

We wish to maximize this function with respect to

the parameters

8.

ti

j

2.4

THE

EM ALGORITHM

In the following sections we develop a learning al-

gorithm for the

HME

architecture based on the

Expectation-Maximization

(EM)

framework

[4].

We

derive an

EM

algorithm for the architecture that

consists of the iterative solution of a coupled set of

iteratively-reweighted least-squares problems.

EM

is an iterative approach to maximum likelihood

estimation. Each iteration

of

an EM algorithm is

composed of two steps:

an

Estimation

(E)

step and

a Maximization

(M)

step. The

M

step involves the

maximization of a likelihood function that is rede-

fined in each iteration by the E step. An application

of

EM

generally begins with the observation that

the optimization of the likelihood function

/(e;

X)

would be simplified if only a set of additional vari-

ables, called “missing”

or

“hidden” variables, were

known. In this context, we refer to the observ-

able data

X

as the “incomplete data” and posit a

“complete data” set

y

that includes the missing

variables

2.

We specify a probability model that

links the fictive missing variables to the actual data:

P(y,zlx,B).

The logarithm of the density

P

de-

fines the “complete-data likelihood,”

lc(f3;

y).

The

original likelihood,

/(e;

X),

is referred to in this con-

text

as

the “incomplete-data likelihood.” It is the

relationship between these two likelihood functions

that motivates the

EM

algorithm. Note that the

complete-data likelihood is a random variable, be-

cause the missing variables

2

are in fact unknown.

An

EM

algorithm first finds the expected value

of the complete-data likelihood, given the observed

data and the current model. This is the

E

step:

Q(6

f3(P))

=

E[L(8;

Y)IXl,

where

dP)

is the value

of

the parameters at the

pth

iteration and the expectation is taken with respect

to

dP).

This step yields a deterministic function

Q.

The

M

step maximizes this function with respect to

8

to find the new parameter estimates

dP+’)

=

arg maxQ(8,

dP))

The

E

step is then repeated

to

yield an improved

estimate

of

the complete likelihood and the process

iterates.

An iterative step

of

EM

chooses a parameter value

that increases the value of

Q,

the expectation of the

complete likelihood.

What is the effect of such a

step on the incomplete likelihood? Dempster, et al.

proved that an increase in

Q

implies an increase in

the incomplete likelihood:

8

with equality obtaining only at the stationary points

of

I.

Thus the likelihood

6

increases monotonically

1341

along the sequence of parameter estimates generated

by an EM algorithm. In practice this implies con-

vergence to a local maximum.

2.5

APPLYING EM TO THE HME

ARCHITECTURE

To develop an EM algorithm for the HME architec-

ture, we must define appropriate "missing data"

so

as

to simplify the likelihood function. We define in-

dicator variables

zij

such that one and only one

of

the

z;j

is one for any given data point. These indi-

cator variables have the interpretation

as

labels that

specify which expert in the probability model gen-

erated the data point. This choice of missing data

yields the following complete-data likelihood:

L(8;X)

=

~~~z$)ln{gj

(t)

gjliPij(Y('))}.

(2)

(8)

tij

Note the relationship of the complete-data likelihood

in Eq.

8

to the incomplete-data likelihood in Eq.

7.

The use of the indicator variables

zij

has allowed the

logarithm to be brought inside the summation signs,

substantially simplifying the maximization problem.

We now define the E step of the EM algorithm by

taking the expectation of the complete-data likeli-

hood:

and

Note that each

of

these maximization problems

are themselves maximum likelihood problems, given

that

Pjj,

gj

and

gjli

are probability densities. More-

over, given our parameterization of these densi-

ties, the log likelihoods that we have obtained

are

weighted log likelihoods for generalized linear mod-

els (GLIM's). An efficient algorithm known as iter-

atively reweighted least-squares (IRLS) is available

to

solve the maximum likelihood problem for such

models

[ll].

(See

[8]

for a discussion of IRLS.)

In summary, the EM algorithm that we have ob-

tained involves a calculation of posterior probabili-

ties in the outer loop (the

E

step), and the solution

of a set of weighted IRLS problems in the inner loop

(the M step). We summarize the algorithm as fol-

lows:

HME

Algorzthm

1

1.

For

each data pair

(x(~),

y(')),

compute the pos-

terior probabilities

hi(')

and

h:;!

using the cur-

rent values of the parameters.

2.

For

each expert

(i,j),

solve

an

IRLS problem

with observations

{

(~(~1,

y('))}fJ

and observa-

tion weights

{

h!:)}?.

3.

For

each top-level gating network, solve an

IRLS problem with observations

{

(~('1,

hf))}?.

The M step requires maximizing

Q(0,

dp))

with re-

spect to the expert network parameters and the gat-

ing network parameters. Examining Eq.

9,

we see

that the expert network parameters influence the

Q

function only through the terms

h!t)

In

Pij(~(~)),

and the gating network parameters influence the

Q

function only through the terms

hi:)

In

Sit)

and

hi:)

In

9;;;.

Thus the M step involves the following

separate maximizations:

$?

and observation weights

{

ht)}?.

5.

Iterate using the updated parameter values.

[8]

also presents an approximation to this algorithm

in which the gating networks are

fit

by least-squares

rather than maximum likelihood. In this case, the

IRLS inner loop reduces to a weighted least-squares

problem that can be solved without iteration.

2.5.1

Simulation Results

We tested the algorithm on

a

nonlinear system iden-

tification problem. The data were obtained from

a simulation of a four-joint robot arm moving in

three-dimensional space. The network must learn

the forward dynamics of the arm; a mapping from

twelve coupled input variables to four output vari-

ables. This mapping is rather smooth and we expect

1342

I

Architecture

I

Relative Error

I

#

Epochs

CART .17 NA

CART (obliqule)

.13

NA

MARS .16 NA

Table

1:

Average Values of Relative Error and Num-

ber of Epochs Required

for

Convergence

for

the

Batch Algorithms.

the error

for

a global fitting algorithm like backprop-

agation to be small; our main interest

is

in the train-

ing time.

We generated

15

000

data points for training and

5,000

points for testing. For each epoch (i.e., each

pass through the training set), we computed the rela-

tive error on the best set. Relative error is computed

as

a ratio between the mean squared error and the

mean squared error that would be obtained if the

learner were to output the mean value of the out-

puts

for

all data points.

We compared the performance of

a

binary hierarchy

to that

of

the best linear approximation, a back-

propagation network, the CART algorithm and the

MARS algorithm. The hierarchy was a four-level hi-

erarchy with 16 expert networks and

15

gating net-

works. Each expert network had

4

output units and

each gating network had

1

output unit. The back-

propagation network had 60 hidden units, which

yields approximately the same number of parame-

ters in the network

as

in the hierarchy. The MARS

algorithm was run with a maximum

of

16 basis func-

tions, based on the fact that each such function cor-

responds roughly

to

a single expert in the HME ar-

chitecture.

Table

1

reports the average values of minimum rela-

tive error and the convergence times for all architec-

tures. As can be seen in the Table, the backpropa-

gation algorithm required

5,500

passes through the

data to converge to a relative error

of

0.09.

The

HME algorithm converged to a similar relative er-

ror in only

35

passes through the data. CART and

MARS required similar CPU time

as

compared to

the HME algorithm, but produced less accurate

fits.

(For

further details on the simulation, see

[SI).

As shown in Figure

2,

the HME architecture lends

it,self well to graphical investigation. This figure dis-

plays the time sequence of the distributions of poste-

rior probabilities across the training set at each node

of the tree. At Elpoch

0,

before any learning has

taken place, most of the posterior probabilities at

each node are approximately

0.5

across the training

set. As the training proceeds, the histograms flatten

Epoch

19

Epoch

29

Figure

2:

A sequence of histogram trees for the HME

architecture. Each histogram displays the distribu-

tion of posterior probabilities across the training set

at each node in the tree.

out, eventually approaching bimodal distributions in

which the posterior probabilities are either one

or

zero

for

most

of

the training patterns. This evolu-

tion is indicative of increasingly sharp splits being

fit by the gating networks. Note that there is a ten-

dency for the splits to be formed more rapidly at

higher levels in the tree than at lower levels.

2.6

AN ON-LINE ALGORITHM

Jordan and Jacobs

[8]

derive an on-line algorithm

for the HME architecture using techniques from re-

cursive estimation theory

[lo].

The on-line update rule for the parameters of the

expert networks is given by the following recursive

equation:

U!?+l)

a3

=

U:;)

+

h;f)h;;!(y(t)

-

pj;))X(t)TR;;),

(10)

where

Rij

is the inverse covariance matrix for ex-

pert network

(i,j).

This matrix is updated via the

equation:

where

X

is a decay parameter.

Similar update rules are obtained for the parameters

of the gating networks. See

[8]

for

further details.

2.6.1

Simulation Results

The on-line algorithm was tested on the robot dy-

namics problem described in the previous section.

1343

Hierarchical mixtures of experts and the EM algorithm

Citations

Statistical pattern recognition: a review

Multitask Learning

The empirical case for two systems of reasoning.

Dynamic bayesian networks: representation, inference and learning

Ensemble based systems in decision making

References

Maximum likelihood from incomplete data via the EM algorithm

Generalized Linear Models

Multilayer feedforward networks are universal approximators

Statistical Analysis with Missing Data

Induction of Decision Trees

Related Papers (5)

Maximum likelihood from incomplete data via the EM algorithm

Bagging predictors

Neural networks for pattern recognition

Neural Networks: A Comprehensive Foundation

A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting