Kullback-Leibler Approach to Gaussian Mixture Reduction

doi:10.1109/TAES.2007.4383588

Kent Academic Repository

Full text document (pdf)

Copyright & reuse

Content in the Kent Academic Repository is made available for research purposes. Unless otherwise stated all

content is protected by copyright and in the absence of an open licence (eg Creative Commons), permissions

for further reuse of content should be sought from the publisher, author or other copyright holder.

Versions of research

The version in the Kent Academic Repository may differ from the final published version.

Users are advised to check http://kar.kent.ac.uk for the status of the paper. Users should always cite the

published version of record.

Enquiries

For any further enquiries regarding the licence status of this document, please contact:

researchsupport@kent.ac.uk

If you believe this document infringes copyright then please contact the KAR admin team with the take-down

information provided at http://kar.kent.ac.uk/contact.html

Citation for published version

Runnalls, Andrew R. (2007) Kullback-Leibler approach to Gaussian mixture reduction. IEEE

Transactions on Aerospace and Electronic Systems, 43 (3). pp. 989-999. ISSN 0018-9251.

DOI

https://doi.org/10.1109/TAES.2007.4383588

Link to record in KAR

https://kar.kent.ac.uk/2782/

Document Version

UNSPECIFIED

TO APPEAR IN IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTEMS 1

A Kullback-Leibler Approach

to Gaussian Mixture Reduction

Andrew R. Runnalls

c

2006 IEEE. Personal use of this material is permitted.

However, permission to reprint/republish this material for

advertising or promotional purposes or for creating new

collective works for resale or redistribution to servers or

lists, or to reuse any copyrighted component of this work

in other works must be obtained from the IEEE.

This material is presented to ensure timely dissemination

of scholarly and technical work. Copyright and all rights

therein are retained by authors or by other copyright

holders. All persons copying this information are expected

to adhere to the terms and constraints invoked by each

author’s copyright. In most cases, these works may not be

reposted without the explicit permission of the copyright

holder.

Abstract— A common problem in multi-target tracking is to

approximate a Gaussian mixture by one containing fewer com-

ponents; similar problems can arise in integrated navigation. A

common approach is successively to merge pairs of components,

replacing the pair with a single Gaussian component whose

moments up to second order match those of the merged pair.

Salmond [1] and Williams [2], [3] have each proposed algorithms

along these lines, but using different criteria for selecting the pair

to be merged at each stage. The paper shows how under certain

circumstances each of these pair-selection criteria can give rise

to anomalous behaviour, and proposes that a key consideration

should be the Kullback-Leibler discrimination of the reduced

mixture with respect to the original mixture. Although computing

this directly would normally be impractical, the paper shows how

an easily-computed upper bound can be used as a pair-selection

criterion which avoids the anomalies of the earlier approaches.

The behaviour of the three algorithms is compared using a high-

dimensional example drawn from terrain-referenced navigation.

Index Terms— Gaussian mixture, data fusion, integrated nav-

igation, tracking.

I. INTRODUCTION

S

EVERAL data fusion algorithms, usually derived in some

way from the Kalman ﬁlter, represent the state of the

observed system as a mixture of Gaussian distributions. An

important example is the multiple hypothesis approach to

tracking multiple targets where there is ambiguity in assigning

observations to tracks—see for example [4, Sec. 6.7]—and

this is the application motivating Salmond’s and Williams’s

papers cited below. However, Gaussian mixture approaches

are also useful in integrated navigation applications where, for

example, there is some ambiguity in the position ﬁxes used to

augment an inertial navigation system: this is the application

motivating the present note [5], [6].

Andrew Runnalls is with the Computing Laboratory at the University of

Kent, Canterbury CT2 7NF, Kent, UK. (e-mail: A.R.Runnalls@kent.ac.uk).

A common drawback with these Gaussian mixture algo-

rithms is that there is a tendency for the number of components

of the mixture to grow without bound: indeed, if the algorithm

were to simply to follow the statistical model on which the

method is based, the number of components would increase

exponentially over time. To combat this, various pragmatic

measures must be taken to keep the number of components

in check. Typically this will be achieved either by discarding

components with low probability, and/or by merging compo-

nents which represent similar state hypotheses.

Salmond [1] proposed a mixture reduction algorithm in

which the number of components is reduced by repeatedly

choosing the two components that appear to be most similar

to each other, and merging them. His criterion of similarity is

based on concepts from the statistical analysis of variance, and

seeks to minimise the increase in ‘within-component’ variance

resulting from merging the two chosen components.

Williams [2], [3] proposed a mixture reduction algorithm

based on an integrated squared difference (ISD) similarity

measure, which as he points out has the big advantage that

the similarity between two arbitrary Gaussian mixtures can

be expressed in closed form. The algorithm he proposes uses

a hill-climbing optimisation to search for a reduced mixture

with the greatest similarity to the original mixture; however,

to ﬁnd starting points for the optimisation process, he uses a

pairwise merge algorithm similar to Salmond’s, but using the

ISD similarity measure.

In the present paper, we propose a third variation on the

pairwise-merge approach, in which the measure of similarity

between two components is based on the Kullback-Leibler

(KL) discrimination measure [7].

The layout is as follows: Sec. II introduces a brief notation

for Gaussian mixtures, deﬁnes the concept of a moment-

preserving merge of two or more components of such a mix-

ture, and outlines the pairwise-merge type of mixture reduction

algorithm being considered in this paper. Sec. III introduces

the KL discrimination measure. Sec. IV describes the criterion

proposed in [1] for selecting which pair of components to

merge at each stage, and identiﬁes two properties of this

criterion that may be considered anomalous. Sec. V similarly

studies the ISD criterion proposed by Williams, and identiﬁes

a property of this criterion that may be considered undesirable

in some applications, particularly where the system state vector

has high dimensionality. Sec. VI proposes a dissimilarity

measure for pair selection based on KL discrimination, and

explores its properties; Sec. VII then discusses the advantages

and disadvantages of a pairwise merge algorithm based on this

dissimilarity measure. Sec. VIII compares the operation of the

Salmond, Williams, and KL reduction algorithms in reducing

2 TO APPEAR IN IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTEMS

a high-dimensional mixture arising in terrain-referenced nav-

igation. Finally Sec. IX draws conclusions.

II. GENERAL BACKGROUND

A. Notation

We shall represent a component of a Gaussian mixture using

notation of the form (w, µ, P ): this represents a component

with non-negative weight w, mean vector µ and covariance

matrix P . (We shall assume throughout that components’

covariance matrices are strictly positive deﬁnite, and not

merely non-negative deﬁnite.) We shall use notation such

as {(w

1

, µ

1

, P

1

), (w

2

, µ

2

, P

2

), . . . (w

n

, µ

n

, P

n

)} to denote a

mixture of n such components; such a mixture must satisfy

w

1

+ ··· + w

n

= 1, and has probability density function:

f(x) =

n

X

i=1

w

i

p

(2π)

d

det P

i

exp



−

1

2

(x −µ

i

)

T

P

−1

i

(x −µ

i

)



where d is the dimensionality of the state vector x. A plain

(unmixed) Gaussian distribution will be written using notation

such as {(1, µ, P )}.

B. Merging Two Components

Suppose we are given a mixture of two Gaussian compo-

nents:

{(w

1

, µ

1

, P

1

), (w

2

, µ

2

, P

2

)} (1)

(where w

1

+ w

2

= 1) and that we wish to approximate

this mixture as a single Gaussian. A strong candidate is

the Gaussian whose zeroth, ﬁrst and second-order moments

match those of (1), i.e. the Gaussian with mean vector µ and

covariance matrix P as follows:

µ = w

1

µ

1

+ w

2

µ

2

P = w

1

(P

1

+ (µ

1

− µ)(µ

1

− µ)

T

)

+ w

2

(P

2

+ (µ

2

− µ)(µ

2

− µ)

T

)

= w

1

P

1

+ w

2

P

2

+ w

1

w

2

(µ

1

− µ

2

)(µ

1

− µ

2

)

T

(Theorem 3.2 will show that {(1, µ, P )} is the Gaussian

whose Kullback-Leibler discrimination from the mixture (1)

is minimal.)

We shall refer to (1, µ, P ) as the moment-preserving

merge of (w

1

, µ

1

, P

1

) and (w

2

, µ

2

, P

2

). More generally, we

can remove the restriction that w

1

+ w

2

= 1: given two

weighted Gaussian components (w

i

, µ

i

, P

i

) and (w

j

, µ

j

, P

j

),

with w

1

+ w

2

≤ 1, their moment-preserving merge is the

Gaussian component (w

ij

, µ

ij

, P

ij

) as follows (cf. [3, Eqs 2–

4]):

w

ij

= w

i

+ w

j

(2)

µ

ij

= w

i|ij

µ

i

+ w

j|ij

µ

j

(3)

P

ij

= w

i|ij

P

i

+ w

j|ij

P

j

+ w

i|ij

w

j|ij

(µ

i

− µ

j

)(µ

i

− µ

j

)

T

(4)

where we write w

i|ij

= w

i

/(w

i

+ w

j

) and w

j|ij

= w

j

/(w

i

+

w

j

).

C. Mixture Reduction Algorithm

Suppose that we are given a mixture with n components,

and we wish to approximate it by a mixture of m components,

where m ≤ n. In this paper, we focus on algorithms which

operate in the following general way:

While more than m components remain, choose the

two components that in a sense to be deﬁned are

least dissimilar, and replace them by their moment-

preserving merge.

The algorithm proposed in [1, Sec. 4] is of this type, using the

dissimilarity measure to be described in Sec. IV; the algorithm

proposed in [2], [3] uses an algorithm of this type to determine

starting points for an optimisation procedure.

III. KULLBACK-LEIBLER DISCRIMINATION

If f

1

(x) and f

2

(x) are probability density functions over

ℜ

d

, the Kullback-Leibler (KL) discrimination

1

of f

2

from f

1

is deﬁned as:

d

kl

(f

1

, f

2

) =

Z

ℜ

d

f

1

(x) log

f

1

(x)

f

2

(x)

dx (5)

Although clearly d

kl

(f, f) = 0, and d

kl

(f, g) ≥ 0 (cf. [8,

Theorem 2.6.3], [9, Theorem 4.3.1]), in general it is not true

that d

kl

(f, g) = d

kl

(g, f), nor that d

kl

(f, g) + d

kl

(g, h) ≥

d

kl

(f, h).

To give an informal motivation for KL discrimination, sup-

pose that we have a stream of data x

1

, x

2

, . . . which we assume

to be independent samples either from f(x) or from g(x), and

we wish to decide which. From a Bayesian perspective, the

approach we might take is to continue drawing samples until

the likelihood ratio

Q

i

(f(x

i

)/g(x

i

)) exceeds some predeﬁned

threshold, say 100:1 in favour on one candidate or the other.

Equivalently, we will be aiming to achieve a sample large

enough that the logarithm of the likelihood ratio falls outside

the bounds ±log 100. Now suppose that (unknown to us) the

data stream is actually coming from f(x). Then the expected

value of the log-likelihood-ratio for a single sample point

will be E(log(f (x)/g(x))) = d

kl

(f, g). Consequently, the

expected log-likelihood-ratio for the full sample will exceed

log 100 provided the sample size exceeds (log 100)/d

kl

(f, g).

Roughly speaking, small values of d

kl

(f, g) mean that we will

need large samples to distinguish f from g, and conversely.

The remainder of this section introduces theorems about

Kullback-Leibler discrimination that we shall use in Sec. VI,

and can be skipped on a ﬁrst reading.

Theorem 3.1: Let g

1

(x) be the d-dimensional Gaussian pdf

with mean vector µ

1

and positive deﬁnite covariance matrix

P

1

, and let g

2

(x) be the d-dimensional Gaussian pdf with

mean vector µ

2

and p.d. covariance matrix P

2

. Then:

2d

kl

(g

1

, g

2

)

= tr



P

−1

2

[P

1

− P

2

+ (µ

1

− µ

2

)(µ

1

− µ

2

)

T

]



+ log

det(P

2

)

det(P

1

)

1

Also referred to as cross-entropy, Kullback-Leibler information, or

Kullback-Leibler divergence. However, Kullback and Leibler themselves

[7] and several subsequent authors use the term ‘divergence’ to refer to

d

kl

(f

1

, f

2

) + d

kl

(f

2

, f

1

). It is also sometimes called the Kullback-Leibler

distance, despite not satisfying the usual requirements for a distance measure.

A KULLBACK-LEIBLER APPROACH TO GAUSSIAN MIXTURE REDUCTION 3

For a proof see for example [9, Theorem 7.2.8].

Theorem 3.2: Let f (x) be a probability density function

over d dimensions with well-deﬁned mean µ

∗

and covariance

matrix P

∗

, where P

∗

is strictly positive-deﬁnite. As before,

let (1, µ, P ) denote the Gaussian density with mean µ and

p.d. covariance matrix P . Then the unique minimum value of

d

kl

(f, (1, µ, P )) is achieved when µ = µ

∗

and P = P

∗

.

For a proof see the Appendix.

Theorem 3.3: If f (x), h

1

(x) and h

2

(x) are any pdfs over

d dimensions and 0 ≤ w ≤ 1 then, writing ¯w for 1 − w:

d

kl

(wh

1

+ ¯wh

2

, f) ≤ wd

kl

(h

1

, f) + ¯wd

kl

(h

2

, f)

d

kl

(f, wh

1

+ ¯wh

2

) ≤ wd

kl

(f, h

1

) + ¯wd

kl

(f, h

2

)

This is a standard result: for a proof see [9, Theorem 4.3.2]

or [8, Theorem 2.7.2].

Theorem 3.4: If f

1

(x), f

2

(x) and h(x) are any pdfs over

d dimensions, 0 ≤ w ≤ 1 and ¯w = 1 − w, then:

d

kl

(wf

1

+ ¯wh, wf

2

+ ¯wh) ≤ wd

kl

(f

1

, f

2

)

For a proof see the Appendix.

IV. SALMOND’S CRITERION

Let {(w

1

, µ

1

, P

1

), . . . (w

n

, µ

n

, P

n

)} be an n-component

Gaussian mixture, and let µ and P be respectively the overall

mean and the overall variance of this mixture. Clearly

µ =

n

X

i=1

w

i

µ

i

while P can be written as P = W + B where W is the

‘within-components’ contribution to the total variance, given

by:

W =

n

X

i=1

w

i

P

i

while B is the ‘between-components’ contribution given by:

B =

n

X

i=1

w

i

(µ

i

− µ)(µ

i

− µ)

T

When two components are replaced by their moment-

preserving merge, the effect is, roughly speaking, to increase

W and decrease B by a corresponding amount, leaving the

total variance P unchanged. Salmond’s general idea [1, Sec. 4]

is to choose for merging two components i and j such that

the increase in W is minimised. He shows that the change in

W when components i and j are replaced by their moment-

preserving merge is

∆W

ij

=

w

i

w

j

w

i

+ w

j

(µ

i

− µ

j

)(µ

i

− µ

j

)

T

However, ∆W

ij

is a matrix, whereas we require a scalar

dissimilarity measure. Salmond proposes using the following

measure:

D

2

s

(i, j) = tr(P

−1

∆W

ij

) (6)

Here the trace reduces its matrix argument to a scalar, and the

premultiplication by P

−1

ensures that the resulting dissimi-

larity measure is invariant under linear transformations of the

state space.

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

x

y

A

B

C

D

Fig. 1. Means of the components in Example 4.2

However, the dissimilarity measure deﬁned in (6) has two

properties that may be considered undesirable as a basis for

choosing which components to merge. First, the measure

depends on the means of the components, but not on their

individual covariance matrices, leading to the behaviour in this

example:

Example 4.1: A mixture comprises three two-dimensional

components {(

1

3

, µ, P

1

), (

1

3

, µ + δµ, P

1

), (

1

3

, µ, P

2

)}, where

δµ is very small (e.g. δµ = (0.0001, 0.0001)

T

) but P

2

is very

different from P

1

:

P

1

=



1 0.9

0.9 1



P

2

=



1 −0.9

−0.9 1



We wish to reduce the mixture to two components. Then, using

(6), we will choose to merge the ﬁrst and third components,

yielding a merged component (

2

3

, µ, I

2

), where I

2

is the two-

dimensional identity matrix.

The reader may well consider that in this example it would

be better to merge the ﬁrst two components, yielding (

2

3

, µ +

1

2

δµ, P

1

+

1

4

δµδµ

T

).

The second drawback arises from the presence of the overall

covariance P within (6). This has the implication that adding

a new component to a mixture may alter the order in which

the existing components are merged, as shown in the following

example.

Example 4.2: A mixture over the two dimensions (x, y)

consists of four components

A = (0.25, (0.661, 1)

T

, I

2

) (7)

B = (0.25, (1.339, −1)

T

, I

2

) (8)

C = (0.25, (−0.692, 1.1)

T

, I

2

) (9)

D = (0.25, (−1.308, −1.1)

T

, I

2

) (10)

4 TO APPEAR IN IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTEMS

(The means of the components are shown in Fig. 1.) We

wish to reduce this mixture to three components. It is readily

established that the overall mean of the mixture is (0, 0)

T

,

and its covariance matrix is 2.105I

2

. From the latter fact, it

follows that criterion (6) will lead us simply to merge the two

components whose means are closest together, namely A and

C.

Now modify the original mixture by reducing the weights

of components A to D to 0.2, and adding a ﬁfth component

E = (0.2, (0, −10)

T

, I

2

). We wish to reduce this new mixture

to three components. It turns out that criterion (6) now selects

components A and B for the ﬁrst merge, and components C

and D for the second merge. This is because, although E is a

weak candidate for either merge, its inclusion in the mixture

has greatly increased its overall variance in the y-direction,

meaning that (6) now weights differences in x more heavily

than differences in y.

V. WILLIAMS’S CRITERION

Williams [2] and Williams and Maybeck [3] propose a

method of Gaussian mixture reduction based on the integrated

squared difference (ISD) measure of the dissimilarity between

two pdfs f

1

(x) and f

2

(x):

J

S

=

Z

(f

1

(x) −f

2

(x))

2

dx

(cf. [3, Eq. 4]. This has the important property that the

dissimilarity between two arbitrary Gaussian mixtures can be

expressed in closed form (given in [3, Eq. 10])—a property

regrettably not shared by the measure proposed in the present

paper.

Their algorithm for reducing an n-component mixture to an

m-component mixture (m ≤ n) can be summarised as follows:

1) While more than m components remain consider all

possible operations of the following two kinds:

• Deleting a component and renormalising the re-

maining mixture;

• Replacing a pair of components with their moment-

preserving merge;

and in each case evaluate the ISD-dissimilarity of the

resulting mixture from the original mixture. Apply the

operation for which this dissimilarity is a minimum.

2) Use the resulting m-component mixture as the starting

point for gradient-based optimisation technique, to seek

an m-component mixture with lower dissimilarity to the

original mixture.

The authors note that the optimisation at Step 2 will seek a

local minimum rather than the global minimum: hence the

need to choose the starting point carefully.

The ISD cost measure circumvents both of the drawbacks

of Salmond’s criterion. First, the measure depends explicitly

on the covariance matrices as well as the means of the compo-

nents. Second, the cost incurred by merging two components

depends only on the parameters of those components, and not

on other characteristics of the mixture of which they form a

part. Consequently, the anomalies observed in Examples 4.1

and 4.2 do not arise.

However, the ISD criterion leads to puzzling behaviour of

its own. To illustrate this, we will focus on mixtures where

the components are radially symmetric, i.e. the covariance

matrices are multiples of the identity matrix. Consider ﬁrst the

case where the starting mixture is {(w, µ−cσu, σ

2

I

d

), (w, µ+

cσu, σ

2

I

d

)}, where µ is arbitrary and u is a d-dimensional

unit vector. The means of the two components of this mixture

are distance 2cσ apart.

In this case it follows from [3, Eq. 12] that the ISD cost

of deleting one of the components (and raising the other

component to unit weight) is given by:

J

S

=

4w

2

σ

d

p

(4π)

d

h

D

(c) (11)

where

h

D

(c) =

1

2

(1 −exp(−c

2

)) (12)

while the cost of replacing the two components by their

moment-preserving merge, namely (2w, µ, σ

2

(I + c

2

uu

T

)),

is:

J

S

=

4w

2

σ

d

p

(4π)

d

h

M

(c) (13)

where

h

M

(c) =

1

2

(1 + exp(−c

2

)) +

1

√

1 + c

2

−

2

√

2

√

2 + c

2

exp



−

c

2

2(2 + c

2

)



(14)

The functions h

M

(c) and h

D

(c) are both zero for c = 0 and

as c increases, both functions increase monotonically, tending

towards

1

2

as c → ∞. It can be shown that h

D

(c) > h

M

(c)

except when c is zero, so the deletion option will not be

considered further.

In the example under consideration, σ acts simply as a scale

factor, but it nevertheless appears in (13), raised moreover to

the power d. This leads to some surprising behaviour in the

way in which Williams’s algorithm selects pairwise merges, as

in the following twelve-dimensional example. (It is not unusual

in inertial navigation applications for the state vector to have

15 or more dimensions.)

Example 5.1: A mixture over the space (x

1

, . . . x

12

) com-

prises four components

A = (0.25, (−20, −0.5, 0, . . . , 0)

T

, I

12

) (15)

B = (0.25, (−20, 0.5, 0, . . . 0)

T

, I

12

) (16)

C = (0.25, (20, −10, 0, . . . 0)

T

, 4I

12

) (17)

D = (0.25, (20, 10, 0, . . . 0)

T

, 4I

12

) (18)

where in each mean vector the ellipsis . . . comprises eight

zeroes. Note that components A and B have negligible prob-

ability within the region where x

1

> 0, and C and D have

negligible probability within the region x

1

< 0.

Assume that we wish to reduce this four-component mixture

to three components. Now, according to (13) the cost of

replacing components A and B by their moment-preserving

merge is

J

S

=

1

4(4π)

6

h

M

(0.5) ≈ 6.39 × 10

−12

Kullback-Leibler Approach to Gaussian Mixture Reduction

Figures

Citations

Gradient boosting machines, a tutorial.

On entropy approximation for Gaussian mixture random vectors

Ieee transactions® on aerospace and electronic systems

Model-Based Clustering and Classification for Data Science

Approximate representations for multi-robot control policies that maximize mutual information

References

Elements of information theory

On Information and Sufficiency

Linear statistical inference and its applications

Linear Statistical Inference and its Applications

Design and Analysis of Modern Tracking Systems

Related Papers (5)

Nonlinear Bayesian estimation using Gaussian sum approximations

On Information and Sufficiency

The Gaussian Mixture Probability Hypothesis Density Filter

Maximum likelihood from incomplete data via the EM algorithm

A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking