scispace - formally typeset
Open AccessJournal ArticleDOI

Kullback-Leibler Approach to Gaussian Mixture Reduction

Reads0
Chats0
TLDR
The paper shows how an easily computed upper bound can be used as a pair-selection criterion which avoids the anomalies of the earlier approaches and proposes that a key consideration should be the Kullback-Leibler (KL) discrimination of the reduced mixture with respect to the original mixture.
Abstract
A common problem in multi-target tracking is to approximate a Gaussian mixture by one containing fewer components; similar problems can arise in integrated navigation. A common approach is successively to merge pairs of components, replacing the pair with a single Gaussian component whose moments up to second order match those of the merged pair. Salmond [1] and Williams [2, 3] have each proposed algorithms along these lines, but using different criteria for selecting the pair to be merged at each stage. The paper shows how under certain circumstances each of these pair-selection criteria can give rise to anomalous behaviour, and proposes that a key consideration should the the Kullback-Leibler (KL) discrimination of the reduced mixture with respect to the original mixture. Although computing this directly would normally be impractical, the paper shows how an easily computed upper bound can be used as a pair-selection criterion which avoids the anomalies of the earlier approaches. The behaviour of the three algorithms is compared using a high-dimensional example drawn from terrain-referenced navigation.

read more

Content maybe subject to copyright    Report

Kent Academic Repository
Full text document (pdf)
Copyright & reuse
Content in the Kent Academic Repository is made available for research purposes. Unless otherwise stated all
content is protected by copyright and in the absence of an open licence (eg Creative Commons), permissions
for further reuse of content should be sought from the publisher, author or other copyright holder.
Versions of research
The version in the Kent Academic Repository may differ from the final published version.
Users are advised to check http://kar.kent.ac.uk for the status of the paper. Users should always cite the
published version of record.
Enquiries
For any further enquiries regarding the licence status of this document, please contact:
researchsupport@kent.ac.uk
If you believe this document infringes copyright then please contact the KAR admin team with the take-down
information provided at http://kar.kent.ac.uk/contact.html
Citation for published version
Runnalls, Andrew R. (2007) Kullback-Leibler approach to Gaussian mixture reduction. IEEE
Transactions on Aerospace and Electronic Systems, 43 (3). pp. 989-999. ISSN 0018-9251.
DOI
https://doi.org/10.1109/TAES.2007.4383588
Link to record in KAR
https://kar.kent.ac.uk/2782/
Document Version
UNSPECIFIED

TO APPEAR IN IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTEMS 1
A Kullback-Leibler Approach
to Gaussian Mixture Reduction
Andrew R. Runnalls
c
2006 IEEE. Personal use of this material is permitted.
However, permission to reprint/republish this material for
advertising or promotional purposes or for creating new
collective works for resale or redistribution to servers or
lists, or to reuse any copyrighted component of this work
in other works must be obtained from the IEEE.
This material is presented to ensure timely dissemination
of scholarly and technical work. Copyright and all rights
therein are retained by authors or by other copyright
holders. All persons copying this information are expected
to adhere to the terms and constraints invoked by each
author’s copyright. In most cases, these works may not be
reposted without the explicit permission of the copyright
holder.
Abstract A common problem in multi-target tracking is to
approximate a Gaussian mixture by one containing fewer com-
ponents; similar problems can arise in integrated navigation. A
common approach is successively to merge pairs of components,
replacing the pair with a single Gaussian component whose
moments up to second order match those of the merged pair.
Salmond [1] and Williams [2], [3] have each proposed algorithms
along these lines, but using different criteria for selecting the pair
to be merged at each stage. The paper shows how under certain
circumstances each of these pair-selection criteria can give rise
to anomalous behaviour, and proposes that a key consideration
should be the Kullback-Leibler discrimination of the reduced
mixture with respect to the original mixture. Although computing
this directly would normally be impractical, the paper shows how
an easily-computed upper bound can be used as a pair-selection
criterion which avoids the anomalies of the earlier approaches.
The behaviour of the three algorithms is compared using a high-
dimensional example drawn from terrain-referenced navigation.
Index Terms Gaussian mixture, data fusion, integrated nav-
igation, tracking.
I. INTRODUCTION
S
EVERAL data fusion algorithms, usually derived in some
way from the Kalman filter, represent the state of the
observed system as a mixture of Gaussian distributions. An
important example is the multiple hypothesis approach to
tracking multiple targets where there is ambiguity in assigning
observations to tracks—see for example [4, Sec. 6.7]—and
this is the application motivating Salmond’s and Williams’s
papers cited below. However, Gaussian mixture approaches
are also useful in integrated navigation applications where, for
example, there is some ambiguity in the position fixes used to
augment an inertial navigation system: this is the application
motivating the present note [5], [6].
Andrew Runnalls is with the Computing Laboratory at the University of
Kent, Canterbury CT2 7NF, Kent, UK. (e-mail: A.R.Runnalls@kent.ac.uk).
A common drawback with these Gaussian mixture algo-
rithms is that there is a tendency for the number of components
of the mixture to grow without bound: indeed, if the algorithm
were to simply to follow the statistical model on which the
method is based, the number of components would increase
exponentially over time. To combat this, various pragmatic
measures must be taken to keep the number of components
in check. Typically this will be achieved either by discarding
components with low probability, and/or by merging compo-
nents which represent similar state hypotheses.
Salmond [1] proposed a mixture reduction algorithm in
which the number of components is reduced by repeatedly
choosing the two components that appear to be most similar
to each other, and merging them. His criterion of similarity is
based on concepts from the statistical analysis of variance, and
seeks to minimise the increase in ‘within-component’ variance
resulting from merging the two chosen components.
Williams [2], [3] proposed a mixture reduction algorithm
based on an integrated squared difference (ISD) similarity
measure, which as he points out has the big advantage that
the similarity between two arbitrary Gaussian mixtures can
be expressed in closed form. The algorithm he proposes uses
a hill-climbing optimisation to search for a reduced mixture
with the greatest similarity to the original mixture; however,
to find starting points for the optimisation process, he uses a
pairwise merge algorithm similar to Salmond’s, but using the
ISD similarity measure.
In the present paper, we propose a third variation on the
pairwise-merge approach, in which the measure of similarity
between two components is based on the Kullback-Leibler
(KL) discrimination measure [7].
The layout is as follows: Sec. II introduces a brief notation
for Gaussian mixtures, defines the concept of a moment-
preserving merge of two or more components of such a mix-
ture, and outlines the pairwise-merge type of mixture reduction
algorithm being considered in this paper. Sec. III introduces
the KL discrimination measure. Sec. IV describes the criterion
proposed in [1] for selecting which pair of components to
merge at each stage, and identifies two properties of this
criterion that may be considered anomalous. Sec. V similarly
studies the ISD criterion proposed by Williams, and identifies
a property of this criterion that may be considered undesirable
in some applications, particularly where the system state vector
has high dimensionality. Sec. VI proposes a dissimilarity
measure for pair selection based on KL discrimination, and
explores its properties; Sec. VII then discusses the advantages
and disadvantages of a pairwise merge algorithm based on this
dissimilarity measure. Sec. VIII compares the operation of the
Salmond, Williams, and KL reduction algorithms in reducing

2 TO APPEAR IN IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTEMS
a high-dimensional mixture arising in terrain-referenced nav-
igation. Finally Sec. IX draws conclusions.
II. GENERAL BACKGROUND
A. Notation
We shall represent a component of a Gaussian mixture using
notation of the form (w, µ, P ): this represents a component
with non-negative weight w, mean vector µ and covariance
matrix P . (We shall assume throughout that components’
covariance matrices are strictly positive definite, and not
merely non-negative definite.) We shall use notation such
as {(w
1
, µ
1
, P
1
), (w
2
, µ
2
, P
2
), . . . (w
n
, µ
n
, P
n
)} to denote a
mixture of n such components; such a mixture must satisfy
w
1
+ ··· + w
n
= 1, and has probability density function:
f(x) =
n
X
i=1
w
i
p
(2π)
d
det P
i
exp
1
2
(x µ
i
)
T
P
1
i
(x µ
i
)
where d is the dimensionality of the state vector x. A plain
(unmixed) Gaussian distribution will be written using notation
such as {(1, µ, P )}.
B. Merging Two Components
Suppose we are given a mixture of two Gaussian compo-
nents:
{(w
1
, µ
1
, P
1
), (w
2
, µ
2
, P
2
)} (1)
(where w
1
+ w
2
= 1) and that we wish to approximate
this mixture as a single Gaussian. A strong candidate is
the Gaussian whose zeroth, first and second-order moments
match those of (1), i.e. the Gaussian with mean vector µ and
covariance matrix P as follows:
µ = w
1
µ
1
+ w
2
µ
2
P = w
1
(P
1
+ (µ
1
µ)(µ
1
µ)
T
)
+ w
2
(P
2
+ (µ
2
µ)(µ
2
µ)
T
)
= w
1
P
1
+ w
2
P
2
+ w
1
w
2
(µ
1
µ
2
)(µ
1
µ
2
)
T
(Theorem 3.2 will show that {(1, µ, P )} is the Gaussian
whose Kullback-Leibler discrimination from the mixture (1)
is minimal.)
We shall refer to (1, µ, P ) as the moment-preserving
merge of (w
1
, µ
1
, P
1
) and (w
2
, µ
2
, P
2
). More generally, we
can remove the restriction that w
1
+ w
2
= 1: given two
weighted Gaussian components (w
i
, µ
i
, P
i
) and (w
j
, µ
j
, P
j
),
with w
1
+ w
2
1, their moment-preserving merge is the
Gaussian component (w
ij
, µ
ij
, P
ij
) as follows (cf. [3, Eqs 2–
4]):
w
ij
= w
i
+ w
j
(2)
µ
ij
= w
i|ij
µ
i
+ w
j|ij
µ
j
(3)
P
ij
= w
i|ij
P
i
+ w
j|ij
P
j
+ w
i|ij
w
j|ij
(µ
i
µ
j
)(µ
i
µ
j
)
T
(4)
where we write w
i|ij
= w
i
/(w
i
+ w
j
) and w
j|ij
= w
j
/(w
i
+
w
j
).
C. Mixture Reduction Algorithm
Suppose that we are given a mixture with n components,
and we wish to approximate it by a mixture of m components,
where m n. In this paper, we focus on algorithms which
operate in the following general way:
While more than m components remain, choose the
two components that in a sense to be defined are
least dissimilar, and replace them by their moment-
preserving merge.
The algorithm proposed in [1, Sec. 4] is of this type, using the
dissimilarity measure to be described in Sec. IV; the algorithm
proposed in [2], [3] uses an algorithm of this type to determine
starting points for an optimisation procedure.
III. KULLBACK-LEIBLER DISCRIMINATION
If f
1
(x) and f
2
(x) are probability density functions over
d
, the Kullback-Leibler (KL) discrimination
1
of f
2
from f
1
is defined as:
d
kl
(f
1
, f
2
) =
Z
d
f
1
(x) log
f
1
(x)
f
2
(x)
dx (5)
Although clearly d
kl
(f, f) = 0, and d
kl
(f, g) 0 (cf. [8,
Theorem 2.6.3], [9, Theorem 4.3.1]), in general it is not true
that d
kl
(f, g) = d
kl
(g, f), nor that d
kl
(f, g) + d
kl
(g, h)
d
kl
(f, h).
To give an informal motivation for KL discrimination, sup-
pose that we have a stream of data x
1
, x
2
, . . . which we assume
to be independent samples either from f(x) or from g(x), and
we wish to decide which. From a Bayesian perspective, the
approach we might take is to continue drawing samples until
the likelihood ratio
Q
i
(f(x
i
)/g(x
i
)) exceeds some predefined
threshold, say 100:1 in favour on one candidate or the other.
Equivalently, we will be aiming to achieve a sample large
enough that the logarithm of the likelihood ratio falls outside
the bounds ±log 100. Now suppose that (unknown to us) the
data stream is actually coming from f(x). Then the expected
value of the log-likelihood-ratio for a single sample point
will be E(log(f (x)/g(x))) = d
kl
(f, g). Consequently, the
expected log-likelihood-ratio for the full sample will exceed
log 100 provided the sample size exceeds (log 100)/d
kl
(f, g).
Roughly speaking, small values of d
kl
(f, g) mean that we will
need large samples to distinguish f from g, and conversely.
The remainder of this section introduces theorems about
Kullback-Leibler discrimination that we shall use in Sec. VI,
and can be skipped on a first reading.
Theorem 3.1: Let g
1
(x) be the d-dimensional Gaussian pdf
with mean vector µ
1
and positive definite covariance matrix
P
1
, and let g
2
(x) be the d-dimensional Gaussian pdf with
mean vector µ
2
and p.d. covariance matrix P
2
. Then:
2d
kl
(g
1
, g
2
)
= tr
P
1
2
[P
1
P
2
+ (µ
1
µ
2
)(µ
1
µ
2
)
T
]
+ log
det(P
2
)
det(P
1
)
1
Also referred to as cross-entropy, Kullback-Leibler information, or
Kullback-Leibler divergence. However, Kullback and Leibler themselves
[7] and several subsequent authors use the term ‘divergence’ to refer to
d
kl
(f
1
, f
2
) + d
kl
(f
2
, f
1
). It is also sometimes called the Kullback-Leibler
distance, despite not satisfying the usual requirements for a distance measure.

A KULLBACK-LEIBLER APPROACH TO GAUSSIAN MIXTURE REDUCTION 3
For a proof see for example [9, Theorem 7.2.8].
Theorem 3.2: Let f (x) be a probability density function
over d dimensions with well-defined mean µ
and covariance
matrix P
, where P
is strictly positive-definite. As before,
let (1, µ, P ) denote the Gaussian density with mean µ and
p.d. covariance matrix P . Then the unique minimum value of
d
kl
(f, (1, µ, P )) is achieved when µ = µ
and P = P
.
For a proof see the Appendix.
Theorem 3.3: If f (x), h
1
(x) and h
2
(x) are any pdfs over
d dimensions and 0 w 1 then, writing ¯w for 1 w:
d
kl
(wh
1
+ ¯wh
2
, f) wd
kl
(h
1
, f) + ¯wd
kl
(h
2
, f)
d
kl
(f, wh
1
+ ¯wh
2
) wd
kl
(f, h
1
) + ¯wd
kl
(f, h
2
)
This is a standard result: for a proof see [9, Theorem 4.3.2]
or [8, Theorem 2.7.2].
Theorem 3.4: If f
1
(x), f
2
(x) and h(x) are any pdfs over
d dimensions, 0 w 1 and ¯w = 1 w, then:
d
kl
(wf
1
+ ¯wh, wf
2
+ ¯wh) wd
kl
(f
1
, f
2
)
For a proof see the Appendix.
IV. SALMONDS CRITERION
Let {(w
1
, µ
1
, P
1
), . . . (w
n
, µ
n
, P
n
)} be an n-component
Gaussian mixture, and let µ and P be respectively the overall
mean and the overall variance of this mixture. Clearly
µ =
n
X
i=1
w
i
µ
i
while P can be written as P = W + B where W is the
‘within-components’ contribution to the total variance, given
by:
W =
n
X
i=1
w
i
P
i
while B is the ‘between-components’ contribution given by:
B =
n
X
i=1
w
i
(µ
i
µ)(µ
i
µ)
T
When two components are replaced by their moment-
preserving merge, the effect is, roughly speaking, to increase
W and decrease B by a corresponding amount, leaving the
total variance P unchanged. Salmond’s general idea [1, Sec. 4]
is to choose for merging two components i and j such that
the increase in W is minimised. He shows that the change in
W when components i and j are replaced by their moment-
preserving merge is
W
ij
=
w
i
w
j
w
i
+ w
j
(µ
i
µ
j
)(µ
i
µ
j
)
T
However, W
ij
is a matrix, whereas we require a scalar
dissimilarity measure. Salmond proposes using the following
measure:
D
2
s
(i, j) = tr(P
1
W
ij
) (6)
Here the trace reduces its matrix argument to a scalar, and the
premultiplication by P
1
ensures that the resulting dissimi-
larity measure is invariant under linear transformations of the
state space.
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
x
y
A
B
C
D
Fig. 1. Means of the components in Example 4.2
However, the dissimilarity measure defined in (6) has two
properties that may be considered undesirable as a basis for
choosing which components to merge. First, the measure
depends on the means of the components, but not on their
individual covariance matrices, leading to the behaviour in this
example:
Example 4.1: A mixture comprises three two-dimensional
components {(
1
3
, µ, P
1
), (
1
3
, µ + δµ, P
1
), (
1
3
, µ, P
2
)}, where
δµ is very small (e.g. δµ = (0.0001, 0.0001)
T
) but P
2
is very
different from P
1
:
P
1
=
1 0.9
0.9 1
P
2
=
1 0.9
0.9 1
We wish to reduce the mixture to two components. Then, using
(6), we will choose to merge the first and third components,
yielding a merged component (
2
3
, µ, I
2
), where I
2
is the two-
dimensional identity matrix.
The reader may well consider that in this example it would
be better to merge the first two components, yielding (
2
3
, µ +
1
2
δµ, P
1
+
1
4
δµδµ
T
).
The second drawback arises from the presence of the overall
covariance P within (6). This has the implication that adding
a new component to a mixture may alter the order in which
the existing components are merged, as shown in the following
example.
Example 4.2: A mixture over the two dimensions (x, y)
consists of four components
A = (0.25, (0.661, 1)
T
, I
2
) (7)
B = (0.25, (1.339, 1)
T
, I
2
) (8)
C = (0.25, (0.692, 1.1)
T
, I
2
) (9)
D = (0.25, (1.308, 1.1)
T
, I
2
) (10)

4 TO APPEAR IN IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTEMS
(The means of the components are shown in Fig. 1.) We
wish to reduce this mixture to three components. It is readily
established that the overall mean of the mixture is (0, 0)
T
,
and its covariance matrix is 2.105I
2
. From the latter fact, it
follows that criterion (6) will lead us simply to merge the two
components whose means are closest together, namely A and
C.
Now modify the original mixture by reducing the weights
of components A to D to 0.2, and adding a fifth component
E = (0.2, (0, 10)
T
, I
2
). We wish to reduce this new mixture
to three components. It turns out that criterion (6) now selects
components A and B for the first merge, and components C
and D for the second merge. This is because, although E is a
weak candidate for either merge, its inclusion in the mixture
has greatly increased its overall variance in the y-direction,
meaning that (6) now weights differences in x more heavily
than differences in y.
V. WILLIAMSS CRITERION
Williams [2] and Williams and Maybeck [3] propose a
method of Gaussian mixture reduction based on the integrated
squared difference (ISD) measure of the dissimilarity between
two pdfs f
1
(x) and f
2
(x):
J
S
=
Z
(f
1
(x) f
2
(x))
2
dx
(cf. [3, Eq. 4]. This has the important property that the
dissimilarity between two arbitrary Gaussian mixtures can be
expressed in closed form (given in [3, Eq. 10])—a property
regrettably not shared by the measure proposed in the present
paper.
Their algorithm for reducing an n-component mixture to an
m-component mixture (m n) can be summarised as follows:
1) While more than m components remain consider all
possible operations of the following two kinds:
Deleting a component and renormalising the re-
maining mixture;
Replacing a pair of components with their moment-
preserving merge;
and in each case evaluate the ISD-dissimilarity of the
resulting mixture from the original mixture. Apply the
operation for which this dissimilarity is a minimum.
2) Use the resulting m-component mixture as the starting
point for gradient-based optimisation technique, to seek
an m-component mixture with lower dissimilarity to the
original mixture.
The authors note that the optimisation at Step 2 will seek a
local minimum rather than the global minimum: hence the
need to choose the starting point carefully.
The ISD cost measure circumvents both of the drawbacks
of Salmond’s criterion. First, the measure depends explicitly
on the covariance matrices as well as the means of the compo-
nents. Second, the cost incurred by merging two components
depends only on the parameters of those components, and not
on other characteristics of the mixture of which they form a
part. Consequently, the anomalies observed in Examples 4.1
and 4.2 do not arise.
However, the ISD criterion leads to puzzling behaviour of
its own. To illustrate this, we will focus on mixtures where
the components are radially symmetric, i.e. the covariance
matrices are multiples of the identity matrix. Consider first the
case where the starting mixture is {(w, µu, σ
2
I
d
), (w, µ+
u, σ
2
I
d
)}, where µ is arbitrary and u is a d-dimensional
unit vector. The means of the two components of this mixture
are distance 2 apart.
In this case it follows from [3, Eq. 12] that the ISD cost
of deleting one of the components (and raising the other
component to unit weight) is given by:
J
S
=
4w
2
σ
d
p
(4π)
d
h
D
(c) (11)
where
h
D
(c) =
1
2
(1 exp(c
2
)) (12)
while the cost of replacing the two components by their
moment-preserving merge, namely (2w, µ, σ
2
(I + c
2
uu
T
)),
is:
J
S
=
4w
2
σ
d
p
(4π)
d
h
M
(c) (13)
where
h
M
(c) =
1
2
(1 + exp(c
2
)) +
1
1 + c
2
2
2
2 + c
2
exp
c
2
2(2 + c
2
)
(14)
The functions h
M
(c) and h
D
(c) are both zero for c = 0 and
as c increases, both functions increase monotonically, tending
towards
1
2
as c . It can be shown that h
D
(c) > h
M
(c)
except when c is zero, so the deletion option will not be
considered further.
In the example under consideration, σ acts simply as a scale
factor, but it nevertheless appears in (13), raised moreover to
the power d. This leads to some surprising behaviour in the
way in which Williams’s algorithm selects pairwise merges, as
in the following twelve-dimensional example. (It is not unusual
in inertial navigation applications for the state vector to have
15 or more dimensions.)
Example 5.1: A mixture over the space (x
1
, . . . x
12
) com-
prises four components
A = (0.25, (20, 0.5, 0, . . . , 0)
T
, I
12
) (15)
B = (0.25, (20, 0.5, 0, . . . 0)
T
, I
12
) (16)
C = (0.25, (20, 10, 0, . . . 0)
T
, 4I
12
) (17)
D = (0.25, (20, 10, 0, . . . 0)
T
, 4I
12
) (18)
where in each mean vector the ellipsis . . . comprises eight
zeroes. Note that components A and B have negligible prob-
ability within the region where x
1
> 0, and C and D have
negligible probability within the region x
1
< 0.
Assume that we wish to reduce this four-component mixture
to three components. Now, according to (13) the cost of
replacing components A and B by their moment-preserving
merge is
J
S
=
1
4(4π)
6
h
M
(0.5) 6.39 × 10
12

Citations
More filters
Journal ArticleDOI

Gradient boosting machines, a tutorial.

TL;DR: This article gives a tutorial introduction into the methodology of gradient boosting methods with a strong focus on machine learning aspects of modeling.
Proceedings ArticleDOI

On entropy approximation for Gaussian mixture random vectors

TL;DR: This paper deals with a novel entropy approximation method for Gaussian mixture random vectors, which is based on a component-wise Taylor-series expansion of the logarithm of aGaussian mixture and on a splitting method of Gaussia mixture components.
Book

Model-Based Clustering and Classification for Data Science

TL;DR: In this paper, the authors frame cluster analysis and classification in terms of statistical models, thus yielding principled estimation, testing and prediction methods, and sound answers to the central questions, such as how many clusters are there? which method should I use? How should I handle outliers.
Journal ArticleDOI

Approximate representations for multi-robot control policies that maximize mutual information

TL;DR: The main contributions of this paper include the control policy, an algorithm for approximating the belief state, and an extensive study of the performance of these algorithms using simulations and real world experiments in complex, indoor environments.
References
More filters
Book

Elements of information theory

TL;DR: The author examines the role of entropy, inequality, and randomness in the design of codes and the construction of codes in the rapidly changing environment.
Book

Linear statistical inference and its applications

TL;DR: Algebra of Vectors and Matrices, Probability Theory, Tools and Techniques, and Continuous Probability Models.
Journal ArticleDOI

Linear Statistical Inference and its Applications

TL;DR: The theory of least squares and analysis of variance has been studied in the literature for a long time, see as mentioned in this paper for a review of some of the most relevant works. But the main focus of this paper is on the analysis of variance.
Book

Design and Analysis of Modern Tracking Systems

TL;DR: The Basics of Target Tracking and Multi Target Tracking with an Agile Beam Radar, and Multiple Hypothesis Tracking System Design and Application.
Related Papers (5)