scispace - formally typeset
Open AccessJournal ArticleDOI

Recursive unsupervised learning of finite mixture models

Reads0
Chats0
TLDR
An online (recursive) algorithm is proposed that estimates the parameters of the mixture and that simultaneously selects the number of components to search for the maximum a posteriori (MAP) solution and to discard the irrelevant components.
Abstract
There are two open problems when finite mixture densities are used to model multivariate data: the selection of the number of components and the initialization. In this paper, we propose an online (recursive) algorithm that estimates the parameters of the mixture and that simultaneously selects the number of components. The new algorithm starts with a large number of randomly initialized components. A prior is used as a bias for maximally structured models. A stochastic approximation recursive learning algorithm is proposed to search for the maximum a posteriori (MAP) solution and to discard the irrelevant components.

read more

Content maybe subject to copyright    Report

UvA-DARE is a service provided by the library of the University of Amsterdam (http
s
://dare.uva.nl)
UvA-DARE (Digital Academic Repository)
Recursive unsupervised learning of finite mixture models
Zivkovic, Z.; van der Heijden, F.
DOI
10.1109/TPAMI.2004.1273970
Publication date
2004
Published in
IEEE Transactions on Pattern Analysis and Machine Intelligence
Link to publication
Citation for published version (APA):
Zivkovic, Z., & van der Heijden, F. (2004). Recursive unsupervised learning of finite mixture
models.
IEEE Transactions on Pattern Analysis and Machine Intelligence
,
26
(5), 651-656.
https://doi.org/10.1109/TPAMI.2004.1273970
General rights
It is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s)
and/or copyright holder(s), other than for strictly personal, individual use, unless the work is under an open
content license (like Creative Commons).
Disclaimer/Complaints regulations
If you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please
let the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the material
inaccessible and/or remove it from the website. Please Ask the Library: https://uba.uva.nl/en/contact, or a letter
to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands. You
will be contacted as soon as possible.
Download date:10 Aug 2022

Recursive Unsupervised Learning of
Finite Mixture Models
Zoran Zivkovic, Member, IEEE Computer Society,
and Ferdina nd van d er Heijden, Member,
IEEE Comput er Society
Abstract—There are two open problems when finite mixture densities are used to
model multivariate data: the selection of the number of components and the
initialization. In this paper, we propose an online (recursive) algorithm that
estimates the parameters of the mixture and that simultaneously selects the
number of components. The new algorithm starts with a large number of randomly
initialized components. A prior is used as a bias for maximally structured models.
A stochas tic approximation recursive learning algorithm is proposed to search for
the maximum a posteriori (MAP) solution and to discard the irrelevant
components.
Index Terms—Online (recursive) estimation, unsupervised learning, finite
mixtures, model selection, EM-algorithm.
æ
1INTRODUCTION
FINITE mixture probability density models have been analyzed
many times and used extensively for modeling multivariate data
[16], [8]. In [3] and [6], an efficient heuristic was used to
simultaneously estimate the parameters of a mixture and select
the appropriate number of its components. The idea is to start with
a large number of components and introduce a prior to express our
preference for compact models. During some iterative search
procedure for the MAP solution, the prior drives the irrelevant
components to extinction. The “entropic-prior” from [3] leads to a
MAP estimate that minimizes the entropy and, hence, leads to a
compact model. The Dirichlet prior from [6] gives a solution that is
related to model selection using the “Minimum Message Length”
(MML) criterion [20].
This paper is inspired by the aforementioned papers [3], [6].
Our contribution is in developing an online version which is
potentially very useful in many situations since it is highly
memory and time efficient. We use a stochastic approximation
procedure to estimate the parameters of the mixture recursively.
More on the behavior of approximate recursive equations can be
found in [13], [5], [15]. We propose a way to include the suggested
prior from [6] in the recursive equations. This enables the online
selection of the number of components of the mixture. We show
that the new algorithm can reach solutions similar to those
obtained by batch algorithms.
In Sections 2 and 3 of the paper, we introduce the notation and
discuss some standard problems associated with finite mixture
fitting. In Section 4, we describe the mentioned heuristic that
enables us to estimate the parameters of the mixture and to
simultaneously select the number of its components. Further, in
Section 5, we develop an online version. The final practical
algorithm we used in our experiments is described in Section 6. In
Section 7, we demonstrate how the new algorithm performs for a
number of standard problems and compare it to some batch
algorithms.
2PARAMETER ESTIMATION
A mixture density with M components for a d-dimensional random
variable
~
xx is given by:
pð
~
xx;
~
Þ¼
X
M
m¼1
m
p
m
ð
~
xx;
~
m
Þ; with
X
M
m¼1
m
¼ 1; ð1Þ
where
~
¼f
1
; ::;
M
;
~
1
; ::;
~
M
g are the parameters. The number of
parameter depends on the number of components M and the
notation
~
ðMÞ will be used to stress this when needed. The mth
component of the mixture is denoted by p
m
ð
~
xx;
~
m
Þ and
~
m
are its
parameters. The mixing weights denoted by
m
are nonnegative
and add up to one.
Given a set of t data samples f
~
xx
ð1Þ
; ...;
~
xx
ðtÞ
g the maximum
likelihood (ML) estimate of the parameter values is:
b
~
~
¼ arg max
~
ðlog pðX;
~
ÞÞ:
The Expectation Maximization (EM) algorithm [4] is commonly
used to search for the solution. The EM algorithm is an iterative
procedure that searches for a local maximum of the log-likelihood
function. In order to apply the EM algorithm, we need to introduce
for each
~
xx a discrete unobserved indicator vector
~
yy ¼½y
1
...y
M
T
.
The indicator vector specifies (by means of position coding) the
mixture component from which the observation
~
xx is drawn. The
new joint density function can be written as a product:
pð
~
xx;
~
yy;
~
Þ¼pð
~
yy;
1
; ::;
M
Þpð
~
xxj
~
yy;
~
1
; ::;
~
M
Þ¼
Y
M
m¼1
y
m
m
p
m
ð
~
xx;
~
m
Þ
y
m
;
where exactly one of the y
m
from
~
yy can be equal to 1 and the others
are zero. The indicators
~
yy have a multinomial distribution defined
by the mixing weights
1
; ::;
M
. The EM algorithm starts with
some initial parameter estimate
b
~
~
ð0Þ
. If we denote the set of
unobserved data by Y¼f
~
yy
ð1Þ
; ...;
~
yy
ðtÞ
g the estimate
b
~
~
ðkÞ
from the
kth iteration of the EM algorithm is obtained using the previous
estimate
b
~
~
ðk1Þ
:
E step : Qð
~
;
b
~
~
ðk1Þ
Þ¼E
Y
ðlog pðX; Y;
~
ÞjX;
b
~
~
ðk1Þ
Þ¼
X
all possible Y
pðYjX;
b
~
~
ðk1Þ
Þ log pðX; Y;
~
Þ
M step :
b
~
~
ðkÞ
¼ arg max
~
ðQð
~
;
b
~
~
ðk1Þ
ÞÞ:
ð2Þ
The attractiveness of the EM algorithm is that it is easy to
implement and it converges to a local maximum of the log-
likelihood function. However, one of the serious limitations of the
EM algorithm is that it can end up in a poor local maximum if not
properly initialized. The selection of the initial parameter values is
still an open question that was studied many times. Some recent
efforts were reported in [3], [6], [17], [18], [19].
3MODEL SELECTION
Note that, in order to use the EM algorithm, we need to know the
appropriate number of components M. Too many components lead
to “overfitting” and too few to “underfitting.” Choosing an
appropriate number of components is important. Sometimes, for
example, the appropriate number of components can reveal some
important existing underlying structure that characterizes the data.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 26, NO. 5, MAY 2004 651
. Z. Zivkovic is with the Informatics Institute, University of Amsterdam,
Kruislaan 403, 1098SJ Amsterdam, The Netherlands.
E-mail: zivkovic@science.uva.nl.
. F. van der Heijden is with the Laboratory for Measurement and
Instrumentation, University of Twente, PO Box 217, 7500AE Enschede,
The Netherlands. E-mail: f.vanderheijden@utwente.nl.
Manuscript received 18 Nov. 2002; revised 24 June 2003; accepted 3 Nov.
2003.
Recommended for acceptance by Y. Amit.
For information on obtaining reprints of this article, please send e-mail to:
tpami@computer.org, and reference IEEECS Log Number 117789.
0162-8828/04/$20.00 ß 2004 IEEE Published by the IEEE Computer Society

Full Bayesian approaches sample from the full a po steriori
distribution with the number of compo nents M considered
unknown. This ispossible using Markov chain Monte Carlo methods
as reported in [11], [10]. However, these methods are still far too
computationally demanding. Most of the practical model selection
techniques are based on maximizing the following type of criteria:
JðM;
~
ðMÞÞ ¼ log pðX;
~
ðMÞÞ P ðMÞ: ð3Þ
Here, log pðX;
~
ðMÞÞ is the log-likelihood for the available data.
This part can be maximized using the EM. However, introducing
more mixture components always increases the log-likelihood. The
balance is achieved by introducing P ðMÞ that penalizes complex
solutions. Some examples of such criteria are the Akaike Informa-
tion C riterion [1], the Bayesian I nference Criterion [14], the
Minimum Description Length [12], the Minimum Message Length
(MML) [20], etc. For a detailed review, see, for example, [8].
4SOLUTION USING MAP ESTIMATION
The standard procedure for selecting M is the following: Find the
ML estimate for different M-s and choose the M that maximizes
(3). Suppose that we introduce a prior pð
~
ðMÞÞ for the mixture
parameters that penalizes complex solutions in a similar way as
P ðMÞ from (3). Instead of (3), we could use:
log pðX;
~
ðMÞÞ þ log pð
~
ðMÞÞ: ð4Þ
As in [6] and [3], we use the simplest prior choice, the prior only on
the mixing weights
m
-s. For example, the Dirichlet prior (see [7],
chapter 16) for the mixing weights is given by:
pð
~
ðMÞÞ / exp
X
M
m¼1
c
m
log
m
¼
Y
M
m¼1
c
m
m
: ð5Þ
The procedure is then as follows: We start with a large number
of randomly initialized components M and search for the MAP
solution using some iterative procedure, for ex ample, the
EM algorithm. The prior drives the irrelevant components to
extinction. In this way, while searching for the MAP solution,
the number of components M is reduced until the balance is
achieved.
It can be shown that the standard MML model selection
criterion can be approximated by the Dirichlet prior with the
coefficients c
m
equal to N=2, where N presents the number of
parameters per component of the mixture. See [6] for details. The
parameters c
m
have a meaningful interpretation. For a multinomial
distribution, the c
m
presents the prior evidence (in the MAP sense)
for the class m (number of samples a priori belonging to that class).
Negative prior evidence means that we will accept that the class m
exists only if there is enough evidence from the data for the
existence of this class. If there are many parameters per
component, we will need many data samples to estimate them.
In this sense, the presented linear connection between the c
m
and
N seems very logical. The procedure from [6] starts with all the
m
-s equal to 1=M. Although there is no proof of optimality, it
seems reasonable to discard the component m when its weight
m
becomes negative. This also ensures that the mixing weights stay
nonnegative.
The “entropic prior” from [3] has a similar form: pð
~
ðMÞÞ / exp
ðH ð
1
, ...;
M
ÞÞ, where Hð
1
; ...;
M
Þ¼
P
M
m¼1
m
log
m
is the
entropy measure for the underlying multinomial distribution and
is a parameter. We use the mentioned Dirichlet prior because it leads
to a closed form solution.
5RECURSIVE (ONLINE)SOLUTION
For the ML estimate, the following holds:
@
@
b
~
~
log pðX;
b
~
~
Þ¼0. The
mixing weights are constrained to sum up to 1. We take this into
account by introducing the Lagrange multiplier and get:
@
@ ^
m
log pðX;
b
~
~
Þþð
P
M
m¼1
^
m
1Þ

¼ 0. From here, after getting
rid of , it follows that the ML estimate for t data samples should
satisfy ^
ðtÞ
m
¼
1
t
P
t
i¼1
o
ðtÞ
m
ð
~
xx
ðiÞ
Þ with the “ownerships” defined as:
o
ðtÞ
m
ð
~
xxÞ¼^
ðtÞ
m
p
m
ð
~
xx;
b
~
~
ðtÞ
m
Þ=pð
~
xx;
b
~
~
ðtÞ
Þ: ð6Þ
Similarly, for the MAP solution, we have
@
@ ^
m
ðlog pðX;
b
~
~
Þ + log
pð
b
~
~
Þþð
P
M
m¼1
^
m
1ÞÞ ¼ 0, where pð
b
~
~
Þ is the mentioned Dirichlet
prior (5). For t data samples, we get:
^
ðtÞ
m
¼
1
K
X
t
i¼1
o
ðtÞ
m
ð
~
xx
ðiÞ
Þc
!
; ð7Þ
where K ¼
P
M
m¼1
ð
P
t
i¼1
o
ðtÞ
m
ð~xx
ðiÞ
ÞcÞ¼t Mc (since
P
M
m¼1
o
ðtÞ
m
¼ 1). The parameters of the prior are c
m
¼c (and c ¼ N=2 as
mentioned before). We rewrite (7) as:
^
ðtÞ
m
¼
^
m
c=t
1 Mc=t
; ð8Þ
where
^
m
¼
1
t
P
t
i¼1
o
ðtÞ
m
ð
~
xx
ðiÞ
Þ is the mentioned ML estimate and the
bias from the prior is introduced through c=t. The bias decreases
for larger data sets (larger t). However, if a small bias is acceptable
we can keep it constant by fixing c=t to c
T
¼ c=T with some large T .
This means that the bias will always be the same as if it would have
been for a data set with T samples. If we assume that the
parameter estimates do not change much when a new sample
~xx
ðtþ1Þ
is added and, therefore, o
ðtþ1Þ
m
ð~xx
ðiÞ
Þ can be approximated by
o
ðtÞ
m
ð
~
xx
ðiÞ
Þ that uses the previous parameter estimates, we get the
following well behaved and easy to use recursive update equation:
^
ðtþ1Þ
m
¼ ^
ðtÞ
m
þð1 þ tÞ
1
o
ðtÞ
m
ð
~
xx
ðtþ1Þ
Þ
1 Mc
T
^
ðtÞ
m

ð1 þ tÞ
1
c
T
1 Mc
T
:
ð9Þ
Here, T should be sufficiently large to make sure that Mc
T
< 1.We
start with initial ^
ð0Þ
m
¼ 1=M and discard the mth component when
^
ðtþ1Þ
m
< 0. Note that the straightforward recursive version of (7)
given by: ^
ðtþ1Þ
m
¼ ^
ðtÞ
m
þð1 þ t McÞ
1
ðo
ðtÞ
m
ð
~
xx
ðtþ1Þ
Þ^
ðtÞ
m
Þ,isnot
very useful. For small t, the update is negative and the weights
for the components with high o
ðtÞ
m
ð
~
xx
ðtþ1Þ
Þ are decreased instead of
increased. In order to avoid the negative update, we could start
with a larger value for t, but then we cancel out the influence of the
prior. This motivates the important choice we made to fix the
influence of the prior.
The most commonly used mixture is the Gaussian mixture. A
mixture component p
m
ð
~
xx;
~
m
Þ¼Nð
~
xx;
~
m
;C
m
Þ has its mean
~
m
and
its covariance matrix C
m
as the parameters. The prior has influence
only on the mixing weights and we can use the recursive equations:
b
~
~
ðtþ1Þ
m
¼
b
~
~
ðtÞ
m
þðt þ 1Þ
1
o
ðtÞ
m
ð
~
xx
ðtþ1Þ
Þ
^
ðtÞ
m
ð
~
xx
ðtþ1Þ
b
~
~
ðtÞ
m
Þð10Þ
^
CC
ðtþ1Þ
m
¼
^
CC
ðtÞ
m
þðt þ 1Þ
1
o
ðtÞ
m
ð
~
xx
ðtþ1Þ
Þ
^
ðtÞ
m
ð
~
xx
ðtþ1Þ
b
~
~
ðtÞ
m
Þð
~
xx
ðtþ1Þ
b
~
~
ðtÞ
m
Þ
T
^
CC
ðtÞ
m
ð11Þ
from [15] for the rest of the parameters.
652 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 26, NO. 5, MAY 2004

6ASIMPLE PRACTICAL ALGORITHM
For an online procedure, it is reasonable to fix the influence of the
new samples by replacing the term ð1 þ tÞ
1
from the recursive
update equations (9), (10), and (11) by ¼ 1=T. There are also
some practical reasons for using a fixed small constant . It reduces
the problems with instability of the equations for small t.
Furthermore, a fixed helps in forgetting the out-of-date statistics
(random initialization and component deletion) more rapidly. It is
equivalent to introducing an exponentially decaying envelope:
ð1 Þ
ti
is applied to the influence of the sample
~
xx
ðtiÞ
.
For the sake of clarity, we present here the whole algorithm we
used in our experiments. We start with a large number of
components M and with a random initialization of the parameters
(see next section for an example). We have c
T
¼ N=2. Further-
more, we use Gaussian mixture components with full covariance
matrices. Therefore, if the data is d-dimensional, we have N ¼
d þ dðd þ 1Þ=2 (the number of parameters for a Gaussian with a full
covariance matrix). The online algorithm is then given by:
. Input: new data sample
~
xx
ðtþ1Þ
, current parameter estimates
b
~
~
ðtÞ
.
. Calculate “ownerships:” o
ðtÞ
m
ð
~
xx
ðtþ1Þ
Þ¼^
ðtÞ
m
p
m
ð
~
xx
ðtþ1Þ
;
b
~
~
ðtÞ
m
Þ=
pð
~
xx
ðtþ1Þ
;
b
~
~
ðtÞ
Þ.
. Update mixture weights: ^
ðtþ1Þ
m
¼ ^
ðtÞ
m
þ ð
o
ðtÞ
m
ð
~
xx
ðtþ1Þ
Þ
1Mc
T
^
ðtÞ
m
Þ
c
T
1Mc
T
.
. Check if there are irrelevant components: if ^
ðtþ1Þ
m
< 0,
discard the component m, set M ¼ M 1 and renormalize
the remaining mixing weights.
. Update the rest of the parameters:
-
b
~
~
ðtþ1Þ
m
¼
b
~
~
ðtÞ
m
þ w
~
(where w ¼
o
ðtÞ
m
ð~xx
ðtþ1Þ
Þ
^
ðtÞ
m
and
~
¼
~
xx
ðtþ1Þ
-
b
~
~
ðtÞ
m
Þ.
-
^
CC
ðtþ1Þ
m
¼
^
CC
ðtÞ
m
þ wð
~
~
T
^
CC
ðtÞ
m
Þ (tip: limit the update
speed w ¼ minð20; wÞ).
. Output: new parameter estimates
b
~
~
ðtþ1Þ
.
This simple algorithm can be implemented in only a few lines
of code. The recommended upper limit 20 for w simply means
that the updating speed is limited for the covariance matrices of
the components representing less than 5 percent of the data. This
was necessary since
~
~
T
is a singular matrix and the covariance
matrices may become singular if updated too fast.
7EXPERIMENTS
In this section, we demonstrate the algorithm performance on a
few standard problems. We show summary results from 100 trials
for each data set. For the real-world data sets, we randomly sample
from the data to generate longer sequences needed for our
sequential algorithm. First, for each of the problems, we present
in Fig. 1 how the selected number of components of the mixture
was changing when new samples are sequentially added. The
number of components that was finally selected is presented in the
form of a histogram for the 100 trials. In Fig. 2, we present a
comparison with some batch algorithms and study the influence of
the parameter .
The random initialization of the parameters is the same as in
[6]. The means
b
~
~
ð0Þ
m
of the mixture components are initialized by
some randomly chosen data points. The initial covariance matrices
are a fraction (1=10 here) of the mean global diagonal covariance
matrix:
C
ð0Þ
m
¼
1
10d
trace
1
n
X
n
i¼1
ð
~
xx
ðiÞ
b
~
~
Þð
~
xx
ðiÞ
b
~
~
Þ
T
!
I;
where
b
~
~
¼
1
n
P
n
i¼1
~
xx
ðiÞ
is the global mean of the data and I is the
identity matri x with proper dimensions. We used the first
n ¼ 100 samples (it is also possible to estimate this initial
covariance matrix recursively). Finally, we set the initial mixing
weights to ^
ð0Þ
m
¼ 1=M. The initial number of components M
should be large enough so that the initialization reasonably covers
the data. We used here the same initial number of components as
in [6].
7.1 The “Three Gaussians” Data Set
First, we analyze a Gaussian mixture with mixing weights
1
¼
2
¼
3
¼ 1=3, means
1
¼½0 2
T
,
2
¼½00
T
,
3
¼½02
T
,
and covariance matrices
C
1
¼ C
2
¼ C
3
¼
20
00:2

:
A modified version of the EM called “DAEM” from [17] was able
to find the correct solution using a “bad” initialization. For a data
set with 900 samples, they needed more than 200 iterations to get
close to the solution. Here, we start with M ¼ 30 mixture
components. With random initialization, we performed 100 trials
and the new algorithm was always able to find the correct solution
while simultaneously estimating the parameters of the mixture and
selecting the number of components. A similar batch algorithm
from [6] needs about 200 iterations to identify the three
components (on a data set with 900 samples). From the plot in
Fig. 1, we see that already after 9,000 samples the new algorithm is
usually able to identify the three components. The computation
costs for 9,000 samples are approximately the same as for only
10 iterations of the EM algorithm on a data set with 900 samples.
Consequently, the new algorithm for this data set is about 20 times
faster in finding a similar solution (a typical solution is presented
in Fig. 1 by the ¼ 2 contours” of the Guassian components). In
[9], some approximate recursive versions of the EM algorithm
were compared to the standard EM algorithm and it was shown
that the recursive versions are usually faster. This is in correspon-
dence with our results. Empirically, we decided that 50 samples
per class are enough and used ¼ 1=150.
7.2 The “Iris” Data Set
We disregard the class information from the well-known 3-class,4-
dimensional “Iris” data set [2]. From the 100 trials, the clusters
were properly identified 81 times. This shows that the order in
which the data is presented can influence the recursive solution.
The data set had only 150 samples (50 per class) that were repeated
many times. We expect that the algorithm would perform better
with more data samples. We used ¼ 1=150. The typical solution
in Fig. 1 is presented by projecting the 4-dimensional data to the
first two principal components.
7.3 The “Shrinking Spiral” Data Set
This data set presents a 1-dimensional manifold (“shrinking spiral”)
in the three dimensions with added noise: ~xx ¼½ð13 0:5tÞ cos t
ð0:5t 13Þ sin ttþ
~
nn,witht Uniform½0; 4 and the noise
~
nn Nð0;IÞ. The modified EM called “SMEM” from [18] was
reported to be able to fit a 10 component mixture in about
350 iterations. The batch algorithm from [6] is fitting the mixture
and selecting 11, 12, or 13 components using typically 300 to
400 iterations for a 900 samples data set. From the graph in Fig. 1, it is
clear that we achieve similar results, but much faster. About
18,000 samples was enough to arrive at a similar solution.
Consequently, again, the new algorithm is about 20 times faster.
There are no clusters in this data set. The fixed has as the effect that
the influence of the old data is downweighted by the exponential
decaying envelope ð1 Þ
tk
(for k<t). For comparison with the
other algorithms that used 900 samples, we limited the influence of
the older samples to 5 percent of the influence of the current sample
by ¼logð0:05Þ=900. In Fig. 1, we present a typical solution by
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 26, NO. 5, MAY 2004 653

showing for each component the eigenvector corresponding to the
largest eigenvalue of the covariance matrix.
7.4 The “Enzyme” Data Set
The 1-dimensional “Enzyme” data set has 245 data samples. It was
shown in [11] using the MCMC that the number of components
supported by the data is most likely four, but two and three are
also good choices. Our algorithm arrived at similar solutions. In a
similar way as before, we used ¼logð0:05Þ=245.
7.5 Comparison with Some Batch Algorithms
The following s tandard batch methods were considered for
comparison: the EM algorithm initialized using the result from
k-means clustering; the SMEM method [18]; the greedy EM method
[19] that st arts wi th a single component and adds new
ones—reported to be faster than the elaborate SMEM. We used
900 samples for the “Three Gaussians” and the “Shrinking Spiral”
data sets. The batch algorithms assume a known number of
components: three for the “Three Gaussians” and the “Iris” data,
13 for the “Shrinking Spiral,” and four for the “Enzyme” data set.
Our new unsupervised recursive algorithm RUEM has selected on
average approximately the same number of components for the
chosen . All the iterative batch algorithms in our experiments
stop if the change in the log-likelihoo d is less than 10
5
. The
results are presented in Fig. 2a. The best likelihood and the lowest
standard deviation are reported in bold. We also added the ideal
ML result obtained using a carefully initialized EM. For the “Iris”
data, the EM was initialized using the means and the covariances
of the three classes. However, the solution where the two close
clusters are modeled using one component was better in terms of
likelihood. This “wrong” solution was found occasionally by some
of the algorithms. The results from the RUEM are biased.
Furthermore, the parameter is controlling the speed of updating
the parameters and, therefore, also the effective amount of data
that is considered. Therefore, we present also the results
“polished” by additionally applying the EM algorithm and using
the same sample size for the batch algorithms. The RUEM results
and the “polished” results are better or similar to the batch results.
We also observe that the greedy EM algorithm has problems with
the “Iris” and the “Shrinking spiral” data.
7.6 The Influence of the Parameter
In Figs. 2b and 2c, we show the influence of the parameter on the
selected number of components. We also plot the log-likelihood
654 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 26, NO. 5, MAY 2004
Fig. 1. Model selection results for a few standard problems (summary from 100 trials).

Citations
More filters
Proceedings ArticleDOI

Improved adaptive Gaussian mixture model for background subtraction

TL;DR: An efficient adaptive algorithm using Gaussian mixture probability density is developed using Recursive equations to constantly update the parameters and but also to simultaneously select the appropriate number of components for each pixel.
Journal ArticleDOI

Efficient adaptive density estimation per image pixel for the task of background subtraction

TL;DR: This work presents recursive equations that are used to constantly update the parameters of a Gaussian mixture model and to simultaneously select the appropriate number of components for each pixel and presents a simple non-parametric adaptive density estimation method.
Journal ArticleDOI

Efficient greedy learning of Gaussian mixture models

TL;DR: A heuristic for searching for the optimal component to insert in the greedy learning of gaussian mixtures is proposed and can be particularly useful when the optimal number of mixture components is unknown.
Journal ArticleDOI

An overview of clustering methods

TL;DR: This paper provides an overview of the different representative clustering methods and several clustering validations indices and approaches to automatically determine the number of clusters.
Journal ArticleDOI

Multivariate online kernel density estimation with Gaussian kernels

TL;DR: In this article, the authors proposed an online kernel density estimation (KDE) method, which maintains and updates a non-parametric model of the observed data, from which the KDE can be calculated.
References
More filters
Journal ArticleDOI

A new look at the statistical model identification

TL;DR: In this article, a new estimate minimum information theoretical criterion estimate (MAICE) is introduced for the purpose of statistical identification, which is free from the ambiguities inherent in the application of conventional hypothesis testing procedure.
Journal ArticleDOI

Estimating the Dimension of a Model

TL;DR: In this paper, the problem of selecting one of a number of models of different dimensions is treated by finding its Bayes solution, and evaluating the leading terms of its asymptotic expansion.

Estimating the dimension of a model

TL;DR: In this paper, the problem of selecting one of a number of models of different dimensions is treated by finding its Bayes solution, and evaluating the leading terms of its asymptotic expansion.
Book

Bayesian Data Analysis

TL;DR: Detailed notes on Bayesian Computation Basics of Markov Chain Simulation, Regression Models, and Asymptotic Theorems are provided.
Related Papers (5)
Frequently Asked Questions (7)
Q1. What have the authors contributed in "Recursive unsupervised learning of finite mixture models - pattern analysis and machine intelligence, ieee transactions on" ?

In this paper, the authors propose an online ( recursive ) algorithm that estimates the parameters of the mixture and that simultaneously selects the number of components. 

With random initialization, the authors performed 100 trials and the new algorithm was always able to find the correct solution while simultaneously estimating the parameters of the mixture and selecting the number of components. 

The batch algorithm from [6] is fitting the mixture and selecting 11, 12, or 13 components using typically 300 to 400 iterations for a 900 samples data set. 

one of the serious limitations of the EM algorithm is that it can end up in a poor local maximum if not properly initialized. 

In [3] and [6], an efficient heuristic was used to simultaneously estimate the parameters of a mixture and select the appropriate number of its components. 

The batch algorithms assume a known number of components: three for the “Three Gaussians” and the “Iris” data, 13 for the “Shrinking Spiral,” and four for the “Enzyme” data set. 

Most of the practical model selectiontechniques are based on maximizing the following type of criteria:JðM;~ ðMÞÞ ¼ log pðX ;~ ðMÞÞ P ðMÞ: ð3ÞHere, log pðX ;~ ðMÞÞ is the log-likelihood for the available data.