scispace - formally typeset
Open AccessJournal ArticleDOI

Mean shift, mode seeking, and clustering

Reads0
Chats0
TLDR
Mean shift, a simple interactive procedure that shifts each data point to the average of data points in its neighborhood is generalized and analyzed and makes some k-means like clustering algorithms its special cases.
Abstract
Mean shift, a simple interactive procedure that shifts each data point to the average of data points in its neighborhood is generalized and analyzed in the paper. This generalization makes some k-means like clustering algorithms its special cases. It is shown that mean shift is a mode-seeking process on the surface constructed with a "shadow" kernal. For Gaussian kernels, mean shift is a gradient mapping. Convergence is studied for mean shift iterations. Cluster analysis if treated as a deterministic problem of finding a fixed point of mean shift that characterizes the data. Applications in clustering and Hough transform are demonstrated. Mean shift is also considered as an evolutionary strategy that performs multistart global optimization. >

read more

Content maybe subject to copyright    Report

790
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,
VOL.
17,
NO.
8,
AUGUST
1995
Mean Shift, Mode Seeking, and Clustering
Yizong
Cheng
Abstract-Mean shift, a simple iterative procedure that shifts
each data point to the average of data points in its neighborhood,
is generalized and analyzed in this paper. This generalization
makes some k-means like clustering algorithms its special cases. It
is shown that mean shift is a mode-seeking process on a surface
constructed with a “shadow” kernel.
For
Gaussian kernels, mean
shift is a gradient mapping. Convergence
is
studied for mean shift
iterations. Cluster analysis is treated as a deterministic problem
of finding a fixed point of mean shift that characterizes the data.
Applications in clustering and Hough transform are demon-
strated. Mean shift is also considered as an evolutionary strategy
that performs multistart global optimization.
Index
Terms-Mean shift, gradient descent, global optimiza-
tion, Hough transform, cluster analysis, k-means clustering.
I. INTRODUCTION
ET data
be a finite set
S
embedded
in
the n-dimensional
L
Euclidean space,
X.
Let
K
be
aflat
kernel
that is the char-
acteristic function of the A-ball in
X,
The
sample mean
at
x
E
X
is
cK(s-x)s
m(x)
=
’ss
c
K(s-x)
S€S
The difference
m(x)
-
x
is called
mean shift
in Fukunaga
and Hostetler
[
11. The repeated movement of data points to the
sample means is called the
mean shzji algorithm
[l],
[2]. In
each iteration of the algorithm,
s
t
m(s)
is performed for all
s
E
S
simultaneously.
The mean shift algorithm has been proposed as a method for
cluster analysis
[l],
[2],
[3].
However, the intuition that mean
shift is gradient ascent, the convergence of the process needs
verification, and its relation with similar algorithms needs
clarification.
In this paper, the mean shift algorithm is generalized in
three ways. First, nonflat kernels are allowed. Second, points
in data can be weighted. Third, shift can be performed on any
subset of
X,
while the data set
S
stay the same.
In Section 11, kernels with five operations are defined.
A
specific weight function unifying certain fuzzy clustering al-
gorithms including the “maximum-entropy’’ clustering algo-
rithm will be discussed. It will be shown that the k-means
clustering algorithm is a limit case of mean shift.
A
relation among kernels called “shadow” will be defined in
Section 111. It will be proved that mean shift on any kernel is
equivalent to gradient ascent on the density estimated with a
shadow of its. Convergence and its rate is the subject of Sec-
tion IV. Section
V
shows some peculiar behaviors of mean
shift in cluster analysis, with application in Hough transform.
Section VI shows how, with a twist in weight assignment, the
deterministic mean shift is transformed into a probabilistic
evolutionary strategy, and how it becomes a global optimiza-
tion algorithm.
11. GENERALIZING MEAN SHIFT
In Section 11, we first define the kernel, its notation, and
operations. Then we define the generalized sample mean and
the generalized mean shift algorithm. We show how this al-
gorithm encompasses some other familiar clustering algo-
rithms and how k-means clustering becomes a limit instance
of
mean shift.
A.
Kernels
DEFINITION
1.
Let
X
be the n-dimensional Euclidean space,
R“.
Denote the
ith
component
of
x
E
X
by xi. The
norm
of
x
E
X
n
is a nonnegative number
11.11
such that llx112
=
cIxi12 . The
i=l
n
inner product
of
x and
y
in
X
is
(x,
y)
=
c
x.y.
I
,
.
Afunction
K
:
X
-+
R
is said
to
be a
kernel
if
there exists
a
profile,
k
:
[0,4
+
R,
such that
i=l
(3)
and
1)
k is nonnegative.
2)
k
is nonincreasing:
k(a) 2
k(b)
if
a
<
b.
3)
k
is piecewise continuous and
jm k(r)dr
<
m
.
Let
a
>
0.
IfK is a kernel, then aK, K,, and
K“
are kernels
defined, respectively,
as
(aK)(x)
=
aK(x),
(4)
Manuscript received
Aug.
27, 1993; revised Mar. 28, 1995. Recommended
Y.
Cheng is with the Department
of
Electrical and Computer Engineering
IEEECS Log Number P95097.
for
acceptance
by
R. Duin.
and Computer Science, University
of
Cincinnati, Cincinnati, Ohio, 45221.
if
K and H are kernels, then K
+
H is a kernel defined
as
(K
+
H)(x)
=
K(x)
+
H(x) and KH is
a
kernel defined
as
0162-8828/95$04.00
0
1995 IEEE

CHENG: MEAN SHIm, MODE SEEKING, AND CLUSTERING
791
(KH)(x)
=
K(x)H(x). Thesefive operators can be ordered in
descending precedence
as
Kn,
K*,
aK, KH, and K
+
H.
0
CLAIM
1.
We have
a(KH)
=
(aK)H
=
K(aH)
a(K
+
H)
=
aK
+
aH
(K
+
mu=
Kn+ Ha
(KIT)”=
K“H“
WWa=
KJL
0
EXAMPLE
1.
Two kernels frequently used
in
this paper are the
unit flat kernel
and the
unit Gaussian kernel
qx)
=
e-IIxIIz
These kernels are shown in Fig.
1.
Clearly, the characteristic
function
of
the A-ball,
(l),
is
Fr
Also notice that
GB
=
Gp-,p
.
(a)
(b)
Fig.
1.
(a) The flat kernel
F
and
(b)
the Gaussian kernel
G.
A kernel can be “truncated” by being multiplied by a flat
kernel. For example, a
truncated Gaussian kernel
is
(7)
Notice that
(GF),
=
Ga-’
Fa.
Fig.
2
shows some of the
truncated Gaussian kernels.
0
(a)
(b)
Fig.
2.
Truncated Gaussian kernels (a)
GF
and
(b)
e
‘F.
DEFINITION
2.
Let
S
c
X
be a finite set (the “data” or
“sample”). Let K be a kernel and w
:
S
-+
(0,
m)
a
weight
function. The
sample mean
with kernel Kat
x
E
X
is
defined
as
K(s-x)w(s)s
SE
s
Let
T
c
X
be afinite set (the “cluster centers”). The evolu-
tion of
T
in the form of iterations T
t
m(T)
with
m(T)
=
{m(t); t
E
T)
is called a
mean shift algorithm.
For
each
t
E
T, there is a sequence
t,
m(t),
m(m(t)),-..,
that
is
called the
trajectory
of
t.
The weight w(s) can be either
fixed throughout the process or re-evaluated afer each it-
eration. It may also be a function of the current
T.
The al-
gorithm halts when it reaches afixed point
(m(T)
=
T).
When
T
is
S,
the mean shift algorithm is called a
blurring
process,
indicating the successive blurring of the data set,
S.0
REMARK
1.
The original mean shift process proposed in
[l],
[3]
is a blurring process,
in
which
T
=
S.
In Definition
2,
it
is
generalized
so
that
T
and
S
may be separate sets with
S
fixed through the process, although the initial
T
may be
a
copy of
S.
Notice that
in
(8),
kernel
K
can be replaced with
kernel
aK
for any
a
>
0,
without generating any difference.
This is the reason why we did not insist that
K(x)dx
=
1
,
which will attach a factor to
K
that is related to
n,
the di-
mensionality of
X.
Similarly, the weights
w(s)
can be nor-
malized
so
that
w(s)
=
1.
Because of the inconsequen-
tiality of these factors, we will use the simplest possible ex-
pressions for the kernel and the weights. We also have to
assume that
T
is initialized such that
K(s- t)w(s)
>
0
for all
t
E
T. Also notice that this is a parallel algorithm, in
the sense that all
t
E
Tare simultaneously updated based on
the previous
t
and
w(s)
values.
0
X
.S€S
S€S
EXAMPLE
2.
The “maximum entropy” clustering algorithm of
Rose, Gurewitz, and Fox
[4]
is a mean shift algorithm when
T and
S
are separate sets,
GP
is the kernel, and
,
SES.
1
c
G
(s
-
t
)
w(s)
=
(9)
t€T
These authors also mention that when
p
approaches infinity,
the algorithm degenerates to k-means clustering, which is
often described as an optimizing Picard iterative routine
[7]:
1)
Randomly initialize “cluster centers,” T.
2)
Compute the following function on
T
x
S:
2
Vt,s
=
[
i,
if
t
=
argmin,ls-
tl
B.
Mean
Shift Algorithms
generalizations summarized in the introduction.
otherwise
Now we redefine the mean shift algorithm based on our

192
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,
VOL.
17,
NO.
8,
AUGUST
1995
3)
Update “cluster centers:”
cvt.,s
c
vt,s
t
E
T.
<
S€S
S€S
Go
to
2.
Indeed, when the profile
k
is strictly decreasing,
Thus, k-means clustering is the limit of the mean shift al-
gorithm with a strictly decreasing kernel
p
when
p
+-
=.
0
111.
MEAN
SHIFT
AS
GRADIENT MAPPING
It has been pointed out
in
[l] that mean shift is a “very in-
tuitive” estimate of the gradient of the data density. In this
section, we give a more rigorous study of this intuition. Theo-
rem 1 relates each kernel to a “shadow” kernel
so
that mean
shift using a kernel will be in the gradient direction of the
density estimate using the corresponding “shadow” kernel.
A.
Shadow
of
a Kernel
DEFINITION
3.
Kernel
H
is said to be a shadow
of
kernel
K,
if
the mean shift using
K,
theorem of calculus and the requirement that
lrh(r)dr
<
-,
(15) is the only solution. In this case, we
(l have
or,
the magnitude
of
mean shift is
in
proportion to the ratio
of the gradient and the local density estimate using kernel
K.
When a discontinuous point
z
is allowed in
h,
a constant can
be added to the
h
from
0
to
z,
and
h’(r)
=
-
ck(r)
is still sat-
(12)
c
K(s-x)w(s)s
S€S
isfied except when
r
=
z.
0
CLAIM
2.
Suppose kernel
H
is a shadow of
K,
and
a
>
0.
The
following are true.
1)
aH
is a shadow of
K.
2)
Ha
is a shadow of
K,.
3)
If
L
is
a
shadow of
M,
then
H
+
L
is a shadow of
K
+
M.
4)
A
truncated kernel
KF,
may not be continuous at
=
a.
If the shadow is
also
allowed to be discontinuous at the
0
same points, then
HF,
is a shadow df
KF,.
EXAMPLE
3.
Using
(15)
we find that the
Epanechnikov kernel
is a shadow of the flat kernel,
(5),
and the
biweight kernel
is in
the gradient direction
at
ofthe density
estimate
using
K
is a shadow of the Epanechnikov kernel. These kernels are
0
so
named in
[2]
and they are shown in Fig.
3.
q(
X)
=
H
(
s
-
X)
W(
s).
Sd
(14)
0
THEOREM
1.
Kernel
H
is a shadow
of
kernel K ifand only
if
their PROFILES, h and
k,
satisfy the following equation.
h(r)=
f
(r)+cjrmk(t)dt,
(15)
where c
>
0
is a constant and
f
is a piecewise constant
function.
PROOF.
The mean shift using kernel
K
can be rewritten as
1
m(x)
-
x
=
-x
k(1ls
-
xl12)w(s)(
s
-
x)
(1
6)
p(x)
S€S
with
p(x)
=
The gradient of (14) at
x
is
k(s- x)w(s),
the density estimate using
K.
sss
Vq(x)
=
-2c
h’(lls
-
xIl2)(
s
-
x)w(s).
(17)
SES
To have
(16)
and (17) point to the same direction, we need
h’(r)
=
-
ck(r)
for all rand some
c
>
0.
By
the fundamental
(a)
(b)
Fig.
3.
(a)
The Epanechnikov kernel
and
(b) the biweight kernel.
B.
Gaussian Kernels
THEOREM
2.
The only kernels that are their own shadows are
the Gaussian kernel
GP
and its truncated version
GPFr
In
this case, the mean shift is equal to
(21)
1
m(x)-
x
=
-Vlogq(x),
2P
where
q
is the data density estimate using the same kernel.
PROOF.
From Theorem 1 we know that kernel
K
is its own
shadow if and only if
k’(r)
=
-ck(r).
Using the method of
separation of variables, we have

CHENG: MEAN
SHIFT,
MODE SEEKING, AND CLUSTERING
193
--
-cdr,orlogk(r)-logk(O)=-cr
.
I:
-I
This gives us
k(r)
=
k(O)eYr,
which makes
K
the Gaussian
kernel. If discontinuities are allowed in
k,
then we have the
truncated Gaussian kernel. When
K
is its own shadow,
p
in
0
REMARK
2.
A
mapping
f
:
R"+
R"
is said to be a
gradient
mapping
if there exists a function
g
:
R"
+
R
such that
Ax)
=
Vg(x)
for
all
x
161.
Theorem
2
is a corollary of a more
general result from the
symmetry principle:
f
is a gradient
mapping if and only if the Jacobian matrix offis symmetric.
In
our
case,
Ax)
=
m(x)
-
x
and by equating
afilaxj
and
afpxi
for all
i
and
j,
one obtains the necessary and sufficient
condition that
k'(r)
=
-ck(r)
for any mean shift to be a gra-
(18)
is equal to
q,
and
(Vq(x))/q(x)
=
V
log
q(x).
dient mapping.
n
A.
Radius and Diameter
of
Data
DEFINITION
4.
A
direction
in
X
is
a
point
on
the unit sphere.
That is,
a
E
X
is
a
direction
if
and
only
if
llall
=
1.
We
call
the mapping
a,
:
X
+
R
wirh
a,(x)
=
(x,
a)
the
projection
in
the direction
of
a.
Let
a,(S)
=
{a,(s);
s
E
S).
The
convex
hull
of
a
set
Y
c
X
is defined
as
h(Y)
=
n{x
E
X;
mina,
(Y)
I
a,
(x)
S
maxa,(Y)}.
(24)
0
iiun=1
CLAIM
3.
The following are true.
1)
min
a,(S)
I
min
n,(m(T))
I
max
n,(m(T))
I
max
n,(S).
2)
m(T)
G
h@"
L
W).
3)In a blurring process, we have
h(S)
2
h(m(S))
3
h(m(m(S)))...
.
There exists an
x
E
X
that is in all the
convex hulls of data. It is possible to make a translation
C.
Mode Seeking
Suppose an idealized mode in the density surface also has a
Gaussian shape, which, without loss of generality, centers at
the origin:
so
that the origin is in all the convex hulls of data.
U
DEFJ"ION
5.
Suppose after
a
translation, the origin is in
all
the convex hulls
of
data during
a
blurring process. Then,
p(S)
=
max
{
Isl;
s
E
S)
is said
to
be the
radius
of
data. The
diameter
of
data
is
defined
as
(22)
d(S)
=
sup( maxa,
(S)
-
min
a,
(S)).
(25)
11.11=1
x. (23)
m(x)
-
x
=
-
V
log
q( x)
=
-
=
--
1
-2yxq(x)
2P -2Pq(x)
P
Because the density surface is estimated with the kernel
G',
any mode approximated by superimposing G' will have a
y
<
P.
The mean shift
(23)
will not cause overshoots
in
this
case.
Mean shift is steepest ascent with a varying step size that is
the magnitude
of
the gradient.
A
notorious problem associated
with steepest ascent with fixed step size is the slow movement
0
It should be clear that
p(S)
I
d(S)
I
2p(S).
Because the
convex hulls of data form a shrinking inclusion sequence, the
radius or diameter of data also form a nonincreasing non-
negative sequence, which must approach a nonnegative limit.
Theorem
3
says that this limit is zero when the kernel in the
blurring process has a support wide enough to cover the data
set.
B.
Blurring With Broad Kernels
on
plateaus of the surface.
For
a density surface, large plateaus
happen only at low density regions and after taking logarithm,
the inclination of a plateau
is
magnified. Combined with the
THEOREM
3.
Let k be the profile
of
the kernel used
in
a
blur-
for
ring Process, and
So
the initial data.
rf
k(d2(So))
2
preceding result about overshoot avoidance, mean shift is well-
adjusted steepest ascent.
some
K
>
0,
then diameter
of
data approaches zero. The
convergence rate is
at
least
as
fast
as
IV.
CONVERGENCE
Theorem
1
says that the mean shift algorithm is steepest as-
cent over the density of
S.
Each
T
point climbs the hill in the
density surface independently. Therefore, if
S
or
its density
does not change during the execution of the algorithm, the
vergence of steepest ascent for individual
T
points.
However,
in
a blurring process,
Tis
S,
and
S
and its density
change
as
the result of each iteration. In this case, convergence
is not as obvious as steepest ascent. The main results of this
section are two convergence theorems about the blurring proc-
ess. The concepts of
radius
and
diameter
of data, defined be-
low, will be used in the proofs.
pROOF. Let be
a
projection,
=
min
@(SI,
=
max
qs),
and
=
+
,,)/2.
suppose
convergence
of
the evolution of
Tis
a consequence of the con-
E
w(s)
2
E
4s).
(27)
sd,n(s)<z
S€S,lr(S)>Z
men,
for
E
s,

194
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE.
VOL.
17,
NO.
8,
AUGUST
1995
v
-
x(m(s))
=
c
K(s’
-
s)w(
s’)
S’ES
s’ES,a(s’)sz
2
c
K(
s’
-
s)w(
s’)
S’ES
v-U
$
W(
S’)K
7
S’ES
-
K(V
-
U)
--
4k(0)
.
Clearly, we have
maxz(m(S))- minn(m(S))
I
maxn(m(S))
-
u
which is negative when
A
>
1/& and positive when
A
<
I/&.
Thus
I/&
is a critical
A
value. When
A
is
larger than this critical value, the density is unimodal and
the two
T
points will converge to
1/2.
When
A
is smaller
than
I/&
,
they will approach two distinct limit points.
0
(28)
The same result can be obtained when the inequality
in
(27)
is reversed. Therefore, the result holds for all projections,
0
EXAMPLE
4.
As
the simplest example, let
X
be the real line and
S
=
{x,
y}
with
x
c
y.
Let
k
be the profile of the kernel with
k(lx-y12)
>
0.
If the weights on
x
and
y
are the same, a blur-
ring process will make them converge to
(x
+
y)/2
with the
rate
and because
v
-
U
I
d(S),
we completed the proof.
k(0)
-
k(lx
-
YI2)
k(0)
+
k(l.
-
YI2)
m(y)-m(x)=
(Y-x).
(30)
C.
Blurring
with
Truncated Kernels
When truncated kernels are used
in
the blurring process,
S
may converge to many points. In fact, if the kernel is truncated
so
that it will not cover more than one point in
S,
then
S
will
not change
in
the process. The result of Theorem 3 applies to
an isolated group
of
data points that can be covered by a trun-
cated kernel. In this case, they eventually converge to one
point, although the rest of the data set may converge to other
cluster centers. Again, when the kernel is not ‘flat, no merger
will take place after any finite number of iterations. (Flat ker-
nels generate a special case where merger is possible in finite
number of steps. See
[8].)
When the blurring process is simulated on a digital com-
(29)
puter, points do merge in finite number of steps, due to the
limits on floating-point precision. In fact, there is a minimum
distance between data points that can be maintained. Theorem
4
below shows that under this condition, the blurring process
terminates
in
finite number of steps.
LEMMA
1.
Suppose
X
=
R”
and r(S)
is
the radius
of
data
in
a
blurring process.
If
the minimum distance between data
points
is
6,
then for any direction
a,
there cannot be more
than one data point
s
E
S
with
no(s)
>
r(S)
-
h, where h is
a
function ofr(S),
6,
and
n.
PROOF.
Suppose there is a direction
a
such that there are
s,
t
E
S
with
nu@)
>
r(S)
-
h
and
nu(t)
>
r(S)
-
h.
Let
b
be a
direction perpendicular to
a
and
db
=
tnb(s)
-
%(t)I.
Because
rection
u
and
One Of
them
must
be
dd2
away
‘Om
the
origin in direction
b,
the square of the radius of data cannot
be smaller than
(r(S)-h) +d,2/4.
It follows that
The difference between the two points will not become zero
ing both points. When the weights are different,
x
and
y
converge
to
a
point
that
may
not
be the weighted
average
of
the initial
x
and
y
with the same weights.
initialized to
S.
Because this is no longer a blurring process,
the result in Theorem
3
does not apply. That is,
T
may con-
verge to more than one point. The two
T
points converge to
(x
+
y)/2
when the density of
S
is unimodal. Otherwise, they
will converge to the two modes of the bimodal density.
When
G,
is used as the kernel, and
x
=
0,
y
=
1,
the density
is
(n-1)(8r(S)h-4h2)+h2
>6’,
--
2
--
(Z-ly
&)=e
a*
+e
.
at any iteration, unless the kernel is flat in the range contain-
both
and
are
at
least
r(s)
-
away
from
the
Origin
in
di-
2
NOW assume that
S
is fixed through the process while
T
iS
di
5
gr(S)h-4h2.
The distance between
and
t,
IJs
-
tl(,
satisfy
6*
IIIs-tl12
<(n-1)(8r(S)h-4h2)+h2.
If
then these
s
and
t
cannot exist. This condition is satisfied when
4(n
-
l)r(S)
-
,/16(n
-
1)’
r2(S)
-
(4n
-5)6’]/(4n
-5)
0
The first derivative of
q(z)
always has a zero at
z
=
1/2.
But
the second derivative
of
q(z)
at
z
=
1/2
is

Citations
More filters
Journal ArticleDOI

Data clustering: a review

TL;DR: An overview of pattern clustering methods from a statistical pattern recognition perspective is presented, with a goal of providing useful advice and references to fundamental concepts accessible to the broad community of clustering practitioners.
Journal ArticleDOI

Mean shift: a robust approach toward feature space analysis

TL;DR: It is proved the convergence of a recursive mean shift procedure to the nearest stationary point of the underlying density function and, thus, its utility in detecting the modes of the density.
Book

Computer Vision: Algorithms and Applications

TL;DR: Computer Vision: Algorithms and Applications explores the variety of techniques commonly used to analyze and interpret images and takes a scientific approach to basic vision problems, formulating physical models of the imaging process before inverting them to produce descriptions of a scene.
Journal ArticleDOI

Clustering by fast search and find of density peaks

TL;DR: A method in which the cluster centers are recognized as local density maxima that are far away from any points of higher density, and the algorithm depends only on the relative densities rather than their absolute values.
Proceedings ArticleDOI

An HOG-LBP human detector with partial occlusion handling

TL;DR: By combining Histograms of Oriented Gradients (HOG) and Local Binary Pattern (LBP) as the feature set, this work proposes a novel human detection approach capable of handling partial occlusion and achieves the best human detection performance on the INRIA dataset.
References
More filters
BookDOI

Density estimation for statistics and data analysis

TL;DR: The Kernel Method for Multivariate Data: Three Important Methods and Density Estimation in Action.
Book

Iterative Solution of Nonlinear Equations in Several Variables

TL;DR: In this article, the authors present a list of basic reference books for convergence of Minimization Methods in linear algebra and linear algebra with a focus on convergence under partial ordering.
Journal ArticleDOI

The estimation of the gradient of a density function, with applications in pattern recognition

TL;DR: Applications of gradient estimation to pattern recognition are presented using clustering and intrinsic dimensionality problems, with the ultimate goal of providing further understanding of these problems in terms of density gradients.
Journal ArticleDOI

K-Means-Type Algorithms: A Generalized Convergence Theorem and Characterization of Local Optimality

TL;DR: It is shown that under certain conditions the K-means algorithm may fail to converge to a local minimum, and that it converges under differentiability conditions to a Kuhn-Tucker point.
Journal ArticleDOI

Statistical mechanics and phase transitions in clustering.

TL;DR: In this paper, a new approach to clustering based on statistical physics is presented, where the problem is formulated as fuzzy clustering and the association probability distribution is obtained by maximizing the entropy at a given average variance.