Mean shift, mode seeking, and clustering

doi:10.1109/34.400568

790

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

VOL.

17,

NO.

8,

AUGUST

1995

Mean Shift, Mode Seeking, and Clustering

Yizong

Cheng

Abstract-Mean shift, a simple iterative procedure that shifts

each data point to the average of data points in its neighborhood,

is generalized and analyzed in this paper. This generalization

makes some k-means like clustering algorithms its special cases. It

is shown that mean shift is a mode-seeking process on a surface

constructed with a “shadow” kernel.

For

Gaussian kernels, mean

shift is a gradient mapping. Convergence

is

studied for mean shift

iterations. Cluster analysis is treated as a deterministic problem

of finding a fixed point of mean shift that characterizes the data.

Applications in clustering and Hough transform are demon-

strated. Mean shift is also considered as an evolutionary strategy

that performs multistart global optimization.

Index

Terms-Mean shift, gradient descent, global optimiza-

tion, Hough transform, cluster analysis, k-means clustering.

I. INTRODUCTION

ET data

be a finite set

S

embedded

in

the n-dimensional

L

Euclidean space,

X.

Let

K

be

aflat

kernel

that is the char-

acteristic function of the A-ball in

X,

The

sample mean

at

x

E

X

is

cK(s-x)s

m(x)

=

’ss

c

K(s-x)

‘

S€S

The difference

m(x)

-

x

is called

mean shift

in Fukunaga

and Hostetler

[

11. The repeated movement of data points to the

sample means is called the

mean shzji algorithm

[l],

[2]. In

each iteration of the algorithm,

s

t

m(s)

is performed for all

s

E

S

simultaneously.

The mean shift algorithm has been proposed as a method for

cluster analysis

[l],

[2],

[3].

However, the intuition that mean

shift is gradient ascent, the convergence of the process needs

verification, and its relation with similar algorithms needs

clarification.

In this paper, the mean shift algorithm is generalized in

three ways. First, nonflat kernels are allowed. Second, points

in data can be weighted. Third, shift can be performed on any

subset of

X,

while the data set

S

stay the same.

In Section 11, kernels with five operations are defined.

A

specific weight function unifying certain fuzzy clustering al-

gorithms including the “maximum-entropy’’ clustering algo-

rithm will be discussed. It will be shown that the k-means

clustering algorithm is a limit case of mean shift.

A

relation among kernels called “shadow” will be defined in

Section 111. It will be proved that mean shift on any kernel is

equivalent to gradient ascent on the density estimated with a

shadow of its. Convergence and its rate is the subject of Sec-

tion IV. Section

V

shows some peculiar behaviors of mean

shift in cluster analysis, with application in Hough transform.

Section VI shows how, with a twist in weight assignment, the

deterministic mean shift is transformed into a probabilistic

evolutionary strategy, and how it becomes a global optimiza-

tion algorithm.

11. GENERALIZING MEAN SHIFT

In Section 11, we first define the kernel, its notation, and

operations. Then we define the generalized sample mean and

the generalized mean shift algorithm. We show how this al-

gorithm encompasses some other familiar clustering algo-

rithms and how k-means clustering becomes a limit instance

of

mean shift.

A.

Kernels

DEFINITION

1.

Let

X

be the n-dimensional Euclidean space,

R“.

Denote the

ith

component

of

x

E

X

by xi. The

norm

of

x

E

X

n

is a nonnegative number

11.11

such that llx112

=

cIxi12 . The

i=l

n

inner product

of

x and

y

in

X

is

(x,

y)

=

c

x.y.

I

,

.

Afunction

K

:

X

-+

R

is said

to

be a

kernel

if

there exists

a

profile,

k

:

[0,4

+

R,

such that

i=l

(3)

and

1)

k is nonnegative.

2)

k

is nonincreasing:

k(a) 2

k(b)

if

a

<

b.

3)

k

is piecewise continuous and

jm k(r)dr

<

m

.

Let

a

>

0.

IfK is a kernel, then aK, K,, and

K“

are kernels

defined, respectively,

as

(aK)(x)

=

aK(x),

(4)

Manuscript received

Aug.

27, 1993; revised Mar. 28, 1995. Recommended

Y.

Cheng is with the Department

of

Electrical and Computer Engineering

IEEECS Log Number P95097.

for

acceptance

by

R. Duin.

and Computer Science, University

of

Cincinnati, Cincinnati, Ohio, 45221.

if

K and H are kernels, then K

+

H is a kernel defined

as

(K

+

H)(x)

=

K(x)

+

H(x) and KH is

a

kernel defined

as

0162-8828/95$04.00

0

1995 IEEE

CHENG: MEAN SHIm, MODE SEEKING, AND CLUSTERING

791

(KH)(x)

=

K(x)H(x). Thesefive operators can be ordered in

descending precedence

as

Kn,

K*,

aK, KH, and K

+

H.

0

CLAIM

1.

We have

a(KH)

=

(aK)H

=

K(aH)

a(K

+

H)

=

aK

+

aH

(K

+

mu=

Kn+ Ha

(KIT)”=

K“H“

WWa=

KJL

0

EXAMPLE

1.

Two kernels frequently used

in

this paper are the

unit flat kernel

and the

unit Gaussian kernel

qx)

=

e-IIxIIz

These kernels are shown in Fig.

1.

Clearly, the characteristic

function

of

the A-ball,

(l),

is

Fr

Also notice that

GB

=

Gp-,p

.

(a)

(b)

Fig.

1.

(a) The flat kernel

F

and

(b)

the Gaussian kernel

G.

A kernel can be “truncated” by being multiplied by a flat

kernel. For example, a

truncated Gaussian kernel

is

(7)

Notice that

(GF),

=

Ga-’

Fa.

Fig.

2

shows some of the

truncated Gaussian kernels.

0

(a)

(b)

Fig.

2.

Truncated Gaussian kernels (a)

GF

and

(b)

e

‘F.

DEFINITION

2.

Let

S

c

X

be a finite set (the “data” or

“sample”). Let K be a kernel and w

:

S

-+

(0,

m)

a

weight

function. The

sample mean

with kernel Kat

x

E

X

is

defined

as

K(s-x)w(s)s

SE

s

Let

T

c

X

be afinite set (the “cluster centers”). The evolu-

tion of

T

in the form of iterations T

t

m(T)

with

m(T)

=

{m(t); t

E

T)

is called a

mean shift algorithm.

For

each

t

E

T, there is a sequence

t,

m(t),

m(m(t)),-..,

that

is

called the

trajectory

of

t.

The weight w(s) can be either

fixed throughout the process or re-evaluated afer each it-

eration. It may also be a function of the current

T.

The al-

gorithm halts when it reaches afixed point

(m(T)

=

T).

When

T

is

S,

the mean shift algorithm is called a

blurring

process,

indicating the successive blurring of the data set,

S.0

REMARK

1.

The original mean shift process proposed in

[l],

[3]

is a blurring process,

in

which

T

=

S.

In Definition

2,

it

is

generalized

so

that

T

and

S

may be separate sets with

S

fixed through the process, although the initial

T

may be

a

copy of

S.

Notice that

in

(8),

kernel

K

can be replaced with

kernel

aK

for any

a

>

0,

without generating any difference.

This is the reason why we did not insist that

K(x)dx

=

1

,

which will attach a factor to

K

that is related to

n,

the di-

mensionality of

X.

Similarly, the weights

w(s)

can be nor-

malized

so

that

w(s)

=

1.

Because of the inconsequen-

tiality of these factors, we will use the simplest possible ex-

pressions for the kernel and the weights. We also have to

assume that

T

is initialized such that

K(s- t)w(s)

>

0

for all

t

E

T. Also notice that this is a parallel algorithm, in

the sense that all

t

E

Tare simultaneously updated based on

the previous

t

and

w(s)

values.

0

X

.S€S

S€S

EXAMPLE

2.

The “maximum entropy” clustering algorithm of

Rose, Gurewitz, and Fox

[4]

is a mean shift algorithm when

T and

S

are separate sets,

GP

is the kernel, and

,

SES.

1

c

G

(s

-

t

)

w(s)

=

(9)

t€T

These authors also mention that when

p

approaches infinity,

the algorithm degenerates to k-means clustering, which is

often described as an optimizing Picard iterative routine

[7]:

1)

Randomly initialize “cluster centers,” T.

2)

Compute the following function on

T

x

S:

2

Vt,s

=

[

i,

if

t

=

argmin,ls-

tl

B.

Mean

Shift Algorithms

generalizations summarized in the introduction.

otherwise

Now we redefine the mean shift algorithm based on our

192

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

VOL.

17,

NO.

8,

AUGUST

1995

3)

Update “cluster centers:”

cvt.,s

c

vt,s

t

E

T.

<

S€S

Go

to

2.

Indeed, when the profile

k

is strictly decreasing,

Thus, k-means clustering is the limit of the mean shift al-

gorithm with a strictly decreasing kernel

p

when

p

+-

=.

0

111.

MEAN

SHIFT

AS

GRADIENT MAPPING

It has been pointed out

in

[l] that mean shift is a “very in-

tuitive” estimate of the gradient of the data density. In this

section, we give a more rigorous study of this intuition. Theo-

rem 1 relates each kernel to a “shadow” kernel

so

that mean

shift using a kernel will be in the gradient direction of the

density estimate using the corresponding “shadow” kernel.

A.

Shadow

of

a Kernel

DEFINITION

3.

Kernel

H

is said to be a shadow

of

kernel

K,

if

the mean shift using

K,

theorem of calculus and the requirement that

lrh(r)dr

<

-,

(15) is the only solution. In this case, we

(l have

or,

the magnitude

of

mean shift is

in

proportion to the ratio

of the gradient and the local density estimate using kernel

K.

When a discontinuous point

z

is allowed in

h,

a constant can

be added to the

h

from

0

to

z,

and

h’(r)

=

-

ck(r)

is still sat-

(12)

c

K(s-x)w(s)s

S€S

isfied except when

r

=

z.

0

CLAIM

2.

Suppose kernel

H

is a shadow of

K,

and

a

>

0.

The

following are true.

1)

aH

is a shadow of

K.

2)

Ha

is a shadow of

K,.

3)

If

L

is

a

shadow of

M,

then

H

+

L

is a shadow of

K

+

M.

4)

A

truncated kernel

KF,

may not be continuous at

=

a.

If the shadow is

also

allowed to be discontinuous at the

0

same points, then

HF,

is a shadow df

KF,.

EXAMPLE

3.

Using

(15)

we find that the

Epanechnikov kernel

is a shadow of the flat kernel,

(5),

and the

biweight kernel

is in

the gradient direction

at

ofthe density

estimate

using

K

is a shadow of the Epanechnikov kernel. These kernels are

0

so

named in

[2]

and they are shown in Fig.

3.

q(

X)

=

H

(

s

-

X)

W(

s).

Sd

(14)

0

THEOREM

1.

Kernel

H

is a shadow

of

kernel K ifand only

if

their PROFILES, h and

k,

satisfy the following equation.

h(r)=

f

(r)+cjrmk(t)dt,

(15)

where c

>

0

is a constant and

f

is a piecewise constant

function.

PROOF.

The mean shift using kernel

K

can be rewritten as

1

m(x)

-

x

=

-x

k(1ls

-

xl12)w(s)(

s

-

x)

(1

6)

p(x)

S€S

with

p(x)

=

The gradient of (14) at

x

is

k(s- x)w(s),

the density estimate using

K.

sss

Vq(x)

=

-2c

h’(lls

-

xIl2)(

s

-

x)w(s).

(17)

SES

To have

(16)

and (17) point to the same direction, we need

h’(r)

=

-

ck(r)

for all rand some

c

>

0.

By

the fundamental

(a)

(b)

Fig.

3.

(a)

The Epanechnikov kernel

and

(b) the biweight kernel.

B.

Gaussian Kernels

THEOREM

2.

The only kernels that are their own shadows are

the Gaussian kernel

GP

and its truncated version

GPFr

In

this case, the mean shift is equal to

(21)

1

m(x)-

x

=

-Vlogq(x),

2P

where

q

is the data density estimate using the same kernel.

PROOF.

From Theorem 1 we know that kernel

K

is its own

shadow if and only if

k’(r)

=

-ck(r).

Using the method of

separation of variables, we have

CHENG: MEAN

SHIFT,

MODE SEEKING, AND CLUSTERING

193

--

-cdr,orlogk(r)-logk(O)=-cr

.

I:

-I

This gives us

k(r)

=

k(O)eYr,

which makes

K

the Gaussian

kernel. If discontinuities are allowed in

k,

then we have the

truncated Gaussian kernel. When

K

is its own shadow,

p

in

0

REMARK

2.

A

mapping

f

:

R"+

R"

is said to be a

gradient

mapping

if there exists a function

g

:

R"

+

R

such that

Ax)

=

Vg(x)

for

all

x

161.

Theorem

2

is a corollary of a more

general result from the

symmetry principle:

f

is a gradient

mapping if and only if the Jacobian matrix offis symmetric.

In

our

case,

Ax)

=

m(x)

-

x

and by equating

afilaxj

and

afpxi

for all

i

and

j,

one obtains the necessary and sufficient

condition that

k'(r)

=

-ck(r)

for any mean shift to be a gra-

(18)

is equal to

q,

and

(Vq(x))/q(x)

=

V

log

q(x).

dient mapping.

n

A.

Radius and Diameter

of

Data

DEFINITION

4.

A

direction

in

X

is

a

point

on

the unit sphere.

That is,

a

E

X

is

a

direction

if

and

only

if

llall

=

1.

We

call

the mapping

a,

:

X

+

R

wirh

a,(x)

=

(x,

a)

the

projection

in

the direction

of

a.

Let

a,(S)

=

{a,(s);

s

E

S).

The

convex

hull

of

a

set

Y

c

X

is defined

as

h(Y)

=

n{x

E

X;

mina,

(Y)

I

a,

(x)

S

maxa,(Y)}.

(24)

0

iiun=1

CLAIM

3.

The following are true.

1)

min

a,(S)

I

min

n,(m(T))

I

max

n,(m(T))

I

max

n,(S).

2)

m(T)

G

h@"

L

W).

3)In a blurring process, we have

h(S)

2

h(m(S))

3

h(m(m(S)))...

.

There exists an

x

E

X

that is in all the

convex hulls of data. It is possible to make a translation

C.

Mode Seeking

Suppose an idealized mode in the density surface also has a

Gaussian shape, which, without loss of generality, centers at

the origin:

so

that the origin is in all the convex hulls of data.

U

DEFJ"ION

5.

Suppose after

a

translation, the origin is in

all

the convex hulls

of

data during

a

blurring process. Then,

p(S)

=

max

{

Isl;

s

E

S)

is said

to

be the

radius

of

data. The

diameter

of

data

is

defined

as

(22)

d(S)

=

sup( maxa,

(S)

-

min

a,

(S)).

(25)

11.11=1

x. (23)

m(x)

-

x

=

-

V

log

q( x)

=

-

=

--

1

-2yxq(x)

2P -2Pq(x)

P

Because the density surface is estimated with the kernel

G',

any mode approximated by superimposing G' will have a

y

<

P.

The mean shift

(23)

will not cause overshoots

in

this

case.

Mean shift is steepest ascent with a varying step size that is

the magnitude

of

the gradient.

A

notorious problem associated

with steepest ascent with fixed step size is the slow movement

0

It should be clear that

p(S)

I

d(S)

I

2p(S).

Because the

convex hulls of data form a shrinking inclusion sequence, the

radius or diameter of data also form a nonincreasing non-

negative sequence, which must approach a nonnegative limit.

Theorem

3

says that this limit is zero when the kernel in the

blurring process has a support wide enough to cover the data

set.

B.

Blurring With Broad Kernels

on

plateaus of the surface.

For

a density surface, large plateaus

happen only at low density regions and after taking logarithm,

the inclination of a plateau

is

magnified. Combined with the

THEOREM

3.

Let k be the profile

of

the kernel used

in

a

blur-

for

ring Process, and

So

the initial data.

rf

k(d2(So))

2

preceding result about overshoot avoidance, mean shift is well-

adjusted steepest ascent.

some

K

>

0,

then diameter

of

data approaches zero. The

convergence rate is

at

least

as

fast

as

IV.

CONVERGENCE

Theorem

1

says that the mean shift algorithm is steepest as-

cent over the density of

S.

Each

T

point climbs the hill in the

density surface independently. Therefore, if

S

or

its density

does not change during the execution of the algorithm, the

vergence of steepest ascent for individual

T

points.

However,

in

a blurring process,

Tis

S,

and

S

and its density

change

as

the result of each iteration. In this case, convergence

is not as obvious as steepest ascent. The main results of this

section are two convergence theorems about the blurring proc-

ess. The concepts of

radius

and

diameter

of data, defined be-

low, will be used in the proofs.

pROOF. Let be

a

projection,

=

min

@(SI,

=

max

qs),

and

=

+

,,)/2.

suppose

convergence

of

the evolution of

Tis

a consequence of the con-

E

w(s)

2

E

4s).

(27)

sd,n(s)<z

S€S,lr(S)>Z

men,

for

E

s,

194

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE.

VOL.

17,

NO.

8,

AUGUST

1995

v

-

x(m(s))

=

c

K(s’

-

s)w(

s’)

S’ES

s’ES,a(s’)sz

2

c

K(

s’

-

s)w(

s’)

S’ES

v-U

$

W(

S’)K

7

S’ES

-

K(V

-

U)

--

4k(0)

.

Clearly, we have

maxz(m(S))- minn(m(S))

I

maxn(m(S))

-

u

which is negative when

A

>

1/& and positive when

A

<

I/&.

Thus

I/&

is a critical

A

value. When

A

is

larger than this critical value, the density is unimodal and

the two

T

points will converge to

1/2.

When

A

is smaller

than

I/&

,

they will approach two distinct limit points.

0

(28)

The same result can be obtained when the inequality

in

(27)

is reversed. Therefore, the result holds for all projections,

0

EXAMPLE

4.

As

the simplest example, let

X

be the real line and

S

=

{x,

y}

with

x

c

y.

Let

k

be the profile of the kernel with

k(lx-y12)

>

0.

If the weights on

x

and

y

are the same, a blur-

ring process will make them converge to

(x

+

y)/2

with the

rate

and because

v

-

U

I

d(S),

we completed the proof.

k(0)

-

k(lx

-

YI2)

k(0)

+

k(l.

-

YI2)

m(y)-m(x)=

(Y-x).

(30)

C.

Blurring

with

Truncated Kernels

When truncated kernels are used

in

the blurring process,

S

may converge to many points. In fact, if the kernel is truncated

so

that it will not cover more than one point in

S,

then

S

will

not change

in

the process. The result of Theorem 3 applies to

an isolated group

of

data points that can be covered by a trun-

cated kernel. In this case, they eventually converge to one

point, although the rest of the data set may converge to other

cluster centers. Again, when the kernel is not ‘flat, no merger

will take place after any finite number of iterations. (Flat ker-

nels generate a special case where merger is possible in finite

number of steps. See

[8].)

When the blurring process is simulated on a digital com-

(29)

puter, points do merge in finite number of steps, due to the

limits on floating-point precision. In fact, there is a minimum

distance between data points that can be maintained. Theorem

4

below shows that under this condition, the blurring process

terminates

in

finite number of steps.

LEMMA

1.

Suppose

X

=

R”

and r(S)

is

the radius

of

data

in

a

blurring process.

If

the minimum distance between data

points

is

6,

then for any direction

a,

there cannot be more

than one data point

s

E

S

with

no(s)

>

r(S)

-

h, where h is

a

function ofr(S),

6,

and

n.

PROOF.

Suppose there is a direction

a

such that there are

s,

t

E

S

with

nu@)

>

r(S)

-

h

and

nu(t)

>

r(S)

-

h.

Let

b

be a

direction perpendicular to

a

and

db

=

tnb(s)

-

%(t)I.

Because

rection

u

and

One Of

them

must

be

dd2

away

‘Om

the

origin in direction

b,

the square of the radius of data cannot

be smaller than

(r(S)-h) +d,2/4.

It follows that

The difference between the two points will not become zero

ing both points. When the weights are different,

x

and

y

converge

to

a

point

that

may

not

be the weighted

average

of

the initial

x

and

y

with the same weights.

initialized to

S.

Because this is no longer a blurring process,

the result in Theorem

3

does not apply. That is,

T

may con-

verge to more than one point. The two

T

points converge to

(x

+

y)/2

when the density of

S

is unimodal. Otherwise, they

will converge to the two modes of the bimodal density.

When

G,

is used as the kernel, and

x

=

0,

y

=

1,

the density

is

(n-1)(8r(S)h-4h2)+h2

>6’,

--

2

--

(Z-ly

&)=e

a*

+e

.

at any iteration, unless the kernel is flat in the range contain-

both

and

are

at

least

r(s)

-

away

from

the

Origin

in

di-

2

NOW assume that

S

is fixed through the process while

T

iS

di

5

gr(S)h-4h2.

The distance between

and

t,

IJs

-

tl(,

satisfy

6*

IIIs-tl12

<(n-1)(8r(S)h-4h2)+h2.

If

then these

s

and

t

cannot exist. This condition is satisfied when

4(n

-

l)r(S)

-

,/16(n

-

1)’

r2(S)

-

(4n

-5)6’]/(4n

-5)

0

The first derivative of

q(z)

always has a zero at

z

=

1/2.

But

the second derivative

of

q(z)

at

z

=

1/2

is

Mean shift, mode seeking, and clustering

Citations

Data clustering: a review

Mean shift: a robust approach toward feature space analysis

Computer Vision: Algorithms and Applications

Clustering by fast search and find of density peaks

An HOG-LBP human detector with partial occlusion handling

References

Density estimation for statistics and data analysis

Iterative Solution of Nonlinear Equations in Several Variables

The estimation of the gradient of a density function, with applications in pattern recognition

K-Means-Type Algorithms: A Generalized Convergence Theorem and Characterization of Local Optimality

Statistical mechanics and phase transitions in clustering.

Related Papers (5)

Mean shift: a robust approach toward feature space analysis

Normalized cuts and image segmentation

A density-based algorithm for discovering clusters in large spatial Databases with Noise

Density estimation for statistics and data analysis

Some methods for classification and analysis of multivariate observations