What is the main reason for using kernel density estimation for color pdfs?

Since color spaces are low in dimensionality, efficient computation of kernel density estimation for color pdfs can be achieved using the Fast Gauss Transform algorithm [34], [35].

What is the disadvantage of using chromaticity coordinates?

(c) detection using chromaticity coordinates (r; g) and the lightness variable, s.Although using chromaticity coordinates helps in the suppression of shadows, they have the disadvantage of losing lightness information.

How does the system track people in groups?

The Hydra system [36] tracks people in groups by tracking their heads based onthe silhouette of the foreground regions corresponding to the group.

What is the advantage of using kernel density estimation for color modeling?

One other important advantage of using kernel density estimation is that the adaptation of the model is trivial and can be achieved by adding new samples.

What is the main reason for using kernel density estimation for color modeling?

Since kernel density estimation does not assume any specific underlying distribution and the estimate can converge to any density shape with enough samples, this approach is suitable to model the color distribution of regions with patterns and mixture of colors.

(Open Access) Background and foreground modeling using nonparametric kernel density estimation for visual surveillance (2002) | Ahmed Elgammal

Q: What is the main drawback of using HMMs to model the background?

The use of HMMs imposes a temporal continuity constraint on the pixel intensity, i.e., if the pixel is detected as a part of the foreground, then it is expected to remain part of the foreground for a period of time before switching back to be part of the background.

Q: Why is the use of edge features to model the background motivated by the desire to have a?

The use of edge features to model the background is motivated by the desire to have a representation of the scene background that is invariant to illumination changes.

Background and Foreground Modeling Using

Nonparametric Kernel Density Estimation for

Visual Surveillance

AHMED ELGAMMAL, RAMANI DURAISWAMI, MEMBER, IEEE, DAVID HARWOOD, AND

LARRY S. DAVIS, FELLOW, IEEE

Invited Paper

Automatic understanding of events happening at a site is the

ultimate goal for many visual surveillance systems. Higher level

understanding of events requires that certain lower level computer

vision tasks be performed. These may include detection of unusual

motion, tracking targets, labeling body parts, and understanding

the interactions between people. To achieve many of these tasks,

it is necessary to build representations of the appearance of

objects in the scene. This paper focuses on two issues related to

this problem. First, we construct a statistical representation of

the scene background that supports sensitive detection of moving

objects in the scene, but is robust to clutter arising out of natural

scene variations. Second, we build statistical representations of

the foreground regions (moving objects) that support their tracking

and support occlusion reasoning. The probability density functions

(pdfs) associated with the background and foreground are likely

to vary from image to image and will not in general have a known

parametric form. We accordingly utilize general nonparametric

kernel density estimation techniques for building these statistical

representations of the background and the foreground. These

techniques estimate the pdf directly from the data without any

assumptions about the underlying distributions. Example results

from applications are presented.

Keywords—Background subtraction, color modeling, kernel

density estimation, occlusion modeling, tracking, visual surveil-

lance.

Manuscript received May 31, 2001; revised February 15, 2002. This

work was supported in part by the ARDA Video Analysis and Content

Exploitation project under Contract MDA 90400C2110 and in part by

Philips Research.

A. Elgammal is with the Computer Vision Laboratory, University

of Maryland Institute for Advanced Computer Studies, Department of

Computer Science, University of Maryland, College Park, MD 20742 USA

(e-mail: elgammal@cs.umd.edu).

R. Duraiswami, D. Harwood, and L. S. Davis are with the Computer

Vision Laboratory, University of Maryland Institute for AdvancedComputer

Studies, University of Maryland, College Park, MD 20742 USA (e-mail:

ramani@umiacs.umd.edu; harwood@umiacs.umd.edu; lsd@cs.umd.edu).

Publisher Item Identifier 10.1109/JPROC.2002.801448.

I. INTRODUCTION

In automated surveillance systems, cameras and other sen-

sors are typically used to monitor activities at a site with the

goal of automatically understanding events happening at the

site. Automatic event understanding would enable function-

alities such as detection of suspicious activities and site se-

curity. Current systems archive huge volumes of video for

eventual off-line human inspection. The automatic detection

of events in videos would facilitate efficient archiving and

automatic annotation. It could be used to direct the attention

of human operators to potential problems. The automatic de-

tection of events would also dramatically reduce the band-

width required for video transmission and storage as only in-

teresting pieces would need to be transmitted or stored.

Higher level understanding of events requires certain

lower level computer vision tasks to be performed such

as detection of unusual motion, tracking targets, labeling

body parts, and understanding the interactions between

people. For many of these tasks, it is necessary to build

representations of the appearance of objects in the scene. For

example, the detection of unusual motions can be achieved

by building a representation of the scene background and

comparing new frames with this representation. This process

is called background subtraction. Building representations

for foreground objects (targets) is essential for tracking

them and maintaining their identities. This paper focuses

on two issues: how to construct a statistical representation

of the scene background that supports sensitive detection

of moving objects in the scene and how to build statistical

representations of the foreground (moving objects) that

support their tracking.

One useful tool for building such representations is sta-

tistical modeling, where a process is modeled as a random

variablein a feature space with an associated probability den-

sity function (pdf). The densityfunction could be represented

parametrically using a specified statistical distribution, that

PROCEEDINGS OF THE IEEE, VOL. 90, NO. 7, JULY 2002 1151

is assumed to approximate the actual distribution, with the

associated parameters estimated from training data. Alterna-

tively, nonparametric approaches could be used. These esti-

mate the density function directly from the data without any

assumptions about the underlying distribution. This avoids

having to choose a model and estimating its distribution pa-

rameters.

A particular nonparametric technique that estimates the

underlying density, avoids having to store the complete data,

and is quite general is the kernel density estimation tech-

nique. In this technique, the underlying pdf is estimated as

(1)

where

is a “kernel function” (typically a Gaussian) cen-

tered at the data points in feature space,

, and

are weighting coefficients (typically uniform weights are

used, i.e.,

). Kernel density estimators asymptoti-

cally converge to any density function [1], [2]. This property

makes these techniques quite general and applicable to many

vision problems where the underlying density is not known.

In this paper, kernel density estimation techniques are

utilized for building representations for both the background

and the foreground. We present an adaptive background

modeling and background subtraction technique that is able

to detect moving targets in challenging outdoor environ-

ments with moving trees and changing illumination. We also

present a technique for modeling foreground regions and

show how it can be used for segmenting major body parts of

a person and for segmenting groups of people.

II. K

ERNEL DENSITY ESTIMATION TECHNIQUES

Given a sample from a distribution with

density function

, an estimate of the density at

can be calculated using

(2)

where

is a kernel function (sometimes called a “window”

function) with a bandwidth (scale)

such that

. The kernel function should satisfy

and . We can think of (2) as estimating

the pdf by averaging the effect of a set of kernel functions

centered at each data point. Alternatively, since the kernel

function is symmetric, we can also regard this computation

as averaging the effect of a kernel function centered at the

estimation point and evaluated at each data point. Kernel

density estimators asymptotically converge to any density

function with sufficientsamples [1], [2]. This property makes

the technique quite general for estimating the density of

any distribution. In fact, all other nonparametric density

estimation methods, e.g., histograms, can be shown to be

asymptotically kernel methods [1].

Forhigher dimensions, products of one-dimensional (1-D)

kernels [1] can be used as

(3)

where the same kernel function is used in each dimension

with a suitable bandwidth

for each dimension. We can

avoid having to store the complete data set by weighting the

samples as

where the ’s are weighting coefficients that sum up to one.

A variety of kernelfunctions with different properties have

been used in the literature. Typically the Gaussian kernel is

used for its continuity, differentiability, and locality proper-

ties. Note that choosing the Gaussian as a kernel function

is different from fitting the distribution to a Gaussian model

(normal distribution). Here, the Gaussian is only used as a

function to weight the data points. Unlike parametric fitting

of a mixture of Gaussians, kernel density estimation is a more

general approach that does not assume any specific shape for

the density function. A good discussion of kernel estimation

techniques can be found in [1]. The major drawback of using

the nonparametric kernel density estimator is its computa-

tional cost. This becomes less of a problem as the available

computationalpowerincreasesandasefficientcomputational

methods have become available recently [3], [4].

III. M

ODELING THE BACKGROUND

A. Background Subtraction: A Review

1) The Concept: In video surveillance systems, sta-

tionary cameras are typically used to monitor activities at

outdoor or indoor sites. Since the cameras are stationary, the

detection of moving objects can be achieved by comparing

each new frame with a representation of the scene back-

ground. This process is called background subtraction and

the scene representation is called the background model.

Typically, background subtraction forms the first stage

in an automated visual surveillance system. Results from

background subtraction are used for further processing, such

as tracking targets and understanding events.

A central issue in building a representation for the scene

background is what features to use for this representation

or, in other words, what to model in the background. In

the literature, a variety of features have been used for

background modeling, including pixel-based features (pixel

intensity, edges, disparity) and region-based features (e.g.,

block correlation). The choice of the features affects how

the background model tolerates changes in the scene and the

granularity of the detected foreground objects.

In any indoor or outdoor scene, there are changes that

occur over time and may be classified as changes to the scene

background. It is important that the background model toler-

ates these kind of changes, either by being invariant to them

or by adapting to them. These changes can be local, affecting

only part of the background, or global, affecting the entire

background. The study of these changes is essential to un-

derstand the motivations behind different background sub-

traction techniques. We classify these changes according to

their source.

Illumination changes:

• gradual change in illumination, as might occur in out-

door scenes due to the change in the location of the sun;

1152 PROCEEDINGS OF THE IEEE, VOL. 90, NO. 7, JULY 2002

• sudden change in illumination as might occur in an in-

door environment by switching the lights on or off, or

in an outdoor environment by a change between cloudy

and sunny conditions;

• shadows cast on the background by objects in the back-

ground itself (e.g., buildings and trees) or by moving

foreground objects.

Motion changes:

• image changes due to small camera displacements

(these are common in outdoor situations due to wind

load or other sources of motion which causes global

motion in the images);

• motion in parts of the background, for example, tree

branches moving with the wind or rippling water.

Changes introduced to the background: These include any

change in the geometry or the appearance of the background

of the scene introduced by targets. Such changes typically

occur when something relatively permanent is introduced

into the scene background (for example, if somebody moves

(introduces) something from (to) the background, or if a car

is parkedin the scene or moves out of the scene, or if a person

stays stationary in the scene for an extended period).

2) Practice: Many researchers haveproposed methods to

address some of the issues regarding the background mod-

eling,and we provide a brief reviewof therelevantwork here.

Pixel intensity is the most commonly used feature in back-

ground modeling. If we monitor the intensity value of a pixel

over time in a completely static scene, then the pixel in-

tensity can be reasonably modeled with a Gaussian distri-

bution

, given that the image noise over time can

be modeled by a zero mean Gaussian distribution

This Gaussian distribution model for the intensity value of a

pixel is the underlying model for many background subtrac-

tiontechniques. For example, one of thesimplest background

subtraction techniques is to calculate an average image of

the scene, subtract each new frame from this image, and

threshold the result. This basic Gaussian model can adapt to

slow changes in the scene (for example, gradual illumination

changes) by recursively updating the model using a simple

adaptive filter. This basic adaptive model is used in [5]; also,

Kalman filtering for adaptation is used in [6]–[8].

Typically, in outdoor environments with moving trees and

bushes, the scene background is not completely static. For

example, one pixel can be the image of the sky in one frame,

a tree leaf in another frame, a tree branch in a third frame,

and some mixture subsequently. In each situation, the pixel

will have a different intensity (color), so a single Gaussian

assumption for the pdf of the pixel intensity will not hold.

Instead, a generalization based on a mixture of Gaussians

has been used in [9]–[11] to model such variations. In [9]

and [10], the pixel intensity was modeled by a mixture of

Gaussian distributions ( is a small number from 3 to 5).

The mixture is weighted by the frequency with which each

of the Gaussians explains the background. In [11], a mixture

of three Gaussian distributions was used to model the pixel

value for traffic surveillance applications. The pixel inten-

sity was modeled as a weighted mixture of three Gaussian

distributions corresponding to road, shadow, and vehicle dis-

tribution. Adaptation of the Gaussian mixture models can be

achieved using an incremental version of the EM algorithm.

In [12], linear prediction using the Wiener filter is used to

predict pixel intensity given a recent history of values. The

prediction coefficients are recomputed each frame from the

sample covariance to achieve adaptivity. Linear prediction

using the Kalman filter was also used in [6]–[8].

All of the previously mentioned models are based on sta-

tistical modeling of pixel intensity with the ability to adapt

the model. While pixel intensity is not invariant to illumi-

nation changes, model adaptation makes it possible for such

techniques to adapt to gradual changes in illumination. On

the other hand, a sudden change in illumination presents a

challenge to such models.

Another approach to model a wide range of variations

in the pixel intensity is to represent these variations as dis-

crete states corresponding to modes of the environment, e.g.,

lights on/off or cloudy/sunny skies. Hidden Markov models

(HMMs) have been used for this purpose in [13] and [14].

In [13], a three-state HMM has been used to model the in-

tensity of a pixel for a traffic-monitoring application where

the three states correspond to the background, shadow, and

foreground.The use of HMMs imposes a temporalcontinuity

constraint on the pixel intensity, i.e., if the pixel is detected as

a part of the foreground, then it is expected to remain part of

the foreground for a period of time before switching back to

be part of the background. In [14], the topology of the HMM

representing global image intensity is learned while learning

the background. At each global intensity state, the pixel in-

tensity is modeled using a single Gaussian. It was shown that

the model is able to learn simple scenarios like switching the

lights on and off.

Alternatively, edge features have also been used to model

the background. The use of edge features to model the back-

ground is motivated by the desire to have a representation

of the scene background that is invariant to illumination

changes. In [15], foreground edges are detected by com-

paring the edges in each new frame with an edge map of the

background which is called the background “primal sketch.”

The major drawback of using edge features to model the

background is that it would only be possible to detect edges

of foreground objects instead of the dense connected regions

that result from pixel-intensity-based approaches. A fusion

of intensity and edge information was used in [16].

Block-based approaches have been also used for modeling

the background. Block matching has been extensively used

for change detection between consecutive frames. In [17],

each image block is fitto a second-order bivariatepolynomial

and the remaining variations are assumed to be noise. A sta-

tistical likelihood test is then used to detect blocks with sig-

nificant change. In [18], each block was represented with its

median template over the background learning period and its

block standard deviation. Subsequently, at each new frame,

each block is correlated with its corresponding template, and

blocks with too much deviation relative to the measured stan-

dard deviation are considered to be foreground. The major

drawback with block-based approaches is that the detection

unit is a whole image block and therefore they are only suit-

able for coarse detection.

ELGAMMAL et al.: MODELING USING NONPARAMETRIC KERNEL DENSITY ESTIMATION FOR VISUAL SURVEILLANCE 1153

In order to monitor wide areas with sufficient resolution,

cameras with zoom lenses are often mounted on pan-tilt plat-

forms. This enables high-resolution imagery to be obtained

from any arbitrary viewing angle from the location where

the camera is mounted. The use of background subtraction

in such situations requires a representation of the scene

background for any arbitrary pan-tilt-zoom combination,

which is an extension to the original background subtraction

concept with a stationary camera. In [19], image mosaicing

techniques are used to build panoramic representations of

the scene background. Alternatively, in [20], a represen-

tation of the scene background as a finite set of images

on a virtual polyhedron is used to construct images of the

scene background at any arbitrary pan-tilt-zoom setting.

Both techniques assume that the camera rotation is around

its optical axis and so that there is no significant motion

parallax.

B. Nonparametric Background Modeling

In this section, we describe a background model and a

background subtraction process that we have developed,

based on nonparametric kernel density estimation. The

model uses pixel intensity (color) as the basic feature for

modeling the background. The model keeps a sample of

intensity values for each pixel in the image and uses this

sample to estimate the density function of the pixel intensity

distribution. Therefore, the model is able to estimate the

probability of any newly observed intensity value. The

model can handle situations where the background of the

scene is cluttered and not completely static but contains

small motions that are due to moving tree branches and

bushes. The model is updated continuously and therefore

adapts to changes in the scene background.

1) Background Subtraction: Let

be a

sample of intensity values for a pixel. Given this sample,

we can obtain an estimate of the pixel intensity pdf at

any intensity value using kernel density estimation. Given

the observed intensity

at time , we can estimate the

probability of this observation as

(4)

where

is a kernel function with bandwidth . This esti-

mate can be generalized to use color features by using kernel

products as

(5)

where

is a -dimensional color feature and is a kernel

function with bandwidth

in the th color space dimension.

If we choose our kernel function

to be Gaussian, then the

density can be estimated as

(6)

Fig. 1. Background Subtraction. (a) Original image. (b) Estimated

probability image.

Using this probability estimate, the pixel is considered to

be a foreground pixel if

, where the threshold

is a global threshold over all the images that can be ad-

justed to achievea desired percentage of false positives. Prac-

tically, the probability estimation in (6) can be calculated in a

very fast way using precalculated lookup tables for the kernel

function values given the intensity value difference

and the kernel function bandwidth. Moreover, a partial eval-

uation of the sum in (6) is usually sufficient to surpass the

threshold at most image pixels, since most of the image is

typically from the background. This allows us to construct a

very fast implementation.

Since kernel density estimation is a general approach, the

estimate of (4) can converge to any pixel intensity density

function. Here, the estimate is based on the most recent

samples used in the computation. Therefore, adaptation of

the model can be achievedsimply by adding new samples and

ignoring older samples [21]. Fig. 1(b) shows the estimated

backgroundprobabilitywherebrighterpixelsrepresent lower

background probability pixels.

One major issue that needs to be addressed when using

kernel density estimation technique is the choice of suitable

kernel bandwidth (scale). Theoretically, as the number of

samples reaches infinity, the choice of the bandwidth is

insignificant and the estimate will approach the actual

density. Practically, since only a finite number of samples

are used and the computation must be performed in real

time, the choice of suitable bandwidth is essential. Too

small a bandwidth will lead to a ragged density estimate,

1154 PROCEEDINGS OF THE IEEE, VOL. 90, NO. 7, JULY 2002

while too wide a bandwidth will lead to an over-smoothed

density estimate [2]. Since the expected variations in pixel

intensity over time are different from one location to another

in the image, a different kernel bandwidth is used for each

pixel. Also, a different kernel bandwidth is used for each

color channel.

To estimate the kernel bandwidth

for the th color

channel for a given pixel, we compute the median absolute

deviation over the sample for consecutive intensity values

of the pixel. That is, the median

of for each

consecutive pair

in the sample is calculated inde-

pendently for each color channel. The motivation behind the

use of median of absolute deviation is that pixel intensities

over time are expected to have jumps because different

objects (e.g., sky, branch, leaf, and mixtures when an edge

passes through the pixel) are projected onto the same pixel at

different times. Since we are measuring deviations between

two consecutive intensity values, the pair

usually

comes from the same local-in-time distribution, and only

a few pairs are expected to come from cross distributions

(intensity jumps). The median is a robust estimate and

should not be affected by few jumps.

If we assume that this local-in-time distribution is

Gaussian

, then the distribution for the deviation

is also Gaussian . Since this distri-

bution is symmetric, the median of the absolute deviations

is equivalent to the quarter percentile of the deviation

distribution. That is,

and therefore the standard deviation of the first distribution

can be estimated as

Since the deviations are integer gray scale (color) values,

linear interpolation is used to obtain more accurate median

values.

2) Probabilistic Suppression of False Detection: In out-

door environments with fluctuating backgrounds, there are

two sources of false detections. First, there are false detec-

tions due to random noise which are expected to be homo-

geneous over the entire image. Second, there are false detec-

tions due to small movements in the scene background that

are not represented by the background model. This can occur

locally, for example, if a tree branch moves further than it

did during model generation. This can also occur globally in

the image as a result of small camera displacements caused

by wind load, which is common in outdoor surveillance and

causes many false detections. These kinds of false detections

are usually spatially clustered in the image, and they are not

easy to eliminate using morphological techniques or noise

filtering because these operations might also affect detection

of small and/or occluded targets.

If a part of the background (a tree branch, for example)

moves to occupy a new pixel, but it was not part of the model

for that pixel, then it will be detected as a foreground object.

However, this object will have a high probability of being

a part of the background distribution corresponding to its

original pixel. Assuming that only a small displacement can

occur between consecutive frames, we decide if a detected

pixel is caused by a background object that has moved by

considering the background distributions of a small neigh-

borhood of the detection location.

Let

be the observed value of a pixel detected as a

foreground pixel at time

. We define the pixel displacement

probability

to be the maximum probability that the

observed value,

, belongs to the background distribution of

some point in the neighborhood

where is the background sample for pixel , and the prob-

ability estimation

is calculated using the kernel

function estimation as in (6). By thresholding

for de-

tected pixels, we can eliminate many false detections due

to small motions in the background scene. To avoid losing

true detections that might accidentally be similar to the back-

ground of some nearby pixel (e.g., camouflaged targets), a

constraint is added that the whole detected foreground ob-

ject must have moved from a nearby location, and not only

some of its pixels. The component displacement probability

is defined to be the probability that a detected connected

component

has been displaced from a nearby location. This

probability is estimated by

For a connected component corresponding to a real target,

the probability that this component has displaced from the

background will be very small. So, a detected pixel

will be

considered to be a part of the background only if

Fig. 2 illustrates the effect of the second stage of detec-

tion. The result after the first stage is shown in Fig. 2(b).

In this example, the background has not been updated for

several seconds, and the camera has been slightly displaced

during this time interval, so we see many false detections

along high-contrast edges. Fig. 2(c) shows the result after

suppressing the detected pixels with high displacement prob-

ability. Most false detections due to displacement were elim-

inated, and only random noise that is uncorrelated with the

scene remains as false detections. However, some true de-

tected pixels were also lost. The final result of the second

stage of the detection is shown in Fig. 2(d), where the com-

ponent displacement probability constraint was added. Fig.

3(b) showsresults fora casewhere as a resultof the wind load

the camera is shaking slightly, resulting in a lot of clustered

false detections, especially on the edges. After probabilistic

suppression of false detection [Fig. 3(c)], most of these clus-

tered false detection are suppressed, while the small target on

the left side of the image remains.

ELGAMMAL et al.: MODELING USING NONPARAMETRIC KERNEL DENSITY ESTIMATION FOR VISUAL SURVEILLANCE 1155

Background and foreground modeling using nonparametric kernel density estimation for visual surveillance

Figures

Citations

Object tracking: A survey

ViBe: A Universal Background Subtraction Algorithm for Video Sequences

Image change detectio algorithms : A systematic survey

Image change detection algorithms: a systematic survey

A texture-based method for modeling the background and detecting moving objects

References

Pattern Classification

Adaptive background mixture models for real-time tracking

Pfinder: real-time tracking of the human body

Real-time tracking of non-rigid objects using mean shift

Non-parametric Model for Background Subtraction

Related Papers (5)

Adaptive background mixture models for real-time tracking

Pfinder: real-time tracking of the human body

Learning patterns of activity using real-time tracking

Wallflower: principles and practice of background maintenance

W/sup 4/: real-time surveillance of people and their activities

Frequently Asked Questions (12)

Q1. What are the contributions mentioned in the paper "Background and foreground modeling using nonparametric kernel density estimation for visual surveillance" ?

Q2. What is the main reason for using kernel density estimation for color pdfs?

Q3. What is the main drawback of using HMMs to model the background?

Q4. Why is the use of edge features to model the background motivated by the desire to have a?

Q5. What is the main drawback with block-based approaches?

Q6. What is the use of background subtraction in such situations?

Q7. What is the disadvantage of using chromaticity coordinates?

Q8. How does the system track people in groups?

Q9. What is the main drawback of using edge features to model the background?

Q10. What is the advantage of using kernel density estimation for color modeling?

Q11. What is the central issue in building a representation for the scene background?

Q12. What is the main reason for using kernel density estimation for color modeling?