What are the contributions in "Bayesian modelling of dynamic scenes for object detection" ?

In this paper, the authors present an object detection scheme that has three innovations over existing approaches. The authors propose a model of the background as a single probability density.

What is the purpose of the joint feature space?

The joint feature space provides the ability to incorporate the spatial distribution of intensities into the decision process, and such feature spaces have been previously used for3031image segmentation, smoothing [4] and tracking [6].

How does the proposed approach handle the dynamic texture of the pool?

Since dynamic textures like the water do not repeat exactly, pixel-wise methods, like the mixture of Gaussians approach, handle the dynamic texture of the pool poorly, regularly27producing false positives.

What is the main argument that the model of image pixels is challenged?

The model of image pixels as independent random variables, an assumption almost ubiquitous in background subtraction methods, is challenged and it is further asserted that there exists useful structure in the spatial proximity of pixels.

What is the main source of spatial uncertainty of a pixel?

If the primary source of spatial uncertainty of a pixel is image misalignment, a Gaussian density would be an adequate model since the corresponding point in the subsequent frame is equally likely to lie in any direction.

What is the definition of a kernel density estimator?

The kernel density estimator is a nonparametric estimator and under appropriate conditions the estimate it produces is a valid probability itself.

What is the effect of the presence of multiple models on the pixel intensity?

Since most of these phenomenon are ‘periodic’, the presence of multiple models describing each pixel mitigates this effect somewhat by allowing a mode for each periodically observed pixel intensity, however performance notably deteriorates since dynamic textures usually do not repeat exactly (see experiments in Section III).

What is the need for a mixture model to describe spatial uncertainty?

Analogous to the need for a mixture model to describe intensity distributions, unimodal distributions are limited in their ability to model7 spatial uncertainty.

What is the threshold for the detection using only the background model?

The threshold for the detection using only the background model was chosen as log(γ) (see Equation 7), which was equal to -27.9905.

What is the probability of observing a foreground pixel in the same proximity?

If an object is detected in the preceding frame, the probability of observing the colors of that object in the same proximity increases according to the second term in Equation 7.

What is the probability of each pixel-vector belonging to the background?

Given this sample set, at the observation of the frame at time t, the probability of each pixel-vector belonging to the background can be computed using the kernel density estimator ([27], [31]).

(Open Access) Bayesian modeling of dynamic scenes for object detection (2005) | Yaser Sheikh

Bayesian Modelling of Dynamic Scenes for Object

Detection

Yaser Sheikh and Mubarak Shah

Abstract

Accurate detection of moving objects is an important precursor to stable tracking or recognition. In

this paper, we present an object detection scheme that has three innovations over existing approaches.

Firstly, the model of the intensities of image pixels as independent random variables is challenged and

it is asserted that useful correlation exists in intensities of spatially proximal pixels. This correlation is

exploited to sustain high levels of detection accuracy in the presence of dynamic backgrounds. By using

a non-parametric density estimation method over a joint domain-range representation of image pixels,

multi-modal spatial uncertainties and complex dependencies between the domain (location) and range

(color) are directly modeled. We prop ose a model of the background as a single probability density.

Secondly, temporal persistence is proposed as a detection criterion. Unlike previous approaches to object

detection which detect objects by building adaptive models of the background, the foreground is modeled

to augment the detection of objects (without explicit tracking), since objects detected in the preceding

frame contain substantial evidence for detection in the current frame. Finally, the background and

foreground models are used competitively in a MAP-MRF decision framework, stressing spatial context

as a condition of detecting interesting objects and the posterior function is maximized eﬃciently by

ﬁnding the minimum cut of a capacitated graph. Experimental validation of the proposed method is

performed and presented on a diverse set of dynamic scenes.

Keywords

Object Detection, Kernel Density Estimation, Joint Domain Range, MAP-MRF Estimation.

I. Introduction

Automated surveillance systems typically use stationary sensors to monitor an envi-

ronment of interest. The assumption that the sensor remains stationary between the

incidence of each video frame allows the use of statistical background modeling tech-

niques for the detection of moving objects such as [39], [33] and [7]. Since ‘interesting’

objects in a scene are usually deﬁned to be moving ones, such object detection provides

a reliable foundation for other surveillance tasks like tracking ([14], [16], [5]) and is often

also an important prerequisite for action or object recognition. However, the assumption

of a stationary sensor does not necessarily imply a stationary background. Examples of

‘nonstationary’ background motion abound in the real world, including periodic motions,

such as a ceiling fans, pendulums or escalators, and dynamic textures, such as fountains,

swaying trees or ocean ripples (shown in Figure 1). Furthermore, the assumption that

the sensor remains stationary is often nominally violated by common phenomena such as

wind or ground vibrations and to a larger degree by (stationary) hand-held cameras. If

natural scenes are to be modeled it is essential that object detection algorithms operate

reliably in such circumstances. Background modeling techniques have also been used for

foreground detection in pan-tilt-zoom cameras, [37]. Since the focal point does not change

when a camera pans or tilts, planar-projective motion compensation can be performed to

create a background mosaic model. Often, however, due to independently moving objects

motion compensation may not be exact, and background modeling approaches that do not

take such nominal misalignment into account usually perform poorly. Thus, a principal

proposition in this work is that modeling spatial uncertainties is important for real world

deployment, and we provide an intuitive and novel representation of the scene background

that consistently yields high detection accuracy.

In addition, we propose a new constraint for object detection and demonstrate signif-

icant improvements in detection. The central criterion that is traditionally exploited for

detecting moving objects is background diﬀerence, some examples being [17], [39], [26]

50 100 150 200 250 300 350

100

150

200

250

50 100 150 200 250 300 350

100

150

200

250

100 200 300 400 500 600 700

0

50

100

150

200

250

300

350

400

450

500

(a) (b) (c)

Fig. 1. Various sources of dynamic behavior. The ﬂow vectors represent the motion in the scene. (a)

The lake-side water ripples and shimmers (b) The fountain, like the lake-side water, is a temporal texture

and does not have exactly repeating motion (c) a strong breeze can cause nominal motion (camera jitter)

of upto 25 pixels between consecutive frames.

and [33]. When an object enters the ﬁeld of view it partially occludes the background

and can be detected through background diﬀerencing approaches if its appearance diﬀers

from the portion of the background it occludes. Sometimes, however, during the course

of an object’s journey across the ﬁeld of view, some colors may be similar to those of

the background, and in such cases detection using background diﬀerencing approaches

fail. To address this limitation and to improve detection in general, a new criterion called

temporal persistence is proposed here and exploited in conjunction with background diﬀer-

ence for accurate detection. True foreground objects, as opposed to spurious noise, tend

to maintain consistent colors and remain in the same spatial area (i.e. frame to frame

color transformation and motion are small). Thus, foreground information from the frame

incident at time t contains substantial evidence for the detection of foreground objects

at time t + 1. In this paper, this fact is exploited by maintaining both background and

foreground models to be used competitively for object detection in stationary cameras,

without explicit tracking.

Finally, once pixel-wise probabilities are obtained for belonging to the background,

decisions are usually made by direct thresholding. Instead, we assert that spatial context

is an important constraint when making decisions about a pixel label, i.e. a pixel’s label

is not independent of the pixel’s neighborhood labels (this can be justiﬁed on Bayesian

grounds using Markov Random Fields, [11], [23]). We introduce a MAP-MRF framework,

that competitively uses both the background and the foreground models to make decisions

based on spatial context. We demonstrate that the maximum a posteriori solution can

be eﬃciently computed by ﬁnding the minimum cut of a capacitated graph, to make an

optimal inference based on neighborhood information at each pixel.

The rest of the paper is organized as follows. Section I-A reviews related work in the

ﬁeld and discusses the proposed approach in the context of previous work. A description

of the proposed approach is presented in Section I-B. In Section II, a discussion on

modeling spatial uncertainty (Section II-A) and on utilizing the foreground model for

object detection (Section II-B) and a description of the overall MAP-MRF framework

is included (Section II-C). In Section II-C, we provide an algorithmic description of the

proposed approach as well. Qualitative and quantitative experimental results are shown

in Section III, followed by conclusions in Section IV.

A. Previous Work

Since the late 70s, diﬀerencing of adjacent frames in a video sequence has been used for

object detection in stationary cameras, [17]. However, it was realized that straightforward

background subtraction was unsuited to surveillance of real-world situations and statistical

techniques were introduced to model the uncertainties of background pixel colors. In

the context of this work, these background modeling methods can be classiﬁed into two

categories: (1) Methods that employ local (pixel-wise) models of intensity and (2) Methods

that have regional models of intensity.

Most background modeling approaches tend to fall into the ﬁrst category of pixel-wise

models. Early approaches operated on the premise that the color of a pixel over time in a

static scene could be modeled by a single Gaussian distribution, N(µ, Σ). In their seminal

work, Wren et al [39] modeled the color of each pixel, I(x, y), with a single 3 dimensional

Gaussian, I(x, y) ∼ N (µ(x, y ), Σ(x, y)). The mean µ(x, y) and the covariance Σ(x, y),

were learned from color observations in consecutive frames. Once the pixel-wise back-

ground model was derived, the likelihood of each incident pixel color could be computed

and labelled as belonging to the background or not. Similar approaches that used Kalman

Filtering for up dating were proposed in [20] and [21]. A robust detection algorithm was

also proposed in [14]. While these methods were among the ﬁrst to principally model the

uncertainty of each pixel color, it was quickly found that the single Gaussian pdf was ill-

suited to most outdoor situations, since repetitive object motion, shadows or reﬂectance

often caused multiple pixel colors to belong to the background at each pixel. To address

some of these issues, Friedman and Russell, and independently Stauﬀer and Grimson,

[9], [33] proposed modeling each pixel intensity as a mixture of Gaussians, instead, to

account for the multi-modality of the ‘underlying’ likelihood function of the background

color. An incident pixel was compared to every Gaussian density in the pixel’s model

and if a match (deﬁned by threshold) was found, the mean and variance of the matched

Gaussian density was updated, or otherwise a new Gaussian density with the mean equal

to the current pixel color and some initial variance was introduced into the mixture. Thus,

each pixel was classiﬁed depending on whether the matched distribution represented the

background process. While the use of Gaussian mixture models was tested extensively, it

did not explicitly model the spatial dependencies of neighboring pixel colors that may be

Bayesian modeling of dynamic scenes for object detection

Figures

Citations

A survey of advances in vision-based human motion capture and analysis

ViBe: A Universal Background Subtraction Algorithm for Video Sequences

Traditional and recent approaches in background modeling for foreground detection: An overview

SuBSENSE: A Universal Change Detection Method With Local Adaptive Sensitivity

Segmenting salient objects from images and videos

References

Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images

Mean shift: a robust approach toward feature space analysis

Introduction to Statistical Pattern Recognition

On Estimation of a Probability Density Function and Mode

Fast approximate energy minimization via graph cuts

Related Papers (5)

Adaptive background mixture models for real-time tracking

Pfinder: real-time tracking of the human body

Non-parametric Model for Background Subtraction

Learning patterns of activity using real-time tracking

Wallflower: principles and practice of background maintenance

Frequently Asked Questions (14)

Q1. What are the contributions in "Bayesian modelling of dynamic scenes for object detection" ?

Q2. What is the purpose of the joint feature space?

Q3. How does the proposed approach handle the dynamic texture of the pool?

Q4. What is the main argument that the model of image pixels is challenged?

Q5. What is the main source of spatial uncertainty of a pixel?

Q6. What is the definition of a kernel density estimator?

Q7. What is the effect of the presence of multiple models on the pixel intensity?

Q8. What is the need for a mixture model to describe spatial uncertainty?

Q9. What was the first approach to pixel-wise modeling?

Q10. What is the threshold for the detection using only the background model?

Q11. What is the probability of observing a foreground pixel in the same proximity?

Q12. What are the two categories of background modeling methods?

Q13. What was the first method to model the uncertainty of each pixel color?

Q14. What is the probability of each pixel-vector belonging to the background?