scispace - formally typeset
Open AccessJournal ArticleDOI

Bayesian modeling of dynamic scenes for object detection

Reads0
Chats0
TLDR
An object detection scheme that has three innovations over existing approaches that is based on a model of the background as a single probability density, and the posterior function is maximized efficiently by finding the minimum cut of a capacitated graph.
Abstract
Accurate detection of moving objects is an important precursor to stable tracking or recognition. In this paper, we present an object detection scheme that has three innovations over existing approaches. First, the model of the intensities of image pixels as independent random variables is challenged and it is asserted that useful correlation exists in intensities of spatially proximal pixels. This correlation is exploited to sustain high levels of detection accuracy in the presence of dynamic backgrounds. By using a nonparametric density estimation method over a joint domain-range representation of image pixels, multimodal spatial uncertainties and complex dependencies between the domain (location) and range (color) are directly modeled. We propose a model of the background as a single probability density. Second, temporal persistence is proposed as a detection criterion. Unlike previous approaches to object detection which detect objects by building adaptive models of the background, the foregrounds modeled to augment the detection of objects (without explicit tracking) since objects detected in the preceding frame contain substantial evidence for detection in the current frame. Finally, the background and foreground models are used competitively in a MAP-MRF decision framework, stressing spatial context as a condition of detecting interesting objects and the posterior function is maximized efficiently by finding the minimum cut of a capacitated graph. Experimental validation of the proposed method is performed and presented on a diverse set of dynamic scenes.

read more

Content maybe subject to copyright    Report

1
Bayesian Modelling of Dynamic Scenes for Object
Detection
Yaser Sheikh and Mubarak Shah
Abstract
Accurate detection of moving objects is an important precursor to stable tracking or recognition. In
this paper, we present an object detection scheme that has three innovations over existing approaches.
Firstly, the model of the intensities of image pixels as independent random variables is challenged and
it is asserted that useful correlation exists in intensities of spatially proximal pixels. This correlation is
exploited to sustain high levels of detection accuracy in the presence of dynamic backgrounds. By using
a non-parametric density estimation method over a joint domain-range representation of image pixels,
multi-modal spatial uncertainties and complex dependencies between the domain (location) and range
(color) are directly modeled. We prop ose a model of the background as a single probability density.
Secondly, temporal persistence is proposed as a detection criterion. Unlike previous approaches to object
detection which detect objects by building adaptive models of the background, the foreground is modeled
to augment the detection of objects (without explicit tracking), since objects detected in the preceding
frame contain substantial evidence for detection in the current frame. Finally, the background and
foreground models are used competitively in a MAP-MRF decision framework, stressing spatial context
as a condition of detecting interesting objects and the posterior function is maximized efficiently by
finding the minimum cut of a capacitated graph. Experimental validation of the proposed method is
performed and presented on a diverse set of dynamic scenes.
Keywords
Object Detection, Kernel Density Estimation, Joint Domain Range, MAP-MRF Estimation.
I. Introduction
Automated surveillance systems typically use stationary sensors to monitor an envi-
ronment of interest. The assumption that the sensor remains stationary between the

2
incidence of each video frame allows the use of statistical background modeling tech-
niques for the detection of moving objects such as [39], [33] and [7]. Since ‘interesting’
objects in a scene are usually defined to be moving ones, such object detection provides
a reliable foundation for other surveillance tasks like tracking ([14], [16], [5]) and is often
also an important prerequisite for action or object recognition. However, the assumption
of a stationary sensor does not necessarily imply a stationary background. Examples of
‘nonstationary’ background motion abound in the real world, including periodic motions,
such as a ceiling fans, pendulums or escalators, and dynamic textures, such as fountains,
swaying trees or ocean ripples (shown in Figure 1). Furthermore, the assumption that
the sensor remains stationary is often nominally violated by common phenomena such as
wind or ground vibrations and to a larger degree by (stationary) hand-held cameras. If
natural scenes are to be modeled it is essential that object detection algorithms operate
reliably in such circumstances. Background modeling techniques have also been used for
foreground detection in pan-tilt-zoom cameras, [37]. Since the focal point does not change
when a camera pans or tilts, planar-projective motion compensation can be performed to
create a background mosaic model. Often, however, due to independently moving objects
motion compensation may not be exact, and background modeling approaches that do not
take such nominal misalignment into account usually perform poorly. Thus, a principal
proposition in this work is that modeling spatial uncertainties is important for real world
deployment, and we provide an intuitive and novel representation of the scene background
that consistently yields high detection accuracy.
In addition, we propose a new constraint for object detection and demonstrate signif-
icant improvements in detection. The central criterion that is traditionally exploited for
detecting moving objects is background difference, some examples being [17], [39], [26]

3
50 100 150 200 250 300 350
0
50
100
150
200
250
50 100 150 200 250 300 350
0
50
100
150
200
250
100 200 300 400 500 600 700
0
50
100
150
200
250
300
350
400
450
500
(a) (b) (c)
Fig. 1. Various sources of dynamic behavior. The flow vectors represent the motion in the scene. (a)
The lake-side water ripples and shimmers (b) The fountain, like the lake-side water, is a temporal texture
and does not have exactly repeating motion (c) a strong breeze can cause nominal motion (camera jitter)
of upto 25 pixels between consecutive frames.
and [33]. When an object enters the field of view it partially occludes the background
and can be detected through background differencing approaches if its appearance differs
from the portion of the background it occludes. Sometimes, however, during the course
of an object’s journey across the field of view, some colors may be similar to those of
the background, and in such cases detection using background differencing approaches
fail. To address this limitation and to improve detection in general, a new criterion called
temporal persistence is proposed here and exploited in conjunction with background differ-
ence for accurate detection. True foreground objects, as opposed to spurious noise, tend
to maintain consistent colors and remain in the same spatial area (i.e. frame to frame
color transformation and motion are small). Thus, foreground information from the frame
incident at time t contains substantial evidence for the detection of foreground objects
at time t + 1. In this paper, this fact is exploited by maintaining both background and
foreground models to be used competitively for object detection in stationary cameras,
without explicit tracking.

4
Finally, once pixel-wise probabilities are obtained for belonging to the background,
decisions are usually made by direct thresholding. Instead, we assert that spatial context
is an important constraint when making decisions about a pixel label, i.e. a pixel’s label
is not independent of the pixel’s neighborhood labels (this can be justified on Bayesian
grounds using Markov Random Fields, [11], [23]). We introduce a MAP-MRF framework,
that competitively uses both the background and the foreground models to make decisions
based on spatial context. We demonstrate that the maximum a posteriori solution can
be efficiently computed by finding the minimum cut of a capacitated graph, to make an
optimal inference based on neighborhood information at each pixel.
The rest of the paper is organized as follows. Section I-A reviews related work in the
field and discusses the proposed approach in the context of previous work. A description
of the proposed approach is presented in Section I-B. In Section II, a discussion on
modeling spatial uncertainty (Section II-A) and on utilizing the foreground model for
object detection (Section II-B) and a description of the overall MAP-MRF framework
is included (Section II-C). In Section II-C, we provide an algorithmic description of the
proposed approach as well. Qualitative and quantitative experimental results are shown
in Section III, followed by conclusions in Section IV.
A. Previous Work
Since the late 70s, differencing of adjacent frames in a video sequence has been used for
object detection in stationary cameras, [17]. However, it was realized that straightforward
background subtraction was unsuited to surveillance of real-world situations and statistical
techniques were introduced to model the uncertainties of background pixel colors. In
the context of this work, these background modeling methods can be classified into two
categories: (1) Methods that employ local (pixel-wise) models of intensity and (2) Methods

5
that have regional models of intensity.
Most background modeling approaches tend to fall into the first category of pixel-wise
models. Early approaches operated on the premise that the color of a pixel over time in a
static scene could be modeled by a single Gaussian distribution, N(µ, Σ). In their seminal
work, Wren et al [39] modeled the color of each pixel, I(x, y), with a single 3 dimensional
Gaussian, I(x, y) N (µ(x, y ), Σ(x, y)). The mean µ(x, y) and the covariance Σ(x, y),
were learned from color observations in consecutive frames. Once the pixel-wise back-
ground model was derived, the likelihood of each incident pixel color could be computed
and labelled as belonging to the background or not. Similar approaches that used Kalman
Filtering for up dating were proposed in [20] and [21]. A robust detection algorithm was
also proposed in [14]. While these methods were among the first to principally model the
uncertainty of each pixel color, it was quickly found that the single Gaussian pdf was ill-
suited to most outdoor situations, since repetitive object motion, shadows or reflectance
often caused multiple pixel colors to belong to the background at each pixel. To address
some of these issues, Friedman and Russell, and independently Stauffer and Grimson,
[9], [33] proposed modeling each pixel intensity as a mixture of Gaussians, instead, to
account for the multi-modality of the ‘underlying’ likelihood function of the background
color. An incident pixel was compared to every Gaussian density in the pixel’s model
and if a match (defined by threshold) was found, the mean and variance of the matched
Gaussian density was updated, or otherwise a new Gaussian density with the mean equal
to the current pixel color and some initial variance was introduced into the mixture. Thus,
each pixel was classified depending on whether the matched distribution represented the
background process. While the use of Gaussian mixture models was tested extensively, it
did not explicitly model the spatial dependencies of neighboring pixel colors that may be

Figures
Citations
More filters
Journal ArticleDOI

A survey of advances in vision-based human motion capture and analysis

TL;DR: This survey reviews recent trends in video-based human capture and analysis, as well as discussing open problems for future research to achieve automatic visual analysis of human movement.
Journal ArticleDOI

ViBe: A Universal Background Subtraction Algorithm for Video Sequences

TL;DR: Efficiency figures show that the proposed technique for motion detection outperforms recent and proven state-of-the-art methods in terms of both computation speed and detection rate.
Journal ArticleDOI

Traditional and recent approaches in background modeling for foreground detection: An overview

TL;DR: The purpose of this paper is to provide a complete survey of the traditional and recent approaches to background modeling for foreground detection, and categorize the different approaches in terms of the mathematical models used.
Journal ArticleDOI

SuBSENSE: A Universal Change Detection Method With Local Adaptive Sensitivity

TL;DR: This paper presents a universal pixel-level segmentation method that relies on spatiotemporal binary features as well as color information to detect changes, which allows camouflaged foreground objects to be detected more easily while most illumination variations are ignored.
Book ChapterDOI

Segmenting salient objects from images and videos

TL;DR: A new salient object segmentation method, which is based on combining a saliency measure with a conditional random field (CRF) model, which outperforms the current state-of-the-art methods in both qualitative and quantitative terms.
References
More filters
Journal ArticleDOI

Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images

TL;DR: The analogy between images and statistical mechanics systems is made and the analogous operation under the posterior distribution yields the maximum a posteriori (MAP) estimate of the image given the degraded observations, creating a highly parallel ``relaxation'' algorithm for MAP estimation.
Journal ArticleDOI

Mean shift: a robust approach toward feature space analysis

TL;DR: It is proved the convergence of a recursive mean shift procedure to the nearest stationary point of the underlying density function and, thus, its utility in detecting the modes of the density.
Book

Introduction to Statistical Pattern Recognition

TL;DR: This completely revised second edition presents an introduction to statistical pattern recognition, which is appropriate as a text for introductory courses in pattern recognition and as a reference book for workers in the field.
Journal ArticleDOI

On Estimation of a Probability Density Function and Mode

TL;DR: In this paper, the problem of the estimation of a probability density function and of determining the mode of the probability function is discussed. Only estimates which are consistent and asymptotically normal are constructed.
Journal ArticleDOI

Fast approximate energy minimization via graph cuts

TL;DR: This work presents two algorithms based on graph cuts that efficiently find a local minimum with respect to two types of large moves, namely expansion moves and swap moves that allow important cases of discontinuity preserving energies.
Related Papers (5)
Frequently Asked Questions (14)
Q1. What are the contributions in "Bayesian modelling of dynamic scenes for object detection" ?

In this paper, the authors present an object detection scheme that has three innovations over existing approaches. The authors propose a model of the background as a single probability density. 

The joint feature space provides the ability to incorporate the spatial distribution of intensities into the decision process, and such feature spaces have been previously used for3031image segmentation, smoothing [4] and tracking [6]. 

Since dynamic textures like the water do not repeat exactly, pixel-wise methods, like the mixture of Gaussians approach, handle the dynamic texture of the pool poorly, regularly27producing false positives. 

The model of image pixels as independent random variables, an assumption almost ubiquitous in background subtraction methods, is challenged and it is further asserted that there exists useful structure in the spatial proximity of pixels. 

If the primary source of spatial uncertainty of a pixel is image misalignment, a Gaussian density would be an adequate model since the corresponding point in the subsequent frame is equally likely to lie in any direction. 

The kernel density estimator is a nonparametric estimator and under appropriate conditions the estimate it produces is a valid probability itself. 

Since most of these phenomenon are ‘periodic’, the presence of multiple models describing each pixel mitigates this effect somewhat by allowing a mode for each periodically observed pixel intensity, however performance notably deteriorates since dynamic textures usually do not repeat exactly (see experiments in Section III). 

Analogous to the need for a mixture model to describe intensity distributions, unimodal distributions are limited in their ability to model7 spatial uncertainty. 

Early approaches operated on the premise that the color of a pixel over time in a static scene could be modeled by a single Gaussian distribution, N(µ, Σ). 

The threshold for the detection using only the background model was chosen as log(γ) (see Equation 7), which was equal to -27.9905. 

If an object is detected in the preceding frame, the probability of observing the colors of that object in the same proximity increases according to the second term in Equation 7. 

In the context of this work, these background modeling methods can be classified into two categories: (1) Methods that employ local (pixel-wise) models of intensity and (2) Methods5 that have regional models of intensity. 

While these methods were among the first to principally model the uncertainty of each pixel color, it was quickly found that the single Gaussian pdf was illsuited to most outdoor situations, since repetitive object motion, shadows or reflectance often caused multiple pixel colors to belong to the background at each pixel. 

Given this sample set, at the observation of the frame at time t, the probability of each pixel-vector belonging to the background can be computed using the kernel density estimator ([27], [31]).