scispace - formally typeset
Open AccessJournal ArticleDOI

Robust Visual Tracking by Integrating Multiple Cues Based on Co-Inference Learning

TLDR
This paper formulates the problem of multiple cue integration and tracking in a probabilistic framework based on a factorized graphical model and proposes a sequential Monte Carlo algorithm to provide an efficient simulation and approximation of the co-inferencing of multiple cues.
Abstract
Visual tracking can be treated as a parameter estimation problem that infers target states based on image observations from video sequences. A richer target representation may incur better chances of successful tracking in cluttered and dynamic environments, and thus enhance the robustness. Richer representations can be constructed by either specifying a detailed model of a single cue or combining a set of rough models of multiple cues. Both approaches increase the dimensionality of the state space, which results in a dramatic increase of computation. To investigate the integration of rough models from multiple cues and to explore computationally efficient algorithms, this paper formulates the problem of multiple cue integration and tracking in a probabilistic framework based on a factorized graphical model. Structured variational analysis of such a graphical model factorizes different modalities and suggests a co-inference process among these modalities. Based on the importance sampling technique, a sequential Monte Carlo algorithm is proposed to provide an efficient simulation and approximation of the co-inferencing of multiple cues. This algorithm runs in real-time at around 30 Hz. Our extensive experiments show that the proposed algorithm performs robustly in a large variety of tracking scenarios. The approach presented in this paper has the potential to solve other problems including sensor fusion problems.

read more

Content maybe subject to copyright    Report

Robust Visual Tracking by Integrating Multiple Cues
based on Co-inference Learning
YING WU
Department of Electrical & Computer Engineering, Northwestern University,
2145 Sheridan Road, Evanston, IL 60208
yingwu@ece.northwestern.edu
THOMAS S. HUANG
Beckman Institute, University of Illinois at Urbana-Champaign,
405 N. Mathews, Urbana, IL 61801
huang@ifp.uiuc.edu
Abstract. Visual tracking can be treated as a parameter estimation problem that infers target states based on
image observations from video sequences. A richer target representation would incur better chances of successful
tracking in cluttered and dynamic environments, and thus enhance the robustness. Richer representations can be
constructed by either specifying a detailed model of a single cue or combining a set of rough models of multiple
cues. Both approaches increase the dimensionality of the state space, which results in a dramatic increase of
computation. To investigate the integration of rough models from multiple cues and to explore computationally
efficient algorithms, this paper formulates the problem of multiple cue integration and tracking in a probabilistic
framework based on a factorized graphical model. Structured variational analysis of such a graphical model factorizes
different modalities and suggests a co-inference process among these modalities. Based on the importance sampling
technique, a sequential Monte Carlo algorithm is proposed to provide an efficient simulation and approximation of
the co-inferencing of multiple cues. This algorithm runs in real-time at around 30Hz. Our extensive experiments
show that the proposed algorithm performs robustly in a large variety of tracking scenarios. The approach presented
in this paper has the potential to solve other problems including sensor fusion problems.
Keywords: Visual tracking, sequential Monte Carlo, importance sampling, co-inference, factorized graphical mo del,
variational analysis
1. Introduction
With the rapid enhancement of computational power provided by computer hardware, computers
are more likely to afford some visual capacities. Recent years have witnessed an expeditious
development of the research and applications of visual surveillance and vision-based interfaces, in
which visual tracking plays an important role. Aiming at developing more natural and non-invasive
human computer interfaces and recognizing human actions visually, tremendous research efforts
have been devoted to visual tracking and analysis of human movements (Gavrila, 1999; Pavlovi´c
et al., 1997; Wu and Huang, 2001a).
One of the purposes of visual tracking is to infer the states of the targets from image sequences.
Besides 2D positions, visual tracking is also expected to recover other states, such as poses,
articulations or deformations, depending on different applications. Although the tracking problem
is well formulated in the research of control theory and signal processing, visual tracking involves
many fundamental research problems in object representations, image analysis and matching.
Since the target states are hidden and can only be inferred from observable visual features, two
difficulties confronts visual tracking: evaluating state hypotheses on the observed image evidence
and searching the state space.
Bottom-up and top-down approaches are two kinds of methodologies for the visual tracking
problem. Bottom-up approaches generally tend to reconstruct the target states by analyzing the
image contents. For example, reconstructing a parametric shape by curve fitting. In contrast, top-
down approaches generate and evaluate a set of state hypotheses based on target models. Tracking
c
° 2003 Kluwer Academic Publishers. Printed in the Netherlands.
ijcv.tex; 15/12/2003; 12:44; p.1

is achieved by evaluating and verifying these hypotheses on image observations. Certainly, these
two approaches could be combined.
Bottom-up methods might be computationally efficient, yet the robustness largely depends
on the ability of image analysis, because grouping, tracing and fitting image pixels could be
overwhelmed by image clutters and noise. On the other hand, top-down approaches depend
less on image analysis, because the target hypotheses serve as strong constraints for analyzing
images. But the performance of the top-down approaches are largely determined by the methods of
generating and verifying hypotheses. To achieve robust tracking, a large number of hypotheses may
be maintained so that more computation would be involved for evaluating them. The combination
of these two methodologies could keep the robustness but reduce the computation.
Visual tracking techniques generally have four elements, the target representation, the obser-
vation representation, the hypotheses generating, and the hypotheses evaluation, which roughly
characterize tracking performance and limitations.
To discriminate the target from other objects, the target representation, denoted by X, which
could include different modalities such as shape, color, appearance, and motion, characterizes
the target in a state space either explicitly or implicitly. Although finding the representations
for targets is a fundamental problem in computer vision, visual tracking research generally em-
ploys concise representations to facilitate computational efficiency. For example, parameterized
shapes (Isard and Blake, 1996; Isard and Blake, 1998b), and color distributions (Comaniciu et al.,
2000; Raja et al., 1998; Toyama et al., 1999; Wu and Huang, 2000) are often used as target
representations. To provide a more constrained description of the target, some methods employ
both shape and color (Azoz et al., 1998; Birchfield, 1998; Isard and Blake, 1998b; Rasmussen and
Hager, 1998; Wren et al., 1997; Toyama and Wu, 2000). Obviously, a unique characterization of
the target would be quite helpful to visual tracking, but it will involve high dimensionality. To add
uniqueness to the target representation, many methods even employ image appearances, such as
image templates (Hager and Belhumeur, 1996; Li and Chellapa, 2000; Tao et al., 2000) or eigen-
space representation (Black and Jepson, 1996), as target representations. For example, if you know
a person, it would be a bit easier to track this person in a crowd. In addition, motion can also be
taken into account in target representations, since different objects can be discriminated by the
differences of their motions. On the other hand, if two objects share the same representation, it
would be difficult to correctly track either of them when they are close in the state space, if there
is no prior from the dynamics of the targets’ movements.
Closely related to the target representation is the observation representation, denoted by Z,
which defines the image evidence, i.e., the image features observed. For example, if the target is
represented by its contour shape, the corresponding image edges are expected to be observed in
images. If the target is characterized by its color appearances, certain color distribution patterns
in images can be used as the observations of the target.
The hypotheses evaluation calculates the matching between state hypotheses and image obser-
vations. We need to measure the likelihood of the image observations given state hypotheses, i.e.,
p(Z|X), so as to infer the posterior p(X|Z) of a hypothesis given a certain image observation in the
MAP estimation framework. However, the evaluation would be quite challenging when evaluating
a shape hypothesis on an image with clutters. Although some analytical results were reported
in (Blake and Isard, 1998), many current tracking methods take ad hoc measurements.
The hypothesis generation, denoted by p(X
t
|X
t1
), produces new state hypotheses based on old
estimates of the target state, implying the evolution of the target’s dynamic processes. Target’s
dynamics can be embedded in such a predicting process. At a certain time instant, the target
state is a random vector. The a posteriori probability distribution of the target state given the
observation history changes with time. Therefore, the tracking problem can be viewed as a problem
of propagating conditional probability densities. The target state at a certain time instant can
ijcv.tex; 15/12/2003; 12:44; p.2

be calculated through estimating the conditional probability density of it. The Kalman filtering
technique gives a classic example of hypotheses generation under Gaussian assumptions, due to
which the densities are characterized and parameterized by their means and covariances. Thus,
the hypothesis generation characterizes the search range and confidence level of the tracking. On
the other hand, if the Gaussian assumption does not hold, which is very likely in image clutters, we
can represent the posterior densities in non-parametric forms. In this circumstance, the hypotheses
generation can be viewed as a evolution process of a set of hypotheses or state samples or particles,
which facilitates a Monte Carlo approach for tracking. The Condensation (Blake and Isard, 1998)
algorithm is one such example.
Using a rough target model would not be robust. For example, if we use an ellipse to model the
head, the visual tracking might be unstable in a cluttered environment, e.g., when the head moves
in front of a book shelf, since the false image edges incurred by the clutter would likely distract the
tracker. Thus, to enhance the robustness, a richer target representation should be employed, since
it brings more uniqueness. There are two kinds of ideas. One approach is to construct and use a
detailed target model. For example, a B-spline shape model can be used to accurately describe a
contour, based on which excellent contour tracking results have been reported (Blake and Isard,
1998). The other approach is to combine several rough representations or models of different cues.
For example, we can simultaneously use a rough shape model and a rough color model to represent
the head. Would the combination of multiple rough models result in robust tracking results? If
so, how to integrate multiple cues and rough models? How do these different cue models interact
with each other? This paper will try to answer these questions.
This paper formulates the problem of integrating multiple cues for robust tracking as the
probabilistic inference problem of a factorized graphical model. To analyze this complex graphical
model, a variational method is taken to approximate the Bayesian inference. Interestingly, the
analysis reveals a co-inference phenomenon of multiple modalities, which illustrates the interac-
tions among different cues, i.e., one cue can be inferred iteratively by the other cues. Based on this,
this paper presents an efficient sequential Monte Carlo tracking algorithm to integrate multiple
visual cues, in which the co-inference of different mo dalities is approximated.
In Section 2, we will give a brief overview of the research of multiple cue integration in the
context of robust tracking. Then the factorized graphical model used in our tracking formulation
will be presented in Section 3; the co-inference phenomenon will be analyzed and explained in this
section as well. Section 4 will describe different techniques in sequential Monte Carlo approaches
for tracking problems. Based on importance sampling techniques, our proposed approach of the
co-inference will be presented in Section 5, and the details of our tracking implementation and
experiments will be described in Section 7. Section 8 will conclude the paper by discussing the
proposed approach and pointing out some possible directions for future investigations.
2. Multiple Cue Integration
We often notice that a state hypothesis of a richer target representation, either a detailed model or
a combination of multiple rough models, would have better opportunities to be verified on various
image observations. For example, combining the color appearance of the target can largely enhance
the robustness of contour tracking against heavily cluttered backgrounds, and integrating shape
and color representations could result in better tracking performance against color distracters.
In addition to effective hypothesis evaluation, integrating multiple cues would reduce tracker’s
dependency on accurate dynamic models of the targets. Dynamics models play an important role
in tracking since they provide predictions to reduce search and computations. In many cases,
the parameters of dynamic models are specified in advance, or learned by training sequences.
ijcv.tex; 15/12/2003; 12:44; p.3

However, if the parameters are not properly set, the tracker would be under large risks of failure.
It is desirable to develop robust trackers that work with weak dynamic models.
Interestingly, the integration and interaction of multiple cues for tracking would not require
accurate dynamic models. Here is the intuition. Suppose we represent the target by two modalities:
shape and color appearance. The two modalities have their own dynamic models, which means
that the target is deformable, and the lighting could change, but it is difficult to know in advance
about how the shape will deform and how the lighting will change. Therefore, we can only assume
very rough dynamic models as approximations. However, such rough dynamic models will be
sometimes violated such that the predictions based on the dynamic models could largely deviate
and fail the tracker, if the two modalities are treated separately. Our main idea is to let the
two modalities interact with each other. For example, if the shape changes very little but the
lighting changes a lot, the estimation of the color appearance can be fulfilled by taking advantage
of shape estimates, instead of relying on the predictions from the dynamics of color appearances.
Symmetrically, if the lighting changes very slowly but the target deforms a lot, the deformation
can be more robustly localized with the help of the color appearance estimates, instead of taking a
strong prediction prior from the deformation dynamics. Naturally, we shall ask what if the changes
of both deformation and lighting are quite large? The problem becomes a sort of untrackable
problem if no accurate dynamic model is available. Fortunately, we can still approach it by taking
the estimation that maximizes the joint probability of both modalities, and find the most likely
state estimates. Detailed formulation and analysis will be given later in this paper.
Multiple cue integration can be done at the level of both observation representation and object
model. At the observation level, some approaches combine the measurements of the multiple
modalities for each hypothesis (Azoz et al., 1998; Birchfield, 1998). Although robust to some
extent, the methods of combining the likelihood measurements of different sources are often ad
hoc. In addition, to integrate shape and color, many tracking algorithms assume fixed target
color appearances (Azoz et al., 1998; Darrell et al., 1998; Isard and Blake, 1998b; Toyama and
Wu, 2000) to enable efficient color segmentation. However, such an assumption is often invalid
in practice. Instead of assuming a fixed color representation, non-stationary color tracking meth-
ods (Raja et al., 1998; Wu and Huang, 2000) adapt the color changes and update the color
models. At the object representation level, some methods also includ the color modality in the
target representation (Bregler, 1997; Rasmussen and Hager, 1998; Wren et al., 1997), in which
a multivariate Gaussian can be used to represent both color and motion parameters. Obviously,
tracking both shape and color simultaneously would be a formidable problem, since it increases the
dimensionality of the state space. In addition, the interaction of multiple modalities is interesting
and important for the robustness. A switching scheme can be use to coordinate different trackers
that track different modalities (Toyama and Hager, 1996; Toyama and Wu, 2000). Generally,
different modalities are updated sequentially in these methods. However, a more profound and
systematic investigation of the interaction of multiple modalities is desirable. This paper tries to
investigate the relationship among different modalities for robust visual tracking, and to identify
an efficient way to facilitate simultaneously tracking of these modalities.
3. Graphical Models for Tracking
In this section, we formulate the visual tracking problem in a probabilistic framework. The
integration of multiple cues is characterized by a factorized graphical model, and we use the
variational analysis approach to approximate the probabilistic inference.
Following the notations of Isard and Blake (Isard and Blake, 1996; Blake and Isard, 1998), we
denote the target states and image observations by X
t
and Z
t
, respectively. The history of the
ijcv.tex; 15/12/2003; 12:44; p.4

states and measurements are denoted by X
t
= (X
1
, . . . , X
t
) and Z
t
= (Z
1
, . . . , Z
t
). The tracking
problem can be formulated as an inference problem with the prediction prior p(X
t+1
|Z
t
). We have
p(X
t+1
|Z
t+1
) p(Z
t+1
|X
t+1
)p(X
t+1
|Z
t
)
p(X
t+1
|Z
t
) =
Z
p(X
t+1
|X
t
)p(X
t
|Z
t
)dX
t
where p(Z
t
|X
t
) represents the measurement or observation likelihood, and p(X
t+1
|X
t
) is the
dynamic model.
The probabilistic formulation of the tracking problem can b e represented by the graphical
model shown in Figure 1, where the X nodes are hidden states and Z nodes are observations.
This is similar to the Hidden Markov Model (Rabiner, 1989). At time t, the observation Z
t
is
independent of previous states X
t1
and previous observations Z
t1
, given current state X
t
, i.e.,
p(Z
t
|X
t
, Z
t1
) = p(Z
t
|X
t
), and the states have a first order Markov property, i.e., p(X
t
|X
t1
) =
p(X
t
|X
t1
).
X
t-1
X X
Z
t-1 t+1
t+1t
t
Z Z
Figure 1. The tracking problem can be represented by a graphical model, similar to the Hidden Markov Model.
Based on this graphical model, the tracking problem can be approached by probabilistic infer-
ence techniques. However, when the dimensionality of the hidden states increases, the inference
and learning would become difficult due to the dramatic increase of computation. Fortunately, a
distributed state representation based on factorized graphic models can largely ease this difficulty
by decoupling the dynamics of different subsets of hidden states. Combining a set of rough models
for different cues can be formulated by such factorized graphical models. For example, target states
can be decomposed into shape states and color states as shown in Figure 2(a). In addition, the
observation could also be separated into Z
s
t
and Z
c
t
for shape and color respectively in Figure 2(b).
Each observation depends on both color and shape states.
X
t-1
X
t-1 t+1
t+1t
t
X
Z
t-1
Z
t
Z
t+1
X X
ss
X
s
cc c
X
t-1
X X
Z
t-1 t+1
t+1t
t
Z Z
t-1
Z Z
X
t-1 t+1
t+1t
t
XX
Z
s s s
c c c
c c c
s s s
(a) (b)
Figure 2. Factorized Graphical Models: (a) The states of the target can be decomposed into shape states X
s
t
and
color states X
c
t
in a factorized graphical model. (b) The observation could also be separated into Z
s
t
and Z
c
t
.
Due to the complex structure of the densely connected factorized network in Figure 2, the exact
inference would be formidable. One approach to this problem is based on statistical sampling
methods, such as Gibbs sampling. Another approach is an analytical way through probabilistic
variational analysis. The basic idea is to approximate the posterior probability p(X
t
|Z
t
) of the
hidden states by a tractable distribution Q(X
t
) which has good analytical properties. The opti-
mal model parameters as well as the variational parameters would be found by minimizing the
discrepancy between these two distributions. A lower bound on the log likelihood log P (Z
t
) can
ijcv.tex; 15/12/2003; 12:44; p.5

Citations
More filters
Journal ArticleDOI

Robust Visual Tracking via Structured Multi-Task Sparse Learning

TL;DR: The results demonstrate that mining the interdependencies between particles improves tracking performance and overall computational complexity, and both methods consistently outperform state-of-the-art trackers.
Proceedings ArticleDOI

Fast multiple object tracking via a hierarchical particle filter

TL;DR: A very efficient and robust visual object tracking algorithm based on the particle filter that maintains multiple hypotheses and offers robustness against clutter or short period occlusions is presented.
Journal ArticleDOI

Context-Aware Visual Tracking

TL;DR: A novel solution to this dilemma by considering the context of the tracking scene by integrating into the tracking process a set of auxiliary objects that are automatically discovered in the video on the fly by data mining.
Journal ArticleDOI

Adaptive Object Tracking Based on an Effective Appearance Filter

TL;DR: A similarity measure based on a spatial-color mixture of Gaussians (SMOG) appearance model for particle filters that can successfully track objects in many difficult situations is proposed and a new technique with which the computational time is greatly reduced is proposed.
Proceedings ArticleDOI

Sequential particle swarm optimization for visual tracking

TL;DR: Experimental results demonstrate that the proposed sequential PSO (particle swarm optimization) framework for visual tracking is more robust and effective, especially when the object has an arbitrary motion or undergoes large appearance changes.
References
More filters
Journal ArticleDOI

A tutorial on hidden Markov models and selected applications in speech recognition

TL;DR: In this paper, the authors provide an overview of the basic theory of hidden Markov models (HMMs) as originated by L.E. Baum and T. Petrie (1966) and give practical details on methods of implementation of the theory along with a description of selected applications of HMMs to distinct problems in speech recognition.
Proceedings ArticleDOI

Combining labeled and unlabeled data with co-training

TL;DR: A PAC-style analysis is provided for a problem setting motivated by the task of learning to classify web pages, in which the description of each example can be partitioned into two distinct views, to allow inexpensive unlabeled data to augment, a much smaller set of labeled examples.
Journal ArticleDOI

C ONDENSATION —Conditional Density Propagation forVisual Tracking

TL;DR: The Condensation algorithm uses “factored sampling”, previously applied to the interpretation of static images, in which the probability distribution of possible interpretations is represented by a randomly generated set.
Journal ArticleDOI

Color indexing

TL;DR: In this paper, color histograms of multicolored objects provide a robust, efficient cue for indexing into a large database of models, and they can differentiate among a large number of objects.
Related Papers (5)