Robust Visual Tracking by Integrating Multiple Cues Based on Co-Inference Learning

doi:10.1023/B:VISI.0000016147.97880.CD

Robust Visual Tracking by Integrating Multiple Cues

based on Co-inference Learning

YING WU

Department of Electrical & Computer Engineering, Northwestern University,

2145 Sheridan Road, Evanston, IL 60208

yingwu@ece.northwestern.edu

THOMAS S. HUANG

Beckman Institute, University of Illinois at Urbana-Champaign,

405 N. Mathews, Urbana, IL 61801

huang@ifp.uiuc.edu

Abstract. Visual tracking can be treated as a parameter estimation problem that infers target states based on

image observations from video sequences. A richer target representation would incur better chances of successful

tracking in cluttered and dynamic environments, and thus enhance the robustness. Richer representations can be

constructed by either specifying a detailed model of a single cue or combining a set of rough models of multiple

cues. Both approaches increase the dimensionality of the state space, which results in a dramatic increase of

computation. To investigate the integration of rough models from multiple cues and to explore computationally

eﬃcient algorithms, this paper formulates the problem of multiple cue integration and tracking in a probabilistic

framework based on a factorized graphical model. Structured variational analysis of such a graphical model factorizes

diﬀerent modalities and suggests a co-inference process among these modalities. Based on the importance sampling

technique, a sequential Monte Carlo algorithm is proposed to provide an eﬃcient simulation and approximation of

the co-inferencing of multiple cues. This algorithm runs in real-time at around 30Hz. Our extensive experiments

show that the proposed algorithm performs robustly in a large variety of tracking scenarios. The approach presented

in this paper has the potential to solve other problems including sensor fusion problems.

Keywords: Visual tracking, sequential Monte Carlo, importance sampling, co-inference, factorized graphical mo del,

variational analysis

1. Introduction

With the rapid enhancement of computational power provided by computer hardware, computers

are more likely to aﬀord some visual capacities. Recent years have witnessed an expeditious

development of the research and applications of visual surveillance and vision-based interfaces, in

which visual tracking plays an important role. Aiming at developing more natural and non-invasive

human computer interfaces and recognizing human actions visually, tremendous research eﬀorts

have been devoted to visual tracking and analysis of human movements (Gavrila, 1999; Pavlovi´c

et al., 1997; Wu and Huang, 2001a).

One of the purposes of visual tracking is to infer the states of the targets from image sequences.

Besides 2D positions, visual tracking is also expected to recover other states, such as poses,

articulations or deformations, depending on diﬀerent applications. Although the tracking problem

is well formulated in the research of control theory and signal processing, visual tracking involves

many fundamental research problems in object representations, image analysis and matching.

Since the target states are hidden and can only be inferred from observable visual features, two

diﬃculties confronts visual tracking: evaluating state hypotheses on the observed image evidence

and searching the state space.

Bottom-up and top-down approaches are two kinds of methodologies for the visual tracking

problem. Bottom-up approaches generally tend to reconstruct the target states by analyzing the

image contents. For example, reconstructing a parametric shape by curve ﬁtting. In contrast, top-

down approaches generate and evaluate a set of state hypotheses based on target models. Tracking

c

° 2003 Kluwer Academic Publishers. Printed in the Netherlands.

ijcv.tex; 15/12/2003; 12:44; p.1

is achieved by evaluating and verifying these hypotheses on image observations. Certainly, these

two approaches could be combined.

Bottom-up methods might be computationally eﬃcient, yet the robustness largely depends

on the ability of image analysis, because grouping, tracing and ﬁtting image pixels could be

overwhelmed by image clutters and noise. On the other hand, top-down approaches depend

less on image analysis, because the target hypotheses serve as strong constraints for analyzing

images. But the performance of the top-down approaches are largely determined by the methods of

generating and verifying hypotheses. To achieve robust tracking, a large number of hypotheses may

be maintained so that more computation would be involved for evaluating them. The combination

of these two methodologies could keep the robustness but reduce the computation.

Visual tracking techniques generally have four elements, the target representation, the obser-

vation representation, the hypotheses generating, and the hypotheses evaluation, which roughly

characterize tracking performance and limitations.

To discriminate the target from other objects, the target representation, denoted by X, which

could include diﬀerent modalities such as shape, color, appearance, and motion, characterizes

the target in a state space either explicitly or implicitly. Although ﬁnding the representations

for targets is a fundamental problem in computer vision, visual tracking research generally em-

ploys concise representations to facilitate computational eﬃciency. For example, parameterized

shapes (Isard and Blake, 1996; Isard and Blake, 1998b), and color distributions (Comaniciu et al.,

2000; Raja et al., 1998; Toyama et al., 1999; Wu and Huang, 2000) are often used as target

representations. To provide a more constrained description of the target, some methods employ

both shape and color (Azoz et al., 1998; Birchﬁeld, 1998; Isard and Blake, 1998b; Rasmussen and

Hager, 1998; Wren et al., 1997; Toyama and Wu, 2000). Obviously, a unique characterization of

the target would be quite helpful to visual tracking, but it will involve high dimensionality. To add

uniqueness to the target representation, many methods even employ image appearances, such as

image templates (Hager and Belhumeur, 1996; Li and Chellapa, 2000; Tao et al., 2000) or eigen-

space representation (Black and Jepson, 1996), as target representations. For example, if you know

a person, it would be a bit easier to track this person in a crowd. In addition, motion can also be

taken into account in target representations, since diﬀerent objects can be discriminated by the

diﬀerences of their motions. On the other hand, if two objects share the same representation, it

would be diﬃcult to correctly track either of them when they are close in the state space, if there

is no prior from the dynamics of the targets’ movements.

Closely related to the target representation is the observation representation, denoted by Z,

which deﬁnes the image evidence, i.e., the image features observed. For example, if the target is

represented by its contour shape, the corresponding image edges are expected to be observed in

images. If the target is characterized by its color appearances, certain color distribution patterns

in images can be used as the observations of the target.

The hypotheses evaluation calculates the matching between state hypotheses and image obser-

vations. We need to measure the likelihood of the image observations given state hypotheses, i.e.,

p(Z|X), so as to infer the posterior p(X|Z) of a hypothesis given a certain image observation in the

MAP estimation framework. However, the evaluation would be quite challenging when evaluating

a shape hypothesis on an image with clutters. Although some analytical results were reported

in (Blake and Isard, 1998), many current tracking methods take ad hoc measurements.

The hypothesis generation, denoted by p(X

t

|X

t−1

), produces new state hypotheses based on old

estimates of the target state, implying the evolution of the target’s dynamic processes. Target’s

dynamics can be embedded in such a predicting process. At a certain time instant, the target

state is a random vector. The a posteriori probability distribution of the target state given the

observation history changes with time. Therefore, the tracking problem can be viewed as a problem

of propagating conditional probability densities. The target state at a certain time instant can

ijcv.tex; 15/12/2003; 12:44; p.2

be calculated through estimating the conditional probability density of it. The Kalman ﬁltering

technique gives a classic example of hypotheses generation under Gaussian assumptions, due to

which the densities are characterized and parameterized by their means and covariances. Thus,

the hypothesis generation characterizes the search range and conﬁdence level of the tracking. On

the other hand, if the Gaussian assumption does not hold, which is very likely in image clutters, we

can represent the posterior densities in non-parametric forms. In this circumstance, the hypotheses

generation can be viewed as a evolution process of a set of hypotheses or state samples or particles,

which facilitates a Monte Carlo approach for tracking. The Condensation (Blake and Isard, 1998)

algorithm is one such example.

Using a rough target model would not be robust. For example, if we use an ellipse to model the

head, the visual tracking might be unstable in a cluttered environment, e.g., when the head moves

in front of a book shelf, since the false image edges incurred by the clutter would likely distract the

tracker. Thus, to enhance the robustness, a richer target representation should be employed, since

it brings more uniqueness. There are two kinds of ideas. One approach is to construct and use a

detailed target model. For example, a B-spline shape model can be used to accurately describe a

contour, based on which excellent contour tracking results have been reported (Blake and Isard,

1998). The other approach is to combine several rough representations or models of diﬀerent cues.

For example, we can simultaneously use a rough shape model and a rough color model to represent

the head. Would the combination of multiple rough models result in robust tracking results? If

so, how to integrate multiple cues and rough models? How do these diﬀerent cue models interact

with each other? This paper will try to answer these questions.

This paper formulates the problem of integrating multiple cues for robust tracking as the

probabilistic inference problem of a factorized graphical model. To analyze this complex graphical

model, a variational method is taken to approximate the Bayesian inference. Interestingly, the

analysis reveals a co-inference phenomenon of multiple modalities, which illustrates the interac-

tions among diﬀerent cues, i.e., one cue can be inferred iteratively by the other cues. Based on this,

this paper presents an eﬃcient sequential Monte Carlo tracking algorithm to integrate multiple

visual cues, in which the co-inference of diﬀerent mo dalities is approximated.

In Section 2, we will give a brief overview of the research of multiple cue integration in the

context of robust tracking. Then the factorized graphical model used in our tracking formulation

will be presented in Section 3; the co-inference phenomenon will be analyzed and explained in this

section as well. Section 4 will describe diﬀerent techniques in sequential Monte Carlo approaches

for tracking problems. Based on importance sampling techniques, our proposed approach of the

co-inference will be presented in Section 5, and the details of our tracking implementation and

experiments will be described in Section 7. Section 8 will conclude the paper by discussing the

proposed approach and pointing out some possible directions for future investigations.

2. Multiple Cue Integration

We often notice that a state hypothesis of a richer target representation, either a detailed model or

a combination of multiple rough models, would have better opportunities to be veriﬁed on various

image observations. For example, combining the color appearance of the target can largely enhance

the robustness of contour tracking against heavily cluttered backgrounds, and integrating shape

and color representations could result in better tracking performance against color distracters.

In addition to eﬀective hypothesis evaluation, integrating multiple cues would reduce tracker’s

dependency on accurate dynamic models of the targets. Dynamics models play an important role

in tracking since they provide predictions to reduce search and computations. In many cases,

the parameters of dynamic models are speciﬁed in advance, or learned by training sequences.

ijcv.tex; 15/12/2003; 12:44; p.3

However, if the parameters are not properly set, the tracker would be under large risks of failure.

It is desirable to develop robust trackers that work with weak dynamic models.

Interestingly, the integration and interaction of multiple cues for tracking would not require

accurate dynamic models. Here is the intuition. Suppose we represent the target by two modalities:

shape and color appearance. The two modalities have their own dynamic models, which means

that the target is deformable, and the lighting could change, but it is diﬃcult to know in advance

about how the shape will deform and how the lighting will change. Therefore, we can only assume

very rough dynamic models as approximations. However, such rough dynamic models will be

sometimes violated such that the predictions based on the dynamic models could largely deviate

and fail the tracker, if the two modalities are treated separately. Our main idea is to let the

two modalities interact with each other. For example, if the shape changes very little but the

lighting changes a lot, the estimation of the color appearance can be fulﬁlled by taking advantage

of shape estimates, instead of relying on the predictions from the dynamics of color appearances.

Symmetrically, if the lighting changes very slowly but the target deforms a lot, the deformation

can be more robustly localized with the help of the color appearance estimates, instead of taking a

strong prediction prior from the deformation dynamics. Naturally, we shall ask what if the changes

of both deformation and lighting are quite large? The problem becomes a sort of untrackable

problem if no accurate dynamic model is available. Fortunately, we can still approach it by taking

the estimation that maximizes the joint probability of both modalities, and ﬁnd the most likely

state estimates. Detailed formulation and analysis will be given later in this paper.

Multiple cue integration can be done at the level of both observation representation and object

model. At the observation level, some approaches combine the measurements of the multiple

modalities for each hypothesis (Azoz et al., 1998; Birchﬁeld, 1998). Although robust to some

extent, the methods of combining the likelihood measurements of diﬀerent sources are often ad

hoc. In addition, to integrate shape and color, many tracking algorithms assume ﬁxed target

color appearances (Azoz et al., 1998; Darrell et al., 1998; Isard and Blake, 1998b; Toyama and

Wu, 2000) to enable eﬃcient color segmentation. However, such an assumption is often invalid

in practice. Instead of assuming a ﬁxed color representation, non-stationary color tracking meth-

ods (Raja et al., 1998; Wu and Huang, 2000) adapt the color changes and update the color

models. At the object representation level, some methods also includ the color modality in the

target representation (Bregler, 1997; Rasmussen and Hager, 1998; Wren et al., 1997), in which

a multivariate Gaussian can be used to represent both color and motion parameters. Obviously,

tracking both shape and color simultaneously would be a formidable problem, since it increases the

dimensionality of the state space. In addition, the interaction of multiple modalities is interesting

and important for the robustness. A switching scheme can be use to coordinate diﬀerent trackers

that track diﬀerent modalities (Toyama and Hager, 1996; Toyama and Wu, 2000). Generally,

diﬀerent modalities are updated sequentially in these methods. However, a more profound and

systematic investigation of the interaction of multiple modalities is desirable. This paper tries to

investigate the relationship among diﬀerent modalities for robust visual tracking, and to identify

an eﬃcient way to facilitate simultaneously tracking of these modalities.

3. Graphical Models for Tracking

In this section, we formulate the visual tracking problem in a probabilistic framework. The

integration of multiple cues is characterized by a factorized graphical model, and we use the

variational analysis approach to approximate the probabilistic inference.

Following the notations of Isard and Blake (Isard and Blake, 1996; Blake and Isard, 1998), we

denote the target states and image observations by X

t

and Z

t

, respectively. The history of the

ijcv.tex; 15/12/2003; 12:44; p.4

states and measurements are denoted by X

t

= (X

1

, . . . , X

t

) and Z

t

= (Z

1

, . . . , Z

t

). The tracking

problem can be formulated as an inference problem with the prediction prior p(X

t+1

|Z

t

). We have

p(X

t+1

|Z

t+1

) ∝ p(Z

t+1

|X

t+1

)p(X

t+1

|Z

t

)

p(X

t+1

|Z

t

) =

Z

p(X

t+1

|X

t

)p(X

t

|Z

t

)dX

t

where p(Z

t

|X

t

) represents the measurement or observation likelihood, and p(X

t+1

|X

t

) is the

dynamic model.

The probabilistic formulation of the tracking problem can b e represented by the graphical

model shown in Figure 1, where the X nodes are hidden states and Z nodes are observations.

This is similar to the Hidden Markov Model (Rabiner, 1989). At time t, the observation Z

t

is

independent of previous states X

t−1

and previous observations Z

t−1

, given current state X

t

, i.e.,

p(Z

t

|X

t

, Z

t−1

) = p(Z

t

|X

t

), and the states have a ﬁrst order Markov property, i.e., p(X

t

|X

t−1

) =

p(X

t

|X

t−1

).

X

t-1

X X

Z

t-1 t+1

t+1t

t

Z Z

Figure 1. The tracking problem can be represented by a graphical model, similar to the Hidden Markov Model.

Based on this graphical model, the tracking problem can be approached by probabilistic infer-

ence techniques. However, when the dimensionality of the hidden states increases, the inference

and learning would become diﬃcult due to the dramatic increase of computation. Fortunately, a

distributed state representation based on factorized graphic models can largely ease this diﬃculty

by decoupling the dynamics of diﬀerent subsets of hidden states. Combining a set of rough models

for diﬀerent cues can be formulated by such factorized graphical models. For example, target states

can be decomposed into shape states and color states as shown in Figure 2(a). In addition, the

observation could also be separated into Z

s

t

and Z

c

t

for shape and color respectively in Figure 2(b).

Each observation depends on both color and shape states.

X

t-1

X

t-1 t+1

t+1t

t

X

Z

t-1

Z

t

Z

t+1

X X

ss

X

s

cc c

X

t-1

X X

Z

t-1 t+1

t+1t

t

Z Z

t-1

Z Z

X

t-1 t+1

t+1t

t

XX

Z

s s s

c c c

s s s

(a) (b)

Figure 2. Factorized Graphical Models: (a) The states of the target can be decomposed into shape states X

s

t

and

color states X

c

t

in a factorized graphical model. (b) The observation could also be separated into Z

s

t

and Z

c

t

.

Due to the complex structure of the densely connected factorized network in Figure 2, the exact

inference would be formidable. One approach to this problem is based on statistical sampling

methods, such as Gibbs sampling. Another approach is an analytical way through probabilistic

variational analysis. The basic idea is to approximate the posterior probability p(X

t

|Z

t

) of the

hidden states by a tractable distribution Q(X

t

) which has good analytical properties. The opti-

mal model parameters as well as the variational parameters would be found by minimizing the

discrepancy between these two distributions. A lower bound on the log likelihood log P (Z

t

) can

ijcv.tex; 15/12/2003; 12:44; p.5

Robust Visual Tracking by Integrating Multiple Cues Based on Co-Inference Learning

Figures

Citations

Robust Visual Tracking via Structured Multi-Task Sparse Learning

Fast multiple object tracking via a hierarchical particle filter

Context-Aware Visual Tracking

Adaptive Object Tracking Based on an Effective Appearance Filter

Sequential particle swarm optimization for visual tracking

References

Maximum likelihood from incomplete data via the EM algorithm

A tutorial on hidden Markov models and selected applications in speech recognition

Combining labeled and unlabeled data with co-training

C ONDENSATION —Conditional Density Propagation forVisual Tracking

Color indexing

Related Papers (5)

C ONDENSATION —Conditional Density Propagation forVisual Tracking

Kernel-based object tracking

A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking

Online selection of discriminative tracking features

Color-Based Probabilistic Tracking