What are the contributions in "Avoiding the “streetlight effect”: tracking by exploring likelihood modes" ?

In their approach modes are found using efficient example-based matching followed by local refinement to find peaks and estimate peak bandwidth. The authors show comparative results on real and synthetic images in a high degree of freedom articulated tracking task.

What is the way to define an adequate likelihood model?

In the case of monocular data, an adequate likelihood model could be defined [17] by the reprojection error of the 3D articulated model onto the images.

What is the common method for calculating the likelihood of a pose?

The authors apply a local search algorithm using initializations {xinit} from both the centers of the modes µt−1i of the likelihood p(yt−1|xt−1) at the previous time step as well as pose estimates provided by a global search algorithm such as PSH.

how can a lmo search a large region of the pose space?

In contrast to classic sampling approaches, their method can explore a much larger region of the pose space since searching a vast number of examples with an approximate nearest neighbor search and refining a few modes is much more efficient than maintaining a particle set of a sufficient size.

(Open Access) Avoiding the "streetlight effect": tracking by exploring likelihood modes (2005) | David Demirdjian

Q: How do the authors obtain the temporal prior?

The authors obtain the temporal prior by propagating modes of the posterior computed at the previous time step through a weak dynamics model.

Q: How many bins were used in the EDH?

For images of 200 by 200 pixels used in their database, with 3 scales (8, 16 and 32 pixels) and with location step size of half the scale, the EDH consisted of N = 13, 076 bins.

Q: What is the probability of the posterior distribution at the previous time step?

if the posterior distribution at the previous time step (and thus the temporal prior, as the authors assume simple diffusion dynamics) is estimated as a mixture of K Gaussians, and the likelihood is a sum of L Gaussians, then it is reasonable to expect that the posterior estimate at the current time step will be a mixture of L × K Gaussians.

Q: How can the authors estimate the posterior of a pose?

The authors will show, however, that when the temporal prior is wide (i.e. the noise covariance is much greater than the covariance of the likelihood modes), then the estimate of the posterior may be obtained simply by modifying the weights of the likelihood Gaussians according to the prior.

Q: What is the importance of a good local optimization algorithm?

In order for local optimization to succeed, it is important to select starting pose hypotheses that are sufficiently close to the modes.

Avoiding the “Streetlight Effect”: Tracking by Exploring Likelihood Modes

David Demirdjian, Leonid Taycher, Gregory Shakhnarovich, Kristen Grauman, and Trevor Darrell

Computer Science and Artiﬁcial Intelligence Laboratory

Massachusetts Institute of Technology

Cambridge, MA, 02139

Abstract

Classic methods for Bayesian inference effectively con-

strain search to lie within regions of signiﬁcant probability

of the temporal prior. This is efﬁcient with an accurate dy-

namics model, but otherwise is prone to ignore signiﬁcant

peaks in the true posterior. A more accurate posterior es-

timate can be obtained by explicitly ﬁnding modes of the

likelihood function and combining them with a weak tem-

poral prior. In our approach modes are found using efﬁ-

cient example-based matching followed by local reﬁnement

to ﬁnd peaks and estimate peak bandwidth. By reweight-

ing these peaks according to the temporal prior we obtain

an estimate of the full posterior model. We show compara-

tive results on real and synthetic images in a high degree of

freedom articulated tracking task.

1. Introduction

Online articulated human tracking is the task of inferring

(for each frame) the pose that both explains the observed

image well, and is consistent with previous pose estimates

and our notion of human motion dynamics. The human

pose space is known to be large, making brute-force search

methods infeasible.

Since the peaks in the compatibility function between

images and pose are sharp [19], and dynamics are highly

uncertain (except for very structured cases such as walking),

a large number of hypotheses may have to be generated in

order to locate the actual pose. When posed in probabilistic

terms, the problem is the following: the pose likelihood is

sharp but multi-modal, and the (dynamics-based) temporal

prior is wide.

Looking under a streetlight to ﬁnd a lost object at night

is an apt metaphor for classic approaches to this task, which

typically search within a region of the state space surround-

ing the estimate at a previous time step. It may not be

where the object is, but it’s an easy place to search! So

goes the rationale of existing Bayesian tracking approaches,

which base search on a strong temporal prior. In practice

the “streetlight” (i.e., samples from the prior) can be narrow

and bright (have high sample density), or be broad and dim

(low density); neither is sufﬁcient to ﬁnd sharp peaks of the

true posterior that are far from modes of the prior. Search-

ing under the streetlight, i.e., under the prior, is seemingly

desirable, but if the object is actually “in the dark” it is a

futile endeavor.

Ideally we would like to evaluate the likelihood of a

very broad and dense set of samples from the prior but this

is impractical with existing probabilistic ﬁltering methods.

Broad search requires an extremely large number of sam-

ples, which are too costly to test and propagate individually.

However, with a sharp likelihood and a wide prior the shape

of the posterior distribution depends much more on the

shape of the likelihood than on the temporal prior. Tracking

performance may thus be improved by ﬁnding modes of the

likelihood function ﬁrst and incorporating prior information

later.

In this paper we show how a broad search for modes of

the likelihood function can proceed efﬁciently, mitigating

the streetlight effect by considering regions of state space

that appear highly likely based on the observation in the

current frame. Whereas maintaining and propagating a very

large set of samples representing a prior is impractical, we

show how modes of the likelihood function can be sought

efﬁciently using fast search methods.

We leverage the recent introduction of view-based or

example-based methods [16, 11, 2], in which the depen-

dency between the pose and body appearance is learned

directly from large number of appearance/pose examples.

Such methods can be used to quickly locate pose samples

that are likely to be close to the modes of the likelihood

functions. Local, gradient-based search can then ﬁnd mode

peaks, and estimate mode bandwidth. We are thus able

to efﬁciently estimate the complete likelihood function as

a mixture of a few Gaussians, each representing a narrow

peak in the likelihood.

By reweighting these peaks according to the temporal

prior we obtain an estimate of the full posterior model. In

contrast to previous view-based tracking methods, our pos-

terior accurately captures the multimodality of the likeli-

hood function when appropriate. In contrast to previous

sample-based methods it is able to search more broadly

through the state space, rather than only around the prior

In Proceedings of the IEEE International Conference on Computer Vision, Beijing, China, October 2005.

(or streetlight, to complete the metaphor).

In the following section we review relevant related work

on probabilistic tracking. We then present our method for

Exploring Likelihood MOdes (ELMO), and describe mode

detection, reﬁnement, and temporal integration in turn. We

evaluate our approach with standard sequences from pub-

licly available rendering software and motion capture data,

as well as with real image sequences.

2. Prior Work

The core of our algorithm is the exploration of pose space

by ﬁnding modes of the likelihood function, and weight-

ing them by the prior to form an estimate of the posterior.

Modes are estimated by initializing a model-based gradient-

ascent algorithm at poses returned by a nearest neighbor

matching algorithm.

Pose estimation algorithms often use gradient ascent to

optimize the likelihood function (or pose-observation com-

patibility function in deterministic methods). Since like-

lihood modes are sharp, the initial hypothesis from which

optimization is started is extremely important; gradient as-

cent is not likely to locate the mode if initialized far from

it. Deterministic methods [14, 7, 8] use the previously es-

timated pose to start the search. While this is reasonable

in situations with small interframe motion, such algorithms

may lose track when fast motion or occlusion occurs.

While classic sampling-based probabilistic tracking al-

gorithms [17, 15] only evaluate the likelihood function, re-

cent approaches also use local optimization methods initial-

ized at samples from the temporal prior [19, 9, 4]. The

Hybrid Monte Carlo method of [5] incorporates gradient

information directly into the sampling process. Since the

temporal prior is obtained by propagating the pose posterior

at the previous time step through the uncertain prior, many

samples need to be drawn from it in order to get a good ini-

tialization point. The multi-hypothesis tracking approach

of [4] is similar to ours in that only modes of the posterior

(rather than individual samples) are propagated through dy-

namics, however it still requires sampling the propagated

modes in order to obtain seeds for local optimization. Al-

gorithms such as [20, 18] base their sampling method on

the likelihood rather than the temporal prior, but still require

generating and evaluating a large number of hypotheses.

As has been shown in [19], a local optimization is often

only as good as its starting location, and the wide temporal

prior is not the best source for pose samples that are close

to a mode of the likelihood. Fortunately, several pose esti-

mation methods have been recently developed that bypass

using a human body model altogether. Instead they use a

large number of view/pose pairs to directly learn the depen-

dency between the image and the human pose. Relevance

vector machine regression on the current observation and

the previous pose estimate is used in [1] to ﬁnd a mode of

the posterior. The single-frame pose estimation algorithm

of [16] uses parameter sensitive hashing to retrieve several

samples with poses similar to the image, followed by robust

regression. In [11], a mixture model prior over multi-view

shape and pose is used to directly infer the unknown pose

of an observed silhouette shape in a single frame.

3. Tracking with Likelihood Modes

We approach online pose estimation in video sequences as

ﬁltering in a probabilistic framework. The philosophy of

our algorithm is based on two observations regarding the

articulated tracking task. On the one hand, body dynam-

ics are often uncertain so the temporal pose prior is wide –

it assigns relatively large probability to large regions in the

pose space. On the other hand, common likelihood func-

tions (the compatibility between a rendered model and an

observed image) are sharp, but multi-modal. A reasonable

approximation to a sharply peaked multi-modal likelihood

function is a weighted sum of Gaussians with small covari-

ances.

Our algorithm, ELMO, proceeds as follows: we estimate

modes of the likelihood function by selecting a set of initial

pose hypotheses and reﬁning them using a gradient-based

technique which is able to both locate the mode of the like-

lihood and estimate its covariance. We obtain the tempo-

ral prior by propagating modes of the posterior computed

at the previous time step through a weak dynamics model.

Finally, we compute an estimate of the posterior distribu-

tion by reweighting the likelihood modes according to the

temporal prior. An overview of the algorithm is shown in

Figure 1.

In order for local optimization to succeed, it is impor-

tant to select starting pose hypotheses that are sufﬁciently

close to the modes. While it is possible to generate initial

hypothesis from the wide temporal prior [19, 5, 17], or by

uniformly sampling the pose space, in both of these meth-

ods a large number of samples would need to be drawn in

order to obtain an hypothesis adequately close to the mode.

Instead, we use a learning-based search method which, af-

ter being trained on a suitable number of image/pose ex-

amples, is able to quickly extract pose hypotheses that with

high probability correspond to the observed image.

There are signiﬁcant methodological differences be-

tween ELMO and classic particle ﬁltering approaches. At

no time is a density represented as a (large) set of samples,

and so the need for a large number of likelihood evaluations

is avoided. Furthermore, repeated instances of the same hy-

pothesis do not imply a greater probability of that hypothe-

sis. We do assume that at least one pose hypothesis will be

extracted for each signiﬁcant peak in the likelihood func-

tion. Thus a mode with low likelihood will have low weight

Pose Hypothesis

Estimated Modes of the Likelihood Function

True likelihood function

Estimated Posterior Distribution

Temporal Prior Distribution

Nearest neighbor search

Reweighting using the prior

Local optimizatiion

Figure 1: High-level overview of the ELMO algorithm. A set of pose hypotheses near the modes of the likelihood function

are extracted using nearest neighbor search. The modes are reﬁned with a gradient ascent algorithm initialized at every

hypothesis, and a weighted sum of Gaussians estimate is computed for the likelihood function. Note that the number of

hypotheses corresponding to a mode does not impact its estimated value. The posterior is then estimated by reweighting

members of the mixture according to the temporal prior.

even if the gradient ascent algorithm converged to it from

multiple starting hypotheses.

Since our algorithm is less reliant on the temporal prior

for initializing search, it is likely to handle occlusions better

then standard ﬁltering methods. Indeed, ELMO can directly

ﬁnd the correct likelihood modes in the post-occlusion

frames rather than starting with a (necessarily) wide prior.

3.1. Sampling with Parameter-Sensitive Hash-

ing

A key component of our approach is the ability to quickly

search the pose space for the small set of samples that lie

close to the modes of the likelihood function. While there

are a variety of fast regression or nearest neighbor search

methods that are appropriate for our task, in this paper we

rely on parameter-sensitive hashing (PSH) [16]. PSH is a

randomized algorithm for the indexing and retrieval of data

that allows very fast search of a large database of examples

for instances similar to a query in a parameter space. In our

case it means that from a database of images labeled with

the corresponding articulated poses, we can quickly retrieve

examples that with high probability have pose similar to the

unknown pose in the input image. This is done by learn-

ing, from examples of images with similar and dissimilar

poses, a set of hashing functions under which collision is

correlated with pose similarity, rather than directly with ap-

pearance similarity.

Thus, the pose examples returned by PSH typically lie

close to the modes of the likelihood function and should

be an appropriate set of initial hypotheses for a local opti-

mization algorithm even if the the training algorithm uses

features different from those used to compute the likeli-

hood. Furthermore, PSH is a modiﬁcation of a locality-

sensitive hashing algorithm [10] and shares its sublinear

running time. Searching over tens of thousands of examples

with PSH is orders of magnitude faster than propagating and

evaluating an equivalent number of samples in a particle ﬁl-

ter. As a result, the number of likelihood mode hypotheses

that we can search is much larger than the number of sam-

ples that we could possibly maintain in a particle ﬁlter (as

shown in the experiments below).

3.2. Local Optimization

We would like the likelihood p(y|x) to represent the com-

patibility between the observed visual data y and the shape

of a 3D articulated model corresponding to the pose x.

In this paper, visual observations y consist of calibrated

stereo image pairs which are used to build a 3D reconstruc-

tion of the scene. The shape of the human body in pose x is

given by a 3D articulated model B(x). Intuitively, the best

ﬁt ˆx is obtained when the surface of the articulated model

B(ˆx) lies closest to the observed scene points. Therefore

we deﬁne the likelihood p(y|x) based on the distance be-

tween the articulated model and the observed scene. Such

criteria has been commonly used for stereo-based tracking

[3, 13]. In the case of monocular data, an adequate likeli-

hood model could be deﬁned [17] by the reprojection error

of the 3D articulated model onto the images.

Let M(y) = {M

(y)} be the set of 3D points of

the scene reconstructed from the stereo image pair. Let

(x)} be a set of sample points from the articulated

model B(x). In practice, the distance d(M(y), B(x)) be-

tween the scene points and the articulated model can be

written as:

(M(y), B(x)) =

(M(y), N

(x)) (1)

where d

() is the Euclidean distance between the point

cloud M(y) and the point N

(x).

A likelihood model p(y |x) naturally follows as:

p(y|x) ∝ exp{−λd

(M(y), B(x))} (2)

where λ a parameter depending on the uncertainty of the 3D

reconstruction.

Given a set of pose hypotheses returned by PSH and

mode locations propagated from the previous time step, we

ﬁt a sum of Gaussians (3) to the approximate likelihood at

time t, p(y

) deﬁned in eq.(2).

We apply a local search algorithm using initializations

init

} from both the centers of the modes µ

t−1

of the like-

lihood p(y

t−1

) at the previous time step as well as

pose estimates provided by a global search algorithm such

as PSH. For each initialization x

init

, we look for a local

maximum µ

(with covariance C

) of p(y

). In many

cases, the local optima µ

converge to the same peaks of

the likelihood p(y

). Only the highest optima (µ

)

are kept to represent the full likelihood model p(y

). In

practice, an average of 5 modes is usually kept.

The local optimum µ

can be found using standard opti-

mization techniques such as gradient ascent or Levenberg-

Marquardt. However, in the particular case of like-

lihood functions based on a 3D metric error such as

(M(y), B(x)), approximative techniques such as those

based on the Iterative Closest Point (ICP) algorithm [3] can

be used in order to estimate the optimum µ

and covariance

(see [7, 8]). Such algorithms are proven to converge

(when initialized close to the solution) and are less compu-

tationally intensive than standard optimization techniques.

3.3. Temporal Integration

In typical articulated tracking tasks, as discussed above, the

temporal prior provides less information about the poste-

rior distribution than the likelihood function. Given a sum

of Gaussians representation of the likelihood function, we

show here how to efﬁciently integrate information over time

and estimate an instantaneous posterior.

A key challenge when propagating mixture models is the

combinatorial complexity cost. Indeed, if the posterior dis-

tribution at the previous time step (and thus the temporal

prior, as we assume simple diffusion dynamics) is estimated

as a mixture of K Gaussians, and the likelihood is a sum of

L Gaussians, then it is reasonable to expect that the poste-

rior estimate at the current time step will be a mixture of

L × K Gaussians. We will show, however, that when the

temporal prior is wide (i.e. the noise covariance is much

greater than the covariance of the likelihood modes), then

the estimate of the posterior may be obtained simply by

modifying the weights of the likelihood Gaussians accord-

ing to the prior.

Let y

be the observation at time t, and x

be the pose.

Let the pose likelihood and temporal prior be

p(y

) =

i=1

ˆw

N(x

; µ

, C

), ) (3)

p(x

, y

, . . . , y

t−1

) =

j=1

t−1

N(x

; µ

t−1

, C

t−1

+ C

)

(4)

where N(x; µ, C) =

(2π)

|C|

−(x−µ)

−1

(x−µ)

The ith mode in the likelihood has mean µ

, covariance

and value

ˆw

√

(2π )

. Each component of the tempo-

ral prior has arisen from the posterior modes estimated at

the previous time step (characterized by means µ

t−1

, co-

variances C

t−1

and weights w

t−1

) after combination with

Gaussian noise with covariance C

In general the posterior distribution

p(x

, y

, . . . , y

) ∝ p(y

)p(x

, y

, . . . , y

t−1

)

would be a mixture of L × K terms of the form

N(x

; µ

, C

)N(x

; µ

t−1

, C

t−1

+ C

). Each such

product can be expressed as:

N(x

;µ

, C

)N(x

; µ

t−1

, C

t−1

+ C

)

= kN(x

; ˆµ

), where

k = N (µ

; µ

t−1

, C

+ C

t−1

+ C

)

= ((C

)

−1

+ (C

t−1

+ C

)

−1

)

−1

ˆµ

((C

)

−1

+ (C

t−1

+ C

)

−1

t−1

)

Since we assume that the noise covariance is much

greater that covariance of the likelihood modes, the follow-

ing is true:

+ C

≈ C

)

−1

+ (C

)

−1

≈ (C

)

−1

The product can be approximated as

N(x

;µ

, C

)N(x

; µ

t−1

, C

t−1

+ C

) ≈ (5)

N(µ

; µ

t−1

, C

)N(x

; µ

, C

)

and the posterior distribution is reduced to

p(x

, y

, . . . , y

) ≈

i=1

N(x

; µ

, C

), (6)

= ˆw

j=1

t−1

N(µ

; µ

t−1

, C

)

Intuitively, we can expect that the wide temporal prior

does not vary much over the region of support of each Gaus-

sian in the likelihood, and the posterior distribution is then

the mixture of the same Gaussians but with their weights

modiﬁed by the probabilities assigned to their means by the

temporal prior.

4. Implementation and Experiments

In order to validate our approach, we performed various ex-

periments to compare our algorithm (ELMO) against both

its component algorithms PSH and ICP, as well as the par-

ticle ﬁltering method Condensation [12].

The feature space over which PSH hash functions were

constructed consisted of concatenated multiscale edge di-

rection histograms (EDH) as in [16]. The EDH of an image

is computed by applying an edge detector, assigning each

edge pixel to one of the ﬁxed directional bins (four in our

case), counting the number of edge pixels for each direc-

tion falling in each of a number of subwindows of various

sizes taken at various locations, and ﬁnally concatenating

the obtained counts in a single feature vector. For images of

200 by 200 pixels used in our database, with 3 scales (8, 16

and 32 pixels) and with location step size of half the scale,

the EDH consisted of N = 13, 076 bins. We then selected

M = 3547 features for which the true-positive rate [16]

was above 0.65 and the true-position/false-positive gap was

at least 0.1. The data were then indexed by l = 50 hash

tables with k = 18 bit keys. For every frame, we retrieve

K = 50 training examples and use their poses to initialize

the ICP.

The labeled pose database indexed by PSH in our sys-

tem consists of 60,000 images of humanoid models in ran-

domly sampled poses created with Poser [6]. The models

were constrained to an upright posture, but the articulation

Figure 3: Example of color and disparity images used in the

synthetic sequences.

in the upper limbs as well as the orientation of the torso was

constrained only by anatomical feasibility. We rendered the

images from a viewpoint consistent with the camera settings

of the tracker, and for each image saved the articulated pose

information (3D locations of key body joints: neck, shoul-

ders, elbows etc.). Pose similarity when training PSH was

deﬁned as less than 5 cm difference between any two joints.

4.1. Synthetic Sequences

The ﬁrst set of experiments evaluates the ground truth error

relative to an extensive set of synthetic sequences.

Testing data consisted of a collection of synthetic se-

quences of people performing various kinds of activities

(e.g. walking, playing sports, greeting). The synthetic

sequences were generated from motion capture data taken

from a public website

and rendered using Poser [6] to pro-

duce stereo image pairs. Then, standard correlation-based

stereo was performed on the image pairs to produce a “real-

istic” disparity image as shown in Figure 3.

Some of the sequences contain many challenges for ar-

ticulated tracking algorithms, including perspective effects

(e.g. images taken from a 45 degree angle, hands mov-

ing very close to the camera), multiple self-occlusions (e.g.

body turned on the side, completely hiding one of the arms),

partial visibility (e.g. arms out of the ﬁeld of view of the

camera) and fast motions. Also note that the synthetic se-

quences have been rendered with characters and features

different from the ones used in the PSH training set.

The synthetic sequences’ images were used as input for

the Condensation, PSH, ICP, and ELMO algorithms. The

Condensation algorithm was implemented as described in

[12] and run using N = 1000 particles. We use the same

likelihood function for Condensation and ELMO. The PSH

and ICP algorithms were implemented following [16]

and

[8] respectively. We ﬁxed the number of candidates re-

turned by PSH to 50 and computed the pose as the can-

didate with highest likelihood. Note that in order to run

the ICP and Condensation algorithms, the articulated model

http://www.mocapdata.com

Except that we omitted the local regression step.

Avoiding the "streetlight effect": tracking by exploring likelihood modes

Figures

Citations

People-tracking-by-detection and people-detection-by-tracking

A Data-Driven Approach for Real-Time Full Body Pose Reconstruction from a Depth Camera.

A data-driven approach for real-time full body pose reconstruction from a depth camera

Consumer Depth Cameras for Computer Vision

A free energy principle for a particular physics

References

A method for registration of 3-D shapes

C ONDENSATION —Conditional Density Propagation forVisual Tracking

Similarity Search in High Dimensions via Hashing

Articulated body motion capture by annealed particle filtering

Fast pose estimation with parameter-sensitive hashing

Related Papers (5)

Fast pose estimation with parameter-sensitive hashing

A method for registration of 3-D shapes

A survey of advances in vision-based human motion capture and analysis

Tracking people with twists and exponential maps

Articulated body motion capture by annealed particle filtering

Frequently Asked Questions (15)

Q1. What are the contributions in "Avoiding the “streetlight effect”: tracking by exploring likelihood modes" ?

Q2. What is the problem with a large number of hypotheses?

Q3. What is the pose estimation algorithm for video sequences?

Q4. How do the authors obtain the temporal prior?

Q5. How many bins were used in the EDH?

Q6. What is the way to define an adequate likelihood model?

Q7. What is the probability of the posterior distribution at the previous time step?

Q8. How can the authors estimate the posterior of a pose?

Q9. What is the importance of a good local optimization algorithm?

Q10. What is the common method for calculating the likelihood of a pose?

Q11. What is the problem with posing in probabilistic terms?

Q12. How many samples would be drawn to obtain an initial hypothesis?

Q13. What are some of the challenges for articulated tracking algorithms?

Q14. What are the advantages of such algorithms?

Q15. how can a lmo search a large region of the pose space?