scispace - formally typeset
Open AccessProceedings ArticleDOI

Avoiding the "streetlight effect": tracking by exploring likelihood modes

Reads0
Chats0
TLDR
In this work, modes of the likelihood function are found using efficient example-based matching followed by local refinement to find peaks and estimate peak bandwidth, and an estimate of the full posterior model is obtained by reweighting these peaks according to the temporal prior.
Abstract
Classic methods for Bayesian inference effectively constrain search to lie within regions of significant probability of the temporal prior. This is efficient with an accurate dynamics model, but otherwise is prone to ignore significant peaks in the true posterior. A more accurate posterior estimate can be obtained by explicitly finding modes of the likelihood function and combining them with a weak temporal prior. In our approach, modes are found using efficient example-based matching followed by local refinement to find peaks and estimate peak bandwidth. By reweighting these peaks according to the temporal prior we obtain an estimate of the full posterior model. We show comparative results on real and synthetic images in a high degree of freedom articulated tracking task.

read more

Content maybe subject to copyright    Report

Avoiding the “Streetlight Effect”: Tracking by Exploring Likelihood Modes
David Demirdjian, Leonid Taycher, Gregory Shakhnarovich, Kristen Grauman, and Trevor Darrell
Computer Science and Artificial Intelligence Laboratory
Massachusetts Institute of Technology
Cambridge, MA, 02139
Abstract
Classic methods for Bayesian inference effectively con-
strain search to lie within regions of significant probability
of the temporal prior. This is efficient with an accurate dy-
namics model, but otherwise is prone to ignore significant
peaks in the true posterior. A more accurate posterior es-
timate can be obtained by explicitly finding modes of the
likelihood function and combining them with a weak tem-
poral prior. In our approach modes are found using effi-
cient example-based matching followed by local refinement
to find peaks and estimate peak bandwidth. By reweight-
ing these peaks according to the temporal prior we obtain
an estimate of the full posterior model. We show compara-
tive results on real and synthetic images in a high degree of
freedom articulated tracking task.
1. Introduction
Online articulated human tracking is the task of inferring
(for each frame) the pose that both explains the observed
image well, and is consistent with previous pose estimates
and our notion of human motion dynamics. The human
pose space is known to be large, making brute-force search
methods infeasible.
Since the peaks in the compatibility function between
images and pose are sharp [19], and dynamics are highly
uncertain (except for very structured cases such as walking),
a large number of hypotheses may have to be generated in
order to locate the actual pose. When posed in probabilistic
terms, the problem is the following: the pose likelihood is
sharp but multi-modal, and the (dynamics-based) temporal
prior is wide.
Looking under a streetlight to find a lost object at night
is an apt metaphor for classic approaches to this task, which
typically search within a region of the state space surround-
ing the estimate at a previous time step. It may not be
where the object is, but it’s an easy place to search! So
goes the rationale of existing Bayesian tracking approaches,
which base search on a strong temporal prior. In practice
the “streetlight” (i.e., samples from the prior) can be narrow
and bright (have high sample density), or be broad and dim
(low density); neither is sufficient to find sharp peaks of the
true posterior that are far from modes of the prior. Search-
ing under the streetlight, i.e., under the prior, is seemingly
desirable, but if the object is actually “in the dark” it is a
futile endeavor.
Ideally we would like to evaluate the likelihood of a
very broad and dense set of samples from the prior but this
is impractical with existing probabilistic filtering methods.
Broad search requires an extremely large number of sam-
ples, which are too costly to test and propagate individually.
However, with a sharp likelihood and a wide prior the shape
of the posterior distribution depends much more on the
shape of the likelihood than on the temporal prior. Tracking
performance may thus be improved by finding modes of the
likelihood function first and incorporating prior information
later.
In this paper we show how a broad search for modes of
the likelihood function can proceed efficiently, mitigating
the streetlight effect by considering regions of state space
that appear highly likely based on the observation in the
current frame. Whereas maintaining and propagating a very
large set of samples representing a prior is impractical, we
show how modes of the likelihood function can be sought
efficiently using fast search methods.
We leverage the recent introduction of view-based or
example-based methods [16, 11, 2], in which the depen-
dency between the pose and body appearance is learned
directly from large number of appearance/pose examples.
Such methods can be used to quickly locate pose samples
that are likely to be close to the modes of the likelihood
functions. Local, gradient-based search can then find mode
peaks, and estimate mode bandwidth. We are thus able
to efficiently estimate the complete likelihood function as
a mixture of a few Gaussians, each representing a narrow
peak in the likelihood.
By reweighting these peaks according to the temporal
prior we obtain an estimate of the full posterior model. In
contrast to previous view-based tracking methods, our pos-
terior accurately captures the multimodality of the likeli-
hood function when appropriate. In contrast to previous
sample-based methods it is able to search more broadly
through the state space, rather than only around the prior
1
In Proceedings of the IEEE International Conference on Computer Vision, Beijing, China, October 2005.

(or streetlight, to complete the metaphor).
In the following section we review relevant related work
on probabilistic tracking. We then present our method for
Exploring Likelihood MOdes (ELMO), and describe mode
detection, refinement, and temporal integration in turn. We
evaluate our approach with standard sequences from pub-
licly available rendering software and motion capture data,
as well as with real image sequences.
2. Prior Work
The core of our algorithm is the exploration of pose space
by finding modes of the likelihood function, and weight-
ing them by the prior to form an estimate of the posterior.
Modes are estimated by initializing a model-based gradient-
ascent algorithm at poses returned by a nearest neighbor
matching algorithm.
Pose estimation algorithms often use gradient ascent to
optimize the likelihood function (or pose-observation com-
patibility function in deterministic methods). Since like-
lihood modes are sharp, the initial hypothesis from which
optimization is started is extremely important; gradient as-
cent is not likely to locate the mode if initialized far from
it. Deterministic methods [14, 7, 8] use the previously es-
timated pose to start the search. While this is reasonable
in situations with small interframe motion, such algorithms
may lose track when fast motion or occlusion occurs.
While classic sampling-based probabilistic tracking al-
gorithms [17, 15] only evaluate the likelihood function, re-
cent approaches also use local optimization methods initial-
ized at samples from the temporal prior [19, 9, 4]. The
Hybrid Monte Carlo method of [5] incorporates gradient
information directly into the sampling process. Since the
temporal prior is obtained by propagating the pose posterior
at the previous time step through the uncertain prior, many
samples need to be drawn from it in order to get a good ini-
tialization point. The multi-hypothesis tracking approach
of [4] is similar to ours in that only modes of the posterior
(rather than individual samples) are propagated through dy-
namics, however it still requires sampling the propagated
modes in order to obtain seeds for local optimization. Al-
gorithms such as [20, 18] base their sampling method on
the likelihood rather than the temporal prior, but still require
generating and evaluating a large number of hypotheses.
As has been shown in [19], a local optimization is often
only as good as its starting location, and the wide temporal
prior is not the best source for pose samples that are close
to a mode of the likelihood. Fortunately, several pose esti-
mation methods have been recently developed that bypass
using a human body model altogether. Instead they use a
large number of view/pose pairs to directly learn the depen-
dency between the image and the human pose. Relevance
vector machine regression on the current observation and
the previous pose estimate is used in [1] to find a mode of
the posterior. The single-frame pose estimation algorithm
of [16] uses parameter sensitive hashing to retrieve several
samples with poses similar to the image, followed by robust
regression. In [11], a mixture model prior over multi-view
shape and pose is used to directly infer the unknown pose
of an observed silhouette shape in a single frame.
3. Tracking with Likelihood Modes
We approach online pose estimation in video sequences as
filtering in a probabilistic framework. The philosophy of
our algorithm is based on two observations regarding the
articulated tracking task. On the one hand, body dynam-
ics are often uncertain so the temporal pose prior is wide
it assigns relatively large probability to large regions in the
pose space. On the other hand, common likelihood func-
tions (the compatibility between a rendered model and an
observed image) are sharp, but multi-modal. A reasonable
approximation to a sharply peaked multi-modal likelihood
function is a weighted sum of Gaussians with small covari-
ances.
Our algorithm, ELMO, proceeds as follows: we estimate
modes of the likelihood function by selecting a set of initial
pose hypotheses and refining them using a gradient-based
technique which is able to both locate the mode of the like-
lihood and estimate its covariance. We obtain the tempo-
ral prior by propagating modes of the posterior computed
at the previous time step through a weak dynamics model.
Finally, we compute an estimate of the posterior distribu-
tion by reweighting the likelihood modes according to the
temporal prior. An overview of the algorithm is shown in
Figure 1.
In order for local optimization to succeed, it is impor-
tant to select starting pose hypotheses that are sufficiently
close to the modes. While it is possible to generate initial
hypothesis from the wide temporal prior [19, 5, 17], or by
uniformly sampling the pose space, in both of these meth-
ods a large number of samples would need to be drawn in
order to obtain an hypothesis adequately close to the mode.
Instead, we use a learning-based search method which, af-
ter being trained on a suitable number of image/pose ex-
amples, is able to quickly extract pose hypotheses that with
high probability correspond to the observed image.
There are significant methodological differences be-
tween ELMO and classic particle filtering approaches. At
no time is a density represented as a (large) set of samples,
and so the need for a large number of likelihood evaluations
is avoided. Furthermore, repeated instances of the same hy-
pothesis do not imply a greater probability of that hypothe-
sis. We do assume that at least one pose hypothesis will be
extracted for each significant peak in the likelihood func-
tion. Thus a mode with low likelihood will have low weight
2

Pose Hypothesis
Estimated Modes of the Likelihood Function
True likelihood function
Estimated Posterior Distribution
Temporal Prior Distribution
Nearest neighbor search
Reweighting using the prior
Local optimizatiion
Figure 1: High-level overview of the ELMO algorithm. A set of pose hypotheses near the modes of the likelihood function
are extracted using nearest neighbor search. The modes are refined with a gradient ascent algorithm initialized at every
hypothesis, and a weighted sum of Gaussians estimate is computed for the likelihood function. Note that the number of
hypotheses corresponding to a mode does not impact its estimated value. The posterior is then estimated by reweighting
members of the mixture according to the temporal prior.
even if the gradient ascent algorithm converged to it from
multiple starting hypotheses.
Since our algorithm is less reliant on the temporal prior
for initializing search, it is likely to handle occlusions better
then standard filtering methods. Indeed, ELMO can directly
find the correct likelihood modes in the post-occlusion
frames rather than starting with a (necessarily) wide prior.
3.1. Sampling with Parameter-Sensitive Hash-
ing
A key component of our approach is the ability to quickly
search the pose space for the small set of samples that lie
close to the modes of the likelihood function. While there
are a variety of fast regression or nearest neighbor search
methods that are appropriate for our task, in this paper we
rely on parameter-sensitive hashing (PSH) [16]. PSH is a
randomized algorithm for the indexing and retrieval of data
that allows very fast search of a large database of examples
for instances similar to a query in a parameter space. In our
case it means that from a database of images labeled with
the corresponding articulated poses, we can quickly retrieve
examples that with high probability have pose similar to the
unknown pose in the input image. This is done by learn-
ing, from examples of images with similar and dissimilar
poses, a set of hashing functions under which collision is
correlated with pose similarity, rather than directly with ap-
pearance similarity.
Thus, the pose examples returned by PSH typically lie
close to the modes of the likelihood function and should
be an appropriate set of initial hypotheses for a local opti-
mization algorithm even if the the training algorithm uses
features different from those used to compute the likeli-
hood. Furthermore, PSH is a modification of a locality-
sensitive hashing algorithm [10] and shares its sublinear
running time. Searching over tens of thousands of examples
with PSH is orders of magnitude faster than propagating and
evaluating an equivalent number of samples in a particle fil-
ter. As a result, the number of likelihood mode hypotheses
that we can search is much larger than the number of sam-
ples that we could possibly maintain in a particle filter (as
shown in the experiments below).
3.2. Local Optimization
We would like the likelihood p(y|x) to represent the com-
patibility between the observed visual data y and the shape
of a 3D articulated model corresponding to the pose x.
3

In this paper, visual observations y consist of calibrated
stereo image pairs which are used to build a 3D reconstruc-
tion of the scene. The shape of the human body in pose x is
given by a 3D articulated model B(x). Intuitively, the best
fit ˆx is obtained when the surface of the articulated model
B(ˆx) lies closest to the observed scene points. Therefore
we define the likelihood p(y|x) based on the distance be-
tween the articulated model and the observed scene. Such
criteria has been commonly used for stereo-based tracking
[3, 13]. In the case of monocular data, an adequate likeli-
hood model could be defined [17] by the reprojection error
of the 3D articulated model onto the images.
Let M(y) = {M
i
(y)} be the set of 3D points of
the scene reconstructed from the stereo image pair. Let
{N
j
(x)} be a set of sample points from the articulated
model B(x). In practice, the distance d(M(y), B(x)) be-
tween the scene points and the articulated model can be
written as:
d
2
(M(y), B(x)) =
X
j
d
2
E
(M(y), N
j
(x)) (1)
where d
2
E
() is the Euclidean distance between the point
cloud M(y) and the point N
j
(x).
A likelihood model p(y |x) naturally follows as:
p(y|x) exp{−λd
2
(M(y), B(x))} (2)
where λ a parameter depending on the uncertainty of the 3D
reconstruction.
Given a set of pose hypotheses returned by PSH and
mode locations propagated from the previous time step, we
fit a sum of Gaussians (3) to the approximate likelihood at
time t, p(y
t
|x
t
) defined in eq.(2).
We apply a local search algorithm using initializations
{x
init
} from both the centers of the modes µ
t1
i
of the like-
lihood p(y
t1
|x
t1
) at the previous time step as well as
pose estimates provided by a global search algorithm such
as PSH. For each initialization x
init
k
, we look for a local
maximum µ
t
k
(with covariance C
t
k
) of p(y
t
|x
t
). In many
cases, the local optima µ
t
k
converge to the same peaks of
the likelihood p(y
t
|x
t
). Only the highest optima (µ
t
k
,C
t
k
)
are kept to represent the full likelihood model p(y
t
|x
t
). In
practice, an average of 5 modes is usually kept.
The local optimum µ
k
can be found using standard opti-
mization techniques such as gradient ascent or Levenberg-
Marquardt. However, in the particular case of like-
lihood functions based on a 3D metric error such as
d
2
(M(y), B(x)), approximative techniques such as those
based on the Iterative Closest Point (ICP) algorithm [3] can
be used in order to estimate the optimum µ
k
and covariance
C
k
(see [7, 8]). Such algorithms are proven to converge
(when initialized close to the solution) and are less compu-
tationally intensive than standard optimization techniques.
3.3. Temporal Integration
In typical articulated tracking tasks, as discussed above, the
temporal prior provides less information about the poste-
rior distribution than the likelihood function. Given a sum
of Gaussians representation of the likelihood function, we
show here how to efficiently integrate information over time
and estimate an instantaneous posterior.
A key challenge when propagating mixture models is the
combinatorial complexity cost. Indeed, if the posterior dis-
tribution at the previous time step (and thus the temporal
prior, as we assume simple diffusion dynamics) is estimated
as a mixture of K Gaussians, and the likelihood is a sum of
L Gaussians, then it is reasonable to expect that the poste-
rior estimate at the current time step will be a mixture of
L × K Gaussians. We will show, however, that when the
temporal prior is wide (i.e. the noise covariance is much
greater than the covariance of the likelihood modes), then
the estimate of the posterior may be obtained simply by
modifying the weights of the likelihood Gaussians accord-
ing to the prior.
Let y
t
be the observation at time t, and x
t
be the pose.
Let the pose likelihood and temporal prior be
p(y
t
|x
t
) =
L
X
i=1
ˆw
t
i
N(x
t
; µ
t
i
, C
t
i
), ) (3)
p(x
t
|y
0
, y
1
, . . . , y
t1
) =
K
X
j=1
w
t1
j
N(x
t
; µ
t1
j
, C
t1
j
+ C
η
)
(4)
where N(x; µ, C) =
1
p
(2π)
D
|C|
e
(xµ)
T
C
1
(xµ)
.
The ith mode in the likelihood has mean µ
t
i
, covariance
C
t
i
and value
ˆw
t
i
(2π )
D
|C
t
i
|
. Each component of the tempo-
ral prior has arisen from the posterior modes estimated at
the previous time step (characterized by means µ
t1
j
, co-
variances C
t1
j
and weights w
t1
j
) after combination with
Gaussian noise with covariance C
η
.
In general the posterior distribution
p(x
t
|y
0
, y
1
, . . . , y
t
) p(y
t
|x
t
)p(x
t
|y
0
, y
1
, . . . , y
t1
)
would be a mixture of L × K terms of the form
N(x
t
; µ
t
i
, C
t
i
)N(x
t
; µ
t1
j
, C
t1
j
+ C
η
). Each such
product can be expressed as:
N(x
t
;µ
t
i
, C
t
i
)N(x
t
; µ
t1
j
, C
t1
j
+ C
η
)
= kN(x
t
; ˆµ
i
,
ˆ
C
i
), where
k = N (µ
t
i
; µ
t1
j
, C
t
i
+ C
t1
j
+ C
η
)
ˆ
C
i
= ((C
t
i
)
1
+ (C
t1
j
+ C
η
)
1
)
1
ˆµ
i
=
ˆ
C
i
((C
t
i
)
1
µ
t
i
+ (C
t1
j
+ C
η
)
1
µ
t1
j
)
Since we assume that the noise covariance is much
4

greater that covariance of the likelihood modes, the follow-
ing is true:
C
t
i
+ C
η
C
η
(C
t
i
)
1
+ (C
η
)
1
(C
t
i
)
1
The product can be approximated as
N(x
t
;µ
t
i
, C
t
i
)N(x
t
; µ
t1
j
, C
t1
j
+ C
η
) (5)
N(µ
t
i
; µ
t1
j
, C
η
)N(x
t
; µ
t
i
, C
t
i
)
and the posterior distribution is reduced to
p(x
t
|y
0
, y
1
, . . . , y
t
)
1
P
L
i
w
t
i
L
X
i=1
w
t
i
N(x
t
; µ
t
i
, C
t
i
), (6)
w
t
i
= ˆw
t
i
K
X
j=1
w
t1
j
N(µ
t
i
; µ
t1
j
, C
η
)
Intuitively, we can expect that the wide temporal prior
does not vary much over the region of support of each Gaus-
sian in the likelihood, and the posterior distribution is then
the mixture of the same Gaussians but with their weights
modified by the probabilities assigned to their means by the
temporal prior.
4. Implementation and Experiments
In order to validate our approach, we performed various ex-
periments to compare our algorithm (ELMO) against both
its component algorithms PSH and ICP, as well as the par-
ticle filtering method Condensation [12].
The feature space over which PSH hash functions were
constructed consisted of concatenated multiscale edge di-
rection histograms (EDH) as in [16]. The EDH of an image
is computed by applying an edge detector, assigning each
edge pixel to one of the fixed directional bins (four in our
case), counting the number of edge pixels for each direc-
tion falling in each of a number of subwindows of various
sizes taken at various locations, and finally concatenating
the obtained counts in a single feature vector. For images of
200 by 200 pixels used in our database, with 3 scales (8, 16
and 32 pixels) and with location step size of half the scale,
the EDH consisted of N = 13, 076 bins. We then selected
M = 3547 features for which the true-positive rate [16]
was above 0.65 and the true-position/false-positive gap was
at least 0.1. The data were then indexed by l = 50 hash
tables with k = 18 bit keys. For every frame, we retrieve
K = 50 training examples and use their poses to initialize
the ICP.
The labeled pose database indexed by PSH in our sys-
tem consists of 60,000 images of humanoid models in ran-
domly sampled poses created with Poser [6]. The models
were constrained to an upright posture, but the articulation
Figure 3: Example of color and disparity images used in the
synthetic sequences.
in the upper limbs as well as the orientation of the torso was
constrained only by anatomical feasibility. We rendered the
images from a viewpoint consistent with the camera settings
of the tracker, and for each image saved the articulated pose
information (3D locations of key body joints: neck, shoul-
ders, elbows etc.). Pose similarity when training PSH was
defined as less than 5 cm difference between any two joints.
4.1. Synthetic Sequences
The first set of experiments evaluates the ground truth error
relative to an extensive set of synthetic sequences.
Testing data consisted of a collection of synthetic se-
quences of people performing various kinds of activities
(e.g. walking, playing sports, greeting). The synthetic
sequences were generated from motion capture data taken
from a public website
1
and rendered using Poser [6] to pro-
duce stereo image pairs. Then, standard correlation-based
stereo was performed on the image pairs to produce a “real-
istic” disparity image as shown in Figure 3.
Some of the sequences contain many challenges for ar-
ticulated tracking algorithms, including perspective effects
(e.g. images taken from a 45 degree angle, hands mov-
ing very close to the camera), multiple self-occlusions (e.g.
body turned on the side, completely hiding one of the arms),
partial visibility (e.g. arms out of the field of view of the
camera) and fast motions. Also note that the synthetic se-
quences have been rendered with characters and features
different from the ones used in the PSH training set.
The synthetic sequences’ images were used as input for
the Condensation, PSH, ICP, and ELMO algorithms. The
Condensation algorithm was implemented as described in
[12] and run using N = 1000 particles. We use the same
likelihood function for Condensation and ELMO. The PSH
and ICP algorithms were implemented following [16]
2
and
[8] respectively. We fixed the number of candidates re-
turned by PSH to 50 and computed the pose as the can-
didate with highest likelihood. Note that in order to run
the ICP and Condensation algorithms, the articulated model
1
http://www.mocapdata.com
2
Except that we omitted the local regression step.
5

Citations
More filters
Proceedings ArticleDOI

People-tracking-by-detection and people-detection-by-tracking

TL;DR: This paper combines the advantages of both detection and tracking in a single framework using a hierarchical Gaussian process latent variable model (hGPLVM) and presents experimental results that demonstrate how this allows to detect and track multiple people in cluttered scenes with reoccurring occlusions.

A Data-Driven Approach for Real-Time Full Body Pose Reconstruction from a Depth Camera.

TL;DR: A variant of Dijkstra's algorithm to efficiently extract pose features from the depth data and a novel late-fusion scheme based on an efficiently computable sparse Hausdorff distance to combine local and global pose estimates are introduced.
Proceedings ArticleDOI

A data-driven approach for real-time full body pose reconstruction from a depth camera

TL;DR: In this article, the authors present an efficient and robust pose estimation framework for tracking full-body motions from a single depth image stream using a data-driven hybrid strategy that combines local optimization with global retrieval techniques.
BookDOI

Consumer Depth Cameras for Computer Vision

TL;DR: A Kinect geometrical model and its calibration procedure providing an accurate calibration of Kinect 3D measurement and Kinect cameras is proposed and integrated into an SfM pipeline where 3D measurements from a moving Kinect are transformed into a common coordinate system by computing relative poses from matches in its color camera.
Posted Content

A free energy principle for a particular physics

TL;DR: The main contribution is to examine the implications of Markov blankets for self-organisation to nonequilibrium steady-state and recover an information geometry and accompanying free energy principle that allows one to interpret the internal states of something as representing or making inferences about its external states.
References
More filters
Journal ArticleDOI

A method for registration of 3-D shapes

TL;DR: In this paper, the authors describe a general-purpose representation-independent method for the accurate and computationally efficient registration of 3D shapes including free-form curves and surfaces, based on the iterative closest point (ICP) algorithm, which requires only a procedure to find the closest point on a geometric entity to a given point.
Journal ArticleDOI

C ONDENSATION —Conditional Density Propagation forVisual Tracking

TL;DR: The Condensation algorithm uses “factored sampling”, previously applied to the interpretation of static images, in which the probability distribution of possible interpretations is represented by a randomly generated set.
Proceedings Article

Similarity Search in High Dimensions via Hashing

TL;DR: Experimental results indicate that the novel scheme for approximate similarity search based on hashing scales well even for a relatively large number of dimensions, and provides experimental evidence that the method gives improvement in running time over other methods for searching in highdimensional spaces based on hierarchical tree decomposition.
Proceedings ArticleDOI

Articulated body motion capture by annealed particle filtering

TL;DR: The principal contribution of the paper is the development of a modified particle filter for search in high dimensional configuration spaces that uses a continuation principle based on annealing to introduce the influence of narrow peaks in the fitness function, gradually.
Proceedings ArticleDOI

Fast pose estimation with parameter-sensitive hashing

TL;DR: A new algorithm is introduced that learns a set of hashing functions that efficiently index examples relevant to a particular estimation task, and can rapidly and accurately estimate the articulated pose of human figures from a large database of example images.
Related Papers (5)
Frequently Asked Questions (15)
Q1. What are the contributions in "Avoiding the “streetlight effect”: tracking by exploring likelihood modes" ?

In their approach modes are found using efficient example-based matching followed by local refinement to find peaks and estimate peak bandwidth. The authors show comparative results on real and synthetic images in a high degree of freedom articulated tracking task. 

Since the peaks in the compatibility function between images and pose are sharp [19], and dynamics are highly uncertain (except for very structured cases such as walking), a large number of hypotheses may have to be generated in order to locate the actual pose. 

The single-frame pose estimation algorithm of [16] uses parameter sensitive hashing to retrieve several samples with poses similar to the image, followed by robust regression. 

The authors obtain the temporal prior by propagating modes of the posterior computed at the previous time step through a weak dynamics model. 

For images of 200 by 200 pixels used in their database, with 3 scales (8, 16 and 32 pixels) and with location step size of half the scale, the EDH consisted of N = 13, 076 bins. 

In the case of monocular data, an adequate likelihood model could be defined [17] by the reprojection error of the 3D articulated model onto the images. 

if the posterior distribution at the previous time step (and thus the temporal prior, as the authors assume simple diffusion dynamics) is estimated as a mixture of K Gaussians, and the likelihood is a sum of L Gaussians, then it is reasonable to expect that the posterior estimate at the current time step will be a mixture of L × K Gaussians. 

The authors will show, however, that when the temporal prior is wide (i.e. the noise covariance is much greater than the covariance of the likelihood modes), then the estimate of the posterior may be obtained simply by modifying the weights of the likelihood Gaussians according to the prior. 

In order for local optimization to succeed, it is important to select starting pose hypotheses that are sufficiently close to the modes. 

The authors apply a local search algorithm using initializations {xinit} from both the centers of the modes µt−1i of the likelihood p(yt−1|xt−1) at the previous time step as well as pose estimates provided by a global search algorithm such as PSH. 

When posed in probabilistic terms, the problem is the following: the pose likelihood is sharp but multi-modal, and the (dynamics-based) temporal prior is wide. 

While it is possible to generate initial hypothesis from the wide temporal prior [19, 5, 17], or by uniformly sampling the pose space, in both of these methods a large number of samples would need to be drawn in order to obtain an hypothesis adequately close to the mode. 

Some of the sequences contain many challenges for articulated tracking algorithms, including perspective effects (e.g. images taken from a 45 degree angle, hands moving very close to the camera), multiple self-occlusions (e.g. body turned on the side, completely hiding one of the arms), partial visibility (e.g. arms out of the field of view of the camera) and fast motions. 

Such algorithms are proven to converge (when initialized close to the solution) and are less computationally intensive than standard optimization techniques. 

In contrast to classic sampling approaches, their method can explore a much larger region of the pose space since searching a vast number of examples with an approximate nearest neighbor search and refining a few modes is much more efficient than maintaining a particle set of a sufficient size.