scispace - formally typeset
Open AccessProceedings ArticleDOI

PixelTrack: A Fast Adaptive Algorithm for Tracking Non-rigid Objects

Stefan Duffner, +1 more
- pp 2480-2487
TLDR
A novel algorithm for fast tracking of generic objects in videos that makes use of the generalised Hough transform with pixel-based descriptors and a probabilistic segmentation method based on global models for foreground and background is presented.
Abstract
In this paper, we present a novel algorithm for fast tracking of generic objects in videos. The algorithm uses two components: a detector that makes use of the generalised Hough transform with pixel-based descriptors, and a probabilistic segmentation method based on global models for foreground and background. These components are used for tracking in a combined way, and they adapt each other in a co-training manner. Through effective model adaptation and segmentation, the algorithm is able to track objects that undergo rigid and non-rigid deformations and considerable shape and appearance variations. The proposed tracking method has been thoroughly evaluated on challenging standard videos, and outperforms state-of-the-art tracking methods designed for the same task. Finally, the proposed models allow for an extremely efficient implementation, and thus tracking is very fast.

read more

Content maybe subject to copyright    Report

HAL Id: hal-00976387
https://hal.archives-ouvertes.fr/hal-00976387
Submitted on 9 Apr 2014
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entic research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diusion de documents
scientiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
PixelTrack: a fast adaptive algorithm for tracking
non-rigid objects
Stefan Duner, Christophe Garcia
To cite this version:
Stefan Duner, Christophe Garcia. PixelTrack: a fast adaptive algorithm for tracking non-rigid
objects. International Conference on Computer Vision (ICCV 2013), Dec 2013, Sydney, Australia.
pp.2480-2487. �hal-00976387�

PixelTrack: a fast adaptive algorithm for tracking non-rigid objects
Stefan Duffner and Christophe Garcia
Universit
´
e de Lyon, CNRS
INSA-Lyon, LIRIS, UMR5205, F-69621, France
stefan.duffner@liris.cnrs.fr, christophe.garcia@liris.cnrs.fr
Abstract
In this paper, we present a novel algorithm for fast track-
ing of generic objects in videos. The algorithm uses two
components: a detector that makes use of the generalised
Hough transform with pixel-based descriptors, and a prob-
abilistic segmentation method based on global models for
foreground and background. These components are used
for tracking in a combined way, and they adapt each other
in a co-training manner. Through effective model adapta-
tion and segmentation, the algorithm is able to track ob-
jects that undergo rigid and non-rigid deformations and
considerable shape and appearance variations. The pro-
posed tracking method has been thoroughly evaluated on
challenging standard videos, and outperforms state-of-the-
art tracking methods designed for the same task. Finally,
the proposed models allow for an extremely efficient imple-
mentation, and thus tracking is very fast.
1. Introduction
Given a video stream, tracking arbitrary objects that are
non-rigid, moving or static, rotating and deforming, par-
tially occluded, under changing illumination and without
any prior knowledge is a challenging task. This uncon-
strained tracking problem where the object model is ini-
tialised from a bounding box in the first video frame and
continuously adapted has been increasingly addressed in the
literature in the past years. When no prior knowledge about
the object’s shape and appearance is available, one of the
main difficulties is to incrementally learn a robust model
from consecutive video frames. This model should gen-
eralise to new unseen appearances and avoid drift, i.e. the
gradual inclusion of background appearance, which can ul-
timately lead to tracking failure.
Our method addresses these issues with an adaptive ap-
proach combining a detector based on pixel-based descrip-
tors and a probabilistic segmentation framework.
1.1. Related Work
Earlier works [
21, 11, 32, 30, 41, 18] on visual ob-
ject tracking mostly consider a bounding box (or some
other simple geometric model) representation of the ob-
ject to track, and often a global appearance model is used.
These classical methods are very robust to some degree
of appearance change and local deformations (as in face
tracking), and also allow for a fast implementation. How-
ever, for tracking non-rigid objects that undergo a large
amount of deformation and appearance variation, e.g. due
to occlusions or illumination changes, these approaches are
less suitable. Although some algorithms effectively cope
with object deformations by tracking their contour (e.g.
[
31, 42, 10]), most of them require the object to be moving
or need prior shape knowledge [
12]. Others, describe an ob-
ject by a relatively dense set of keypoints that are matched
in each frame [26, 19] to track the object. However, these
methods have mostly been applied to relatively rigid ob-
jects.
Many existing methods, follow a tracking-by-detection
approach, where a discriminative model of the object to
track is built and updated “online”, i.e. during the tracking,
in order to adapt to possible appearance changes. For ex-
ample, Adam et al. [
1] use a patch-based appearance model
with integral histograms of colour and intensity. The dy-
namic patch template configuration allows to model spatial
structure and to be robust to partial occlusions. Grabner et
al. [
16] proposed an Online Adaboost (OAB) learning algo-
rithm that dynamically selects weak classifiers that discrim-
inate between the object image region and the background.
Later, they extended this method to a semi-supervised algo-
rithm [
17] that uses a fixed (or adaptive [39]) prior model to
avoid drift and an online boosting framework learning with
unlabelled data. Babenko et al. [
4] presented another online
method based on Multiple Instance Learning (MIL), where
the positive training examples are bags of image patches
containing at least one positive (object) image patch. Be-
sides boosting algorithms, Online Random Forests have
been proposed for adaptive visual object tracking [
36, 37],
where randomised trees are incrementally grown to clas-
1

sify an image region as object or background. Kalal et al.
[22] also use randomised forests which they combine effec-
tively with a Lucas-Kanade tracker in a framework called
Tracking-Learning-Detection (TLD) where the tracker up-
dates the detector using spatial and temporal constraints and
the detector re-initialises the tracker in case of drift.
In order to cope with changing appearance, Mei et al.
[
28] introduced the l
1
tracker that is based on a sparse set
of appearance templates that are collected during tracking
and used in the observation model of a particle filter. Re-
cently, several extensions have been proposed [
6, 20, 43]
to improve the robustness and reduce the complexity of this
method. However, these approaches are still relatively time-
consuming due to the complex l
1
minimisation. A sparse
set of templates has also been used by Liu et al. [
27], but
with smaller image patches of object parts, and by Kwon
et al. [
23] in their Visual Tracking Decomposition (VTD)
method. In a similar spirit, Ross et al. [34] propose a par-
ticle filter algorithm called IVT that uses an observation
model relying on the eigenbasis of image patches com-
puted online using an incremental PCA algorithm. Other
approaches, more similar to ours, consist in using a pixel-
based classifier [
3, 9]. Avidan et al. [3], for example, pro-
posed an ensemble tracking method that label each pixel as
foreground or background with an Adaboost algorithm that
is updated online. However, all of these methods still op-
erate more or less on image regions described by bounding
boxed and inherently have difficulties to track objects un-
dergoing large deformations.
To overcome this problem, recent approaches integrate
some form of segmentation into the tracking process. For
example, Nejhum et al. [
29] proposed to track articulated
objects with a set of independent rectangular blocks that are
used in a refinement step to segment the object with a graph-
cut algorithm. Similarly, although not segmenting the ob-
ject, Kwon et al. [
24] handle deforming objects by track-
ing configurations of a dynamic set of image patches, and
they use Basin Hopping Monte Carlo (BHMC) sampling
to reduce the computational complexity. Other approaches
[
33, 40] use a segmentation on the superpixel level. Bibby
et al. [
8] propose an adaptive probabilistic framework sep-
arating the tracking of non-rigid objects into registration
and level-set segmentation, where posterior probabilities
are computed at the pixel level. Aeschliman et al. [
2] also
combined tracking and segmentation in a Bayesian frame-
work, where pixel-wise likelihood distributions of several
objects and the background are modelled by Gaussian func-
tions the parameters of which are learnt online. In a differ-
ent application context, pixel-based descriptors have also
been used for 3D articulated human-body detection and
tracking by Shotton et al. [
38] on segmented depth images.
In the approach recently proposed by Belagiannis et al. [
7],
a graph-cut segmentation is applied separately to the image
patches provided by a particle filter.
The work of Godec et al. [15] is probably the most sim-
ilar to ours. The authors proposed a patch-based voting al-
gorithm with Hough forests [
14]. By back-projecting the
patches that voted for the object centre, the authors initialise
a graph-cut algorithm to segment foreground from back-
ground. The resulting segmentation is then used to update
the patches’ foreground and background probabilities in the
Hough forest. This method achieves state-of-the-art track-
ing results on many challenging benchmark videos. How-
ever, due to the graph-cut segmentation it is relatively slow.
Also, the segmentation is discrete and binary, which can
increase the risk of drift due to wrongly segmented image
regions.
1.2. Motivation
The algorithm presented in this paper is inspired by
these recent works on combined tracking and segmentation,
which is beneficial for tracking non-rigid objects. Further-
more, patch-based local descriptors have shown state-of-
the-art performance due to their possibility of handling ap-
pearance changes with large object deformations.
In this paper, we integrate these concepts and present
a novel tracking-by-detection algorithm that relies on a
Hough-voting scheme based on pixel descriptors. The
method tightly integrates with a probabilistic segmentation
of foreground and background that is used to incrementally
update the local pixel-based descriptors and vice versa. The
local Hough-voting model and the global colour model op-
erate both at the pixel level and thus allow for very efficient
model representation and inference.
The following scientific contributions are made:
a fast tracking algorithm using a detector based on
the generalised Hough transform and pixel descriptors
in combination with a probabilistic soft segmentation
method,
a co-training framework, where the local pixel-based
detector model is used to update the global segmenta-
tion model and vice versa,
a thorough performance evaluation on standard
datasets and a comparison with other state-of-the-art
tracking algorithms.
Note that the main goal of the proposed approach is to
provide an accurate position estimate of an object to track
in a video. Here, we are not so much interested in a perfect
segmentation of the object from the background. Better ap-
proaches for segmentation exist in the literature but they are
rather complex. Thus, instead of using these, we aimed at
reducing the overall computational complexity.
In the following, we will first describe the overall ap-
proach (Section
2). Then we will detail each of the com-

Figure 1. The overall tracking procedure for one video frame.
Each pixel inside the search window (blue dotted rectangle) in the
input image casts a vote (1) according to the current Hough trans-
form model (darker: high vote sum, lighter: low vote sum). Then
the maximum vote sum (circled in red) and all pixels that have
contributed (i.e. the backprojection) are determined (2). In par-
allel, the current segmentation model is used to segment the im-
age region inside the search window (3) (binarised segmentation
shown in red). Finally, after updating the objects current position
(4), the segmentation model is adapted (5) using the backprojected
image pixels, and the Hough transform model is updated (6) with
foreground pixels from the segmentation and background pixels
from a region around the object (blue frame).
ponents of the algorithm (Sections
3-5). And finally, some
experimental results will be presented (Section 7).
2. Overall Approach
The overall procedure of detection, segmentation, track-
ing, and model adaptation for one video frame is illus-
trated in Fig.
1. The algorithm receives as input the current
video frame as well as the bounding box and segmentation
from the tracking result of the previous frame. The pixel-
based Hough transform is applied on each pixel inside the
search window, the enlarged bounding box, i.e. each pixel
votes for the centre position of the object according to the
learnt model. Votes are cumulated in a common reference
frame, the voting map, and the position with the highest
sum of votes determines the most likely position of the ob-
ject’s centre (see Section
3 for a more detailed explanation).
Then, the pixels that have contributed to the maximum vote
are determined. This process is called backprojection. In
parallel, the image region corresponding to the search win-
dow is segmented using the current segmentation model.
That is, each pixel is assigned a foreground probability (see
Section
4). The position of the tracked object is updated
using the maximum vote position and the centre of mass of
the segmentation output (see Section
5). Finally, the mod-
els are adapted in a co-training manner to avoid drift. That
means, the segmentation model is updated using the back-
projection, and the pixel-based Hough model is adapted ac-
cording to the segmentation output (see Section
6 for more
details).
3. Pixel-based Hough Voting
We developed a new detection algorithm relying on the
generalised Hough transform [
5]. In contrast to existing
models developed recently for similar tasks (e.g. [
14, 15])
which use Hough forests, i.e. Random Forests trained on
small image patches, or the Implicit Shape Model (ISM)
[
25] our method operates at the pixel-level.
This has the following advantages:
pixel-based descriptors are more suitable for detecting
objects that are extremely small in the image (e.g. for
far-field vision),
the feature space is relatively small and does not (or
very little) depend on spatial neighbourhood, which
makes training and updating of the model easier and
more coherent with the object’s appearance changes,
the training and the application of the detector is ex-
tremely fast as it can be implemented with look-up ta-
bles.
One drawback of using a pixel-based Hough model is when
the object’s image region contains primarily pixels of very
similar colours (and gradients). In that case, the pixels on
their own may not be discriminative enough to infer the ob-
jects centre position. Note that also patch-based methods
have difficulties with uniform regions. In practice, however,
this is rarely the case and may be controlled by increasing
the discriminative power of the descriptors (at the cost of
invariance). Also, in this tracking framework, this risk is
considerably reduced by combining the detector with the
segmentation output.
Let us now consider the model creation and application
in detail. Figure
2 illustrates the model creation (training)
and detection process.
3.1. Training
Let us denote x = (x
1
, x
2
) the position of a pixel I(x) in
an image I. In the training image, the pixels inside a given
initial bounding box are quantised according to the vector
composed of its HSV colour values (with separate V quan-
tisation) and its x and y gradient orientation (with a separate
quantisation for low gradient magnitudes) (see Fig.
2 left).
Experiments showed that colour alone is also working well,
but not gradient alone. This amounts to computing D = D
z
(z = 1..N ), an N-dimensional histogram, which is referred
to as pixel-based Hough model in this paper. Here, we use
N = (16 × 16 + 16) × (8 + 1) = 2448 (16 colour bins and
8 gradient orientations). The vectors D
z
= {d
1
z
, . . . , d
M
z
z
}

1
st
image
trained model
D
Hough voting
voting map
TRAINING DETECTION
Figure 2. Training and detection with the pixel-based Hough
model. Left: the model D is constructed by storing for each quan-
tised pixel value in the given bounding box all the displacement
vectors to the object’s centre position (here only colour is used for
illustration). Right: the object is detected in a search window by
accumulating the displacement votes of each pixel in a voting map
(bright pixels: many votes, dark pixels: few votes).
contain M
z
displacement vectors d
m
z
= (x
zm
, w
zm
), each
associated with a weight w
zm
= 1.0. Thus, training con-
sists in constructing D, where each pixel I(x) in the given
bounding box produces a displacement vector d
z
(arrows in
Fig.
2) corresponding to its quantised value z
x
and pointing
to the centre of the bounding box.
3.2. Detection
In a new video frame, the object can be detected by let-
ting each pixel I(x) inside the search window vote accord-
ing to D
z
corresponding to its quantised value z
x
. The right
part of Fig. 2 illustrates this. Each vote is a list of displace-
ments d
m
z
that are weighted by w
zm
and accumulated in a
voting map. The detector’s output is then simply the posi-
tion in the voting map with the maximum value x
max
.
Note that, as illustrated in figure
2, the position estima-
tion is “diffused” by two factors: the deformation of the
object (one of the green pixels in the figure), and pixels of
the same colour (green and blue pixels). But the maximum
value in the voting map is still distinctive and corresponds
well to the centre position of the object. This could also be
observed in our experiments.
Nevertheless, to be robust to very small deformations we
group the votes in small voting cells of 3×3 pixels (as [15]).
3.3. Backprojection
With the position of the maximum in the voting map
x
max
, we can determine which pixels in the search window
contributed to it during the detection. This process is illus-
trated in Fig.
1 and is called backprojection. More precisely,
let z be the value of pixel I(x). Then, the backprojection b
at each position x is defined as:
b
x
=
w
zm
if d
m
z
voted for x
max
,
0 otherwise.
(1)
The backprojected pixels are used for adapting the seg-
mentation model (see Section
6 for more details). The idea
behind this is that, intuitively, pixels that contributed to
x
max
are most likely corresponding to the object.
4. Segmentation
Complementary to the local pixel-wise Hough model, a
global segmentation model is trained and adapted to allow
for varying object shapes and for a better discrimination be-
tween foreground (object) and background, especially when
the shape and appearance changes drastically and abruptly.
A probabilistic soft segmentation approach is adopted
here (similar to [
2]). Let c
t,x
{0, 1} be the class of the
pixel at position x at time t: 0 for background, and 1 for
foreground, and let y
1:t,x
be the pixel’s colour observations
from time 1 to t. For clarity, we’ll drop the index x in the
following. In order to incorporate the segmentation of the
previous video frame at time t 1 and to make the estima-
tion more robust, we use a recursive Bayesian formulation,
where, at time t, each pixel (in the search window) is as-
signed the foreground probability:
p(c
t
= 1|y
1:t
) = Z
1
p(y
t
|c
t
= 1)
X
c
t1
p(c
t
= 1|c
t1
) p(c
t1
|y
1:t1
) , (2)
where Z is a normalisation constant to make the probabili-
ties sum to 1. The distributions p(y
t
|c
t
) are modelled with
HSV colour histograms with 12 × 12 bins for the H and
S channels and 12 separate bins for the V channel. The
foreground histogram is initialised from the image region
defined by the bounding box around the object in the first
frame. The background histogram is initialised from the
image region surrounding this rectangle (with some margin
between). The transition probabilities for foreground and
background are set to:
p(c
t
= 0|c
t1
) = 0.6 p(c
t
= 1|c
t1
) = 0.4 (3)
which is an empirical choice that has been validated exper-
imentally. Note that the tracking algorithm is not very sen-
sitive to these parameters.
As opposed to recent work on image segmentation (e.g.
[
35]), we treat each pixel independently, which, in general,
leads to a less regularised solution but at the same time re-
duces the computational complexity considerably. As stated
in section
1.2, we are not so much interested here in a per-
fectly “clean” segmentation but rather in fast and robust
tracking of the position of an object.

Citations
More filters
Journal ArticleDOI

Struck: Structured Output Tracking with Kernels

TL;DR: A framework for adaptive visual object tracking based on structured output prediction that is able to outperform state-of-the-art trackers on various benchmark videos and can easily incorporate additional features and kernels into the framework, which results in increased tracking performance.
Proceedings ArticleDOI

Staple: Complementary Learners for Real-Time Tracking

TL;DR: It is shown that a simple tracker combining complementary cues in a ridge regression framework can operate faster than 80 FPS and outperform not only all entries in the popular VOT14 competition, but also recent and far more sophisticated trackers according to multiple benchmarks.
Journal ArticleDOI

A Novel Performance Evaluation Methodology for Single-Target Trackers

TL;DR: The requirements are the basis of a new evaluation methodology that aims at a simple and easily interpretable tracker comparison and a fully-annotated dataset with per-frame annotations with several visual attributes, which is the largest benchmark to date.
Proceedings ArticleDOI

Learning Video Object Segmentation from Static Images

TL;DR: In this paper, the authors use a combination of offline and online learning strategies, where the former produces a refined mask from the previous frame estimate and the latter allows to capture the appearance of the specific object instance.
Book ChapterDOI

The Visual Object Tracking VOT2014 challenge results

TL;DR: The evaluation protocol of the VOT2013 challenge and the results of a comparison of 27 trackers on the benchmark dataset are presented, offering a more systematic comparison of the trackers.
References
More filters
Journal ArticleDOI

The Pascal Visual Object Classes (VOC) Challenge

TL;DR: The state-of-the-art in evaluated methods for both classification and detection are reviewed, whether the methods are statistically different, what they are learning from the images, and what the methods find easy or confuse.
Journal ArticleDOI

C ONDENSATION —Conditional Density Propagation forVisual Tracking

TL;DR: The Condensation algorithm uses “factored sampling”, previously applied to the interpretation of static images, in which the probability distribution of possible interpretations is represented by a randomly generated set.
Journal ArticleDOI

"GrabCut": interactive foreground extraction using iterated graph cuts

TL;DR: A more powerful, iterative version of the optimisation of the graph-cut approach is developed and the power of the iterative algorithm is used to simplify substantially the user interaction needed for a given quality of result.
Journal ArticleDOI

Generalizing the hough transform to detect arbitrary shapes

TL;DR: It is shown how the boundaries of an arbitrary non-analytic shape can be used to construct a mapping between image space and Hough transform space, which makes the generalized Houghtransform a kind of universal transform which can beused to find arbitrarily complex shapes.
Proceedings ArticleDOI

Real-time human pose recognition in parts from single depth images

TL;DR: This work takes an object recognition approach, designing an intermediate body parts representation that maps the difficult pose estimation problem into a simpler per-pixel classification problem, and generates confidence-scored 3D proposals of several body joints by reprojecting the classification result and finding local modes.
Related Papers (5)
Frequently Asked Questions (18)
Q1. What contributions have the authors mentioned in the paper "Pixeltrack: a fast adaptive algorithm for tracking non-rigid objects" ?

In this paper, the authors present a novel algorithm for fast tracking of generic objects in videos. 

By back-projecting the patches that voted for the object centre, the authors initialise a graph-cut algorithm to segment foreground from background. 

Given a video stream, tracking arbitrary objects that are non-rigid, moving or static, rotating and deforming, partially occluded, under changing illumination and without any prior knowledge is a challenging task. 

The second dataset2 is composed of 11 videos (around 2 500 frames) showing moving objects that undergo considerable rigid and non-rigid deformations. 

In order to incorporate the segmentation of the previous video frame at time t− 1 and to make the estimation more robust, the authors use a recursive Bayesian formulation, where, at time t, each pixel (in the search window) is assigned the foreground probability:p(ct = 1|y1:t) = Z −1p(yt|ct = 1)∑c′ t−1p(ct = 1|c ′ t−1) p(c ′ t−1|y1:t−1) , (2)where Z is a normalisation constant to make the probabilities sum to 1. 

A sparse set of templates has also been used by Liu et al. [27], but with smaller image patches of object parts, and by Kwon et al. [23] in their Visual Tracking Decomposition (VTD) method. 

In order to cope with changing appearance, Mei et al. [28] introduced the l1 tracker that is based on a sparse set of appearance templates that are collected during tracking and used in the observation model of a particle filter. 

Earlier works [21, 11, 32, 30, 41, 18] on visual object tracking mostly consider a bounding box (or some other simple geometric model) representation of the object to track, and often a global appearance model is used. 

Bibby et al. [8] propose an adaptive probabilistic framework separating the tracking of non-rigid objects into registration and level-set segmentation, where posterior probabilities are computed at the pixel level. 

These classical methods are very robust to some degree of appearance change and local deformations (as in face tracking), and also allow for a fast implementation. 

This has the following advantages:• pixel-based descriptors are more suitable for detecting objects that are extremely small in the image (e.g. for far-field vision),• the feature space is relatively small and does not (or very little) depend on spatial neighbourhood, which makes training and updating of the model easier and more coherent with the object’s appearance changes,• the training and the application of the detector is extremely fast as it can be implemented with look-up tables. 

It is computed dynamically at each frame by a simple reliability measure that is defined as the proportion of pixels in the search window that change from foreground to background or vice versa, i.e. crossing the threshold p(cx = 1|y) = 0.5. 

training consists in constructing D, where each pixel I(x) in the given bounding box produces a displacement vector dz (arrows in Fig. 2) corresponding to its quantised value zx and pointing to the centre of the bounding box. 

In a new video frame, the object can be detected by letting each pixel I(x) inside the search window vote according to Dz corresponding to its quantised value zx. 

When no prior knowledge about the object’s shape and appearance is available, one of the main difficulties is to incrementally learn a robust model from consecutive video frames. 

Their algorithm is very fast compared to existing methods, which makes it suitable for realtime applications, or tasks where many objects need to be tracked at the same time, or where large amounts of data need to be processed (e.g. video indexation). 

One drawback of using a pixel-based Hough model is when the object’s image region contains primarily pixels of very similar colours (and gradients). 

Let ct,x ∈ {0, 1} be the class of the pixel at position x at time t: 0 for background, and 1 for foreground, and let y1:t,x be the pixel’s colour observations from time 1 to t.