How many videos are in the second dataset?

The second dataset2 is composed of 11 videos (around 2 500 frames) showing moving objects that undergo considerable rigid and non-rigid deformations.

What is the p(ct) of the background histogram?

It is computed dynamically at each frame by a simple reliability measure that is defined as the proportion of pixels in the search window that change from foreground to background or vice versa, i.e. crossing the threshold p(cx = 1|y) = 0.5.

What is the training process for constructing D?

training consists in constructing D, where each pixel I(x) in the given bounding box produces a displacement vector dz (arrows in Fig. 2) corresponding to its quantised value zx and pointing to the centre of the bounding box.

What is the main difficulty of learning a robust model from consecutive video frames?

When no prior knowledge about the object’s shape and appearance is available, one of the main difficulties is to incrementally learn a robust model from consecutive video frames.

what is the way to track objects in video?

Their algorithm is very fast compared to existing methods, which makes it suitable for realtime applications, or tasks where many objects need to be tracked at the same time, or where large amounts of data need to be processed (e.g. video indexation).

(Open Access) PixelTrack: A Fast Adaptive Algorithm for Tracking Non-rigid Objects (2013) | Stefan Duffner

Q: What contributions have the authors mentioned in the paper "Pixeltrack: a fast adaptive algorithm for tracking non-rigid objects" ?

In this paper, the authors present a novel algorithm for fast tracking of generic objects in videos.

Q: What is the main purpose of the l1 tracker?

In order to cope with changing appearance, Mei et al. [28] introduced the l1 tracker that is based on a sparse set of appearance templates that are collected during tracking and used in the observation model of a particle filter.

Q: What is the common method used for visual object tracking?

Earlier works [21, 11, 32, 30, 41, 18] on visual object tracking mostly consider a bounding box (or some other simple geometric model) representation of the object to track, and often a global appearance model is used.

Q: What is the way to track a non-rigid object?

Bibby et al. [8] propose an adaptive probabilistic framework separating the tracking of non-rigid objects into registration and level-set segmentation, where posterior probabilities are computed at the pixel level.

HAL Id: hal-00976387

https://hal.archives-ouvertes.fr/hal-00976387

Submitted on 9 Apr 2014

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-

entic research documents, whether they are pub-

lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diusion de documents

scientiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

PixelTrack: a fast adaptive algorithm for tracking

non-rigid objects

Stefan Duner, Christophe Garcia

To cite this version:

Stefan Duner, Christophe Garcia. PixelTrack: a fast adaptive algorithm for tracking non-rigid

objects. International Conference on Computer Vision (ICCV 2013), Dec 2013, Sydney, Australia.

pp.2480-2487. �hal-00976387�

PixelTrack: a fast adaptive algorithm for tracking non-rigid objects

Stefan Duffner and Christophe Garcia

Universit

e de Lyon, CNRS

INSA-Lyon, LIRIS, UMR5205, F-69621, France

stefan.duffner@liris.cnrs.fr, christophe.garcia@liris.cnrs.fr

Abstract

In this paper, we present a novel algorithm for fast track-

ing of generic objects in videos. The algorithm uses two

components: a detector that makes use of the generalised

Hough transform with pixel-based descriptors, and a prob-

abilistic segmentation method based on global models for

foreground and background. These components are used

for tracking in a combined way, and they adapt each other

in a co-training manner. Through effective model adapta-

tion and segmentation, the algorithm is able to track ob-

jects that undergo rigid and non-rigid deformations and

considerable shape and appearance variations. The pro-

posed tracking method has been thoroughly evaluated on

challenging standard videos, and outperforms state-of-the-

art tracking methods designed for the same task. Finally,

the proposed models allow for an extremely efﬁcient imple-

mentation, and thus tracking is very fast.

1. Introduction

Given a video stream, tracking arbitrary objects that are

non-rigid, moving or static, rotating and deforming, par-

tially occluded, under changing illumination and without

any prior knowledge is a challenging task. This uncon-

strained tracking problem where the object model is ini-

tialised from a bounding box in the ﬁrst video frame and

continuously adapted has been increasingly addressed in the

literature in the past years. When no prior knowledge about

the object’s shape and appearance is available, one of the

main difﬁculties is to incrementally learn a robust model

from consecutive video frames. This model should gen-

eralise to new unseen appearances and avoid drift, i.e. the

gradual inclusion of background appearance, which can ul-

timately lead to tracking failure.

Our method addresses these issues with an adaptive ap-

proach combining a detector based on pixel-based descrip-

tors and a probabilistic segmentation framework.

1.1. Related Work

Earlier works [

21, 11, 32, 30, 41, 18] on visual ob-

ject tracking mostly consider a bounding box (or some

other simple geometric model) representation of the ob-

ject to track, and often a global appearance model is used.

These classical methods are very robust to some degree

of appearance change and local deformations (as in face

tracking), and also allow for a fast implementation. How-

ever, for tracking non-rigid objects that undergo a large

amount of deformation and appearance variation, e.g. due

to occlusions or illumination changes, these approaches are

less suitable. Although some algorithms effectively cope

with object deformations by tracking their contour (e.g.

[

31, 42, 10]), most of them require the object to be moving

or need prior shape knowledge [

12]. Others, describe an ob-

ject by a relatively dense set of keypoints that are matched

in each frame [26, 19] to track the object. However, these

methods have mostly been applied to relatively rigid ob-

jects.

Many existing methods, follow a tracking-by-detection

approach, where a discriminative model of the object to

track is built and updated “online”, i.e. during the tracking,

in order to adapt to possible appearance changes. For ex-

ample, Adam et al. [

1] use a patch-based appearance model

with integral histograms of colour and intensity. The dy-

namic patch template conﬁguration allows to model spatial

structure and to be robust to partial occlusions. Grabner et

al. [

16] proposed an Online Adaboost (OAB) learning algo-

rithm that dynamically selects weak classiﬁers that discrim-

inate between the object image region and the background.

Later, they extended this method to a semi-supervised algo-

rithm [

17] that uses a ﬁxed (or adaptive [39]) prior model to

avoid drift and an online boosting framework learning with

unlabelled data. Babenko et al. [

4] presented another online

method based on Multiple Instance Learning (MIL), where

the positive training examples are bags of image patches

containing at least one positive (object) image patch. Be-

sides boosting algorithms, Online Random Forests have

been proposed for adaptive visual object tracking [

36, 37],

where randomised trees are incrementally grown to clas-

sify an image region as object or background. Kalal et al.

[22] also use randomised forests which they combine effec-

tively with a Lucas-Kanade tracker in a framework called

Tracking-Learning-Detection (TLD) where the tracker up-

dates the detector using spatial and temporal constraints and

the detector re-initialises the tracker in case of drift.

In order to cope with changing appearance, Mei et al.

[

28] introduced the l

tracker that is based on a sparse set

of appearance templates that are collected during tracking

and used in the observation model of a particle ﬁlter. Re-

cently, several extensions have been proposed [

6, 20, 43]

to improve the robustness and reduce the complexity of this

method. However, these approaches are still relatively time-

consuming due to the complex l

minimisation. A sparse

set of templates has also been used by Liu et al. [

27], but

with smaller image patches of object parts, and by Kwon

et al. [

23] in their Visual Tracking Decomposition (VTD)

method. In a similar spirit, Ross et al. [34] propose a par-

ticle ﬁlter algorithm called IVT that uses an observation

model relying on the eigenbasis of image patches com-

puted online using an incremental PCA algorithm. Other

approaches, more similar to ours, consist in using a pixel-

based classiﬁer [

3, 9]. Avidan et al. [3], for example, pro-

posed an ensemble tracking method that label each pixel as

foreground or background with an Adaboost algorithm that

is updated online. However, all of these methods still op-

erate more or less on image regions described by bounding

boxed and inherently have difﬁculties to track objects un-

dergoing large deformations.

To overcome this problem, recent approaches integrate

some form of segmentation into the tracking process. For

example, Nejhum et al. [

29] proposed to track articulated

objects with a set of independent rectangular blocks that are

used in a reﬁnement step to segment the object with a graph-

cut algorithm. Similarly, although not segmenting the ob-

ject, Kwon et al. [

24] handle deforming objects by track-

ing conﬁgurations of a dynamic set of image patches, and

they use Basin Hopping Monte Carlo (BHMC) sampling

to reduce the computational complexity. Other approaches

[

33, 40] use a segmentation on the superpixel level. Bibby

et al. [

8] propose an adaptive probabilistic framework sep-

arating the tracking of non-rigid objects into registration

and level-set segmentation, where posterior probabilities

are computed at the pixel level. Aeschliman et al. [

2] also

combined tracking and segmentation in a Bayesian frame-

work, where pixel-wise likelihood distributions of several

objects and the background are modelled by Gaussian func-

tions the parameters of which are learnt online. In a differ-

ent application context, pixel-based descriptors have also

been used for 3D articulated human-body detection and

tracking by Shotton et al. [

38] on segmented depth images.

In the approach recently proposed by Belagiannis et al. [

7],

a graph-cut segmentation is applied separately to the image

patches provided by a particle ﬁlter.

The work of Godec et al. [15] is probably the most sim-

ilar to ours. The authors proposed a patch-based voting al-

gorithm with Hough forests [

14]. By back-projecting the

patches that voted for the object centre, the authors initialise

a graph-cut algorithm to segment foreground from back-

ground. The resulting segmentation is then used to update

the patches’ foreground and background probabilities in the

Hough forest. This method achieves state-of-the-art track-

ing results on many challenging benchmark videos. How-

ever, due to the graph-cut segmentation it is relatively slow.

Also, the segmentation is discrete and binary, which can

increase the risk of drift due to wrongly segmented image

regions.

1.2. Motivation

The algorithm presented in this paper is inspired by

these recent works on combined tracking and segmentation,

which is beneﬁcial for tracking non-rigid objects. Further-

more, patch-based local descriptors have shown state-of-

the-art performance due to their possibility of handling ap-

pearance changes with large object deformations.

In this paper, we integrate these concepts and present

a novel tracking-by-detection algorithm that relies on a

Hough-voting scheme based on pixel descriptors. The

method tightly integrates with a probabilistic segmentation

of foreground and background that is used to incrementally

update the local pixel-based descriptors and vice versa. The

local Hough-voting model and the global colour model op-

erate both at the pixel level and thus allow for very efﬁcient

model representation and inference.

The following scientiﬁc contributions are made:

• a fast tracking algorithm using a detector based on

the generalised Hough transform and pixel descriptors

in combination with a probabilistic soft segmentation

method,

• a co-training framework, where the local pixel-based

detector model is used to update the global segmenta-

tion model and vice versa,

• a thorough performance evaluation on standard

datasets and a comparison with other state-of-the-art

tracking algorithms.

Note that the main goal of the proposed approach is to

provide an accurate position estimate of an object to track

in a video. Here, we are not so much interested in a perfect

segmentation of the object from the background. Better ap-

proaches for segmentation exist in the literature but they are

rather complex. Thus, instead of using these, we aimed at

reducing the overall computational complexity.

In the following, we will ﬁrst describe the overall ap-

proach (Section

2). Then we will detail each of the com-

Figure 1. The overall tracking procedure for one video frame.

Each pixel inside the search window (blue dotted rectangle) in the

input image casts a vote (1) according to the current Hough trans-

form model (darker: high vote sum, lighter: low vote sum). Then

the maximum vote sum (circled in red) and all pixels that have

contributed (i.e. the backprojection) are determined (2). In par-

allel, the current segmentation model is used to segment the im-

age region inside the search window (3) (binarised segmentation

shown in red). Finally, after updating the objects current position

(4), the segmentation model is adapted (5) using the backprojected

image pixels, and the Hough transform model is updated (6) with

foreground pixels from the segmentation and background pixels

from a region around the object (blue frame).

ponents of the algorithm (Sections

3-5). And ﬁnally, some

experimental results will be presented (Section 7).

2. Overall Approach

The overall procedure of detection, segmentation, track-

ing, and model adaptation for one video frame is illus-

trated in Fig.

1. The algorithm receives as input the current

video frame as well as the bounding box and segmentation

from the tracking result of the previous frame. The pixel-

based Hough transform is applied on each pixel inside the

search window, the enlarged bounding box, i.e. each pixel

votes for the centre position of the object according to the

learnt model. Votes are cumulated in a common reference

frame, the voting map, and the position with the highest

sum of votes determines the most likely position of the ob-

ject’s centre (see Section

3 for a more detailed explanation).

Then, the pixels that have contributed to the maximum vote

are determined. This process is called backprojection. In

parallel, the image region corresponding to the search win-

dow is segmented using the current segmentation model.

That is, each pixel is assigned a foreground probability (see

Section

4). The position of the tracked object is updated

using the maximum vote position and the centre of mass of

the segmentation output (see Section

5). Finally, the mod-

els are adapted in a co-training manner to avoid drift. That

means, the segmentation model is updated using the back-

projection, and the pixel-based Hough model is adapted ac-

cording to the segmentation output (see Section

6 for more

details).

3. Pixel-based Hough Voting

We developed a new detection algorithm relying on the

generalised Hough transform [

5]. In contrast to existing

models developed recently for similar tasks (e.g. [

14, 15])

which use Hough forests, i.e. Random Forests trained on

small image patches, or the Implicit Shape Model (ISM)

[

25] our method operates at the pixel-level.

This has the following advantages:

• pixel-based descriptors are more suitable for detecting

objects that are extremely small in the image (e.g. for

far-ﬁeld vision),

• the feature space is relatively small and does not (or

very little) depend on spatial neighbourhood, which

makes training and updating of the model easier and

more coherent with the object’s appearance changes,

• the training and the application of the detector is ex-

tremely fast as it can be implemented with look-up ta-

bles.

One drawback of using a pixel-based Hough model is when

the object’s image region contains primarily pixels of very

similar colours (and gradients). In that case, the pixels on

their own may not be discriminative enough to infer the ob-

jects centre position. Note that also patch-based methods

have difﬁculties with uniform regions. In practice, however,

this is rarely the case and may be controlled by increasing

the discriminative power of the descriptors (at the cost of

invariance). Also, in this tracking framework, this risk is

considerably reduced by combining the detector with the

segmentation output.

Let us now consider the model creation and application

in detail. Figure

2 illustrates the model creation (training)

and detection process.

3.1. Training

Let us denote x = (x

, x

) the position of a pixel I(x) in

an image I. In the training image, the pixels inside a given

initial bounding box are quantised according to the vector

composed of its HSV colour values (with separate V quan-

tisation) and its x and y gradient orientation (with a separate

quantisation for low gradient magnitudes) (see Fig.

2 left).

Experiments showed that colour alone is also working well,

but not gradient alone. This amounts to computing D = D

(z = 1..N ), an N-dimensional histogram, which is referred

to as pixel-based Hough model in this paper. Here, we use

N = (16 × 16 + 16) × (8 + 1) = 2448 (16 colour bins and

8 gradient orientations). The vectors D

= {d

, . . . , d

}

image

trained model

Hough voting

voting map

TRAINING DETECTION

Figure 2. Training and detection with the pixel-based Hough

model. Left: the model D is constructed by storing for each quan-

tised pixel value in the given bounding box all the displacement

vectors to the object’s centre position (here only colour is used for

illustration). Right: the object is detected in a search window by

accumulating the displacement votes of each pixel in a voting map

(bright pixels: many votes, dark pixels: few votes).

contain M

displacement vectors d

= (x

, w

), each

associated with a weight w

= 1.0. Thus, training con-

sists in constructing D, where each pixel I(x) in the given

bounding box produces a displacement vector d

(arrows in

Fig.

2) corresponding to its quantised value z

and pointing

to the centre of the bounding box.

3.2. Detection

In a new video frame, the object can be detected by let-

ting each pixel I(x) inside the search window vote accord-

ing to D

corresponding to its quantised value z

. The right

part of Fig. 2 illustrates this. Each vote is a list of displace-

ments d

that are weighted by w

and accumulated in a

voting map. The detector’s output is then simply the posi-

tion in the voting map with the maximum value x

max

Note that, as illustrated in ﬁgure

2, the position estima-

tion is “diffused” by two factors: the deformation of the

object (one of the green pixels in the ﬁgure), and pixels of

the same colour (green and blue pixels). But the maximum

value in the voting map is still distinctive and corresponds

well to the centre position of the object. This could also be

observed in our experiments.

Nevertheless, to be robust to very small deformations we

group the votes in small voting cells of 3×3 pixels (as [15]).

3.3. Backprojection

With the position of the maximum in the voting map

max

, we can determine which pixels in the search window

contributed to it during the detection. This process is illus-

trated in Fig.

1 and is called backprojection. More precisely,

let z be the value of pixel I(x). Then, the backprojection b

at each position x ∈ Ω is deﬁned as:



if d

voted for x

max

0 otherwise.

(1)

The backprojected pixels are used for adapting the seg-

mentation model (see Section

6 for more details). The idea

behind this is that, intuitively, pixels that contributed to

max

are most likely corresponding to the object.

4. Segmentation

Complementary to the local pixel-wise Hough model, a

global segmentation model is trained and adapted to allow

for varying object shapes and for a better discrimination be-

tween foreground (object) and background, especially when

the shape and appearance changes drastically and abruptly.

A probabilistic soft segmentation approach is adopted

here (similar to [

2]). Let c

t,x

∈ {0, 1} be the class of the

pixel at position x at time t: 0 for background, and 1 for

foreground, and let y

1:t,x

be the pixel’s colour observations

from time 1 to t. For clarity, we’ll drop the index x in the

following. In order to incorporate the segmentation of the

previous video frame at time t − 1 and to make the estima-

tion more robust, we use a recursive Bayesian formulation,

where, at time t, each pixel (in the search window) is as-

signed the foreground probability:

p(c

= 1|y

1:t

) = Z

−1

p(y

= 1)

′

t−1

p(c

= 1|c

′

t−1

) p(c

′

t−1

1:t−1

) , (2)

where Z is a normalisation constant to make the probabili-

ties sum to 1. The distributions p(y

) are modelled with

HSV colour histograms with 12 × 12 bins for the H and

S channels and 12 separate bins for the V channel. The

foreground histogram is initialised from the image region

deﬁned by the bounding box around the object in the ﬁrst

frame. The background histogram is initialised from the

image region surrounding this rectangle (with some margin

between). The transition probabilities for foreground and

background are set to:

p(c

= 0|c

t−1

) = 0.6 p(c

= 1|c

t−1

) = 0.4 (3)

which is an empirical choice that has been validated exper-

imentally. Note that the tracking algorithm is not very sen-

sitive to these parameters.

As opposed to recent work on image segmentation (e.g.

[

35]), we treat each pixel independently, which, in general,

leads to a less regularised solution but at the same time re-

duces the computational complexity considerably. As stated

in section

1.2, we are not so much interested here in a per-

fectly “clean” segmentation but rather in fast and robust

tracking of the position of an object.

PixelTrack: A Fast Adaptive Algorithm for Tracking Non-rigid Objects

Figures

Citations

Struck: Structured Output Tracking with Kernels

Staple: Complementary Learners for Real-Time Tracking

A Novel Performance Evaluation Methodology for Single-Target Trackers

Learning Video Object Segmentation from Static Images

The Visual Object Tracking VOT2014 challenge results

References

The Pascal Visual Object Classes (VOC) Challenge

C ONDENSATION —Conditional Density Propagation forVisual Tracking

"GrabCut": interactive foreground extraction using iterated graph cuts

Generalizing the hough transform to detect arbitrary shapes

Real-time human pose recognition in parts from single depth images

Related Papers (5)

Online Object Tracking: A Benchmark

Struck: Structured output tracking with kernels

High-Speed Tracking with Kernelized Correlation Filters

Robust Object Tracking with Online Multiple Instance Learning

Accurate scale estimation for robust visual tracking

Frequently Asked Questions (18)

Q1. What contributions have the authors mentioned in the paper "Pixeltrack: a fast adaptive algorithm for tracking non-rigid objects" ?

Q2. How do they use the graph-cut algorithm?

Q3. What is the problem with tracking arbitrary objects?

Q4. How many videos are in the second dataset?

Q5. How do the authors make the estimation more robust?

Q6. What is the recent version of the VTD method?

Q7. What is the main purpose of the l1 tracker?

Q8. What is the common method used for visual object tracking?

Q9. What is the way to track a non-rigid object?

Q10. What are the advantages of using a classical method to track objects?

Q11. What is the advantage of using a pixel-based Hough model?

Q12. What is the p(ct) of the background histogram?

Q13. What is the training process for constructing D?

Q14. How can a video frame be detected?

Q15. What is the main difficulty of learning a robust model from consecutive video frames?

Q16. what is the way to track objects in video?

Q17. What is the drawback of using a pixel-based Hough model?

Q18. What is the class of the pixel at position x at time t?