What are the contributions mentioned in the paper "Online object tracking: a benchmark" ?

After briefly reviewing recent advances of online object tracking, the authors carry out large scale experiments with various evaluation criteria to understand how these algorithms perform. By analyzing quantitative results, the authors identify effective approaches for robust tracking and provide potential future research directions in this field.

What is the recent development in object tracking?

the discriminative model has been widely adopted in tracking [15, 4], where a binary classifier is learned online to discriminate the target from the background.

What is the performance of the affine motion trackers?

The results show that trackers with affine motion models (e.g., ASLA and SCM) often handle scale variation better than others that are designed to account for only translational motion with a few exceptions such as Struck.

What is the overlap score for the tracked bounding box rt?

Given the tracked bounding box rt and theground truth bounding box ra, the overlap score is defined as S = | rt⋂ ra || rt ⋃ ra |, where ⋂ and ⋃ represent the intersec-tion and union of two regions, respectively, and | · | denotes the number of pixels in the region.

What is the main reason for the Haar-like features?

This indicates that the Haar-like features are somewhat robust to background clutters due to the summation operations when computing features.

How can the authors use the background information in the discriminative model?

It can be exploited by using advanced learning techniques to encode the background information in the discriminative model implicitly (e.g., Struck), or serving as the tracking context explicitly (e.g., CXT).

Why are the plots of SRE presented for analysis?

Due to the space limitation, the plots of SRE are presented for analysis in the following sections, and more results are included in the supplement.

(Open Access) Online Object Tracking: A Benchmark (2013) | Yi Wu

Q: What is the way to evaluate the performance of the tracking algorithms?

The authors use the precision plots based on location error metric and the success plots based on the overlap metric, to analyze the performance of each algorithm.

Q: What are some of the popular features in tracking algorithms?

In addition to template, many other visual features have been adopted in tracking algorithms, such as color histograms [16], histograms of oriented gradients (HOG) [17, 52], covariance region descriptor [53, 46, 56] and Haar-like features [54, 22].

Q: What are the two tests that are commonly used in the real world?

The proposed test scenarios happen a lot in the realworld applications as a tracker is often initialized by an object detector, which is likely to introduce initialization errors in terms of position and scale.

Q: What is the purpose of this work?

In this work, the authors build a code library that includes most publicly available trackers and a test dataset with ground-truth annotations to facilitate the evaluation task.

Online Object Tracking: A Benchmark

Yi Wu

University of California at Merced

ywu29@ucmerced.edu

Jongwoo Lim

Hanyang University

jlim@hanyang.ac.kr

Ming-Hsuan Yang

University of California at Merced

mhyang@ucmerced.edu

Abstract

Object tracking is one of the most important components

in numerous applications of computer vision. While much

progress has been made in recent years with efforts on shar-

ing code and datasets, it is of great importance to develop

a library and benchmark to gauge the state of the art. After

brieﬂy reviewing recent advances of online object tracking,

we carry out large scale experiments with various evalua-

tion criteria to understand how these algorithms perform.

The test image sequences are annotated with different at-

tributes for performance evaluation and analysis. By ana-

lyzing quantitative results, we identify effective approaches

for robust tracking and provide potential future research di-

rections in this ﬁeld.

1. Introduction

Object tracking is one of the most important components

in a wide range of applications in computer vision, such

as surveillance, human computer interaction, and medical

imaging [60, 12]. Given the initialized state (e.g., position

and size) of a target object in a frame of a video, the goal

of tracking is to estimate the states of the target in the sub-

sequent frames. Although object tracking has been studied

for several decades, and much progress has been made in re-

cent years [28, 16, 47, 5, 40, 26, 19], it remains a very chal-

lenging problem. Numerous factors affect the performance

of a tracking algorithm, such as illumination variation, oc-

clusion, as well as background clutters, and there exists no

single tracking approach that can successfully handle all s-

cenarios. Therefore, it is crucial to evaluate the performance

of state-of-the-art trackers to demonstrate their strength and

weakness and help identify future research directions in this

ﬁeld for designing more robust algorithms.

For comprehensive performance evaluation, it is criti-

cal to collect a representative dataset. There exist sever-

al datasets for visual tracking in the surveillance scenarios,

such as the VIVID [14], CAVIAR [21], and PETS databas-

es. However, the target objects are usually humans or cars

of small size in these surveillance sequences, and the back-

ground is usually static. Although some tracking dataset-

s [47, 5, 33] for generic scenes are annotated with bounding

box, most of them are not. For sequences without labeled

ground truth, it is difﬁcult to evaluate tracking algorithms

as the reported results are based on inconsistently annotated

object locations.

Recently, more tracking source codes have been made

publicly available, e.g., the OAB [22], IVT [47], MIL [5],

L1 [40], and TLD [31] algorithms, which have been com-

monly used for evaluation. However, the input and output

formats of most trackers are different and thus it is inconve-

nient for large scale performance evaluation. In this work,

we build a code library that includes most publicly available

trackers and a test dataset with ground-truth annotations to

facilitate the evaluation task. Additionally each sequence

in the dataset is annotated with attributes that often affect

tracking performance, such as occlusion, fast motion, and

illumination variation.

One common issue in assessing tracking algorithms is

that the results are reported based on just a few sequences

with different initial conditions or parameters. Thus, the

results do not provide the holistic view of these algorithm-

s. For fair and comprehensive performance evaluation, we

propose to perturb the initial state spatially and temporally

from the ground-truth target locations. While the robust-

ness to initialization is a well-known problem in the ﬁeld,

it is seldom addressed in the literature. To the best of our

knowledge, this is the ﬁrst comprehensive work to address

and analyze the initialization problem of object tracking.

We use the precision plots based on location error metric

and the success plots based on the overlap metric, to ana-

lyze the performance of each algorithm.

The contribution of this work is three-fold:

Dataset. We build a tracking dataset with 50 fully annotat-

ed sequences to facilitate tracking evaluation.

Code library. We integrate most publicly available tracker-

s in our code library with uniform input and output formats

to facilitate large scale performance evaluation. At present,

it includes 29 tracking algorithms.

Robustness evaluation. The initial bounding boxes for

tracking are sampled spatially and temporally to evaluate

the robustness and characteristics of trackers. Each track-

er is extensively evaluated by analyzing more than 660,000

bounding box outputs.

This work mainly focuses on the online

tracking of sin-

gle target. The code library, annotated dataset and all the

tracking results are available on the website http://visual-

tracking.net .

2. Related Work

In this section, we review recent algorithms for object

tracking in terms of several main modules: target represen-

tation scheme, search mechanism, and model update. In

addition, some methods have been proposed that build on

combing some trackers or mining context information.

Representation Scheme. Object representation is one

of major components in any visual tracker and numerous

schemes have been presented [35]. Since the pioneer-

ing work of Lucas and Kanade [37, 8], holistic templates

(raw intensity values) have been widely used for track-

ing [25, 39, 2]. Subsequently, subspace-based tracking ap-

proaches [11, 47] have been proposed to better account

for appearance changes. Furthermore, Mei and Ling [40]

proposed a tracking approach based on sparse representa-

tion to handle the corrupted appearance and recently it has

been further improved [41, 57, 64, 10, 55, 42]. In ad-

dition to template, many other visual features have been

adopted in tracking algorithms, such as color histogram-

s [16], histograms of oriented gradients (HOG) [17, 52],

covariance region descriptor [53, 46, 56] and Haar-like fea-

tures [54, 22]. Recently, the discriminative model has been

widely adopted in tracking [15, 4], where a binary classiﬁer

is learned online to discriminate the target from the back-

ground. Numerous learning methods have been adapted

to the tracking problem, such as SVM [3], structured out-

put SVM [26], ranking SVM [7], boosting [4, 22], semi-

boosting [23] and multi-instance boosting [5]. To make

trackers more robust to pose variation and partial occlusion,

an object can be represented by parts where each one is rep-

resented by descriptors or histograms. In [1] several local

histograms are used to represent the object in a pre-deﬁned

grid structure. Kwon and Lee [32] propose an approach to

automatically update the topology of local patches to handle

large pose changes. To better handle appearance variation-

s, some approaches regarding integration of multiple repre-

sentation schemes have recently been proposed [62, 51, 33].

Search Mechanism. To estimate the state of the target ob-

jects, deterministic or stochastic methods have been used.

When the tracking problem is posed within an optimiza-

tion framework, assuming the objective function is differ-

entiable with respect to the motion parameters, gradient

descent methods can be used to locate the target efﬁcient-

ly [37, 16, 20, 49]. However, these objective functions are

Here, the word online means during tracking only the information of

previous few frames is used for inference at any time instance.

usually nonlinear and contain many local minima. To allevi-

ate this problem, dense sampling methods have been adopt-

ed [22, 5, 26] at the expense of high computational load.

On the other hand, stochastic search algorithms such as par-

ticle ﬁlters [28, 44] have been widely used since they are

relatively insensitive to local minima and computationally

efﬁcient [47, 40, 30].

Model Update. It is crucial to update the target repre-

sentation or model to account for appearance variations.

Matthews et al. [39] address the template update problem

for the Lucas-Kanade algorithm [37] where the template is

updated with the combination of the ﬁxed reference tem-

plate extracted from the ﬁrst frame and the result from the

most recent frame. Effective update algorithms have also

been proposed via online mixture model [29], online boost-

ing [22], and incremental subspace update [47]. For dis-

criminative models, the main issue has been improving the

sample collection part to make the online-trained classiﬁer

more robust [23, 5, 31, 26]. While much progress has been

made, it is still difﬁcult to get an adaptive appearance model

to avoid drifts.

Context and Fusion of Trackers. Context information is

also very important for tracking. Recently some approach-

es have been proposed by mining auxiliary objects or lo-

cal visual information surrounding the target to assist track-

ing [59, 24, 18]. The context information is especially help-

ful when the target is fully occluded or leaves the image

region [24]. To improve the tracking performance, some

tracker fusion methods have been proposed recently. Sant-

ner et al. [48] proposed an approach that combines stat-

ic, moderately adaptive and highly adaptive trackers to ac-

count for appearance changes. Even multiple trackers [34]

or multiple feature sets [61] are maintained and selected

in a Bayesian framework to better account for appearance

changes.

3. Evaluated Algorithms and Datasets

For fair evaluation, we test the tracking algorithms

whose original source or binary codes are publicly avail-

able as all implementations inevitably involve technical de-

tails and speciﬁc parameter settings

. Table 1 shows the list

of the evaluated tracking algorithms. We also evaluate the

trackers in the VIVID testbed [14] including the mean shift

(MS-V), template matching (TM-V), ratio shift (RS-V) and

peak difference (PD-V) methods.

In recent years, many benchmark datasets have been de-

veloped for various vision problems, such as the Berkeley

segmentation [38], FERET face recognition [45] and opti-

cal ﬂow dataset [9]. There exist some datasets for the track-

ing in the surveillance scenario, such as the VIVID [14] and

CAVIAR [21] datasets. For generic visual tracking, more

Some source codes [36, 58] are obtained from direct contact, and some

methods are implemented on our own [44, 16].

Method Representation Search MU Code FPS

CPF [44] L, IH PF N C 109

LOT [43] L, color PF Y M 0.70

IVT [47] H, PCA, GM PF Y MC 33.4

ASLA [30] L, SR, GM PF Y MC 8.5

SCM [65] L, SR, GM+DM PF Y MC 0.51

L1APG [10] H, SR, GM PF Y MC 2.0

MTT [64] H, SR, GM PF Y M 1.0

VTD [33] H, SPCA, GM MCMC Y MC-E 5.7

VTS [34] L, SPCA, GM MCMC Y MC-E 5.7

LSK [36] L, SR, GM LOS Y M-E 5.5

ORIA [58] H, T, GM LOS Y M 9.0

DFT [49] L, T LOS Y M 13.2

KMS [16] H, IH LOS N C 3,159

SMS [13] H, IH LOS N C 19.2

VR-V [15] H, color LOS Y MC 109

Frag [1] L, IH DS N C 6.3

OAB [22] H, Haar, DM DS Y C 22.4

SemiT [23] H, Haar, DM DS Y C 11.2

BSBT [50] H, Haar, DM DS Y C 7.0

MIL [5] H, Haar, DM DS Y C 38.1

CT [63] H, Haar, DM DS Y MC 64.4

TLD [31] L, BP, DM DS Y MC 28.1

Struck [26] H, Haar, DM DS Y C 20.2

CSK [27] H, T, DM DS Y M 362

CXT [18] H, BP, DM DS Y C 15.3

Table 1. Evaluated tracking algorithms (MU: model update, FP-

S: frames per second). For representation schemes, L: local, H:

holistic, T: template, IH: intensity histogram, BP: binary pattern,

PCA: principal component analysis, SPCA: sparse PCA, SR: s-

parse representation, DM: discriminative model, GM: generative

model. For search mechanism, PF: particle ﬁlter, MCMC: Markov

Chain Monte Carlo, LOS: local optimum search, DS: dense sam-

pling search. For the model update, N: No, Y: Yes. In the Code

column, M: Matlab, C:C/C++, MC: Mixture of Matlab and C/C++,

sufﬁx E: executable binary code.

sequences have been used for evaluation [47, 5]. However,

most sequences do not have the ground truth annotation-

s, and the quantitative evaluation results may be generated

with different initial conditions. To facilitate fair perfor-

mance evaluation, we have collected and annotated most

commonly used tracking sequences. Figure 1 shows the ﬁrst

frame of each sequence where the target object is initialized

with a bounding box.

Attributes of a test sequence. Evaluating trackers is dif-

ﬁcult because many factors can affect the tracking perfor-

mance. For better evaluation and analysis of the strength

and weakness of tracking approaches, we propose to catego-

rize the sequences by annotating them with the 11 attributes

shown in Table 2.

The attribute distribution in our dataset is shown in Fig-

ure 2(a). Some attributes occur more frequently, e.g., OPR

and IPR, than others. It also shows that one sequence is

often annotated with several attributes. Aside from summa-

rizing the performance on the whole dataset, we also con-

struct several subsets corresponding to attributes to report

speciﬁc challenging conditions. For example, the OCC sub-

set contains 29 sequences which can be used to analyze the

Attr Description

IV Illumination Variation - the illumination in the target region is

signiﬁcantly changed.

SV Scale Variation - the ratio of the bounding boxes of the ﬁrst

frame and the current frame is out of the range [1/t

, t

], t

1 (t

=2).

OCC Occlusion - the target is partially or fully occluded.

DEF Deformation - non-rigid object deformation.

MB Motion Blur - the target region is blurred due to the motion of

target or camera.

FM Fast Motion - the motion of the ground truth is larger than t

pixels (t

=20).

IPR In-Plane Rotation - the target rotates in the image plane.

OPR Out-of-Plane Rotation - the target rotates out of the image

plane.

OV Out-of-View - some portion of the target leaves the view.

BC Background Clutters - the background near the target has the

similar color or texture as the target.

LR Low Resolution - the number of pixels inside the ground-truth

bounding box is less than t

=400).

Table 2. List of the attributes annotated to test sequences. The

threshold values used in this work are also shown.

(a) (b)

Figure 2. (a) Attribute distribution of the entire testset, and (b) the

distribution of the sequences with occlusion (OCC) attribute.

performance of trackers to handle occlusion. The attribute

distributions in OCC subset is shown in Figure 2(b) and oth-

ers are available in the supplemental material.

4. Evaluation Methodology

In this work, we use the precision and success rate for

quantitative analysis. In addition, we evaluate the robust-

ness of tracking algorithms in two aspects.

Precision plot. One widely used evaluation metric on track-

ing precision is the center location error, which is deﬁned

as the average Euclidean distance between the center loca-

tions of the tracked targets and the manually labeled ground

truths. Then the average center location error over all the

frames of one sequence is used to summarize the overall

performance for that sequence. However, when the tracker

loses the target, the output location can be random and the

average error value may not measure the tracking perfor-

mance correctly [6]. Recently the precision plot [6, 27] has

been adopted to measure the overall tracking performance.

It shows the percentage of frames whose estimated location

is within the given threshold distance of the ground truth.

As the representative precision score for each tracker we

use the score for the threshold = 20 pixels [6].

Success plot. Another evaluation metric is the bounding

box overlap. Given the tracked bounding box r

and the

Figure 1. Tracking sequences for evaluation. The ﬁrst frame with the bounding box of the target object is shown for each sequence. The

sequences are ordered based on our ranking results (See supplementary material): the ones on the top left are more difﬁcult for tracking

than the ones on the bottom right. Note that we annotated two targets for the jogging sequence.

ground truth bounding box r

, the overlap score is deﬁned

as S =

| r

, where

and

represent the intersec-

tion and union of two regions, respectively, and | · | denotes

the number of pixels in the region. To measure the perfor-

mance on a sequence of frames, we count the number of

successful frames whose overlap S is larger than the given

threshold t

. The success plot shows the ratios of success-

ful frames at the thresholds varied from 0 to 1. Using one

success rate value at a speciﬁc threshold (e.g. t

=0.5) for

tracker evaluation may not be fair or representative. Instead

we use the area under curve (AUC) of each success plot to

rank the tracking algorithms.

Robustness Evaluation. The conventional way to evaluate

trackers is to run them throughout a test sequence with ini-

tialization from the ground truth position in the ﬁrst frame

and report the average precision or success rate. We re-

fer this as one-pass evaluation (OPE). However a tracker

may be sensitive to the initialization, and its performance

with different initialization at a different start frame may

become much worse or better. Therefore, we propose two

ways to analyze a tracker’s robustness to initialization, by

perturbing the initialization temporally (i.e., start at differ-

ent frames) and spatially (i.e., start by different bounding

boxes). These tests are referred as temporal robustness e-

valuation (TRE) and spatial robustness evaluation (SRE)

respectively.

The proposed test scenarios happen a lot in the real-

world applications as a tracker is often initialized by an ob-

ject detector, which is likely to introduce initialization er-

rors in terms of position and scale. In addition, an object

detector may be used to re-initialize a tracker at differen-

t time instances. By investigating a tracker’s characteristic

in the robustness evaluation, more thorough understanding

and analysis of the tracking algorithm can be carried out.

Temporal Robustness Evaluation. Given one initial frame

together with the ground-truth bounding box of target, one

tracker is initialized and runs to the end of the sequence, i.e.,

one segment of the entire sequence. The tracker is evaluat-

ed on each segment, and the overall statistics are tallied.

Spatial Robustness Evaluation. We sample the initial

bounding box in the ﬁrst frame by shifting or scaling the

ground truth. Here, we use 8 spatial shifts including 4 cen-

ter shifts and 4 corner shifts, and 4 scale variations (supple-

ment). The amount for shift is 10% of target size, and the

scale ratio varys among 0.8, 0.9, 1.1 and 1.2 to the ground

truth. Thus, we evaluate each tracker 12 times for SRE.

5. Evaluation Results

For each tracker, the default parameters with the source

code are used in all evaluations. Table 1 lists the average

FPS of each tracker in OPE running on a PC with Intel i7

3770 CPU (3.4GHz). More detailed speed statistics, such as

minimum and maximum, are available in the supplement.

For OPE, each tracker is tested on more than 29,000

frames. For SRE, each tracker is evaluated 12 times on each

sequence, where more than 350,000 bounding box results

are generated. For TRE, each sequence is partitioned into

20 segments and thus each tracker is performed on around

310,000 frames. To the best of our knowledge, this is the

largest scale performance evaluation of visual tracking. We

report the most important ﬁndings in this manuscript and

more details and ﬁgures can be found in the supplement.

5.1. Overall Performance

The overall performance for all the trackers is summa-

rized by the success and precision plots as shown in Fig-

Figure 3. Plots of OPE, SRE, and TRE. The performance score for each tracker is shown in the legend. For each ﬁgure, the top 10 trackers

are presented for clarity and complete plots are in the supplementary material (best viewed on high-resolution display).

ure 3 where only the top 10 algorithms are presented for

clarity and the complete plots are displayed in the supple-

mentary material. For success plots, we use AUC scores to

summarize and rank the trackers, while for precision plots

we use the results at error threshold of 20 for ranking. In

the precision plots, the rankings of some trackers are slight-

ly different from the rankings in the success plots in that

they are based on different metrics which measure different

characteristics of trackers. Because the AUC score of suc-

cess plot measures the overall performance which is more

accurate than the score at one threshold of the plot, in the

following we mainly analyze the rankings based on success

plots but use the precision plots as auxiliary.

The average TRE performance is higher than that of OPE

in that the number of frames decreases from the ﬁrst to last

segment of TRE. As the trackers tend to perform well in

shorter sequences, the average of all the results in TRE tend

to be higher. On the other hand, the average performance

of SRE is lower than that of OPE. The initialization errors

tend to cause trackers to update with imprecise appearance

information, thereby causing gradual drifts.

In the success plots, the top ranked tracker SCM in OPE

outperforms Struck by 2.6% but is 1.9% below Struck in

SRE. The results also show that OPE is not the best perfor-

mance indicator as the OPE is one trial of SRE or TRE. The

ranking of TLD in TRE is lower than OPE and SRE. This

is because TLD performs well in long sequences with a re-

detection module while there are numerous short segments

in TRE. The success plots of Struck in TRE and SRE show

that the success rate of Struck is higher than SCM and AL-

SA when the overlap threshold is small, but less than SCM

and ALSA when the overlap threshold is large. This is be-

cause Struck only estimates the location of target and does

not handle scale variation.

Sparse representations are used in SCM, ASLA, LSK,

MTT and L1APG. These trackers perform well in SRE and

TRE, which suggests sparse representations are effective

models to account for appearance change (e.g., occlusion).

We note that SCM, ASLA and LSK outperform MTT and

L1APG. The results suggest that local sparse representa-

tions are more effective than the ones with holistic sparse

templates. The AUC score of ASLA deceases less than

the other top 5 trackers from OPE to SRE and the rank-

ing of ASLA also increases. It indicates the alignment-

pooling technique adopted by ASLA is more robust to mis-

alignments and background clutters.

Among the top 10 trackers, CSK has the highest speed

where the proposed circulant structure plays a key role.

The VTD and VTS methods adopt mixture models to im-

prove the tracking performance. Compared with other high-

er ranked trackers, the performance bottleneck of them can

be attributed to their adopted representation based on sparse

Online Object Tracking: A Benchmark

Figures

Citations

High-Speed Tracking with Kernelized Correlation Filters

Object Tracking Benchmark

Fully-Convolutional Siamese Networks for Object Tracking

Accurate scale estimation for robust visual tracking

ECO: Efficient Convolution Operators for Tracking

References

Histograms of oriented gradients for human detection

Robust Real-Time Face Detection

An iterative image registration technique with an application to stereo vision

Robust real-time face detection

C ONDENSATION —Conditional Density Propagation forVisual Tracking

Related Papers (5)

High-Speed Tracking with Kernelized Correlation Filters

Object Tracking Benchmark

Tracking-Learning-Detection

Incremental Learning for Robust Visual Tracking

Exploiting the circulant structure of tracking-by-detection with kernels

Frequently Asked Questions (18)

Q1. What are the contributions mentioned in the paper "Online object tracking: a benchmark" ?

Q2. What is the way to evaluate the performance of the tracking algorithms?

Q3. What is the recent development in object tracking?

Q4. What are some of the popular features in tracking algorithms?

Q5. What methods have been adapted to the tracking problem?

Q6. What are the two tests that are commonly used in the real world?

Q7. What is the purpose of this work?

Q8. What is the conventional way to evaluate a tracker?

Q9. What is the performance of the affine motion trackers?

Q10. What is the overlap score for the tracked bounding box rt?

Q11. What are the common sources of tracking source codes?

Q12. What is the main reason for the Haar-like features?

Q13. How can the authors use the background information in the discriminative model?

Q14. What is the main issue in evaluating tracking algorithms?

Q15. How do the authors measure the performance of a tracker?

Q16. Why are the plots of SRE presented for analysis?

Q17. What is the performance bottleneck of the trackers?

Q18. What is the main problem of the Lucas-Kanade algorithm?