Accurate scale estimation for robust visual tracking

doi:10.5244/C.28.65

Home
/
Papers
/
Accurate scale estimation for robust visual tracking

Proceedings Article•DOI•

Accurate scale estimation for robust visual tracking

Martin Danelljan¹, Gustav Häger¹, Fahad Shahbaz Khan¹, Michael Felsberg¹•Institutions (1)

Linköping University¹

01 Jan 2014-

TL;DR: This paper presents a novel approach to robust scale estimation that can handle large scale variations in complex image sequences and shows promising results in terms of accuracy and efficiency.

read less

Abstract: Robust scale estimation is a challenging problem in visual object tracking. Most existing methods fail to handle large scale variations in complex image sequences. This paper presents a novel appro ...

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Book Chapter•DOI•

Fully-Convolutional Siamese Networks for Object Tracking

[...]

Luca Bertinetto¹, Jack Valmadre¹, João F. Henriques¹, Andrea Vedaldi¹, Philip H. S. Torr¹ - Show less +1 more•Institutions (1)

University of Oxford¹

08 Oct 2016

TL;DR: A basic tracking algorithm is equipped with a novel fully-convolutional Siamese network trained end-to-end on the ILSVRC15 dataset for object detection in video and achieves state-of-the-art performance in multiple benchmarks.

...read moreread less

Abstract: The problem of arbitrary object tracking has traditionally been tackled by learning a model of the object’s appearance exclusively online, using as sole training data the video itself. Despite the success of these methods, their online-only approach inherently limits the richness of the model they can learn. Recently, several attempts have been made to exploit the expressive power of deep convolutional networks. However, when the object to track is not known beforehand, it is necessary to perform Stochastic Gradient Descent online to adapt the weights of the network, severely compromising the speed of the system. In this paper we equip a basic tracking algorithm with a novel fully-convolutional Siamese network trained end-to-end on the ILSVRC15 dataset for object detection in video. Our tracker operates at frame-rates beyond real-time and, despite its extreme simplicity, achieves state-of-the-art performance in multiple benchmarks.

...read moreread less

2,936 citations

Cites methods from "Accurate scale estimation for robus..."

...3 we also compare against seven more recent state-of-the-art trackers presented in the major computer vision conferences and that can run at frame-rate speed: Staple [33], LCT [34], CCT [35], SCT4 [36], DLSSVM NU [37], DSST [38] and KCFDP [39]....
[...]

Proceedings Article•DOI•

High Performance Visual Tracking with Siamese Region Proposal Network

[...]

Bo Li¹, Junjie Yan², Wei Wu³, Zheng Zhu⁴, Xiaolin Hu² - Show less +1 more•Institutions (4)

Beihang University¹, Tsinghua University², SenseTime³, Chinese Academy of Sciences⁴

18 Jun 2018

TL;DR: The Siamese region proposal network (Siamese-RPN) is proposed which is end-to-end trained off-line with large-scale image pairs for visual object tracking and consists of SiAMESe subnetwork for feature extraction and region proposal subnetwork including the classification branch and regression branch.

...read moreread less

Abstract: Visual object tracking has been a fundamental topic in recent years and many deep learning based trackers have achieved state-of-the-art performance on multiple benchmarks. However, most of these trackers can hardly get top performance with real-time speed. In this paper, we propose the Siamese region proposal network (Siamese-RPN) which is end-to-end trained off-line with large-scale image pairs. Specifically, it consists of Siamese subnetwork for feature extraction and region proposal subnetwork including the classification branch and regression branch. In the inference phase, the proposed framework is formulated as a local one-shot detection task. We can pre-compute the template branch of the Siamese subnetwork and formulate the correlation layers as trivial convolution layers to perform online tracking. Benefit from the proposal refinement, traditional multi-scale test and online fine-tuning can be discarded. The Siamese-RPN runs at 160 FPS while achieving leading performance in VOT2015, VOT2016 and VOT2017 real-time challenges.

...read moreread less

2,016 citations

Cites methods from "Accurate scale estimation for robus..."

...In this experiment, we compare our method with several representive trackers, including PTAV [11], CREST[31], SRDCF [8], SINT [33], CSR-DCF [23], Siamese-FC [4], Staple [3], CFNet [35] and DSST [9]....
[...]

Proceedings Article•DOI•

ECO: Efficient Convolution Operators for Tracking

[...]

Martin Danelljan¹, Goutam Bhat¹, Fahad Shahbaz Khan¹, Michael Felsberg¹•Institutions (1)

Linköping University¹

21 Jul 2017

TL;DR: This work revisit the core DCF formulation and introduces a factorized convolution operator, which drastically reduces the number of parameters in the model, and a compact generative model of the training sample distribution that significantly reduces memory and time complexity, while providing better diversity of samples.

...read moreread less

Abstract: In recent years, Discriminative Correlation Filter (DCF) based methods have significantly advanced the state-of-the-art in tracking. However, in the pursuit of ever increasing tracking performance, their characteristic speed and real-time capability have gradually faded. Further, the increasingly complex models, with massive number of trainable parameters, have introduced the risk of severe over-fitting. In this work, we tackle the key causes behind the problems of computational complexity and over-fitting, with the aim of simultaneously improving both speed and performance. We revisit the core DCF formulation and introduce: (i) a factorized convolution operator, which drastically reduces the number of parameters in the model, (ii) a compact generative model of the training sample distribution, that significantly reduces memory and time complexity, while providing better diversity of samples, (iii) a conservative model update strategy with improved robustness and reduced complexity. We perform comprehensive experiments on four benchmarks: VOT2016, UAV123, OTB-2015, and TempleColor. When using expensive deep features, our tracker provides a 20-fold speedup and achieves a 13.0% relative gain in Expected Average Overlap compared to the top ranked method [12] in the VOT2016 challenge. Moreover, our fast variant, using hand-crafted features, operates at 60 Hz on a single CPU, while obtaining 65.0% AUC on OTB-2015.

...read moreread less

1,993 citations

Cites background or methods from "Accurate scale estimation for robus..."

...The recent advancement in DCF based tracking performance is driven by the use of multi-dimensional features [13, 15], robust scale estimation [7, 11], non-linear kernels [20], long-term memory components [28], sophisticated learning models [3, 10] and reducing boundary effects [9, 16]....
[...]
...OTB2015 Dataset: We compare our tracker with 20 stateof-the-art methods: TLD [22], Struck [19], CFLB [16], ACT [13], TGPR [17], KCF [20], DSST [7], SAMF [25], MEEM [38], DAT [33], LCT [28], HCF [27], SRDCF [9], SRDCFad [10], DeepSRDCF [8], Staple [1], MDNet [31], SiameseFC [2], TCNN [30] and C-COT [12]....
[...]

Proceedings Article•DOI•

Learning Multi-domain Convolutional Neural Networks for Visual Tracking

[...]

Hyeonseob Nam¹, Bohyung Han¹•Institutions (1)

Pohang University of Science and Technology¹

27 Jun 2016

TL;DR: A novel visual tracking algorithm based on the representations from a discriminatively trained Convolutional Neural Network using a large set of videos with tracking ground-truths to obtain a generic target representation.

...read moreread less

Abstract: We propose a novel visual tracking algorithm based on the representations from a discriminatively trained Convolutional Neural Network (CNN). Our algorithm pretrains a CNN using a large set of videos with tracking groundtruths to obtain a generic target representation. Our network is composed of shared layers and multiple branches of domain-specific layers, where domains correspond to individual training sequences and each branch is responsible for binary classification to identify target in each domain. We train each domain in the network iteratively to obtain generic target representations in the shared layers. When tracking a target in a new sequence, we construct a new network by combining the shared layers in the pretrained CNN with a new binary classification layer, which is updated online. Online tracking is performed by evaluating the candidate windows randomly sampled around the previous target state. The proposed algorithm illustrates outstanding performance in existing tracking benchmarks.

...read moreread less

1,960 citations

Cites background or methods from "Accurate scale estimation for robus..."

...The one-pass evaluation (OPE) is employed to compare our algorithm with the six state-ofthe-art trackers including MUSTer [21], CNN-SVM [20], MEEM [42], TGPR [12], DSST [6] and KCF [18], as well as the top 2 trackers included in the benchmark—SCM [44] and Struck [17]....
[...]
...Visual tracking, however, has been less affected by these popular trends since it is difficult to collect a large amount of training data for video processing applications and training algorithms specialized for visual tracking are not available yet, while the approaches based on low-level handcraft features still work well in practice [18, 6, 21, 42]....
[...]
...We compare our algorithm with the top 5 trackers in VOT2014 challenge—DSST [6], SAMF [29], KCF [18], DGT [4] and PLT 14 [25]—and additional two state-of-the-art trackers MUSTer [21] and MEEM [42]....
[...]
...In recent years, correlation filters have gained attention in the area of visual tracking due to their computational efficiency and competitive performance [3, 18, 6, 21]....
[...]
...The one-pass evaluation (OPE) is employed to compare our algorithm with the six state-ofthe-art trackers including MUSTer [21], CNN-SVM [20], MEEM [42], TGPR [12], DSST [6] and KCF [18], as well as the top 2 trackers included in the benchmark—SCM [44]...
[...]

Posted Content•

Learning Multi-Domain Convolutional Neural Networks for Visual Tracking

[...]

Hyeonseob Nam¹, Bohyung Han¹•Institutions (1)

Pohang University of Science and Technology¹

27 Oct 2015-arXiv: Computer Vision and Pattern Recognition

TL;DR: Zhang et al. as discussed by the authors proposed a novel visual tracking algorithm based on the representations from a discriminatively trained Convolutional Neural Network (CNN), which pretrain a CNN using a large set of videos with tracking ground-truths to obtain a generic target representation.

...read moreread less

Abstract: We propose a novel visual tracking algorithm based on the representations from a discriminatively trained Convolutional Neural Network (CNN). Our algorithm pretrains a CNN using a large set of videos with tracking ground-truths to obtain a generic target representation. Our network is composed of shared layers and multiple branches of domain-specific layers, where domains correspond to individual training sequences and each branch is responsible for binary classification to identify the target in each domain. We train the network with respect to each domain iteratively to obtain generic target representations in the shared layers. When tracking a target in a new sequence, we construct a new network by combining the shared layers in the pretrained CNN with a new binary classification layer, which is updated online. Online tracking is performed by evaluating the candidate windows randomly sampled around the previous target state. The proposed algorithm illustrates outstanding performance compared with state-of-the-art methods in existing tracking benchmarks.

...read moreread less

1,818 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Object Detection with Discriminatively Trained Part-Based Models

[...]

Pedro F. Felzenszwalb¹, Ross Girshick¹, David McAllester², Deva Ramanan³•Institutions (3)

University of Chicago¹, Toyota², University of California, Irvine³

01 Sep 2010-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: An object detection system based on mixtures of multiscale deformable part models that is able to represent highly variable object classes and achieves state-of-the-art results in the PASCAL object detection challenges is described.

...read moreread less

Abstract: We describe an object detection system based on mixtures of multiscale deformable part models. Our system is able to represent highly variable object classes and achieves state-of-the-art results in the PASCAL object detection challenges. While deformable part models have become quite popular, their value had not been demonstrated on difficult benchmarks such as the PASCAL data sets. Our system relies on new methods for discriminative training with partially labeled data. We combine a margin-sensitive approach for data-mining hard negative examples with a formalism we call latent SVM. A latent SVM is a reformulation of MI--SVM in terms of latent variables. A latent SVM is semiconvex, and the training problem becomes convex once latent information is specified for the positive examples. This leads to an iterative training algorithm that alternates between fixing latent values for positive examples and optimizing the latent SVM objective function.

...read moreread less

10,501 citations

Journal Article•DOI•

High-Speed Tracking with Kernelized Correlation Filters

[...]

João F. Henriques¹, Rui Caseiro¹, Pedro Martins¹, Jorge Batista¹•Institutions (1)

University of Coimbra¹

01 Mar 2015-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: A new kernelized correlation filter is derived, that unlike other kernel algorithms has the exact same complexity as its linear counterpart, which is called dual correlation filter (DCF), which outperform top-ranking trackers such as Struck or TLD on a 50 videos benchmark, despite being implemented in a few lines of code.

...read moreread less

Abstract: The core component of most modern trackers is a discriminative classifier, tasked with distinguishing between the target and the surrounding environment. To cope with natural image changes, this classifier is typically trained with translated and scaled sample patches. Such sets of samples are riddled with redundancies—any overlapping pixels are constrained to be the same. Based on this simple observation, we propose an analytic model for datasets of thousands of translated patches. By showing that the resulting data matrix is circulant, we can diagonalize it with the discrete Fourier transform, reducing both storage and computation by several orders of magnitude. Interestingly, for linear regression our formulation is equivalent to a correlation filter, used by some of the fastest competitive trackers. For kernel regression, however, we derive a new kernelized correlation filter (KCF), that unlike other kernel algorithms has the exact same complexity as its linear counterpart. Building on it, we also propose a fast multi-channel extension of linear correlation filters, via a linear kernel, which we call dual correlation filter (DCF). Both KCF and DCF outperform top-ranking trackers such as Struck or TLD on a 50 videos benchmark, despite running at hundreds of frames-per-second, and being implemented in a few lines of code (Algorithm 1). To encourage further developments, our tracking framework was made open-source.

...read moreread less

4,994 citations

"Accurate scale estimation for robus..." refers methods in this paper

...Similar to [13], we use HOG features for the translation filter and concatenate it with the usual image intensity features....
[...]
...The discriminative correlation filters described in section 2 have recently been extended to multi-dimensional features for a variety of applications, including visual tracking [4, 13], object detection [8, 12] and object alignment [2]....
[...]

Proceedings Article•DOI•

Online Object Tracking: A Benchmark

[...]

Yi Wu¹, Jongwoo Lim², Ming-Hsuan Yang¹•Institutions (2)

University of California, Merced¹, Hanyang University²

23 Jun 2013

TL;DR: Large scale experiments are carried out with various evaluation criteria to identify effective approaches for robust tracking and provide potential future research directions in this field.

...read moreread less

Abstract: Object tracking is one of the most important components in numerous applications of computer vision. While much progress has been made in recent years with efforts on sharing code and datasets, it is of great importance to develop a library and benchmark to gauge the state of the art. After briefly reviewing recent advances of online object tracking, we carry out large scale experiments with various evaluation criteria to understand how these algorithms perform. The test image sequences are annotated with different attributes for performance evaluation and analysis. By analyzing quantitative results, we identify effective approaches for robust tracking and provide potential future research directions in this field.

...read moreread less

3,828 citations

"Accurate scale estimation for robus..." refers background or methods in this paper

...In addition, the results are presented using precision and success plots [18]....
[...]
...To validate our approach, we perform extensive experiments on all the 28 image sequences annotated with “Scale Variation (SV)” in the recent benchmark evaluation [18]....
[...]
...Evaluation Methodology: The performance of our approach is quantitatively validated by following the protocol1 used in [18]....
[...]
...Datasets: We employ all the 28 sequences1 annotated with the scale variation attribute in the recent evaluation of tracking methods [18]....
[...]
...[18] performed a comprehensive evaluation of online visual tracking approaches....
[...]

Proceedings Article•DOI•

Visual object tracking using adaptive correlation filters

[...]

David S. Bolme¹, J. Ross Beveridge¹, Bruce A. Draper¹, Yui Man Lui¹•Institutions (1)

Colorado State University¹

13 Jun 2010

TL;DR: A new type of correlation filter is presented, a Minimum Output Sum of Squared Error (MOSSE) filter, which produces stable correlation filters when initialized using a single frame, which enables the tracker to pause and resume where it left off when the object reappears.

...read moreread less

Abstract: Although not commonly used, correlation filters can track complex objects through rotations, occlusions and other distractions at over 20 times the rate of current state-of-the-art techniques. The oldest and simplest correlation filters use simple templates and generally fail when applied to tracking. More modern approaches such as ASEF and UMACE perform better, but their training needs are poorly suited to tracking. Visual tracking requires robust filters to be trained from a single frame and dynamically adapted as the appearance of the target object changes. This paper presents a new type of correlation filter, a Minimum Output Sum of Squared Error (MOSSE) filter, which produces stable correlation filters when initialized using a single frame. A tracker based upon MOSSE filters is robust to variations in lighting, scale, pose, and nonrigid deformations while operating at 669 frames per second. Occlusion is detected based upon the peak-to-sidelobe ratio, which enables the tracker to pause and resume where it left off when the object reappears.

...read moreread less

2,948 citations

"Accurate scale estimation for robus..." refers background or methods in this paper

...In recent years, tracking-by-detection methods [3, 9, 11, 19] have shown to provide excellent tracking performance....
[...]
...The intensity based baseline roughly corresponds to the MOSSE tracker proposed in [3], but without any explicit failure detection component....
[...]
...[3], is based on finding an adaptive correlation filter by minimizing the output sum of squared error (MOSSE)....
[...]
...As mentioned in [3], the regularization parameter alleviates the problem of zero-frequency components in the spectrum of f , which would lead to division by zero....
[...]
...Finally, the extracted features are always multiplied by a Hann window, as described in [3]....
[...]

Book Chapter•DOI•

Exploiting the circulant structure of tracking-by-detection with kernels

[...]

João F. Henriques¹, Rui Caseiro¹, Pedro Martins¹, Jorge Batista¹•Institutions (1)

University of Coimbra¹

07 Oct 2012

TL;DR: Using the well-established theory of Circulant matrices, this work provides a link to Fourier analysis that opens up the possibility of extremely fast learning and detection with the Fast Fourier Transform, which can be done in the dual space of kernel machines as fast as with linear classifiers.

...read moreread less

Abstract: Recent years have seen greater interest in the use of discriminative classifiers in tracking systems, owing to their success in object detection. They are trained online with samples collected during tracking. Unfortunately, the potentially large number of samples becomes a computational burden, which directly conflicts with real-time requirements. On the other hand, limiting the samples may sacrifice performance. Interestingly, we observed that, as we add more and more samples, the problem acquires circulant structure. Using the well-established theory of Circulant matrices, we provide a link to Fourier analysis that opens up the possibility of extremely fast learning and detection with the Fast Fourier Transform. This can be done in the dual space of kernel machines as fast as with linear classifiers. We derive closed-form solutions for training and detection with several types of kernels, including the popular Gaussian and polynomial kernels. The resulting tracker achieves performance competitive with the state-of-the-art, can be implemented with only a few lines of code and runs at hundreds of frames-per-second. MATLAB code is provided in the paper (see Algorithm 1).

...read moreread less

2,197 citations

"Accurate scale estimation for robus..." refers background or methods in this paper

...In recent years, tracking-by-detection methods [3, 9, 11, 19] have shown to provide excellent tracking performance....
[...]
...Given an image patch, the CSK tracker works by learning a kernelized least-squares classifier of the target appearance....
[...]
...Ours ASLA [14] SCM [20] Struck [9] TLD [15] EDFT [6] L1APG [1] DFT [17] LOT [16] CSK [11] LSHT [10] CT [19] Median OP 75....
[...]
...Most tracking-by-detection methods, such as the CSK and MOSSE, are limited to only estimating the target translation....
[...]
...0 10 20 30 40 50 0 0.2 0.4 0.6 0.8 Location error threshold D is ta n c e P re c is io n Precision plot Ours [0.745] Struck [0.659] ASLA [0.612] SCM [0.610] TLD [0.509] LSHT [0.508] EDFT [0.505] CSK [0.502] L1APG [0.472] LOT [0.467] DFT [0.441] CT [0.344] 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 Overlap threshold O v e rl a p P re c is io n Success plot Ours [0.549] ASLA [0.492] SCM [0.477] Struck [0.430] TLD [0.356] LSHT [0.354] EDFT [0.350] CSK [0.350] L1APG [0.350] LOT [0.339] DFT [0.329] CT [0.239] Figure 2: Precision and success plots over all the 28 sequences....
[...]