scispace - formally typeset
Open AccessProceedings ArticleDOI

Convolutional Features for Correlation Filter Based Visual Tracking

TLDR
The results suggest that activations from the first layer provide superior tracking performance compared to the deeper layers, and show that the convolutional features provide improved results compared to standard hand-crafted features.
Abstract
Visual object tracking is a challenging computer vision problem with numerous real-world applications. This paper investigates the impact of convolutional features for the visual tracking problem. We propose to use activations from the convolutional layer of a CNN in discriminative correlation filter based tracking frameworks. These activations have several advantages compared to the standard deep features (fully connected layers). Firstly, they miti-gate the need of task specific fine-tuning. Secondly, they contain structural information crucial for the tracking problem. Lastly, these activations have low dimensionality. We perform comprehensive experiments on three benchmark datasets: OTB, ALOV300++ and the recently introduced VOT2015. Surprisingly, different to image classification, our results suggest that activations from the first layer provide superior tracking performance compared to the deeper layers. Our results further show that the convolutional features provide improved results compared to standard hand-crafted features. Finally, results comparable to state-of-the-art trackers are obtained on all three benchmark datasets.

read more

Content maybe subject to copyright    Report

Convolutional Features for Correlation
Filter Based Visual Tracking
Martin Danelljan, Gustav Häger, Fahad Shahbaz Khan and
Michael Felsberg
Conference article
Cite this conference article as:
Danelljan, M., Häger, G., Shahbaz, F., Felsberg, M. Convolutional Features for
Correlation Filter Based Visual Tracking, In 2015 IEEE International Conference on
Computer Vision Workshop (ICCVW), IEEE conference proceedings; 2015,
pp. 621-629. ISBN: 978-1-4673-9711-7
DOI: https://doi.org/10.1109/ICCVW.2015.84
Copyright: IEEE conference proceedings
The self-archived postprint version of this conference article is available at Linköping
University Institutional Repository (DiVA):
http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-128869

Convolutional Features for Correlation Filter Based Visual Tracking
Martin Danelljan
1
, Gustav H
¨
ager
1
, Fahad Shahbaz Khan, Michael Felsberg
Computer Vision Laboratory, Link
¨
oping University, Sweden
{martin.danelljan, gustav.hager, fahad.khan, michael.felsberg}@liu.se
Abstract
Visual object tracking is a challenging computer vision
problem with numerous real-world applications. This pa-
per investigates the impact of convolutional features for the
visual tracking problem. We propose to use activations
from the convolutional layer of a CNN in discriminative
correlation filter based tracking frameworks. These acti-
vations have several advantages compared to the standard
deep features (fully connected layers). Firstly, they mitigate
the need of task specific fine-tuning. Secondly, they con-
tain structural information crucial for the tracking problem.
Lastly, these activations have low dimensionality.
We perform comprehensive experiments on three bench-
mark datasets: OTB, ALOV300++ and the recently intro-
duced VOT2015. Surprisingly, different to image classifi-
cation, our results suggest that activations from the first
layer provide superior tracking performance compared to
the deeper layers. Our results further show that the con-
volutional features provide improved results compared to
standard hand-crafted features. Finally, results comparable
to state-of-the-art trackers are obtained on all three bench-
mark datasets.
1. Introduction
Visual tracking is the task of estimating the trajectory of
a target object in an image sequence. It has many impor-
tant real-world applications, such as robotics [11] and road
scene understanding [18]. In the generic tracking problem,
the target can be any object, and only its initial location is
known. This problem is challenging due to several factors,
such as appearance changes, scale variations, deformations
and occlusions. Most state-of-the-art approaches tackle the
tracking problem by learning a discriminative appearance
model of the target object. Such approaches [9, 12, 21] rely
on rich feature representations for describing of the target
and background appearance. This paper investigates robust
feature representations for visual tracking.
Among the discriminative tracking methods, correlation
filter based approaches have recently shown excellent per-
1
Both authors contributed equally
#750
#853
#975
#035
#042
#047
#025
#034
#078
Ours
intensity intensity+CN intensity+HOG
Figure 1. A comparison of the proposed feature representation
with three commonly employed hand-crafted features, namely im-
age intensity, Color Names (CN) and Histogram of Oriented Gra-
dients (HOG). Tracking results from our DCF tracker on three ex-
ample sequences are shown. The convolutional features used in
our tracker provides a richer description of target apperance, lead-
ing to better performance.
formance on benchmark tracking datasets [24, 39]. These
approaches learn a discriminative correlation filter (DCF),
from example patches of the target appearance. Initially, the
DCF framework was restricted to a single feature channel
(e.g. a grayscale image) [4]. Later works have investigated
extending the single-channel DCF to using multi-channel
feature representations for tracking [12]. However, existing
DCF based approaches [12, 21, 4] suffer from the periodic
boundary effects induced by circular correlation. Only re-
cently, Danelljan et al. [10] proposed Spatially Regularized
Discriminative Correlation Filters (SRDCF) to mitigate the
negative effects of the inherent periodic assumption of the
standard DCF. In this work, we investigate convolutional
features within both the standard DCF framework and the
more recent SRDCF framework.
Initially, most tracking approaches relied on using only
image intensity information or simple color transformations
[4, 30, 32] for feature representation. In recent years, hand-

crafted histogram-based descriptors have shown improved
results for visual tracking. Feature representations such as
HOG [21], Color Names [12] and channel representations
[8] have successfully been employed in DCF based track-
ing frameworks. These descriptors aim at capturing the
shape, color or luminance information of the target appear-
ance. Combining multiple features have also been investi-
gated [28] within a DCF framework.
Recently, Convolutional Neural Networks (CNNs) have
significantly advanced the state-of-the-art in many vision
applications, including object recognition [25, 31] and ob-
ject detection [19]. These networks take a fixed sized RGB
image as input to a sequence of convolution, local normal-
ization and pooling operations (called layers). The final lay-
ers in the network are fully connected (FC), and are typi-
cally used to extract features for classification. CNNs re-
quire a large amount of training data, and are trained on the
large scale ImageNet dataset [13]. It has been shown that
the deep features extracted from the network (the FC layer)
are generic and can be used for a variety of vision applica-
tions [2].
As discussed above, the common strategy is to extract
deep features from the activations of the FC layer of the
pre-trained network. Other than the FC layer, activations
from convolutional layers of the network have recently been
shown to achieve superior results for image classification
[6]. These convolutional layers are discriminative, seman-
tically meaningful and contain structural information cru-
cial for the localization task. Additionally, the use of con-
volutional features mitigates the need of task-specific fine-
tuning employed with standard deep features. In such ap-
proaches, it has been shown that activations from the last
convolutional layer provides improved results compared to
other layers of the same network [6].
In this work, we investigate the impact of convolutional
features in two DCF based tracking frameworks: a standard
DCF framework and the SRDCF framework [10]. Contrary
to in image classification, we show that activations from
the first layer provides superior tracking performance com-
pared to the deeper layers of the network. Finally, we pro-
vide both qualitative and quantitative comparison of convo-
lutional features with standard hand-crafted histogram de-
scriptors, commonly used within the DCF based trackers.
Comprehensive experiments are performed on three
benchmark datasets: the Online Tracking Benchmark
(OTB) [39], the Amsterdam Library of Ordinary Videos for
tracking (ALOV300++) [35] and the Visual Object Track-
ing (VOT) challenge 2015 [1]. Our results demonstrate
that superior performance is obtained by using convolu-
tional features compared to standard hand-crafted feature
representations. Finally, we show that our proposed tracker
achieves state-of-the-art tracking performance on all three
benchmark datasets. Figure 1 provides a comparison of our
tracker employing convolutional features with commonly
used feature representations within the same DCF based
tracking framework.
The paper is organized as follows. Section 2 discusses
related work in tracking and convolutional neural networks.
Our tracking framework is described in section 3. The em-
ployed DCF and SRDCF frameworks are briefly presented
in section 3.1 and section 3.2 respectively, while the used
convolutional features are discussed in section 3.3. Sec-
tion 4 contains the experimental evaluations and results. Fi-
nally, conclusions are provided in section 5.
2. Related Work
The visual tracking problem can be approached using
generative [34, 22] or discriminative [20, 3, 40] appear-
ance models. The latter methods apply machine learning
techniques to discriminate the target appearance from the
background. Recently, the Discriminant Correlation Filter
(DCF) [4] based approaches have achieved state-of-the-art
results on benchmark tracking datasets [24, 39]. The suc-
cess of DCF based methods is evident from the outcome
of the Visual Object Tracking (VOT) 2014 challenge [24],
where the top three entries employ variants of the DCF
framework. Related methods [12, 21] have also shown ex-
cellent results on the Object Tracking Benchmark (OTB)
[39]. In this work, we employ the DCF framework to inves-
tigate the impact of convolutional features for tracking.
The DCF based tracking approaches learn a correla-
tion filter to discriminate between the target and back-
ground appearance. The training data is composed of ob-
served samples of the target appearance and the surround-
ing background. Bolme et al. [4] initially proposed the
MOSSE tracker, which is restricted to using a single fea-
ture channel, typically a grayscale image. Henriques et al.
[21] introduced a kernelized version of the tracker, to al-
low non-linear classification boundaries. More recent work
[12, 9, 21] have achieved significant increase in tracking
performance by investigating the use of multi-dimensional
features in the DCF tracking framework.
Despite their success, it is known that standard DCF
based trackers greatly suffers from the periodic assumption
induced by circular correlation. This leads to inaccurate and
insufficient training samples as well as a restricted search
area. Galoogahi et al. [16] propose to solve a constraint
problem using the Alternating Direction Method of Mul-
tipliers (ADMM) to preserve the correct filter size. This
method is however restricted to using a single feature chan-
nel and hence not applicable for our purpose. Recently,
Danelljan et al. [10] tackles these issues by introducing the
Spatially Regularized DCF (SRDCF). Their approach al-
lows the expansion of the training and search regions with-
out increasing the effective filter size. This increases the
discriminative power and robustness of the tracker, leading

to a significant performance gain. Moreover, the filter is op-
timized directly in the Fourier domain using Gauss-Seidel,
while every ADMM iteration in [16] requires a transition
between the spatial and Fourier domain.
In the last few years, convolutional neural networks
(CNN) have significantly advanced the state-of-the art in
object recognition and detection benchmarks [33]. The
CNNs learn invariant features by a series of convolution and
pooling operations. These layers of convolution and pool-
ing operations are followed by one or more fully connected
(FC) layers. The entire CNNs are trained using raw pixels
with a fixed input size. In order to train these networks, a
large amount of labeled training data [26] is required. The
activations of fully connected layers in a trained deep net-
work are known to contain general-purpose features appli-
cable to several visual recognition tasks such as attribute
recognition, action recognition and scene classification [2].
Interestingly, recent results [6, 29] suggest that improved
performance is obtained using convolutional layer activa-
tions instead of those extracted from the fully connected
layers of the same network. The convolutional layers in
deep networks are discriminative, semantically meaningful
and mitigate the need to apply task specific fine-tuning. The
work of [29] proposes a cross-convolutional layer pooling
approach. The method works by employing feature maps
of one convolutional layer as local features. The image rep-
resentation is obtained by pooling the extracted features us-
ing the feature maps of the successive convolutional lay-
ers. A multi-scale convolutional feature based approach is
proposed by [6] for texture classification and object recog-
nition. In their method, activations from the convolutional
layer of the pre-trained network are used as local features.
Further, it was shown that the activations of the last convo-
lutional layer of the network provide superior performance
compared to other layers [6] for visual recognition.
Despite the success of deep features in several computer
vision tasks, less attention has been dedicated to investi-
gate deep features in the context of visual tracking. A hu-
man tracking algorithm is proposed by Fan et al. [14] by
learning convolutional features from offline training data.
The authors of [38] propose a compact deep feature based
tracking framework that learns generic features by employ-
ing a stacked denoising auto-encoder. Zhou et al. [42] in-
vestigate boosting techniques to construct an ensemble of
deep networks for visual tracking. Li et al. [27] propose a
deep tracking framework using a candidate pool of multiple
CNNs. Different from the above mentioned work, we in-
vestigate the impact of deep features for DCF based track-
ing. We exploit the spatial structure of the convolutional
features for learning a DCF (or SRDCF), which acts as a
final classification layer in the network. In this paper, we
also investigate the performance of different convolutional
layers and compare with standard hand-crafted features.
3. Method
Our tracking approach is based on learning a DCF or
a SRDCF from samples of the target appearance. For
image description, we employ convolutional features ex-
tracted from these samples. In each new frame, the learned
DCF is applied on the convolutional features extracted from
the predicted target location. A location estimate is then
achieved by maximizing the detection scores.
3.1. Discriminative Correlation Filters
In this work, we use a standard DCF framework to in-
vestigate the impact of convolutional features for tracking.
The DCF framework utilizes the properties of circular cor-
relation to efficiently train and apply a classifier in a slid-
ing window fashion. The resulting classifier is a correlation
(or convolution) filter which is applied to the input feature
channels. Hence, the correlation operation within the DCF
acts similarly to a convolutional layer in a CNN. The cor-
responding learned filter can be viewed as a final convolu-
tional classification layer in the network. Unlike the costly
methods typically applied for training CNNs, the DCF is
trained efficiently by solving a linear least-squares problem
and exploiting the Fast Fourier Transform (FFT).
The discriminative correlation filer f
t
is learned from a
set of example patches x
k
which are sampled at each frame
k = 1, . . . , t. Here, t denotes the current frame number.
The patches are all of the same size and are typically cen-
tered at the estimated target location in each frame. We de-
note feature channel j of x
k
by superscript x
j
k
. In our case,
x
j
k
corresponds to the output of channel j at a convolutional
layer in the CNN. The objective is to learn a correlation fil-
ter f
j
t
for each channel j, that minimizes the following loss,
=
t
X
k=1
α
k
kf
t
? x
k
y
k
k
2
+ λkf
t
k
2
. (1)
Here ? denotes circular correlation generalized to multi-
channel signals in the conventional way by computing inner
products. That is, the correlation output for each channel is
summed over the channel dimension to produce a single-
channel output. The desired correlation output y
k
is set to a
Gaussian function with the peak placed at the target center
location [4]. A weight parameter λ controls the impact of
the regularization term, while the weights α
k
determine the
impact of each training sample.
To find an approximate solution of (1), we use the online
update rule of [9]. At frame t, the numerator ˆg
t
and denom-
inator
ˆ
h
t
of the discrete Fourier transformed (DFT) filter
ˆ
f
t

are updated as,
ˆg
j
t
= (1 γ)ˆg
j
t1
+ γˆy
t
· ˆx
j
t
(2a)
ˆ
h
t
= (1 γ)
ˆ
h
t1
+ γ
d
X
l=1
ˆx
l
t
· ˆx
l
t
+ λ
!
. (2b)
Here, the hat denotes the 2-dimensional DFT, the bar de-
notes complex conjugation and · denotes pointwise mul-
tiplication. The scalar γ [0, 1] is a learning rate pa-
rameter and d is the number of feature channels. The
sought filter can then be constructed by a point-wise divi-
sion
ˆ
f
j
t
= ˆg
j
t
/
ˆ
h
t
.
To locate the target at frame t, a sample patch z
t
is first
extracted at the previous location. The filter is then applied
by computing the correlation scores in the Fourier domain
s
t
= F
1
d
X
j=1
ˆ
f
j
t1
· ˆz
l
t
. (3)
Here, F
1
denotes the inverse DFT. To obtain an estimate
of the target scale, we apply the learned filter at multiple
resolutions. The target location and scale in the image are
then updated by finding the maximum correlation score over
all evaluated locations and scales.
3.2. Spatially Regularized Discriminative Correla-
tion Filters
As discussed above, the conventional DCF tracking ap-
proaches have demonstrated impressive performance in re-
cent years. However, the standard DCF formulation is
severely hampered by the periodic assumption introduced
by the circular correlation. This leads to unwanted periodic
boundary effects at both the training and detection stages.
Such periodic boundary effects limit the performance of the
DCF in several aspects. First, the DCF trackers struggle
in cases of fast motion due to a restricted search region.
More importantly, the inaccurate and insufficient training
data limit the discriminative power of the learned model and
lead to over-fitting.
To mitigate the periodic boundary effects, Danelljan et
al. [10] recently proposed Spatially Regularized Correla-
tion Filters (SRDCF), leading to a significant performance
boost for correlation based trackers. The authors introduced
a spatial regularization function w that penalizes filter co-
efficients residing outside the target bounding box. This
allows an expansion of the training and detection regions
without increasing the effective filter size. Instead of (1),
the following cost is minimized,
=
t
X
k=1
α
k
kf
t
? x
k
y
k
k
2
+
d
X
l=1
kw · f
l
t
k
2
. (4)
The spatial regularization function w reflects the relia-
bility of visual features depending on their spatial location.
The function w is therefore set to smoothly increase with
distance from the target center, as suggested in [10]. Since
background coefficients in the filter f
t
are suppressed by
assigning larger weights in w, the emphasis on background
information at the detection stage is reduced. On the con-
trary, a naive expansion of the sample size (using a standard
regularization) would also result in a similar increase in the
effective filter size. However, this leads to a large empha-
sis on background features, thereby severely degrading the
discriminative power of the learned model.
The cost (4) can be efficiently minimized in the Fourier
domain by exploiting the sparsity of the DFT coefficients
ˆw. Instead of relying on approximate solutions, such as
(2), [10] propose an iterative minimization scheme based
on Gauss-Seidel, that converges to the global minimum of
(4). We refer to [10] for a detailed description of the SRDCF
training procedure.
3.3. Convolutional Features for DCF Tracking
Traditionally, DCF based approaches rely on hand-
crafted features for image description [12, 21, 28]. In
this work, we instead investigate the use of convolutional
layer activations for DCF based tracking. We employ the
imagenet-vgg-2048 network [5] using the implementation
in the MatConvNet library [37].
1
The network is trained
on the ImageNet dataset, for the image classification task.
The employed network contains five convolutional layers
and uses a 224 × 224 RGB image as an input. At each con-
volutional layer, we employ the activations produced after
the rectified linear (ReLu) non-linearity. The samples used
for training and detection in the DCF framework (x
k
and z
k
respectively) are obtained by extracting the convolutional
features at the appropriate image location.
When computing the convolutional features, the image
patch is pre-processed by first resizing it to the input size
(224 × 224) and then subtracting the mean of the network
training data. For grayscale images, we simply set the R, G
and B-channels equal to the grayscale intensities. As dis-
cussed in [4], the extracted features are always multiplied
with a Hann window.
4. Experiments
We perform experimental evaluation on three pub-
lic benchmark datasets: the Online Tracking Benchmark
(OTB) [39], the Amsterdam Library of Ordinary Videos for
tracking (ALOV300++) [35] and the Visual Object Track-
ing (VOT) challenge 2015 [1].
1
The network is available at http://www.vlfeat.org/
matconvnet/models/imagenet-vgg-m-2048.mat

Figures
Citations
More filters
Book ChapterDOI

Fully-Convolutional Siamese Networks for Object Tracking

TL;DR: A basic tracking algorithm is equipped with a novel fully-convolutional Siamese network trained end-to-end on the ILSVRC15 dataset for object detection in video and achieves state-of-the-art performance in multiple benchmarks.
Proceedings ArticleDOI

ECO: Efficient Convolution Operators for Tracking

TL;DR: This work revisit the core DCF formulation and introduces a factorized convolution operator, which drastically reduces the number of parameters in the model, and a compact generative model of the training sample distribution that significantly reduces memory and time complexity, while providing better diversity of samples.
Posted Content

Fully-Convolutional Siamese Networks for Object Tracking

TL;DR: In this paper, a fully-convolutional Siamese network is trained end-to-end on the ILSVRC15 dataset for object detection in video, which achieves state-of-the-art performance.
Proceedings ArticleDOI

SiamRPN++: Evolution of Siamese Visual Tracking With Very Deep Networks

TL;DR: This work proves the core reason Siamese trackers still have accuracy gap comes from the lack of strict translation invariance, and proposes a new model architecture to perform depth-wise and layer-wise aggregations, which not only improves the accuracy but also reduces the model size.
Proceedings ArticleDOI

End-to-End Representation Learning for Correlation Filter Based Tracking

TL;DR: In this paper, the Correlation Filter learner is interpreted as a differentiable layer in a deep neural network, which enables learning deep features that are tightly coupled to the correlation filter.
References
More filters
Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Proceedings ArticleDOI

ImageNet: A large-scale hierarchical image database

TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Proceedings ArticleDOI

Histograms of oriented gradients for human detection

TL;DR: It is shown experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection, and the influence of each stage of the computation on performance is studied.
Journal ArticleDOI

ImageNet Large Scale Visual Recognition Challenge

TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.
Proceedings ArticleDOI

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

TL;DR: RCNN as discussed by the authors combines CNNs with bottom-up region proposals to localize and segment objects, and when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost.
Related Papers (5)
Frequently Asked Questions (9)
Q1. What contributions have the authors mentioned in the paper "Convolutional features for correlation filter based visual tracking" ?

This paper investigates the impact of convolutional features for the visual tracking problem. The authors propose to use activations from the convolutional layer of a CNN in discriminative correlation filter based tracking frameworks. The authors perform comprehensive experiments on three benchmark datasets: OTB, ALOV300++ and the recently introduced VOT2015. Surprisingly, different to image classification, their results suggest that activations from the first layer provide superior tracking performance compared to the deeper layers. Their results further show that the convolutional features provide improved results compared to standard hand-crafted features. 

Feature representations such as HOG [21], Color Names [12] and channel representations [8] have successfully been employed in DCF based tracking frameworks. 

the Discriminant Correlation Filter (DCF) [4] based approaches have achieved state-of-the-art results on benchmark tracking datasets [24, 39]. 

Galoogahi et al. [16] propose to solve a constraint problem using the Alternating Direction Method of Multipliers (ADMM) to preserve the correct filter size. 

The activations of fully connected layers in a trained deep network are known to contain general-purpose features applicable to several visual recognition tasks such as attribute recognition, action recognition and scene classification [2]. 

A weight parameter λ controls the impact of the regularization term, while the weights αk determine the impact of each training sample. 

For each video, the F-score is computed based on the percentage of successfully tracked frames, using an intersection-over-union overlap threshold of 0.5. 

Bolme et al. [4] initially proposed the MOSSE tracker, which is restricted to using a single feature channel, typically a grayscale image. 

Other than the FC layer, activations from convolutional layers of the network have recently been shown to achieve superior results for image classification [6].