What is the weight parameter for the correlation filter?

A weight parameter λ controls the impact of the regularization term, while the weights αk determine the impact of each training sample.

How is the F-score calculated for each video?

For each video, the F-score is computed based on the percentage of successfully tracked frames, using an intersection-over-union overlap threshold of 0.5.

(Open Access) Convolutional Features for Correlation Filter Based Visual Tracking (2015) | Martin Danelljan

Q: What contributions have the authors mentioned in the paper "Convolutional features for correlation filter based visual tracking" ?

This paper investigates the impact of convolutional features for the visual tracking problem. The authors propose to use activations from the convolutional layer of a CNN in discriminative correlation filter based tracking frameworks. The authors perform comprehensive experiments on three benchmark datasets: OTB, ALOV300++ and the recently introduced VOT2015. Surprisingly, different to image classification, their results suggest that activations from the first layer provide superior tracking performance compared to the deeper layers. Their results further show that the convolutional features provide improved results compared to standard hand-crafted features.



Convolutional Features for Correlation

Filter Based Visual Tracking

Martin Danelljan, Gustav Häger, Fahad Shahbaz Khan and

Michael Felsberg

Conference article

Cite this conference article as:

Danelljan, M., Häger, G., Shahbaz, F., Felsberg, M. Convolutional Features for

Correlation Filter Based Visual Tracking, In 2015 IEEE International Conference on

Computer Vision Workshop (ICCVW), IEEE conference proceedings; 2015,

pp. 621-629. ISBN: 978-1-4673-9711-7

DOI: https://doi.org/10.1109/ICCVW.2015.84

The self-archived postprint version of this conference article is available at Linköping

University Institutional Repository (DiVA):

http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-128869

Convolutional Features for Correlation Filter Based Visual Tracking

Martin Danelljan

, Gustav H

ager

, Fahad Shahbaz Khan, Michael Felsberg

Computer Vision Laboratory, Link

oping University, Sweden

{martin.danelljan, gustav.hager, fahad.khan, michael.felsberg}@liu.se

Abstract

Visual object tracking is a challenging computer vision

problem with numerous real-world applications. This pa-

per investigates the impact of convolutional features for the

visual tracking problem. We propose to use activations

from the convolutional layer of a CNN in discriminative

correlation ﬁlter based tracking frameworks. These acti-

vations have several advantages compared to the standard

deep features (fully connected layers). Firstly, they mitigate

the need of task speciﬁc ﬁne-tuning. Secondly, they con-

tain structural information crucial for the tracking problem.

Lastly, these activations have low dimensionality.

We perform comprehensive experiments on three bench-

mark datasets: OTB, ALOV300++ and the recently intro-

duced VOT2015. Surprisingly, different to image classiﬁ-

cation, our results suggest that activations from the ﬁrst

layer provide superior tracking performance compared to

the deeper layers. Our results further show that the con-

volutional features provide improved results compared to

standard hand-crafted features. Finally, results comparable

to state-of-the-art trackers are obtained on all three bench-

mark datasets.

1. Introduction

Visual tracking is the task of estimating the trajectory of

a target object in an image sequence. It has many impor-

tant real-world applications, such as robotics [11] and road

scene understanding [18]. In the generic tracking problem,

the target can be any object, and only its initial location is

known. This problem is challenging due to several factors,

such as appearance changes, scale variations, deformations

and occlusions. Most state-of-the-art approaches tackle the

tracking problem by learning a discriminative appearance

model of the target object. Such approaches [9, 12, 21] rely

on rich feature representations for describing of the target

and background appearance. This paper investigates robust

feature representations for visual tracking.

Among the discriminative tracking methods, correlation

ﬁlter based approaches have recently shown excellent per-

Both authors contributed equally

#750

#853

#975

#035

#042

#047

#025

#034

#078

Ours

intensity intensity+CN intensity+HOG

Figure 1. A comparison of the proposed feature representation

with three commonly employed hand-crafted features, namely im-

age intensity, Color Names (CN) and Histogram of Oriented Gra-

dients (HOG). Tracking results from our DCF tracker on three ex-

ample sequences are shown. The convolutional features used in

our tracker provides a richer description of target apperance, lead-

ing to better performance.

formance on benchmark tracking datasets [24, 39]. These

approaches learn a discriminative correlation ﬁlter (DCF),

from example patches of the target appearance. Initially, the

DCF framework was restricted to a single feature channel

(e.g. a grayscale image) [4]. Later works have investigated

extending the single-channel DCF to using multi-channel

feature representations for tracking [12]. However, existing

DCF based approaches [12, 21, 4] suffer from the periodic

boundary effects induced by circular correlation. Only re-

cently, Danelljan et al. [10] proposed Spatially Regularized

Discriminative Correlation Filters (SRDCF) to mitigate the

negative effects of the inherent periodic assumption of the

standard DCF. In this work, we investigate convolutional

features within both the standard DCF framework and the

more recent SRDCF framework.

Initially, most tracking approaches relied on using only

image intensity information or simple color transformations

[4, 30, 32] for feature representation. In recent years, hand-

crafted histogram-based descriptors have shown improved

results for visual tracking. Feature representations such as

HOG [21], Color Names [12] and channel representations

[8] have successfully been employed in DCF based track-

ing frameworks. These descriptors aim at capturing the

shape, color or luminance information of the target appear-

ance. Combining multiple features have also been investi-

gated [28] within a DCF framework.

Recently, Convolutional Neural Networks (CNNs) have

signiﬁcantly advanced the state-of-the-art in many vision

applications, including object recognition [25, 31] and ob-

ject detection [19]. These networks take a ﬁxed sized RGB

image as input to a sequence of convolution, local normal-

ization and pooling operations (called layers). The ﬁnal lay-

ers in the network are fully connected (FC), and are typi-

cally used to extract features for classiﬁcation. CNNs re-

quire a large amount of training data, and are trained on the

large scale ImageNet dataset [13]. It has been shown that

the deep features extracted from the network (the FC layer)

are generic and can be used for a variety of vision applica-

tions [2].

As discussed above, the common strategy is to extract

deep features from the activations of the FC layer of the

pre-trained network. Other than the FC layer, activations

from convolutional layers of the network have recently been

shown to achieve superior results for image classiﬁcation

[6]. These convolutional layers are discriminative, seman-

tically meaningful and contain structural information cru-

cial for the localization task. Additionally, the use of con-

volutional features mitigates the need of task-speciﬁc ﬁne-

tuning employed with standard deep features. In such ap-

proaches, it has been shown that activations from the last

convolutional layer provides improved results compared to

other layers of the same network [6].

In this work, we investigate the impact of convolutional

features in two DCF based tracking frameworks: a standard

DCF framework and the SRDCF framework [10]. Contrary

to in image classiﬁcation, we show that activations from

the ﬁrst layer provides superior tracking performance com-

pared to the deeper layers of the network. Finally, we pro-

vide both qualitative and quantitative comparison of convo-

lutional features with standard hand-crafted histogram de-

scriptors, commonly used within the DCF based trackers.

Comprehensive experiments are performed on three

benchmark datasets: the Online Tracking Benchmark

(OTB) [39], the Amsterdam Library of Ordinary Videos for

tracking (ALOV300++) [35] and the Visual Object Track-

ing (VOT) challenge 2015 [1]. Our results demonstrate

that superior performance is obtained by using convolu-

tional features compared to standard hand-crafted feature

representations. Finally, we show that our proposed tracker

achieves state-of-the-art tracking performance on all three

benchmark datasets. Figure 1 provides a comparison of our

tracker employing convolutional features with commonly

used feature representations within the same DCF based

tracking framework.

The paper is organized as follows. Section 2 discusses

related work in tracking and convolutional neural networks.

Our tracking framework is described in section 3. The em-

ployed DCF and SRDCF frameworks are brieﬂy presented

in section 3.1 and section 3.2 respectively, while the used

convolutional features are discussed in section 3.3. Sec-

tion 4 contains the experimental evaluations and results. Fi-

nally, conclusions are provided in section 5.

2. Related Work

The visual tracking problem can be approached using

generative [34, 22] or discriminative [20, 3, 40] appear-

ance models. The latter methods apply machine learning

techniques to discriminate the target appearance from the

background. Recently, the Discriminant Correlation Filter

(DCF) [4] based approaches have achieved state-of-the-art

results on benchmark tracking datasets [24, 39]. The suc-

cess of DCF based methods is evident from the outcome

of the Visual Object Tracking (VOT) 2014 challenge [24],

where the top three entries employ variants of the DCF

framework. Related methods [12, 21] have also shown ex-

cellent results on the Object Tracking Benchmark (OTB)

[39]. In this work, we employ the DCF framework to inves-

tigate the impact of convolutional features for tracking.

The DCF based tracking approaches learn a correla-

tion ﬁlter to discriminate between the target and back-

ground appearance. The training data is composed of ob-

served samples of the target appearance and the surround-

ing background. Bolme et al. [4] initially proposed the

MOSSE tracker, which is restricted to using a single fea-

ture channel, typically a grayscale image. Henriques et al.

[21] introduced a kernelized version of the tracker, to al-

low non-linear classiﬁcation boundaries. More recent work

[12, 9, 21] have achieved signiﬁcant increase in tracking

performance by investigating the use of multi-dimensional

features in the DCF tracking framework.

Despite their success, it is known that standard DCF

based trackers greatly suffers from the periodic assumption

induced by circular correlation. This leads to inaccurate and

insufﬁcient training samples as well as a restricted search

area. Galoogahi et al. [16] propose to solve a constraint

problem using the Alternating Direction Method of Mul-

tipliers (ADMM) to preserve the correct ﬁlter size. This

method is however restricted to using a single feature chan-

nel and hence not applicable for our purpose. Recently,

Danelljan et al. [10] tackles these issues by introducing the

Spatially Regularized DCF (SRDCF). Their approach al-

lows the expansion of the training and search regions with-

out increasing the effective ﬁlter size. This increases the

discriminative power and robustness of the tracker, leading

to a signiﬁcant performance gain. Moreover, the ﬁlter is op-

timized directly in the Fourier domain using Gauss-Seidel,

while every ADMM iteration in [16] requires a transition

between the spatial and Fourier domain.

In the last few years, convolutional neural networks

(CNN) have signiﬁcantly advanced the state-of-the art in

object recognition and detection benchmarks [33]. The

CNNs learn invariant features by a series of convolution and

pooling operations. These layers of convolution and pool-

ing operations are followed by one or more fully connected

(FC) layers. The entire CNNs are trained using raw pixels

with a ﬁxed input size. In order to train these networks, a

large amount of labeled training data [26] is required. The

activations of fully connected layers in a trained deep net-

work are known to contain general-purpose features appli-

cable to several visual recognition tasks such as attribute

recognition, action recognition and scene classiﬁcation [2].

Interestingly, recent results [6, 29] suggest that improved

performance is obtained using convolutional layer activa-

tions instead of those extracted from the fully connected

layers of the same network. The convolutional layers in

deep networks are discriminative, semantically meaningful

and mitigate the need to apply task speciﬁc ﬁne-tuning. The

work of [29] proposes a cross-convolutional layer pooling

approach. The method works by employing feature maps

of one convolutional layer as local features. The image rep-

resentation is obtained by pooling the extracted features us-

ing the feature maps of the successive convolutional lay-

ers. A multi-scale convolutional feature based approach is

proposed by [6] for texture classiﬁcation and object recog-

nition. In their method, activations from the convolutional

layer of the pre-trained network are used as local features.

Further, it was shown that the activations of the last convo-

lutional layer of the network provide superior performance

compared to other layers [6] for visual recognition.

Despite the success of deep features in several computer

vision tasks, less attention has been dedicated to investi-

gate deep features in the context of visual tracking. A hu-

man tracking algorithm is proposed by Fan et al. [14] by

learning convolutional features from ofﬂine training data.

The authors of [38] propose a compact deep feature based

tracking framework that learns generic features by employ-

ing a stacked denoising auto-encoder. Zhou et al. [42] in-

vestigate boosting techniques to construct an ensemble of

deep networks for visual tracking. Li et al. [27] propose a

deep tracking framework using a candidate pool of multiple

CNNs. Different from the above mentioned work, we in-

vestigate the impact of deep features for DCF based track-

ing. We exploit the spatial structure of the convolutional

features for learning a DCF (or SRDCF), which acts as a

ﬁnal classiﬁcation layer in the network. In this paper, we

also investigate the performance of different convolutional

layers and compare with standard hand-crafted features.

3. Method

Our tracking approach is based on learning a DCF or

a SRDCF from samples of the target appearance. For

image description, we employ convolutional features ex-

tracted from these samples. In each new frame, the learned

DCF is applied on the convolutional features extracted from

the predicted target location. A location estimate is then

achieved by maximizing the detection scores.

3.1. Discriminative Correlation Filters

In this work, we use a standard DCF framework to in-

vestigate the impact of convolutional features for tracking.

The DCF framework utilizes the properties of circular cor-

relation to efﬁciently train and apply a classiﬁer in a slid-

ing window fashion. The resulting classiﬁer is a correlation

(or convolution) ﬁlter which is applied to the input feature

channels. Hence, the correlation operation within the DCF

acts similarly to a convolutional layer in a CNN. The cor-

responding learned ﬁlter can be viewed as a ﬁnal convolu-

tional classiﬁcation layer in the network. Unlike the costly

methods typically applied for training CNNs, the DCF is

trained efﬁciently by solving a linear least-squares problem

and exploiting the Fast Fourier Transform (FFT).

The discriminative correlation ﬁler f

is learned from a

set of example patches x

which are sampled at each frame

k = 1, . . . , t. Here, t denotes the current frame number.

The patches are all of the same size and are typically cen-

tered at the estimated target location in each frame. We de-

note feature channel j of x

by superscript x

. In our case,

corresponds to the output of channel j at a convolutional

layer in the CNN. The objective is to learn a correlation ﬁl-

ter f

for each channel j, that minimizes the following loss,

 =

k=1

? x

− y

+ λkf

. (1)

Here ? denotes circular correlation generalized to multi-

channel signals in the conventional way by computing inner

products. That is, the correlation output for each channel is

summed over the channel dimension to produce a single-

channel output. The desired correlation output y

is set to a

Gaussian function with the peak placed at the target center

location [4]. A weight parameter λ controls the impact of

the regularization term, while the weights α

determine the

impact of each training sample.

To ﬁnd an approximate solution of (1), we use the online

update rule of [9]. At frame t, the numerator ˆg

and denom-

inator

of the discrete Fourier transformed (DFT) ﬁlter

are updated as,

ˆg

= (1 − γ)ˆg

t−1

+ γˆy

· ˆx

(2a)

= (1 − γ)

t−1

+ γ

l=1

ˆx

· ˆx

+ λ

. (2b)

Here, the hat denotes the 2-dimensional DFT, the bar de-

notes complex conjugation and · denotes pointwise mul-

tiplication. The scalar γ ∈ [0, 1] is a learning rate pa-

rameter and d is the number of feature channels. The

sought ﬁlter can then be constructed by a point-wise divi-

sion

= ˆg

To locate the target at frame t, a sample patch z

is ﬁrst

extracted at the previous location. The ﬁlter is then applied

by computing the correlation scores in the Fourier domain

= F

−1







j=1

t−1

· ˆz







. (3)

Here, F

−1

denotes the inverse DFT. To obtain an estimate

of the target scale, we apply the learned ﬁlter at multiple

resolutions. The target location and scale in the image are

then updated by ﬁnding the maximum correlation score over

all evaluated locations and scales.

3.2. Spatially Regularized Discriminative Correla-

tion Filters

As discussed above, the conventional DCF tracking ap-

proaches have demonstrated impressive performance in re-

cent years. However, the standard DCF formulation is

severely hampered by the periodic assumption introduced

by the circular correlation. This leads to unwanted periodic

boundary effects at both the training and detection stages.

Such periodic boundary effects limit the performance of the

DCF in several aspects. First, the DCF trackers struggle

in cases of fast motion due to a restricted search region.

More importantly, the inaccurate and insufﬁcient training

data limit the discriminative power of the learned model and

lead to over-ﬁtting.

To mitigate the periodic boundary effects, Danelljan et

al. [10] recently proposed Spatially Regularized Correla-

tion Filters (SRDCF), leading to a signiﬁcant performance

boost for correlation based trackers. The authors introduced

a spatial regularization function w that penalizes ﬁlter co-

efﬁcients residing outside the target bounding box. This

allows an expansion of the training and detection regions

without increasing the effective ﬁlter size. Instead of (1),

the following cost is minimized,

 =

k=1

? x

− y

l=1

kw · f

. (4)

The spatial regularization function w reﬂects the relia-

bility of visual features depending on their spatial location.

The function w is therefore set to smoothly increase with

distance from the target center, as suggested in [10]. Since

background coefﬁcients in the ﬁlter f

are suppressed by

assigning larger weights in w, the emphasis on background

information at the detection stage is reduced. On the con-

trary, a naive expansion of the sample size (using a standard

regularization) would also result in a similar increase in the

effective ﬁlter size. However, this leads to a large empha-

sis on background features, thereby severely degrading the

discriminative power of the learned model.

The cost (4) can be efﬁciently minimized in the Fourier

domain by exploiting the sparsity of the DFT coefﬁcients

ˆw. Instead of relying on approximate solutions, such as

(2), [10] propose an iterative minimization scheme based

on Gauss-Seidel, that converges to the global minimum of

(4). We refer to [10] for a detailed description of the SRDCF

training procedure.

3.3. Convolutional Features for DCF Tracking

Traditionally, DCF based approaches rely on hand-

crafted features for image description [12, 21, 28]. In

this work, we instead investigate the use of convolutional

layer activations for DCF based tracking. We employ the

imagenet-vgg-2048 network [5] using the implementation

in the MatConvNet library [37].

The network is trained

on the ImageNet dataset, for the image classiﬁcation task.

The employed network contains ﬁve convolutional layers

and uses a 224 × 224 RGB image as an input. At each con-

volutional layer, we employ the activations produced after

the rectiﬁed linear (ReLu) non-linearity. The samples used

for training and detection in the DCF framework (x

and z

respectively) are obtained by extracting the convolutional

features at the appropriate image location.

When computing the convolutional features, the image

patch is pre-processed by ﬁrst resizing it to the input size

(224 × 224) and then subtracting the mean of the network

training data. For grayscale images, we simply set the R, G

and B-channels equal to the grayscale intensities. As dis-

cussed in [4], the extracted features are always multiplied

with a Hann window.

4. Experiments

We perform experimental evaluation on three pub-

lic benchmark datasets: the Online Tracking Benchmark

(OTB) [39], the Amsterdam Library of Ordinary Videos for

tracking (ALOV300++) [35] and the Visual Object Track-

ing (VOT) challenge 2015 [1].

The network is available at http://www.vlfeat.org/

matconvnet/models/imagenet-vgg-m-2048.mat

Convolutional Features for Correlation Filter Based Visual Tracking

Figures

Citations

Fully-Convolutional Siamese Networks for Object Tracking

ECO: Efficient Convolution Operators for Tracking

Fully-Convolutional Siamese Networks for Object Tracking

SiamRPN++: Evolution of Siamese Visual Tracking With Very Deep Networks

End-to-End Representation Learning for Correlation Filter Based Tracking

References

ImageNet Classification with Deep Convolutional Neural Networks

ImageNet: A large-scale hierarchical image database

Histograms of oriented gradients for human detection

ImageNet Large Scale Visual Recognition Challenge

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

Related Papers (5)

High-Speed Tracking with Kernelized Correlation Filters

Object Tracking Benchmark

Hierarchical Convolutional Features for Visual Tracking

Visual object tracking using adaptive correlation filters

Online Object Tracking: A Benchmark

Frequently Asked Questions (9)

Q1. What contributions have the authors mentioned in the paper "Convolutional features for correlation filter based visual tracking" ?

Q2. What are the popular features in the DCF framework?

Q3. How does their tracker achieve state-of-the-art results?

Q4. What do Galoogahi et al. propose to do?

Q5. What is the way to train a deep network?

Q6. What is the weight parameter for the correlation filter?

Q7. How is the F-score calculated for each video?

Q8. What is the common type of tracking?

Q9. What is the way to extract features from the network?