The results suggest that activations from the first layer provide superior tracking performance compared to the deeper layers, and show that the convolutional features provide improved results compared to standard hand-crafted features.
Abstract:
Visual object tracking is a challenging computer vision problem with numerous real-world applications. This paper investigates the impact of convolutional features for the visual tracking problem. We propose to use activations from the convolutional layer of a CNN in discriminative correlation filter based tracking frameworks. These activations have several advantages compared to the standard deep features (fully connected layers). Firstly, they miti-gate the need of task specific fine-tuning. Secondly, they contain structural information crucial for the tracking problem. Lastly, these activations have low dimensionality. We perform comprehensive experiments on three benchmark datasets: OTB, ALOV300++ and the recently introduced VOT2015. Surprisingly, different to image classification, our results suggest that activations from the first layer provide superior tracking performance compared to the deeper layers. Our results further show that the convolutional features provide improved results compared to standard hand-crafted features. Finally, results comparable to state-of-the-art trackers are obtained on all three benchmark datasets.
TL;DR: A basic tracking algorithm is equipped with a novel fully-convolutional Siamese network trained end-to-end on the ILSVRC15 dataset for object detection in video and achieves state-of-the-art performance in multiple benchmarks.
TL;DR: This work revisit the core DCF formulation and introduces a factorized convolution operator, which drastically reduces the number of parameters in the model, and a compact generative model of the training sample distribution that significantly reduces memory and time complexity, while providing better diversity of samples.
TL;DR: In this paper, a fully-convolutional Siamese network is trained end-to-end on the ILSVRC15 dataset for object detection in video, which achieves state-of-the-art performance.
TL;DR: This work proves the core reason Siamese trackers still have accuracy gap comes from the lack of strict translation invariance, and proposes a new model architecture to perform depth-wise and layer-wise aggregations, which not only improves the accuracy but also reduces the model size.
TL;DR: In this paper, the Correlation Filter learner is interpreted as a differentiable layer in a deep neural network, which enables learning deep features that are tightly coupled to the correlation filter.
TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
TL;DR: It is shown experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection, and the influence of each stage of the computation on performance is studied.
TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.
TL;DR: RCNN as discussed by the authors combines CNNs with bottom-up region proposals to localize and segment objects, and when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost.
Q1. What contributions have the authors mentioned in the paper "Convolutional features for correlation filter based visual tracking" ?
This paper investigates the impact of convolutional features for the visual tracking problem. The authors propose to use activations from the convolutional layer of a CNN in discriminative correlation filter based tracking frameworks. The authors perform comprehensive experiments on three benchmark datasets: OTB, ALOV300++ and the recently introduced VOT2015. Surprisingly, different to image classification, their results suggest that activations from the first layer provide superior tracking performance compared to the deeper layers. Their results further show that the convolutional features provide improved results compared to standard hand-crafted features.
Q2. What are the popular features in the DCF framework?
Feature representations such as HOG [21], Color Names [12] and channel representations [8] have successfully been employed in DCF based tracking frameworks.
Q3. How does their tracker achieve state-of-the-art results?
the Discriminant Correlation Filter (DCF) [4] based approaches have achieved state-of-the-art results on benchmark tracking datasets [24, 39].
Q4. What do Galoogahi et al. propose to do?
Galoogahi et al. [16] propose to solve a constraint problem using the Alternating Direction Method of Multipliers (ADMM) to preserve the correct filter size.
Q5. What is the way to train a deep network?
The activations of fully connected layers in a trained deep network are known to contain general-purpose features applicable to several visual recognition tasks such as attribute recognition, action recognition and scene classification [2].
Q6. What is the weight parameter for the correlation filter?
A weight parameter λ controls the impact of the regularization term, while the weights αk determine the impact of each training sample.
Q7. How is the F-score calculated for each video?
For each video, the F-score is computed based on the percentage of successfully tracked frames, using an intersection-over-union overlap threshold of 0.5.
Q8. What is the common type of tracking?
Bolme et al. [4] initially proposed the MOSSE tracker, which is restricted to using a single feature channel, typically a grayscale image.
Q9. What is the way to extract features from the network?
Other than the FC layer, activations from convolutional layers of the network have recently been shown to achieve superior results for image classification [6].