scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Learning Spatially Regularized Correlation Filters for Visual Tracking

TL;DR: The proposed SRDCF formulation allows the correlation filters to be learned on a significantly larger set of negative training samples, without corrupting the positive samples, and an optimization strategy is proposed, based on the iterative Gauss-Seidel method, for efficient online learning.
Abstract: Robust and accurate visual tracking is one of the most challenging computer vision problems. Due to the inherent lack of training data, a robust approach for constructing a target appearance model is crucial. Recently, discriminatively learned correlation filters (DCF) have been successfully applied to address this problem for tracking. These methods utilize a periodic assumption of the training samples to efficiently learn a classifier on all patches in the target neighborhood. However, the periodic assumption also introduces unwanted boundary effects, which severely degrade the quality of the tracking model. We propose Spatially Regularized Discriminative Correlation Filters (SRDCF) for tracking. A spatial regularization component is introduced in the learning to penalize correlation filter coefficients depending on their spatial location. Our SRDCF formulation allows the correlation filters to be learned on a significantly larger set of negative training samples, without corrupting the positive samples. We further propose an optimization strategy, based on the iterative Gauss-Seidel method, for efficient online learning of our SRDCF. Experiments are performed on four benchmark datasets: OTB-2013, ALOV++, OTB-2015, and VOT2014. Our approach achieves state-of-the-art results on all four datasets. On OTB-2013 and OTB-2015, we obtain an absolute gain of 8.0% and 8.2% respectively, in mean overlap precision, compared to the best existing trackers.
Citations
More filters
Proceedings ArticleDOI
18 Jun 2018
TL;DR: The Siamese region proposal network (Siamese-RPN) is proposed which is end-to-end trained off-line with large-scale image pairs for visual object tracking and consists of SiAMESe subnetwork for feature extraction and region proposal subnetwork including the classification branch and regression branch.
Abstract: Visual object tracking has been a fundamental topic in recent years and many deep learning based trackers have achieved state-of-the-art performance on multiple benchmarks. However, most of these trackers can hardly get top performance with real-time speed. In this paper, we propose the Siamese region proposal network (Siamese-RPN) which is end-to-end trained off-line with large-scale image pairs. Specifically, it consists of Siamese subnetwork for feature extraction and region proposal subnetwork including the classification branch and regression branch. In the inference phase, the proposed framework is formulated as a local one-shot detection task. We can pre-compute the template branch of the Siamese subnetwork and formulate the correlation layers as trivial convolution layers to perform online tracking. Benefit from the proposal refinement, traditional multi-scale test and online fine-tuning can be discarded. The Siamese-RPN runs at 160 FPS while achieving leading performance in VOT2015, VOT2016 and VOT2017 real-time challenges.

2,016 citations


Cites methods from "Learning Spatially Regularized Corr..."

  • ...In this experiment, we compare our method with several representive trackers, including PTAV [11], CREST[31], SRDCF [8], SINT [33], CSR-DCF [23], Siamese-FC [4], Staple [3], CFNet [35] and DSST [9]....

    [...]

  • ...Specifically, it can surpass CSRDCF++ in the 2nd place by 14% and surpass Siamese-FC in the 3nd place by 33%....

    [...]

Posted Content
TL;DR: In this paper, a fully-convolutional Siamese network is trained end-to-end on the ILSVRC15 dataset for object detection in video, which achieves state-of-the-art performance.
Abstract: The problem of arbitrary object tracking has traditionally been tackled by learning a model of the object's appearance exclusively online, using as sole training data the video itself. Despite the success of these methods, their online-only approach inherently limits the richness of the model they can learn. Recently, several attempts have been made to exploit the expressive power of deep convolutional networks. However, when the object to track is not known beforehand, it is necessary to perform Stochastic Gradient Descent online to adapt the weights of the network, severely compromising the speed of the system. In this paper we equip a basic tracking algorithm with a novel fully-convolutional Siamese network trained end-to-end on the ILSVRC15 dataset for object detection in video. Our tracker operates at frame-rates beyond real-time and, despite its extreme simplicity, achieves state-of-the-art performance in multiple benchmarks.

1,613 citations

Proceedings ArticleDOI
21 Jul 2017
TL;DR: In this paper, the Correlation Filter learner is interpreted as a differentiable layer in a deep neural network, which enables learning deep features that are tightly coupled to the correlation filter.
Abstract: The Correlation Filter is an algorithm that trains a linear template to discriminate between images and their translations. It is well suited to object tracking because its formulation in the Fourier domain provides a fast solution, enabling the detector to be re-trained once per frame. Previous works that use the Correlation Filter, however, have adopted features that were either manually designed or trained for a different task. This work is the first to overcome this limitation by interpreting the Correlation Filter learner, which has a closed-form solution, as a differentiable layer in a deep neural network. This enables learning deep features that are tightly coupled to the Correlation Filter. Experiments illustrate that our method has the important practical benefit of allowing lightweight architectures to achieve state-of-the-art performance at high framerates.

1,329 citations

Book ChapterDOI
TL;DR: Discriminative Correlation Filters have demonstrated excellent performance for visual object tracking and the key to their success is the ability to efficiently exploit available negative data.
Abstract: Discriminative Correlation Filters (DCF) have demonstrated excellent performance for visual object tracking. The key to their success is the ability to efficiently exploit available negative data by including all shifted versions of a training sample. However, the underlying DCF formulation is restricted to single-resolution feature maps, significantly limiting its potential. In this paper, we go beyond the conventional DCF framework and introduce a novel formulation for training continuous convolution filters. We employ an implicit interpolation model to pose the learning problem in the continuous spatial domain. Our proposed formulation enables efficient integration of multi-resolution deep feature maps, leading to superior results on three object tracking benchmarks: OTB-2015 (+5.1% in mean OP), Temple-Color (+4.6% in mean OP), and VOT2015 (20% relative reduction in failure rate). Additionally, our approach is capable of sub-pixel localization, crucial for the task of accurate feature point tracking. We also demonstrate the effectiveness of our learning formulation in extensive feature point tracking experiments. Code and supplementary material are available at this http URL.

1,324 citations


Cites background or methods from "Learning Spatially Regularized Corr..."

  • ...Among the compared methods, the SRDCF and its variants SRDCFdecon and DeepSRDCF 4 Detailed results are provided in the supplementary material. provide the best results, all obtaining AUC scores above 60%....

    [...]

  • ...To detect the target, we perform a multi-scale search strategy [11,31] with 5 scales and a relative scale factor 1....

    [...]

  • ...Conventional DCF formulations [11,17,24] assume the feature channels to have the same spatial resolution, i....

    [...]

  • ...We also compare with SRDCFdecon, which integrates the adaptive decontamination of the training set [12] in SRDCF, and DeepSRDCF [10] employing activations from the first convolutional layer....

    [...]

  • ...The Fourier coefficients ŵ of the penalty function w are computed as described in [11]....

    [...]

Book ChapterDOI
08 Oct 2016
TL;DR: A new aerial video dataset and benchmark for low altitude UAV target tracking, as well as, a photo-realistic UAV simulator that can be coupled with tracking methods to easily extend existing real-world datasets.
Abstract: In this paper, we propose a new aerial video dataset and benchmark for low altitude UAV target tracking, as well as, a photo-realistic UAV simulator that can be coupled with tracking methods. Our benchmark provides the first evaluation of many state-of-the-art and popular trackers on 123 new and fully annotated HD video sequences captured from a low-altitude aerial perspective. Among the compared trackers, we determine which ones are the most suitable for UAV tracking both in terms of tracking accuracy and run-time. The simulator can be used to evaluate tracking algorithms in real-time scenarios before they are deployed on a UAV “in the field”, as well as, generate synthetic but photo-realistic tracking datasets with automatic ground truth annotations to easily extend existing real-world datasets. Both the benchmark and simulator are made publicly available to the vision community on our website to further research in the area of object tracking from UAVs. (https://ivul.kaust.edu.sa/Pages/pub-benchmark-simulator-uav.aspx.).

1,277 citations


Cites methods from "Learning Spatially Regularized Corr..."

  • ...In addition, we include several of the latest trackers such as MEEM [44], MUSTER [18], DSST [8] (winner VOT2014) and SRDCF [7] (winner VOT-TIR2015 and OpenCV challenge)....

    [...]

  • ...The top performing tracker on the UAV123 dataset in terms of precision and success is SRDCF [7]....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: High generalization ability of support-vector networks utilizing polynomial input transformations is demonstrated and the performance of the support- vector network is compared to various classical learning algorithms that all took part in a benchmark study of Optical Character Recognition.
Abstract: The support-vector network is a new learning machine for two-group classification problems. The machine conceptually implements the following idea: input vectors are non-linearly mapped to a very high-dimension feature space. In this feature space a linear decision surface is constructed. Special properties of the decision surface ensures high generalization ability of the learning machine. The idea behind the support-vector network was previously implemented for the restricted case where the training data can be separated without errors. We here extend this result to non-separable training data. High generalization ability of support-vector networks utilizing polynomial input transformations is demonstrated. We also compare the performance of the support-vector network to various classical learning algorithms that all took part in a benchmark study of Optical Character Recognition.

37,861 citations


"Learning Spatially Regularized Corr..." refers methods in this paper

  • ...The main difference from other techniques, such as support vector machines [6], is that the DCF formulation exploits the properties of circular correlation for efficient training and detection....

    [...]

Proceedings ArticleDOI
20 Jun 2005
TL;DR: It is shown experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection, and the influence of each stage of the computation on performance is studied.
Abstract: We study the question of feature sets for robust visual object recognition; adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.

31,952 citations


"Learning Spatially Regularized Corr..." refers background in this paper

  • ...Similar to recent DCF based trackers [8, 20, 24], we also employ HOG features, using a cell size of 4×4 pixels....

    [...]

  • ...Recent work [9, 8, 10, 20, 24] have shown a notable improvement by learning multi-channel filters on multi-dimensional features, such as HOG [7] or Color-Names [31]....

    [...]

  • ...Contrary to [14], we target the problem of multi-dimensional features, such as HOG, crucial for the overall tracking performance [10, 20]....

    [...]

Journal ArticleDOI
TL;DR: A new kernelized correlation filter is derived, that unlike other kernel algorithms has the exact same complexity as its linear counterpart, which is called dual correlation filter (DCF), which outperform top-ranking trackers such as Struck or TLD on a 50 videos benchmark, despite being implemented in a few lines of code.
Abstract: The core component of most modern trackers is a discriminative classifier, tasked with distinguishing between the target and the surrounding environment. To cope with natural image changes, this classifier is typically trained with translated and scaled sample patches. Such sets of samples are riddled with redundancies—any overlapping pixels are constrained to be the same. Based on this simple observation, we propose an analytic model for datasets of thousands of translated patches. By showing that the resulting data matrix is circulant, we can diagonalize it with the discrete Fourier transform, reducing both storage and computation by several orders of magnitude. Interestingly, for linear regression our formulation is equivalent to a correlation filter, used by some of the fastest competitive trackers. For kernel regression, however, we derive a new kernelized correlation filter (KCF), that unlike other kernel algorithms has the exact same complexity as its linear counterpart. Building on it, we also propose a fast multi-channel extension of linear correlation filters, via a linear kernel, which we call dual correlation filter (DCF). Both KCF and DCF outperform top-ranking trackers such as Struck or TLD on a 50 videos benchmark, despite running at hundreds of frames-per-second, and being implemented in a few lines of code (Algorithm 1). To encourage further developments, our tracking framework was made open-source.

4,994 citations


"Learning Spatially Regularized Corr..." refers background or methods in this paper

  • ...Similar to recent DCF based trackers [8, 20, 24], we also employ HOG features, using a cell size of 4×4 pixels....

    [...]

  • ...Among the existing methods, SAMF and MEEM provide the best results with mean OP of 64.7% Overlap threshold 0 0.2 0.4 0.6 0.8 1 O v e rl a p P re c is io n [ % ] 0 20 40 60 80 Success plot of out-of-plane rotation (39) SRDCF [60.5] MEEM [57.2] SAMF [56.0] DSST [54.1] KCF [49.9] TGPR [48.6] ASLA [46.9] ACT [46.3] Struck [45.3] SCM [42.4] Overlap threshold 0 0.2 0.4 0.6 0.8 1 O v e rl a p P re c is io n [ % ] 0 20 40 60 80 Success plot of scale variation (28) SRDCF [59.3] DSST [55.2] SAMF [52.0] MEEM [50.7] ASLA [49.7] SCM [48.1] Struck [43.1] KCF [42.8] TGPR [42.4] ACT [41.0] Overlap threshold 0 0.2 0.4 0.6 0.8 1 O v e rl a p P re c is io n [ % ] 0 20 40 60 80 Success plot of motion blur (12) SRDCF [60.8] MEEM [56.8] SAMF [52.4] KCF [50.0] Struck [47.7] ACT [46.8] DSST [45.8] TGPR [42.5] EDFT [40.5] CFLB [36.5] Overlap threshold 0 0.2 0.4 0.6 0.8 1 O v e rl a p P re c is io n [ % ] 0 20 40 60 80 Success plot of occlusion (29) SRDCF [63.4] SAMF [62.8] MEEM [57.5] DSST [53.8] KCF [51.7] TGPR [47.0] ACT [45.2] Struck [44.9] ASLA [44.7] SCM [42.1] Figure 7....

    [...]

  • ...Contrary to [14], we target the problem of multi-dimensional features, such as HOG, crucial for the overall tracking performance [10, 20]....

    [...]

  • ...We provide a comparison of our tracker with 24 state-ofthe-art methods from the literature: MIL [2], IVT [28], CT [36], TLD [22], DFT [29], EDFT [12], ASLA [21], L1APG [3], CSK [19], SCM [37], LOT [26], CPF [27], CXT [11], Frag [1], Struck [16], LSHT [17], LSST [32], ACT [10], KCF [20], CFLB [14], DSST [8], SAMF [24], TGPR [15] and MEEM [35]....

    [...]

  • ...approaches [5, 8, 10, 19, 20, 24] have successfully been applied to the tracking problem [23]....

    [...]

Proceedings ArticleDOI
23 Jun 2013
TL;DR: Large scale experiments are carried out with various evaluation criteria to identify effective approaches for robust tracking and provide potential future research directions in this field.
Abstract: Object tracking is one of the most important components in numerous applications of computer vision. While much progress has been made in recent years with efforts on sharing code and datasets, it is of great importance to develop a library and benchmark to gauge the state of the art. After briefly reviewing recent advances of online object tracking, we carry out large scale experiments with various evaluation criteria to understand how these algorithms perform. The test image sequences are annotated with different attributes for performance evaluation and analysis. By analyzing quantitative results, we identify effective approaches for robust tracking and provide potential future research directions in this field.

3,828 citations


"Learning Spatially Regularized Corr..." refers methods in this paper

  • ...Figure 5 shows the success plots for TRE and SRE on the OTB-2013 dataset with 50 videos....

    [...]

  • ...We further propose an optimization strategy, based on the iterative Gauss-Seidel method, for efficient online learning of our SRDCF. Experiments are performed on four benchmark datasets: OTB-2013, ALOV++, OTB-2015, and VOT2014....

    [...]

  • ...Attribute-based analysis of our approach on the OTB-2013 dataset with 50 videos....

    [...]

  • ...The dataset extends OTB-2013 and contains 100 videos....

    [...]

  • ...Table 1 shows the mean overlap precision (OP) for the four methods on the OTB-2013 dataset....

    [...]

Journal ArticleDOI
TL;DR: A tracking method that incrementally learns a low-dimensional subspace representation, efficiently adapting online to changes in the appearance of the target, and includes a method for correctly updating the sample mean and a forgetting factor to ensure less modeling power is expended fitting older observations.
Abstract: Visual tracking, in essence, deals with non-stationary image streams that change over time. While most existing algorithms are able to track objects well in controlled environments, they usually fail in the presence of significant variation of the object's appearance or surrounding illumination. One reason for such failures is that many algorithms employ fixed appearance models of the target. Such models are trained using only appearance data available before tracking begins, which in practice limits the range of appearances that are modeled, and ignores the large volume of information (such as shape changes or specific lighting conditions) that becomes available during tracking. In this paper, we present a tracking method that incrementally learns a low-dimensional subspace representation, efficiently adapting online to changes in the appearance of the target. The model update, based on incremental algorithms for principal component analysis, includes two important features: a method for correctly updating the sample mean, and a forgetting factor to ensure less modeling power is expended fitting older observations. Both of these features contribute measurably to improving overall tracking performance. Numerous experiments demonstrate the effectiveness of the proposed tracking algorithm in indoor and outdoor environments where the target objects undergo large changes in pose, scale, and illumination.

3,151 citations


"Learning Spatially Regularized Corr..." refers methods in this paper

  • ...We provide a comparison of our tracker with 24 state-ofthe-art methods from the literature: MIL [2], IVT [28], CT [36], TLD [22], DFT [29], EDFT [12], ASLA [21], L1APG [3], CSK [19], SCM [37], LOT [26], CPF [27], CXT [11], Frag [1], Struck [16], LSHT [17], LSST [32], ACT [10], KCF [20], CFLB [14], DSST [8], SAMF [24], TGPR [15] and MEEM [35]....

    [...]