scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Accurate scale estimation for robust visual tracking

TL;DR: This paper presents a novel approach to robust scale estimation that can handle large scale variations in complex image sequences and shows promising results in terms of accuracy and efficiency.
Abstract: Robust scale estimation is a challenging problem in visual object tracking. Most existing methods fail to handle large scale variations in complex image sequences. This paper presents a novel appro ...

Content maybe subject to copyright    Report

Citations
More filters
Book ChapterDOI
08 Oct 2016
TL;DR: A basic tracking algorithm is equipped with a novel fully-convolutional Siamese network trained end-to-end on the ILSVRC15 dataset for object detection in video and achieves state-of-the-art performance in multiple benchmarks.
Abstract: The problem of arbitrary object tracking has traditionally been tackled by learning a model of the object’s appearance exclusively online, using as sole training data the video itself. Despite the success of these methods, their online-only approach inherently limits the richness of the model they can learn. Recently, several attempts have been made to exploit the expressive power of deep convolutional networks. However, when the object to track is not known beforehand, it is necessary to perform Stochastic Gradient Descent online to adapt the weights of the network, severely compromising the speed of the system. In this paper we equip a basic tracking algorithm with a novel fully-convolutional Siamese network trained end-to-end on the ILSVRC15 dataset for object detection in video. Our tracker operates at frame-rates beyond real-time and, despite its extreme simplicity, achieves state-of-the-art performance in multiple benchmarks.

2,936 citations


Cites methods from "Accurate scale estimation for robus..."

  • ...3 we also compare against seven more recent state-of-the-art trackers presented in the major computer vision conferences and that can run at frame-rate speed: Staple [33], LCT [34], CCT [35], SCT4 [36], DLSSVM NU [37], DSST [38] and KCFDP [39]....

    [...]

Proceedings ArticleDOI
18 Jun 2018
TL;DR: The Siamese region proposal network (Siamese-RPN) is proposed which is end-to-end trained off-line with large-scale image pairs for visual object tracking and consists of SiAMESe subnetwork for feature extraction and region proposal subnetwork including the classification branch and regression branch.
Abstract: Visual object tracking has been a fundamental topic in recent years and many deep learning based trackers have achieved state-of-the-art performance on multiple benchmarks. However, most of these trackers can hardly get top performance with real-time speed. In this paper, we propose the Siamese region proposal network (Siamese-RPN) which is end-to-end trained off-line with large-scale image pairs. Specifically, it consists of Siamese subnetwork for feature extraction and region proposal subnetwork including the classification branch and regression branch. In the inference phase, the proposed framework is formulated as a local one-shot detection task. We can pre-compute the template branch of the Siamese subnetwork and formulate the correlation layers as trivial convolution layers to perform online tracking. Benefit from the proposal refinement, traditional multi-scale test and online fine-tuning can be discarded. The Siamese-RPN runs at 160 FPS while achieving leading performance in VOT2015, VOT2016 and VOT2017 real-time challenges.

2,016 citations


Cites methods from "Accurate scale estimation for robus..."

  • ...In this experiment, we compare our method with several representive trackers, including PTAV [11], CREST[31], SRDCF [8], SINT [33], CSR-DCF [23], Siamese-FC [4], Staple [3], CFNet [35] and DSST [9]....

    [...]

Proceedings ArticleDOI
21 Jul 2017
TL;DR: This work revisit the core DCF formulation and introduces a factorized convolution operator, which drastically reduces the number of parameters in the model, and a compact generative model of the training sample distribution that significantly reduces memory and time complexity, while providing better diversity of samples.
Abstract: In recent years, Discriminative Correlation Filter (DCF) based methods have significantly advanced the state-of-the-art in tracking. However, in the pursuit of ever increasing tracking performance, their characteristic speed and real-time capability have gradually faded. Further, the increasingly complex models, with massive number of trainable parameters, have introduced the risk of severe over-fitting. In this work, we tackle the key causes behind the problems of computational complexity and over-fitting, with the aim of simultaneously improving both speed and performance. We revisit the core DCF formulation and introduce: (i) a factorized convolution operator, which drastically reduces the number of parameters in the model, (ii) a compact generative model of the training sample distribution, that significantly reduces memory and time complexity, while providing better diversity of samples, (iii) a conservative model update strategy with improved robustness and reduced complexity. We perform comprehensive experiments on four benchmarks: VOT2016, UAV123, OTB-2015, and TempleColor. When using expensive deep features, our tracker provides a 20-fold speedup and achieves a 13.0% relative gain in Expected Average Overlap compared to the top ranked method [12] in the VOT2016 challenge. Moreover, our fast variant, using hand-crafted features, operates at 60 Hz on a single CPU, while obtaining 65.0% AUC on OTB-2015.

1,993 citations


Cites background or methods from "Accurate scale estimation for robus..."

  • ...The recent advancement in DCF based tracking performance is driven by the use of multi-dimensional features [13, 15], robust scale estimation [7, 11], non-linear kernels [20], long-term memory components [28], sophisticated learning models [3, 10] and reducing boundary effects [9, 16]....

    [...]

  • ...OTB2015 Dataset: We compare our tracker with 20 stateof-the-art methods: TLD [22], Struck [19], CFLB [16], ACT [13], TGPR [17], KCF [20], DSST [7], SAMF [25], MEEM [38], DAT [33], LCT [28], HCF [27], SRDCF [9], SRDCFad [10], DeepSRDCF [8], Staple [1], MDNet [31], SiameseFC [2], TCNN [30] and C-COT [12]....

    [...]

Proceedings ArticleDOI
27 Jun 2016
TL;DR: A novel visual tracking algorithm based on the representations from a discriminatively trained Convolutional Neural Network using a large set of videos with tracking ground-truths to obtain a generic target representation.
Abstract: We propose a novel visual tracking algorithm based on the representations from a discriminatively trained Convolutional Neural Network (CNN). Our algorithm pretrains a CNN using a large set of videos with tracking groundtruths to obtain a generic target representation. Our network is composed of shared layers and multiple branches of domain-specific layers, where domains correspond to individual training sequences and each branch is responsible for binary classification to identify target in each domain. We train each domain in the network iteratively to obtain generic target representations in the shared layers. When tracking a target in a new sequence, we construct a new network by combining the shared layers in the pretrained CNN with a new binary classification layer, which is updated online. Online tracking is performed by evaluating the candidate windows randomly sampled around the previous target state. The proposed algorithm illustrates outstanding performance in existing tracking benchmarks.

1,960 citations


Cites background or methods from "Accurate scale estimation for robus..."

  • ...The one-pass evaluation (OPE) is employed to compare our algorithm with the six state-ofthe-art trackers including MUSTer [21], CNN-SVM [20], MEEM [42], TGPR [12], DSST [6] and KCF [18], as well as the top 2 trackers included in the benchmark—SCM [44] and Struck [17]....

    [...]

  • ...Visual tracking, however, has been less affected by these popular trends since it is difficult to collect a large amount of training data for video processing applications and training algorithms specialized for visual tracking are not available yet, while the approaches based on low-level handcraft features still work well in practice [18, 6, 21, 42]....

    [...]

  • ...We compare our algorithm with the top 5 trackers in VOT2014 challenge—DSST [6], SAMF [29], KCF [18], DGT [4] and PLT 14 [25]—and additional two state-of-the-art trackers MUSTer [21] and MEEM [42]....

    [...]

  • ...In recent years, correlation filters have gained attention in the area of visual tracking due to their computational efficiency and competitive performance [3, 18, 6, 21]....

    [...]

  • ...The one-pass evaluation (OPE) is employed to compare our algorithm with the six state-ofthe-art trackers including MUSTer [21], CNN-SVM [20], MEEM [42], TGPR [12], DSST [6] and KCF [18], as well as the top 2 trackers included in the benchmark—SCM [44]...

    [...]

Posted Content
TL;DR: Zhang et al. as discussed by the authors proposed a novel visual tracking algorithm based on the representations from a discriminatively trained Convolutional Neural Network (CNN), which pretrain a CNN using a large set of videos with tracking ground-truths to obtain a generic target representation.
Abstract: We propose a novel visual tracking algorithm based on the representations from a discriminatively trained Convolutional Neural Network (CNN). Our algorithm pretrains a CNN using a large set of videos with tracking ground-truths to obtain a generic target representation. Our network is composed of shared layers and multiple branches of domain-specific layers, where domains correspond to individual training sequences and each branch is responsible for binary classification to identify the target in each domain. We train the network with respect to each domain iteratively to obtain generic target representations in the shared layers. When tracking a target in a new sequence, we construct a new network by combining the shared layers in the pretrained CNN with a new binary classification layer, which is updated online. Online tracking is performed by evaluating the candidate windows randomly sampled around the previous target state. The proposed algorithm illustrates outstanding performance compared with state-of-the-art methods in existing tracking benchmarks.

1,818 citations

References
More filters
Journal ArticleDOI
TL;DR: An object detection system based on mixtures of multiscale deformable part models that is able to represent highly variable object classes and achieves state-of-the-art results in the PASCAL object detection challenges is described.
Abstract: We describe an object detection system based on mixtures of multiscale deformable part models. Our system is able to represent highly variable object classes and achieves state-of-the-art results in the PASCAL object detection challenges. While deformable part models have become quite popular, their value had not been demonstrated on difficult benchmarks such as the PASCAL data sets. Our system relies on new methods for discriminative training with partially labeled data. We combine a margin-sensitive approach for data-mining hard negative examples with a formalism we call latent SVM. A latent SVM is a reformulation of MI--SVM in terms of latent variables. A latent SVM is semiconvex, and the training problem becomes convex once latent information is specified for the positive examples. This leads to an iterative training algorithm that alternates between fixing latent values for positive examples and optimizing the latent SVM objective function.

10,501 citations

Journal ArticleDOI
TL;DR: A new kernelized correlation filter is derived, that unlike other kernel algorithms has the exact same complexity as its linear counterpart, which is called dual correlation filter (DCF), which outperform top-ranking trackers such as Struck or TLD on a 50 videos benchmark, despite being implemented in a few lines of code.
Abstract: The core component of most modern trackers is a discriminative classifier, tasked with distinguishing between the target and the surrounding environment. To cope with natural image changes, this classifier is typically trained with translated and scaled sample patches. Such sets of samples are riddled with redundancies—any overlapping pixels are constrained to be the same. Based on this simple observation, we propose an analytic model for datasets of thousands of translated patches. By showing that the resulting data matrix is circulant, we can diagonalize it with the discrete Fourier transform, reducing both storage and computation by several orders of magnitude. Interestingly, for linear regression our formulation is equivalent to a correlation filter, used by some of the fastest competitive trackers. For kernel regression, however, we derive a new kernelized correlation filter (KCF), that unlike other kernel algorithms has the exact same complexity as its linear counterpart. Building on it, we also propose a fast multi-channel extension of linear correlation filters, via a linear kernel, which we call dual correlation filter (DCF). Both KCF and DCF outperform top-ranking trackers such as Struck or TLD on a 50 videos benchmark, despite running at hundreds of frames-per-second, and being implemented in a few lines of code (Algorithm 1). To encourage further developments, our tracking framework was made open-source.

4,994 citations


"Accurate scale estimation for robus..." refers methods in this paper

  • ...Similar to [13], we use HOG features for the translation filter and concatenate it with the usual image intensity features....

    [...]

  • ...The discriminative correlation filters described in section 2 have recently been extended to multi-dimensional features for a variety of applications, including visual tracking [4, 13], object detection [8, 12] and object alignment [2]....

    [...]

Proceedings ArticleDOI
23 Jun 2013
TL;DR: Large scale experiments are carried out with various evaluation criteria to identify effective approaches for robust tracking and provide potential future research directions in this field.
Abstract: Object tracking is one of the most important components in numerous applications of computer vision. While much progress has been made in recent years with efforts on sharing code and datasets, it is of great importance to develop a library and benchmark to gauge the state of the art. After briefly reviewing recent advances of online object tracking, we carry out large scale experiments with various evaluation criteria to understand how these algorithms perform. The test image sequences are annotated with different attributes for performance evaluation and analysis. By analyzing quantitative results, we identify effective approaches for robust tracking and provide potential future research directions in this field.

3,828 citations


"Accurate scale estimation for robus..." refers background or methods in this paper

  • ...In addition, the results are presented using precision and success plots [18]....

    [...]

  • ...To validate our approach, we perform extensive experiments on all the 28 image sequences annotated with “Scale Variation (SV)” in the recent benchmark evaluation [18]....

    [...]

  • ...Evaluation Methodology: The performance of our approach is quantitatively validated by following the protocol1 used in [18]....

    [...]

  • ...Datasets: We employ all the 28 sequences1 annotated with the scale variation attribute in the recent evaluation of tracking methods [18]....

    [...]

  • ...[18] performed a comprehensive evaluation of online visual tracking approaches....

    [...]

Proceedings ArticleDOI
13 Jun 2010
TL;DR: A new type of correlation filter is presented, a Minimum Output Sum of Squared Error (MOSSE) filter, which produces stable correlation filters when initialized using a single frame, which enables the tracker to pause and resume where it left off when the object reappears.
Abstract: Although not commonly used, correlation filters can track complex objects through rotations, occlusions and other distractions at over 20 times the rate of current state-of-the-art techniques. The oldest and simplest correlation filters use simple templates and generally fail when applied to tracking. More modern approaches such as ASEF and UMACE perform better, but their training needs are poorly suited to tracking. Visual tracking requires robust filters to be trained from a single frame and dynamically adapted as the appearance of the target object changes. This paper presents a new type of correlation filter, a Minimum Output Sum of Squared Error (MOSSE) filter, which produces stable correlation filters when initialized using a single frame. A tracker based upon MOSSE filters is robust to variations in lighting, scale, pose, and nonrigid deformations while operating at 669 frames per second. Occlusion is detected based upon the peak-to-sidelobe ratio, which enables the tracker to pause and resume where it left off when the object reappears.

2,948 citations


"Accurate scale estimation for robus..." refers background or methods in this paper

  • ...In recent years, tracking-by-detection methods [3, 9, 11, 19] have shown to provide excellent tracking performance....

    [...]

  • ...The intensity based baseline roughly corresponds to the MOSSE tracker proposed in [3], but without any explicit failure detection component....

    [...]

  • ...[3], is based on finding an adaptive correlation filter by minimizing the output sum of squared error (MOSSE)....

    [...]

  • ...As mentioned in [3], the regularization parameter alleviates the problem of zero-frequency components in the spectrum of f , which would lead to division by zero....

    [...]

  • ...Finally, the extracted features are always multiplied by a Hann window, as described in [3]....

    [...]

Book ChapterDOI
07 Oct 2012
TL;DR: Using the well-established theory of Circulant matrices, this work provides a link to Fourier analysis that opens up the possibility of extremely fast learning and detection with the Fast Fourier Transform, which can be done in the dual space of kernel machines as fast as with linear classifiers.
Abstract: Recent years have seen greater interest in the use of discriminative classifiers in tracking systems, owing to their success in object detection. They are trained online with samples collected during tracking. Unfortunately, the potentially large number of samples becomes a computational burden, which directly conflicts with real-time requirements. On the other hand, limiting the samples may sacrifice performance. Interestingly, we observed that, as we add more and more samples, the problem acquires circulant structure. Using the well-established theory of Circulant matrices, we provide a link to Fourier analysis that opens up the possibility of extremely fast learning and detection with the Fast Fourier Transform. This can be done in the dual space of kernel machines as fast as with linear classifiers. We derive closed-form solutions for training and detection with several types of kernels, including the popular Gaussian and polynomial kernels. The resulting tracker achieves performance competitive with the state-of-the-art, can be implemented with only a few lines of code and runs at hundreds of frames-per-second. MATLAB code is provided in the paper (see Algorithm 1).

2,197 citations


"Accurate scale estimation for robus..." refers background or methods in this paper

  • ...In recent years, tracking-by-detection methods [3, 9, 11, 19] have shown to provide excellent tracking performance....

    [...]

  • ...Given an image patch, the CSK tracker works by learning a kernelized least-squares classifier of the target appearance....

    [...]

  • ...Ours ASLA [14] SCM [20] Struck [9] TLD [15] EDFT [6] L1APG [1] DFT [17] LOT [16] CSK [11] LSHT [10] CT [19] Median OP 75....

    [...]

  • ...Most tracking-by-detection methods, such as the CSK and MOSSE, are limited to only estimating the target translation....

    [...]

  • ...0 10 20 30 40 50 0 0.2 0.4 0.6 0.8 Location error threshold D is ta n c e P re c is io n Precision plot Ours [0.745] Struck [0.659] ASLA [0.612] SCM [0.610] TLD [0.509] LSHT [0.508] EDFT [0.505] CSK [0.502] L1APG [0.472] LOT [0.467] DFT [0.441] CT [0.344] 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 Overlap threshold O v e rl a p P re c is io n Success plot Ours [0.549] ASLA [0.492] SCM [0.477] Struck [0.430] TLD [0.356] LSHT [0.354] EDFT [0.350] CSK [0.350] L1APG [0.350] LOT [0.339] DFT [0.329] CT [0.239] Figure 2: Precision and success plots over all the 28 sequences....

    [...]