scispace - formally typeset
Search or ask a question

Showing papers on "Video tracking published in 2017"


Proceedings ArticleDOI
21 Jul 2017
TL;DR: This work revisit the core DCF formulation and introduces a factorized convolution operator, which drastically reduces the number of parameters in the model, and a compact generative model of the training sample distribution that significantly reduces memory and time complexity, while providing better diversity of samples.
Abstract: In recent years, Discriminative Correlation Filter (DCF) based methods have significantly advanced the state-of-the-art in tracking. However, in the pursuit of ever increasing tracking performance, their characteristic speed and real-time capability have gradually faded. Further, the increasingly complex models, with massive number of trainable parameters, have introduced the risk of severe over-fitting. In this work, we tackle the key causes behind the problems of computational complexity and over-fitting, with the aim of simultaneously improving both speed and performance. We revisit the core DCF formulation and introduce: (i) a factorized convolution operator, which drastically reduces the number of parameters in the model, (ii) a compact generative model of the training sample distribution, that significantly reduces memory and time complexity, while providing better diversity of samples, (iii) a conservative model update strategy with improved robustness and reduced complexity. We perform comprehensive experiments on four benchmarks: VOT2016, UAV123, OTB-2015, and TempleColor. When using expensive deep features, our tracker provides a 20-fold speedup and achieves a 13.0% relative gain in Expected Average Overlap compared to the top ranked method [12] in the VOT2016 challenge. Moreover, our fast variant, using hand-crafted features, operates at 60 Hz on a single CPU, while obtaining 65.0% AUC on OTB-2015.

1,993 citations


Proceedings ArticleDOI
21 Mar 2017
TL;DR: This paper integrates appearance information to improve the performance of SORT and reduces the number of identity switches, achieving overall competitive performance at high frame rates.
Abstract: Simple Online and Realtime Tracking (SORT) is a pragmatic approach to multiple object tracking with a focus on simple, effective algorithms. In this paper, we integrate appearance information to improve the performance of SORT. Due to this extension we are able to track objects through longer periods of occlusions, effectively reducing the number of identity switches. In spirit of the original framework we place much of the computational complexity into an offline pre-training stage where we learn a deep association metric on a largescale person re-identification dataset. During online application, we establish measurement-to-track associations using nearest neighbor queries in visual appearance space. Experimental evaluation shows that our extensions reduce the number of identity switches by 45%, achieving overall competitive performance at high frame rates.

1,808 citations


Proceedings ArticleDOI
21 Jul 2017
TL;DR: In this paper, the Correlation Filter learner is interpreted as a differentiable layer in a deep neural network, which enables learning deep features that are tightly coupled to the correlation filter.
Abstract: The Correlation Filter is an algorithm that trains a linear template to discriminate between images and their translations. It is well suited to object tracking because its formulation in the Fourier domain provides a fast solution, enabling the detector to be re-trained once per frame. Previous works that use the Correlation Filter, however, have adopted features that were either manually designed or trained for a different task. This work is the first to overcome this limitation by interpreting the Correlation Filter learner, which has a closed-form solution, as a differentiable layer in a deep neural network. This enables learning deep features that are tightly coupled to the Correlation Filter. Experiments illustrate that our method has the important practical benefit of allowing lightweight architectures to achieve state-of-the-art performance at high framerates.

1,329 citations


Posted Content
TL;DR: In this paper, the authors integrate appearance information to improve the performance of Simple Online and Real-time Tracking (SORT) by tracking objects through longer periods of occlusions, effectively reducing the number of identity switches.
Abstract: Simple Online and Realtime Tracking (SORT) is a pragmatic approach to multiple object tracking with a focus on simple, effective algorithms. In this paper, we integrate appearance information to improve the performance of SORT. Due to this extension we are able to track objects through longer periods of occlusions, effectively reducing the number of identity switches. In spirit of the original framework we place much of the computational complexity into an offline pre-training stage where we learn a deep association metric on a large-scale person re-identification dataset. During online application, we establish measurement-to-track associations using nearest neighbor queries in visual appearance space. Experimental evaluation shows that our extensions reduce the number of identity switches by 45%, achieving overall competitive performance at high frame rates.

987 citations


Journal ArticleDOI
TL;DR: In this article, a scale adaptive tracking approach by learning separate discriminative correlation filters for translation and scale estimation is proposed, which directly learns the appearance change induced by variations in the target scale.
Abstract: Accurate scale estimation of a target is a challenging research problem in visual object tracking. Most state-of-the-art methods employ an exhaustive scale search to estimate the target size. The exhaustive search strategy is computationally expensive and struggles when encountered with large scale variations. This paper investigates the problem of accurate and robust scale estimation in a tracking-by-detection framework. We propose a novel scale adaptive tracking approach by learning separate discriminative correlation filters for translation and scale estimation. The explicit scale filter is learned online using the target appearance sampled at a set of different scales. Contrary to standard approaches, our method directly learns the appearance change induced by variations in the target scale. Additionally, we investigate strategies to reduce the computational cost of our approach. Extensive experiments are performed on the OTB and the VOT2014 datasets. Compared to the standard exhaustive scale search, our approach achieves a gain of 2.5 percent in average overlap precision on the OTB dataset. Additionally, our method is computationally efficient, operating at a 50 percent higher frame rate compared to the exhaustive scale search. Our method obtains the top rank in performance by outperforming 19 state-of-the-art trackers on OTB and 37 state-of-the-art trackers on VOT2014.

945 citations


Proceedings ArticleDOI
21 Jul 2017
TL;DR: The channel and spatial reliability concepts are introduced to DCF tracking and a novel learning algorithm is provided for its efficient and seamless integration in the filter update and the tracking process.
Abstract: Short-term tracking is an open and challenging problem for which discriminative correlation filters (DCF) have shown excellent performance. We introduce the channel and spatial reliability concepts to DCF tracking and provide a novel learning algorithm for its efficient and seamless integration in the filter update and the tracking process. The spatial reliability map adjusts the filter support to the part of the object suitable for tracking. This allows tracking of non-rectangular objects as well as extending the search region. Channel reliability reflects the quality of the learned filter and it is used as a feature weighting coefficient in localization. Experimentally, with only two simple standard features, HOGs and Colornames, the novel CSR-DCF method – DCF with Channel and Spatial Reliability – achieves state-of-the-art results on VOT 2016, VOT 2015 and OTB. The CSR-DCF runs in real-time on a CPU.

941 citations


Proceedings ArticleDOI
01 Jul 2017
TL;DR: A class of temporal models that use a hierarchy of temporal convolutions to perform fine-grained action segmentation or detection, which are capable of capturing action compositions, segment durations, and long-range dependencies, and are over a magnitude faster to train than competing LSTM-based Recurrent Neural Networks.
Abstract: The ability to identify and temporally segment fine-grained human actions throughout a video is crucial for robotics, surveillance, education, and beyond. Typical approaches decouple this problem by first extracting local spatiotemporal features from video frames and then feeding them into a temporal classifier that captures high-level temporal patterns. We describe a class of temporal models, which we call Temporal Convolutional Networks (TCNs), that use a hierarchy of temporal convolutions to perform fine-grained action segmentation or detection. Our Encoder-Decoder TCN uses pooling and upsampling to efficiently capture long-range temporal patterns whereas our Dilated TCN uses dilated convolutions. We show that TCNs are capable of capturing action compositions, segment durations, and long-range dependencies, and are over a magnitude faster to train than competing LSTM-based Recurrent Neural Networks. We apply these models to three challenging fine-grained datasets and show large improvements over the state of the art.

859 citations


Proceedings ArticleDOI
01 Oct 2017
TL;DR: This paper proposes dynamic Siamese network, via a fast transformation learning model that enables effective online learning of target appearance variation and background suppression from previous frames, and presents elementwise multi-layer fusion to adaptively integrate the network outputs using multi-level deep features.
Abstract: How to effectively learn temporal variation of target appearance, to exclude the interference of cluttered background, while maintaining real-time response, is an essential problem of visual object tracking. Recently, Siamese networks have shown great potentials of matching based trackers in achieving balanced accuracy and beyond realtime speed. However, they still have a big gap to classification & updating based trackers in tolerating the temporal changes of objects and imaging conditions. In this paper, we propose dynamic Siamese network, via a fast transformation learning model that enables effective online learning of target appearance variation and background suppression from previous frames. We then present elementwise multi-layer fusion to adaptively integrate the network outputs using multi-level deep features. Unlike state-of-theart trackers, our approach allows the usage of any feasible generally- or particularly-trained features, such as SiamFC and VGG. More importantly, the proposed dynamic Siamese network can be jointly trained as a whole directly on the labeled video sequences, thus can take full advantage of the rich spatial temporal information of moving objects. As a result, our approach achieves state-of-the-art performance on OTB-2013 and VOT-2015 benchmarks, while exhibits superiorly balanced accuracy and real-time response over state-of-the-art competitors.

772 citations


Proceedings ArticleDOI
01 Oct 2017
TL;DR: This work proposes a Background-Aware CF based on hand-crafted features (HOG] that can efficiently model how both the foreground and background of the object varies over time, and superior accuracy and real-time performance of the method compared to the state-of-the-art trackers.
Abstract: Correlation Filters (CFs) have recently demonstrated excellent performance in terms of rapidly tracking objects under challenging photometric and geometric variations. The strength of the approach comes from its ability to efficiently learn - on the fly - how the object is changing over time. A fundamental drawback to CFs, however, is that the background of the target is not modeled over time which can result in suboptimal performance. Recent tracking algorithms have suggested to resolve this drawback by either learning CFs from more discriminative deep features (e.g. DeepSRDCF [9] and CCOT [11]) or learning complex deep trackers (e.g. MDNet [28] and FCNT [33]). While such methods have been shown to work well, they suffer from high complexity: extracting deep features or applying deep tracking frameworks is very computationally expensive. This limits the real-time performance of such methods, even on high-end GPUs. This work proposes a Background-Aware CF based on hand-crafted features (HOG [6]) that can efficiently model how both the foreground and background of the object varies over time. Our approach, like conventional CFs, is extremely computationally efficient- and extensive experiments over multiple tracking benchmarks demonstrate the superior accuracy and real-time performance of our method compared to the state-of-the-art trackers.

679 citations


Proceedings ArticleDOI
01 Oct 2017
TL;DR: Deep voxel flow as mentioned in this paper combines the advantages of optical flow and neural network-based methods by training a deep network that learns to synthesize video frames by flowing pixel values from existing ones, which can be applied at any video resolution.
Abstract: We address the problem of synthesizing new video frames in an existing video, either in-between existing frames (interpolation), or subsequent to them (extrapolation). This problem is challenging because video appearance and motion can be highly complex. Traditional optical-flow-based solutions often fail where flow estimation is challenging, while newer neural-network-based methods that hallucinate pixel values directly often produce blurry results. We combine the advantages of these two methods by training a deep network that learns to synthesize video frames by flowing pixel values from existing ones, which we call deep voxel flow. Our method requires no human supervision, and any video can be used as training data by dropping, and then learning to predict, existing frames. The technique is efficient, and can be applied at any video resolution. We demonstrate that our method produces results that both quantitatively and qualitatively improve upon the state-of-the-art.

601 citations


Proceedings ArticleDOI
01 Jul 2017
TL;DR: Wang et al. as discussed by the authors proposed a large margin object tracking method, which absorbs the strong discriminative ability from structured output SVM and speeds up by the correlation filter algorithm significantly.
Abstract: Structured output support vector machine (SVM) based tracking algorithms have shown favorable performance recently. Nonetheless, the time-consuming candidate sampling and complex optimization limit their real-time applications. In this paper, we propose a novel large margin object tracking method which absorbs the strong discriminative ability from structured output SVM and speeds up by the correlation filter algorithm significantly. Secondly, a multimodal target detection technique is proposed to improve the target localization precision and prevent model drift introduced by similar objects or background noise. Thirdly, we exploit the feedback from high-confidence tracking results to avoid the model corruption problem. We implement two versions of the proposed tracker with the representations from both conventional hand-crafted and deep convolution neural networks (CNNs) based features to validate the strong compatibility of the algorithm. The experimental results demonstrate that the proposed tracker performs superiorly against several state-of-the-art algorithms on the challenging benchmark sequences while runs at speed in excess of 80 frames per second.

Proceedings ArticleDOI
01 Jul 2017
TL;DR: Through evaluation of the OTB dataset, the proposed tracker is validated to achieve a competitive performance that is three times faster than state-of-the-art, deep network–based trackers.
Abstract: This paper proposes a novel tracker which is controlled by sequentially pursuing actions learned by deep reinforcement learning. In contrast to the existing trackers using deep networks, the proposed tracker is designed to achieve a light computation as well as satisfactory tracking accuracy in both location and scale. The deep network to control actions is pre-trained using various training sequences and fine-tuned during tracking for online adaptation to target and background changes. The pre-training is done by utilizing deep reinforcement learning as well as supervised learning. The use of reinforcement learning enables even partially labeled data to be successfully utilized for semi-supervised learning. Through evaluation of the OTB dataset, the proposed tracker is validated to achieve a competitive performance that is three times faster than state-of-the-art, deep network–based trackers. The fast version of the proposed method, which operates in real-time on GPU, outperforms the state-of-the-art real-time trackers.

Proceedings ArticleDOI
Esteban Real1, Jonathon Shlens1, Stefano Mazzocchi1, Xin Pan1, Vincent Vanhoucke1 
01 Jul 2017
TL;DR: A new large-scale data set of video URLs with densely-sampled object bounding box annotations called YouTube-BoundingBoxes (YT-BB), which consists of approximately 380,000 video segments automatically selected to feature objects in natural settings without editing or post-processing.
Abstract: We introduce a new large-scale data set of video URLs with densely-sampled object bounding box annotations called YouTube-BoundingBoxes (YT-BB). The data set consists of approximately 380,000 video segments about 19s long, automatically selected to feature objects in natural settings without editing or post-processing, with a recording quality often akin to that of a hand-held cell phone camera. The objects represent a subset of the COCO [32] label set. All video segments were human-annotated with high-precision classification labels and bounding boxes at 1 frame per second. The use of a cascade of increasingly precise human annotations ensures a label accuracy above 95% for every class and tight bounding boxes. Finally, we train and evaluate well-known deep network architectures and report baseline figures for per-frame classification and localization. We also demonstrate how the temporal contiguity of video can potentially be used to improve such inferences. The data set can be found at https://research.google.com/youtube-bb. We hope the availability of such large curated corpus will spur new advances in video object detection and tracking.

Proceedings ArticleDOI
01 Aug 2017
TL;DR: This work presents a tracking-by-detection algorithm which can compete with more sophisticated approaches at a fraction of the computational cost and shows with thorough experiments its potential using a wide range of object detectors.
Abstract: Tracking-by-detection is a common approach to multi-object tracking. With ever increasing performances of object detectors, the basis for a tracker becomes much more reliable. In combination with commonly higher frame rates, this poses a shift in the challenges for a successful tracker. That shift enables the deployment of much simpler tracking algorithms which can compete with more sophisticated approaches at a fraction of the computational cost. We present such an algorithm and show with thorough experiments its potential using a wide range of object detectors. The proposed method can easily run at 100K fps while outperforming the state-of-the-art on the DETRAC vehicle tracking dataset.

Proceedings ArticleDOI
Matej Kristan1, Ales Leonardis2, Jiri Matas3, Michael Felsberg4, Roman Pflugfelder5, Luka Čehovin Zajc1, Tomas Vojir3, Gustav Häger4, Alan Lukezic1, Abdelrahman Eldesokey4, Gustavo Fernandez5, Alvaro Garcia-Martin6, Andrej Muhič1, Alfredo Petrosino7, Alireza Memarmoghadam8, Andrea Vedaldi9, Antoine Manzanera10, Antoine Tran10, A. Aydin Alatan11, Bogdan Mocanu, Boyu Chen12, Chang Huang, Changsheng Xu13, Chong Sun12, Dalong Du, David Zhang, Dawei Du13, Deepak Mishra, Erhan Gundogdu14, Erhan Gundogdu11, Erik Velasco-Salido, Fahad Shahbaz Khan4, Francesco Battistone, Gorthi R. K. Sai Subrahmanyam, Goutam Bhat4, Guan Huang, Guilherme Sousa Bastos, Guna Seetharaman15, Hongliang Zhang16, Houqiang Li17, Huchuan Lu12, Isabela Drummond, Jack Valmadre9, Jae-chan Jeong18, Jaeil Cho18, Jae-Yeong Lee18, Jana Noskova, Jianke Zhu19, Jin Gao13, Jingyu Liu13, Ji-Wan Kim18, João F. Henriques9, José M. Martínez, Junfei Zhuang20, Junliang Xing13, Junyu Gao13, Kai Chen21, Kannappan Palaniappan22, Karel Lebeda, Ke Gao22, Kris M. Kitani23, Lei Zhang, Lijun Wang12, Lingxiao Yang, Longyin Wen24, Luca Bertinetto9, Mahdieh Poostchi22, Martin Danelljan4, Matthias Mueller25, Mengdan Zhang13, Ming-Hsuan Yang26, Nianhao Xie16, Ning Wang17, Ondrej Miksik9, Payman Moallem8, Pallavi Venugopal M, Pedro Senna, Philip H. S. Torr9, Qiang Wang13, Qifeng Yu16, Qingming Huang13, Rafael Martin-Nieto, Richard Bowden27, Risheng Liu12, Ruxandra Tapu, Simon Hadfield27, Siwei Lyu28, Stuart Golodetz9, Sunglok Choi18, Tianzhu Zhang13, Titus Zaharia, Vincenzo Santopietro, Wei Zou13, Weiming Hu13, Wenbing Tao21, Wenbo Li28, Wengang Zhou17, Xianguo Yu16, Xiao Bian24, Yang Li19, Yifan Xing23, Yingruo Fan20, Zheng Zhu13, Zhipeng Zhang13, Zhiqun He20 
01 Jul 2017
TL;DR: The Visual Object Tracking challenge VOT2017 is the fifth annual tracker benchmarking activity organized by the VOT initiative; results of 51 trackers are presented; many are state-of-the-art published at major computer vision conferences or journals in recent years.
Abstract: The Visual Object Tracking challenge VOT2017 is the fifth annual tracker benchmarking activity organized by the VOT initiative. Results of 51 trackers are presented; many are state-of-the-art published at major computer vision conferences or journals in recent years. The evaluation included the standard VOT and other popular methodologies and a new "real-time" experiment simulating a situation where a tracker processes images as if provided by a continuously running sensor. Performance of the tested trackers typically by far exceeds standard baselines. The source code for most of the trackers is publicly available from the VOT page. The VOT2017 goes beyond its predecessors by (i) improving the VOT public dataset and introducing a separate VOT2017 sequestered dataset, (ii) introducing a realtime tracking experiment and (iii) releasing a redesigned toolkit that supports complex experiments. The dataset, the evaluation kit and the results are publicly available at the challenge website1.

Proceedings ArticleDOI
01 Jul 2017
TL;DR: In this paper, the authors use a combination of offline and online learning strategies, where the former produces a refined mask from the previous frame estimate and the latter allows to capture the appearance of the specific object instance.
Abstract: Inspired by recent advances of deep learning in instance segmentation and object tracking, we introduce the concept of convnet-based guidance applied to video object segmentation. Our model proceeds on a per-frame basis, guided by the output of the previous frame towards the object of interest in the next frame. We demonstrate that highly accurate object segmentation in videos can be enabled by using a convolutional neural network (convnet) trained with static images only. The key component of our approach is a combination of offline and online learning strategies, where the former produces a refined mask from the previous frame estimate and the latter allows to capture the appearance of the specific object instance. Our method can handle different types of input annotations such as bounding boxes and segments while leveraging an arbitrary amount of annotated frames. Therefore our system is suitable for diverse applications with different requirements in terms of accuracy and efficiency. In our extensive evaluation, we obtain competitive results on three different datasets, independently from the type of input annotation.

Proceedings ArticleDOI
01 Jul 2017
TL;DR: The proposed MCPF is designed to exploit and complement the strength of a MCF and a particle filter, and can effectively maintain multiple modes in the posterior density using fewer particles than conventional particle filters, thereby lowering the computational cost.
Abstract: In this paper, we propose a multi-task correlation particle filter (MCPF) for robust visual tracking. We first present the multi-task correlation filter (MCF) that takes the interdependencies among different features into account to learn correlation filters jointly. The proposed MCPF is designed to exploit and complement the strength of a MCF and a particle filter. Compared with existing tracking methods based on correlation filters and particle filters, the proposed tracker has several advantages. First, it can shepherd the sampled particles toward the modes of the target state distribution via the MCF, thereby resulting in robust tracking performance. Second, it can effectively handle large-scale variation via a particle sampling strategy. Third, it can effectively maintain multiple modes in the posterior density using fewer particles than conventional particle filters, thereby lowering the computational cost. Extensive experimental results on three benchmark datasets demonstrate that the proposed MCPF performs favorably against the state-of-the-art methods.

Proceedings ArticleDOI
01 Aug 2017
TL;DR: This paper proposes the CREST algorithm to reformulate DCFs as a one-layer convolutional neural network, and applies residual learning to take appearance changes into account to reduce model degradation during online update.
Abstract: Discriminative correlation filters (DCFs) have been shown to perform superiorly in visual tracking. They only need a small set of training samples from the initial frame to generate an appearance model. However, existing DCFs learn the filters separately from feature extraction, and update these filters using a moving average operation with an empirical weight. These DCF trackers hardly benefit from the end-to-end training. In this paper, we propose the CREST algorithm to reformulate DCFs as a one-layer convolutional neural network. Our method integrates feature extraction, response map generation as well as model update into the neural networks for an end-to-end training. To reduce model degradation during online update, we apply residual learning to take appearance changes into account. Extensive experiments on the benchmark datasets demonstrate that our CREST tracker performs favorably against state-of-the-art trackers.

Posted Content
TL;DR: MoCoGAN as mentioned in this paper decomposes the visual signals in a video into content and motion, and learns motion and content decomposition in an unsupervised manner using both image and video discriminators.
Abstract: Visual signals in a video can be divided into content and motion. While content specifies which objects are in the video, motion describes their dynamics. Based on this prior, we propose the Motion and Content decomposed Generative Adversarial Network (MoCoGAN) framework for video generation. The proposed framework generates a video by mapping a sequence of random vectors to a sequence of video frames. Each random vector consists of a content part and a motion part. While the content part is kept fixed, the motion part is realized as a stochastic process. To learn motion and content decomposition in an unsupervised manner, we introduce a novel adversarial learning scheme utilizing both image and video discriminators. Extensive experimental results on several challenging datasets with qualitative and quantitative comparison to the state-of-the-art approaches, verify effectiveness of the proposed framework. In addition, we show that MoCoGAN allows one to generate videos with same content but different motion as well as videos with different content and same motion.

Proceedings ArticleDOI
01 Oct 2017
TL;DR: This paper sets up a ConvNet architecture for simultaneous detection and tracking, using a multi-task objective for frame-based object detection and across-frame track regression, and introduces correlation features that represent object co-occurrences across time to aid the ConvNet during tracking.
Abstract: Recent approaches for high accuracy detection and tracking of object categories in video consist of complex multistage solutions that become more cumbersome each year. In this paper we propose a ConvNet architecture that jointly performs detection and tracking, solving the task in a simple and effective way. Our contributions are threefold: (i) we set up a ConvNet architecture for simultaneous detection and tracking, using a multi-task objective for frame-based object detection and across-frame track regression; (ii) we introduce correlation features that represent object co-occurrences across time to aid the ConvNet during tracking; and (iii) we link the frame level detections based on our across-frame tracklets to produce high accuracy detections at the video level. Our ConvNet architecture for spatiotemporal object detection is evaluated on the large-scale ImageNet VID dataset where it achieves state-of-the-art results. Our approach provides better single model performance than the winning method of the last ImageNet challenge while being conceptually much simpler. Finally, we show that by increasing the temporal stride we can dramatically increase the tracker speed.

Posted Content
TL;DR: This work is the first to overcome this limitation by interpreting the Correlation Filter learner, which has a closed-form solution, as a differentiable layer in a deep neural network, which enables learning deep features that are tightly coupled to the Cor correlation filter.
Abstract: The Correlation Filter is an algorithm that trains a linear template to discriminate between images and their translations. It is well suited to object tracking because its formulation in the Fourier domain provides a fast solution, enabling the detector to be re-trained once per frame. Previous works that use the Correlation Filter, however, have adopted features that were either manually designed or trained for a different task. This work is the first to overcome this limitation by interpreting the Correlation Filter learner, which has a closed-form solution, as a differentiable layer in a deep neural network. This enables learning deep features that are tightly coupled to the Correlation Filter. Experiments illustrate that our method has the important practical benefit of allowing lightweight architectures to achieve state-of-the-art performance at high framerates.

Proceedings ArticleDOI
01 Oct 2017
TL;DR: Zhang et al. as mentioned in this paper proposed a spatial-temporal attention mechanism (STAM) to handle the drift caused by occlusion and interaction among targets, which can be considered as temporal attention mechanism.
Abstract: In this paper, we propose a CNN-based framework for online MOT. This framework utilizes the merits of single object trackers in adapting appearance models and searching for target in the next frame. Simply applying single object tracker for MOT will encounter the problem in computational efficiency and drifted results caused by occlusion. Our framework achieves computational efficiency by sharing features and using ROI-Pooling to obtain individual features for each target. Some online learned target-specific CNN layers are used for adapting the appearance model for each target. In the framework, we introduce spatial-temporal attention mechanism (STAM) to handle the drift caused by occlusion and interaction among targets. The visibility map of the target is learned and used for inferring the spatial attention map. The spatial attention map is then applied to weight the features. Besides, the occlusion status can be estimated from the visibility map, which controls the online updating process via weighted loss on training samples with different occlusion statuses in different frames. It can be considered as temporal attention mechanism. The proposed algorithm achieves 34.3% and 46.0% in MOTA on challenging MOT15 and MOT16 benchmark dataset respectively.

Proceedings ArticleDOI
20 Apr 2017
TL;DR: A novel two-stream neural network with an explicit memory module to achieve the task of segmenting moving objects in unconstrained videos and provides an extensive ablative analysis to investigate the influence of each component in the proposed framework.
Abstract: This paper addresses the task of segmenting moving objects in unconstrained videos. We introduce a novel two-stream neural network with an explicit memory module to achieve this. The two streams of the network encode spatial and temporal features in a video sequence respectively, while the memory module captures the evolution of objects over time. The module to build a “visual memory” in video, i.e., a joint representation of all the video frames, is realized with a convolutional recurrent unit learned from a small number of training video sequences. Given a video frame as input, our approach assigns each pixel an object or background label based on the learned spatio-temporal features as well as the "visual memory" specific to the video, acquired automatically without any manually-annotated frames. The visual memory is implemented with convolutional gated recurrent units, which allows to propagate spatial information over time. We evaluate our method extensively on two benchmarks, DAVIS and Freiburg-Berkeley motion segmentation datasets, and show state-of-the-art results. For example, our approach outperforms the top method on the DAVIS dataset by nearly 6%. We also provide an extensive ablative analysis to investigate the influence of each component in the proposed framework.

Journal ArticleDOI
Daniel Kang1, John Emmons1, Firas Abuzaid1, Peter Bailis1, Matei Zaharia1 
01 Aug 2017
TL;DR: NoScope as mentioned in this paper cascades two types of models: specialized models that forego the full generality of the reference model but faithfully mimic its behavior for the target video and object; and difference detectors that highlight temporal differences across frames.
Abstract: Recent advances in computer vision---in the form of deep neural networks---have made it possible to query increasing volumes of video data with high accuracy. However, neural network inference is computationally expensive at scale: applying a state-of-the-art object detector in real time (i.e., 30+ frames per second) to a single video requires a $4000 GPU. In response, we present NoScope, a system for querying videos that can reduce the cost of neural network video analysis by up to three orders of magnitude via inference-optimized model search. Given a target video, object to detect, and reference neural network, NoScope automatically searches for and trains a sequence, or cascade, of models that preserves the accuracy of the reference network but is specialized to the target video and are therefore far less computationally expensive. NoScope cascades two types of models: specialized models that forego the full generality of the reference model but faithfully mimic its behavior for the target video and object; and difference detectors that highlight temporal differences across frames. We show that the optimal cascade architecture differs across videos and objects, so NoScope uses an efficient cost-based optimizer to search across models and cascades. With this approach, NoScope achieves two to three order of magnitude speed-ups (265-15,500x real-time) on binary classification tasks over fixed-angle webcam and surveillance video while maintaining accuracy within 1--5% of state-of-the-art neural networks.

Posted Content
TL;DR: This work presents an end-to-end lightweight network architecture, namely DCFNet, to learn the convolutional features and perform the correlation tracking process simultaneously, and treats DCF as a special correlation filter layer added in a Siamese network.
Abstract: Discriminant Correlation Filters (DCF) based methods now become a kind of dominant approach to online object tracking. The features used in these methods, however, are either based on hand-crafted features like HoGs, or convolutional features trained independently from other tasks like image classification. In this work, we present an end-to-end lightweight network architecture, namely DCFNet, to learn the convolutional features and perform the correlation tracking process simultaneously. Specifically, we treat DCF as a special correlation filter layer added in a Siamese network, and carefully derive the backpropagation through it by defining the network output as the probability heatmap of object location. Since the derivation is still carried out in Fourier frequency domain, the efficiency property of DCF is preserved. This enables our tracker to run at more than 60 FPS during test time, while achieving a significant accuracy gain compared with KCF using HoGs. Extensive evaluations on OTB-2013, OTB-2015, and VOT2015 benchmarks demonstrate that the proposed DCFNet tracker is competitive with several state-of-the-art trackers, while being more compact and much faster.

Proceedings ArticleDOI
01 Jul 2017
TL;DR: This work proposes Quadruplet Convolutional Neural Networks (Quad-CNN) for multi-object tracking, which learn to associate object detections across frames using quadruplet losses and employs a multi-task loss to jointly learn object association and bounding box regression for better localization.
Abstract: We propose Quadruplet Convolutional Neural Networks (Quad-CNN) for multi-object tracking, which learn to associate object detections across frames using quadruplet losses. The proposed networks consider target appearances together with their temporal adjacencies for data association. Unlike conventional ranking losses, the quadruplet loss enforces an additional constraint that makes temporally adjacent detections more closely located than the ones with large temporal gaps. We also employ a multi-task loss to jointly learn object association and bounding box regression for better localization. The whole network is trained end-to-end. For tracking, the target association is performed by minimax label propagation using the metric learned from the proposed network. We evaluate performance of our multi-object tracking algorithm on public MOT Challenge datasets, and achieve outstanding results.

Proceedings ArticleDOI
Heng Fan1, Haibin Ling1
01 Oct 2017
TL;DR: Zhang et al. as mentioned in this paper proposed a parallel tracking and verifying (PTAV) framework, which consists of two components, a tracker T and a verifier V, working in parallel on two separate threads.
Abstract: Being intensively studied, visual tracking has seen great recent advances in either speed (e.g., with correlation filters) or accuracy (e.g., with deep features). Real-time and high accuracy tracking algorithms, however, remain scarce. In this paper we study the problem from a new perspective and present a novel parallel tracking and verifying (PTAV) framework, by taking advantage of the ubiquity of multithread techniques and borrowing from the success of parallel tracking and mapping in visual SLAM. Our PTAV framework typically consists of two components, a tracker T and a verifier V, working in parallel on two separate threads. The tracker T aims to provide a super real-time tracking inference and is expected to perform well most of the time; by contrast, the verifier V checks the tracking results and corrects T when needed. The key innovation is that, V does not work on every frame but only upon the requests from T; on the other end, T may adjust the tracking according to the feedback from V. With such collaboration, PTAV enjoys both the high efficiency provided by T and the strong discriminative power by V. In our extensive experiments on popular benchmarks including OTB2013, OTB2015, TC128 and UAV20L, PTAV achieves the best tracking accuracy among all real-time trackers, and in fact performs even better than many deep learning based solutions. Moreover, as a general framework, PTAV is very flexible and has great rooms for improvement and generalization.

Journal ArticleDOI
TL;DR: In this article, a multi-scale image processing method is applied on the frames of the video of a vibrating structure to extract the local pixel phases that encode local structural vibration, establishing a full-field spatio-temporal motion matrix.

Proceedings ArticleDOI
01 Oct 2017
TL;DR: In this paper, the authors propose an end-to-end network for online video style transfer, which generates temporally coherent stylized video sequences in near real-time by incorporating short-term coherence.
Abstract: Training a feed-forward network for the fast neural style transfer of images has proven successful, but the naive extension of processing videos frame by frame is prone to producing flickering results. We propose the first end-toend network for online video style transfer, which generates temporally coherent stylized video sequences in near realtime. Two key ideas include an efficient network by incorporating short-term coherence, and propagating short-term coherence to long-term, which ensures consistency over a longer period of time. Our network can incorporate different image stylization networks and clearly outperforms the per-frame baseline both qualitatively and quantitatively. Moreover, it can achieve visually comparable coherence to optimization-based video style transfer, but is three orders of magnitude faster.

Proceedings Article
12 Feb 2017
TL;DR: One focus of this paper is in conducting a benchmark evaluation and proposing baseline algorithms to explicitly estimate the ego-motion of the camera motion by geometric transformation based on background feature points in a drone tracking scenario.
Abstract: Despite recent advances in the visual tracking community, most studies so far have focused on the observation model. As another important component in the tracking system, the motion model is much less well-explored especially for some extreme scenarios. In this paper, we consider one such scenario in which the camera is mounted on an unmanned aerial vehicle (UAV) or drone. We build a benchmark dataset of high diversity, consisting of 70 videos captured by drone cameras. To address the challenging issue of severe camera motion, we devise simple baselines to model the camera motion by geometric transformation based on background feature points. An extensive comparison of recent state-of-the-art trackers and their motion model variants on our drone tracking dataset validates both the necessity of the dataset and the effectiveness of the proposed methods. Our aim for this work is to lay the foundation for further research in the UAV tracking area.