Showing papers on "Video tracking published in 2017"

PDF

Open Access

Proceedings Article•DOI•

ECO: Efficient Convolution Operators for Tracking

[...]

Martin Danelljan¹, Goutam Bhat¹, Fahad Shahbaz Khan¹, Michael Felsberg¹•Institutions (1)

21 Jul 2017

TL;DR: This work revisit the core DCF formulation and introduces a factorized convolution operator, which drastically reduces the number of parameters in the model, and a compact generative model of the training sample distribution that significantly reduces memory and time complexity, while providing better diversity of samples.

...read moreread less

Abstract: In recent years, Discriminative Correlation Filter (DCF) based methods have significantly advanced the state-of-the-art in tracking. However, in the pursuit of ever increasing tracking performance, their characteristic speed and real-time capability have gradually faded. Further, the increasingly complex models, with massive number of trainable parameters, have introduced the risk of severe over-fitting. In this work, we tackle the key causes behind the problems of computational complexity and over-fitting, with the aim of simultaneously improving both speed and performance. We revisit the core DCF formulation and introduce: (i) a factorized convolution operator, which drastically reduces the number of parameters in the model, (ii) a compact generative model of the training sample distribution, that significantly reduces memory and time complexity, while providing better diversity of samples, (iii) a conservative model update strategy with improved robustness and reduced complexity. We perform comprehensive experiments on four benchmarks: VOT2016, UAV123, OTB-2015, and TempleColor. When using expensive deep features, our tracker provides a 20-fold speedup and achieves a 13.0% relative gain in Expected Average Overlap compared to the top ranked method [12] in the VOT2016 challenge. Moreover, our fast variant, using hand-crafted features, operates at 60 Hz on a single CPU, while obtaining 65.0% AUC on OTB-2015.

...read moreread less

1,993 citations

Proceedings Article•DOI•

Simple online and realtime tracking with a deep association metric

[...]

Nicolai Wojke¹, Alex Bewley², Dietrich Paulus¹•Institutions (2)

University of Koblenz and Landau¹, Queensland University of Technology²

21 Mar 2017

TL;DR: This paper integrates appearance information to improve the performance of SORT and reduces the number of identity switches, achieving overall competitive performance at high frame rates.

...read moreread less

Abstract: Simple Online and Realtime Tracking (SORT) is a pragmatic approach to multiple object tracking with a focus on simple, effective algorithms. In this paper, we integrate appearance information to improve the performance of SORT. Due to this extension we are able to track objects through longer periods of occlusions, effectively reducing the number of identity switches. In spirit of the original framework we place much of the computational complexity into an offline pre-training stage where we learn a deep association metric on a largescale person re-identification dataset. During online application, we establish measurement-to-track associations using nearest neighbor queries in visual appearance space. Experimental evaluation shows that our extensions reduce the number of identity switches by 45%, achieving overall competitive performance at high frame rates.

...read moreread less

1,808 citations

Proceedings Article•DOI•

End-to-End Representation Learning for Correlation Filter Based Tracking

[...]

Jack Valmadre¹, Luca Bertinetto¹, João F. Henriques¹, Andrea Vedaldi¹, Philip H. S. Torr¹ - Show less +1 more•Institutions (1)

University of Oxford¹

21 Jul 2017

TL;DR: In this paper, the Correlation Filter learner is interpreted as a differentiable layer in a deep neural network, which enables learning deep features that are tightly coupled to the correlation filter.

...read moreread less

Abstract: The Correlation Filter is an algorithm that trains a linear template to discriminate between images and their translations. It is well suited to object tracking because its formulation in the Fourier domain provides a fast solution, enabling the detector to be re-trained once per frame. Previous works that use the Correlation Filter, however, have adopted features that were either manually designed or trained for a different task. This work is the first to overcome this limitation by interpreting the Correlation Filter learner, which has a closed-form solution, as a differentiable layer in a deep neural network. This enables learning deep features that are tightly coupled to the Correlation Filter. Experiments illustrate that our method has the important practical benefit of allowing lightweight architectures to achieve state-of-the-art performance at high framerates.

...read moreread less

1,329 citations

Posted Content•

Simple Online and Realtime Tracking with a Deep Association Metric

[...]

Nicolai Wojke¹, Alex Bewley², Dietrich Paulus¹•Institutions (2)

University of Koblenz and Landau¹, Queensland University of Technology²

21 Mar 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, the authors integrate appearance information to improve the performance of Simple Online and Real-time Tracking (SORT) by tracking objects through longer periods of occlusions, effectively reducing the number of identity switches.

...read moreread less

Abstract: Simple Online and Realtime Tracking (SORT) is a pragmatic approach to multiple object tracking with a focus on simple, effective algorithms. In this paper, we integrate appearance information to improve the performance of SORT. Due to this extension we are able to track objects through longer periods of occlusions, effectively reducing the number of identity switches. In spirit of the original framework we place much of the computational complexity into an offline pre-training stage where we learn a deep association metric on a large-scale person re-identification dataset. During online application, we establish measurement-to-track associations using nearest neighbor queries in visual appearance space. Experimental evaluation shows that our extensions reduce the number of identity switches by 45%, achieving overall competitive performance at high frame rates.

...read moreread less

987 citations

Journal Article•DOI•

Discriminative Scale Space Tracking

[...]

Martin Danelljan¹, Gustav Häger¹, Fahad Shahbaz Khan¹, Michael Felsberg¹•Institutions (1)

Linköping University¹

01 Aug 2017-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: In this article, a scale adaptive tracking approach by learning separate discriminative correlation filters for translation and scale estimation is proposed, which directly learns the appearance change induced by variations in the target scale.

...read moreread less

Abstract: Accurate scale estimation of a target is a challenging research problem in visual object tracking. Most state-of-the-art methods employ an exhaustive scale search to estimate the target size. The exhaustive search strategy is computationally expensive and struggles when encountered with large scale variations. This paper investigates the problem of accurate and robust scale estimation in a tracking-by-detection framework. We propose a novel scale adaptive tracking approach by learning separate discriminative correlation filters for translation and scale estimation. The explicit scale filter is learned online using the target appearance sampled at a set of different scales. Contrary to standard approaches, our method directly learns the appearance change induced by variations in the target scale. Additionally, we investigate strategies to reduce the computational cost of our approach. Extensive experiments are performed on the OTB and the VOT2014 datasets. Compared to the standard exhaustive scale search, our approach achieves a gain of 2.5 percent in average overlap precision on the OTB dataset. Additionally, our method is computationally efficient, operating at a 50 percent higher frame rate compared to the exhaustive scale search. Our method obtains the top rank in performance by outperforming 19 state-of-the-art trackers on OTB and 37 state-of-the-art trackers on VOT2014.

...read moreread less

945 citations

Proceedings Article•DOI•

Discriminative Correlation Filter with Channel and Spatial Reliability

[...]

Alan Lukezic¹, Tomas Vojir², Luka Čehovin Zajc¹, Jiri Matas², Matej Kristan¹ - Show less +1 more•Institutions (2)

University of Ljubljana¹, Czech Technical University in Prague²

21 Jul 2017

TL;DR: The channel and spatial reliability concepts are introduced to DCF tracking and a novel learning algorithm is provided for its efficient and seamless integration in the filter update and the tracking process.

...read moreread less

Abstract: Short-term tracking is an open and challenging problem for which discriminative correlation filters (DCF) have shown excellent performance. We introduce the channel and spatial reliability concepts to DCF tracking and provide a novel learning algorithm for its efficient and seamless integration in the filter update and the tracking process. The spatial reliability map adjusts the filter support to the part of the object suitable for tracking. This allows tracking of non-rectangular objects as well as extending the search region. Channel reliability reflects the quality of the learned filter and it is used as a feature weighting coefficient in localization. Experimentally, with only two simple standard features, HOGs and Colornames, the novel CSR-DCF method – DCF with Channel and Spatial Reliability – achieves state-of-the-art results on VOT 2016, VOT 2015 and OTB. The CSR-DCF runs in real-time on a CPU.

...read moreread less

941 citations

Proceedings Article•DOI•

Temporal Convolutional Networks for Action Segmentation and Detection

[...]

Colin Lea¹, Michael D. Flynn¹, René Vidal¹, Austin Reiter¹, Gregory D. Hager¹ - Show less +1 more•Institutions (1)

Johns Hopkins University¹

01 Jul 2017

TL;DR: A class of temporal models that use a hierarchy of temporal convolutions to perform fine-grained action segmentation or detection, which are capable of capturing action compositions, segment durations, and long-range dependencies, and are over a magnitude faster to train than competing LSTM-based Recurrent Neural Networks.

...read moreread less

Abstract: The ability to identify and temporally segment fine-grained human actions throughout a video is crucial for robotics, surveillance, education, and beyond. Typical approaches decouple this problem by first extracting local spatiotemporal features from video frames and then feeding them into a temporal classifier that captures high-level temporal patterns. We describe a class of temporal models, which we call Temporal Convolutional Networks (TCNs), that use a hierarchy of temporal convolutions to perform fine-grained action segmentation or detection. Our Encoder-Decoder TCN uses pooling and upsampling to efficiently capture long-range temporal patterns whereas our Dilated TCN uses dilated convolutions. We show that TCNs are capable of capturing action compositions, segment durations, and long-range dependencies, and are over a magnitude faster to train than competing LSTM-based Recurrent Neural Networks. We apply these models to three challenging fine-grained datasets and show large improvements over the state of the art.

...read moreread less

859 citations

Proceedings Article•DOI•

Learning Dynamic Siamese Network for Visual Object Tracking

[...]

Qing Guo¹, Wei Feng², Ce Zhou¹, Rui Huang¹, Liang Wan¹, Song Wang³ - Show less +2 more•Institutions (3)

Tianjin University¹, Chinese Academy of Sciences², University of South Carolina³

01 Oct 2017

TL;DR: This paper proposes dynamic Siamese network, via a fast transformation learning model that enables effective online learning of target appearance variation and background suppression from previous frames, and presents elementwise multi-layer fusion to adaptively integrate the network outputs using multi-level deep features.

...read moreread less

Abstract: How to effectively learn temporal variation of target appearance, to exclude the interference of cluttered background, while maintaining real-time response, is an essential problem of visual object tracking. Recently, Siamese networks have shown great potentials of matching based trackers in achieving balanced accuracy and beyond realtime speed. However, they still have a big gap to classification & updating based trackers in tolerating the temporal changes of objects and imaging conditions. In this paper, we propose dynamic Siamese network, via a fast transformation learning model that enables effective online learning of target appearance variation and background suppression from previous frames. We then present elementwise multi-layer fusion to adaptively integrate the network outputs using multi-level deep features. Unlike state-of-theart trackers, our approach allows the usage of any feasible generally- or particularly-trained features, such as SiamFC and VGG. More importantly, the proposed dynamic Siamese network can be jointly trained as a whole directly on the labeled video sequences, thus can take full advantage of the rich spatial temporal information of moving objects. As a result, our approach achieves state-of-the-art performance on OTB-2013 and VOT-2015 benchmarks, while exhibits superiorly balanced accuracy and real-time response over state-of-the-art competitors.

...read moreread less

772 citations

Proceedings Article•DOI•

Learning Background-Aware Correlation Filters for Visual Tracking

[...]

Hamed Kiani Galoogahi¹, Ashton Fagg², Simon Lucey¹•Institutions (2)

Carnegie Mellon University¹, Commonwealth Scientific and Industrial Research Organisation²

01 Oct 2017

TL;DR: This work proposes a Background-Aware CF based on hand-crafted features (HOG] that can efficiently model how both the foreground and background of the object varies over time, and superior accuracy and real-time performance of the method compared to the state-of-the-art trackers.

...read moreread less

Abstract: Correlation Filters (CFs) have recently demonstrated excellent performance in terms of rapidly tracking objects under challenging photometric and geometric variations. The strength of the approach comes from its ability to efficiently learn - on the fly - how the object is changing over time. A fundamental drawback to CFs, however, is that the background of the target is not modeled over time which can result in suboptimal performance. Recent tracking algorithms have suggested to resolve this drawback by either learning CFs from more discriminative deep features (e.g. DeepSRDCF [9] and CCOT [11]) or learning complex deep trackers (e.g. MDNet [28] and FCNT [33]). While such methods have been shown to work well, they suffer from high complexity: extracting deep features or applying deep tracking frameworks is very computationally expensive. This limits the real-time performance of such methods, even on high-end GPUs. This work proposes a Background-Aware CF based on hand-crafted features (HOG [6]) that can efficiently model how both the foreground and background of the object varies over time. Our approach, like conventional CFs, is extremely computationally efficient- and extensive experiments over multiple tracking benchmarks demonstrate the superior accuracy and real-time performance of our method compared to the state-of-the-art trackers.

...read moreread less

679 citations

Proceedings Article•DOI•

Video Frame Synthesis Using Deep Voxel Flow

[...]

Ziwei Liu¹, Raymond A. Yeh², Xiaoou Tang¹, Yiming Liu³, Aseem Agarwala⁴ - Show less +1 more•Institutions (4)

The Chinese University of Hong Kong¹, University of Illinois at Urbana–Champaign², Google³, Adobe Systems⁴

01 Oct 2017

TL;DR: Deep voxel flow as mentioned in this paper combines the advantages of optical flow and neural network-based methods by training a deep network that learns to synthesize video frames by flowing pixel values from existing ones, which can be applied at any video resolution.

...read moreread less

Abstract: We address the problem of synthesizing new video frames in an existing video, either in-between existing frames (interpolation), or subsequent to them (extrapolation). This problem is challenging because video appearance and motion can be highly complex. Traditional optical-flow-based solutions often fail where flow estimation is challenging, while newer neural-network-based methods that hallucinate pixel values directly often produce blurry results. We combine the advantages of these two methods by training a deep network that learns to synthesize video frames by flowing pixel values from existing ones, which we call deep voxel flow. Our method requires no human supervision, and any video can be used as training data by dropping, and then learning to predict, existing frames. The technique is efficient, and can be applied at any video resolution. We demonstrate that our method produces results that both quantitatively and qualitatively improve upon the state-of-the-art.

...read moreread less

601 citations

Proceedings Article•DOI•

Large Margin Object Tracking with Circulant Feature Maps

[...]

Mengmeng Wang¹, Yong Liu¹, Zeyi Huang•Institutions (1)

Zhejiang University¹

01 Jul 2017

TL;DR: Wang et al. as discussed by the authors proposed a large margin object tracking method, which absorbs the strong discriminative ability from structured output SVM and speeds up by the correlation filter algorithm significantly.

...read moreread less

Abstract: Structured output support vector machine (SVM) based tracking algorithms have shown favorable performance recently. Nonetheless, the time-consuming candidate sampling and complex optimization limit their real-time applications. In this paper, we propose a novel large margin object tracking method which absorbs the strong discriminative ability from structured output SVM and speeds up by the correlation filter algorithm significantly. Secondly, a multimodal target detection technique is proposed to improve the target localization precision and prevent model drift introduced by similar objects or background noise. Thirdly, we exploit the feedback from high-confidence tracking results to avoid the model corruption problem. We implement two versions of the proposed tracker with the representations from both conventional hand-crafted and deep convolution neural networks (CNNs) based features to validate the strong compatibility of the algorithm. The experimental results demonstrate that the proposed tracker performs superiorly against several state-of-the-art algorithms on the challenging benchmark sequences while runs at speed in excess of 80 frames per second.

...read moreread less

Proceedings Article•DOI•

Action-Decision Networks for Visual Tracking with Deep Reinforcement Learning

[...]

Sangdoo Yun¹, Jongwon Choi¹, Youngjoon Yoo¹, Kimin Yun², Jin-Young Choi¹ - Show less +1 more•Institutions (2)

Seoul National University¹, Electronics and Telecommunications Research Institute²

01 Jul 2017

TL;DR: Through evaluation of the OTB dataset, the proposed tracker is validated to achieve a competitive performance that is three times faster than state-of-the-art, deep network–based trackers.

...read moreread less

Abstract: This paper proposes a novel tracker which is controlled by sequentially pursuing actions learned by deep reinforcement learning. In contrast to the existing trackers using deep networks, the proposed tracker is designed to achieve a light computation as well as satisfactory tracking accuracy in both location and scale. The deep network to control actions is pre-trained using various training sequences and fine-tuned during tracking for online adaptation to target and background changes. The pre-training is done by utilizing deep reinforcement learning as well as supervised learning. The use of reinforcement learning enables even partially labeled data to be successfully utilized for semi-supervised learning. Through evaluation of the OTB dataset, the proposed tracker is validated to achieve a competitive performance that is three times faster than state-of-the-art, deep network–based trackers. The fast version of the proposed method, which operates in real-time on GPU, outperforms the state-of-the-art real-time trackers.

...read moreread less

Proceedings Article•DOI•

YouTube-BoundingBoxes: A Large High-Precision Human-Annotated Data Set for Object Detection in Video

[...]

Esteban Real¹, Jonathon Shlens¹, Stefano Mazzocchi¹, Xin Pan¹, Vincent Vanhoucke¹ - Show less +1 more•Institutions (1)

Google¹

01 Jul 2017

TL;DR: A new large-scale data set of video URLs with densely-sampled object bounding box annotations called YouTube-BoundingBoxes (YT-BB), which consists of approximately 380,000 video segments automatically selected to feature objects in natural settings without editing or post-processing.

...read moreread less

Abstract: We introduce a new large-scale data set of video URLs with densely-sampled object bounding box annotations called YouTube-BoundingBoxes (YT-BB). The data set consists of approximately 380,000 video segments about 19s long, automatically selected to feature objects in natural settings without editing or post-processing, with a recording quality often akin to that of a hand-held cell phone camera. The objects represent a subset of the COCO [32] label set. All video segments were human-annotated with high-precision classification labels and bounding boxes at 1 frame per second. The use of a cascade of increasingly precise human annotations ensures a label accuracy above 95% for every class and tight bounding boxes. Finally, we train and evaluate well-known deep network architectures and report baseline figures for per-frame classification and localization. We also demonstrate how the temporal contiguity of video can potentially be used to improve such inferences. The data set can be found at https://research.google.com/youtube-bb. We hope the availability of such large curated corpus will spur new advances in video object detection and tracking.

...read moreread less

Proceedings Article•DOI•

High-Speed tracking-by-detection without using image information

[...]

Erik Bochinski¹, Volker Eiselein¹, Thomas Sikora¹•Institutions (1)

Technical University of Berlin¹

01 Aug 2017

TL;DR: This work presents a tracking-by-detection algorithm which can compete with more sophisticated approaches at a fraction of the computational cost and shows with thorough experiments its potential using a wide range of object detectors.

...read moreread less

Abstract: Tracking-by-detection is a common approach to multi-object tracking. With ever increasing performances of object detectors, the basis for a tracker becomes much more reliable. In combination with commonly higher frame rates, this poses a shift in the challenges for a successful tracker. That shift enables the deployment of much simpler tracking algorithms which can compete with more sophisticated approaches at a fraction of the computational cost. We present such an algorithm and show with thorough experiments its potential using a wide range of object detectors. The proposed method can easily run at 100K fps while outperforming the state-of-the-art on the DETRAC vehicle tracking dataset.

...read moreread less

Proceedings Article•DOI•

The Visual Object Tracking VOT2017 Challenge Results

[...]

Matej Kristan¹, Ales Leonardis², Jiri Matas³, Michael Felsberg⁴, Roman Pflugfelder⁵, Luka Čehovin Zajc¹, Tomas Vojir³, Gustav Häger⁴, Alan Lukezic¹, Abdelrahman Eldesokey⁴, Gustavo Fernandez⁵, Alvaro Garcia-Martin⁶, Andrej Muhič¹, Alfredo Petrosino⁷, Alireza Memarmoghadam⁸, Andrea Vedaldi⁹, Antoine Manzanera¹⁰, Antoine Tran¹⁰, A. Aydin Alatan¹¹, Bogdan Mocanu, Boyu Chen¹², Chang Huang, Changsheng Xu¹³, Chong Sun¹², Dalong Du, David Zhang, Dawei Du¹³, Deepak Mishra, Erhan Gundogdu¹⁴, Erhan Gundogdu¹¹, Erik Velasco-Salido, Fahad Shahbaz Khan⁴, Francesco Battistone, Gorthi R. K. Sai Subrahmanyam, Goutam Bhat⁴, Guan Huang, Guilherme Sousa Bastos, Guna Seetharaman¹⁵, Hongliang Zhang¹⁶, Houqiang Li¹⁷, Huchuan Lu¹², Isabela Drummond, Jack Valmadre⁹, Jae-chan Jeong¹⁸, Jaeil Cho¹⁸, Jae-Yeong Lee¹⁸, Jana Noskova, Jianke Zhu¹⁹, Jin Gao¹³, Jingyu Liu¹³, Ji-Wan Kim¹⁸, João F. Henriques⁹, José M. Martínez, Junfei Zhuang²⁰, Junliang Xing¹³, Junyu Gao¹³, Kai Chen²¹, Kannappan Palaniappan²², Karel Lebeda, Ke Gao²², Kris M. Kitani²³, Lei Zhang, Lijun Wang¹², Lingxiao Yang, Longyin Wen²⁴, Luca Bertinetto⁹, Mahdieh Poostchi²², Martin Danelljan⁴, Matthias Mueller²⁵, Mengdan Zhang¹³, Ming-Hsuan Yang²⁶, Nianhao Xie¹⁶, Ning Wang¹⁷, Ondrej Miksik⁹, Payman Moallem⁸, Pallavi Venugopal M, Pedro Senna, Philip H. S. Torr⁹, Qiang Wang¹³, Qifeng Yu¹⁶, Qingming Huang¹³, Rafael Martin-Nieto, Richard Bowden²⁷, Risheng Liu¹², Ruxandra Tapu, Simon Hadfield²⁷, Siwei Lyu²⁸, Stuart Golodetz⁹, Sunglok Choi¹⁸, Tianzhu Zhang¹³, Titus Zaharia, Vincenzo Santopietro, Wei Zou¹³, Weiming Hu¹³, Wenbing Tao²¹, Wenbo Li²⁸, Wengang Zhou¹⁷, Xianguo Yu¹⁶, Xiao Bian²⁴, Yang Li¹⁹, Yifan Xing²³, Yingruo Fan²⁰, Zheng Zhu¹³, Zhipeng Zhang¹³, Zhiqun He²⁰ - Show less +101 more•Institutions (28)

University of Ljubljana¹, University of Birmingham², Czech Technical University in Prague³, Linköping University⁴, Austrian Institute of Technology⁵, Autonomous University of Madrid⁶, Parthenope University of Naples⁷, University of Isfahan⁸, University of Oxford⁹, Superior National School of Advanced Techniques¹⁰, Middle East Technical University¹¹, Dalian University of Technology¹², Chinese Academy of Sciences¹³, ASELSAN¹⁴, United States Naval Research Laboratory¹⁵, National University of Defense Technology¹⁶, University of Science and Technology of China¹⁷, Electronics and Telecommunications Research Institute¹⁸, Zhejiang University¹⁹, Beijing University of Posts and Telecommunications²⁰, Huazhong University of Science and Technology²¹, University of Missouri²², Carnegie Mellon University²³, General Electric²⁴, King Abdullah University of Science and Technology²⁵, University of California, Merced²⁶, University of Surrey²⁷, University at Albany, SUNY²⁸

01 Jul 2017

TL;DR: The Visual Object Tracking challenge VOT2017 is the fifth annual tracker benchmarking activity organized by the VOT initiative; results of 51 trackers are presented; many are state-of-the-art published at major computer vision conferences or journals in recent years.

...read moreread less

Abstract: The Visual Object Tracking challenge VOT2017 is the fifth annual tracker benchmarking activity organized by the VOT initiative. Results of 51 trackers are presented; many are state-of-the-art published at major computer vision conferences or journals in recent years. The evaluation included the standard VOT and other popular methodologies and a new "real-time" experiment simulating a situation where a tracker processes images as if provided by a continuously running sensor. Performance of the tested trackers typically by far exceeds standard baselines. The source code for most of the trackers is publicly available from the VOT page. The VOT2017 goes beyond its predecessors by (i) improving the VOT public dataset and introducing a separate VOT2017 sequestered dataset, (ii) introducing a realtime tracking experiment and (iii) releasing a redesigned toolkit that supports complex experiments. The dataset, the evaluation kit and the results are publicly available at the challenge website1.

...read moreread less

Proceedings Article•DOI•

Learning Video Object Segmentation from Static Images

[...]

Federico Perazzi¹, Anna Khoreva², Rodrigo Benenson², Bernt Schiele², Alexander Sorkine-Hornung¹ - Show less +1 more•Institutions (2)

Disney Research¹, Max Planck Society²

01 Jul 2017

TL;DR: In this paper, the authors use a combination of offline and online learning strategies, where the former produces a refined mask from the previous frame estimate and the latter allows to capture the appearance of the specific object instance.

...read moreread less

Abstract: Inspired by recent advances of deep learning in instance segmentation and object tracking, we introduce the concept of convnet-based guidance applied to video object segmentation. Our model proceeds on a per-frame basis, guided by the output of the previous frame towards the object of interest in the next frame. We demonstrate that highly accurate object segmentation in videos can be enabled by using a convolutional neural network (convnet) trained with static images only. The key component of our approach is a combination of offline and online learning strategies, where the former produces a refined mask from the previous frame estimate and the latter allows to capture the appearance of the specific object instance. Our method can handle different types of input annotations such as bounding boxes and segments while leveraging an arbitrary amount of annotated frames. Therefore our system is suitable for diverse applications with different requirements in terms of accuracy and efficiency. In our extensive evaluation, we obtain competitive results on three different datasets, independently from the type of input annotation.

...read moreread less

Proceedings Article•DOI•

Multi-task Correlation Particle Filter for Robust Object Tracking

[...]

Tianzhu Zhang¹, Changsheng Xu¹, Ming-Hsuan Yang²•Institutions (2)

Chinese Academy of Sciences¹, University of California, Merced²

01 Jul 2017

TL;DR: The proposed MCPF is designed to exploit and complement the strength of a MCF and a particle filter, and can effectively maintain multiple modes in the posterior density using fewer particles than conventional particle filters, thereby lowering the computational cost.

...read moreread less

Abstract: In this paper, we propose a multi-task correlation particle filter (MCPF) for robust visual tracking. We first present the multi-task correlation filter (MCF) that takes the interdependencies among different features into account to learn correlation filters jointly. The proposed MCPF is designed to exploit and complement the strength of a MCF and a particle filter. Compared with existing tracking methods based on correlation filters and particle filters, the proposed tracker has several advantages. First, it can shepherd the sampled particles toward the modes of the target state distribution via the MCF, thereby resulting in robust tracking performance. Second, it can effectively handle large-scale variation via a particle sampling strategy. Third, it can effectively maintain multiple modes in the posterior density using fewer particles than conventional particle filters, thereby lowering the computational cost. Extensive experimental results on three benchmark datasets demonstrate that the proposed MCPF performs favorably against the state-of-the-art methods.

...read moreread less

Proceedings Article•DOI•

CREST: Convolutional Residual Learning for Visual Tracking

[...]

Yibing Song¹, Chao Ma², Lijun Gong, Jiawei Zhang¹, Rynson W. H. Lau¹, Ming-Hsuan Yang³ - Show less +2 more•Institutions (3)

City University of Hong Kong¹, University of Adelaide², University of California³

01 Aug 2017

TL;DR: This paper proposes the CREST algorithm to reformulate DCFs as a one-layer convolutional neural network, and applies residual learning to take appearance changes into account to reduce model degradation during online update.

...read moreread less

Abstract: Discriminative correlation filters (DCFs) have been shown to perform superiorly in visual tracking. They only need a small set of training samples from the initial frame to generate an appearance model. However, existing DCFs learn the filters separately from feature extraction, and update these filters using a moving average operation with an empirical weight. These DCF trackers hardly benefit from the end-to-end training. In this paper, we propose the CREST algorithm to reformulate DCFs as a one-layer convolutional neural network. Our method integrates feature extraction, response map generation as well as model update into the neural networks for an end-to-end training. To reduce model degradation during online update, we apply residual learning to take appearance changes into account. Extensive experiments on the benchmark datasets demonstrate that our CREST tracker performs favorably against state-of-the-art trackers.

...read moreread less

Posted Content•

MoCoGAN: Decomposing Motion and Content for Video Generation

[...]

Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang¹, Jan Kautz•Institutions (1)

Nvidia¹

17 Jul 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: MoCoGAN as mentioned in this paper decomposes the visual signals in a video into content and motion, and learns motion and content decomposition in an unsupervised manner using both image and video discriminators.

...read moreread less

Abstract: Visual signals in a video can be divided into content and motion. While content specifies which objects are in the video, motion describes their dynamics. Based on this prior, we propose the Motion and Content decomposed Generative Adversarial Network (MoCoGAN) framework for video generation. The proposed framework generates a video by mapping a sequence of random vectors to a sequence of video frames. Each random vector consists of a content part and a motion part. While the content part is kept fixed, the motion part is realized as a stochastic process. To learn motion and content decomposition in an unsupervised manner, we introduce a novel adversarial learning scheme utilizing both image and video discriminators. Extensive experimental results on several challenging datasets with qualitative and quantitative comparison to the state-of-the-art approaches, verify effectiveness of the proposed framework. In addition, we show that MoCoGAN allows one to generate videos with same content but different motion as well as videos with different content and same motion.

...read moreread less

Proceedings Article•DOI•

Detect to Track and Track to Detect

[...]

Christoph Feichtenhofer¹, Axel Pinz², Andrew Zisserman³•Institutions (3)

Graz University of Technology¹, VRVis², University of Oxford³

01 Oct 2017

TL;DR: This paper sets up a ConvNet architecture for simultaneous detection and tracking, using a multi-task objective for frame-based object detection and across-frame track regression, and introduces correlation features that represent object co-occurrences across time to aid the ConvNet during tracking.

...read moreread less

Abstract: Recent approaches for high accuracy detection and tracking of object categories in video consist of complex multistage solutions that become more cumbersome each year. In this paper we propose a ConvNet architecture that jointly performs detection and tracking, solving the task in a simple and effective way. Our contributions are threefold: (i) we set up a ConvNet architecture for simultaneous detection and tracking, using a multi-task objective for frame-based object detection and across-frame track regression; (ii) we introduce correlation features that represent object co-occurrences across time to aid the ConvNet during tracking; and (iii) we link the frame level detections based on our across-frame tracklets to produce high accuracy detections at the video level. Our ConvNet architecture for spatiotemporal object detection is evaluated on the large-scale ImageNet VID dataset where it achieves state-of-the-art results. Our approach provides better single model performance than the winning method of the last ImageNet challenge while being conceptually much simpler. Finally, we show that by increasing the temporal stride we can dramatically increase the tracker speed.

...read moreread less

Posted Content•

End-to-end representation learning for Correlation Filter based tracking

[...]

Jack Valmadre¹, Luca Bertinetto¹, João F. Henriques¹, Andrea Vedaldi¹, Philip H. S. Torr¹ - Show less +1 more•Institutions (1)

University of Oxford¹

20 Apr 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work is the first to overcome this limitation by interpreting the Correlation Filter learner, which has a closed-form solution, as a differentiable layer in a deep neural network, which enables learning deep features that are tightly coupled to the Cor correlation filter.

...read moreread less

Proceedings Article•DOI•

Online Multi-object Tracking Using CNN-Based Single Object Tracker with Spatial-Temporal Attention Mechanism

[...]

Qi Chu¹, Wanli Ouyang², Hongsheng Li³, Xiaogang Wang³, Bin Liu¹, Nenghai Yu¹ - Show less +2 more•Institutions (3)

University of Science and Technology of China¹, University of Sydney², The Chinese University of Hong Kong³

01 Oct 2017

TL;DR: Zhang et al. as mentioned in this paper proposed a spatial-temporal attention mechanism (STAM) to handle the drift caused by occlusion and interaction among targets, which can be considered as temporal attention mechanism.

...read moreread less

Abstract: In this paper, we propose a CNN-based framework for online MOT. This framework utilizes the merits of single object trackers in adapting appearance models and searching for target in the next frame. Simply applying single object tracker for MOT will encounter the problem in computational efficiency and drifted results caused by occlusion. Our framework achieves computational efficiency by sharing features and using ROI-Pooling to obtain individual features for each target. Some online learned target-specific CNN layers are used for adapting the appearance model for each target. In the framework, we introduce spatial-temporal attention mechanism (STAM) to handle the drift caused by occlusion and interaction among targets. The visibility map of the target is learned and used for inferring the spatial attention map. The spatial attention map is then applied to weight the features. Besides, the occlusion status can be estimated from the visibility map, which controls the online updating process via weighted loss on training samples with different occlusion statuses in different frames. It can be considered as temporal attention mechanism. The proposed algorithm achieves 34.3% and 46.0% in MOTA on challenging MOT15 and MOT16 benchmark dataset respectively.

...read moreread less

Proceedings Article•DOI•

Learning Video Object Segmentation with Visual Memory

[...]

Pavel Tokmakov, Karteek Alahari, Cordelia Schmid

20 Apr 2017

TL;DR: A novel two-stream neural network with an explicit memory module to achieve the task of segmenting moving objects in unconstrained videos and provides an extensive ablative analysis to investigate the influence of each component in the proposed framework.

...read moreread less

Abstract: This paper addresses the task of segmenting moving objects in unconstrained videos. We introduce a novel two-stream neural network with an explicit memory module to achieve this. The two streams of the network encode spatial and temporal features in a video sequence respectively, while the memory module captures the evolution of objects over time. The module to build a “visual memory” in video, i.e., a joint representation of all the video frames, is realized with a convolutional recurrent unit learned from a small number of training video sequences. Given a video frame as input, our approach assigns each pixel an object or background label based on the learned spatio-temporal features as well as the "visual memory" specific to the video, acquired automatically without any manually-annotated frames. The visual memory is implemented with convolutional gated recurrent units, which allows to propagate spatial information over time. We evaluate our method extensively on two benchmarks, DAVIS and Freiburg-Berkeley motion segmentation datasets, and show state-of-the-art results. For example, our approach outperforms the top method on the DAVIS dataset by nearly 6%. We also provide an extensive ablative analysis to investigate the influence of each component in the proposed framework.

...read moreread less

Journal Article•DOI•

NoScope: optimizing neural network queries over video at scale

[...]

Daniel Kang¹, John Emmons¹, Firas Abuzaid¹, Peter Bailis¹, Matei Zaharia¹ - Show less +1 more•Institutions (1)

Stanford University¹

01 Aug 2017

TL;DR: NoScope as mentioned in this paper cascades two types of models: specialized models that forego the full generality of the reference model but faithfully mimic its behavior for the target video and object; and difference detectors that highlight temporal differences across frames.

...read moreread less

Abstract: Recent advances in computer vision---in the form of deep neural networks---have made it possible to query increasing volumes of video data with high accuracy. However, neural network inference is computationally expensive at scale: applying a state-of-the-art object detector in real time (i.e., 30+ frames per second) to a single video requires a $4000 GPU. In response, we present NoScope, a system for querying videos that can reduce the cost of neural network video analysis by up to three orders of magnitude via inference-optimized model search. Given a target video, object to detect, and reference neural network, NoScope automatically searches for and trains a sequence, or cascade, of models that preserves the accuracy of the reference network but is specialized to the target video and are therefore far less computationally expensive. NoScope cascades two types of models: specialized models that forego the full generality of the reference model but faithfully mimic its behavior for the target video and object; and difference detectors that highlight temporal differences across frames. We show that the optimal cascade architecture differs across videos and objects, so NoScope uses an efficient cost-based optimizer to search across models and cascades. With this approach, NoScope achieves two to three order of magnitude speed-ups (265-15,500x real-time) on binary classification tasks over fixed-angle webcam and surveillance video while maintaining accuracy within 1--5% of state-of-the-art neural networks.

...read moreread less

Posted Content•

DCFNet: Discriminant Correlation Filters Network for Visual Tracking

[...]

Qiang Wang, Jin Gao, Junliang Xing, Mengdan Zhang, Weiming Hu - Show less +1 more

13 Apr 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work presents an end-to-end lightweight network architecture, namely DCFNet, to learn the convolutional features and perform the correlation tracking process simultaneously, and treats DCF as a special correlation filter layer added in a Siamese network.

...read moreread less

Abstract: Discriminant Correlation Filters (DCF) based methods now become a kind of dominant approach to online object tracking. The features used in these methods, however, are either based on hand-crafted features like HoGs, or convolutional features trained independently from other tasks like image classification. In this work, we present an end-to-end lightweight network architecture, namely DCFNet, to learn the convolutional features and perform the correlation tracking process simultaneously. Specifically, we treat DCF as a special correlation filter layer added in a Siamese network, and carefully derive the backpropagation through it by defining the network output as the probability heatmap of object location. Since the derivation is still carried out in Fourier frequency domain, the efficiency property of DCF is preserved. This enables our tracker to run at more than 60 FPS during test time, while achieving a significant accuracy gain compared with KCF using HoGs. Extensive evaluations on OTB-2013, OTB-2015, and VOT2015 benchmarks demonstrate that the proposed DCFNet tracker is competitive with several state-of-the-art trackers, while being more compact and much faster.

...read moreread less

Proceedings Article•DOI•

Multi-object Tracking with Quadruplet Convolutional Neural Networks

[...]

Jeany Son¹, Mooyeol Baek¹, Minsu Cho¹, Bohyung Han¹•Institutions (1)

Pohang University of Science and Technology¹

01 Jul 2017

TL;DR: This work proposes Quadruplet Convolutional Neural Networks (Quad-CNN) for multi-object tracking, which learn to associate object detections across frames using quadruplet losses and employs a multi-task loss to jointly learn object association and bounding box regression for better localization.

...read moreread less

Abstract: We propose Quadruplet Convolutional Neural Networks (Quad-CNN) for multi-object tracking, which learn to associate object detections across frames using quadruplet losses. The proposed networks consider target appearances together with their temporal adjacencies for data association. Unlike conventional ranking losses, the quadruplet loss enforces an additional constraint that makes temporally adjacent detections more closely located than the ones with large temporal gaps. We also employ a multi-task loss to jointly learn object association and bounding box regression for better localization. The whole network is trained end-to-end. For tracking, the target association is performed by minimax label propagation using the metric learned from the proposed network. We evaluate performance of our multi-object tracking algorithm on public MOT Challenge datasets, and achieve outstanding results.

...read moreread less

Proceedings Article•DOI•

Parallel Tracking and Verifying: A Framework for Real-Time and High Accuracy Visual Tracking

[...]

Heng Fan¹, Haibin Ling¹•Institutions (1)

Temple University¹

01 Oct 2017

TL;DR: Zhang et al. as mentioned in this paper proposed a parallel tracking and verifying (PTAV) framework, which consists of two components, a tracker T and a verifier V, working in parallel on two separate threads.

...read moreread less

Abstract: Being intensively studied, visual tracking has seen great recent advances in either speed (e.g., with correlation filters) or accuracy (e.g., with deep features). Real-time and high accuracy tracking algorithms, however, remain scarce. In this paper we study the problem from a new perspective and present a novel parallel tracking and verifying (PTAV) framework, by taking advantage of the ubiquity of multithread techniques and borrowing from the success of parallel tracking and mapping in visual SLAM. Our PTAV framework typically consists of two components, a tracker T and a verifier V, working in parallel on two separate threads. The tracker T aims to provide a super real-time tracking inference and is expected to perform well most of the time; by contrast, the verifier V checks the tracking results and corrects T when needed. The key innovation is that, V does not work on every frame but only upon the requests from T; on the other end, T may adjust the tracking according to the feedback from V. With such collaboration, PTAV enjoys both the high efficiency provided by T and the strong discriminative power by V. In our extensive experiments on popular benchmarks including OTB2013, OTB2015, TC128 and UAV20L, PTAV achieves the best tracking accuracy among all real-time trackers, and in fact performs even better than many deep learning based solutions. Moreover, as a general framework, PTAV is very flexible and has great rooms for improvement and generalization.

...read moreread less

Journal Article•DOI•

Blind identification of full-field vibration modes from video measurements with phase-based video motion magnification

[...]

Yongchao Yang¹, Charles Dorn², Tyler Mancini³, Zachary Talken⁴, Garrett T. Kenyon¹, Charles R. Farrar¹, David Mascareñas¹ - Show less +3 more•Institutions (4)

Los Alamos National Laboratory¹, University of Wisconsin-Madison², University of Texas at Austin³, Missouri University of Science and Technology⁴

15 Feb 2017-Mechanical Systems and Signal Processing

TL;DR: In this article, a multi-scale image processing method is applied on the frames of the video of a vibrating structure to extract the local pixel phases that encode local structural vibration, establishing a full-field spatio-temporal motion matrix.

...read moreread less

Proceedings Article•DOI•

Coherent Online Video Style Transfer

[...]

Dongdong Chen¹, Jing Liao², Lu Yuan³, Nenghai Yu¹, Gang Hua³ - Show less +1 more•Institutions (3)

University of Science and Technology of China¹, Hong Kong University of Science and Technology², Microsoft³

01 Oct 2017

TL;DR: In this paper, the authors propose an end-to-end network for online video style transfer, which generates temporally coherent stylized video sequences in near real-time by incorporating short-term coherence.

...read moreread less

Abstract: Training a feed-forward network for the fast neural style transfer of images has proven successful, but the naive extension of processing videos frame by frame is prone to producing flickering results. We propose the first end-toend network for online video style transfer, which generates temporally coherent stylized video sequences in near realtime. Two key ideas include an efficient network by incorporating short-term coherence, and propagating short-term coherence to long-term, which ensures consistency over a longer period of time. Our network can incorporate different image stylization networks and clearly outperforms the per-frame baseline both qualitatively and quantitatively. Moreover, it can achieve visually comparable coherence to optimization-based video style transfer, but is three orders of magnitude faster.

...read moreread less

Proceedings Article•

Visual Object Tracking for Unmanned Aerial Vehicles: A Benchmark and New Motion Models

[...]

Siyi Li¹, Dit-Yan Yeung¹•Institutions (1)

Hong Kong University of Science and Technology¹

12 Feb 2017

TL;DR: One focus of this paper is in conducting a benchmark evaluation and proposing baseline algorithms to explicitly estimate the ego-motion of the camera motion by geometric transformation based on background feature points in a drone tracking scenario.

...read moreread less

Abstract: Despite recent advances in the visual tracking community, most studies so far have focused on the observation model. As another important component in the tracking system, the motion model is much less well-explored especially for some extreme scenarios. In this paper, we consider one such scenario in which the camera is mounted on an unmanned aerial vehicle (UAV) or drone. We build a benchmark dataset of high diversity, consisting of 70 videos captured by drone cameras. To address the challenging issue of severe camera motion, we devise simple baselines to model the camera motion by geometric transformation based on background feature points. An extensive comparison of recent state-of-the-art trackers and their motion model variants on our drone tracking dataset validates both the necessity of the dataset and the effectiveness of the proposed methods. Our aim for this work is to lay the foundation for further research in the UAV tracking area.

...read moreread less

Collapse