Showing papers on "Video tracking published in 2018"

PDF

Open Access

Proceedings Article•DOI•

High Performance Visual Tracking with Siamese Region Proposal Network

[...]

Bo Li¹, Junjie Yan², Wei Wu³, Zheng Zhu⁴, Xiaolin Hu² - Show less +1 more•Institutions (4)

Beihang University¹, Tsinghua University², SenseTime³, Chinese Academy of Sciences⁴

18 Jun 2018

TL;DR: The Siamese region proposal network (Siamese-RPN) is proposed which is end-to-end trained off-line with large-scale image pairs for visual object tracking and consists of SiAMESe subnetwork for feature extraction and region proposal subnetwork including the classification branch and regression branch.

...read moreread less

Abstract: Visual object tracking has been a fundamental topic in recent years and many deep learning based trackers have achieved state-of-the-art performance on multiple benchmarks. However, most of these trackers can hardly get top performance with real-time speed. In this paper, we propose the Siamese region proposal network (Siamese-RPN) which is end-to-end trained off-line with large-scale image pairs. Specifically, it consists of Siamese subnetwork for feature extraction and region proposal subnetwork including the classification branch and regression branch. In the inference phase, the proposed framework is formulated as a local one-shot detection task. We can pre-compute the template branch of the Siamese subnetwork and formulate the correlation layers as trivial convolution layers to perform online tracking. Benefit from the proposal refinement, traditional multi-scale test and online fine-tuning can be discarded. The Siamese-RPN runs at 160 FPS while achieving leading performance in VOT2015, VOT2016 and VOT2017 real-time challenges.

...read moreread less

2,016 citations

Book Chapter•DOI•

Distractor-aware Siamese Networks for Visual Object Tracking

[...]

Zheng Zhu¹, Qiang Wang¹, Bo Li², Wei Wu², Junjie Yan², Weiming Hu¹ - Show less +2 more•Institutions (2)

Chinese Academy of Sciences¹, SenseTime²

08 Sep 2018

TL;DR: Zhang et al. as discussed by the authors proposed a distractor-aware Siamese network for accurate and long-term tracking, which uses an effective sampling strategy to control the distribution of training data and make the model focus on the semantic distractors.

...read moreread less

Abstract: Recently, Siamese networks have drawn great attention in visual tracking community because of their balanced accuracy and speed. However, features used in most Siamese tracking approaches can only discriminate foreground from the non-semantic backgrounds. The semantic backgrounds are always considered as distractors, which hinders the robustness of Siamese trackers. In this paper, we focus on learning distractor-aware Siamese networks for accurate and long-term tracking. To this end, features used in traditional Siamese trackers are analyzed at first. We observe that the imbalanced distribution of training data makes the learned features less discriminative. During the off-line training phase, an effective sampling strategy is introduced to control this distribution and make the model focus on the semantic distractors. During inference, a novel distractor-aware module is designed to perform incremental learning, which can effectively transfer the general embedding to the current video domain. In addition, we extend the proposed approach for long-term tracking by introducing a simple yet effective local-to-global search region strategy. Extensive experiments on benchmarks show that our approach significantly outperforms the state-of-the-arts, yielding 9.6% relative gain in VOT2016 dataset and 35.9% relative gain in UAV20L dataset. The proposed tracker can perform at 160 FPS on short-term benchmarks and 110 FPS on long-term benchmarks.

...read moreread less

711 citations

Posted Content•

Distractor-aware Siamese Networks for Visual Object Tracking

[...]

Zheng Zhu¹, Qiang Wang¹, Bo Li², Wei Wu², Junjie Yan², Weiming Hu¹ - Show less +2 more•Institutions (2)

Chinese Academy of Sciences¹, SenseTime²

18 Aug 2018-arXiv: Computer Vision and Pattern Recognition

TL;DR: This paper focuses on learning distractor-aware Siamese networks for accurate and long-term tracking, and extends the proposed approach for long- term tracking by introducing a simple yet effective local-to-global search region strategy.

...read moreread less

644 citations

Proceedings Article•DOI•

Fast and Furious: Real Time End-to-End 3D Detection, Tracking and Motion Forecasting with a Single Convolutional Net

[...]

Wenjie Luo¹, Bin Yang¹, Raquel Urtasun¹•Institutions (1)

Uber ¹

18 Jun 2018

TL;DR: A novel deep neural network that is able to jointly reason about 3D detection, tracking and motion forecasting given data captured by a 3D sensor is proposed, which is very efficient in terms of both memory and computation.

...read moreread less

Abstract: In this paper we propose a novel deep neural network that is able to jointly reason about 3D detection, tracking and motion forecasting given data captured by a 3D sensor. By jointly reasoning about these tasks, our holistic approach is more robust to occlusion as well as sparse data at range. Our approach performs 3D convolutions across space and time over a bird's eye view representation of the 3D world, which is very efficient in terms of both memory and computation. Our experiments on a new very large scale dataset captured in several north american cities, show that we can outperform the state-of-the-art by a large margin. Importantly, by sharing computation we can perform all tasks in as little as 30 ms.

...read moreread less

584 citations

Book Chapter•DOI•

TrackingNet: A Large-Scale Dataset and Benchmark for Object Tracking in the Wild

[...]

Matthias A. Müller¹, Adel Bibi¹, Silvio Giancola¹, Salman Al-Subaihi¹, Bernard Ghanem¹ - Show less +1 more•Institutions (1)

King Abdullah University of Science and Technology¹

08 Sep 2018

TL;DR: This work presents TrackingNet, the first large-scale dataset and benchmark for object tracking in the wild, which covers a wide selection of object classes in broad and diverse context and provides an extensive benchmark on TrackingNet by evaluating more than 20 trackers.

...read moreread less

Abstract: Despite the numerous developments in object tracking, further improvement of current tracking algorithms is limited by small and mostly saturated datasets. As a matter of fact, data-hungry trackers based on deep-learning currently rely on object detection datasets due to the scarcity of dedicated large-scale tracking datasets. In this work, we present TrackingNet, the first large-scale dataset and benchmark for object tracking in the wild. We provide more than 30K videos with more than 14 million dense bounding box annotations. Our dataset covers a wide selection of object classes in broad and diverse context. By releasing such a large-scale dataset, we expect deep trackers to further improve and generalize. In addition, we introduce a new benchmark composed of 500 novel videos, modeled with a distribution similar to our training dataset. By sequestering the annotation of the test set and providing an online evaluation server, we provide a fair benchmark for future development of object trackers. Deep trackers fine-tuned on a fraction of our dataset improve their performance by up to 1.6% on OTB100 and up to 1.7% on TrackingNet Test. We provide an extensive benchmark on TrackingNet by evaluating more than 20 trackers. Our results suggest that object tracking in the wild is far from being solved.

...read moreread less

570 citations

Proceedings Article•DOI•

VITAL: VIsual Tracking via Adversarial Learning

[...]

Yibing Song¹, Chao Ma², Xiaohe Wu³, Lijun Gong¹, Linchao Bao¹, Wangmeng Zuo³, Chunhua Shen², Rynson W. H. Lau⁴, Ming-Hsuan Yang⁵ - Show less +5 more•Institutions (5)

Tencent¹, University of Adelaide², Harbin Institute of Technology³, City University of Hong Kong⁴, University of California, Merced⁵

18 Jun 2018

TL;DR: Zhang et al. as mentioned in this paper used a generative network to randomly generate masks, which are applied to adaptively dropout input features to capture a variety of appearance changes, and the network identifies the mask that maintains the most robust features of the target objects over a long temporal span.

...read moreread less

Abstract: The tracking-by-detection framework consists of two stages, i.e., drawing samples around the target object in the first stage and classifying each sample as the target object or as background in the second stage. The performance of existing trackers using deep classification networks is limited by two aspects. First, the positive samples in each frame are highly spatially overlapped, and they fail to capture rich appearance variations. Second, there exists extreme class imbalance between positive and negative samples. This paper presents the VITAL algorithm to address these two problems via adversarial learning. To augment positive samples, we use a generative network to randomly generate masks, which are applied to adaptively dropout input features to capture a variety of appearance changes. With the use of adversarial learning, our network identifies the mask that maintains the most robust features of the target objects over a long temporal span. In addition, to handle the issue of class imbalance, we propose a high-order cost sensitive loss to decrease the effect of easy negative samples to facilitate training the classification network. Extensive experiments on benchmark datasets demonstrate that the proposed tracker performs favorably against state-of-the-art approaches.

...read moreread less

539 citations

Book Chapter•DOI•

Triplet Loss in Siamese Network for Object Tracking

[...]

Xingping Dong¹, Jianbing Shen¹•Institutions (1)

Beijing Institute of Technology¹

08 Sep 2018

TL;DR: A novel triplet loss is proposed to extract expressive deep feature for object tracking by adding it into Siamese network framework instead of pairwise loss for training.

...read moreread less

Abstract: Object tracking is still a critical and challenging problem with many applications in computer vision. For this challenge, more and more researchers pay attention to applying deep learning to get powerful feature for better tracking accuracy. In this paper, a novel triplet loss is proposed to extract expressive deep feature for object tracking by adding it into Siamese network framework instead of pairwise loss for training. Without adding any inputs, our approach is able to utilize more elements for training to achieve more powerful feature via the combination of original samples. Furthermore, we propose a theoretical analysis by combining comparison of gradients and back-propagation, to prove the effectiveness of our method. In experiments, we apply the proposed triplet loss for three real-time trackers based on Siamese network. And the results on several popular tracking benchmarks show our variants operate at almost the same frame-rate with baseline trackers and achieve superior tracking performance than them, as well as the comparable accuracy with recent state-of-the-art real-time trackers.

...read moreread less

506 citations

Posted Content•

LaSOT: A High-quality Benchmark for Large-scale Single Object Tracking

[...]

Heng Fan, Liting Lin, Fan Yang, Peng Chu, Ge Deng, Sijia Yu, Hexin Bai, Yong Xu, Chunyuan Liao, Haibin Ling - Show less +6 more

20 Sep 2018-arXiv: Computer Vision and Pattern Recognition

TL;DR: The LaSOT benchmark as discussed by the authors provides a high-quality benchmark for large-scale single object tracking, which consists of 1,400 sequences with more than 3.5M frames in total.

...read moreread less

Abstract: In this paper, we present LaSOT, a high-quality benchmark for Large-scale Single Object Tracking. LaSOT consists of 1,400 sequences with more than 3.5M frames in total. Each frame in these sequences is carefully and manually annotated with a bounding box, making LaSOT the largest, to the best of our knowledge, densely annotated tracking benchmark. The average video length of LaSOT is more than 2,500 frames, and each sequence comprises various challenges deriving from the wild where target objects may disappear and re-appear again in the view. By releasing LaSOT, we expect to provide the community with a large-scale dedicated benchmark with high quality for both the training of deep trackers and the veritable evaluation of tracking algorithms. Moreover, considering the close connections of visual appearance and natural language, we enrich LaSOT by providing additional language specification, aiming at encouraging the exploration of natural linguistic feature for tracking. A thorough experimental evaluation of 35 tracking algorithms on LaSOT is presented with detailed analysis, and the results demonstrate that there is still a big room for improvements.

...read moreread less

501 citations

Proceedings Article•DOI•

Learning Attentions: Residual Attentional Siamese Network for High Performance Online Visual Tracking

[...]

Qiang Wang¹, Zhu Teng², Junliang Xing¹, Jin Gao¹, Weiming Hu¹, Stephen J. Maybank³ - Show less +2 more•Institutions (3)

Chinese Academy of Sciences¹, Beijing Jiaotong University², Birkbeck, University of London³

18 Jun 2018

TL;DR: A Residual Attentional Siamese Network (RASNet) for high performance object tracking that mitigates the over-fitting problem in deep network training, but also enhances its discriminative capacity and adaptability due to the separation of representation learning and discriminator learning.

...read moreread less

Abstract: Offline training for object tracking has recently shown great potentials in balancing tracking accuracy and speed. However, it is still difficult to adapt an offline trained model to a target tracked online. This work presents a Residual Attentional Siamese Network (RASNet) for high performance object tracking. The RASNet model reformulates the correlation filter within a Siamese tracking framework, and introduces different kinds of the attention mechanisms to adapt the model without updating the model online. In particular, by exploiting the offline trained general attention, the target adapted residual attention, and the channel favored feature attention, the RASNet not only mitigates the over-fitting problem in deep network training, but also enhances its discriminative capacity and adaptability due to the separation of representation learning and discriminator learning. The proposed deep architecture is trained from end to end and takes full advantage of the rich spatial temporal information to achieve robust visual tracking. Experimental results on two latest benchmarks, OTB-2015 and VOT2017, show that the RASNet tracker has the state-of-the-art tracking accuracy while runs at more than 80 frames per second.

...read moreread less

499 citations

Proceedings Article•DOI•

GANerated Hands for Real-Time 3D Hand Tracking from Monocular RGB

[...]

Franziska Mueller¹, Florian Bernard¹, Oleksandr Sotnychenko¹, Dushyant Mehta¹, Srinath Sridhar², Dan Casas, Christian Theobalt¹ - Show less +3 more•Institutions (2)

Max Planck Society¹, Stanford University²

18 Jun 2018

TL;DR: This work proposes a novel approach for the synthetic generation of training data that is based on a geometrically consistent image-to-image translation network, and uses a neural network that translates synthetic images to "real" images, such that the so-generated images follow the same statistical distribution as real-world hand images.

...read moreread less

Abstract: We address the highly challenging problem of real-time 3D hand tracking based on a monocular RGB-only sequence. Our tracking method combines a convolutional neural network with a kinematic 3D hand model, such that it generalizes well to unseen data, is robust to occlusions and varying camera viewpoints, and leads to anatomically plausible as well as temporally smooth hand motions. For training our CNN we propose a novel approach for the synthetic generation of training data that is based on a geometrically consistent image-to-image translation network. To be more specific, we use a neural network that translates synthetic images to "real" images, such that the so-generated images follow the same statistical distribution as real-world hand images. For training this translation network we combine an adversarial loss and a cycle-consistency loss with a geometric consistency loss in order to preserve geometric properties (such as hand pose) during translation. We demonstrate that our hand tracking system outperforms the current state-of-the-art on challenging RGB-only footage.

...read moreread less

484 citations

Proceedings Article•DOI•

A Twofold Siamese Network for Real-Time Object Tracking

[...]

Anfeng He¹, Chong Luo², Xinmei Tian¹, Wenjun Zeng²•Institutions (2)

University of Science and Technology of China¹, Microsoft²

18 Jun 2018

TL;DR: The proposed SA-Siam outperforms all other real-time trackers by a large margin on OTB-2013/50/100 benchmarks and proposes a channel attention mechanism for the semantic branch.

...read moreread less

Abstract: Observing that Semantic features learned in an image classification task and Appearance features learned in a similarity matching task complement each other, we build a twofold Siamese network, named SA-Siam, for real-time object tracking. SA-Siam is composed of a semantic branch and an appearance branch. Each branch is a similaritylearning Siamese network. An important design choice in SA-Siam is to separately train the two branches to keep the heterogeneity of the two types of features. In addition, we propose a channel attention mechanism for the semantic branch. Channel-wise weights are computed according to the channel activations around the target position. While the inherited architecture from SiamFC [3] allows our tracker to operate beyond real-time, the twofold design and the attention mechanism significantly improve the tracking performance. The proposed SA-Siam outperforms all other real-time trackers by a large margin on OTB-2013/50/100 benchmarks.

...read moreread less

Book Chapter•DOI•

The Unmanned Aerial Vehicle Benchmark: Object Detection and Tracking

[...]

Dawei Du¹, Yuankai Qi², Hongyang Yu², Yifan Yang¹, Kaiwen Duan¹, Guorong Li¹, Weigang Zhang², Qingming Huang¹, Qi Tian³ - Show less +5 more•Institutions (3)

Chinese Academy of Sciences¹, Harbin Institute of Technology², Huawei³

08 Sep 2018

TL;DR: In this article, a new unconstrained UAV benchmark dataset is proposed for object detection, single object tracking, and multiple object tracking with new level challenges, including high density, small object, and camera motion, and a detailed quantitative study is performed using most recent state-of-the-art algorithms for each task.

...read moreread less

Abstract: With the advantage of high mobility, Unmanned Aerial Vehicles (UAVs) are used to fuel numerous important applications in computer vision, delivering more efficiency and convenience than surveillance cameras with fixed camera angle, scale and view. However, very limited UAV datasets are proposed, and they focus only on a specific task such as visual tracking or object detection in relatively constrained scenarios. Consequently, it is of great importance to develop an unconstrained UAV benchmark to boost related researches. In this paper, we construct a new UAV benchmark focusing on complex scenarios with new level challenges. Selected from 10 hours raw videos, about 80, 000 representative frames are fully annotated with bounding boxes as well as up to 14 kinds of attributes (e.g., weather condition, flying altitude, camera view, vehicle category, and occlusion) for three fundamental computer vision tasks: object detection, single object tracking, and multiple object tracking. Then, a detailed quantitative study is performed using most recent state-of-the-art algorithms for each task. Experimental results show that the current state-of-the-art methods perform relative worse on our dataset, due to the new challenges appeared in UAV based real scenes, e.g., high density, small object, and camera motion. To our knowledge, our work is the first time to explore such issues in unconstrained scenes comprehensively. The dataset and all the experimental results are available in https://sites.google.com/site/daviddo0323/.

...read moreread less

Proceedings Article•DOI•

PoseTrack: A Benchmark for Human Pose Estimation and Tracking

[...]

Mykhaylo Andriluka¹, Umar Iqbal², Eldar Insafutdinov, Leonid Pishchulin, Anton Milan³, Juergen Gall², Bernt Schiele - Show less +3 more•Institutions (3)

Google¹, University of Bonn², Amazon.com³

01 Jun 2018

TL;DR: PoseTrack is a new large-scale benchmark for video-based human pose estimation and articulated tracking that conducts an extensive experimental study on recent approaches to articulated pose tracking and provides analysis of the strengths and weaknesses of the state of the art.

...read moreread less

Abstract: Existing systems for video-based pose estimation and tracking struggle to perform well on realistic videos with multiple people and often fail to output body-pose trajectories consistent over time. To address this shortcoming this paper introduces PoseTrack which is a new large-scale benchmark for video-based human pose estimation and articulated tracking. Our new benchmark encompasses three tasks focusing on i) single-frame multi-person pose estimation, ii) multi-person pose estimation in videos, and iii) multi-person articulated tracking. To establish the benchmark, we collect, annotate and release a new dataset that features videos with multiple people labeled with person tracks and articulated pose. A public centralized evaluation server is provided to allow the research community to evaluate on a held-out test set. Furthermore, we conduct an extensive experimental study on recent approaches to articulated pose tracking and provide analysis of the strengths and weaknesses of the state of the art. We envision that the proposed benchmark will stimulate productive research both by providing a large and representative training dataset as well as providing a platform to objectively evaluate and compare the proposed methods. The benchmark is freely accessible at https://posetrack.net/.

...read moreread less

Book Chapter•DOI•

Online Multi-Object Tracking with Dual Matching Attention Networks

[...]

Ji Zhu¹, Hua Yang¹, Nian Liu², Minyoung Kim³, Wenjun Zhang¹, Ming-Hsuan Yang⁴ - Show less +2 more•Institutions (4)

Shanghai Jiao Tong University¹, Northwestern Polytechnical University², Massachusetts Institute of Technology³, University of California, Merced⁴

08 Sep 2018

TL;DR: This paper introduces a cost-sensitive tracking loss based on the state-of-the-art visual tracker which encourages the model to focus on hard negative distractors during online learning and proposes Dual Matching Attention Networks (DMAN) with both spatial and temporal attention mechanisms.

...read moreread less

Abstract: In this paper, we propose an online Multi-Object Tracking (MOT) approach which integrates the merits of single object tracking and data association methods in a unified framework to handle noisy detections and frequent interactions between targets. Specifically, for applying single object tracking in MOT, we introduce a cost-sensitive tracking loss based on the state-of-the-art visual tracker, which encourages the model to focus on hard negative distractors during online learning. For data association, we propose Dual Matching Attention Networks (DMAN) with both spatial and temporal attention mechanisms. The spatial attention module generates dual attention maps which enable the network to focus on the matching patterns of the input image pair, while the temporal attention module adaptively allocates different levels of attention to different samples in the tracklet to suppress noisy observations. Experimental results on the MOT benchmark datasets show that the proposed algorithm performs favorably against both online and offline trackers in terms of identity-preserving metrics.

...read moreread less

Book Chapter•DOI•

Unveiling the Power of Deep Tracking

[...]

Goutam Bhat¹, Joakim Johnander¹, Martin Danelljan¹, Fahad Shahbaz Khan¹, Michael Felsberg¹ - Show less +1 more•Institutions (1)

Linköping University¹

08 Sep 2018

TL;DR: In this paper, the authors propose an adaptive fusion approach that leverages the complementary properties of both deep and shallow features to improve both robustness and accuracy, which significantly outperforms the top performing tracker from the challenge with a relative gain of 17% in EAO.

...read moreread less

Abstract: In the field of generic object tracking numerous attempts have been made to exploit deep features. Despite all expectations, deep trackers are yet to reach an outstanding level of performance compared to methods solely based on handcrafted features. In this paper, we investigate this key issue and propose an approach to unlock the true potential of deep features for tracking. We systematically study the characteristics of both deep and shallow features, and their relation to tracking accuracy and robustness. We identify the limited data and low spatial resolution as the main challenges, and propose strategies to counter these issues when integrating deep features for tracking. Furthermore, we propose a novel adaptive fusion approach that leverages the complementary properties of deep and shallow features to improve both robustness and accuracy. Extensive experiments are performed on four challenging datasets. On VOT2017, our approach significantly outperforms the top performing tracker from the challenge with a relative gain of \(17\%\) in EAO.

...read moreread less

Posted Content•

Unveiling the Power of Deep Tracking

[...]

Goutam Bhat¹, Joakim Johnander¹, Martin Danelljan¹, Fahad Shahbaz Khan¹, Michael Felsberg¹ - Show less +1 more•Institutions (1)

Linköping University¹

18 Apr 2018-arXiv: Computer Vision and Pattern Recognition

TL;DR: This paper proposes a novel adaptive fusion approach that leverages the complementary properties of deep and shallow features to improve both robustness and accuracy in generic object tracking.

...read moreread less

Posted Content•

Vision Meets Drones: A Challenge

[...]

Pengfei Zhu, Longyin Wen, Xiao Bian, Haibin Ling, Qinghua Hu - Show less +1 more

20 Apr 2018-arXiv: Computer Vision and Pattern Recognition

TL;DR: A large-scale visual object detection and tracking benchmark, named VisDrone2018, aiming at advancing visual understanding tasks on the drone platform, with more than 2.5 million annotated instances in 179,264 images/video frames, being the largest such dataset ever published.

...read moreread less

Abstract: In this paper we present a large-scale visual object detection and tracking benchmark, named VisDrone2018, aiming at advancing visual understanding tasks on the drone platform. The images and video sequences in the benchmark were captured over various urban/suburban areas of 14 different cities across China from north to south. Specifically, VisDrone2018 consists of 263 video clips and 10,209 images (no overlap with video clips) with rich annotations, including object bounding boxes, object categories, occlusion, truncation ratios, etc. With intensive amount of effort, our benchmark has more than 2.5 million annotated instances in 179,264 images/video frames. Being the largest such dataset ever published, the benchmark enables extensive evaluation and investigation of visual analysis algorithms on the drone platform. In particular, we design four popular tasks with the benchmark, including object detection in images, object detection in videos, single object tracking, and multi-object tracking. All these tasks are extremely challenging in the proposed dataset due to factors such as occlusion, large scale and pose variation, and fast motion. We hope the benchmark largely boost the research and development in visual analysis on drone platforms.

...read moreread less

Book Chapter•DOI•

Learning Dynamic Memory Networks for Object Tracking

[...]

Tianyu Yang¹, Antoni B. Chan¹•Institutions (1)

City University of Hong Kong¹

08 Sep 2018

TL;DR: In this paper, a dynamic memory network is proposed to adapt the template to the target's appearance variations during tracking, where an LSTM is used as a memory controller, where the input is the search feature map and the outputs are the control signals for the reading and writing process of the memory block.

...read moreread less

Abstract: Template-matching methods for visual tracking have gained popularity recently due to their comparable performance and fast speed. However, they lack effective ways to adapt to changes in the target object’s appearance, making their tracking accuracy still far from state-of-the-art. In this paper, we propose a dynamic memory network to adapt the template to the target’s appearance variations during tracking. An LSTM is used as a memory controller, where the input is the search feature map and the outputs are the control signals for the reading and writing process of the memory block. As the location of the target is at first unknown in the search feature map, an attention mechanism is applied to concentrate the LSTM input on the potential target. To prevent aggressive model adaptivity, we apply gated residual template learning to control the amount of retrieved memory that is used to combine with the initial template. Unlike tracking-by-detection methods where the object’s information is maintained by the weight parameters of neural networks, which requires expensive online fine-tuning to be adaptable, our tracker runs completely feed-forward and adapts to the target’s appearance changes by updating the external memory. Moreover, unlike other tracking methods where the model capacity is fixed after offline training – the capacity of our tracker can be easily enlarged as the memory requirements of a task increase, which is favorable for memorizing long-term object information. Extensive experiments on OTB and VOT demonstrates that our tracker MemTrack performs favorably against state-of-the-art tracking methods while retaining real-time speed of 50 fps.

...read moreread less

Proceedings Article•DOI•

Recurrent Autoregressive Networks for Online Multi-object Tracking

[...]

Kuan Fang¹, Yu Xiang², Xiaocheng Li¹, Silvio Savarese¹•Institutions (2)

Stanford University¹, University of Washington²

01 Mar 2018

TL;DR: This work proposes the Recurrent Autoregressive Network (RAN), a temporal generative modeling framework to characterize the appearance and motion dynamics of multiple objects over time and achieves top-ranked results on the two benchmarks.

...read moreread less

Abstract: The main challenge of online multi-object tracking is to reliably associate object trajectories with detections in each video frame based on their tracking history. In this work, we propose the Recurrent Autoregressive Network (RAN), a temporal generative modeling framework to characterize the appearance and motion dynamics of multiple objects over time. The RAN couples an external memory and an internal memory. The external memory explicitly stores previous inputs of each trajectory in a time window, while the internal memory learns to summarize long-term tracking history and associate detections by processing the external memory. We conduct experiments on the MOT 2015 and 2016 datasets to demonstrate the robustness of our tracking method in highly crowded and occluded scenes. Our method achieves top-ranked results on the two benchmarks.

...read moreread less

Journal Article•DOI•

Confidence-Based Data Association and Discriminative Deep Appearance Learning for Robust Online Multi-Object Tracking

[...]

Seung-Hwan Bae¹, Kuk-Jin Yoon²•Institutions (2)

Incheon National University¹, Gwangju Institute of Science and Technology²

01 Mar 2018-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: This paper defines the tracklet confidence using the detectability and continuity of a tracklet, and decomposes a multi-object tracking problem into small subproblems based on theTracklet confidence, and solves the online multi- object tracking problem by associating tracklets and detections in different ways according to their confidence values.

...read moreread less

Abstract: Online multi-object tracking aims at estimating the tracks of multiple objects instantly with each incoming frame and the information provided up to the moment. It still remains a difficult problem in complex scenes, because of the large ambiguity in associating multiple objects in consecutive frames and the low discriminability between objects appearances. In this paper, we propose a robust online multi-object tracking method that can handle these difficulties effectively. We first define the tracklet confidence using the detectability and continuity of a tracklet, and decompose a multi-object tracking problem into small subproblems based on the tracklet confidence. We then solve the online multi-object tracking problem by associating tracklets and detections in different ways according to their confidence values. Based on this strategy, tracklets sequentially grow with online-provided detections, and fragmented tracklets are linked up with others without any iterative and expensive association steps. For more reliable association between tracklets and detections, we also propose a deep appearance learning method to learn a discriminative appearance model from large training datasets, since the conventional appearance learning methods do not provide rich representation that can distinguish multiple objects with large appearance variations. In addition, we combine online transfer learning for improving appearance discriminability by adapting the pre-trained deep model during online tracking. Experiments with challenging public datasets show distinct performance improvement over other state-of-the-arts batch and online tracking methods, and prove the effect and usefulness of the proposed methods for online multi-object tracking.

...read moreread less

Proceedings Article•DOI•

DoubleFusion: Real-Time Capture of Human Performances with Inner Body Shapes from a Single Depth Sensor

[...]

Tao Yu¹, Zerong Zheng², Kaiwen Guo³, Jianhui Zhao¹, Qionghai Dai², Hao Li⁴, Gerard Pons-Moll⁵, Yebin Liu - Show less +4 more•Institutions (5)

Beihang University¹, Tsinghua University², Google³, Institute for Creative Technologies⁴, Max Planck Society⁵

01 Jun 2018

TL;DR: Li et al. as mentioned in this paper proposed DoubleFusion, which combines volumetric dynamic reconstruction with data-driven template fitting to simultaneously reconstruct detailed geometry, non-rigid motion and the inner human body shape from a single depth camera.

...read moreread less

Abstract: We propose DoubleFusion, a new real-time system that combines volumetric dynamic reconstruction with data-driven template fitting to simultaneously reconstruct detailed geometry, non-rigid motion and the inner human body shape from a single depth camera. One of the key contributions of this method is a double layer representation consisting of a complete parametric body shape inside, and a gradually fused outer surface layer. A pre-defined node graph on the body surface parameterizes the non-rigid deformations near the body, and a free-form dynamically changing graph parameterizes the outer surface layer far from the body, which allows more general reconstruction. We further propose a joint motion tracking method based on the double layer representation to enable robust and fast motion tracking performance. Moreover, the inner body shape is optimized online and forced to fit inside the outer surface layer. Overall, our method enables increasingly denoised, detailed and complete surface reconstructions, fast motion tracking performance and plausible inner body shape reconstruction in real-time. In particular, experiments show improved fast motion tracking and loop closure performance on more challenging scenarios.

...read moreread less

Proceedings Article•DOI•

MaskFusion: Real-Time Recognition, Tracking and Reconstruction of Multiple Moving Objects

[...]

Martin Rünz¹, Maud Buffier, Lourdes Agapito¹•Institutions (1)

University College London¹

01 Oct 2018

TL;DR: MaskFusion as discussed by the authors is a real-time object-aware, semantic and dynamic RGB-D SLAM system that goes beyond traditional systems which output a purely geometric map of a static scene.

...read moreread less

Abstract: We present MaskFusion, a real-time, object-aware, semantic and dynamic RGB-D SLAM system that goes beyond traditional systems which output a purely geometric map of a static scene. MaskFusion recognizes, segments and assigns semantic class labels to different objects in the scene, while tracking and reconstructing them even when they move independently from the camera. As an RGB-D camera scans a cluttered scene, image-based instance-level semantic segmentation creates semantic object masks that enable realtime object recognition and the creation of an object-level representation for the world map. Unlike previous recognition-based SLAM systems, MaskFusion does not require known models of the objects it can recognize, and can deal with multiple independent motions. MaskFusion takes full advantage of using instance-level semantic segmentation to enable semantic labels to be fused into an object-aware map, unlike recent semantics enabled SLAM systems that perform voxel-level semantic segmentation. We show augmented-reality applications that demonstrate the unique features of the map output by MaskFusion: instance-aware, semantic and dynamic. Code will be made available.

...read moreread less

Book Chapter•DOI•

Multi-object Tracking with Neural Gating Using Bilinear LSTM

[...]

Chanho Kim¹, Fuxin Li², James M. Rehg¹•Institutions (2)

Georgia Institute of Technology¹, Oregon State University²

08 Sep 2018

TL;DR: A novel recurrent network model, the Bilinear LSTM, is proposed in order to improve the learning of long-term appearance models via a recurrent network based on intuitions drawn from recursive least squares.

...read moreread less

Abstract: In recent deep online and near-online multi-object tracking approaches, a difficulty has been to incorporate long-term appearance models to efficiently score object tracks under severe occlusion and multiple missing detections. In this paper, we propose a novel recurrent network model, the Bilinear LSTM, in order to improve the learning of long-term appearance models via a recurrent network. Based on intuitions drawn from recursive least squares, Bilinear LSTM stores building blocks of a linear predictor in its memory, which is then coupled with the input in a multiplicative manner, instead of the additive coupling in conventional LSTM approaches. Such coupling resembles an online learned classifier/regressor at each time step, which we have found to improve performances in using LSTM for appearance modeling. We also propose novel data augmentation approaches to efficiently train recurrent models that score object tracks on both appearance and motion. We train an LSTM that can score object tracks based on both appearance and motion and utilize it in a multiple hypothesis tracking framework. In experiments, we show that with our novel LSTM model, we achieved state-of-the-art performance on near-online multiple object tracking on the MOT 2016 and MOT 2017 benchmarks.

...read moreread less

Journal Article•DOI•

Less is more: Micro-expression recognition from video using apex frame

[...]

Sze-Teng Liong¹, John See², KokSheik Wong³, Raphael C.-W. Phan²•Institutions (3)

Feng Chia University¹, Multimedia University², Monash University Malaysia Campus³

01 Mar 2018-Signal Processing-image Communication

TL;DR: A new feature extractor, Bi-Weighted Oriented Optical Flow (Bi-WOOF) is proposed to encode essential expressiveness of the apex frame of a video, with a proposed technique achieving a state-of-the-art F1-score recognition performance.

...read moreread less

Abstract: Despite recent interest and advances in facial micro-expression research, there is still plenty of room for improvement in terms of micro-expression recognition. Conventional feature extraction approaches for micro-expression video consider either the whole video sequence or a part of it, for representation. However, with the high-speed video capture of micro-expressions (100–200 fps), are all frames necessary to provide a sufficiently meaningful representation? Is the luxury of data a bane to accurate recognition? A novel proposition is presented in this paper, whereby we utilize only two images per video, namely, the apex frame and the onset frame. The apex frame of a video contains the highest intensity of expression changes among all frames, while the onset is the perfect choice of a reference frame with neutral expression. A new feature extractor, Bi-Weighted Oriented Optical Flow (Bi-WOOF) is proposed to encode essential expressiveness of the apex frame. We evaluated the proposed method on five micro-expression databases—CAS(ME) 2 , CASME II, SMIC-HS, SMIC-NIR and SMIC-VIS. Our experiments lend credence to our hypothesis, with our proposed technique achieving a state-of-the-art F1-score recognition performance of 0.61 and 0.62 in the high frame rate CASME II and SMIC-HS databases respectively.

...read moreread less

Proceedings Article•DOI•

End-to-End Flow Correlation Tracking with Spatial-Temporal Attention

[...]

Zheng Zhu¹, Wei Wu², Wei Zou¹, Junjie Yan²•Institutions (2)

Chinese Academy of Sciences¹, SenseTime²

01 Jun 2018

TL;DR: Wang et al. as discussed by the authors proposed the FlowTrack, which makes use of the rich flow information in consecutive frames to improve the feature representation and the tracking accuracy by combining optical flow estimation, feature extraction, aggregation and correlation filters tracking.

...read moreread less

Abstract: Discriminative correlation filters (DCF) with deep convolutional features have achieved favorable performance in recent tracking benchmarks. However, most of existing DCF trackers only consider appearance features of current frame, and hardly benefit from motion and inter-frame information. The lack of temporal information degrades the tracking performance during challenges such as partial occlusion and deformation. In this paper, we propose the FlowTrack, which focuses on making use of the rich flow information in consecutive frames to improve the feature representation and the tracking accuracy. The FlowTrack formulates individual components, including optical flow estimation, feature extraction, aggregation and correlation filters tracking as special layers in network. To the best of our knowledge, this is the first work to jointly train flow and tracking task in deep learning framework. Then the historical feature maps at predefined intervals are warped and aggregated with current ones by the guiding of flow. For adaptive aggregation, we propose a novel spatial-temporal attention mechanism. In experiments, the proposed method achieves leading performance on OTB2013, OTB2015, VOT2015 and VOT2016.

...read moreread less

Journal Article•DOI•

Learning Adaptive Discriminative Correlation Filters via Temporal Consistency Preserving Spatial Feature Selection for Robust Visual Tracking.

[...]

Tianyang Xu¹, Zhen-Hua Feng, Xiao-Jun Wu¹, Josef Kittler²•Institutions (2)

University of Surrey¹, Jiangnan University²

30 Jul 2018-arXiv: Computer Vision and Pattern Recognition

TL;DR: Wang et al. as mentioned in this paper proposed a new discriminative correlation filter (DCF) based tracking method with adaptive spatial feature selection and temporal consistent constraints, with which the new tracker enables joint spatial-temporal filter learning.

...read moreread less

Abstract: With efficient appearance learning models, Discriminative Correlation Filter (DCF) has been proven to be very successful in recent video object tracking benchmarks and competitions. However, the existing DCF paradigm suffers from two major issues, i.e., spatial boundary effect and temporal filter degradation. To mitigate these challenges, we propose a new DCF-based tracking method. The key innovations of the proposed method include adaptive spatial feature selection and temporal consistent constraints, with which the new tracker enables joint spatial-temporal filter learning in a lower dimensional discriminative manifold. More specifically, we apply structured spatial sparsity constraints to multi-channel filers. Consequently, the process of learning spatial filters can be approximated by the lasso regularisation. To encourage temporal consistency, the filter model is restricted to lie around its historical value and updated locally to preserve the global structure in the manifold. Last, a unified optimisation framework is proposed to jointly select temporal consistency preserving spatial features and learn discriminative filters with the augmented Lagrangian method. Qualitative and quantitative evaluations have been conducted on a number of well-known benchmarking datasets such as OTB2013, OTB50, OTB100, Temple-Colour, UAV123 and VOT2018. The experimental results demonstrate the superiority of the proposed method over the state-of-the-art approaches.

...read moreread less

Journal Article•DOI•

Hybridizing sine cosine algorithm with differential evolution for global optimization and object tracking

[...]

Hathiram Nenavath¹, Ravi Kumar Jatoth¹•Institutions (1)

National Institute of Technology, Warangal¹

01 Jan 2018-Applied Soft Computing

TL;DR: Results prove that the proposed Hybrid SCA-DE-based tracker can robustly track an arbitrary target in various challenging conditions than the other trackers and is very competitive compared to the state-of-the-art metaheuristic algorithms.

...read moreread less

Proceedings Article•

RGB-T Object Tracking:Benchmark and Baseline

[...]

Chenglong Li¹, Xinyan Liang¹, Yijuan Lu², Nan Zhao¹, Jin Tang¹ - Show less +1 more•Institutions (2)

Anhui University¹, Texas State University²

23 May 2018

TL;DR: Wang et al. as discussed by the authors proposed a novel graph-based approach to learn a robust object representation for RGB-T tracking, in which the tracked object is represented with a graph with image patches as nodes.

...read moreread less

Abstract: RGB-Thermal (RGB-T) object tracking receives more and more attention due to the strongly complementary benefits of thermal information to visible data. However, RGB-T research is limited by lacking a comprehensive evaluation platform. In this paper, we propose a large-scale video benchmark dataset for RGB-T this http URL has three major advantages over existing ones: 1) Its size is sufficiently large for large-scale performance evaluation (total frame number: 234K, maximum frame per sequence: 8K). 2) The alignment between RGB-T sequence pairs is highly accurate, which does not need pre- or post-processing. 3) The occlusion levels are annotated for occlusion-sensitive performance analysis of different tracking algorithms.Moreover, we propose a novel graph-based approach to learn a robust object representation for RGB-T tracking. In particular, the tracked object is represented with a graph with image patches as nodes. This graph including graph structure, node weights and edge weights is dynamically learned in a unified ADMM (alternating direction method of multipliers)-based optimization framework, in which the modality weights are also incorporated for adaptive fusion of multiple source data.Extensive experiments on the large-scale dataset are executed to demonstrate the effectiveness of the proposed tracker against other state-of-the-art tracking methods. We also provide new insights and potential research directions to the field of RGB-T object tracking.

...read moreread less

Journal Article•DOI•

An Overview of Digital Video Watermarking

[...]

Md. Asikuzzaman¹, Mark R. Pickering¹•Institutions (1)

University of New South Wales¹

01 Sep 2018-IEEE Transactions on Circuits and Systems for Video Technology

TL;DR: This paper presents a review of the digital video watermarking techniques in which their applications, challenges, and important properties are discussed, and categorizes them based on the domain in which they embed the watermark.

...read moreread less

Abstract: The illegal distribution of a digital movie is a common and significant threat to the film industry. With the advent of high-speed broadband Internet access, a pirated copy of a digital video can now be easily distributed to a global audience. A possible means of limiting this type of digital theft is digital video watermarking whereby additional information, called a watermark, is embedded in the host video. This watermark can be extracted at the decoder and used to determine whether the video content is watermarked. This paper presents a review of the digital video watermarking techniques in which their applications, challenges, and important properties are discussed, and categorizes them based on the domain in which they embed the watermark. It then provides an overview of a few emerging innovative solutions using watermarks. Protecting a 3D video by watermarking is an emerging area of research. The relevant 3D video watermarking techniques in the literature are classified based on the image-based representations of a 3D video in stereoscopic, depth-image-based rendering, and multi-view video watermarking. We discuss each technique, and then present a survey of the literature. Finally, we provide a summary of this paper and propose some future research directions.

...read moreread less

Journal Article•DOI•

New Trends on Moving Object Detection in Video Images Captured by a moving Camera: A Survey

[...]

Mehran Yazdi¹, Mehran Yazdi², Thierry Bouwmans¹•Institutions (2)

University of La Rochelle¹, Shiraz University²

01 May 2018-Computer Science Review

TL;DR: A survey on the latest methods of moving object detection in video sequences captured by a moving camera and presents the main methods which proposed improvements in the general concept of the techniques.

...read moreread less

Collapse