Online Multi-object Tracking Using CNN-Based Single Object Tracker with Spatial-Temporal Attention Mechanism
Qi Chu,Wanli Ouyang,Hongsheng Li,Xiaogang Wang,Bin Liu,Nenghai Yu +5 more
- pp 4846-4855
Reads0
Chats0
TLDR
Zhang et al. as mentioned in this paper proposed a spatial-temporal attention mechanism (STAM) to handle the drift caused by occlusion and interaction among targets, which can be considered as temporal attention mechanism.Abstract:
In this paper, we propose a CNN-based framework for online MOT. This framework utilizes the merits of single object trackers in adapting appearance models and searching for target in the next frame. Simply applying single object tracker for MOT will encounter the problem in computational efficiency and drifted results caused by occlusion. Our framework achieves computational efficiency by sharing features and using ROI-Pooling to obtain individual features for each target. Some online learned target-specific CNN layers are used for adapting the appearance model for each target. In the framework, we introduce spatial-temporal attention mechanism (STAM) to handle the drift caused by occlusion and interaction among targets. The visibility map of the target is learned and used for inferring the spatial attention map. The spatial attention map is then applied to weight the features. Besides, the occlusion status can be estimated from the visibility map, which controls the online updating process via weighted loss on training samples with different occlusion statuses in different frames. It can be considered as temporal attention mechanism. The proposed algorithm achieves 34.3% and 46.0% in MOTA on challenging MOT15 and MOT16 benchmark dataset respectively.read more
Citations
More filters
Journal ArticleDOI
GOT-10k: A Large High-Diversity Benchmark for Generic Object Tracking in the Wild
TL;DR: A large tracking database that offers an unprecedentedly wide coverage of common moving objects in the wild, called GOT-10k, and the first video trajectory dataset that uses the semantic hierarchy of WordNet to guide class population, which ensures a comprehensive and relatively unbiased coverage of diverse moving objects.
Proceedings ArticleDOI
Learning Attentions: Residual Attentional Siamese Network for High Performance Online Visual Tracking
TL;DR: A Residual Attentional Siamese Network (RASNet) for high performance object tracking that mitigates the over-fitting problem in deep network training, but also enhances its discriminative capacity and adaptability due to the separation of representation learning and discriminator learning.
Journal ArticleDOI
Deep learning in video multi-object tracking: A survey
Gioele Ciaparrone,Gioele Ciaparrone,Francisco Luque Sánchez,Siham Tabik,Luigi Troiano,Roberto Tagliaferri,Francisco Herrera +6 more
TL;DR: A comprehensive survey on works that employ Deep Learning models to solve the task of MOT on single-camera videos, identifying a number of similarities among the top-performing methods and presenting some possible future research directions.
Book ChapterDOI
Online Multi-Object Tracking with Dual Matching Attention Networks
TL;DR: This paper introduces a cost-sensitive tracking loss based on the state-of-the-art visual tracker which encourages the model to focus on hard negative distractors during online learning and proposes Dual Matching Attention Networks (DMAN) with both spatial and temporal attention mechanisms.
Proceedings ArticleDOI
Real-Time Multiple People Tracking with Deeply Learned Candidate Selection and Person Re-Identification
TL;DR: Zhang et al. as discussed by the authors proposed to handle unreliable detection by collecting candidates from outputs of both detection and tracking, and applied optimal selection from a considerable amount of candidates in real-time, and presented a novel scoring function based on a fully convolutional neural network.
References
More filters
Proceedings Article
ImageNet Classification with Deep Convolutional Neural Networks
TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Proceedings Article
Very Deep Convolutional Networks for Large-Scale Image Recognition
Karen Simonyan,Andrew Zisserman +1 more
TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
Proceedings Article
Very Deep Convolutional Networks for Large-Scale Image Recognition
Karen Simonyan,Andrew Zisserman +1 more
TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Proceedings ArticleDOI
ImageNet: A large-scale hierarchical image database
TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Proceedings ArticleDOI
Histograms of oriented gradients for human detection
Navneet Dalal,Bill Triggs +1 more
TL;DR: It is shown experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection, and the influence of each stage of the computation on performance is studied.