Online Multi-object Tracking Using CNN-Based Single Object Tracker with Spatial-Temporal Attention Mechanism

doi:10.1109/ICCV.2017.518

Open AccessProceedings ArticleDOI

Online Multi-object Tracking Using CNN-Based Single Object Tracker with Spatial-Temporal Attention Mechanism

Qi Chu, +5 more

- pp 4846-4855

Chats0

TLDR

Zhang et al. as mentioned in this paper proposed a spatial-temporal attention mechanism (STAM) to handle the drift caused by occlusion and interaction among targets, which can be considered as temporal attention mechanism.

Abstract:

In this paper, we propose a CNN-based framework for online MOT. This framework utilizes the merits of single object trackers in adapting appearance models and searching for target in the next frame. Simply applying single object tracker for MOT will encounter the problem in computational efficiency and drifted results caused by occlusion. Our framework achieves computational efficiency by sharing features and using ROI-Pooling to obtain individual features for each target. Some online learned target-specific CNN layers are used for adapting the appearance model for each target. In the framework, we introduce spatial-temporal attention mechanism (STAM) to handle the drift caused by occlusion and interaction among targets. The visibility map of the target is learned and used for inferring the spatial attention map. The spatial attention map is then applied to weight the features. Besides, the occlusion status can be estimated from the visibility map, which controls the online updating process via weighted loss on training samples with different occlusion statuses in different frames. It can be considered as temporal attention mechanism. The proposed algorithm achieves 34.3% and 46.0% in MOTA on challenging MOT15 and MOT16 benchmark dataset respectively.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

GOT-10k: A Large High-Diversity Benchmark for Generic Object Tracking in the Wild

Lianghua Huang, +2 more

- 01 May 2021 -

IEEE Transactions on Pattern Analysis an...

TL;DR: A large tracking database that offers an unprecedentedly wide coverage of common moving objects in the wild, called GOT-10k, and the first video trajectory dataset that uses the semantic hierarchy of WordNet to guide class population, which ensures a comprehensive and relatively unbiased coverage of diverse moving objects.

...read moreread less

Proceedings ArticleDOI

Learning Attentions: Residual Attentional Siamese Network for High Performance Online Visual Tracking

Qiang Wang, +5 more

TL;DR: A Residual Attentional Siamese Network (RASNet) for high performance object tracking that mitigates the over-fitting problem in deep network training, but also enhances its discriminative capacity and adaptability due to the separation of representation learning and discriminator learning.

...read moreread less

Journal ArticleDOI

Deep learning in video multi-object tracking: A survey

Gioele Ciaparrone, +6 more

- 14 Mar 2020 -

Neurocomputing

TL;DR: A comprehensive survey on works that employ Deep Learning models to solve the task of MOT on single-camera videos, identifying a number of similarities among the top-performing methods and presenting some possible future research directions.

...read moreread less

Book ChapterDOI

Online Multi-Object Tracking with Dual Matching Attention Networks

Ji Zhu, +5 more

TL;DR: This paper introduces a cost-sensitive tracking loss based on the state-of-the-art visual tracker which encourages the model to focus on hard negative distractors during online learning and proposes Dual Matching Attention Networks (DMAN) with both spatial and temporal attention mechanisms.

...read moreread less

Proceedings ArticleDOI

Real-Time Multiple People Tracking with Deeply Learned Candidate Selection and Person Re-Identification

Long Chen, +3 more

TL;DR: Zhang et al. as discussed by the authors proposed to handle unreliable detection by collecting candidates from outputs of both detection and tracking, and applied optimal selection from a considerable amount of candidates in real-time, and presented a novel scoring function based on a fully convolutional neural network.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

Alex Krizhevsky, +2 more

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.

...read moreread less

Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan, +1 more

TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.

...read moreread less

Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan, +1 more

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.

...read moreread less

Proceedings ArticleDOI

ImageNet: A large-scale hierarchical image database

Jia Deng, +5 more

TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.

...read moreread less

Proceedings ArticleDOI

Histograms of oriented gradients for human detection

Navneet Dalal, +1 more

TL;DR: It is shown experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection, and the influence of each stage of the computation on performance is studied.

...read moreread less