Online Real-Time Multiple Spatiotemporal Action Localisation and Prediction

doi:10.1109/ICCV.2017.393

Open AccessProceedings ArticleDOI

Online Real-Time Multiple Spatiotemporal Action Localisation and Prediction

- pp 3657-3666

TLDR

This work presents a deep-learning framework for real-time multiple spatio-temporal (S/T) action localisation and classification that is not only capable of performing S/T detection in real time, but can also perform early action prediction in an online fashion.

Abstract:

We present a deep-learning framework for real-time multiple spatio-temporal (S/T) action localisation and classification. Current state-of-the-art approaches work offline, and are too slow to be useful in real-world settings. To overcome their limitations we introduce two major developments. Firstly, we adopt real-time SSD (Single Shot Multi-Box Detector) CNNs to regress and classify detection boxes in each video frame potentially containing an action of interest. Secondly, we design an original and efficient online algorithm to incrementally construct and label ‘action tubes’ from the SSD frame level detections. As a result, our system is not only capable of performing S/T detection in real time, but can also perform early action prediction in an online fashion. We achieve new state-of-the-art results in both S/T action localisation and early action prediction on the challenging UCF101-24 and J-HMDB-21 benchmarks, even when compared to the top offline competitors. To the best of our knowledge, ours is the first real-time (up to 40fps) system able to perform online S/T action localisation on the untrimmed videos of UCF101-24.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions

Chunhui Gu, +12 more

TL;DR: The AVA dataset densely annotates 80 atomic visual actions in 437 15-minute video clips, where actions are localized in space and time, resulting in 1.59M action labels with multiple labels per person occurring frequently.

...read moreread less

Proceedings ArticleDOI

Rethinking the Faster R-CNN Architecture for Temporal Action Localization

Yu-Wei Chao, +5 more

TL;DR: TAL-Net as mentioned in this paper improves receptive field alignment using a multi-scale architecture that can accommodate extreme variation in action durations and better exploit the temporal context of actions for both proposal generation and action classification by appropriately extending receptive fields.

...read moreread less

Proceedings ArticleDOI

Video Action Transformer Network

Rohit Girdhar, +3 more

TL;DR: Action Transformer as mentioned in this paper uses a Transformer-style architecture to aggregate features from the spatio-temporal context around the person whose actions we are trying to classify, and shows that by using high-resolution, person-specific, class-agnostic queries, the model spontaneously learns to track individual people and to pick up on semantic context from the actions of others.

...read moreread less

Book ChapterDOI

ECO: Efficient Convolutional Network for Online Video Understanding

Mohammadreza Zolfaghari, +2 more

TL;DR: A network architecture that takes long-term content into account and enables fast per-video processing at the same time and achieves competitive performance across all datasets while being 10 to 80 times faster than state-of-the-art methods.

...read moreread less

Proceedings ArticleDOI

Long-Term Feature Banks for Detailed Video Understanding

Chao-Yuan Wu, +5 more

TL;DR: In this article, a long-term feature bank is proposed to augment state-of-the-art video models that otherwise would only view short clips of 2-5 seconds, enabling existing video models to relate the present to the past, and put events in context.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

Alex Krizhevsky, +2 more

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.

...read moreread less

Book ChapterDOI

Microsoft COCO: Common Objects in Context

Tsung-Yi Lin, +7 more

TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.

...read moreread less

Proceedings ArticleDOI

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

Ross Girshick, +3 more

TL;DR: RCNN as discussed by the authors combines CNNs with bottom-up region proposals to localize and segment objects, and when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost.

...read moreread less

Book ChapterDOI

SSD: Single Shot MultiBox Detector

Wei Liu, +6 more

TL;DR: The approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location, which makes SSD easy to train and straightforward to integrate into systems that require a detection component.

...read moreread less

Proceedings Article

Faster R-CNN: towards real-time object detection with region proposal networks

Shaoqing Ren, +3 more

TL;DR: Ren et al. as discussed by the authors proposed a region proposal network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals.

...read moreread less

Collapse

Online Real-Time Multiple Spatiotemporal Action Localisation and Prediction

Citations

AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions

Rethinking the Faster R-CNN Architecture for Temporal Action Localization

Video Action Transformer Network

ECO: Efficient Convolutional Network for Online Video Understanding

Long-Term Feature Banks for Detailed Video Understanding

References

ImageNet Classification with Deep Convolutional Neural Networks

Microsoft COCO: Common Objects in Context

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

SSD: Single Shot MultiBox Detector

Faster R-CNN: towards real-time object detection with region proposal networks

Related Papers (5)

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

Learning Spatiotemporal Features with 3D Convolutional Networks

Two-Stream Convolutional Networks for Action Recognition in Videos

Deep Residual Learning for Image Recognition