Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs

doi:10.1109/CVPR.2016.119

Open AccessProceedings ArticleDOI

Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs

- pp 1049-1058

TLDR

Wang et al. as mentioned in this paper exploit the effectiveness of deep networks in temporal action localization via three segment-based 3D ConvNets: a proposal network identifies candidate segments in a long video that may contain actions, a classification network learns one-vs-all action classification model to serve as initialization for the localization network, and a localization network fine-tunes the learned classification network to localize each action instance.

Abstract:

We address temporal action localization in untrimmed long videos. This is important because videos in real applications are usually unconstrained and contain multiple action instances plus video content of background scenes or other activities. To address this challenging issue, we exploit the effectiveness of deep networks in temporal action localization via three segment-based 3D ConvNets: (1) a proposal network identifies candidate segments in a long video that may contain actions, (2) a classification network learns one-vs-all action classification model to serve as initialization for the localization network, and (3) a localization network fine-tunes the learned classification network to localize each action instance. We propose a novel loss function for the localization network to explicitly consider temporal overlap and achieve high temporal localization accuracy. In the end, only the proposal network and the localization network are used during prediction. On two largescale benchmarks, our approach achieves significantly superior performances compared with other state-of-the-art systems: mAP increases from 1.7% to 7.4% on MEXaction2 and increases from 15.0% to 19.0% on THUMOS 2014.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

A Closer Look at Spatiotemporal Convolutions for Action Recognition

Du Tran, +6 more

TL;DR: In this article, a new spatio-temporal convolutional block "R(2+1)D" was proposed, which achieved state-of-the-art performance on Sports-1M, Kinetics, UCF101, and HMDB51.

...read moreread less

Proceedings ArticleDOI

R-C3D: Region Convolutional 3D Network for Temporal Activity Detection

Huijuan Xu, +2 more

TL;DR: Region Convolutional 3D Network (R-C3D) as mentioned in this paper uses a three-dimensional fully convolutional network to extract meaningful spatio-temporal features to capture activities, accurately localizing the start and end times of each activity.

...read moreread less

Proceedings ArticleDOI

Rethinking the Faster R-CNN Architecture for Temporal Action Localization

Yu-Wei Chao, +5 more

TL;DR: TAL-Net as mentioned in this paper improves receptive field alignment using a multi-scale architecture that can accommodate extreme variation in action durations and better exploit the temporal context of actions for both proposal generation and action classification by appropriately extending receptive fields.

...read moreread less

Proceedings ArticleDOI

Hide-and-Seek: Forcing a Network to be Meticulous for Weakly-Supervised Object and Action Localization

Krishna Kumar Singh, +1 more

TL;DR: The key idea is to hide patches in a training image randomly, forcing the network to seek other relevant parts when the most discriminative part is hidden, which obtains superior performance compared to previous methods for weakly-supervised object localization on the ILSVRC dataset.

...read moreread less

Proceedings ArticleDOI

Temporal Action Detection with Structured Segment Networks

Yue Zhao, +5 more

TL;DR: In this article, a structured segment network (SSN) is proposed to model the temporal structure of each action instance via a structured temporal pyramid, and a decomposed discriminative model comprising two classifiers, respectively for classifying actions and determining completeness.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

Alex Krizhevsky, +2 more

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.

...read moreread less

Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan, +1 more

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.

...read moreread less

Proceedings ArticleDOI

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

Ross Girshick, +3 more

TL;DR: RCNN as discussed by the authors combines CNNs with bottom-up region proposals to localize and segment objects, and when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost.

...read moreread less

Proceedings ArticleDOI

Fast R-CNN

Ross Girshick

TL;DR: Fast R-CNN as discussed by the authors proposes a Fast Region-based Convolutional Network method for object detection, which employs several innovations to improve training and testing speed while also increasing detection accuracy and achieves a higher mAP on PASCAL VOC 2012.

...read moreread less

Proceedings Article

Faster R-CNN: towards real-time object detection with region proposal networks

Shaoqing Ren, +3 more

TL;DR: Ren et al. as discussed by the authors proposed a region proposal network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals.

...read moreread less

Collapse

Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs

Citations

A Closer Look at Spatiotemporal Convolutions for Action Recognition

R-C3D: Region Convolutional 3D Network for Temporal Activity Detection

Rethinking the Faster R-CNN Architecture for Temporal Action Localization

Hide-and-Seek: Forcing a Network to be Meticulous for Weakly-Supervised Object and Action Localization

Temporal Action Detection with Structured Segment Networks

References

ImageNet Classification with Deep Convolutional Neural Networks

Very Deep Convolutional Networks for Large-Scale Image Recognition

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

Fast R-CNN

Faster R-CNN: towards real-time object detection with region proposal networks

Related Papers (5)

Learning Spatiotemporal Features with 3D Convolutional Networks

ActivityNet: A large-scale video benchmark for human activity understanding

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

Two-Stream Convolutional Networks for Action Recognition in Videos

R-C3D: Region Convolutional 3D Network for Temporal Activity Detection