UntrimmedNets for Weakly Supervised Action Recognition and Detection

doi:10.1109/CVPR.2017.678

Open AccessProceedings ArticleDOI

UntrimmedNets for Weakly Supervised Action Recognition and Detection

- pp 6402-6411

TLDR

This paper presents a new weakly supervised architecture, called UntrimmedNet, which is able to directly learn action recognition models from untrimmed videos without the requirement of temporal annotations of action instances.

Abstract:

Current action recognition methods heavily rely on trimmed videos for model training. However, it is expensive and time-consuming to acquire a large-scale trimmed video dataset. This paper presents a new weakly supervised architecture, called UntrimmedNet, which is able to directly learn action recognition models from untrimmed videos without the requirement of temporal annotations of action instances. Our UntrimmedNet couples two important components, the classification module and the selection module, to learn the action models and reason about the temporal duration of action instances, respectively. These two components are implemented with feed-forward networks, and UntrimmedNet is therefore an end-to-end trainable architecture. We exploit the learned models for action recognition (WSR) and detection (WSD) on the untrimmed video datasets of THUMOS14 and ActivityNet. Although our UntrimmedNet only employs weak supervision, our method achieves performance superior or comparable to that of those strongly supervised approaches on these two datasets.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

R-C3D: Region Convolutional 3D Network for Temporal Activity Detection

Huijuan Xu, +2 more

TL;DR: Region Convolutional 3D Network (R-C3D) as mentioned in this paper uses a three-dimensional fully convolutional network to extract meaningful spatio-temporal features to capture activities, accurately localizing the start and end times of each activity.

...read moreread less

Proceedings ArticleDOI

Rethinking the Faster R-CNN Architecture for Temporal Action Localization

Yu-Wei Chao, +5 more

TL;DR: TAL-Net as mentioned in this paper improves receptive field alignment using a multi-scale architecture that can accommodate extreme variation in action durations and better exploit the temporal context of actions for both proposal generation and action classification by appropriately extending receptive fields.

...read moreread less

Book ChapterDOI

BSN: Boundary Sensitive Network for Temporal Action Proposal Generation

Tianwei Lin, +4 more

TL;DR: An effective proposal generation method, named Boundary-Sensitive Network (BSN), which adopts "local to global" fashion and significantly improves the state-of-the-art temporal action detection performance.

...read moreread less

Proceedings ArticleDOI

Graph Convolutional Networks for Temporal Action Localization

Runhao Zeng, +6 more

TL;DR: Zhang et al. as mentioned in this paper exploit the proposal-proposal relations using GraphConvolutional Networks (GCNs) to exploit the context information for each proposal and the correlations between distinct actions.

...read moreread less

Proceedings ArticleDOI

BMN: Boundary-Matching Network for Temporal Action Proposal Generation

Tianwei Lin, +4 more

TL;DR: This work proposes an effective, efficient and end-to-end proposal generation method, named Boundary-Matching Network (BMN), which generates proposals with precise temporal boundaries as well as reliable confidence scores simultaneously, and can achieve state-of-the-art temporal action detection performance.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Weakly Supervised Deep Detection Networks

Hakan Bilen, +1 more

TL;DR: This paper proposes a weakly supervised deep detection architecture that modifies one such network to operate at the level of image regions, performing simultaneously region selection and classification.

...read moreread less

Proceedings ArticleDOI

Action snippets: How many frames does human action recognition require?

Konrad Schindler, +1 more

TL;DR: It turns out that even local shape and optic flow for a single frame areenough to achieve ap90% correct recognitions, and snippets of 5-7 frames are enough to achieve a performance similar to the one obtainable with the entire video sequence.

...read moreread less

Journal ArticleDOI

Weakly Supervised Object Localization with Multi-Fold Multiple Instance Learning

Ramazan Gokberk Cinbis, +2 more

- 01 Jan 2017 -

IEEE Transactions on Pattern Analysis an...

TL;DR: This work follows a multiple-instance learning approach that iteratively trains the detector and infers the object locations in the positive training images and proposes a window refinement method, which improves the localization accuracy by incorporating an objectness prior.

...read moreread less

Proceedings ArticleDOI

Learning Activity Progression in LSTMs for Activity Detection and Early Detection

Shugao Ma, +2 more

TL;DR: This work designs novel ranking losses that directly penalize the model on violation of such monotonicities, which are used together with classification loss in training of LSTM models.

...read moreread less

Book ChapterDOI

DAPs: Deep Action Proposals for Action Understanding

Victor Escorcia, +4 more

TL;DR: Deep Action Proposals (DAPs), an effective and efficient algorithm for generating temporal action proposals from long videos, is introduced, which outperforms previous work on a large scale action benchmark, runs at 134 FPS making it practical for large-scale scenarios, and exhibits an appealing ability to generalize.

...read moreread less

Collapse

UntrimmedNets for Weakly Supervised Action Recognition and Detection

Citations

R-C3D: Region Convolutional 3D Network for Temporal Activity Detection

Rethinking the Faster R-CNN Architecture for Temporal Action Localization

BSN: Boundary Sensitive Network for Temporal Action Proposal Generation

Graph Convolutional Networks for Temporal Action Localization

BMN: Boundary-Matching Network for Temporal Action Proposal Generation

References

Weakly Supervised Deep Detection Networks

Action snippets: How many frames does human action recognition require?

Weakly Supervised Object Localization with Multi-Fold Multiple Instance Learning

Learning Activity Progression in LSTMs for Activity Detection and Early Detection

DAPs: Deep Action Proposals for Action Understanding

Related Papers (5)

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

ActivityNet: A large-scale video benchmark for human activity understanding

Learning Spatiotemporal Features with 3D Convolutional Networks

Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs

Temporal Segment Networks: Towards Good Practices for Deep Action Recognition