UntrimmedNets for Weakly Supervised Action Recognition and Detection
Limin Wang,Yuanjun Xiong,Dahua Lin,Luc Van Gool +3 more
- pp 6402-6411
TLDR
This paper presents a new weakly supervised architecture, called UntrimmedNet, which is able to directly learn action recognition models from untrimmed videos without the requirement of temporal annotations of action instances.Abstract:
Current action recognition methods heavily rely on trimmed videos for model training. However, it is expensive and time-consuming to acquire a large-scale trimmed video dataset. This paper presents a new weakly supervised architecture, called UntrimmedNet, which is able to directly learn action recognition models from untrimmed videos without the requirement of temporal annotations of action instances. Our UntrimmedNet couples two important components, the classification module and the selection module, to learn the action models and reason about the temporal duration of action instances, respectively. These two components are implemented with feed-forward networks, and UntrimmedNet is therefore an end-to-end trainable architecture. We exploit the learned models for action recognition (WSR) and detection (WSD) on the untrimmed video datasets of THUMOS14 and ActivityNet. Although our UntrimmedNet only employs weak supervision, our method achieves performance superior or comparable to that of those strongly supervised approaches on these two datasets.read more
Citations
More filters
Proceedings ArticleDOI
R-C3D: Region Convolutional 3D Network for Temporal Activity Detection
Huijuan Xu,Abir Das,Kate Saenko +2 more
TL;DR: Region Convolutional 3D Network (R-C3D) as mentioned in this paper uses a three-dimensional fully convolutional network to extract meaningful spatio-temporal features to capture activities, accurately localizing the start and end times of each activity.
Proceedings ArticleDOI
Rethinking the Faster R-CNN Architecture for Temporal Action Localization
Yu-Wei Chao,Sudheendra Vijayanarasimhan,Bryan Seybold,David A. Ross,Jia Deng,Rahul Sukthankar +5 more
TL;DR: TAL-Net as mentioned in this paper improves receptive field alignment using a multi-scale architecture that can accommodate extreme variation in action durations and better exploit the temporal context of actions for both proposal generation and action classification by appropriately extending receptive fields.
Book ChapterDOI
BSN: Boundary Sensitive Network for Temporal Action Proposal Generation
TL;DR: An effective proposal generation method, named Boundary-Sensitive Network (BSN), which adopts "local to global" fashion and significantly improves the state-of-the-art temporal action detection performance.
Proceedings ArticleDOI
Graph Convolutional Networks for Temporal Action Localization
TL;DR: Zhang et al. as mentioned in this paper exploit the proposal-proposal relations using GraphConvolutional Networks (GCNs) to exploit the context information for each proposal and the correlations between distinct actions.
Proceedings ArticleDOI
BMN: Boundary-Matching Network for Temporal Action Proposal Generation
TL;DR: This work proposes an effective, efficient and end-to-end proposal generation method, named Boundary-Matching Network (BMN), which generates proposals with precise temporal boundaries as well as reliable confidence scores simultaneously, and can achieve state-of-the-art temporal action detection performance.
References
More filters
Proceedings ArticleDOI
Weakly Supervised Deep Detection Networks
Hakan Bilen,Andrea Vedaldi +1 more
TL;DR: This paper proposes a weakly supervised deep detection architecture that modifies one such network to operate at the level of image regions, performing simultaneously region selection and classification.
Proceedings ArticleDOI
Action snippets: How many frames does human action recognition require?
Konrad Schindler,L. Van Gool +1 more
TL;DR: It turns out that even local shape and optic flow for a single frame areenough to achieve ap90% correct recognitions, and snippets of 5-7 frames are enough to achieve a performance similar to the one obtainable with the entire video sequence.
Journal ArticleDOI
Weakly Supervised Object Localization with Multi-Fold Multiple Instance Learning
TL;DR: This work follows a multiple-instance learning approach that iteratively trains the detector and infers the object locations in the positive training images and proposes a window refinement method, which improves the localization accuracy by incorporating an objectness prior.
Proceedings ArticleDOI
Learning Activity Progression in LSTMs for Activity Detection and Early Detection
TL;DR: This work designs novel ranking losses that directly penalize the model on violation of such monotonicities, which are used together with classification loss in training of LSTM models.
Book ChapterDOI
DAPs: Deep Action Proposals for Action Understanding
TL;DR: Deep Action Proposals (DAPs), an effective and efficient algorithm for generating temporal action proposals from long videos, is introduced, which outperforms previous work on a large scale action benchmark, runs at 134 FPS making it practical for large-scale scenarios, and exhibits an appealing ability to generalize.
Related Papers (5)
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
Joao Carreira,Andrew Zisserman +1 more