scispace - formally typeset
Open AccessProceedings ArticleDOI

UntrimmedNets for Weakly Supervised Action Recognition and Detection

TLDR
This paper presents a new weakly supervised architecture, called UntrimmedNet, which is able to directly learn action recognition models from untrimmed videos without the requirement of temporal annotations of action instances.
Abstract
Current action recognition methods heavily rely on trimmed videos for model training. However, it is expensive and time-consuming to acquire a large-scale trimmed video dataset. This paper presents a new weakly supervised architecture, called UntrimmedNet, which is able to directly learn action recognition models from untrimmed videos without the requirement of temporal annotations of action instances. Our UntrimmedNet couples two important components, the classification module and the selection module, to learn the action models and reason about the temporal duration of action instances, respectively. These two components are implemented with feed-forward networks, and UntrimmedNet is therefore an end-to-end trainable architecture. We exploit the learned models for action recognition (WSR) and detection (WSD) on the untrimmed video datasets of THUMOS14 and ActivityNet. Although our UntrimmedNet only employs weak supervision, our method achieves performance superior or comparable to that of those strongly supervised approaches on these two datasets.

read more

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI

R-C3D: Region Convolutional 3D Network for Temporal Activity Detection

TL;DR: Region Convolutional 3D Network (R-C3D) as mentioned in this paper uses a three-dimensional fully convolutional network to extract meaningful spatio-temporal features to capture activities, accurately localizing the start and end times of each activity.
Proceedings ArticleDOI

Rethinking the Faster R-CNN Architecture for Temporal Action Localization

TL;DR: TAL-Net as mentioned in this paper improves receptive field alignment using a multi-scale architecture that can accommodate extreme variation in action durations and better exploit the temporal context of actions for both proposal generation and action classification by appropriately extending receptive fields.
Book ChapterDOI

BSN: Boundary Sensitive Network for Temporal Action Proposal Generation

TL;DR: An effective proposal generation method, named Boundary-Sensitive Network (BSN), which adopts "local to global" fashion and significantly improves the state-of-the-art temporal action detection performance.
Proceedings ArticleDOI

Graph Convolutional Networks for Temporal Action Localization

TL;DR: Zhang et al. as mentioned in this paper exploit the proposal-proposal relations using GraphConvolutional Networks (GCNs) to exploit the context information for each proposal and the correlations between distinct actions.
Proceedings ArticleDOI

BMN: Boundary-Matching Network for Temporal Action Proposal Generation

TL;DR: This work proposes an effective, efficient and end-to-end proposal generation method, named Boundary-Matching Network (BMN), which generates proposals with precise temporal boundaries as well as reliable confidence scores simultaneously, and can achieve state-of-the-art temporal action detection performance.
References
More filters
Proceedings ArticleDOI

Weakly Supervised Deep Detection Networks

TL;DR: This paper proposes a weakly supervised deep detection architecture that modifies one such network to operate at the level of image regions, performing simultaneously region selection and classification.
Proceedings ArticleDOI

Action snippets: How many frames does human action recognition require?

TL;DR: It turns out that even local shape and optic flow for a single frame areenough to achieve ap90% correct recognitions, and snippets of 5-7 frames are enough to achieve a performance similar to the one obtainable with the entire video sequence.
Journal ArticleDOI

Weakly Supervised Object Localization with Multi-Fold Multiple Instance Learning

TL;DR: This work follows a multiple-instance learning approach that iteratively trains the detector and infers the object locations in the positive training images and proposes a window refinement method, which improves the localization accuracy by incorporating an objectness prior.
Proceedings ArticleDOI

Learning Activity Progression in LSTMs for Activity Detection and Early Detection

TL;DR: This work designs novel ranking losses that directly penalize the model on violation of such monotonicities, which are used together with classification loss in training of LSTM models.
Book ChapterDOI

DAPs: Deep Action Proposals for Action Understanding

TL;DR: Deep Action Proposals (DAPs), an effective and efficient algorithm for generating temporal action proposals from long videos, is introduced, which outperforms previous work on a large scale action benchmark, runs at 134 FPS making it practical for large-scale scenarios, and exhibits an appealing ability to generalize.
Related Papers (5)