Weakly Supervised Action Learning with RNN Based Fine-to-Coarse Modeling

doi:10.1109/CVPR.2017.140

Open AccessProceedings ArticleDOI

Weakly Supervised Action Learning with RNN Based Fine-to-Coarse Modeling

- pp 1273-1282

TLDR

A combination of a discriminative representation of subactions, modeled by a recurrent neural network, and a coarse probabilistic model to allow for a temporal alignment and inference over long sequences of human actions is proposed.

Abstract:

We present an approach for weakly supervised learning of human actions. Given a set of videos and an ordered list of the occurring actions, the goal is to infer start and end frames of the related action classes within the video and to train the respective action classifiers without any need for hand labeled frame boundaries. To address this task, we propose a combination of a discriminative representation of subactions, modeled by a recurrent neural network, and a coarse probabilistic model to allow for a temporal alignment and inference over long sequences. While this system alone already generates good results, we show that the performance can be further improved by approximating the number of subactions to the characteristics of the different action classes. To this end, we adapt the number of subaction classes by iterating realignment and reestimation during training. The proposed system is evaluated on two benchmark datasets, the Breakfast and the Hollywood extended dataset, showing a competitive performance on various weak learning tasks such as temporal action segmentation and action alignment.

Citations

PDF

Open Access

More filters

Posted Content

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Antoine Miech, +5 more

- 07 Jun 2019 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: It is demonstrated that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask.

...read moreread less

Journal ArticleDOI

Skeleton-Based Action Recognition Using Spatio-Temporal LSTM Network with Trust Gates

Jun Liu, +4 more

- 01 Dec 2018 -

IEEE Transactions on Pattern Analysis an...

TL;DR: A new gating mechanism within LSTM module is introduced, with which the network can learn the reliability of the sequential data and accordingly adjust the effect of the input data on the updating procedure of the long-term context representation stored in the unit's memory cell.

...read moreread less

Proceedings ArticleDOI

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Antoine Miech, +5 more

TL;DR: This article proposed to learn text-to-video embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations, which leads to state-of-the-art results on instructional video datasets such as YouCook2 or CrossTask.

...read moreread less

Proceedings ArticleDOI

Weakly Supervised Action Localization by Sparse Temporal Pooling Network

Phuc Xuan Nguyen, +3 more

TL;DR: In this article, a weakly supervised temporal action localization algorithm is proposed, which learns from video-level class labels and predicts temporal intervals of human actions with no requirement of temporal localization annotations.

...read moreread less

Proceedings ArticleDOI

MS-TCN: Multi-Stage Temporal Convolutional Network for Action Segmentation

Yazan Abu Farha, +1 more

TL;DR: A multi-stage architecture for the temporal action segmentation task that achieves state-of-the-art results on three challenging datasets: 50Salads, Georgia Tech Egocentric Activities (GTEA), and the Breakfast dataset.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Posted Content

Empirical evaluation of gated recurrent neural networks on sequence modeling

Junyoung Chung, +5 more

- 11 Dec 2014 -

arXiv: Neural and Evolutionary Computing

TL;DR: These advanced recurrent units that implement a gating mechanism, such as a long short-term memory (LSTM) unit and a recently proposed gated recurrent unit (GRU), are found to be comparable to LSTM.

...read moreread less

Proceedings Article

Two-Stream Convolutional Networks for Action Recognition in Videos

Karen Simonyan, +1 more

TL;DR: This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data.

...read moreread less

Large-scale Video Classiﬁcation with Convolutional Neural Networks

Andrej Karpathy, +5 more

Proceedings ArticleDOI

Large-Scale Video Classification with Convolutional Neural Networks

Andrej Karpathy, +5 more

TL;DR: This work studies multiple approaches for extending the connectivity of a CNN in time domain to take advantage of local spatio-temporal information and suggests a multiresolution, foveated architecture as a promising way of speeding up the training.

...read moreread less

Proceedings ArticleDOI

On the Properties of Neural Machine Translation: Encoder--Decoder Approaches

Kyunghyun Cho, +5 more

TL;DR: In this paper, a gated recursive convolutional neural network (GRNN) was proposed to learn a grammatical structure of a sentence automatically, which performed well on short sentences without unknown words, but its performance degrades rapidly as the length of the sentence and the number of unknown words increase.

...read moreread less

Collapse

Weakly Supervised Action Learning with RNN Based Fine-to-Coarse Modeling

Citations

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Skeleton-Based Action Recognition Using Spatio-Temporal LSTM Network with Trust Gates

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Weakly Supervised Action Localization by Sparse Temporal Pooling Network

MS-TCN: Multi-Stage Temporal Convolutional Network for Action Segmentation

References

Empirical evaluation of gated recurrent neural networks on sequence modeling

Two-Stream Convolutional Networks for Action Recognition in Videos

Large-scale Video Classiﬁcation with Convolutional Neural Networks

Large-Scale Video Classification with Convolutional Neural Networks

On the Properties of Neural Machine Translation: Encoder--Decoder Approaches

Related Papers (5)

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

The Language of Actions: Recovering the Syntax and Semantics of Goal-Directed Human Activities

Temporal Convolutional Networks for Action Segmentation and Detection

Action Recognition with Improved Trajectories

Two-Stream Convolutional Networks for Action Recognition in Videos