Weakly-Supervised Action Segmentation with Iterative Soft Boundary Assignment

doi:10.1109/CVPR.2018.00681

Open AccessProceedings ArticleDOI

Weakly-Supervised Action Segmentation with Iterative Soft Boundary Assignment

- pp 6508-6516

TLDR

A novel action modeling framework is proposed, which consists of a new temporal convolutional network, named Temporal Convolutional Feature Pyramid Network (TCFPN), for predicting frame-wise action labels, and a novel training strategy for weakly-supervised sequence modeling, named Iterative Soft Boundary Assignment (ISBA), to align action sequences and update the network in an iterative fashion.

Abstract:

In this work, we address the task of weakly-supervised human action segmentation in long, untrimmed videos. Recent methods have relied on expensive learning models, such as Recurrent Neural Networks (RNN) and Hidden Markov Models (HMM). However, these methods suffer from expensive computational cost, thus are unable to be deployed in large scale. To overcome the limitations, the keys to our design are efficiency and scalability. We propose a novel action modeling framework, which consists of a new temporal convolutional network, named Temporal Convolutional Feature Pyramid Network (TCFPN), for predicting frame-wise action labels, and a novel training strategy for weakly-supervised sequence modeling, named Iterative Soft Boundary Assignment (ISBA), to align action sequences and update the network in an iterative fashion. The proposed framework is evaluated on two benchmark datasets, Breakfast and Hollywood Extended, with four different evaluation metrics. Extensive experimental results show that our methods achieve competitive or superior performance to state-of-the-art methods.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

ActBERT: Learning Global-Local Video-Text Representations

Linchao Zhu, +1 more

TL;DR: This paper introduces ActBERT for self-supervised learning of joint video-text representations from unlabeled data and introduces an ENtangled Transformer block to encode three sources of information, i.e., global actions, local regional objects, and linguistic descriptions.

...read moreread less

Proceedings ArticleDOI

MS-TCN: Multi-Stage Temporal Convolutional Network for Action Segmentation

Yazan Abu Farha, +1 more

TL;DR: A multi-stage architecture for the temporal action segmentation task that achieves state-of-the-art results on three challenging datasets: 50Salads, Georgia Tech Egocentric Activities (GTEA), and the Breakfast dataset.

...read moreread less

Proceedings ArticleDOI

Completeness Modeling and Context Separation for Weakly Supervised Temporal Action Localization

Daochang Liu, +2 more

TL;DR: This work identifies two underexplored problems posed by the weak supervision for temporal action localization, namely action completeness modeling and action-context separation, and proposes a multi-branch neural network in which branches are enforced to discover distinctive action parts.

...read moreread less

Proceedings ArticleDOI

FineGym: A Hierarchical Video Dataset for Fine-Grained Action Understanding

Dian Shao, +3 more

TL;DR: FineGym is a new dataset built on top of gymnasium videos that provides temporal annotations at both action and sub-action levels with a three-level semantic hierarchy and systematically investigates different methods on this dataset and obtains a number of interesting findings.

...read moreread less

Proceedings ArticleDOI

COIN: A Large-Scale Dataset for Comprehensive Instructional Video Analysis

Yansong Tang, +7 more

TL;DR: The COIN dataset as discussed by the authors contains 11,827 videos of 180 tasks in 12 domains (e.g., vehicles, gadgets, etc.) related to our daily life, and all the videos are annotated effectively with a series of step descriptions and corresponding temporal boundaries.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

Maximum likelihood from incomplete data via the EM algorithm

Arthur P. Dempster, +2 more

- 01 Sep 1977 -

Journal of the royal statistical society...

Proceedings ArticleDOI

Feature Pyramid Networks for Object Detection

Tsung-Yi Lin, +5 more

TL;DR: This paper exploits the inherent multi-scale, pyramidal hierarchy of deep convolutional networks to construct feature pyramids with marginal extra cost and achieves state-of-the-art single-model results on the COCO detection benchmark without bells and whistles.

...read moreread less

Proceedings ArticleDOI

Learning realistic human actions from movies

Ivan Laptev, +3 more

TL;DR: A new method for video classification that builds upon and extends several recent ideas including local space-time features,space-time pyramids and multi-channel non-linear SVMs is presented and shown to improve state-of-the-art results on the standard KTH action dataset.

...read moreread less

Proceedings ArticleDOI

Actions in context

Marcin Marszalek, +2 more

TL;DR: This paper automatically discover relevant scene classes and their correlation with human actions, and shows how to learn selected scene classes from video without manual supervision and develops a joint framework for action and scene recognition and demonstrates improved recognition of both in natural video.

...read moreread less

Proceedings ArticleDOI

MSR-VTT: A Large Video Description Dataset for Bridging Video and Language

Jun Xu, +3 more

TL;DR: A detailed analysis of MSR-VTT in comparison to a complete set of existing datasets, together with a summarization of different state-of-the-art video-to-text approaches, shows that the hybrid Recurrent Neural Networkbased approach, which combines single-frame and motion representations with soft-attention pooling strategy, yields the best generalization capability on this dataset.

...read moreread less

Weakly-Supervised Action Segmentation with Iterative Soft Boundary Assignment

Citations

ActBERT: Learning Global-Local Video-Text Representations

MS-TCN: Multi-Stage Temporal Convolutional Network for Action Segmentation

Completeness Modeling and Context Separation for Weakly Supervised Temporal Action Localization

FineGym: A Hierarchical Video Dataset for Fine-Grained Action Understanding

COIN: A Large-Scale Dataset for Comprehensive Instructional Video Analysis

References

Maximum likelihood from incomplete data via the EM algorithm

Feature Pyramid Networks for Object Detection

Learning realistic human actions from movies

Actions in context

MSR-VTT: A Large Video Description Dataset for Bridging Video and Language

Related Papers (5)

The Language of Actions: Recovering the Syntax and Semantics of Goal-Directed Human Activities

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

Temporal Convolutional Networks for Action Segmentation and Detection

Combining embedded accelerometers with computer vision for recognizing food preparation activities

Segmental Spatiotemporal CNNs for Fine-Grained Action Segmentation