Weakly-Supervised Action Segmentation with Iterative Soft Boundary Assignment
Chenliang Xu,Li Ding +1 more
- pp 6508-6516
TLDR
A novel action modeling framework is proposed, which consists of a new temporal convolutional network, named Temporal Convolutional Feature Pyramid Network (TCFPN), for predicting frame-wise action labels, and a novel training strategy for weakly-supervised sequence modeling, named Iterative Soft Boundary Assignment (ISBA), to align action sequences and update the network in an iterative fashion.Abstract:
In this work, we address the task of weakly-supervised human action segmentation in long, untrimmed videos. Recent methods have relied on expensive learning models, such as Recurrent Neural Networks (RNN) and Hidden Markov Models (HMM). However, these methods suffer from expensive computational cost, thus are unable to be deployed in large scale. To overcome the limitations, the keys to our design are efficiency and scalability. We propose a novel action modeling framework, which consists of a new temporal convolutional network, named Temporal Convolutional Feature Pyramid Network (TCFPN), for predicting frame-wise action labels, and a novel training strategy for weakly-supervised sequence modeling, named Iterative Soft Boundary Assignment (ISBA), to align action sequences and update the network in an iterative fashion. The proposed framework is evaluated on two benchmark datasets, Breakfast and Hollywood Extended, with four different evaluation metrics. Extensive experimental results show that our methods achieve competitive or superior performance to state-of-the-art methods.read more
Citations
More filters
Proceedings ArticleDOI
ActBERT: Learning Global-Local Video-Text Representations
Linchao Zhu,Yi Yang +1 more
TL;DR: This paper introduces ActBERT for self-supervised learning of joint video-text representations from unlabeled data and introduces an ENtangled Transformer block to encode three sources of information, i.e., global actions, local regional objects, and linguistic descriptions.
Proceedings ArticleDOI
MS-TCN: Multi-Stage Temporal Convolutional Network for Action Segmentation
Yazan Abu Farha,Juergen Gall +1 more
TL;DR: A multi-stage architecture for the temporal action segmentation task that achieves state-of-the-art results on three challenging datasets: 50Salads, Georgia Tech Egocentric Activities (GTEA), and the Breakfast dataset.
Proceedings ArticleDOI
Completeness Modeling and Context Separation for Weakly Supervised Temporal Action Localization
TL;DR: This work identifies two underexplored problems posed by the weak supervision for temporal action localization, namely action completeness modeling and action-context separation, and proposes a multi-branch neural network in which branches are enforced to discover distinctive action parts.
Proceedings ArticleDOI
FineGym: A Hierarchical Video Dataset for Fine-Grained Action Understanding
TL;DR: FineGym is a new dataset built on top of gymnasium videos that provides temporal annotations at both action and sub-action levels with a three-level semantic hierarchy and systematically investigates different methods on this dataset and obtains a number of interesting findings.
Proceedings ArticleDOI
COIN: A Large-Scale Dataset for Comprehensive Instructional Video Analysis
TL;DR: The COIN dataset as discussed by the authors contains 11,827 videos of 180 tasks in 12 domains (e.g., vehicles, gadgets, etc.) related to our daily life, and all the videos are annotated effectively with a series of step descriptions and corresponding temporal boundaries.
References
More filters
Journal ArticleDOI
Maximum likelihood from incomplete data via the EM algorithm
Proceedings ArticleDOI
Feature Pyramid Networks for Object Detection
TL;DR: This paper exploits the inherent multi-scale, pyramidal hierarchy of deep convolutional networks to construct feature pyramids with marginal extra cost and achieves state-of-the-art single-model results on the COCO detection benchmark without bells and whistles.
Proceedings ArticleDOI
Learning realistic human actions from movies
TL;DR: A new method for video classification that builds upon and extends several recent ideas including local space-time features,space-time pyramids and multi-channel non-linear SVMs is presented and shown to improve state-of-the-art results on the standard KTH action dataset.
Proceedings ArticleDOI
Actions in context
TL;DR: This paper automatically discover relevant scene classes and their correlation with human actions, and shows how to learn selected scene classes from video without manual supervision and develops a joint framework for action and scene recognition and demonstrates improved recognition of both in natural video.
Proceedings ArticleDOI
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language
TL;DR: A detailed analysis of MSR-VTT in comparison to a complete set of existing datasets, together with a summarization of different state-of-the-art video-to-text approaches, shows that the hybrid Recurrent Neural Networkbased approach, which combines single-frame and motion representations with soft-attention pooling strategy, yields the best generalization capability on this dataset.
Related Papers (5)
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
Joao Carreira,Andrew Zisserman +1 more