scispace - formally typeset
Open AccessPosted Content

Speech2Action: Cross-modal Supervision for Action Recognition

TLDR
This work trains a BERT-based Speech2Action classifier on over a thousand movie screenplays, to predict action labels from transcribed speech segments, and applies this model to the speech segments of a large unlabelled movie corpus.
Abstract
Is it possible to guess human action from dialogue alone? In this work we investigate the link between spoken words and actions in movies. We note that movie screenplays describe actions, as well as contain the speech of characters and hence can be used to learn this correlation with no additional supervision. We train a BERT-based Speech2Action classifier on over a thousand movie screenplays, to predict action labels from transcribed speech segments. We then apply this model to the speech segments of a large unlabelled movie corpus (188M speech segments from 288K movies). Using the predictions of this model, we obtain weak action labels for over 800K video clips. By training on these video clips, we demonstrate superior action recognition performance on standard action recognition benchmarks, without using a single manually labelled action example.

read more

Citations
More filters
Journal Article

Multi-modal Self-Supervision from Generalized Data Transformations

TL;DR: The framework of Generalized Data Transformations is introduced to reduce several recent self-supervised learning objectives to a single formulation for ease of comparison, analysis, and extension, and allow a choice between being invariant or distinctive to data transformations, obtaining different supervisory signals, and derive the conditions that combinations of transformations must obey in order to lead to well-posed learning objectives.
Posted ContentDOI

Human Action Recognition from Various Data Modalities: A Review

TL;DR: This paper reviews both the hand-crafted feature-based and deep learning-based methods for single data modalities and also the methods based on multiple modalities, including the fusion-based frameworks and the co-learning-based approaches for HAR.
Proceedings Article

Labelling unlabelled videos from scratch with multi-modal self-supervision

TL;DR: It is shown that unsupervised labelling of a video dataset does not come for free from strong feature encoders and a novel clustering method is proposed that allows pseudo-labelling of the video dataset without any human annotations, by leveraging the natural correspondence between the audio and visual modalities.
Proceedings ArticleDOI

Towards Long-Form Video Understanding

TL;DR: In this article, the authors introduce a framework for modeling long-form videos and develop evaluation protocols on large-scale datasets and show that existing state-of-the-art short-term models are limited for long-term tasks.
Journal ArticleDOI

Multimodal Learning with Transformers: A Survey

TL;DR: A comprehensive survey of Transformer techniques oriented at multimodal data and a discussion of open problems and potential research directions for the community are presented.
References
More filters
Proceedings ArticleDOI

ImageNet: A large-scale hierarchical image database

TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Proceedings ArticleDOI

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

TL;DR: BERT as mentioned in this paper pre-trains deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Proceedings ArticleDOI

Non-local Neural Networks

TL;DR: In this article, the non-local operation computes the response at a position as a weighted sum of the features at all positions, which can be used to capture long-range dependencies.
Posted Content

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

TL;DR: GNMT, Google's Neural Machine Translation system, is presented, which attempts to address many of the weaknesses of conventional phrase-based translation systems and provides a good balance between the flexibility of "character"-delimited models and the efficiency of "word"-delicited models.
Proceedings ArticleDOI

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

TL;DR: In this article, a Two-Stream Inflated 3D ConvNet (I3D) is proposed to learn seamless spatio-temporal feature extractors from video while leveraging successful ImageNet architecture designs and their parameters.