Speech2Action: Cross-modal Supervision for Action Recognition

Open AccessPosted Content

Speech2Action: Cross-modal Supervision for Action Recognition

- 30 Mar 2020 -

arXiv: Computer Vision and Pattern Recog...

TLDR

This work trains a BERT-based Speech2Action classifier on over a thousand movie screenplays, to predict action labels from transcribed speech segments, and applies this model to the speech segments of a large unlabelled movie corpus.

Citations

PDF

Open Access

More filters

Journal Article

Multi-modal Self-Supervision from Generalized Data Transformations

Mandela Patrick, +6 more

- 04 May 2021 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: The framework of Generalized Data Transformations is introduced to reduce several recent self-supervised learning objectives to a single formulation for ease of comparison, analysis, and extension, and allow a choice between being invariant or distinctive to data transformations, obtaining different supervisory signals, and derive the conditions that combinations of transformations must obey in order to lead to well-posed learning objectives.

...read moreread less

Posted ContentDOI

Human Action Recognition from Various Data Modalities: A Review

Zehua Sun, +5 more

- 22 Dec 2020 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: This paper reviews both the hand-crafted feature-based and deep learning-based methods for single data modalities and also the methods based on multiple modalities, including the fusion-based frameworks and the co-learning-based approaches for HAR.

...read moreread less

Proceedings Article

Labelling unlabelled videos from scratch with multi-modal self-supervision

Yuki M. Asano, +3 more

TL;DR: It is shown that unsupervised labelling of a video dataset does not come for free from strong feature encoders and a novel clustering method is proposed that allows pseudo-labelling of the video dataset without any human annotations, by leveraging the natural correspondence between the audio and visual modalities.

...read moreread less

Proceedings ArticleDOI

Towards Long-Form Video Understanding

Chao-Yuan Wu, +1 more

TL;DR: In this article, the authors introduce a framework for modeling long-form videos and develop evaluation protocols on large-scale datasets and show that existing state-of-the-art short-term models are limited for long-term tasks.

...read moreread less

Journal ArticleDOI

Multimodal Learning with Transformers: A Survey

Peng Xu, +2 more

- 13 Jun 2022 -

IEEE Transactions on Pattern Analysis an...

TL;DR: A comprehensive survey of Transformer techniques oriented at multimodal data and a discussion of open problems and potential research directions for the community are presented.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings ArticleDOI

ImageNet: A large-scale hierarchical image database

Jia Deng, +5 more

TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.

...read moreread less

Proceedings ArticleDOI

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, +3 more

TL;DR: BERT as mentioned in this paper pre-trains deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

...read moreread less

Proceedings ArticleDOI

Non-local Neural Networks

Xiaolong Wang, +3 more

TL;DR: In this article, the non-local operation computes the response at a position as a weighted sum of the features at all positions, which can be used to capture long-range dependencies.

...read moreread less

Posted Content

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Yonghui Wu, +30 more

- 26 Sep 2016 -

arXiv: Computation and Language

TL;DR: GNMT, Google's Neural Machine Translation system, is presented, which attempts to address many of the weaknesses of conventional phrase-based translation systems and provides a good balance between the flexibility of "character"-delimited models and the efficiency of "word"-delicited models.

...read moreread less

Proceedings ArticleDOI

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

Joao Carreira, +1 more

TL;DR: In this article, a Two-Stream Inflated 3D ConvNet (I3D) is proposed to learn seamless spatio-temporal feature extractors from video while leveraging successful ImageNet architecture designs and their parameters.

...read moreread less

Collapse

Speech2Action: Cross-modal Supervision for Action Recognition

Citations

Multi-modal Self-Supervision from Generalized Data Transformations

Human Action Recognition from Various Data Modalities: A Review

Labelling unlabelled videos from scratch with multi-modal self-supervision

Towards Long-Form Video Understanding

Multimodal Learning with Transformers: A Survey

References

ImageNet: A large-scale hierarchical image database

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Non-local Neural Networks

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

Related Papers (5)

Look, Listen and Learn

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

HMDB: A large video database for human motion recognition

Deep Residual Learning for Image Recognition

Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles