Self-Supervised Learning of Class Embeddings from Video

doi:10.1109/ICCVW.2019.00364

Open AccessProceedings ArticleDOI

Self-Supervised Learning of Class Embeddings from Video

- pp 0-0

TLDR

This work explores how to use self-supervised learning on videos to learn a class-specific image embedding that encodes pose and shape information in the form of landmarks, and demonstrates quantitatively and experimentally that the learned embeddings do indeed generalise.

Abstract:

This work explores how to use self-supervised learning on videos to learn a class-specific image embedding that encodes pose and shape information in the form of landmarks. At train time, two frames of the same video of an object class (e.g. human upper body) are extracted and each encoded to an embedding. Conditioned on these embeddings, the decoder network is tasked to transform one frame into another. To successfully perform long range transformations (e.g. a wrist lowered in one image should be mapped to the same wrist raised in another), we introduce a new hierarchical probabilistic network decoder model. Once trained, the embedding can be used for a variety of downstream tasks and domains. We demonstrate our approach quantitatively on three distinct deformable object classes - human full bodies, upper bodies, faces - and show experimentally that the learned embeddings do indeed generalise. They achieve state-of-the-art performance in comparison to other self-supervised methods trained on the same datasets, and approach the performance of fully supervised methods.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

Emotion Recognition in Speech using Cross-Modal Transfer in the Wild

Samuel Albanie, +3 more

TL;DR: This article showed that the emotional content of speech correlates with the facial expression of the speaker, which can be transferred from the visual domain (faces) to the speech domain (voices) through cross-modal distillation.

...read moreread less

Posted Content

Memory-augmented Dense Predictive Coding for Video Representation Learning

Tengda Han, +2 more

- 03 Aug 2020 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: A new architecture and learning framework Memory-augmented Dense Predictive Coding (MemDPC) is proposed for the self-supervised learning from video, in particular for representations for action recognition, trained with a predictive attention mechanism over the set of compressed memories.

...read moreread less

Book ChapterDOI

Memory-Augmented Dense Predictive Coding for Video Representation Learning

Tengda Han, +2 more

TL;DR: In this article, the authors propose a self-supervised learning framework for action recognition from video, which is trained with a predictive attention mechanism over the set of compressed memories, such that any future states can always be constructed by a convex combination of the condensed representations, allowing to make multiple hypotheses efficiently.

...read moreread less

Book ChapterDOI

A survey on automatic multimodal emotion recognition in the wild

Garima Sharma, +1 more

TL;DR: This chapter gives a detailed overview of different emotion recognition techniques and the predominantly used signal modalities and a thorough comparison of standard emotion labelled databases is presented.

...read moreread less

Journal ArticleDOI

Deepfakes generation and detection: state-of-the-art, open challenges, countermeasures, and way forward

- 04 Jun 2022 -

Applied Intelligence

TL;DR: In this article , a comprehensive review and detailed analysis of existing tools and machine learning (ML) based approaches for deepfake generation, and the methodologies used to detect such manipulations in both audio and video are presented.

...read moreread less

References

PDF

Open Access

More filters

Proceedings Article

Auto-Encoding Variational Bayes

Diederik P. Kingma, +1 more

TL;DR: A stochastic variational inference and learning algorithm that scales to large datasets and, under some mild differentiability conditions, even works in the intractable case is introduced.

...read moreread less

Proceedings Article

An iterative image registration technique with an application to stereo vision

Bruce D. Lucas, +1 more

TL;DR: In this paper, the spatial intensity gradient of the images is used to find a good match using a type of Newton-Raphson iteration, which can be generalized to handle rotation, scaling and shearing.

...read moreread less

Proceedings Article

Spatial transformer networks

Max Jaderberg, +3 more

TL;DR: This work introduces a new learnable module, the Spatial Transformer, which explicitly allows the spatial manipulation of data within the network, and can be inserted into existing convolutional architectures, giving neural networks the ability to actively spatially transform feature maps.

...read moreread less

Proceedings ArticleDOI

Curriculum learning

Yoshua Bengio, +3 more

TL;DR: It is hypothesized that curriculum learning has both an effect on the speed of convergence of the training process to a minimum and on the quality of the local minima obtained: curriculum learning can be seen as a particular form of continuation method (a general strategy for global optimization of non-convex functions).

...read moreread less

Proceedings ArticleDOI

FlowNet: Learning Optical Flow with Convolutional Networks

Alexey Dosovitskiy, +8 more

TL;DR: In this paper, the authors propose and compare two architectures: a generic architecture and another one including a layer that correlates feature vectors at different image locations, and show that networks trained on this unrealistic data still generalize very well to existing datasets such as Sintel and KITTI.

...read moreread less

Collapse

Related Papers (5)

Self-supervised Learning for Video Correspondence Flow

Zihang Lai, +1 more

- 02 May 2019 -

arXiv: Computer Vision and Pattern Recog...

Self-Supervised Learning of Class Embeddings from Video

Citations

Emotion Recognition in Speech using Cross-Modal Transfer in the Wild

Memory-augmented Dense Predictive Coding for Video Representation Learning

Memory-Augmented Dense Predictive Coding for Video Representation Learning

A survey on automatic multimodal emotion recognition in the wild

Deepfakes generation and detection: state-of-the-art, open challenges, countermeasures, and way forward

References

Auto-Encoding Variational Bayes

An iterative image registration technique with an application to stereo vision

Spatial transformer networks

Curriculum learning

FlowNet: Learning Optical Flow with Convolutional Networks

Related Papers (5)

Self-supervised Learning for Video Correspondence Flow

Self-supervised Spatiotemporal Feature Learning by Video Geometric Transformations

Semi-Supervised Semantic Image Segmentation With Self-Correcting Networks

S2SiamFC: Self-supervised Fully Convolutional Siamese Network for Visual Tracking

Cross Pixel Optical Flow Similarity for Self-Supervised Learning