scispace - formally typeset
Open AccessProceedings ArticleDOI

Self-Supervised Learning of Class Embeddings from Video

TLDR
This work explores how to use self-supervised learning on videos to learn a class-specific image embedding that encodes pose and shape information in the form of landmarks, and demonstrates quantitatively and experimentally that the learned embeddings do indeed generalise.
Abstract
This work explores how to use self-supervised learning on videos to learn a class-specific image embedding that encodes pose and shape information in the form of landmarks. At train time, two frames of the same video of an object class (e.g. human upper body) are extracted and each encoded to an embedding. Conditioned on these embeddings, the decoder network is tasked to transform one frame into another. To successfully perform long range transformations (e.g. a wrist lowered in one image should be mapped to the same wrist raised in another), we introduce a new hierarchical probabilistic network decoder model. Once trained, the embedding can be used for a variety of downstream tasks and domains. We demonstrate our approach quantitatively on three distinct deformable object classes - human full bodies, upper bodies, faces - and show experimentally that the learned embeddings do indeed generalise. They achieve state-of-the-art performance in comparison to other self-supervised methods trained on the same datasets, and approach the performance of fully supervised methods.

read more

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI

Emotion Recognition in Speech using Cross-Modal Transfer in the Wild

TL;DR: This article showed that the emotional content of speech correlates with the facial expression of the speaker, which can be transferred from the visual domain (faces) to the speech domain (voices) through cross-modal distillation.
Posted Content

Memory-augmented Dense Predictive Coding for Video Representation Learning

TL;DR: A new architecture and learning framework Memory-augmented Dense Predictive Coding (MemDPC) is proposed for the self-supervised learning from video, in particular for representations for action recognition, trained with a predictive attention mechanism over the set of compressed memories.
Book ChapterDOI

Memory-Augmented Dense Predictive Coding for Video Representation Learning

TL;DR: In this article, the authors propose a self-supervised learning framework for action recognition from video, which is trained with a predictive attention mechanism over the set of compressed memories, such that any future states can always be constructed by a convex combination of the condensed representations, allowing to make multiple hypotheses efficiently.
Book ChapterDOI

A survey on automatic multimodal emotion recognition in the wild

TL;DR: This chapter gives a detailed overview of different emotion recognition techniques and the predominantly used signal modalities and a thorough comparison of standard emotion labelled databases is presented.
Journal ArticleDOI

Deepfakes generation and detection: state-of-the-art, open challenges, countermeasures, and way forward

- 04 Jun 2022 - 
TL;DR: In this article , a comprehensive review and detailed analysis of existing tools and machine learning (ML) based approaches for deepfake generation, and the methodologies used to detect such manipulations in both audio and video are presented.
References
More filters
Proceedings Article

Auto-Encoding Variational Bayes

TL;DR: A stochastic variational inference and learning algorithm that scales to large datasets and, under some mild differentiability conditions, even works in the intractable case is introduced.
Proceedings Article

An iterative image registration technique with an application to stereo vision

TL;DR: In this paper, the spatial intensity gradient of the images is used to find a good match using a type of Newton-Raphson iteration, which can be generalized to handle rotation, scaling and shearing.
Proceedings Article

Spatial transformer networks

TL;DR: This work introduces a new learnable module, the Spatial Transformer, which explicitly allows the spatial manipulation of data within the network, and can be inserted into existing convolutional architectures, giving neural networks the ability to actively spatially transform feature maps.
Proceedings ArticleDOI

Curriculum learning

TL;DR: It is hypothesized that curriculum learning has both an effect on the speed of convergence of the training process to a minimum and on the quality of the local minima obtained: curriculum learning can be seen as a particular form of continuation method (a general strategy for global optimization of non-convex functions).
Proceedings ArticleDOI

FlowNet: Learning Optical Flow with Convolutional Networks

TL;DR: In this paper, the authors propose and compare two architectures: a generic architecture and another one including a layer that correlates feature vectors at different image locations, and show that networks trained on this unrealistic data still generalize very well to existing datasets such as Sintel and KITTI.
Related Papers (5)