Self-Supervised Learning of Class Embeddings from Video
Olivia Wiles,A. Sophia Koepke,Andrew Zisserman +2 more
- pp 0-0
TLDR
This work explores how to use self-supervised learning on videos to learn a class-specific image embedding that encodes pose and shape information in the form of landmarks, and demonstrates quantitatively and experimentally that the learned embeddings do indeed generalise.Abstract:
This work explores how to use self-supervised learning on videos to learn a class-specific image embedding that encodes pose and shape information in the form of landmarks. At train time, two frames of the same video of an object class (e.g. human upper body) are extracted and each encoded to an embedding. Conditioned on these embeddings, the decoder network is tasked to transform one frame into another. To successfully perform long range transformations (e.g. a wrist lowered in one image should be mapped to the same wrist raised in another), we introduce a new hierarchical probabilistic network decoder model. Once trained, the embedding can be used for a variety of downstream tasks and domains. We demonstrate our approach quantitatively on three distinct deformable object classes - human full bodies, upper bodies, faces - and show experimentally that the learned embeddings do indeed generalise. They achieve state-of-the-art performance in comparison to other self-supervised methods trained on the same datasets, and approach the performance of fully supervised methods.read more
Citations
More filters
Proceedings ArticleDOI
Emotion Recognition in Speech using Cross-Modal Transfer in the Wild
TL;DR: This article showed that the emotional content of speech correlates with the facial expression of the speaker, which can be transferred from the visual domain (faces) to the speech domain (voices) through cross-modal distillation.
Posted Content
Memory-augmented Dense Predictive Coding for Video Representation Learning
TL;DR: A new architecture and learning framework Memory-augmented Dense Predictive Coding (MemDPC) is proposed for the self-supervised learning from video, in particular for representations for action recognition, trained with a predictive attention mechanism over the set of compressed memories.
Book ChapterDOI
Memory-Augmented Dense Predictive Coding for Video Representation Learning
TL;DR: In this article, the authors propose a self-supervised learning framework for action recognition from video, which is trained with a predictive attention mechanism over the set of compressed memories, such that any future states can always be constructed by a convex combination of the condensed representations, allowing to make multiple hypotheses efficiently.
Book ChapterDOI
A survey on automatic multimodal emotion recognition in the wild
Garima Sharma,Abhinav Dhall +1 more
TL;DR: This chapter gives a detailed overview of different emotion recognition techniques and the predominantly used signal modalities and a thorough comparison of standard emotion labelled databases is presented.
Journal ArticleDOI
Deepfakes generation and detection: state-of-the-art, open challenges, countermeasures, and way forward
TL;DR: In this article , a comprehensive review and detailed analysis of existing tools and machine learning (ML) based approaches for deepfake generation, and the methodologies used to detect such manipulations in both audio and video are presented.
References
More filters
Proceedings Article
Auto-Encoding Variational Bayes
Diederik P. Kingma,Max Welling +1 more
TL;DR: A stochastic variational inference and learning algorithm that scales to large datasets and, under some mild differentiability conditions, even works in the intractable case is introduced.
Proceedings Article
An iterative image registration technique with an application to stereo vision
Bruce D. Lucas,Takeo Kanade +1 more
TL;DR: In this paper, the spatial intensity gradient of the images is used to find a good match using a type of Newton-Raphson iteration, which can be generalized to handle rotation, scaling and shearing.
Proceedings Article
Spatial transformer networks
TL;DR: This work introduces a new learnable module, the Spatial Transformer, which explicitly allows the spatial manipulation of data within the network, and can be inserted into existing convolutional architectures, giving neural networks the ability to actively spatially transform feature maps.
Proceedings ArticleDOI
Curriculum learning
TL;DR: It is hypothesized that curriculum learning has both an effect on the speed of convergence of the training process to a minimum and on the quality of the local minima obtained: curriculum learning can be seen as a particular form of continuation method (a general strategy for global optimization of non-convex functions).
Proceedings ArticleDOI
FlowNet: Learning Optical Flow with Convolutional Networks
Alexey Dosovitskiy,Philipp Fischery,Eddy Ilg,Philip Häusser,Caner Hazirbas,Vladimir Golkov,Patrick van der Smagt,Daniel Cremers,Thomas Brox +8 more
TL;DR: In this paper, the authors propose and compare two architectures: a generic architecture and another one including a layer that correlates feature vectors at different image locations, and show that networks trained on this unrealistic data still generalize very well to existing datasets such as Sintel and KITTI.
Related Papers (5)
Self-supervised Spatiotemporal Feature Learning by Video Geometric Transformations
Longlong Jing,Yingli Tian +1 more