Look, Listen and Learn

doi:10.1109/ICCV.2017.73

Proceedings ArticleDOI

Look, Listen and Learn

- pp 609-617

TLDR

There is a valuable, but so far untapped, source of information contained in the video itself – the correspondence between the visual and the audio streams, and a novel “Audio-Visual Correspondence” learning task that makes use of this.

Abstract:

We consider the question: what can be learnt by looking at and listening to a large number of unlabelled videos? There is a valuable, but so far untapped, source of information contained in the video itself – the correspondence between the visual and the audio streams, and we introduce a novel “Audio-Visual Correspondence” learning task that makes use of this Training visual and audio networks from scratch, without any additional supervision other than the raw unconstrained videos themselves, is shown to successfully solve this task, and, more interestingly, result in good visual and audio representations These features set the new state-of-the-art on two sound classification benchmarks, and perform on par with the state-of-the-art selfsupervised approaches on ImageNet classification We also demonstrate that the network is able to localize objects in both modalities, as well as perform fine-grained recognition tasks

Citations

PDF

Open Access

More filters

Proceedings Article

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Jiasen Lu, +3 more

TL;DR: The ViLBERT model as mentioned in this paper extends the BERT architecture to a multi-modal two-stream model, processing both visual and textual inputs in separate streams that interact through co-attentional transformer layers.

...read moreread less

Proceedings ArticleDOI

Self-Supervised Learning of Pretext-Invariant Representations

Ishan Misra, +1 more

TL;DR: This work develops Pretext-Invariant Representation Learning (PIRL), a new state-of-the-art in self-supervised learning from images that learns invariant representations based on pretext tasks that substantially improves the semantic quality of the learned image representations.

...read moreread less

Posted Content

Data-Efficient Image Recognition with Contrastive Predictive Coding

Olivier J. Hénaff, +6 more

- 22 May 2019 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: This work revisit and improve Contrastive Predictive Coding, an unsupervised objective for learning such representations which make the variability in natural signals more predictable, and produces features which support state-of-the-art linear classification accuracy on the ImageNet dataset.

...read moreread less

Journal ArticleDOI

Self-Supervised Visual Feature Learning With Deep Neural Networks: A Survey

Longlong Jing, +1 more

- 01 Nov 2021 -

IEEE Transactions on Pattern Analysis an...

TL;DR: An extensive review of deep learning-based self-supervised general visual feature learning methods from images or videos as a subset of unsupervised learning methods to learn general image and video features from large-scale unlabeled data without using any human-annotated labels is provided.

...read moreread less

Book ChapterDOI

Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

Andrew Owens, +1 more

TL;DR: In this paper, the authors argue that the visual and audio components of a video signal should be modeled jointly using a fused multisensory representation, and they propose to learn such a representation in a self-supervised way, by training a neural network to predict whether video frames and audio are temporally aligned.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings Article

Adam: A Method for Stochastic Optimization

Diederik P. Kingma, +1 more

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

...read moreread less

Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

Alex Krizhevsky, +2 more

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.

...read moreread less

Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan, +1 more

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.

...read moreread less

Proceedings Article

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioffe, +1 more

TL;DR: Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.

...read moreread less

Journal ArticleDOI

ImageNet Large Scale Visual Recognition Challenge

Olga Russakovsky, +11 more

- 01 Dec 2015 -

International Journal of Computer Vision

TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.

...read moreread less

Collapse

Look, Listen and Learn

Citations

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Self-Supervised Learning of Pretext-Invariant Representations

Data-Efficient Image Recognition with Contrastive Predictive Coding

Self-Supervised Visual Feature Learning With Deep Neural Networks: A Survey

Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

References

Adam: A Method for Stochastic Optimization

ImageNet Classification with Deep Convolutional Neural Networks

Very Deep Convolutional Networks for Large-Scale Image Recognition

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

ImageNet Large Scale Visual Recognition Challenge

Related Papers (5)

Deep Residual Learning for Image Recognition

Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

Audio Set: An ontology and human-labeled dataset for audio events

Adam: A Method for Stochastic Optimization

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset