Learning Words by Drawing Images

doi:10.1109/CVPR.2019.00213

Proceedings ArticleDOI

Learning Words by Drawing Images

Dídac Surís, +5 more

- pp 2029-2038

Chats0

TLDR

It is found that training that takes advantage of GAN-generated edited examples results in improvements in the model's ability to learn attributes compared to previous results.

Abstract:

We propose a framework for learning through drawing. Our goal is to learn the correspondence between spoken words and abstract visual attributes, from a dataset of spoken descriptions of images. Building upon recent findings that GAN representations can be manipulated to edit semantic concepts in the generated output, we propose a new method to use such GAN-generated images to train a model using a triplet loss. To apply the method, we develop Audio CLEVRGAN, a new dataset of audio descriptions of GAN-generated CLEVR images, and we describe a training procedure that creates a curriculum of GAN-generated images that focuses training on image pairs that differ in a specific, informative way. Training is done without additional supervision beyond the spoken captions and the GAN. We find that training that takes advantage of GAN-generated edited examples results in improvements in the model's ability to learn attributes compared to previous results. Our proposed learning framework also results in models that can associate spoken words with some abstract visual concepts such as color and size.

Citations

PDF

Open Access

More filters

Posted Content

AVLnet: Learning Audio-Visual Language Representations from Instructional Videos

Andrew Rouditchenko, +10 more

- 16 Jun 2020 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: This work introduces Audio-Video Language Network (AVLnet), a self-supervised network that learns a shared audio-visual embedding space directly from raw video inputs, and performs analysis of AVLnet's learned representations, showing the model has learned to relate visual objects with salient words and natural sounds.

...read moreread less

Journal ArticleDOI

Adversarial text-to-image synthesis: A review.

Stanislav Frolov, +5 more

- 08 Aug 2021 -

Neural Networks

TL;DR: The state-of-the-art text-to-image synthesis models can be found in this paper, where the authors contextualize the state of the art of adversarial text to image synthesis models, their development since their inception five years ago, and propose a taxonomy based on the level of supervision.

...read moreread less

Posted Content

Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech

David Harwath, +2 more

- 21 Nov 2019 -

arXiv: Computation and Language

TL;DR: This article used a discriminative, multimodal grounding objective which forces the learned units to be useful for semantic image retrieval, achieving a 27.3% reduction in ABX error rate over the top-performing submission, while keeping the bitrate approximately the same.

...read moreread less

Proceedings Article

Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech

David Harwath, +2 more

TL;DR: This paper presents a method for learning discrete linguistic units by incorporating vector quantization layers into neural models of visually grounded speech and shows that this method is capable of capturing both word-level and sub-word units, depending on how it is configured.

...read moreread less

Proceedings ArticleDOI

Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions

Mathew Monfort, +6 more

TL;DR: The Spoken Moments (S-MiT) dataset of 500k spoken captions each attributed to a unique short video depicting a broad range of different events is presented and a novel Adaptive Mean Margin (AMM) approach to contrastive learning is presented to evaluate the authors' models on video/caption retrieval on multiple datasets.

...read moreread less

References

PDF

Open Access

More filters

Journal ArticleDOI

Generative Adversarial Nets

Ian Goodfellow, +7 more

TL;DR: A new framework for estimating generative models via an adversarial process, in which two models are simultaneously train: a generative model G that captures the data distribution and a discriminative model D that estimates the probability that a sample came from the training data rather than G.

...read moreread less

Proceedings ArticleDOI

Curriculum learning

Yoshua Bengio, +3 more

TL;DR: It is hypothesized that curriculum learning has both an effect on the speed of convergence of the training process to a minimum and on the quality of the local minima obtained: curriculum learning can be seen as a particular form of continuation method (a general strategy for global optimization of non-convex functions).

...read moreread less

Proceedings Article

Progressive Growing of GANs for Improved Quality, Stability, and Variation

Tero Karras, +3 more

TL;DR: Recently, the authors proposed a new training methodology for GANs that grows both the generator and discriminator progressively, starting from a low resolution, and adding new layers that model increasingly fine details as training progresses.

...read moreread less

Proceedings ArticleDOI

FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks

Eddy Ilg, +5 more

TL;DR: The concept of end-to-end learning of optical flow is advanced and it work really well, and faster variants that allow optical flow computation at up to 140fps with accuracy matching the original FlowNet are presented.

...read moreread less

Proceedings Article

Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

Alec Radford, +2 more

TL;DR: Deep convolutional generative adversarial networks (DCGANs) as discussed by the authors learn a hierarchy of representations from object parts to scenes in both the generator and discriminator for unsupervised learning.

...read moreread less

Collapse

arXiv: Learning

Learning Words by Drawing Images

Citations

AVLnet: Learning Audio-Visual Language Representations from Instructional Videos

Adversarial text-to-image synthesis: A review.

Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech

Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech

Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions

References

Generative Adversarial Nets

Curriculum learning

Progressive Growing of GANs for Improved Quality, Stability, and Variation

FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks

Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

Related Papers (5)

Unsupervised learning of spoken language with visual context

Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input

Learning Deep Features for Scene Recognition using Places Database

Deep Residual Learning for Image Recognition

Representation Learning with Contrastive Predictive Coding