Proceedings ArticleDOI
Learning Words by Drawing Images
Dídac Surís,Adrià Recasens,David Bau,David Harwath,James Glass,Antonio Torralba +5 more
- pp 2029-2038
Reads0
Chats0
TLDR
It is found that training that takes advantage of GAN-generated edited examples results in improvements in the model's ability to learn attributes compared to previous results.Abstract:
We propose a framework for learning through drawing. Our goal is to learn the correspondence between spoken words and abstract visual attributes, from a dataset of spoken descriptions of images. Building upon recent findings that GAN representations can be manipulated to edit semantic concepts in the generated output, we propose a new method to use such GAN-generated images to train a model using a triplet loss. To apply the method, we develop Audio CLEVRGAN, a new dataset of audio descriptions of GAN-generated CLEVR images, and we describe a training procedure that creates a curriculum of GAN-generated images that focuses training on image pairs that differ in a specific, informative way. Training is done without additional supervision beyond the spoken captions and the GAN. We find that training that takes advantage of GAN-generated edited examples results in improvements in the model's ability to learn attributes compared to previous results. Our proposed learning framework also results in models that can associate spoken words with some abstract visual concepts such as color and size.read more
Citations
More filters
Posted Content
AVLnet: Learning Audio-Visual Language Representations from Instructional Videos
Andrew Rouditchenko,Angie Boggust,David Harwath,Dhiraj Joshi,Samuel Thomas,Kartik Audhkhasi,Rogerio Feris,Brian Kingsbury,Michael Picheny,Antonio Torralba,James Glass +10 more
TL;DR: This work introduces Audio-Video Language Network (AVLnet), a self-supervised network that learns a shared audio-visual embedding space directly from raw video inputs, and performs analysis of AVLnet's learned representations, showing the model has learned to relate visual objects with salient words and natural sounds.
Journal ArticleDOI
Adversarial text-to-image synthesis: A review.
TL;DR: The state-of-the-art text-to-image synthesis models can be found in this paper, where the authors contextualize the state of the art of adversarial text to image synthesis models, their development since their inception five years ago, and propose a taxonomy based on the level of supervision.
Posted Content
Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech
TL;DR: This article used a discriminative, multimodal grounding objective which forces the learned units to be useful for semantic image retrieval, achieving a 27.3% reduction in ABX error rate over the top-performing submission, while keeping the bitrate approximately the same.
Proceedings Article
Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech
TL;DR: This paper presents a method for learning discrete linguistic units by incorporating vector quantization layers into neural models of visually grounded speech and shows that this method is capable of capturing both word-level and sub-word units, depending on how it is configured.
Proceedings ArticleDOI
Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions
Mathew Monfort,SouYoung Jin,Alexander H. Liu,David Harwath,Rogerio Feris,James Glass,Aude Oliva +6 more
TL;DR: The Spoken Moments (S-MiT) dataset of 500k spoken captions each attributed to a unique short video depicting a broad range of different events is presented and a novel Adaptive Mean Margin (AMM) approach to contrastive learning is presented to evaluate the authors' models on video/caption retrieval on multiple datasets.
References
More filters
Journal ArticleDOI
Generative Adversarial Nets
Ian Goodfellow,Jean Pouget-Abadie,Mehdi Mirza,Bing Xu,David Warde-Farley,Sherjil Ozair,Aaron Courville,Yoshua Bengio +7 more
TL;DR: A new framework for estimating generative models via an adversarial process, in which two models are simultaneously train: a generative model G that captures the data distribution and a discriminative model D that estimates the probability that a sample came from the training data rather than G.
Proceedings ArticleDOI
Curriculum learning
TL;DR: It is hypothesized that curriculum learning has both an effect on the speed of convergence of the training process to a minimum and on the quality of the local minima obtained: curriculum learning can be seen as a particular form of continuation method (a general strategy for global optimization of non-convex functions).
Proceedings Article
Progressive Growing of GANs for Improved Quality, Stability, and Variation
TL;DR: Recently, the authors proposed a new training methodology for GANs that grows both the generator and discriminator progressively, starting from a low resolution, and adding new layers that model increasingly fine details as training progresses.
Proceedings ArticleDOI
FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks
TL;DR: The concept of end-to-end learning of optical flow is advanced and it work really well, and faster variants that allow optical flow computation at up to 140fps with accuracy matching the original FlowNet are presented.
Proceedings Article
Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks
TL;DR: Deep convolutional generative adversarial networks (DCGANs) as discussed by the authors learn a hierarchy of representations from object parts to scenes in both the generator and discriminator for unsupervised learning.