scispace - formally typeset
Proceedings ArticleDOI

Learning Words by Drawing Images

Reads0
Chats0
TLDR
It is found that training that takes advantage of GAN-generated edited examples results in improvements in the model's ability to learn attributes compared to previous results.
Abstract
We propose a framework for learning through drawing. Our goal is to learn the correspondence between spoken words and abstract visual attributes, from a dataset of spoken descriptions of images. Building upon recent findings that GAN representations can be manipulated to edit semantic concepts in the generated output, we propose a new method to use such GAN-generated images to train a model using a triplet loss. To apply the method, we develop Audio CLEVRGAN, a new dataset of audio descriptions of GAN-generated CLEVR images, and we describe a training procedure that creates a curriculum of GAN-generated images that focuses training on image pairs that differ in a specific, informative way. Training is done without additional supervision beyond the spoken captions and the GAN. We find that training that takes advantage of GAN-generated edited examples results in improvements in the model's ability to learn attributes compared to previous results. Our proposed learning framework also results in models that can associate spoken words with some abstract visual concepts such as color and size.

read more

Content maybe subject to copyright    Report

Citations
More filters
Posted Content

AVLnet: Learning Audio-Visual Language Representations from Instructional Videos

TL;DR: This work introduces Audio-Video Language Network (AVLnet), a self-supervised network that learns a shared audio-visual embedding space directly from raw video inputs, and performs analysis of AVLnet's learned representations, showing the model has learned to relate visual objects with salient words and natural sounds.
Journal ArticleDOI

Adversarial text-to-image synthesis: A review.

TL;DR: The state-of-the-art text-to-image synthesis models can be found in this paper, where the authors contextualize the state of the art of adversarial text to image synthesis models, their development since their inception five years ago, and propose a taxonomy based on the level of supervision.
Posted Content

Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech

TL;DR: This article used a discriminative, multimodal grounding objective which forces the learned units to be useful for semantic image retrieval, achieving a 27.3% reduction in ABX error rate over the top-performing submission, while keeping the bitrate approximately the same.
Proceedings Article

Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech

TL;DR: This paper presents a method for learning discrete linguistic units by incorporating vector quantization layers into neural models of visually grounded speech and shows that this method is capable of capturing both word-level and sub-word units, depending on how it is configured.
Proceedings ArticleDOI

Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions

TL;DR: The Spoken Moments (S-MiT) dataset of 500k spoken captions each attributed to a unique short video depicting a broad range of different events is presented and a novel Adaptive Mean Margin (AMM) approach to contrastive learning is presented to evaluate the authors' models on video/caption retrieval on multiple datasets.
References
More filters
Journal ArticleDOI

Generative Adversarial Nets

TL;DR: A new framework for estimating generative models via an adversarial process, in which two models are simultaneously train: a generative model G that captures the data distribution and a discriminative model D that estimates the probability that a sample came from the training data rather than G.
Proceedings ArticleDOI

Curriculum learning

TL;DR: It is hypothesized that curriculum learning has both an effect on the speed of convergence of the training process to a minimum and on the quality of the local minima obtained: curriculum learning can be seen as a particular form of continuation method (a general strategy for global optimization of non-convex functions).
Proceedings Article

Progressive Growing of GANs for Improved Quality, Stability, and Variation

TL;DR: Recently, the authors proposed a new training methodology for GANs that grows both the generator and discriminator progressively, starting from a low resolution, and adding new layers that model increasingly fine details as training progresses.
Proceedings ArticleDOI

FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks

TL;DR: The concept of end-to-end learning of optical flow is advanced and it work really well, and faster variants that allow optical flow computation at up to 140fps with accuracy matching the original FlowNet are presented.
Proceedings Article

Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

TL;DR: Deep convolutional generative adversarial networks (DCGANs) as discussed by the authors learn a hierarchy of representations from object parts to scenes in both the generator and discriminator for unsupervised learning.
Related Papers (5)