Visually Indicated Sounds

Open AccessPosted Content

Visually Indicated Sounds

Andrew Owens, +5 more

- 28 Dec 2015 -

arXiv: Computer Vision and Pattern Recog...

Chats0

TLDR

In this article, the authors use a recurrent neural network to predict sound features from videos and then produce a waveform from these features with an example-based synthesis procedure, showing that the sounds predicted by their model are realistic enough to fool participants in a "real or fake" psychophysical experiment, and convey significant information about material properties and physical interactions.

Abstract:

Objects make distinctive sounds when they are hit or scratched. These sounds reveal aspects of an object's material properties, as well as the actions that produced them. In this paper, we propose the task of predicting what sound an object makes when struck as a way of studying physical interactions within a visual scene. We present an algorithm that synthesizes sound from silent videos of people hitting and scratching objects with a drumstick. This algorithm uses a recurrent neural network to predict sound features from videos and then produces a waveform from these features with an example-based synthesis procedure. We show that the sounds predicted by our model are realistic enough to fool participants in a "real or fake" psychophysical experiment, and that they convey significant information about material properties and physical interactions.

Citations

PDF

Open Access

More filters

Posted Content

Image-to-Image Translation with Conditional Adversarial Networks

Phillip Isola, +3 more

- 21 Nov 2016 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: Conditional Adversarial Network (CA) as discussed by the authors is a general-purpose solution to image-to-image translation problems, which can be used to synthesize photos from label maps, reconstructing objects from edge maps, and colorizing images, among other tasks.

...read moreread less

Posted Content

SoundNet: Learning Sound Representations from Unlabeled Video

Yusuf Aytar, +2 more

- 27 Oct 2016 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: In this article, the authors leverage the natural synchronization between vision and sound to learn an acoustic representation using two-million unlabeled videos and propose a student-teacher training procedure which transfers discriminative visual knowledge from well established visual recognition models into the sound modality using unlabelled video as a bridge.

...read moreread less

Posted Content

Time-Contrastive Networks: Self-Supervised Learning from Video

Pierre Sermanet, +6 more

- 23 Apr 2017 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: A self-supervised approach for learning representations and robotic behaviors entirely from unlabeled videos recorded from multiple viewpoints is proposed, and it is demonstrated that this representation can be used by a robot to directly mimic human poses without an explicit correspondence, and that it can be use as a reward function within a reinforcement learning algorithm.

...read moreread less

Proceedings ArticleDOI

Generating the Future with Adversarial Transformers

Carl Vondrick, +1 more

TL;DR: This work presents a model that generates the future by transforming pixels in the past, and explicitly disentangles the models memory from the prediction, which helps the model learn desirable invariances.

...read moreread less

Proceedings ArticleDOI

Learning Aligned Cross-Modal Representations from Weakly Aligned Data

Lluis Castrejon, +4 more

TL;DR: The experiments suggest that the scene representation can help transfer representations across modalities for retrieval and the visualizations suggest that units emerge in the shared representation that tend to activate on consistent concepts independently of the modality.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

Long short-term memory

Sepp Hochreiter, +1 more

- 01 Nov 1997 -

Neural Computation

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.

...read moreread less

Proceedings ArticleDOI

ImageNet: A large-scale hierarchical image database

Jia Deng, +5 more

TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.

...read moreread less

Journal Article

Dropout: a simple way to prevent neural networks from overfitting

Nitish Srivastava, +4 more

- 01 Jan 2014 -

Journal of Machine Learning Research

TL;DR: It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.

...read moreread less

Posted Content

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioffe, +1 more

- 11 Feb 2015 -

arXiv: Learning

TL;DR: Batch Normalization as mentioned in this paper normalizes layer inputs for each training mini-batch to reduce the internal covariate shift in deep neural networks, and achieves state-of-the-art performance on ImageNet.

...read moreread less

Posted Content

Caffe: Convolutional Architecture for Fast Feature Embedding

Yangqing Jia, +7 more

- 20 Jun 2014 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: Caffe as discussed by the authors is a BSD-licensed C++ library with Python and MATLAB bindings for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures.

...read moreread less

Collapse

Review of Philosophy and Psychology

Visually Indicated Sounds

Citations

Image-to-Image Translation with Conditional Adversarial Networks

SoundNet: Learning Sound Representations from Unlabeled Video

Time-Contrastive Networks: Self-Supervised Learning from Video

Generating the Future with Adversarial Transformers

Learning Aligned Cross-Modal Representations from Weakly Aligned Data

References

Long short-term memory

ImageNet: A large-scale hierarchical image database

Dropout: a simple way to prevent neural networks from overfitting

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Caffe: Convolutional Architecture for Fast Feature Embedding

Related Papers (5)

Visually Indicated Sounds

Generative Adversarial Nets

Multimodal Deep Learning

Experimental evaluation of auditory display and sonification of textured images

On the Diversity of Auditory Objects