Vision-Infused Deep Audio Inpainting

doi:10.1109/ICCV.2019.00037

Open AccessProceedings ArticleDOI

Vision-Infused Deep Audio Inpainting

Hang Zhou, +4 more

- pp 283-292

Chats0

TLDR

This work considers a new task of visual information-infused audio inpainting, i.e., synthesizing missing audio segments that correspond to their accompanying videos that are coherent with their video counterparts, showing the effectiveness of the proposed Vision-Infused Audio Inpainter (VIAI).

Abstract:

Multi-modality perception is essential to develop interactive intelligence. In this work, we consider a new task of visual information-infused audio inpainting, i.e., synthesizing missing audio segments that correspond to their accompanying videos. We identify two key aspects for a successful inpainter: (1) It is desirable to operate on spectrograms instead of raw audios. Recent advances in deep semantic image inpainting could be leveraged to go beyond the limitations of traditional audio inpainting. (2) To synthesize visually indicated audio, a visual-audio joint feature space needs to be learned with synchronization of audio and video. To facilitate a large-scale study, we collect a new multi-modality instrument-playing dataset called MUSIC-Extra-Solo (MUSICES) by enriching MUSIC dataset. Extensive experiments demonstrate that our framework is capable of inpainting realistic and varying audio segments with or without visual contexts. More importantly, our synthesized audio segments are coherent with their video counterparts, showing the effectiveness of our proposed Vision-Infused Audio Inpainter (VIAI).

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation

Hang Zhou, +5 more

TL;DR: In this article, a pose code is learned in a modulated convolution-based reconstruction framework to generate pose-controllable talking faces with audio-visual modality modularization.

...read moreread less

Proceedings ArticleDOI

Listen to Look: Action Recognition by Previewing Audio

Ruohan Gao, +3 more

TL;DR: In this article, an attention-based long short-term memory network was proposed to iteratively select useful moments in untrimmed videos, reducing long-term temporal redundancy for efficient video-level recognition.

...read moreread less

Posted Content

Listen to Look: Action Recognition by Previewing Audio

Ruohan Gao, +3 more

- 10 Dec 2019 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: A framework for efficient action recognition in untrimmed video that uses audio as a preview mechanism to eliminate both short-term and long-term visual redundancies is proposed, and an ImgAud2Vid framework is devised that hallucinates clip-level features by distilling from lighter modalities, reducingShort-term temporal redundancy for efficient video-level recognition.

...read moreread less

Proceedings ArticleDOI

Rotate-and-Render: Unsupervised Photorealistic Face Rotation From Single-View Images

Hang Zhou, +4 more

TL;DR: In this paper, the authors propose a novel unsupervised framework that can synthesize photo-realistic rotated faces using only single-view image collections in the wild, which can serve as a strong self-supervision.

...read moreread less

Book ChapterDOI

Sep-Stereo: Visually Guided Stereophonic Audio Generation by Associating Source Separation

Hang Zhou, +4 more

TL;DR: This work integrates both stereo generation and source separation into a unified framework, Sep-Stereo, by considering source separation as a particular type of audio spatialization, and proposes a novel associative pyramid network architecture carefully designed for audio-visual feature fusion.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.

...read moreread less

Proceedings Article

Adam: A Method for Stochastic Optimization

Diederik P. Kingma, +1 more

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

...read moreread less

Journal ArticleDOI

Image quality assessment: from error visibility to structural similarity

Zhou Wang, +3 more

- 01 Apr 2004 -

IEEE Transactions on Image Processing

TL;DR: In this article, a structural similarity index is proposed for image quality assessment based on the degradation of structural information, which can be applied to both subjective ratings and objective methods on a database of images compressed with JPEG and JPEG2000.

...read moreread less

Journal ArticleDOI

Generative Adversarial Nets

Ian Goodfellow, +7 more

TL;DR: A new framework for estimating generative models via an adversarial process, in which two models are simultaneously train: a generative model G that captures the data distribution and a discriminative model D that estimates the probability that a sample came from the training data rather than G.

...read moreread less

Proceedings ArticleDOI

Image-to-Image Translation with Conditional Adversarial Networks

Phillip Isola, +3 more

TL;DR: Conditional adversarial networks are investigated as a general-purpose solution to image-to-image translation problems and it is demonstrated that this approach is effective at synthesizing photos from label maps, reconstructing objects from edge maps, and colorizing images, among other tasks.

...read moreread less

Collapse

Vision-Infused Deep Audio Inpainting

Citations

Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation

Listen to Look: Action Recognition by Previewing Audio

Listen to Look: Action Recognition by Previewing Audio

Rotate-and-Render: Unsupervised Photorealistic Face Rotation From Single-View Images

Sep-Stereo: Visually Guided Stereophonic Audio Generation by Associating Source Separation

References

Deep Residual Learning for Image Recognition

Adam: A Method for Stochastic Optimization

Image quality assessment: from error visibility to structural similarity

Generative Adversarial Nets

Image-to-Image Translation with Conditional Adversarial Networks

Related Papers (5)

The Sound of Pixels

Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

Deep Residual Learning for Image Recognition

Look, Listen and Learn

Objects that Sound

Trending Questions (1)