scispace - formally typeset
Open AccessProceedings ArticleDOI

Vision-Infused Deep Audio Inpainting

Reads0
Chats0
TLDR
This work considers a new task of visual information-infused audio inpainting, i.e., synthesizing missing audio segments that correspond to their accompanying videos that are coherent with their video counterparts, showing the effectiveness of the proposed Vision-Infused Audio Inpainter (VIAI).
Abstract
Multi-modality perception is essential to develop interactive intelligence. In this work, we consider a new task of visual information-infused audio inpainting, i.e., synthesizing missing audio segments that correspond to their accompanying videos. We identify two key aspects for a successful inpainter: (1) It is desirable to operate on spectrograms instead of raw audios. Recent advances in deep semantic image inpainting could be leveraged to go beyond the limitations of traditional audio inpainting. (2) To synthesize visually indicated audio, a visual-audio joint feature space needs to be learned with synchronization of audio and video. To facilitate a large-scale study, we collect a new multi-modality instrument-playing dataset called MUSIC-Extra-Solo (MUSICES) by enriching MUSIC dataset. Extensive experiments demonstrate that our framework is capable of inpainting realistic and varying audio segments with or without visual contexts. More importantly, our synthesized audio segments are coherent with their video counterparts, showing the effectiveness of our proposed Vision-Infused Audio Inpainter (VIAI).

read more

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI

Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation

TL;DR: In this article, a pose code is learned in a modulated convolution-based reconstruction framework to generate pose-controllable talking faces with audio-visual modality modularization.
Proceedings ArticleDOI

Listen to Look: Action Recognition by Previewing Audio

TL;DR: In this article, an attention-based long short-term memory network was proposed to iteratively select useful moments in untrimmed videos, reducing long-term temporal redundancy for efficient video-level recognition.
Posted Content

Listen to Look: Action Recognition by Previewing Audio

TL;DR: A framework for efficient action recognition in untrimmed video that uses audio as a preview mechanism to eliminate both short-term and long-term visual redundancies is proposed, and an ImgAud2Vid framework is devised that hallucinates clip-level features by distilling from lighter modalities, reducingShort-term temporal redundancy for efficient video-level recognition.
Proceedings ArticleDOI

Rotate-and-Render: Unsupervised Photorealistic Face Rotation From Single-View Images

TL;DR: In this paper, the authors propose a novel unsupervised framework that can synthesize photo-realistic rotated faces using only single-view image collections in the wild, which can serve as a strong self-supervision.
Book ChapterDOI

Sep-Stereo: Visually Guided Stereophonic Audio Generation by Associating Source Separation

TL;DR: This work integrates both stereo generation and source separation into a unified framework, Sep-Stereo, by considering source separation as a particular type of audio spatialization, and proposes a novel associative pyramid network architecture carefully designed for audio-visual feature fusion.
References
More filters
Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Proceedings Article

Adam: A Method for Stochastic Optimization

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Journal ArticleDOI

Image quality assessment: from error visibility to structural similarity

TL;DR: In this article, a structural similarity index is proposed for image quality assessment based on the degradation of structural information, which can be applied to both subjective ratings and objective methods on a database of images compressed with JPEG and JPEG2000.
Journal ArticleDOI

Generative Adversarial Nets

TL;DR: A new framework for estimating generative models via an adversarial process, in which two models are simultaneously train: a generative model G that captures the data distribution and a discriminative model D that estimates the probability that a sample came from the training data rather than G.
Proceedings ArticleDOI

Image-to-Image Translation with Conditional Adversarial Networks

TL;DR: Conditional adversarial networks are investigated as a general-purpose solution to image-to-image translation problems and it is demonstrated that this approach is effective at synthesizing photos from label maps, reconstructing objects from edge maps, and colorizing images, among other tasks.
Trending Questions (1)
How do I use Focusrite audio interface?

More importantly, our synthesized audio segments are coherent with their video counterparts, showing the effectiveness of our proposed Vision-Infused Audio Inpainter (VIAI).