Vision-Infused Deep Audio Inpainting
Hang Zhou,Ziwei Liu,Xudong Xu,Ping Luo,Xiaogang Wang +4 more
- pp 283-292
Reads0
Chats0
TLDR
This work considers a new task of visual information-infused audio inpainting, i.e., synthesizing missing audio segments that correspond to their accompanying videos that are coherent with their video counterparts, showing the effectiveness of the proposed Vision-Infused Audio Inpainter (VIAI).Abstract:
Multi-modality perception is essential to develop interactive intelligence. In this work, we consider a new task of visual information-infused audio inpainting, i.e., synthesizing missing audio segments that correspond to their accompanying videos. We identify two key aspects for a successful inpainter: (1) It is desirable to operate on spectrograms instead of raw audios. Recent advances in deep semantic image inpainting could be leveraged to go beyond the limitations of traditional audio inpainting. (2) To synthesize visually indicated audio, a visual-audio joint feature space needs to be learned with synchronization of audio and video. To facilitate a large-scale study, we collect a new multi-modality instrument-playing dataset called MUSIC-Extra-Solo (MUSICES) by enriching MUSIC dataset. Extensive experiments demonstrate that our framework is capable of inpainting realistic and varying audio segments with or without visual contexts. More importantly, our synthesized audio segments are coherent with their video counterparts, showing the effectiveness of our proposed Vision-Infused Audio Inpainter (VIAI).read more
Citations
More filters
Proceedings ArticleDOI
Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation
TL;DR: In this article, a pose code is learned in a modulated convolution-based reconstruction framework to generate pose-controllable talking faces with audio-visual modality modularization.
Proceedings ArticleDOI
Listen to Look: Action Recognition by Previewing Audio
TL;DR: In this article, an attention-based long short-term memory network was proposed to iteratively select useful moments in untrimmed videos, reducing long-term temporal redundancy for efficient video-level recognition.
Posted Content
Listen to Look: Action Recognition by Previewing Audio
TL;DR: A framework for efficient action recognition in untrimmed video that uses audio as a preview mechanism to eliminate both short-term and long-term visual redundancies is proposed, and an ImgAud2Vid framework is devised that hallucinates clip-level features by distilling from lighter modalities, reducingShort-term temporal redundancy for efficient video-level recognition.
Proceedings ArticleDOI
Rotate-and-Render: Unsupervised Photorealistic Face Rotation From Single-View Images
TL;DR: In this paper, the authors propose a novel unsupervised framework that can synthesize photo-realistic rotated faces using only single-view image collections in the wild, which can serve as a strong self-supervision.
Book ChapterDOI
Sep-Stereo: Visually Guided Stereophonic Audio Generation by Associating Source Separation
TL;DR: This work integrates both stereo generation and source separation into a unified framework, Sep-Stereo, by considering source separation as a particular type of audio spatialization, and proposes a novel associative pyramid network architecture carefully designed for audio-visual feature fusion.
References
More filters
Proceedings ArticleDOI
Deep Residual Learning for Image Recognition
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Proceedings Article
Adam: A Method for Stochastic Optimization
Diederik P. Kingma,Jimmy Ba +1 more
TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Journal ArticleDOI
Image quality assessment: from error visibility to structural similarity
TL;DR: In this article, a structural similarity index is proposed for image quality assessment based on the degradation of structural information, which can be applied to both subjective ratings and objective methods on a database of images compressed with JPEG and JPEG2000.
Journal ArticleDOI
Generative Adversarial Nets
Ian Goodfellow,Jean Pouget-Abadie,Mehdi Mirza,Bing Xu,David Warde-Farley,Sherjil Ozair,Aaron Courville,Yoshua Bengio +7 more
TL;DR: A new framework for estimating generative models via an adversarial process, in which two models are simultaneously train: a generative model G that captures the data distribution and a discriminative model D that estimates the probability that a sample came from the training data rather than G.
Proceedings ArticleDOI
Image-to-Image Translation with Conditional Adversarial Networks
TL;DR: Conditional adversarial networks are investigated as a general-purpose solution to image-to-image translation problems and it is demonstrated that this approach is effective at synthesizing photos from label maps, reconstructing objects from edge maps, and colorizing images, among other tasks.