scispace - formally typeset
Open AccessPosted Content

Arbitrary Talking Face Generation via Attentional Audio-Visual Coherence Learning

TLDR
A novel arbitrary talking face generation framework is proposed by discovering the audio-visual coherence via the proposed Asymmetric Mutual Information Estimator (AMIE) and a Dynamic Attention (DA) block by selectively focusing the lip area of the input image during the training stage, to further enhance lip synchronization.
Abstract
Talking face generation aims to synthesize a face video with precise lip synchronization as well as a smooth transition of facial motion over the entire video via the given speech clip and facial image. Most existing methods mainly focus on either disentangling the information in a single image or learning temporal information between frames. However, cross-modality coherence between audio and video information has not been well addressed during synthesis. In this paper, we propose a novel arbitrary talking face generation framework by discovering the audio-visual coherence via the proposed Asymmetric Mutual Information Estimator (AMIE). In addition, we propose a Dynamic Attention (DA) block by selectively focusing the lip area of the input image during the training stage, to further enhance lip synchronization. Experimental results on benchmark LRW dataset and GRID dataset transcend the state-of-the-art methods on prevalent metrics with robust high-resolution synthesizing on gender and pose variations.

read more

Citations
More filters
Posted Content

A Review on Generative Adversarial Networks: Algorithms, Theory, and Applications

TL;DR: This paper attempts to provide a review on various GANs methods from the perspectives of algorithms, theory, and applications, and compares the commonalities and differences of these GAns methods.
Posted Content

Deep Audio-Visual Learning: A Survey

TL;DR: A comprehensive survey of recent audio-visual learning development can be found in this article, where the authors divide the current audio visual learning tasks into four different subfields: audio visual separation and localization, audio visual correspondence learning, audiovisual generation, and audio visual representation learning.
Book ChapterDOI

Semantic-Aware Implicit Neural Audio-Driven Video Portrait Generation

TL;DR: Li et al. as mentioned in this paper proposed Semantic-aware Speaking Portrait NeRF (SSP-NeRF), which can handle the detailed local facial semantics and the global head-torso relationship through two semantic-aware modules.
Journal ArticleDOI

Foundations and Recent Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions

TL;DR: This paper proposes a taxonomy of 6 core technical challenges: representation, alignment, reasoning, generation, transference, and quantification covering historical and recent trends, and defines two key principles of modality heterogeneity and interconnections that have driven subsequent innovations.
References
More filters
Journal ArticleDOI

Long short-term memory

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
Book ChapterDOI

U-Net: Convolutional Networks for Biomedical Image Segmentation

TL;DR: Neber et al. as discussed by the authors proposed a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently, which can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks.
Journal ArticleDOI

Image quality assessment: from error visibility to structural similarity

TL;DR: In this article, a structural similarity index is proposed for image quality assessment based on the degradation of structural information, which can be applied to both subjective ratings and objective methods on a database of images compressed with JPEG and JPEG2000.
Journal ArticleDOI

Generative Adversarial Nets

TL;DR: A new framework for estimating generative models via an adversarial process, in which two models are simultaneously train: a generative model G that captures the data distribution and a discriminative model D that estimates the probability that a sample came from the training data rather than G.
Posted Content

U-Net: Convolutional Networks for Biomedical Image Segmentation

TL;DR: It is shown that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks.
Related Papers (5)