scispace - formally typeset
Journal ArticleDOI

Visual Speech Synthesis Using a Variable-Order Switching Shared Gaussian Process Dynamical Model

Reads0
Chats0
TLDR
The quantitative results demonstrate that the proposed method outperforms other state-of-the-art methods in visual speech synthesis and the qualitative results reveal that the synthetic videos are comparable to ground truth in terms of visual perception and intelligibility.
Abstract
In this paper, we present a novel approach to speech- driven facial animation using a non-parametric switching state space model based on Gaussian processes. The model is an extension of the shared Gaussian process dynamical model, augmented with switching states. Two talking head corpora are processed by extracting visual and audio data from the sequences followed by a parameterization of both data streams. Phonetic labels are obtained by performing forced phonetic alignment on the audio. The switching states are found using a variable length Markov model trained on the labelled phonetic data. The audio and visual data corresponding to phonemes matching each switching state are extracted and modelled together using a shared Gaussian process dynamical model. We propose a synthesis method that takes into account both previous and future phonetic context, thus accounting for forward and backward coarticulation in speech. Both objective and subjective evaluation results are presented. The quantitative results demonstrate that the proposed method outperforms other state-of-the-art methods in visual speech synthesis and the qualitative results reveal that the synthetic videos are comparable to ground truth in terms of visual perception and intelligibility.

read more

Citations
More filters
Journal ArticleDOI

Audio-driven facial animation by joint end-to-end learning of pose and emotion

TL;DR: This work presents a machine learning technique for driving 3D facial animation by audio input in real time and with low latency, and simultaneously discovers a compact, latent code that disambiguates the variations in facial expression that cannot be explained by the audio alone.
Journal ArticleDOI

BLTRCNN-Based 3-D Articulatory Movement Prediction: Learning Articulatory Synchronicity From Both Text and Audio Inputs

TL;DR: This work proposes a new network architecture for articulatory movement prediction with both text and audio inputs, called a bottleneck long-term recurrent convolutional neural network (BLTRCNN), and is the first time to predict articulatory movements based on DNN by fusing text andaudio inputs.
Proceedings ArticleDOI

Sequence-to-sequence articulatory inversion through time convolution of sub-band frequency signals

TL;DR: On a speaker independent AAI task, it is shown that the convolutional features outperform the original filterbank features, and can be combined with phonetic features bringing independent information to the solution of the problem.
Journal ArticleDOI

Acoustic-to-Articulatory Mapping With Joint Optimization of Deep Speech Enhancement and Articulatory Inversion Models

TL;DR: It is argued that a DNN vector-to-vector regression front-end for speech enhancement (DNN-SE) can play a key role in AAI when used to enhance spectral features prior to AAI back-end processing.
Proceedings Article

Revisiting Gaussian process dynamical models

TL;DR: Four new algorithms are presented for learning GPDMs with incomplete training data and a new conditional model (CM+) for recovering incomplete test data that adopt the Bayesian framework and can fully and properly use the partially observed data.
References
More filters
Journal ArticleDOI

Active appearance models

Abstract: We describe a new method of matching statistical models of appearance to images. A set of model parameters control modes of shape and gray-level variation learned from a training set. We construct an efficient iterative matching algorithm by learning the relationship between perturbations in the model parameters and the induced image errors.
Journal ArticleDOI

Fast and robust fixed-point algorithms for independent component analysis

TL;DR: Using maximum entropy approximations of differential entropy, a family of new contrast (objective) functions for ICA enable both the estimation of the whole decomposition by minimizing mutual information, and estimation of individual independent components as projection pursuit directions.
Journal ArticleDOI

Hearing lips and seeing voices

TL;DR: The study reported here demonstrates a previously unrecognised influence of vision upon speech perception, on being shown a film of a young woman's talking head in which repeated utterances of the syllable [ba] had been dubbed on to lip movements for [ga].
Proceedings ArticleDOI

A morphable model for the synthesis of 3D faces

TL;DR: A new technique for modeling textured 3D faces by transforming the shape and texture of the examples into a vector space representation, which regulates the naturalness of modeled faces avoiding faces with an “unlikely” appearance.
Related Papers (5)