Visual Speech Synthesis Using a Variable-Order Switching Shared Gaussian Process Dynamical Model

doi:10.1109/TMM.2013.2279659

Journal ArticleDOI

Visual Speech Synthesis Using a Variable-Order Switching Shared Gaussian Process Dynamical Model

Salil Deena, +2 more

- 01 Dec 2013 -

IEEE Transactions on Multimedia

- Vol. 15, Iss: 8, pp 1755-1768

Chats0

TLDR

The quantitative results demonstrate that the proposed method outperforms other state-of-the-art methods in visual speech synthesis and the qualitative results reveal that the synthetic videos are comparable to ground truth in terms of visual perception and intelligibility.

Abstract:

In this paper, we present a novel approach to speech- driven facial animation using a non-parametric switching state space model based on Gaussian processes. The model is an extension of the shared Gaussian process dynamical model, augmented with switching states. Two talking head corpora are processed by extracting visual and audio data from the sequences followed by a parameterization of both data streams. Phonetic labels are obtained by performing forced phonetic alignment on the audio. The switching states are found using a variable length Markov model trained on the labelled phonetic data. The audio and visual data corresponding to phonemes matching each switching state are extracted and modelled together using a shared Gaussian process dynamical model. We propose a synthesis method that takes into account both previous and future phonetic context, thus accounting for forward and backward coarticulation in speech. Both objective and subjective evaluation results are presented. The quantitative results demonstrate that the proposed method outperforms other state-of-the-art methods in visual speech synthesis and the qualitative results reveal that the synthetic videos are comparable to ground truth in terms of visual perception and intelligibility.

Visual Speech Synthesis Using a Variable-Order Switching Shared Gaussian Process Dynamical Model

Citations

Audio-driven facial animation by joint end-to-end learning of pose and emotion

BLTRCNN-Based 3-D Articulatory Movement Prediction: Learning Articulatory Synchronicity From Both Text and Audio Inputs

Sequence-to-sequence articulatory inversion through time convolution of sub-band frequency signals

Acoustic-to-Articulatory Mapping With Joint Optimization of Deep Speech Enhancement and Articulatory Inversion Models

Revisiting Gaussian process dynamical models

References

Active appearance models

Fast and robust fixed-point algorithms for independent component analysis

Hearing lips and seeing voices

A morphable model for the synthesis of 3D faces

Nonparametric Statistical Methods

Related Papers (5)

Visual speech synthesis by modelling coarticulation dynamics using a non-parametric switching state-space model

Text-to-visual speech synthesis based on parameter generation from HMM

Real-time lip-synch face animation driven by human voice

Product HMMs for audio-visual continuous speech recognition using facial animation parameters

A coupled HMM approach to video-realistic speech animation