scispace - formally typeset
Search or ask a question

Showing papers on "Viseme published in 2005"


Journal ArticleDOI
01 Jul 2005
TL;DR: Face Transfer is a method for mapping videorecorded performances of one individual to facial animations of another, based on a multilinear model of 3D face meshes that separably parameterizes the space of geometric variations due to different attributes.
Abstract: Face Transfer is a method for mapping videorecorded performances of one individual to facial animations of another It extracts visemes (speech-related mouth articulations), expressions, and three-dimensional (3D) pose from monocular video or film footage These parameters are then used to generate and drive a detailed 3D textured face mesh for a target identity, which can be seamlessly rendered back into target footage The underlying face model automatically adjusts for how the target performs facial expressions and visemes The performance data can be easily edited to change the visemes, expressions, pose, or even the identity of the target---the attributes are separably controllable This supports a wide variety of video rewrite and puppetry applicationsFace Transfer is based on a multilinear model of 3D face meshes that separably parameterizes the space of geometric variations due to different attributes (eg, identity, expression, and viseme) Separability means that each of these attributes can be independently varied A multilinear model can be estimated from a Cartesian product of examples (identities × expressions × visemes) with techniques from statistical analysis, but only after careful preprocessing of the geometric data set to secure one-to-one correspondence, to minimize cross-coupling artifacts, and to fill in any missing examples Face Transfer offers new solutions to these problems and links the estimated model with a face-tracking algorithm to extract pose, expression, and viseme parameters

679 citations


Proceedings ArticleDOI
17 Oct 2005
TL;DR: A novel dynamic Bayesian network with a multi-stream structure and observations consisting of articulate feature classifier scores, which can model varying degrees of co-articulation in a principled way is presented.
Abstract: We present an approach to detecting and recognizing spoken isolated phrases based solely on visual input. We adopt an architecture that first employs discriminative detection of visual speech and articulate features, and then performs recognition using a model that accounts for the loose synchronization of the feature streams. Discriminative classifiers detect the subclass of lip appearance corresponding to the presence of speech, and further decompose it into features corresponding to the physical components of articulate production. These components often evolve in a semi-independent fashion, and conventional viseme-based approaches to recognition fail to capture the resulting co-articulation effects. We present a novel dynamic Bayesian network with a multi-stream structure and observations consisting of articulate feature classifier scores, which can model varying degrees of co-articulation in a principled way. We evaluate our visual-only recognition system on a command utterance task. We show comparative results on lip detection and speech/non-speech classification, as well as recognition performance against several baseline systems

87 citations


Journal ArticleDOI
TL;DR: The utility of the techniques described in this paper are shown by implementing them in a text-to-audiovisual-speech system that creates animation of speech from unrestricted text that automatically creates accurate real-time animated speech from the input text.
Abstract: We present a facial model designed primarily to support animated speech. Our facial model takes facial geometry as input and transforms it into a parametric deformable model. The facial model uses a muscle-based parameterization, allowing for easier integration between speech synchrony and facial expressions. Our facial model has a highly deformable lip model that is grafted onto the input facial geometry to provide the necessary geometric complexity needed for creating lip shapes and high-quality renderings. Our facial model also includes a highly deformable tongue model that can represent the shapes the tongue undergoes during speech. We add teeth, gums, and upper palate geometry to complete the inner mouth. To decrease the processing time, we hierarchically deform the facial surface. We also present a method to animate the facial model over time to create animated speech using a model of coarticulation that blends visemes together using dominance functions. We treat visemes as a dynamic shaping of the vocal tract by describing visemes as curves instead of keyframes. We show the utility of the techniques described in this paper by implementing them in a text-to-audiovisual-speech system that creates animation of speech from unrestricted text. The facial and coarticulation models must first be interactively initialized. The system then automatically creates accurate real-time animated speech from the input text. It is capable of cheaply producing tremendous amounts of animated speech with very low resource requirements.

61 citations


Patent
01 Jul 2005
TL;DR: In this article, a sequence of visemes, each associated with one or more phonemes are mapped onto a 3D target face, and concatentated with motion trajectories of a set facial points.
Abstract: The disclosure describes methods for synthesis of accurate visible speech using transformations of motion-capture data. Methods are provided for synthesis of visible speech in a three-dimensional face. A sequence of visemes, each associated with one or more phonemes, are mapped onto a three-dimensional target face, and concatentated. The sequence may include divisemes corresponding to pairwise sequences of phonemes, wherein the diviseme is comprised of motion trajectories of a set facial points. The sequence may also include multi-units corresponding to words and sequences of words. Various techniques involving mapping and concatenation are also addressed.

25 citations


Proceedings ArticleDOI
06 Jul 2005
TL;DR: A new method for mapping a natural speech to the lip shape animation in the real time using neural networks, which eliminates the need for tedious manual neural network design by trial and error and considerably improves the viseme classification results.
Abstract: In this paper we present a new method for mapping a natural speech to the lip shape animation in the real time. The speech signal, represented by MFCC vectors, is classified into viseme classes using neural networks. The topology of neural networks is automatically configured using genetic algorithms. This eliminates the need for tedious manual neural network design by trial and error and considerably improves the viseme classification results. This method is suitable for real-time and offline applications

23 citations


Journal ArticleDOI
01 Aug 2005
TL;DR: An efficient system for realistic speech animation is proposed, which supports all steps of the animation pipeline, from the capture or design of 3-D head models up to the synthesis and editing of the performance.
Abstract: An efficient system for realistic speech animation is proposed. The system supports all steps of the animation pipeline, from the capture or design of 3-D head models up to the synthesis and editing of the performance. This pipeline is fully 3-D, which yields high flexibility in the use of the animated character. Real detailed 3-D face dynamics, observed at video frame rate for thousands of points on the face of speaking actors, underpin the realism of the facial deformations. These are given a compact and intuitive representation via independent component analysis (ICA). Performances amount to trajectories through this ‘viseme space’. When asked to animate a face the system replicates the ‘visemes’ that it has learned, and adds the necessary co-articulation effects. Realism has been improved through comparisons with motion captured groundtruth. Faces for which no 3-D dynamics could be observed can be animated nonetheless. Their visemes are adapted automatically to their physiognomy by localising the face in a ‘face space’.

22 citations


Proceedings Article
21 Oct 2005
TL;DR: This master thesis investigates automatic lip synchronization, a method for generating an animation of 3D human face model where the animation of the face model is synchronized with the lip synchronization.
Abstract: This master thesis investigates automatic lip synchronization. It is a method for generating an animation of 3D human face model where the animation is driven only by a speech signal. The whole process is completely automatic and starts from the speech signal. The automatic lip synchronization consists of two main parts: audio to visual mapping and a face synthesis. The thesis proposes and implements a system for the automatic lip synchronization of synthetic 3D avatars based only on the speech input. The speech signal is classified into viseme classes using neural networks. The topology of neural networks is automatically configured using genetic algorithms. Visual representation of phonemes, viseme, defined in MPEG-4 FA, is used for face synthesis. The system is adopted for specificity of the Croatian language. Detailed system validation based on three different evaluation methods is done and potential applications of these technologies are discussed in details. This method is suitable for real-time and offline applications. It is speaker independent and multilingual.

18 citations


Journal ArticleDOI
TL;DR: Results of experiments on identifying a group of confusable visemes indicate that the proposed approach to discriminative training of HMM is able to increase the recognition accuracy by an average of 20% compared with the conventional HMMs that are trained with the Baum-Welch estimation.
Abstract: Hidden Markov model (HMM) has been a popular mathematical approach for sequence classification such as speech recognition since 1980s. In this paper, a novel two-channel training strategy is proposed for discriminative training of HMM. For the proposed training strategy, a novel separable-distance function that measures the difference between a pair of training samples is adopted as the criterion function. The symbol emission matrix of an HMM is split into two channels: a static channel to maintain the validity of the HMM and a dynamic channel that is modified to maximize the separable distance. The parameters of the two-channel HMM are estimated by iterative application of expectation-maximization (EM) operations. As an example of the application of the novel approach, a hierarchical speaker-dependent visual speech recognition system is trained using the two-channel HMMs. Results of experiments on identifying a group of confusable visemes indicate that the proposed approach is able to increase the recognition accuracy by an average of 20% compared with the conventional HMMs that are trained with the Baum-Welch estimation.

16 citations


Book
20 Jan 2005
TL;DR: In this paper, three puzzles of multimodal speech perception are investigated: temporal organization of cued speech production, bimodal perception within the natural time-course of speech production and sensory information for face perception.
Abstract: 1. Three puzzles of multimodal speech perception R. E. Remez 2. Visual speech perception L. E. Bernstein 3. Dynamic information for face perception K. Lander and V. Bruce 4. Investigating auditory-visual speech perception development D. Burnham and K. Sekiyama 5. Brain bases for seeing speech: FMRI studies of speechreading R. Campbell and M. MacSweeney 6. Temporal organization of cued speech production D. Beautemps, M.-A. Cathiard, V. Attina and C. Savariaux 7. Bimodal perception within the natural time-course of speech production M.-A. Cathiard, A. Vilain, R. Laboissiere, H. Loevenbruck, C. Savariaux and J.-L. Schwartz 8. Visual and audiovisual synthesis and recognition of speech by computers N. M. Brooke and S. D. Scott 9. Audiovisual automatic speech recognition G. Potamianos, C. Neti, J. Luettin and I. Matthews 10. Image-based facial synthesis M. Slaney and C. Bregler 11. A trainable videorealistic speech animation system T. Ezzat, G. Geiger and T. Poggio 12. Animated speech: research progress and applications D. W. Massaro, M. M. Cohen, M. Tabain, J. Beskow and R. Clark 13. Empirical perceptual-motor linkage of multimodal speech E. Vatikiotis-Bateson and K. G. Munhall 14. Sensorimotor characteristics of speech production G. Bailly, P. Badin, L. Reveret and A. Ben Youssef.

13 citations



Patent
22 Feb 2005
TL;DR: In this paper, a technique for extracting visemes is proposed, where each of the time domain classification vectors is derived from one of the successive frames of digitized analog speech information.
Abstract: A technique for extracting visemes includes receiving successive frames of digitized analog speech information obtained from the speech signal at a fixed rate (210), filtering each of the successive frames of digitized analog speech information to synchronously generate time domain frame classification vectors at the fixed rate (215, 220, 225, 230, 235, 240), and analyzing each of the time domain classification vectors (250) to synchronously generate a set of visemes corresponding to each of the successive frames of digitized speech information at the fixed rate. Each of the time domain frame classification vectors is derived from one of the successive frames of digitized analog speech information. N multi-taper discrete prolate spheroid sequence basis (MTDPSSB) functions (220) that are factors of a Fredholm integral of the first kind may be used for the filtering, and the analyzing may use a spatial classification function (250). The latency is less than 100 milliseconds.

Proceedings ArticleDOI
15 Jun 2005
TL;DR: A new method for mapping natural speech to lip shape animation in real time using neural networks that eliminates the need for tedious manual neural network design by trial and error and considerably improves the viseme classification results.
Abstract: In this paper we present a new method for mapping natural speech to lip shape animation in real time. The speech signal, represented by MFCC vectors, is classified into viseme classes using neural networks. The topology of neural networks is automatically configured using genetic algorithms. This eliminates the need for tedious manual neural network design by trial and error and considerably improves the viseme classification results. This method is available in real-time and offline mode, and is suitable for various applications. So, we propose the new multimedia services for mobile devices based on the lip sync system described.

Proceedings ArticleDOI
Guobin Ou1, Xin Li1, XiaoCao Yao1, Hongbin Jia1, Yi Lu Murphey1 
27 Dec 2005
TL;DR: An algorithm that automatically extracts lip areas from speaker images, and a neural network system that integrates the two different types of signals to give accurate identification of speakers are developed.
Abstract: We present a speaker identification system that uses synchronized speech signals and lip features. We developed an algorithm that automatically extracts lip areas from speaker images, and a neural network system that integrates the two different types of signals to give accurate identification of speakers. We show that the proposed system gives better performances than the systems that use only speech or lip features in both text dependant and text independent speaker identification applications.

01 Jan 2005
TL;DR: The results show that lip animation can indeed enhance speech perception if done correctly, and lip movement that does not correlate with the presented speech resulted in worse performance in the presence of masking noise than when no lip animation was used at all.
Abstract: The addition of facial animation to characters greatly contributes to realism and presence in virtual environments. Even simple animations can make a character seem more lifelike and more believable. The purpose of this study was to determine whether the rudimentary lip animations used in most virtual environments could influence the perception of speech. The results show that lip animation can indeed enhance speech perception if done correctly. Lip movement that does not correlate with the presented speech however resulted in worse performance in the presence of masking noise than when no lip animation was used at all.


Journal ArticleDOI
TL;DR: A Chinese text-to-visual speech synthesis system based on data-driven (sample based) approach, which is realized by short video segments concatenation, and an effective method to construct two visual confusion trees for Chinese initials and finals is developed.
Abstract: Text-To-Visual speech (TTVS) synthesis by computer can increase the speech intelligibility and make the human-computer interaction interfaces more friendly. This paper describes a Chinese text-to-visual speech synthesis system based on data-driven (sample based) approach, which is realized by short video segments concatenation. An effective method to construct two visual confusion trees for Chinese initials and finals is developed. A co-articulation model based on visual distance and hardness factor is proposed, which can be used in the recording corpus sentence selection in analysis phase and the unit selection in synthesis phase. The obvious difference between boundary images of the concatenation video segments is smoothed by image morphing technique. By combining with the acoustic Text-To-Speech (TTS) synthesis, a Chinese text-to-visual speech synthesis system is realized.

Proceedings ArticleDOI
04 Sep 2005
TL;DR: A new hybrid visual feature combination, which is suitable for audio-visual speech recognition was implemented, which resulted in a high recognition accuracy and improved the audio- visual speech recognition drastically.
Abstract: In this work, a system of audio visual speech recognition will be presented. A new hybrid visual feature combination, which is suitable for audio -visual speech recognition was implemented. The features comprise both the shape and the appearance of lips, the dimensional reduction is applied using discrete cosine transform (DCT). A large visual speech database of the German language has been assembled, the German Audio -Visual Database (GAVD). The conducted experiments using only visual features resulted in a high recognition accuracy and improved the audio-visual speech recognition drastically.

Book ChapterDOI
05 Sep 2005
TL;DR: The real time classification algorithms are presented for visual mouth appearances (visemes) which correspond to phonemes and their speech contexts and the DFT+LDA approach has practical advantages over MESH+ LDA classifier.
Abstract: Real time classification algorithms are presented for visual mouth appearances (visemes) which correspond to phonemes and their speech contexts. They are used at the design of talking head application. Two feature extraction procedures were verified. The first one is based on the normalized triangle mesh covering mouth area and the color image texture vector indexed by barycentric coordinates. The second procedure performs Discrete Fourier Transform on the image rectangle including mouth w.r.t. a small block of DFT coefficients. The classifier has been designed by the optimized LDA method which uses two singular subspace approach. Despite of higher computational complexity (about three milliseconds per video frame on Pentium IV 3.2GHz), the DFT+LDA approach has practical advantages over MESH+LDA classifier. Firstly, it is better in recognition rate more than two percent (97.2% versus 99.3%). Secondly, the automatic identification of the covering mouth rectangle is more robust than the automatic identification of the covering mouth triangle mesh.

Proceedings Article
01 Jan 2005
TL;DR: A German viseme inventory for visemically transcribing text according to phonetic transcribtion is introduced and an inventory of German visemo classes in a SAMPA-like labelling is worked out and a model for automatic visemic transcription of given input text is trained.
Abstract: In this paper, we introduce a German viseme inventory for visemically transcribing text according to phonetic transcribtion. A viseme set like the one presented in this work is essential for speech-driven audio-visual synthesis due to the fact that the selection of appropriate video segments is based on the visemically transcribed input text. For text-to-speech synthesis, a transcription of the input text into the phonemic representation is used, in order to avoid ambiguous meanings and to acquire the correct pronunciation of the underlying input text and to serve as labels in unitselection-based synthesis systems. Likewise, the visual synthesis requires a transcription that represents analogue to the phonemes the visual counterpart which is called viseme in related literature and which also serves as a unit label in our data-driven video-realistic audio-visual synthesis system. We worked out an inventory of German viseme classes in a SAMPA-like labelling and trained a model for automatic visemic transcription of given input text.

01 Jan 2005
TL;DR: This paper presents a method allowing to carry out a spatial-temporal tracking of some points of interest in the speaker’s face and to indicate the different configuration of the mouth through visemes, to establish a correlation between the phoneme and the viseme.
Abstract: Speech recognition is a basic component in several research projects nowadays. However, to understand a speech, hearing is not enough, it is sometimes necessary to see it. Indeed perspective studies proved that visual information brought by the interlocutor’s face in a degraded communication condition, contributed largely to the improvement of speech-intelligibility. In fact several domains are concerned with the use of visual information such as e-learning, Human-Machine interaction, etc. This paper presents a method allowing to carry out a spatial-temporal tracking of some points of interest in the speaker’s face and to indicate the different configuration of the mouth through visemes. Later on these visemes will be associated to relatively precise physical measures like the spreading of the lips and mouth height, in order to establish a correlation between the phoneme and the viseme. The results of our experiment show that we can describe the whole French phonemes by the visemes.

Proceedings ArticleDOI
19 May 2005
TL;DR: Two methods for automatic facial gesturing of graphically embodied animated agents are presented and another method provides virtual speaker capable of reading plain English text and rendering it in a form of speech accompanied by the appropriate facial gestures.
Abstract: We present two methods for automatic facial gesturing of graphically embodied animated agents. In one case, conversational agent is driven by speech in automatic Lip Sync process. By analyzing speech input, lip movements are determined from the speech signal. Another method provides virtual speaker capable of reading plain English text and rendering it in a form of speech accompanied by the appropriate facial gestures. Proposed statistical model for generating virtual speaker's facial gestures, can be also applied as addition to lip synchronization process in order to obtain speech driven facial gesturing. In this case statistical model will be triggered with the input speech prosody instead of lexical analysis of the input text.

Proceedings ArticleDOI
16 Dec 2005
TL;DR: A context-based visubsyllable database is set up to map Chinese initials or finals to their corresponding pronunciation mouth shape and 3D facial animation can be synthesized based on speech signal input.
Abstract: This paper describes a prototype implementation of a speech driven facial animation system for embedded devices. The system is comprised of speech recognition and talking head synthesis. A context-based visubsyllable database is set up to map Chinese initials or finals to their corresponding pronunciation mouth shape. With the database, 3D facial animation can be synthesized based on speech signal input. Experiment results show the system works well in simulating real mouth shapes and forwarding a friendly interface in communication terminals.

Book ChapterDOI
28 Sep 2005
TL;DR: Real time recognition of visual face appearances (visemes) which correspond to phonemes and their speech contexts is presented and it appears that the LDA classifier outperforms subspace technique.
Abstract: Real time recognition of visual face appearances (visemes) which correspond to phonemes and their speech contexts is presented. We distinguish six major classes of visemes. Features are extracted in the form of normalized image texture. The normalization procedure uses barycentric coordinates in a mesh of triangles superimposed onto a reference facial image. The mesh itself is defined using a subset of FAP points conforming with MPEG-4 standard. The elaborated classifiers were designed by PCA subspace and LDA methods. It appears that the LDA classifier outperforms subspace technique. It is better than the best subspace PCA – in recognition rate by more than 13% times (97% versus 84%) and it is more than 10 times faster (0.5ms versus 7ms) and its time is neglected w.r.t. mouth image normalization time (0.5ms versus 5ms).

Proceedings ArticleDOI
24 Oct 2005
TL;DR: Two methods for automatic facial gesturing of graphically embodied animated agents are presented and one provides virtual speaker capable of reading plain English text and rendering it in a form of speech accompanied by the appropriate facial gestures.
Abstract: We present two methods for automatic facial gesturing of graphically embodied animated agents. In one case, conversational agent is driven by speech in automatic lip sync process. By analyzing speech input, lip movements are determined from the speech signal. Another method provides virtual speaker capable of reading plain English text and rendering it in a form of speech accompanied by the appropriate facial gestures. Proposed statistical model for generating virtual speaker's facial gestures can be also applied as addition to lip synchronization process in order to obtain speech driven facial gesturing. In this case statistical model is triggered with the input speech prosody instead of lexical analysis of the input text.


Proceedings ArticleDOI
04 Sep 2005
TL;DR: The results show that, by using the HMM models defined in the training phase, the speech recognizer detects reliably specific speech sounds with a small rate of errors.
Abstract: This paper presents an elitist approach\" for extracting automatically well-realized speech sounds with high confidence. The elitist approach uses a speech recognition system based on Hidden Markov Models (HMM). The HMM are trained on speech sounds which are systematically well-detected in an iterative procedure. The results show that, by using the HMM models defined in the training phase, the speech recognizer detects reliably specific speech sounds with a small rate of errors.

Proceedings ArticleDOI
12 May 2005
TL;DR: A system for automatic keyframe generation using MaxScript control script is developed, which is able to perform fine tuning and adding facial expressions of the emotions in facial animation keyframes.
Abstract: In terms of reducing efforts of an animator in creating facial animation keyframes we developed a system for automatic keyframe generation using MaxScript control script. Input parameter for the script is a parameter file containing phonemes of the prerecorded soundtrack and their durations. Recognition of phonemes is done by an LVQ neural network. After keyframes are created by the script, the animator is able to perform fine tuning and adding facial expressions of the emotions.

01 Jan 2005
TL;DR: This paper looks at an alternate technique for applying multivariate statistical techniques to lip-sync a cartoon or model with an audio stream in real time, which requires orders of magnitude less processing power than traditional methods.
Abstract: With the advance of modem computer hardware, computer animation has advanced leaps and bounds. What formerly took weeks of processing can now be generated on the fly. However, the actors in games often stand mute with faces unmoving, or speak only in canned phrases as the technology for calculating their lip positions from an arbitrary sound segment has lagged behind the technology that allowed the movement of those lips to be rendered in real-time. Traditional speech recognition techniques requires the entire utterance to be present or require at least a wide window around the text to be matched to allow for higher level structure to be used in determining what words are being spoken. However, this approach, while highly appropriate for recognizing the sounds present in an audio stream and mapping those to speech, is less applicable to the problem of "lip-syncing" in real time. This paper looks at an alternate technique for applying multivariate statistical techniques to lip-sync a cartoon or model with an audio stream in real time, which requires orders of magnitude less processing power than traditional methods. Degree Type Open Access Senior Honors Thesis Department Mathematics

01 Jan 2005
TL;DR: This approach additionally equires a modification to traditional techniques employed fo r the estimation of hidden Markov Models (HMMs), whose resultant models the authors currently refer to as free-parts HMMs (FP-HMMs) will be presented on the CUAVE audiovisual speech database.
Abstract: Motivated by the success of free-parts based representatio s in face recognition [1] we have attempted to address some of the problems associated with applying such a philosophy to the task of speaker-independent automatic speech reading. Hitherto, a major problem with canonical area-based approaches in automatic speech reading is the intrinsic lac k of training observations due to the visual speech modality’ s low sample rate and large variability in appearance. We believe a free-parts representation can overcome many of these limitations due to its natural ability to generalize b y producing many observations from a single mouth image, whilst still preserving the ability to discriminate betwee n various visual-speech units. This approach additionally r equires a modification to traditional techniques employed fo r the estimation of hidden Markov Models (HMMs), whose resultant models we currently refer to as free-parts HMMs (FP-HMMs). Results will be presented on the CUAVE audiovisual speech database.

Journal ArticleDOI
TL;DR: Rich Richie and D. Kewley-Port as discussed by the authors used pre-training data (vowel-identification confusion matrices) to determine whether vowel visemes exist for untrained speechreaders.
Abstract: The status of visemes, groups of visually confusable speech sounds, for American English vowels has been disputed for some time. While some researchers claim that vowels are visually distinguishable, others claim that some vowels are visually confusable and comprise viseme categories. Data from our study on speechreading words and sentences were examined for evidence of vowel visemes [C. Richie and D. Kewley‐Port, J. Acoust. Soc. Am. 117, 2570 (2005)]. Normal‐hearing listeners were tested in auditory‐visual conditions in masking noise designed to simulate a hearing loss. They were trained on speechreading tasks emphasizing vowels, consonants, or vowels and consonants combined. Pre‐ and post‐training speechreading tests included identification of 10 vowels in CVC context. Pre‐training data (vowel‐identification confusion matrices) were used to determine whether vowel visemes exist for untrained speechreaders. Post‐training results were examined to determine whether the number of vowel response categories increased and whether the number of vowel identification errors decreased, for trained versus untrained participants. The impact of these training programs on speechreading performance is discussed in terms of vowel visemes. [Work supported by NIHDCD02229.]