Showing papers on "Viseme published in 2005"

PDF

Open Access

Journal Article•DOI•

[...]

Daniel Vlasic¹, Matthew Brand², Hanspeter Pfister², Jovan Popović¹•Institutions (2)

Massachusetts Institute of Technology¹, Mitsubishi Electric Research Laboratories²

01 Jul 2005

TL;DR: Face Transfer is a method for mapping videorecorded performances of one individual to facial animations of another, based on a multilinear model of 3D face meshes that separably parameterizes the space of geometric variations due to different attributes.

...read moreread less

Abstract: Face Transfer is a method for mapping videorecorded performances of one individual to facial animations of another It extracts visemes (speech-related mouth articulations), expressions, and three-dimensional (3D) pose from monocular video or film footage These parameters are then used to generate and drive a detailed 3D textured face mesh for a target identity, which can be seamlessly rendered back into target footage The underlying face model automatically adjusts for how the target performs facial expressions and visemes The performance data can be easily edited to change the visemes, expressions, pose, or even the identity of the target---the attributes are separably controllable This supports a wide variety of video rewrite and puppetry applicationsFace Transfer is based on a multilinear model of 3D face meshes that separably parameterizes the space of geometric variations due to different attributes (eg, identity, expression, and viseme) Separability means that each of these attributes can be independently varied A multilinear model can be estimated from a Cartesian product of examples (identities × expressions × visemes) with techniques from statistical analysis, but only after careful preprocessing of the geometric data set to secure one-to-one correspondence, to minimize cross-coupling artifacts, and to fill in any missing examples Face Transfer offers new solutions to these problems and links the estimated model with a face-tracking algorithm to extract pose, expression, and viseme parameters

...read moreread less

679 citations

Proceedings Article•DOI•

Visual speech recognition with loosely synchronized feature streams

[...]

Kate Saenko¹, Karen Livescu¹, Michael R. Siracusa¹, Kevin W. Wilson¹, James Glass¹, Trevor Darrell¹ - Show less +2 more•Institutions (1)

Massachusetts Institute of Technology¹

17 Oct 2005

TL;DR: A novel dynamic Bayesian network with a multi-stream structure and observations consisting of articulate feature classifier scores, which can model varying degrees of co-articulation in a principled way is presented.

...read moreread less

Abstract: We present an approach to detecting and recognizing spoken isolated phrases based solely on visual input. We adopt an architecture that first employs discriminative detection of visual speech and articulate features, and then performs recognition using a model that accounts for the loose synchronization of the feature streams. Discriminative classifiers detect the subclass of lip appearance corresponding to the presence of speech, and further decompose it into features corresponding to the physical components of articulate production. These components often evolve in a semi-independent fashion, and conventional viseme-based approaches to recognition fail to capture the resulting co-articulation effects. We present a novel dynamic Bayesian network with a multi-stream structure and observations consisting of articulate feature classifier scores, which can model varying degrees of co-articulation in a principled way. We evaluate our visual-only recognition system on a command utterance task. We show comparative results on lip detection and speech/non-speech classification, as well as recognition performance against several baseline systems

...read moreread less

87 citations

Journal Article•DOI•

Creating speech-synchronized animation

[...]

Scott A. King¹, Rick Parent²•Institutions (2)

Texas A&M University–Corpus Christi¹, IEEE Computer Society²

01 May 2005-IEEE Transactions on Visualization and Computer Graphics

TL;DR: The utility of the techniques described in this paper are shown by implementing them in a text-to-audiovisual-speech system that creates animation of speech from unrestricted text that automatically creates accurate real-time animated speech from the input text.

...read moreread less

Abstract: We present a facial model designed primarily to support animated speech. Our facial model takes facial geometry as input and transforms it into a parametric deformable model. The facial model uses a muscle-based parameterization, allowing for easier integration between speech synchrony and facial expressions. Our facial model has a highly deformable lip model that is grafted onto the input facial geometry to provide the necessary geometric complexity needed for creating lip shapes and high-quality renderings. Our facial model also includes a highly deformable tongue model that can represent the shapes the tongue undergoes during speech. We add teeth, gums, and upper palate geometry to complete the inner mouth. To decrease the processing time, we hierarchically deform the facial surface. We also present a method to animate the facial model over time to create animated speech using a model of coarticulation that blends visemes together using dominance functions. We treat visemes as a dynamic shaping of the vocal tract by describing visemes as curves instead of keyframes. We show the utility of the techniques described in this paper by implementing them in a text-to-audiovisual-speech system that creates animation of speech from unrestricted text. The facial and coarticulation models must first be interactively initialized. The system then automatically creates accurate real-time animated speech from the input text. It is capable of cheaply producing tremendous amounts of animated speech with very low resource requirements.

...read moreread less

61 citations

Patent•

Methods and systems for synthesis of accurate visible speech via transformation of motion capture data

[...]

Jiyong Ma¹, Ronald A. Cole¹, Wayne H. Ward¹, Bryan L. Pellom¹•Institutions (1)

University of Colorado Boulder¹

01 Jul 2005

TL;DR: In this article, a sequence of visemes, each associated with one or more phonemes are mapped onto a 3D target face, and concatentated with motion trajectories of a set facial points.

...read moreread less

Abstract: The disclosure describes methods for synthesis of accurate visible speech using transformations of motion-capture data. Methods are provided for synthesis of visible speech in a three-dimensional face. A sequence of visemes, each associated with one or more phonemes, are mapped onto a three-dimensional target face, and concatentated. The sequence may include divisemes corresponding to pairwise sequences of phonemes, wherein the diviseme is comprised of motion trajectories of a set facial points. The sequence may also include multi-units corresponding to words and sequences of words. Various techniques involving mapping and concatenation are also addressed.

...read moreread less

25 citations

Proceedings Article•DOI•

A Real-Time Lip SYNC System Using a Genetic Algorithm for Automatic Neural Network Configuration

[...]

Goranka Zoric¹, Igor S. Pandzic¹•Institutions (1)

University of Zagreb¹

06 Jul 2005

TL;DR: A new method for mapping a natural speech to the lip shape animation in the real time using neural networks, which eliminates the need for tedious manual neural network design by trial and error and considerably improves the viseme classification results.

...read moreread less

Abstract: In this paper we present a new method for mapping a natural speech to the lip shape animation in the real time. The speech signal, represented by MFCC vectors, is classified into viseme classes using neural networks. The topology of neural networks is automatically configured using genetic algorithms. This eliminates the need for tedious manual neural network design by trial and error and considerably improves the viseme classification results. This method is suitable for real-time and offline applications

...read moreread less

23 citations

Journal Article•DOI•

Realistic speech animation based on observed 3-D face dynamics

[...]

Pascal Müller¹, Gregor A. Kalberer¹, M. Proesmans¹, L. Van Gool¹•Institutions (1)

ETH Zurich¹

01 Aug 2005

TL;DR: An efficient system for realistic speech animation is proposed, which supports all steps of the animation pipeline, from the capture or design of 3-D head models up to the synthesis and editing of the performance.

...read moreread less

Abstract: An efficient system for realistic speech animation is proposed. The system supports all steps of the animation pipeline, from the capture or design of 3-D head models up to the synthesis and editing of the performance. This pipeline is fully 3-D, which yields high flexibility in the use of the animated character. Real detailed 3-D face dynamics, observed at video frame rate for thousands of points on the face of speaking actors, underpin the realism of the facial deformations. These are given a compact and intuitive representation via independent component analysis (ICA). Performances amount to trajectories through this ‘viseme space’. When asked to animate a face the system replicates the ‘visemes’ that it has learned, and adds the necessary co-articulation effects. Realism has been improved through comparisons with motion captured groundtruth. Faces for which no 3-D dynamics could be observed can be animated nonetheless. Their visemes are adapted automatically to their physiognomy by localising the face in a ‘face space’.

...read moreread less

22 citations

Proceedings Article•

Automatic lip synchronization by speech signal analysis

[...]

Goranka Zoric

21 Oct 2005

TL;DR: This master thesis investigates automatic lip synchronization, a method for generating an animation of 3D human face model where the animation of the face model is synchronized with the lip synchronization.

...read moreread less

Abstract: This master thesis investigates automatic lip synchronization. It is a method for generating an animation of 3D human face model where the animation is driven only by a speech signal. The whole process is completely automatic and starts from the speech signal. The automatic lip synchronization consists of two main parts: audio to visual mapping and a face synthesis. The thesis proposes and implements a system for the automatic lip synchronization of synthetic 3D avatars based only on the speech input. The speech signal is classified into viseme classes using neural networks. The topology of neural networks is automatically configured using genetic algorithms. Visual representation of phonemes, viseme, defined in MPEG-4 FA, is used for face synthesis. The system is adopted for specificity of the Croatian language. Detailed system validation based on three different evaluation methods is done and potential applications of these technologies are discussed in details. This method is suitable for real-time and offline applications. It is speaker independent and multilingual.

...read moreread less

18 citations

Journal Article•DOI•

A two-channel training algorithm for hidden Markov model and its application to lip reading

[...]

Liang Dong¹, Say Wei Foo², Yong Lian¹•Institutions (2)

National University of Singapore¹, Nanyang Technological University²

01 Jan 2005-EURASIP Journal on Advances in Signal Processing

TL;DR: Results of experiments on identifying a group of confusable visemes indicate that the proposed approach to discriminative training of HMM is able to increase the recognition accuracy by an average of 20% compared with the conventional HMMs that are trained with the Baum-Welch estimation.

...read moreread less

Abstract: Hidden Markov model (HMM) has been a popular mathematical approach for sequence classification such as speech recognition since 1980s. In this paper, a novel two-channel training strategy is proposed for discriminative training of HMM. For the proposed training strategy, a novel separable-distance function that measures the difference between a pair of training samples is adopted as the criterion function. The symbol emission matrix of an HMM is split into two channels: a static channel to maintain the validity of the HMM and a dynamic channel that is modified to maximize the separable distance. The parameters of the two-channel HMM are estimated by iterative application of expectation-maximization (EM) operations. As an example of the application of the novel approach, a hierarchical speaker-dependent visual speech recognition system is trained using the two-channel HMMs. Results of experiments on identifying a group of confusable visemes indicate that the proposed approach is able to increase the recognition accuracy by an average of 20% compared with the conventional HMMs that are trained with the Baum-Welch estimation.

...read moreread less

16 citations

Book•

Audiovisual Speech Processing

[...]

Gérard Bailly, Pascal Perrier, Eric Vatikiotis-Bateson

20 Jan 2005

TL;DR: In this paper, three puzzles of multimodal speech perception are investigated: temporal organization of cued speech production, bimodal perception within the natural time-course of speech production and sensory information for face perception.

...read moreread less

Abstract: 1. Three puzzles of multimodal speech perception R. E. Remez 2. Visual speech perception L. E. Bernstein 3. Dynamic information for face perception K. Lander and V. Bruce 4. Investigating auditory-visual speech perception development D. Burnham and K. Sekiyama 5. Brain bases for seeing speech: FMRI studies of speechreading R. Campbell and M. MacSweeney 6. Temporal organization of cued speech production D. Beautemps, M.-A. Cathiard, V. Attina and C. Savariaux 7. Bimodal perception within the natural time-course of speech production M.-A. Cathiard, A. Vilain, R. Laboissiere, H. Loevenbruck, C. Savariaux and J.-L. Schwartz 8. Visual and audiovisual synthesis and recognition of speech by computers N. M. Brooke and S. D. Scott 9. Audiovisual automatic speech recognition G. Potamianos, C. Neti, J. Luettin and I. Matthews 10. Image-based facial synthesis M. Slaney and C. Bregler 11. A trainable videorealistic speech animation system T. Ezzat, G. Geiger and T. Poggio 12. Animated speech: research progress and applications D. W. Massaro, M. M. Cohen, M. Tabain, J. Beskow and R. Clark 13. Empirical perceptual-motor linkage of multimodal speech E. Vatikiotis-Bateson and K. G. Munhall 14. Sensorimotor characteristics of speech production G. Bailly, P. Badin, L. Reveret and A. Ben Youssef.

...read moreread less

13 citations

Technique of video features extraction for audio-video speech recognition system

[...]

M. Kubanek¹•Institutions (1)

Częstochowa University of Technology¹

01 Jan 2005

8 citations

Patent•

Speech receiving device and viseme extraction method and apparatus

[...]

Eric R. Buhrke

22 Feb 2005

TL;DR: In this paper, a technique for extracting visemes is proposed, where each of the time domain classification vectors is derived from one of the successive frames of digitized analog speech information.

...read moreread less

Abstract: A technique for extracting visemes includes receiving successive frames of digitized analog speech information obtained from the speech signal at a fixed rate (210), filtering each of the successive frames of digitized analog speech information to synchronously generate time domain frame classification vectors at the fixed rate (215, 220, 225, 230, 235, 240), and analyzing each of the time domain classification vectors (250) to synchronously generate a set of visemes corresponding to each of the successive frames of digitized speech information at the fixed rate. Each of the time domain frame classification vectors is derived from one of the successive frames of digitized analog speech information. N multi-taper discrete prolate spheroid sequence basis (MTDPSSB) functions (220) that are factors of a Fredholm integral of the first kind may be used for the filtering, and the analyzing may use a spatial classification function (250). The latency is less than 100 milliseconds.

...read moreread less

Proceedings Article•DOI•

Automatic lip sync and its use in the new multimedia services for mobile devices

[...]

Goranka Zoric, Igor S. Pandzic¹•Institutions (1)

University of Zagreb¹

15 Jun 2005

TL;DR: A new method for mapping natural speech to lip shape animation in real time using neural networks that eliminates the need for tedious manual neural network design by trial and error and considerably improves the viseme classification results.

...read moreread less

Abstract: In this paper we present a new method for mapping natural speech to lip shape animation in real time. The speech signal, represented by MFCC vectors, is classified into viseme classes using neural networks. The topology of neural networks is automatically configured using genetic algorithms. This eliminates the need for tedious manual neural network design by trial and error and considerably improves the viseme classification results. This method is available in real-time and offline mode, and is suitable for various applications. So, we propose the new multimedia services for mobile devices based on the lip sync system described.

...read moreread less

Proceedings Article•DOI•

Speaker identification using speech and lip features

[...]

Guobin Ou¹, Xin Li¹, XiaoCao Yao¹, Hongbin Jia¹, Yi Lu Murphey¹ - Show less +1 more•Institutions (1)

University of Michigan¹

27 Dec 2005

TL;DR: An algorithm that automatically extracts lip areas from speaker images, and a neural network system that integrates the two different types of signals to give accurate identification of speakers are developed.

...read moreread less

Abstract: We present a speaker identification system that uses synchronized speech signals and lip features. We developed an algorithm that automatically extracts lip areas from speaker images, and a neural network system that integrates the two different types of signals to give accurate identification of speakers. We show that the proposed system gives better performances than the systems that use only speech or lip features in both text dependant and text independent speaker identification applications.

...read moreread less

The Influence of Lip Animation on the Perception of Speech in Virtual Environments

[...]

J. H. Verwey¹, Edwin Blake•Institutions (1)

University of Cape Town¹

01 Jan 2005

TL;DR: The results show that lip animation can indeed enhance speech perception if done correctly, and lip movement that does not correlate with the presented speech resulted in worse performance in the presence of masking noise than when no lip animation was used at all.

...read moreread less

Abstract: The addition of facial animation to characters greatly contributes to realism and presence in virtual environments. Even simple animations can make a character seem more lifelike and more believable. The purpose of this study was to determine whether the rudimentary lip animations used in most virtual environments could influence the perception of speech. The results show that lip animation can indeed enhance speech perception if done correctly. Lip movement that does not correlate with the presented speech however resulted in worse performance in the presence of masking noise than when no lip animation was used at all.

...read moreread less

Viseme analysis for speech-driven facial animation for Czech audio-visual speech synthesis

[...]

Zdeněk Krňoul, Petr Císař, Miloš Železný, J. Holas

17 Oct 2005

Journal Article•DOI•

Text-To-Visual Speech in Chinese Based on Data-Driven Approach

[...]

Wang Zhiming¹, Cai Lianhong, Ai Haizhou•Institutions (1)

Tsinghua University¹

01 Jan 2005-Journal of Software

TL;DR: A Chinese text-to-visual speech synthesis system based on data-driven (sample based) approach, which is realized by short video segments concatenation, and an effective method to construct two visual confusion trees for Chinese initials and finals is developed.

...read moreread less

Abstract: Text-To-Visual speech (TTVS) synthesis by computer can increase the speech intelligibility and make the human-computer interaction interfaces more friendly. This paper describes a Chinese text-to-visual speech synthesis system based on data-driven (sample based) approach, which is realized by short video segments concatenation. An effective method to construct two visual confusion trees for Chinese initials and finals is developed. A co-articulation model based on visual distance and hardness factor is proposed, which can be used in the recording corpus sentence selection in analysis phase and the unit selection in synthesis phase. The obvious difference between boundary images of the concatenation video segments is smoothed by image morphing technique. By combining with the acoustic Text-To-Speech (TTS) synthesis, a Chinese text-to-visual speech synthesis system is realized.

...read moreread less

Proceedings Article•DOI•

A system for audio-visual speech recognition.

[...]

Islam Shdaifat, Rolf-Rainer Grigat

04 Sep 2005

TL;DR: A new hybrid visual feature combination, which is suitable for audio-visual speech recognition was implemented, which resulted in a high recognition accuracy and improved the audio- visual speech recognition drastically.

...read moreread less

Abstract: In this work, a system of audio visual speech recognition will be presented. A new hybrid visual feature combination, which is suitable for audio -visual speech recognition was implemented. The features comprise both the shape and the appearance of lips, the dimensional reduction is applied using discrete cosine transform (DCT). A large visual speech database of the German language has been assembled, the German Audio -Visual Database (GAVD). The conducted experiments using only visual features resulted in a high recognition accuracy and improved the audio-visual speech recognition drastically.

...read moreread less

Book Chapter•DOI•

Viseme classification for talking head application

[...]

Mariusz Leszczynski¹, Władysław Skarbek¹•Institutions (1)

Warsaw University of Technology¹

05 Sep 2005

TL;DR: The real time classification algorithms are presented for visual mouth appearances (visemes) which correspond to phonemes and their speech contexts and the DFT+LDA approach has practical advantages over MESH+ LDA classifier.

...read moreread less

Abstract: Real time classification algorithms are presented for visual mouth appearances (visemes) which correspond to phonemes and their speech contexts. They are used at the design of talking head application. Two feature extraction procedures were verified. The first one is based on the normalized triangle mesh covering mouth area and the color image texture vector indexed by barycentric coordinates. The second procedure performs Discrete Fourier Transform on the image rectangle including mouth w.r.t. a small block of DFT coefficients. The classifier has been designed by the optimized LDA method which uses two singular subspace approach. Despite of higher computational complexity (about three milliseconds per video frame on Pentium IV 3.2GHz), the DFT+LDA approach has practical advantages over MESH+LDA classifier. Firstly, it is better in recognition rate more than two percent (97.2% versus 99.3%). Secondly, the automatic identification of the covering mouth rectangle is more robust than the automatic identification of the covering mouth triangle mesh.

...read moreread less

Proceedings Article•

A German viseme-set for automatic transcription of input text used for audio-visual speech synthesis.

[...]

Christian Weiss, Bianca Aschenberner

01 Jan 2005

TL;DR: A German viseme inventory for visemically transcribing text according to phonetic transcribtion is introduced and an inventory of German visemo classes in a SAMPA-like labelling is worked out and a model for automatic visemic transcription of given input text is trained.

...read moreread less

Abstract: In this paper, we introduce a German viseme inventory for visemically transcribing text according to phonetic transcribtion. A viseme set like the one presented in this work is essential for speech-driven audio-visual synthesis due to the fact that the selection of appropriate video segments is based on the visemically transcribed input text. For text-to-speech synthesis, a transcription of the input text into the phonemic representation is used, in order to avoid ambiguous meanings and to acquire the correct pronunciation of the underlying input text and to serve as labels in unitselection-based synthesis systems. Likewise, the visual synthesis requires a transcription that represents analogue to the phonemes the visual counterpart which is called viseme in related literature and which also serves as a unit label in our data-driven video-realistic audio-visual synthesis system. We worked out an inventory of German viseme classes in a SAMPA-like labelling and trained a model for automatic visemic transcription of given input text.

...read moreread less

A Spatial-Temporal technique of Viseme Extraction: Application in Speech Recognition

[...]

Salah Werda, Walid Mahdi, Abdelmajid Ben Hamadou

01 Jan 2005

TL;DR: This paper presents a method allowing to carry out a spatial-temporal tracking of some points of interest in the speaker’s face and to indicate the different configuration of the mouth through visemes, to establish a correlation between the phoneme and the viseme.

...read moreread less

Abstract: Speech recognition is a basic component in several research projects nowadays. However, to understand a speech, hearing is not enough, it is sometimes necessary to see it. Indeed perspective studies proved that visual information brought by the interlocutor’s face in a degraded communication condition, contributed largely to the improvement of speech-intelligibility. In fact several domains are concerned with the use of visual information such as e-learning, Human-Machine interaction, etc. This paper presents a method allowing to carry out a spatial-temporal tracking of some points of interest in the speaker’s face and to indicate the different configuration of the mouth through visemes. Later on these visemes will be associated to relatively precise physical measures like the spreading of the lips and mouth height, in order to establish a correlation between the phoneme and the viseme. The results of our experiment show that we can describe the whole French phonemes by the visemes.

...read moreread less

Proceedings Article•DOI•

Automatic facial gesturing for conversational agents and avatars

[...]

Goranka Zoric¹, Karlo Smid, Igor S. Pandzic•Institutions (1)

University of Zagreb¹

19 May 2005

TL;DR: Two methods for automatic facial gesturing of graphically embodied animated agents are presented and another method provides virtual speaker capable of reading plain English text and rendering it in a form of speech accompanied by the appropriate facial gestures.

...read moreread less

Abstract: We present two methods for automatic facial gesturing of graphically embodied animated agents. In one case, conversational agent is driven by speech in automatic Lip Sync process. By analyzing speech input, lip movements are determined from the speech signal. Another method provides virtual speaker capable of reading plain English text and rendering it in a form of speech accompanied by the appropriate facial gestures. Proposed statistical model for generating virtual speaker's facial gestures, can be also applied as addition to lip synchronization process in order to obtain speech driven facial gesturing. In this case statistical model will be triggered with the input speech prosody instead of lexical analysis of the input text.

...read moreread less

Proceedings Article•DOI•

Facial animation system for embedded application

[...]

Jiajun Bu¹, Mingyu You¹, Chun Chen¹•Institutions (1)

Zhejiang University¹

16 Dec 2005

TL;DR: A context-based visubsyllable database is set up to map Chinese initials or finals to their corresponding pronunciation mouth shape and 3D facial animation can be synthesized based on speech signal input.

...read moreread less

Abstract: This paper describes a prototype implementation of a speech driven facial animation system for embedded devices. The system is comprised of speech recognition and talking head synthesis. A context-based visubsyllable database is set up to map Chinese initials or finals to their corresponding pronunciation mouth shape. With the database, 3D facial animation can be synthesized based on speech signal input. Experiment results show the system works well in simulating real mouth shapes and forwarding a friendly interface in communication terminals.

...read moreread less

Book Chapter•DOI•

Fast viseme recognition for talking head application

[...]

Mariusz Leszczynski¹, Władysław Skarbek¹, Stanisław Badura¹•Institutions (1)

Warsaw University of Technology¹

28 Sep 2005

TL;DR: Real time recognition of visual face appearances (visemes) which correspond to phonemes and their speech contexts is presented and it appears that the LDA classifier outperforms subspace technique.

...read moreread less

Abstract: Real time recognition of visual face appearances (visemes) which correspond to phonemes and their speech contexts is presented. We distinguish six major classes of visemes. Features are extracted in the form of normalized image texture. The normalization procedure uses barycentric coordinates in a mesh of triangles superimposed onto a reference facial image. The mesh itself is defined using a subset of FAP points conforming with MPEG-4 standard. The elaborated classifiers were designed by PCA subspace and LDA methods. It appears that the LDA classifier outperforms subspace technique. It is better than the best subspace PCA – in recognition rate by more than 13% times (97% versus 84%) and it is more than 10 times faster (0.5ms versus 7ms) and its time is neglected w.r.t. mouth image normalization time (0.5ms versus 5ms).

...read moreread less

Proceedings Article•DOI•

Automated gesturing for virtual characters: speech-driven and text-driven approaches

[...]

Goranka Zoric¹, Karlo Smid, Igor S. Pandzic•Institutions (1)

University of Zagreb¹

24 Oct 2005

TL;DR: Two methods for automatic facial gesturing of graphically embodied animated agents are presented and one provides virtual speaker capable of reading plain English text and rendering it in a form of speech accompanied by the appropriate facial gestures.

...read moreread less

Abstract: We present two methods for automatic facial gesturing of graphically embodied animated agents. In one case, conversational agent is driven by speech in automatic lip sync process. By analyzing speech input, lip movements are determined from the speech signal. Another method provides virtual speaker capable of reading plain English text and rendering it in a form of speech accompanied by the appropriate facial gestures. Proposed statistical model for generating virtual speaker's facial gestures can be also applied as addition to lip synchronization process in order to obtain speech driven facial gesturing. In this case statistical model is triggered with the input speech prosody instead of lexical analysis of the input text.

...read moreread less

Synthesizing Speech Animation By Learning Compact Speech

[...]

Zhigang Deng

01 Jan 2005

Proceedings Article•DOI•

An elitist approach for extracting automatically well-realized speech sounds with high confidence

[...]

Jean-Baptiste Maj, Anne Bonneau, Dominique Fohr, Yves Laprie

04 Sep 2005

TL;DR: The results show that, by using the HMM models defined in the training phase, the speech recognizer detects reliably specific speech sounds with a small rate of errors.

...read moreread less

Abstract: This paper presents an elitist approach\" for extracting automatically well-realized speech sounds with high confidence. The elitist approach uses a speech recognition system based on Hidden Markov Models (HMM). The HMM are trained on speech sounds which are systematically well-detected in an iterative procedure. The results show that, by using the HMM models defined in the training phase, the speech recognizer detects reliably specific speech sounds with a small rate of errors.

...read moreread less

Proceedings Article•DOI•

Static linking of phonemes to polygonal 3D model's facial expressions

[...]

Selma Rizvic¹, Sanjin Jeginovic¹, Samim Konjicija¹, Zikrija Avdagic¹•Institutions (1)

University of Sarajevo¹

12 May 2005

TL;DR: A system for automatic keyframe generation using MaxScript control script is developed, which is able to perform fine tuning and adding facial expressions of the emotions in facial animation keyframes.

...read moreread less

Abstract: In terms of reducing efforts of an animator in creating facial animation keyframes we developed a system for automatic keyframe generation using MaxScript control script. Input parameter for the script is a parameter file containing phonemes of the prerecorded soundtrack and their durations. Recognition of phonemes is done by an LVQ neural network. After keyframes are created by the script, the animator is able to perform fine tuning and adding facial expressions of the emotions.

...read moreread less

Real-time viseme extraction

[...]

Edward A Kmett

01 Jan 2005

TL;DR: This paper looks at an alternate technique for applying multivariate statistical techniques to lip-sync a cartoon or model with an audio stream in real time, which requires orders of magnitude less processing power than traditional methods.

...read moreread less

Abstract: With the advance of modem computer hardware, computer animation has advanced leaps and bounds. What formerly took weeks of processing can now be generated on the fly. However, the actors in games often stand mute with faces unmoving, or speak only in canned phrases as the technology for calculating their lip positions from an arbitrary sound segment has lagged behind the technology that allowed the movement of those lips to be rendered in real-time. Traditional speech recognition techniques requires the entire utterance to be present or require at least a wide window around the text to be matched to allow for higher level structure to be used in determining what words are being spoken. However, this approach, while highly appropriate for recognizing the sounds present in an audio stream and mapping those to speech, is less applicable to the problem of "lip-syncing" in real time. This paper looks at an alternate technique for applying multivariate statistical techniques to lip-sync a cartoon or model with an audio stream in real time, which requires orders of magnitude less processing power than traditional methods. Degree Type Open Access Senior Honors Thesis Department Mathematics

...read moreread less

Improved speech reading through a free-parts representation.

[...]

Simon Lucey, Patrick Lucey

01 Jan 2005

TL;DR: This approach additionally equires a modification to traditional techniques employed fo r the estimation of hidden Markov Models (HMMs), whose resultant models the authors currently refer to as free-parts HMMs (FP-HMMs) will be presented on the CUAVE audiovisual speech database.

...read moreread less

Abstract: Motivated by the success of free-parts based representatio s in face recognition [1] we have attempted to address some of the problems associated with applying such a philosophy to the task of speaker-independent automatic speech reading. Hitherto, a major problem with canonical area-based approaches in automatic speech reading is the intrinsic lac k of training observations due to the visual speech modality’ s low sample rate and large variability in appearance. We believe a free-parts representation can overcome many of these limitations due to its natural ability to generalize b y producing many observations from a single mouth image, whilst still preserving the ability to discriminate betwee n various visual-speech units. This approach additionally r equires a modification to traditional techniques employed fo r the estimation of hidden Markov Models (HMMs), whose resultant models we currently refer to as free-parts HMMs (FP-HMMs). Results will be presented on the CUAVE audiovisual speech database.

...read moreread less

Journal Article•DOI•

The effects of speechreading training on viseme categories for vowels

[...]

Carolyn Richie

09 Sep 2005-Journal of the Acoustical Society of America

TL;DR: Rich Richie and D. Kewley-Port as discussed by the authors used pre-training data (vowel-identification confusion matrices) to determine whether vowel visemes exist for untrained speechreaders.

...read moreread less

Abstract: The status of visemes, groups of visually confusable speech sounds, for American English vowels has been disputed for some time. While some researchers claim that vowels are visually distinguishable, others claim that some vowels are visually confusable and comprise viseme categories. Data from our study on speechreading words and sentences were examined for evidence of vowel visemes [C. Richie and D. Kewley‐Port, J. Acoust. Soc. Am. 117, 2570 (2005)]. Normal‐hearing listeners were tested in auditory‐visual conditions in masking noise designed to simulate a hearing loss. They were trained on speechreading tasks emphasizing vowels, consonants, or vowels and consonants combined. Pre‐ and post‐training speechreading tests included identification of 10 vowels in CVC context. Pre‐training data (vowel‐identification confusion matrices) were used to determine whether vowel visemes exist for untrained speechreaders. Post‐training results were examined to determine whether the number of vowel response categories increased and whether the number of vowel identification errors decreased, for trained versus untrained participants. The impact of these training programs on speechreading performance is discussed in terms of vowel visemes. [Work supported by NIHDCD02229.]

...read moreread less