scispace - formally typeset
Search or ask a question

Showing papers on "Viseme published in 2012"


Proceedings ArticleDOI
29 Jul 2012
TL;DR: It is found that dynamic visemes are able to produce more accurate and visually pleasing speech animation given phonetically annotated audio, reducing the amount of time that an animator needs to spend manually refining the animation.
Abstract: We present a new method for generating a dynamic, concatenative, unit of visual speech that can generate realistic visual speech animation. We redefine visemes as temporal units that describe distinctive speech movements of the visual speech articulators. Traditionally visemes have been surmized as the set of static mouth shapes representing clusters of contrastive phonemes (e.g. /p, b, m/, and /f, v/). In this work, the motion of the visual speech articulators are used to generate discrete, dynamic visual speech gestures. These gestures are clustered, providing a finite set of movements that describe visual speech, the visemes. Dynamic visemes are applied to speech animation by simply concatenating viseme units. We compare to static visemes using subjective evaluation. We find that dynamic visemes are able to produce more accurate and visually pleasing speech animation given phonetically annotated audio, reducing the amount of time that an animator needs to spend manually refining the animation.

132 citations


Proceedings Article
01 Jan 2012
TL;DR: These initial experiments demonstrate that the choice of visual unit requires more careful attention in audio-visual speech recognition system development, and the best visual-only recognition on the VidTIMIT database is achieved using a linguistically motivated viseme set.
Abstract: Phonemes are the standard modelling unit in HMM-based continuous speech recognition systems. Visemes are the equivalent unit in the visual domain, but there is less agreement on precisely what visemes are, or how many to model on the visual side in audio-visual speech recognition systems. This paper compares the use of 5 viseme maps in a continuous speech recognition task. The focus of the study is visual-only recognition to examine the choice of viseme map. All the maps are based on the phoneme-to-viseme approach, created either using a linguistic method or a data driven method. DCT, PCA and optical flow are used as the visual features. The best visual-only recognition on the VidTIMIT database is achieved using a linguistically motivated viseme set. These initial experiments demonstrate that the choice of visual unit requires more careful attention in audio-visual speech recognition system development.

72 citations


Journal ArticleDOI
TL;DR: An image-based visual speech animation system that represents a video sequence by a low-dimensional continuous curve embedded in a path graph and establishes a map from the curve to the image domain to preserve the video dynamics of a talking face.
Abstract: An image-based visual speech animation system is presented in this paper. A video model is proposed to preserve the video dynamics of a talking face. The model represents a video sequence by a low-dimensional continuous curve embedded in a path graph and establishes a map from the curve to the image domain. When selecting video segments for synthesis, we loosen the traditional requirement of using triphone as the unit to allow segments to contain longer natural talking motion. Dense videos are sampled from the segments, concatenated, and downsampled to train a video model that enables efficient time alignment and motion smoothing for the final video synthesis. Different viseme definitions are used to investigate the impact of visemes on the video realism of the animated talking face. The system is built on a public database and tested both objectively and subjectively.

35 citations


Proceedings ArticleDOI
25 Mar 2012
TL;DR: A multiview dataset using connected words that can be analysed by an automatic system, based on linear predictive trackers and active appearance models, and human lip-readers, and the automatic system is good at guessing its fallibility.
Abstract: Computer lip-reading is one of the great signal processing challenges. Not only is the signal noisy, it is variable. However it is almost unknown to compare the performance with human lip-readers. Partly this is because of the paucity of human lip-readers and partly because most automatic systems only handle data that are trivial and therefore not representative of human speech. Here we generate a multiview dataset using connected words that can be analysed by an automatic system, based on linear predictive trackers and active appearance models, and human lip-readers. The automatic system we devise has a viseme accuracy of ≈ 46% which is comparable to poor professional human lip-readers. However, unlike human lip-readers our system is good at guessing its fallibility.

32 citations


Book ChapterDOI
01 Apr 2012

17 citations


Journal ArticleDOI
TL;DR: A significant increase of recognition effectiveness and processing speed were noted during tests – for properly selected CHMM parameters and an adequate codebook size, besides the use of the appropriate fusion of audio-visual characteristics.
Abstract: This paper focuses on combining audio-visual signals for Polish speech recognition in conditions of the highly disturbed audio speech signal Recognition of audio-visual speech was based on combined hidden Markov models (CHMM) The described methods were developed for a single isolated command, nevertheless their effectiveness indicated that they would also work similarly in continuous audiovisual speech recognition The problem of a visual speech analysis is very difficult and computationally demanding, mostly because of an extreme amount of data that needs to be processed Therefore, the method of audio-video speech recognition is used only while the audiospeech signal is exposed to a considerable level of distortion There are proposed the authors’ own methods of the lip edges detection and a visual characteristic extraction in this paper Moreover, the method of fusing speech characteristics for an audio-video signal was proposed and tested A significant increase of recognition effectiveness and processing speed were noted during tests – for properly selected CHMM parameters and an adequate codebook size, besides the use of the appropriate fusion of audio-visual characteristics The experimental results were very promising and close to those achieved by leading scientists in the field of audio-visual speech recognition

13 citations


Book ChapterDOI
01 Jan 2012
TL;DR: The potential of the framework is demonstrated by developing the first automatic visual speech automation system for European Portuguese based on the concatenation of visemes and assessing the quality of two different phoneme-to-viseme mappings devised for the language.
Abstract: Visual speech animation, or lip synchronization, is the process of matching speech with the lip movements of a virtual character. It is a challenging task because all articulatory movements must be controlled and synchronized with the audio signal. Existing language-independent systems usually require fine tuning by an artist to avoid artefacts appearing in the animation. In this paper, we present a modular visual speech animation framework aimed at speeding up and easing the visual speech animation process as compared with traditional techniques. We demonstrate the potential of the framework by developing the first automatic visual speech automation system for European Portuguese based on the concatenation of visemes. We also present the results of a preliminary evaluation that was carried out to assess the quality of two different phoneme-to-viseme mappings devised for the language.

13 citations


Journal ArticleDOI
TL;DR: The results provide arguments for the involvement of the speech motor cortex in phonological discrimination, and suggest a multimodal representation of speech units.

13 citations


Journal ArticleDOI
TL;DR: A main effect of pseudo-homovisemy is found, suggesting that at least some deaf individuals do automatically access sublexical structure during single-word reading, and a working model of single- word reading by deaf adults based on the dual-route cascaded model of reading aloud is proposed.
Abstract: There is an ongoing debate whether deaf individuals access phonology when reading, and if so, what impact the ability to access phonology might have on reading achievement. However, the debate so far has been theoretically unspecific on two accounts: (a) the phonological units deaf individuals may have of oral language have not been specified and (b) there seem to be no explicit cognitive models specifying how phonology and other factors operate in reading by deaf individuals. We propose that deaf individuals have representations of the sublexical structure of oral-aural language which are based on mouth shapes and that these sublexical units are activated during reading by deaf individuals. We specify the sublexical units of deaf German readers as 11 "visemes" and incorporate the viseme set into a working model of single-word reading by deaf adults based on the dual-route cascaded model of reading aloud by Coltheart, Rastle, Perry, Langdon, and Ziegler (2001. DRC: A dual route cascaded model of visual word recognition and reading aloud. Psychological Review, 108, 204-256. doi: 10.1037//0033-295x.108.1.204). We assessed the indirect route of this model by investigating the "pseudo-homoviseme" effect using a lexical decision task in deaf German reading adults. We found a main effect of pseudo-homovisemy, suggesting that at least some deaf individuals do automatically access sublexical structure during single-word reading.

12 citations


Dissertation
24 Aug 2012
TL;DR: Investigation of techniques for audiovisual speech synthesis, using both viseme-based and data-driven approaches to implement multiple talking heads suggests that the use of talking heads in technology-enhanced learning could be useful in addition to traditional methods.
Abstract: This thesis investigates the use of synthetic talking heads, with lip, tongue and face movements synchronized with synthesized or natural speech, in technology-enhanced learning This work applies talking heads in a speech tutoring application for teaching English as a second language Previous studies have shown that speech perception is aided by visual information, but more research is needed to determine the effectiveness of visualization of articulators in pronunciation training This thesis explores whether or not visual speech technology can give an improvement in learning pronunciation This thesis investigates techniques for audiovisual speech synthesis, using both viseme-based and data-driven approaches to implement multiple talking heads Intelligibility studies found the audiovisual heads to be more intelligible than audio alone, and the data-driven head was found to be more intelligible than the viseme-driven implementation The talking heads are applied in a pronunciation-training application, which is evaluated by second-language learners to investigate the benefit of visual speech in technology-enhanced learning User trials explored the efficacy of the software in demonstrating the /b/–/p/ contrast in English The results indicate that learners showed an improvement in listening and pronunciation after using the software, while the benefit of visualization compared to auditory training alone varied between individuals User evaluations found that the talking heads were perceived to be helpful in learning pronunciation, and the positive feedback on the tutoring system suggests that the use of talking heads in technology-enhanced learning could be useful in addition to traditional methods

6 citations



Journal ArticleDOI
TL;DR: A framework for synthesizing lip-sync character speech animation in real time from a given speech sequence and its corresponding texts, starting from training dominated animeme models for each kind of phoneme by learning the character's animation control signal through an expectation-maximization (EM)-style optimization approach is introduced.
Abstract: Character speech animation is traditionally considered as important but tedious work, especially when taking lip synchronization (lip-sync) into consideration. Although there are some methods proposed to ease the burden on artists to create facial and speech animation, almost none is fast and efficient. In this paper, we introduce a framework for synthesizing lip-sync character speech animation in real time from a given speech sequence and its corresponding texts, starting from training dominated animeme models (DAMs) for each kind of phoneme by learning the character's animation control signal through an expectation-maximization (EM)-style optimization approach. The DAMs are further decomposed to polynomial-fitted animeme models and corresponding dominance functions while taking coarticulation into account. Finally, given a novel speech sequence and its corresponding texts, the animation control signal of the character can be synthesized in real time with the trained DAMs. The synthesized lip-sync animation can even preserve exaggerated characteristics of the character's facial geometry. Moreover, since our method can perform in real time, it can be used for many applications, such as lip-sync animation prototyping, multilingual animation reproduction, avatar speech, and mass animation production. Furthermore, the synthesized animation control signal can be imported into 3-D packages for further adjustment, so our method can be easily integrated into the existing production pipeline.


Book ChapterDOI
01 Jan 2012
TL;DR: This chapter reviews systems that use ASR techniques to evaluate pronunciation of people who suffer from speech or voice impairments and investigates the existing systems and presents the main innovation and some of the available resources.
Abstract: Speech disorders are human disabilities widely present in young population but also adults may suffer from such disorders after some physical problems. In this context, the detection and further the correction of such disabilities may be handled by Automatic Speech Recognition (ASR) technology. The first works on the speech disorders detection began early in the 70s and seem to follow the same evolution as those on the ASR. Indeed, these early works were more based on the signal processing techniques. Progressively, systems dealing with speech disorders incorporate more ideas from ASR technology. Particularly, Hidden Markov Models, the state-of-the-art approaches in ASR systems, are used. This chapter reviews systems that use ASR techniques to evaluate pronunciation of people who suffer from speech or voice impairments. The authors investigate the existing systems and present the main innovation and some of the available resources.

17 Sep 2012
TL;DR: This work demonstrates 11 final visemes representing the 28 consonantal Arabic phonemes and shows the variation of Pitch for each viseme.
Abstract: The aim of this work is to introduce a primary research on Arabic audiovisual analysis. Each language has multiple phonemes and visemes and each viseme can have multiple phonemes. The first part focuses on how to classify Arabic visemes from still images, whereas the second part shows the variation of Pitch for each viseme. We haven’t taken coarticulation of visemes in context. To evaluate the performance of the proposed method, we collected a large number of speech visual signal of ten Algerian speakers male and female at different moments pronouncing 28 Arabic syllabuses. In our work, we demonstrate 11 final visemes representing the 28 consonantal Arabic phonemes.

Proceedings ArticleDOI
01 Nov 2012
TL;DR: A novel word lip recognition system by detecting and determining initial mouth-shape codes to recognize uttering consonants that eventually is able to discriminate different words consisting of the same sequential vowel codes though containing different consonant codes is proposed.
Abstract: Visual speech recognition or lip reading is an approach for noise robust speech recognition by adding speaker's visual cues to audio information. Basically visual-only speech recognition is applicable to speaker verification and multimedia interface for supporting speaking impaired person. The sequential mouth-shape code method is an effective approach of lip reading for particularly uttered Japanese words by utilizing two kinds of distinctive mouth shapes, known as first and last mouth shapes, appeared intermittently. One advantage of this method is its low computational burden for the learning and word registration processes. This paper proposes a novel word lip recognition system by detecting and determining initial mouth-shape codes to recognize uttering consonants. The proposed method eventually is able to discriminate different words consisting of the same sequential vowel codes though containing different consonant codes. The conducted experiments demonstrate that the proposed system provides higher recognition rate than the conventional ones.

Proceedings Article
18 Oct 2012
TL;DR: A framework designed for helping people with hearing disabilities learn how to articulate correctly in Romanian and also it works as a training-assistant for Romanian language lip-reading.
Abstract: In this paper, we propose a 3D facial animation model for simulating visual speech production in the Romanian language. Using a set of existing 3D key shapes representing facial animation visemes, fluid animations describing facial activity during speech pronunciation are provided, taking into account the Romanian language coarticulation effects which are discussed in this paper. A novel mathematical model for defining efficient viseme coarticulation functions for 3D facial animations is also provided. The 3D tongue activity could be closely observed in real-time while different words are pronounced in Romanian, by allowing transparency for the 3D head model, thus making tongue, teeth and the entire oral cavity visible. The purpose of this work is to provide a framework designed for helping people with hearing disabilities learn how to articulate correctly in Romanian and also it works as a training-assistant for Romanian language lip-reading.

Journal ArticleDOI
TL;DR: In this paper, the impact of various emotional states on speech prosody analysis is verified using a corpus of speech signals across gender for making a standard database of different linguistic, and paralinguistic factors.
Abstract: Speech can be described as an act of producing voice through the use of the vocal folds and vocal apparatus to create a linguistic act designed to convey information. Linguists classify the speech sounds used in a language into a number of abstract categories called phonemes. Phonemes are abstract categories, which allow us to group together subsets of speech sounds. Speech signals carry different features, which need detailed study across gender for making a standard database of different linguistic, & paralinguistic factors. When people interact with others they convey emotions. Emotions play a vital role in any kind of decision in affective, social or business area. The emotions are manifested in verbal, facial expressions but also in written texts. The objective of this study is to verify the impact of various emotional states on speech prosody analysis.


Proceedings ArticleDOI
21 Sep 2012
TL;DR: In this paper, the authors proposed to use appropriate speech production data to improve the quality of articulatory animation for audiovisual (AV) speech synthesis, which could significantly improve the articulation quality.
Abstract: The importance of modeling speech articulation for high-quality audiovisual (AV) speech synthesis is widely acknowledged. Nevertheless, while state-of-the-art, data-driven approaches to facial animation can make use of sophisticated motion capture techniques, the animation of the intraoral articulators (viz. the tongue, jaw, and velum) typically makes use of simple rules or viseme morphing, in stark contrast to the otherwise high quality of facial modeling. Using appropriate speech production data could significantly improve the quality of articulatory animation for AV synthesis.

Book ChapterDOI
29 Apr 2012
TL;DR: This work presents a method for automatic detection of the outer edges of the lips, which was used to identify individual words in audio-visual speech recognition and how to use video speech to divide the audio signal into phonemes.
Abstract: This paper proposes a method of tracking the lips in the system of audio-visual speech recognition. Presented methods consists of a face detector, face tracker, lip detector, lip tracker, and word classifier. In speech recognition systems, the audio signal is exposed to a large amount of acoustic noise, therefor scientists are looking for ways to reduce audio interference on recognition results. Visual speech is one of the sources that is not perturbed by the acoustic environment and noise. To analyze the video speech one has to develop a method of lip tracking. This work presents a method for automatic detection of the outer edges of the lips, which was used to identify individual words in audio-visual speech recognition. Additionally the paper also shows how to use video speech to divide the audio signal into phonemes.

01 Jan 2012
TL;DR: The experimental results indicate that the new Visual Speech Unit concept achieves 90% recognition rate when the system has been applied to the identification of 60 classes of VSUs, while the recognition rate for the standard set of MPEG-4 visemes was only 52%.
Abstract: In this paper we propose a new learning-based representation that is referred to as Visual Speech Unit (VSU) for visual speech recognition (VSR). The new Visual Speech Unit concept proposes an extension of the standard viseme model that is currently applied for VSR by including in this representation not only the data associated with the visemes, but also the transitory information between consecutive visemes. The developed speech recognition system consists of several computational stages: (a) lips segmentation, (b) construction of the Expectation-Maximization Principal Component Analysis (EM-PCA) manifolds from the input video image, (c) registration between the models of the VSUs and the EM-PCA data constructed from the input image sequence and (d) recognition of the VSUs using a standard Hidden Markov Model (HMM) classification scheme. In this paper we were particularly interested to evaluate the classification accuracy obtained for our new VSU models when compared with that attained for standard (MPEG-4) viseme models. The experimental results indicate that we achieved 90% recognition rate when the system has been applied to the identification of 60 classes of VSUs, while the recognition rate for the standard set of MPEG-4 visemes was only 52%.

Book ChapterDOI
01 Jan 2012
TL;DR: A representation model of the visual speech which bases on the local binary pattern (LBP) and the discrete cosine transform (DCT) of mouth images is proposed which shows better performance than using the global feature only.
Abstract: The paper aims to establish a effective feature form of visual speech to realize the Chinese viseme recognition. We propose and discuss a representation model of the visual speech which bases on the local binary pattern (LBP) and the discrete cosine transform (DCT) of mouth images. The joint model combines the advantages of the local and global texture information together, which shows better performance than using the global feature only. By computing LBP and DCT of each mouth frame capturing during the subject speaking, the Hidden Markov Model (HMM) is trained based on the training dataset and is employed to recognize the new visual speech. The experiments show this visual speech feature model exhibits good performance in classifying the difference speaking states.

Patent
17 Jan 2012
TL;DR: In this paper, a computer-implemented method of animating a mouth of a face in accordance with a given sequence of visemes is presented, which is graphically represented by a face model.
Abstract: A computer-implemented method of animating a mouth of a face in accordance with a given sequence of visemes, said method comprising: graphically representing said face by a face model; For each of a plurality V of possibly visemes, obtaining by measurement a plurality of I different scans or samples of said visemes; representing each of said plurality of I samples of each of said plurality of V different visemes by said face animation model to generate a matrix based on said scans or samples which spans up a viseme space such that a sequence of visemes can be represented through a trajectory through said viseme space; applying a Bayesian approach to obtain the best path through said viseme space for said given sequence of visemes

Journal ArticleDOI
TL;DR: The paper shows the HMM which describing the dynamic of speech, coupled with the combined feature for describing the global and local texture is the best model.
Abstract: This paper aims to give a solutions for the construction of chinese visual speech feature model based on HMM. We propose and discuss three kind representation model of the visual speech which are lip geometrical features, lip motion features and lip texture features. The model combines the advantages of the local LBP and global DCT texture information together, which shows better performance than the single feature. Equally the model combines the advantages of the local LBP and geometrical information together is better than single feature. By computing the recognition rate of the visemes from the model, the paper shows the HMM which describing the dynamic of speech, coupled with the combined feature for describing the global and local texture is the best model.

Posted Content
TL;DR: This work states that using appropriate speech production data could significantly improve the quality of articulatory animation for AV synthesis.
Abstract: The importance of modeling speech articulation for high-quality audiovisual (AV) speech synthesis is widely acknowledged. Nevertheless, while state-of-the-art, data-driven approaches to facial animation can make use of sophisticated motion capture techniques, the animation of the intraoral articulators (viz. the tongue, jaw, and velum) typically makes use of simple rules or viseme morphing, in stark contrast to the otherwise high quality of facial modeling. Using appropriate speech production data could significantly improve the quality of articulatory animation for AV synthesis.

Proceedings ArticleDOI
29 May 2012
TL;DR: Based on the characteristics of Chinese speech pronunciation, the lip features of consonant and vowels are extracted, grey relation analysis is used, and fuzzy clustering is used to classify the vowels, and 13 basic static visemes for Chinese are defined.
Abstract: In this paper, based on the characteristics of Chinese speech pronunciation, we firstly extract the lip features of consonant and vowels, use grey relation analysis to construct fuzzy similarity relation matrix for consonant and vowels, and then use fuzzy clustering to classify the consonant and vowels, at last we define 13 basic static visemes for Chinese. We realized a TTVS(Text-To-Visual Speech) system to verify the performance of the visemes.

Proceedings Article
21 May 2012
TL;DR: This paper presents the effort to integrate and extend existing text-to-speech system for Croatian language and a face and body animation system to produce a system capable of real-time visual text- to-speech synthesis for Croatianlanguage.
Abstract: Virtual human characters are being used in a variety of applications, such as computer games, movies, tutoring software and virtual guides. Virtual characters interact with persons using speech and gestures. In some applications speech can be pre-recorded and synchronized with animation to produce a believable virtual character. However in many cases an ability to produce synthesized speech synchronized with facial animation is necessary or desireable, for example in cases the output text is not known in advance, for ability to choose among various voices, or because recording natural speech is too time consuming and expensive for a given application. This paper presents the effort to integrate and extend existing text-to-speech system for Croatian language and a face and body animation system to produce a system capable of real-time visual text-to-speech synthesis for Croatian language.

Book ChapterDOI
25 Jun 2012
TL;DR: A design tool for creating correct speech visemes is designed and is testing the correctness of generated viseme on Slovak speech domains.
Abstract: Many deaf people are using lip reading as a main communication fiorm. A viseme is a representational unit used to classify speech sounds in the visual domain and describes the particular facial and oral positions and movements that occur alongside the voicing of phonemes. A design tool for creating correct speech visemes is designed. It's composed of 5 modules; one module for creating phonemes, one module for creating 3D speech visemes, one module for facial expression and modul for synchronization between phonemes and visemes and lastly one module to generate speech triphones. We are testing the correctness of generated visemes on Slovak speech domains. The paper descriebes our developed tool.

Proceedings ArticleDOI
01 Nov 2012
TL;DR: This paper presents the design and implementation of an online speech driven talking head animation system that first recognizes phoneme sequence from the input speech with a Chinese Mandarin speech recognizer, and is used to drive the facial animations on a 3-dimentional talking head.
Abstract: This paper presents the design and implementation of an online speech driven talking head animation system The system first recognizes phoneme sequence from the input speech with a Chinese Mandarin speech recognizer The phoneme sequence is further transformed to a sequence of visemes The sequence of MPEG-4 facial animation parameters (FAPs) is further derived from the viseme sequence, and is used to drive the facial animations on a 3-dimentional talking head The architecture and the major features are also presented in the paper, together with the evaluations of the system