Showing papers on "Viseme published in 2012"

PDF

Open Access

Proceedings Article•DOI•

[...]

Sarah Taylor¹, Moshe Mahler², Barry-John Theobald¹, Iain Matthews²•Institutions (2)

University of East Anglia¹, Disney Research²

29 Jul 2012

TL;DR: It is found that dynamic visemes are able to produce more accurate and visually pleasing speech animation given phonetically annotated audio, reducing the amount of time that an animator needs to spend manually refining the animation.

...read moreread less

Abstract: We present a new method for generating a dynamic, concatenative, unit of visual speech that can generate realistic visual speech animation. We redefine visemes as temporal units that describe distinctive speech movements of the visual speech articulators. Traditionally visemes have been surmized as the set of static mouth shapes representing clusters of contrastive phonemes (e.g. /p, b, m/, and /f, v/). In this work, the motion of the visual speech articulators are used to generate discrete, dynamic visual speech gestures. These gestures are clustered, providing a finite set of movements that describe visual speech, the visemes. Dynamic visemes are applied to speech animation by simply concatenating viseme units. We compare to static visemes using subjective evaluation. We find that dynamic visemes are able to produce more accurate and visually pleasing speech animation given phonetically annotated audio, reducing the amount of time that an animator needs to spend manually refining the animation.

...read moreread less

132 citations

Proceedings Article•

Phoneme-to-viseme mapping for visual speech recognition

[...]

Luca Cappelletta¹, Naomi Harte¹•Institutions (1)

Trinity College, Dublin¹

01 Jan 2012

TL;DR: These initial experiments demonstrate that the choice of visual unit requires more careful attention in audio-visual speech recognition system development, and the best visual-only recognition on the VidTIMIT database is achieved using a linguistically motivated viseme set.

...read moreread less

Abstract: Phonemes are the standard modelling unit in HMM-based continuous speech recognition systems. Visemes are the equivalent unit in the visual domain, but there is less agreement on precisely what visemes are, or how many to model on the visual side in audio-visual speech recognition systems. This paper compares the use of 5 viseme maps in a continuous speech recognition task. The focus of the study is visual-only recognition to examine the choice of viseme map. All the maps are based on the phoneme-to-viseme approach, created either using a linguistic method or a data driven method. DCT, PCA and optical flow are used as the visual features. The best visual-only recognition on the VidTIMIT database is achieved using a linguistically motivated viseme set. These initial experiments demonstrate that the choice of visual unit requires more careful attention in audio-visual speech recognition system development.

...read moreread less

72 citations

Journal Article•DOI•

An Image-Based Visual Speech Animation System

[...]

Ziheng Zhou¹, Guoying Zhao¹, Yimo Guo¹, Matti Pietikäinen¹•Institutions (1)

University of Oulu¹

01 Oct 2012-IEEE Transactions on Circuits and Systems for Video Technology

TL;DR: An image-based visual speech animation system that represents a video sequence by a low-dimensional continuous curve embedded in a path graph and establishes a map from the curve to the image domain to preserve the video dynamics of a talking face.

...read moreread less

Abstract: An image-based visual speech animation system is presented in this paper. A video model is proposed to preserve the video dynamics of a talking face. The model represents a video sequence by a low-dimensional continuous curve embedded in a path graph and establishes a map from the curve to the image domain. When selecting video segments for synthesis, we loosen the traditional requirement of using triphone as the unit to allow segments to contain longer natural talking motion. Dense videos are sampled from the segments, concatenated, and downsampled to train a video model that enables efficient time alignment and motion smoothing for the final video synthesis. Different viseme definitions are used to investigate the impact of visemes on the video realism of the animated talking face. The system is built on a public database and tested both objectively and subjectively.

...read moreread less

35 citations

Proceedings Article•DOI•

Insights into machine lip reading

[...]

Yuxuan Lan¹, Richard P. Harvey¹, Barry-John Theobald¹•Institutions (1)

University of East Anglia¹

25 Mar 2012

TL;DR: A multiview dataset using connected words that can be analysed by an automatic system, based on linear predictive trackers and active appearance models, and human lip-readers, and the automatic system is good at guessing its fallibility.

...read moreread less

Abstract: Computer lip-reading is one of the great signal processing challenges. Not only is the signal noisy, it is variable. However it is almost unknown to compare the performance with human lip-readers. Partly this is because of the paucity of human lip-readers and partly because most automatic systems only handle data that are trivial and therefore not representative of human speech. Here we generate a multiview dataset using connected words that can be analysed by an automatic system, based on linear predictive trackers and active appearance models, and human lip-readers. The automatic system we devise has a viseme accuracy of ≈ 46% which is comparable to poor professional human lip-readers. However, unlike human lip-readers our system is good at guessing its fallibility.

...read moreread less

32 citations

Book Chapter•DOI•

Audiovisual Speech Processing: Visual speech perception

[...]

L. E. Bernstein

01 Apr 2012

17 citations

Journal Article•DOI•

Characteristics of the use of coupled hidden Markov models for audio-visual polish speech recognition

[...]

Mariusz Kubanek¹, Janusz Bobulski, Lukasz Adrjanowicz•Institutions (1)

Częstochowa University of Technology¹

01 Oct 2012-Bulletin of The Polish Academy of Sciences-technical Sciences

TL;DR: A significant increase of recognition effectiveness and processing speed were noted during tests – for properly selected CHMM parameters and an adequate codebook size, besides the use of the appropriate fusion of audio-visual characteristics.

...read moreread less

Abstract: This paper focuses on combining audio-visual signals for Polish speech recognition in conditions of the highly disturbed audio speech signal Recognition of audio-visual speech was based on combined hidden Markov models (CHMM) The described methods were developed for a single isolated command, nevertheless their effectiveness indicated that they would also work similarly in continuous audiovisual speech recognition The problem of a visual speech analysis is very difficult and computationally demanding, mostly because of an extreme amount of data that needs to be processed Therefore, the method of audio-video speech recognition is used only while the audiospeech signal is exposed to a considerable level of distortion There are proposed the authors’ own methods of the lip edges detection and a visual characteristic extraction in this paper Moreover, the method of fusing speech characteristics for an audio-video signal was proposed and tested A significant increase of recognition effectiveness and processing speed were noted during tests – for properly selected CHMM parameters and an adequate codebook size, besides the use of the appropriate fusion of audio-visual characteristics The experimental results were very promising and close to those achieved by leading scientists in the field of audio-visual speech recognition

...read moreread less

13 citations

Book Chapter•DOI•

A Proposal for a Visual Speech Animation System for European Portuguese

[...]

José Serra¹, Manuel Sam Ribeiro², Manuel Sam Ribeiro³, João Freitas³, João Freitas², Verónica Orvalho¹, Miguel Sales Dias³, Miguel Sales Dias² - Show less +4 more•Institutions (3)

University of Porto¹, ISCTE – University Institute of Lisbon², Microsoft³

01 Jan 2012

TL;DR: The potential of the framework is demonstrated by developing the first automatic visual speech automation system for European Portuguese based on the concatenation of visemes and assessing the quality of two different phoneme-to-viseme mappings devised for the language.

...read moreread less

Abstract: Visual speech animation, or lip synchronization, is the process of matching speech with the lip movements of a virtual character. It is a challenging task because all articulatory movements must be controlled and synchronized with the audio signal. Existing language-independent systems usually require fine tuning by an artist to avoid artefacts appearing in the animation. In this paper, we present a modular visual speech animation framework aimed at speeding up and easing the visual speech animation process as compared with traditional techniques. We demonstrate the potential of the framework by developing the first automatic visual speech automation system for European Portuguese based on the concatenation of visemes. We also present the results of a preliminary evaluation that was carried out to assess the quality of two different phoneme-to-viseme mappings devised for the language.

...read moreread less

13 citations

Journal Article•DOI•

Visemic processing in audiovisual discrimination of natural speech: a simultaneous fMRI-EEG study.

[...]

Cyril Dubois¹, Hélène Otzenberger¹, Daniel Gounot¹, Rudolph Sock¹, Marie-Noëlle Metz-Lutz¹ - Show less +1 more•Institutions (1)

University of Strasbourg¹

01 Jun 2012-Neuropsychologia

TL;DR: The results provide arguments for the involvement of the speech motor cortex in phonological discrimination, and suggest a multimodal representation of speech units.

...read moreread less

13 citations

Journal Article•DOI•

A Dual-Route Cascaded Model of Reading by Deaf Adults: Evidence for Grapheme to Viseme Conversion

[...]

Eeva A. Elliott¹, Mario Braun¹, Michael Kuhlmann¹, Arthur M. Jacobs¹•Institutions (1)

Free University of Berlin¹

01 Apr 2012-Journal of Deaf Studies and Deaf Education

TL;DR: A main effect of pseudo-homovisemy is found, suggesting that at least some deaf individuals do automatically access sublexical structure during single-word reading, and a working model of single- word reading by deaf adults based on the dual-route cascaded model of reading aloud is proposed.

...read moreread less

Abstract: There is an ongoing debate whether deaf individuals access phonology when reading, and if so, what impact the ability to access phonology might have on reading achievement. However, the debate so far has been theoretically unspecific on two accounts: (a) the phonological units deaf individuals may have of oral language have not been specified and (b) there seem to be no explicit cognitive models specifying how phonology and other factors operate in reading by deaf individuals. We propose that deaf individuals have representations of the sublexical structure of oral-aural language which are based on mouth shapes and that these sublexical units are activated during reading by deaf individuals. We specify the sublexical units of deaf German readers as 11 "visemes" and incorporate the viseme set into a working model of single-word reading by deaf adults based on the dual-route cascaded model of reading aloud by Coltheart, Rastle, Perry, Langdon, and Ziegler (2001. DRC: A dual route cascaded model of visual word recognition and reading aloud. Psychological Review, 108, 204-256. doi: 10.1037//0033-295x.108.1.204). We assessed the indirect route of this model by investigating the "pseudo-homoviseme" effect using a lexical decision task in deaf German reading adults. We found a main effect of pseudo-homovisemy, suggesting that at least some deaf individuals do automatically access sublexical structure during single-word reading.

...read moreread less

12 citations

Dissertation•

Visual Speech in Technology-Enhanced Learning

[...]

Priya Dey

24 Aug 2012

TL;DR: Investigation of techniques for audiovisual speech synthesis, using both viseme-based and data-driven approaches to implement multiple talking heads suggests that the use of talking heads in technology-enhanced learning could be useful in addition to traditional methods.

...read moreread less

Abstract: This thesis investigates the use of synthetic talking heads, with lip, tongue and face movements synchronized with synthesized or natural speech, in technology-enhanced learning This work applies talking heads in a speech tutoring application for teaching English as a second language Previous studies have shown that speech perception is aided by visual information, but more research is needed to determine the effectiveness of visualization of articulators in pronunciation training This thesis explores whether or not visual speech technology can give an improvement in learning pronunciation This thesis investigates techniques for audiovisual speech synthesis, using both viseme-based and data-driven approaches to implement multiple talking heads Intelligibility studies found the audiovisual heads to be more intelligible than audio alone, and the data-driven head was found to be more intelligible than the viseme-driven implementation The talking heads are applied in a pronunciation-training application, which is evaluated by second-language learners to investigate the benefit of visual speech in technology-enhanced learning User trials explored the efficacy of the software in demonstrating the /b/–/p/ contrast in English The results indicate that learners showed an improvement in listening and pronunciation after using the software, while the benefit of visualization compared to auditory training alone varied between individuals User evaluations found that the talking heads were perceived to be helpful in learning pronunciation, and the positive feedback on the tutoring system suggests that the use of talking heads in technology-enhanced learning could be useful in addition to traditional methods

...read moreread less

6 citations

Book Chapter•DOI•

Audiovisual Speech Processing: Audiovisual automatic speech recognition

[...]

G. Potamianos, C. Neti, Juergen Luettin, Iain Matthews

01 Apr 2012

Journal Article•DOI•

Animating Lip-Sync Characters With Dominated Animeme Models

[...]

Yu-Mei Chen¹, Fu-Chun Huang², Shuen-Huei Guan¹, Bing-Yu Chen¹•Institutions (2)

National Taiwan University¹, University of California, Berkeley²

01 Sep 2012-IEEE Transactions on Circuits and Systems for Video Technology

TL;DR: A framework for synthesizing lip-sync character speech animation in real time from a given speech sequence and its corresponding texts, starting from training dominated animeme models for each kind of phoneme by learning the character's animation control signal through an expectation-maximization (EM)-style optimization approach is introduced.

...read moreread less

Abstract: Character speech animation is traditionally considered as important but tedious work, especially when taking lip synchronization (lip-sync) into consideration. Although there are some methods proposed to ease the burden on artists to create facial and speech animation, almost none is fast and efficient. In this paper, we introduce a framework for synthesizing lip-sync character speech animation in real time from a given speech sequence and its corresponding texts, starting from training dominated animeme models (DAMs) for each kind of phoneme by learning the character's animation control signal through an expectation-maximization (EM)-style optimization approach. The DAMs are further decomposed to polynomial-fitted animeme models and corresponding dominance functions while taking coarticulation into account. Finally, given a novel speech sequence and its corresponding texts, the animation control signal of the character can be synthesized in real time with the trained DAMs. The synthesized lip-sync animation can even preserve exaggerated characteristics of the character's facial geometry. Moreover, since our method can perform in real time, it can be used for many applications, such as lip-sync animation prototyping, multilingual animation reproduction, avatar speech, and mass animation production. Furthermore, the synthesized animation control signal can be imported into 3-D packages for further adjustment, so our method can be easily integrated into the existing production pipeline.

...read moreread less

Book•

Silent Speech Interface

[...]

Proteus Valre Kresten

08 Jun 2012

Book Chapter•DOI•

Speech Disorders Recognition using Speech Analysis

[...]

Khaled Necibi¹, Halima Bahi¹, Toufik Sari¹•Institutions (1)

University of Annaba¹

01 Jan 2012

TL;DR: This chapter reviews systems that use ASR techniques to evaluate pronunciation of people who suffer from speech or voice impairments and investigates the existing systems and presents the main innovation and some of the available resources.

...read moreread less

Abstract: Speech disorders are human disabilities widely present in young population but also adults may suffer from such disorders after some physical problems. In this context, the detection and further the correction of such disabilities may be handled by Automatic Speech Recognition (ASR) technology. The first works on the speech disorders detection began early in the 70s and seem to follow the same evolution as those on the ASR. Indeed, these early works were more based on the signal processing techniques. Progressively, systems dealing with speech disorders incorporate more ideas from ASR technology. Particularly, Hidden Markov Models, the state-of-the-art approaches in ASR systems, are used. This chapter reviews systems that use ASR techniques to evaluate pronunciation of people who suffer from speech or voice impairments. The authors investigate the existing systems and present the main innovation and some of the available resources.

...read moreread less

Visual Speech Analysis, Application to Arabic Phonemes

[...]

Fatma Zohra Chelali, Khadidja Sadeddine, Amar Djeradi

17 Sep 2012

TL;DR: This work demonstrates 11 final visemes representing the 28 consonantal Arabic phonemes and shows the variation of Pitch for each viseme.

...read moreread less

Abstract: The aim of this work is to introduce a primary research on Arabic audiovisual analysis. Each language has multiple phonemes and visemes and each viseme can have multiple phonemes. The first part focuses on how to classify Arabic visemes from still images, whereas the second part shows the variation of Pitch for each viseme. We haven’t taken coarticulation of visemes in context. To evaluate the performance of the proposed method, we collected a large number of speech visual signal of ten Algerian speakers male and female at different moments pronouncing 28 Arabic syllabuses. In our work, we demonstrate 11 final visemes representing the 28 consonantal Arabic phonemes.

...read moreread less

Proceedings Article•DOI•

Speaker dependent visual word recognition by using sequential mouth shape codes

[...]

Takuro Tasaka¹, Nozomu Hamada¹•Institutions (1)

Keio University¹

01 Nov 2012

TL;DR: A novel word lip recognition system by detecting and determining initial mouth-shape codes to recognize uttering consonants that eventually is able to discriminate different words consisting of the same sequential vowel codes though containing different consonant codes is proposed.

...read moreread less

Abstract: Visual speech recognition or lip reading is an approach for noise robust speech recognition by adding speaker's visual cues to audio information. Basically visual-only speech recognition is applicable to speaker verification and multimedia interface for supporting speaking impaired person. The sequential mouth-shape code method is an effective approach of lip reading for particularly uttered Japanese words by utilizing two kinds of distinctive mouth shapes, known as first and last mouth shapes, appeared intermittently. One advantage of this method is its low computational burden for the learning and word registration processes. This paper proposes a novel word lip recognition system by detecting and determining initial mouth-shape codes to recognize uttering consonants. The proposed method eventually is able to discriminate different words consisting of the same sequential vowel codes though containing different consonant codes. The conducted experiments demonstrate that the proposed system provides higher recognition rate than the conventional ones.

...read moreread less

Proceedings Article•

Romanian language coarticulation model for visual speech simulations

[...]

Mihai Daniel Ilie¹, Cristian Negrescu¹, Dumitru Stanomir¹•Institutions (1)

Politehnica University of Bucharest¹

18 Oct 2012

TL;DR: A framework designed for helping people with hearing disabilities learn how to articulate correctly in Romanian and also it works as a training-assistant for Romanian language lip-reading.

...read moreread less

Abstract: In this paper, we propose a 3D facial animation model for simulating visual speech production in the Romanian language. Using a set of existing 3D key shapes representing facial animation visemes, fluid animations describing facial activity during speech pronunciation are provided, taking into account the Romanian language coarticulation effects which are discussed in this paper. A novel mathematical model for defining efficient viseme coarticulation functions for 3D facial animations is also provided. The 3D tongue activity could be closely observed in real-time while different words are pronounced in Romanian, by allowing transparency for the 3D head model, thus making tongue, teeth and the entire oral cavity visible. The purpose of this work is to provide a framework designed for helping people with hearing disabilities learn how to articulate correctly in Romanian and also it works as a training-assistant for Romanian language lip-reading.

...read moreread less

Journal Article•DOI•

Impact of Emotion on Prosody Analysis

[...]

Padmalaya Pattnaik

01 Jan 2012-IOSR Journal of Computer Engineering

TL;DR: In this paper, the impact of various emotional states on speech prosody analysis is verified using a corpus of speech signals across gender for making a standard database of different linguistic, and paralinguistic factors.

...read moreread less

Abstract: Speech can be described as an act of producing voice through the use of the vocal folds and vocal apparatus to create a linguistic act designed to convey information. Linguists classify the speech sounds used in a language into a number of abstract categories called phonemes. Phonemes are abstract categories, which allow us to group together subsets of speech sounds. Speech signals carry different features, which need detailed study across gender for making a standard database of different linguistic, & paralinguistic factors. When people interact with others they convey emotions. Emotions play a vital role in any kind of decision in affective, social or business area. The emotions are manifested in verbal, facial expressions but also in written texts. The objective of this study is to verify the impact of various emotional states on speech prosody analysis.

...read moreread less

Book Chapter•DOI•

Audiovisual Speech Processing: Visual and audiovisual synthesis and recognition of speech by computers

[...]

N. M. Brooke, S. D. Scott

01 Jan 2012

Proceedings Article•DOI•

Using multimodal speech production data to evaluate articulatory animation for audiovisual speech synthesis

[...]

Ingmar Steiner¹, Korin Richmond², Slim Ouni³•Institutions (3)

University College Dublin¹, University of Edinburgh², University of Lorraine³

21 Sep 2012

TL;DR: In this paper, the authors proposed to use appropriate speech production data to improve the quality of articulatory animation for audiovisual (AV) speech synthesis, which could significantly improve the articulation quality.

...read moreread less

Abstract: The importance of modeling speech articulation for high-quality audiovisual (AV) speech synthesis is widely acknowledged. Nevertheless, while state-of-the-art, data-driven approaches to facial animation can make use of sophisticated motion capture techniques, the animation of the intraoral articulators (viz. the tongue, jaw, and velum) typically makes use of simple rules or viseme morphing, in stark contrast to the otherwise high quality of facial modeling. Using appropriate speech production data could significantly improve the quality of articulatory animation for AV synthesis.

...read moreread less

Book Chapter•DOI•

Lip tracking method for the system of audio-visual polish speech recognition

[...]

Mariusz Kubanek¹, Janusz Bobulski¹, Lukasz Adrjanowicz¹•Institutions (1)

Częstochowa University of Technology¹

29 Apr 2012

TL;DR: This work presents a method for automatic detection of the outer edges of the lips, which was used to identify individual words in audio-visual speech recognition and how to use video speech to divide the audio signal into phonemes.

...read moreread less

Abstract: This paper proposes a method of tracking the lips in the system of audio-visual speech recognition. Presented methods consists of a face detector, face tracker, lip detector, lip tracker, and word classifier. In speech recognition systems, the audio signal is exposed to a large amount of acoustic noise, therefor scientists are looking for ways to reduce audio interference on recognition results. Visual speech is one of the sources that is not perturbed by the acoustic environment and noise. To analyze the video speech one has to develop a method of lip tracking. This work presents a method for automatic detection of the outer edges of the lips, which was used to identify individual words in audio-visual speech recognition. Additionally the paper also shows how to use video speech to divide the audio signal into phonemes.

...read moreread less

A new visual speech modelling approach for visual speech recognition

[...]

Dahai Yu, Ovidiu Ghita, Alistair Sutherland, Paul F. Whelan

01 Jan 2012

TL;DR: The experimental results indicate that the new Visual Speech Unit concept achieves 90% recognition rate when the system has been applied to the identification of 60 classes of VSUs, while the recognition rate for the standard set of MPEG-4 visemes was only 52%.

...read moreread less

Abstract: In this paper we propose a new learning-based representation that is referred to as Visual Speech Unit (VSU) for visual speech recognition (VSR). The new Visual Speech Unit concept proposes an extension of the standard viseme model that is currently applied for VSR by including in this representation not only the data associated with the visemes, but also the transitory information between consecutive visemes. The developed speech recognition system consists of several computational stages: (a) lips segmentation, (b) construction of the Expectation-Maximization Principal Component Analysis (EM-PCA) manifolds from the input video image, (c) registration between the models of the VSUs and the EM-PCA data constructed from the input image sequence and (d) recognition of the VSUs using a standard Hidden Markov Model (HMM) classification scheme. In this paper we were particularly interested to evaluate the classification accuracy obtained for our new VSU models when compared with that attained for standard (MPEG-4) viseme models. The experimental results indicate that we achieved 90% recognition rate when the system has been applied to the identification of 60 classes of VSUs, while the recognition rate for the standard set of MPEG-4 visemes was only 52%.

...read moreread less

Book Chapter•DOI•

Joint LBP and DCT Model for Visual Speech

[...]

Zheng MeiXia¹, Jia Xibin¹•Institutions (1)

Beijing University of Technology¹

01 Jan 2012

TL;DR: A representation model of the visual speech which bases on the local binary pattern (LBP) and the discrete cosine transform (DCT) of mouth images is proposed which shows better performance than using the global feature only.

...read moreread less

Abstract: The paper aims to establish a effective feature form of visual speech to realize the Chinese viseme recognition. We propose and discuss a representation model of the visual speech which bases on the local binary pattern (LBP) and the discrete cosine transform (DCT) of mouth images. The joint model combines the advantages of the local and global texture information together, which shows better performance than using the global feature only. By computing LBP and DCT of each mouth frame capturing during the subject speaking, the Hidden Markov Model (HMM) is trained based on the training dataset and is employed to recognize the new visual speech. The experiments show this visual speech feature model exhibits good performance in classifying the difference speaking states.

...read moreread less

Patent•

Computer-implemented method and apparatus for animating the mouth of a face

[...]

Reinhard Knothe¹, Thomas Vetter¹, Rami Ajaj¹, Michael Fahrmair¹•Institutions (1)

NTT DoCoMo¹

17 Jan 2012

TL;DR: In this paper, a computer-implemented method of animating a mouth of a face in accordance with a given sequence of visemes is presented, which is graphically represented by a face model.

...read moreread less

Abstract: A computer-implemented method of animating a mouth of a face in accordance with a given sequence of visemes, said method comprising: graphically representing said face by a face model; For each of a plurality V of possibly visemes, obtaining by measurement a plurality of I different scans or samples of said visemes; representing each of said plurality of I samples of each of said plurality of V different visemes by said face animation model to generate a matrix based on said scans or samples which spans up a viseme space such that a sequence of visemes can be represented through a trajectory through said viseme space; applying a Bayesian approach to obtain the best path through said viseme space for said given sequence of visemes

...read moreread less

Journal Article•DOI•

Video Based Visual Speech Feature Model Construction

[...]

Xibin Jia¹, Mei Xia Zheng¹•Institutions (1)

Beijing University of Technology¹

01 Jun 2012-Applied Mechanics and Materials

TL;DR: The paper shows the HMM which describing the dynamic of speech, coupled with the combined feature for describing the global and local texture is the best model.

...read moreread less

Abstract: This paper aims to give a solutions for the construction of chinese visual speech feature model based on HMM. We propose and discuss three kind representation model of the visual speech which are lip geometrical features, lip motion features and lip texture features. The model combines the advantages of the local LBP and global DCT texture information together, which shows better performance than the single feature. Equally the model combines the advantages of the local LBP and geometrical information together is better than single feature. By computing the recognition rate of the visemes from the model, the paper shows the HMM which describing the dynamic of speech, coupled with the combined feature for describing the global and local texture is the best model.

...read moreread less

Posted Content•

Using multimodal speech production data to evaluate articulatory animation for audiovisual speech synthesis

[...]

Ingmar Steiner¹, Korin Richmond², Slim Ouni³•Institutions (3)

Trinity College, Dublin¹, University of Edinburgh², French Institute for Research in Computer Science and Automation³

22 Sep 2012-arXiv: Human-Computer Interaction

TL;DR: This work states that using appropriate speech production data could significantly improve the quality of articulatory animation for AV synthesis.

...read moreread less

Proceedings Article•DOI•

Research on Chinese viseme based on fuzzy clustering and grey relation analysis

[...]

Chuanzhen Rong, Yue Zhenjun, Wang Yuan, Yang Yu

29 May 2012

TL;DR: Based on the characteristics of Chinese speech pronunciation, the lip features of consonant and vowels are extracted, grey relation analysis is used, and fuzzy clustering is used to classify the vowels, and 13 basic static visemes for Chinese are defined.

...read moreread less

Abstract: In this paper, based on the characteristics of Chinese speech pronunciation, we firstly extract the lip features of consonant and vowels, use grey relation analysis to construct fuzzy similarity relation matrix for consonant and vowels, and then use fuzzy clustering to classify the consonant and vowels, at last we define 13 basic static visemes for Chinese. We realized a TTVS(Text-To-Visual Speech) system to verify the performance of the visemes.

...read moreread less

Proceedings Article•

Croatian visual speech synthesis

[...]

Miran Pobar¹, Ivo Ipšić¹•Institutions (1)

University of Rijeka¹

21 May 2012

TL;DR: This paper presents the effort to integrate and extend existing text-to-speech system for Croatian language and a face and body animation system to produce a system capable of real-time visual text- to-speech synthesis for Croatianlanguage.

...read moreread less

Abstract: Virtual human characters are being used in a variety of applications, such as computer games, movies, tutoring software and virtual guides. Virtual characters interact with persons using speech and gestures. In some applications speech can be pre-recorded and synchronized with animation to produce a believable virtual character. However in many cases an ability to produce synthesized speech synchronized with facial animation is necessary or desireable, for example in cases the output text is not known in advance, for ability to choose among various voices, or because recording natural speech is too time consuming and expensive for a given application. This paper presents the effort to integrate and extend existing text-to-speech system for Croatian language and a face and body animation system to produce a system capable of real-time visual text-to-speech synthesis for Croatian language.

...read moreread less

Book Chapter•DOI•

Correct speech visemes as a root of total communication method for deaf people

[...]

Eva Pajorová¹, Ladislav Hluchý¹•Institutions (1)

Slovak Academy of Sciences¹

25 Jun 2012

TL;DR: A design tool for creating correct speech visemes is designed and is testing the correctness of generated viseme on Slovak speech domains.

...read moreread less

Abstract: Many deaf people are using lip reading as a main communication fiorm. A viseme is a representational unit used to classify speech sounds in the visual domain and describes the particular facial and oral positions and movements that occur alongside the voicing of phonemes. A design tool for creating correct speech visemes is designed. It's composed of 5 modules; one module for creating phonemes, one module for creating 3D speech visemes, one module for facial expression and modul for synchronization between phonemes and visemes and lastly one module to generate speech triphones. We are testing the correctness of generated visemes on Slovak speech domains. The paper descriebes our developed tool.

...read moreread less

Proceedings Article•DOI•

An online speech driven talking head system

[...]

Kai Zhao¹, Zhiyong Wu¹, Jia Jia¹, Lianhong Cai¹•Institutions (1)

Tsinghua University¹

01 Nov 2012

TL;DR: This paper presents the design and implementation of an online speech driven talking head animation system that first recognizes phoneme sequence from the input speech with a Chinese Mandarin speech recognizer, and is used to drive the facial animations on a 3-dimentional talking head.

...read moreread less

Abstract: This paper presents the design and implementation of an online speech driven talking head animation system The system first recognizes phoneme sequence from the input speech with a Chinese Mandarin speech recognizer The phoneme sequence is further transformed to a sequence of visemes The sequence of MPEG-4 facial animation parameters (FAPs) is further derived from the viseme sequence, and is used to drive the facial animations on a 3-dimentional talking head The architecture and the major features are also presented in the paper, together with the evaluations of the system

...read moreread less