scispace - formally typeset
Search or ask a question

Showing papers on "Viseme published in 2011"


Proceedings Article
01 Jan 2011
TL;DR: This paper shows that session-independent training methods may be used to obtain robust EMGbased speech recognizers which cope well with unseen recording sessions as well as with speaking mode variations.
Abstract: This paper reports on our recent research in speech recognition by surface electromyography (EMG), which is the technology of recording the electric activation potentials of the human articulatory muscles by surface electrodes in order to recognize speech. This method can be used to create Silent Speech Interfaces, since the EMG signal is available even when no audible signal is transmitted or captured. Several past studies have shown that EMG signals may vary greatly between different recording sessions, even of one and the same speaker. This paper shows that session-independent training methods may be used to obtain robust EMGbased speech recognizers which cope well with unseen recording sessions as well as with speaking mode variations. Our best session-independent recognition system, trained on 280 utterances of 7 different sessions, achieves an average 21.93% Word Error Rate (WER) on a testing vocabulary of 108 words. The overall best session-adaptive recognition system, based on a session-independent system and adapted towards the test session with 40 adaptation sentences, achieves an average WER of 15.66%, which is a relative improvement of 21% compared to the baseline average WER of 19.96% of a session-dependent recognition system trained only on a single session of 40 sentences.

71 citations


Proceedings ArticleDOI
28 Nov 2011
TL;DR: Bengali speech corpus development for speaker independent continuous speech recognition for two age groups is presented and phone and triphone labeled speech corpora are created.
Abstract: This paper presents Bengali speech corpus development for speaker independent continuous speech recognition. speech corpora is the backbone of automatic speech recognition (ASR) system. Speech corpus can be classified into several class. It may be language dependent or age dependent. We have developed speech corpus for two age groups. Younger group belongs to 20 to 40 years of age whereas older group is distributed into 60 to 80 years. We have created phone and triphone labeled speech corpora. Initially, speech samples are aligned with statistical modeling technique. Statistically labeled files are then pruned by manual correction. Hidden Markov Model Toolkit (HTK) has been used for aligning the speech data. We have observed phoneme recognition and continuous word recognition performance to check speech corpus quality.

59 citations


Book ChapterDOI
TL;DR: Visual speech recognition (VSR) deals with the visual domain of speech and involves image processing, artificial intelligence, object detection, pattern recognition, statistical modelling, etc and has received a great deal of attention in the last decade.
Abstract: Lip reading is used to understand or interpret speech without hearing it, a technique especially mastered by people with hearing difficulties. The ability to lip read enables a person with a hearing impairment to communicate with others and to engage in social activities, which otherwise would be difficult. Recent advances in the fields of computer vision, pattern recognition, and signal processing has led to a growing interest in automating this challenging task of lip reading. Indeed, automating the human ability to lip read, a process referred to as visual speech recognition (VSR) (or sometimes speech reading), could open the door for other novel related applications. VSR has received a great deal of attention in the last decade for its potential use in applications such as human-computer interaction (HCI), audio-visual speech recognition (AVSR), speaker recognition, talking heads, sign language recognition and video surveillance. Its main aim is to recognise spoken word(s) by using only the visual signal that is produced during speech. Hence, VSR deals with the visual domain of speech and involves image processing, artificial intelligence, object detection, pattern recognition, statistical modelling, etc.

38 citations


Proceedings Article
01 Aug 2011
TL;DR: A visual-only speech recognition system is presented, trained using either PCA or optical flow visual features, and results are analyzed in order to establish which is, within the 4 candidates, the best performing viseme definition.
Abstract: Audio-visual speech recognition (AVSR) involves recognising of what a speaker is uttering using both audio and visual cues. While phonemes, the units of speech in the audio domain, are well documented, this is not equally true for the speech units in the visual domain: visemes. In the literature, only a generic viseme definition is recognised. There is no agreement on what visemes practically imply, and if they are just related to mouth position or mouth movement. In this paper a visual-only speech recognition system is presented, trained using either PCA or optical flow visual features. Recognition rate changes depending on which practical viseme definition has been used. Four viseme definitions were tested and results are analyzed in order to establish which is, within the 4 candidates, the best performing viseme definition.

36 citations


Proceedings ArticleDOI
22 May 2011
TL;DR: A visual speech synthesizer providing midsagittal and front views of the vocal tract to help language learners to correct their mispronunciations is presented.
Abstract: This paper presents a visual speech synthesizer providing midsagittal and front views of the vocal tract to help language learners to correct their mispronunciations. We adopt a set of allophonic rules to determine the visualization of allophonic variations. We also implement coarticulation by decomposing a viseme (visualization of all articulators) into viseme components (visualization of tongue, lips, jaw, and velum separately). Viseme components are morphed independently while the temporally adjacent articulations are considered. Subjective evaluation involving 6 subjects with linguistic background shows that 54% of their responses prefer having allophonic variations incorporated.

18 citations


Journal ArticleDOI
TL;DR: A lip-reading technique that identifies visemes from visual data only and without evaluating the corresponding acoustic signals is presented, indicating that the proposed method is more robust to inter-subject variations with high sensitivity and specificity for 12 out of 14 viseme.
Abstract: A lip-reading technique that identifies visemes from visual data only and without evaluating the corresponding acoustic signals is presented. The technique is based on vertical components of the optical flow (OF) analysis and these are classified using support vector machines (SVM). The OF is decomposed into multiple non-overlapping fixed scale blocks and statistical features of each block are computed for successive video frames of an utterance. This technique performs automatic temporal segmentation (i.e., determining the start and the end of an utterance) of the utterances, achieved by pair-wise pixel comparison method, which evaluates the differences in intensity of corresponding pixels in two successive frames. The experiments were conducted on a database of 14 visemes taken from seven subjects and the accuracy tested using five and ten fold cross validation for binary and multiclass SVM respectively to determine the impact of subject variations. Unlike other systems in the literature, the results indicate that the proposed method is more robust to inter-subject variations with high sensitivity and specificity for 12 out of 14 visemes. Potential applications of such a system include human computer interface (HCI) for mobility-impaired users, lip reading mobile phones, in-vehicle systems, and improvement of speech based computer control in noisy environment.

16 citations


Journal ArticleDOI
TL;DR: In perceptual experiments, the synthetic visual speech of a computer-animated Mandarin talking head was evaluated and subsequently improved, and the basic units of Mandarin visual speech were determined for initial consonants and final single-vowels.

16 citations


Proceedings ArticleDOI
28 Nov 2011
TL;DR: This paper presents the group's latest progress in developing Enunciate — an online computer-aided pronunciation training (CAPT) system for Chinese learners of English, which consists of an audio-enabled web interface, a speech recognizer for mispronunciation detection and diagnosis, aspeech synthesizer and a viseme animator.
Abstract: This paper presents our group's latest progress in developing Enunciate — an online computer-aided pronunciation training (CAPT) system for Chinese learners of English. Presently, the system targets segmental pronunciation errors. It consists of an audio-enabled web interface, a speech recognizer for mispronunciation detection and diagnosis, a speech synthesizer and a viseme animator. We present a summary of the system's architecture and major interactive features. We also present statistics from evaluations by English teachers and university students who participate in pilot trials. We are also extending the system to cover suprasegmental training and mobile access.

16 citations


02 May 2011

16 citations


Proceedings ArticleDOI
01 Nov 2011
TL;DR: This paper designs a lip synchronization system for the authors' humanoid robot using Microsoft Speech API (SAPI), and builds sixteen lip shapes to perform all the visemes, and implements the whole system at the humanoid robot head to demonstrate the success of the system.
Abstract: The most common way for human interaction to understand each other is through communication with each other, so as in the human robot interaction. The ability to talk is one of the most important technologies in the field of intelligent robotics. When it comes to talk there are two basic type of signal transmitted by people: Auditory and Visual. Speech synthesis and speech recognition are crucial abilities for auditory signal. Likewise, lip synchronization is the key technique for visual signal. The mimic lip synchronization could contribute to the improvement of human robot interaction. In this paper, we design a lip synchronization system for our humanoid robot using Microsoft Speech API (SAPI). With totally thirty degrees of freedom (twelve for mouth) we build sixteen lip shapes to perform all the visemes. With proposed system the precise and mimic lip synchronization can let the users to have favorable impression, and gains the closeness to the people. Finally the whole system is implemented at our humanoid robot head to demonstrate the success of the system proposed herein.

12 citations


Journal ArticleDOI
TL;DR: This study examines the degree to which visual /p,b,m/ are discriminable in production and perception, and establishes the absence/presence of systematic visual differences in bilabial productions.
Abstract: Previous studies have demonstrated that the labial stops /p,b,m/ may be impossible to discriminate visually, leading to their designation as a single viseme. This perceptual limitation has engendered the belief that there are no visible differences in the production of /p,b,m/, with consequences for research in machine recognition, where production differences below the level of the viseme have been largely ignored. Kinematic studies using high‐speed cine, however, have previously documented systematic differences in the production of labial consonants. This study examines the degree to which visual /p,b,m/ are discriminable in production and perception. Two experiments—one designed to measure kinematic orofacial movement using optical flow analysis and one designed to test perceiver discrimination of /p,b,m/—were used to establish the absence/presence of systematic visual differences in bilabial productions, and to replicate the previous perception findings. Results from the optical flow analysis indicat...

Book ChapterDOI
13 Jun 2011

01 Jan 2011
TL;DR: The experimental results have shown that use of asynchronous frameworks for combined audible and visible speech processing results in improvement of the accuracy of audiovisual speech recognition as well as the naturalness and the intelligibility of speech synthesis.
Abstract: In this paper, we present a research of temporal correlations of audiovisual units in continuous Russian speech. The corpus-based study identifies natural time asynchronies between flows of audible and visible speech modalities partially caused by inertance of the articulation organs. Original methods for speech asynchrony modeling have been proposed and studied using bimodal ASR and TTS systems. The experimental results have shown that use of asynchronous frameworks for combined audible and visible speech processing results in improvement of the accuracy of audiovisual speech recognition as well as the naturalness and the intelligibility of speech synthesis.

Proceedings ArticleDOI
06 Jun 2011
TL;DR: This article proposes a novel approach introducing a consonant-vowel detector and using two classifiers: an HMM based classifiers for the recognition of the “consonant part” of the phoneme and a classifier for the ‘vowels part’, making it applicable to any set of words of varying size or content.
Abstract: Viseme-based Visual Speech Recognition (VSR) systems, using Hidden Markov Models (HMM) for phoneme recognition, generally use 3-state left-right HMM for each viseme to recognize. In this article, we propose a novel approach introducing a consonant-vowel detector and using two classifiers: an HMM based classifier for the recognition of the “consonant part” of the phoneme and a classifier for the “vowel part”. The benefits of such an approach include (1) reducing the number of hidden states and (2) reducing the number of HMMs. We tested our method on a limited set of words of the Modern Classic Arabic language and achieved a recognition rate of 81.7%. Moreover, the proposed model is speaker-independent and uses visemes as the basic units, thereby, making it applicable to any set of words of varying size or content.

Journal ArticleDOI
TL;DR: Translingual phoneme to viseme mapping module offered in this paper is suitable to animate foreign languages and can be driven by Lithuanian phonetics to get samples of animated Lithuanian speech.
Abstract: Methodology of Lithuanian phonemes visualization using visemes set of the base language appended by new visemes defined to animate specific Lithuanian phonemes is proposed. Translingual visemes mapping for Lithuanian speech animation is applied. Phoneme to viseme mapping module is one of the most important parts of the framework for Lithuanian speech animation. Facial animation engine compatible with MPEG-4 standard is used to integrate data flows. The base language of the exploited engine is English and the proposed architecture explains how it can be driven by Lithuanian phonetics to get samples of animated Lithuanian speech. Translingual phoneme to viseme mapping module offered in this paper is suitable to animate foreign languages. Ill. 2, bibl. 14, tabl. 1 (in English; abstracts in English and Lithuanian). http://dx.doi.org/10.5755/j01.eee.111.5.365

01 Jan 2011
TL;DR: This work investigates perceived audio-visual asynchrony, specifically anticipatory coarticulation, in which the visual cues of a speech sound may occur before the acoustic cues, and finds that the newly proposed models outperform previously suggested as synchrony models for both alignment and recognition tasks.
Abstract: This work investigates perceived audio-visual asynchrony, specifically anticipatory coarticulation, in which the visual cues (e.g. lip rounding) of a speech sound may occur before the acoustic cues. This phenomenon often gives the impression that the visual and acoustic signals are asynchronous. This effect can be accounted for using models based on multiple hidden Markov models with some synchrony constraints linking states in different modalities, though generally only within phones and not across phone boundaries. In this work, we consider several such models, implemented as dynamic Bayesian networks (DBNs). We study the models' ability to accurately locate audio and viseme (audio and video sub-word units, respectively) boundaries in the audio and video signals, and compare them with human labels of these boundaries. This alignment task is important on its own for purposes of linguistic analysis, as it can serve as an analysis tool and a convenience tool to linguists. Furthermore, these advances in alignment systems can carry over into the speech recognition domain. This thesis makes several contributions. First, this work presents a new set of manually labeled phonetic boundary data in words expected to display asynchrony, and analysis of the data confirms our expectations about this phenomenon. Second, this work presents a new software program called AVDDisplay which allows the viewing of audio, video, and alignment data simultaneously and in sync. This tool is essential for the alignment analysis detailed in this work. Third, new DBN-based models of audio-visual asynchrony are presented. The newly proposed models consider linguistic context within the asynchrony model. Fourth, alignment experiments are used to compare system performance with the hand-labeled ground truth. Finally, the performance of these models in a speech recognition context is examined. This work finds that the newly proposed models outperform previously suggested asynchrony models for both alignment and recognition tasks.

Proceedings Article
27 Aug 2011
TL;DR: In this paper, phonemebased and viseme-based audiovisual speech synthesis techniques are compared in order to explore the balancing between data availability and an improved audiovISual coherence for synthesis optimization.
Abstract: A common approach in visual speech synthesis is the use of visemes as atomic units of speech. In this paper, phonemebased and viseme-based audiovisual speech synthesis techniques are compared in order to explore the balancing between data availability and an improved audiovisual coherence for synthesis optimization. A technique for automatic viseme clustering is described and it is compared to the standardized viseme set described in MPEG-4. Both objective and subjective testing indicated that a phoneme-based approach leads to better synthesis results. In addition, the test results improve when more different visemes are defined. This raises some questions on the widely applied viseme-based approach. It appears that a many-to-one phoneme-to-viseme mapping is not capable of describing all subtle details of the visual speech information. In addition, with viseme-based synthesis the perceived synthesis quality is affected by the loss of audiovisual coherence in the synthetic speech.

Book ChapterDOI
20 Nov 2011
TL;DR: This paper proposes, for spoken word recognition, to utilize c combined parameter(combined parameter) as the visual feature extracted by Active Appearance Model applied to a face image including the lip area, which was improved by the proposed feature compared to the conventional features such as DCT and the principal component score.
Abstract: As one of the techniques for robust speech recognition under noisy environment, audio-visual speech recognition using lip dynamic visual information together with audio information is attracting attention and the research is advanced in recent years. Since visual information plays a great role in audio-visual speech recognition, what to select as the visual feature becomes a significant point. This paper proposes, for spoken word recognition, to utilize c combined parameter(combined parameter) as the visual feature extracted by Active Appearance Model applied to a face image including the lip area. Combined parameter contains information of the coordinate value and the intensity value as the visual feature. The recognition rate was improved by the proposed feature compared to the conventional features such as DCT and the principal component score. Finally, we integrated the phoneme score from audio information and the viseme score from visual information with high accuracy.

01 Jan 2011
TL;DR: Congruent AV speech facilitation and incongruent McGurk effects were tested by comparing percent correct syllable identification for full face visual speech stimuli compared to upper-face only conditions and results showed more accurate identification for congruent stimuli and less accurate responses for incongsruent ones.
Abstract: When the talker’s face (visual speech) can be seen, speech perception is both facilitated (for congruent visual speech) and interfered with (for incongruent visual speech). The current study investigated whether the degree of these visual speech effects was affected by the presence of an additional irrelevant talking face. In the experiment, auditory speech targets (vCv syllables) were presented in noise for subsequent speech identification. Participants were presented with the full display or upper-half (control) display of a talker’s face uttering single syllables either in central vision (Exp 1) or in the visual periphery (Exp 2). In addition, another talker was presented (silently uttering a sentence) either in the periphery (Exp 1) or in central vision (Exp 2). Participants’ eye-movements were monitored to ensure that participants always fixated centrally. Congruent AV speech facilitation and incongruent McGurk effects were tested by comparing percent correct syllable identification for full face visual speech stimuli compared to upper-face only conditions. The results showed more accurate identification for congruent stimuli and less accurate responses for incongruent ones (full face condition vs. the upper-half face control). The magnitude of the McGurk effect was greater when the face articulating the syllable was presented in central vision (with visual speech noise in the periphery) than when it was presented in the periphery (with central visual speech noise). The size of the congruent AV speech effect, however, did not differ as a function of central or peripheral presentation.

Journal ArticleDOI
TL;DR: To evaluate the performance of the proposed method, the authors collected a large number of speech visual signal of five Algerian speakers male and female at different moments pronouncing 28 Arabic phonemes to show its variation for each viseme.
Abstract: Visemes are the unique facial positions required to produce phonemes, which are the smallest phonetic unit distinguished by the speakers of a particular language. Each language has multiple phonemes and visemes, and each viseme can have multiple phonemes. However, current literature on viseme research indicates that the mapping between phonemes and visemes is many-to-one: there are many phonemes which look alike visually, and hence they fall into the same visemic category. To evaluate the performance of the proposed method, the authors collected a large number of speech visual signal of five Algerian speakers male and female at different moments pronouncing 28 Arabic phonemes. For each frame the lip area is manually located with a rectangle of size proportional to 120*160 and centred on the mouth, and converted to gray scale. Finally, the mean and the standard deviation of the values of the pixels of the lip area are computed by using 20 images for each phoneme sequence to classify the visemes. The pitch analysis is investigated to show its variation for each viseme.

Proceedings ArticleDOI
27 Apr 2011
TL;DR: This paper has started from a 3D human face model that can be adjusted to a particular face, and has considered a total of 15 different positions to accurately model the articulation of Portuguese language — the visemes.
Abstract: In this paper we present an approach for creating interactive and speaking avatar models, based on standard face images. We have started from a 3D human face model that can be adjusted to a particular face. In order to adjust the 3D model from a 2D image, a new method with 2 steps is presented. First, a process based on Procrustes analysis is applied in order to find the best match for input key points, obtaining the rotation, translation and scale needed to best fit the model to the photo. Then, using the resulting model we refine the face mesh by applying a linear transform on each vertex. In terms of visual speech animation, we have considered a total of 15 different positions to accurately model the articulation of Portuguese language — the visemes. For normalization purposes, each viseme is defined from the generic neutral face. The animation process is visually represented with linear time interpolation, given a sequence of visemes and its instants of occurrence.

Book ChapterDOI
14 Oct 2011
TL;DR: The ultimate goal of ASR research is to allow a computer to recognize in real‐time, with 100% accuracy, all words that are intelligibly spoken by any person, independent of vocabulary size, noise, speaker characteristics or accent.
Abstract: Automatic speech recognition (ASR) can be defined as the independent, computer‐driven transcription of spoken language into readable text in real time (Stuckless, 1994). In a nutshell, ASR is technology that allows a computer to identify the words that a person speaks into a microphone or telephone and convert it to written text. Having a machine to understand fluently spoken speech has driven speech research for more than 50 years. Although ASR technology is not yet at the point where machines understand all speech, in any acoustic environment, or by any person, it is used on a day‐to‐day basis in a number of applications and services. The ultimate goal of ASR research is to allow a computer to recognize in real‐time, with 100% accuracy, all words that are intelligibly spoken by any person, independent of vocabulary size, noise, speaker characteristics or accent. Today, if the system is trained to learn an individual speaker's voice, then much larger vocabularies are possible and accuracy can be greater than 90%. Commercially available ASR systems usually require only a short period of speaker training and may successfully capture continuous speech with a large vocabulary at normal pace with a very high accuracy. Most commercial companies claim that recognition software can achieve between 98% to 99% accuracy if operated under optimal conditions. `Optimal conditions' usually assume that users: have speech characteristics which match the training data, can achieve proper speaker adaptation, and work in a clean noise environment (e.g. quiet space). This explains why some users, especially those whose speech is heavily accented, might achieve recognition rates much lower than expected. The earliest attempts to devise systems for automatic speech recognition by machine were made in the 1950s. Speech recognition technology was designed initially for individuals in the disability community. For example, voice recognition can help people with musculoskeletal disabilities caused by multiple sclerosis, cerebral palsy, or arthritis achieve maximum productivity on computers. During the early 1990s, tremendous market opportunities emerged for speech recognition computer technology. The early versions of these products were clunky and hard to use. The early language recognition systems had to make compromises: they were "tuned" to be dependent on a particular speaker, or had small vocabulary, or used a very stylized and rigid syntax. However, in the computer industry, nothing stays the same for very long and by the end of the 1990s there was a whole new crop of commercial …

Proceedings ArticleDOI
20 Oct 2011
TL;DR: The main idea is to extract and capture a vise me from the video of a human talking and the phonemic scripts inside this video, and generate a talking head animation video by synchronizing a time-stamped of each phoneme to concatenated visemes.
Abstract: We consider the problem of making lip movement for an animated talking character, which consumes workload and cost during the animation development process. The main idea is to extract and capture a vise me from the video of a human talking and the phonemic scripts inside this video. After that, we generate a talking head animation video by synchronizing a time-stamped of each phoneme to concatenated visemes. The results of experimental tests are reported, indicating good accuracy.

Proceedings ArticleDOI
07 Apr 2011
TL;DR: A speaker identification system where lip information is fused with corresponding speech information from each speaker using a multilayer perceptron classifier and the energy, the zero cross ratio (ZCR) and the pitch are used as features for the audio modality.
Abstract: Audio-only speaker/speech recognition systems ASR are far from being perfect especially under noisy conditions. Furthermore, it is a known fact that the content of speech can be revealed partially through lip-reading. Human speech perception is bimodal in nature: Humans combine audio and visual information in deciding what has been spoken, especially in noisy environments. In this paper, we describe a speaker identification system where lip information is fused with corresponding speech information from each speaker. The energy, the zero cross ratio (ZCR) and the pitch are used as features for the audio modality. The features for the lip texture modality are 2D-DCT coefficients of the luminance component. Intuitively, we would expect lip information to be somewhat complementary to speech information due to the range of lip movements associated with the production of the corresponding phonemes in speech using a multilayer perceptron classifier.

01 Jan 2011
TL;DR: The focus of this thesis is to develop computer vision algorithms for visual speech recognition system to identify the visemes in speech recognition systems.
Abstract: The focus of this thesis is to develop computer vision algorithms for visual speech recognition system to identify the visemes. The majority of existing speech recognition systems is based on audio-visual signals and has been developed for speech enhancem

Journal ArticleDOI
TL;DR: In this article, a dynamic viseme model is presented, in which inner-syllable visemes are described by initial-final phones, and intersyllably viseme are determined by a hierarchical control function.
Abstract: The aspirate and pronunciation of a Chinese syllable is similar to an `olivary nucleus` and syllables are mildly connected by coarticulation. Accordingly a dynamic viseme model is presented. Inner-syllable visemes are described by initial-final phones, and inter-syllable visemes are determined by a hierarchical control function. Experimental results show improvement in naturalness of lip animation, as compared to the visual speech synthesis system based on triphones.

Journal Article
TL;DR: This paper presents a real-time speech driven talking avatar that is able to speak with live speech input and has many potential applications in videophones, virtual conferences,audio/video chats and entertainment.
Abstract: This paper presents a real-time speech driven talking avatar.Unlike most talking avatars in which the speech-synchronized facial animation is generated offline,this talking avatar is able to speak with live speech input.This life-like talking avatar has many potential applications in videophones,virtual conferences,audio/video chats and entertainment.Since phonemes are the smallest units of pronunciation,a real-time phoneme recognizer was built.The synchronization between the input live speech and the facial motion used a phoneme recognition and output algorithm.The coarticulation effects are included in a dynamic viseme generation algorithm to coordinate the facial animation parameters(FAPs) from the input phonemes.The MPEG-4 compliant avatar model is driven by the generated FAPs.Tests show that the avatar motion is synchronized and natural with MOS values of 3.42 and 3.5.

Book ChapterDOI
01 Jan 2011
TL;DR: Automatic speech recognition systems convert speech from a recorded audio signal to text using a probabilistic approach to infer original words given the observable signal.
Abstract: Automatic speech recognition (ASR) systems convert speech from a recorded audio signal to text. Humans convert words to speech with their speech production mechanism. An ASR system aims to infer those original words given the observable signal. The most common and as of today best method is the probabilistic approach. A speech signal corresponds to any word (or sequence of words) in the vocabulary with a certain probability.

Proceedings ArticleDOI
11 Dec 2011
TL;DR: This paper focuses on reducing the cost and workload in this process, and applies this technique for use with Thai speech, indicating good accuracy of the synchronized lip movement with the speech, compared to an artist-animated talking character.
Abstract: Lip synchronization in character animation is generally done in animation films and games, consuming workload and cost during the animation development process. In this paper, we focus on reducing the cost and workload in this process, and apply this technique for use with Thai speech. The main idea is to extract and capture a viseme from the video of a human talking and the phonemic scripts inside this video. First, this approach starts with separating the human talking video into two parts that contains the speech and frame sequence, then using the speech combined with a phonemic script to extract the time-stamp of each phoneme by using force-alignment techniques; next, we create a visyllable database by mapping the start time of each selected phoneme to an image; then, we capture the interested positions from the image to make a visyllable database; after that, we generate a talking head animation video by synchronizing a time-stamp of each phoneme to concatenated visemes. The results of experimental tests are reported, indicating good accuracy of the synchronized lip movement with the speech, compared to an artist-animated talking character.

Proceedings ArticleDOI
01 Nov 2011
TL;DR: This work presents a method to synchronize the image and the speech, and it uses Microsoft's Speech Application Programming Interface (SAPI) to be the speech synthesis tool.
Abstract: Synchronization between speech and mouth shape includes technologies, such as computer vision, speech synthesis, and speech recognition. We present a method to synchronize the image and the speech, and we use Microsoft's Speech Application Programming Interface (SAPI) to be the speech synthesis tool. Speech animation includes two components, the speech and the image. Speech synthesis output is obtained from Text-to-Speech (TTS), and the images of visemes are generated from software, FaceGen Modeller. Import three key pictures to this software to calibrate and generate the face model. The viseme event handler in C# will connect the image of mouth shape and viseme together. Load the images sequentially and the visemes will one by one match with the images correctly. The main applications of speech synthesis are used as assistive devices, e.g. the use of screen readers for people with visual impairment. A mute person can take advantage of this technology to talk to others. In recent years, speech synthesis is extensively applied in service robotics and entertainment productions such as language learning, education, video games, animations, and music videos.