scispace - formally typeset
Search or ask a question

Showing papers on "Viseme published in 2008"


Journal ArticleDOI
TL;DR: A novel real-time multimodal human-avatar interaction (RTM-HAI) framework with vision-based remote animation control (RAC) that integrates audio-visual analysis and synthesis modules to realize multichannel and runtime animations, visual TTS and real- time viseme detection and rendering.
Abstract: This paper presents a novel real-time multimodal human-avatar interaction (RTM-HAI) framework with vision-based remote animation control (RAC). The framework is designed for both mobile and desktop avatar-based human-machine or human-human visual communications in real-world scenarios. Using 3-D components stored in the Java mobile 3-D (M3G) file format, the avatar models can be flexibly constructed and customized on the fly on any mobile devices or systems that support the M3G standard. For the RAC head tracker, we propose a 2-D real-time face detection/tracking strategy through an interactive loop, in which the detection and tracking complement each other for efficient and reliable face localization, tolerating extreme user movement. With the face location robustly tracked, the RAC head tracker selects a main user and estimates the user's head rolling, tilting, yawing, scaling, horizontal, and vertical motion in order to generate avatar animation parameters. The animation parameters can be used either locally or remotely and can be transmitted through socket over the network. In addition, it integrates audio-visual analysis and synthesis modules to realize multichannel and runtime animations, visual TTS and real-time viseme detection and rendering. The framework is recognized as an effective design for future realistic industrial products of humanoid kiosk and human-to-human mobile communication.

70 citations


Patent
19 Nov 2008
TL;DR: In this paper, a system for voice personalization of video content is described, which includes a composition of a background scene having a character, head model data representing an individualized three-dimensional (3D) head model of a user, audio data simulating the user's voice, and a viseme track containing instructions for causing the individualized 3D head model to lip sync the words contained in the audio data.
Abstract: Systems and methods are disclosed for performing voice personalization of video content. The personalized media content may include a composition of a background scene having a character, head model data representing an individualized three-dimensional (3D) head model of a user, audio data simulating the user's voice, and a viseme track containing instructions for causing the individualized 3D head model to lip sync the words contained in the audio data. The audio data simulating the user's voice can be generated using a voice transformation process. In certain examples, the audio data is based on a text input or selected by the user (e.g., via a telephone or computer) or a textual dialogue of a background character.

31 citations


Proceedings ArticleDOI
26 Aug 2008
TL;DR: A complete pipeline of efficient and low-cost techniques to construct a realistic 3D text-driven emotive audio-visual avatar from a single 2D frontal-view face image of any person on the fly is proposed.
Abstract: In this paper, we propose a complete pipeline of efficient and low-cost techniques to construct a realistic 3D text-driven emotive audio-visual avatar from a single 2D frontal-view face image of any person on the fly. This real-time conversion is achieved through three steps. First, a personalized 3D face model is built based on the 2D face image using a fully automatic 3D face shape and texture reconstruction framework. Second, using standard MPEG-4 FAPs (Facial Animation Parameters), the face model is animated by the Viseme and expression channels and is complemented by the visual prosody channel that controls head, eye and eyelid movements. Finally, the facial animation is combined and synchronized with the emotive synthetic speech generated by incorporating an emotion transformer into a Festival-MBROLA text to neutral speech synthesizer.

27 citations


Book
01 Nov 2008
TL;DR: Speech sounds : a pictorial guide to typical and atypical speech as discussed by the authors, is a pictual guide for typical and unusual speech in the context of speech synthesis and decoding.
Abstract: Speech sounds : a pictorial guide to typical and atypical speech , Speech sounds : a pictorial guide to typical and atypical speech , کتابخانه دیجیتال جندی شاپور اهواز

17 citations


Book ChapterDOI
01 Jan 2008
TL;DR: With the development of new trends in human-machine interfaces, animated feature films and video games, better avatars and virtual agents are required to synthesize realistic animation that capture and resemble the complex relationship between these communicative channels.
Abstract: With the development of new trends in human-machine interfaces, animated feature films and video games, better avatars and virtual agents are required that more accurately mimic how humans communicate and interact. Gestures and speech are jointly used to express intended messages. The tone and energy of the speech, facial expression, rigid head motion and hand motion combine in a non-trivial manner as they unfold in natural human interaction. Given that the use of large motion capture datasets is expensive and can only be applied in planned scenarios, new automatic approaches are required to synthesize realistic animation that capture and resemble the complex relationship between these communicative channels. One useful and practical approach is the use of acoustic features to generate gestures, exploiting the link between gestures and speech. Since the shape of the lips is determined by the underlying articulation, acoustic features have been used to generate visual visemes that match the spoken sentences [4, 5, 12, 17]. Likewise, acoustic features have been used to synthesize facial expressions [11, 30], exploiting the fact that the same muscles used for articulation also affect the shape of the face [44, 46]. One important gesture that has received less attention than other aspects in facial animations is rigid head motion. Head motion is important not only to acknowledge active listening or replace verbal information (e.g. “nod”), but also for many aspect of human

17 citations


Dissertation
01 Nov 2008
TL;DR: A large section of this thesis has been dedicated to analysis the performance of the new visual speech unit model when compared with that attained for standard (MPEG-4) viseme models.
Abstract: This dissertation presents a new learning-based representation that is referred to as a Visual Speech Unit for visual speech recognition (VSR). The automated recognition of human speech using only features from the visual domain has become a significant research topic that plays an essential role in the development of many multimedia systems such as audio visual speech recognition(AVSR), mobile phone applications, human-computer interaction (HCI) and sign language recognition. The inclusion of the lip visual information is opportune since it can improve the overall accuracy of audio or hand recognition algorithms especially when such systems are operated in environments characterized by a high level of acoustic noise. The main contribution of the work presented in this thesis is located in the development of a new learning-based representation that is referred to as Visual Speech Unit for Visual Speech Recognition (VSR). The main components of the developed Visual Speech Recognition system are applied to: (a) segment the mouth region of interest, (b) extract the visual features from the real time input video image and (c) to identify the visual speech units. The major difficulty associated with the VSR systems resides in the identification of the smallest elements contained in the image sequences that represent the lip movements in the visual domain. The Visual Speech Unit concept as proposed represents an extension of the standard viseme model that is currently applied for VSR. The VSU model augments the standard viseme approach by including in this new representation not only the data associated with the articulation of the visemes but also the transitory information between consecutive visemes. A large section of this thesis has been dedicated to analysis the performance of the new visual speech unit model when compared with that attained for standard (MPEG- 4) viseme models. Two experimental results indicate that: 1. The developed VSR system achieved 80-90% correct recognition when the system has been applied to the identification of 60 classes of VSUs, while the recognition rate for the standard set of MPEG-4 visemes was only 62-72%. 2. 15 words are identified when VSU and viseme are employed as the visual speech element. The accuracy rate for word recognition based on VSUs is 7%-12% higher than the accuracy rate based on visemes.

12 citations


Proceedings ArticleDOI
01 Dec 2008
TL;DR: This work extends and improves a recently introduced dynamic Bayesian network based audio-visual automatic speech recognition (AV-ASR) system to model the audio and visual streams as being composed of separate, yet related, sub-word units.
Abstract: This work extends and improves a recently introduced (Dec. 2007) dynamic Bayesian network (DBN) based audio-visual automatic speech recognition (AV-ASR) system. That system models the audio and visual components of speech as being composed of the same sub-word units when, in fact, this is not psycholinguistically true. We extend the system to model the audio and visual streams as being composed of separate, yet related, sub-word units. We also introduce a novel stream weighting structure incorporated into the model itself. In doing so, our system makes improvements in word error rate (WER) and overall recognition accuracy in a large vocabulary continuous speech recognition task (LVCSR). The ldquobestrdquo performing proposed system attains a WER of 66.71%whereas the ldquobestrdquo baseline system performs at a WER of 64.30%. The proposed system also improves accuracy to 45.95% from 39.40%.

12 citations


Journal ArticleDOI

11 citations


Proceedings ArticleDOI
20 Jun 2008
TL;DR: Experimental results show that, for speeches which are chosen randomly, the synthetic visual speech is smooth and realistic.
Abstract: In order to realize realistic visual speech synthesis, a visual speech synthesis method based on Chinese dynamic visemes is proposed. With mouth feature parameters of Chinese static visemes, consonants and vowels are classified using clustering algorithm. According to Chinese pronunciation characters, we can get 40 basic dynamic visemes by combining consonant types and vowel types. With these dynamic visemes and corresponding phonemes, two-layer hidden Markov model (HMM) is built up and trained. Experimental results show that, for speeches which are chosen randomly, the synthetic visual speech is smooth and realistic.

10 citations


Proceedings ArticleDOI
01 Dec 2008
TL;DR: This paper presents some minimal signal modification techniques for reducing audible artifacts in speech synthesis due to discontinuities in pitch, energy, and formant trajectories at the joining point of the units.
Abstract: Our earlier work [1] on speech synthesis has shown that syllables can produce reasonably natural quality speech. Nevertheless, audible artifacts are present due to discontinuities in pitch, energy, and formant trajectories at the joining point of the units. In this paper, we present some minimal signal modification techniques for reducing these artifacts.

10 citations


Journal ArticleDOI
TL;DR: In this article, a new automatic approach for lip POI localization and feature extraction on a speaker's face based on mouth color information and a geometrical model of the lips is presented.
Abstract: An automatic lip-reading system is among assistive technologies for hearing impaired or elderly people. We can imagine, for example, a dependent person ordering a machine with an easy lip movement or by a simple visemes (visual phoneme) pronunciation. A lip-reading system is decomposed into three subsystems: a lip localization subsystem, then a feature extracting subsystem, followed by a classification system that maps feature vectors to visemes. The major difficulty in a lip-reading system is the extraction of the visual speech descriptors. In fact, to ensure this task it is necessary to carry out an automatic localization and tracking of the labial gestures. We present, in this paper, a new automatic approach for lip POI localization and feature extraction on a speaker's face based on mouth color information and a geometrical model of the lips. The extracted visual information is then classified in order to recognize the uttered viseme. We have developed our Automatic Lip Feature Extraction prototype (ALiFE). ALiFE prototype is evaluated for multiple speakers under natural conditions. Experiments include a group of French visemes for different speakers. Results revealed that our system recognizes 94.64% of the tested French visemes.

Journal ArticleDOI
10 Jan 2008
TL;DR: This work describes an approach to pose-based interpolation that deals with coarticulation using a constraint-based technique and demonstrates it using a Mexican-Spanish talking head, which can vary its speed of talking and produce coARTiculation effects.
Abstract: A common approach to produce visual speech is to interpolate the parameters describing a sequence of mouth shapes, known as visemes, where a viseme corresponds to a phoneme in an utterance. The interpolation process must consider the issue of context-dependent shape, or coarticulation, in order to produce realistic-looking speech. We describe an approach to such pose-based interpolation that deals with coarticulation using a constraint-based technique. This is demonstrated using a Mexican-Spanish talking head, which can vary its speed of talking and produce coarticulation effects.

Proceedings ArticleDOI
05 Nov 2008
TL;DR: Visual speech synthesis experiments and subjective evaluation results show that mouth animations can be obtained which are not only realistic with clear and smooth mouth images, but also in good accordance with the acoustic pronunciation and intensity of the input speech.
Abstract: This paper presents a novel speech driven accurate realistic visual speech synthesis approach. Firstly, an audio visual instance database is built for different viseme context combinations, i.e. diviseme units, using 100 audio visual speech sentences of a female speaker. Then a diviseme instance selection algorithm is introduced to choose the optimal diviseme instances for the viseme contexts in the input speech, considering both the concatenation smoothness of the image sequences, and matching of the mouth movements to the acoustic pronunciation process, as well the intensity of the input speech. Finally mouth image sequences of corresponding viseme segments in the selected diviseme instances are time warped and blended to construct the mouth images of the final animation. Visual speech synthesis experiments and subjective evaluation results show that mouth animations can be obtained which are not only realistic with clear and smooth mouth images, but also in good accordance with the acoustic pronunciation and intensity of the input speech.

Journal ArticleDOI
TL;DR: Objective and subjective evaluations on the synthesized mouth animations prove that the multimodal diviseme instance selection algorithm proposed in this paper outperforms the triphone unit selection algorithm in Video Rewrite.
Abstract: This paper presents a novel audio visual diviseme (viseme pair) instance selection and concatenation method for speech driven photo realistic mouth animation. Firstly, an audio visual diviseme database is built consisting of the audio feature sequences, intensity sequences and visual feature sequences of the instances. In the Viterbi based diviseme instance selection, we set the accumulative cost as the weighted sum of three items: 1) logarithm of concatenation smoothness of the synthesized mouth trajectory; 2) logarithm of the pronunciation distance; 3) logarithm of the audio intensity distance between the candidate diviseme instance and the target diviseme segment in the incoming speech. The selected diviseme instances are time warped and blended to construct the mouth animation. Objective and subjective evaluations on the synthesized mouth animations prove that the multimodal diviseme instance selection algorithm proposed in this paper outperforms the triphone unit selection algorithm in Video Rewrite. Clear, accurate, smooth mouth animations can be obtained matching well with the pronunciation and intensity changes in the incoming speech. Moreover, with the logarithm function in the accumulative cost, it is easy to set the weights to obtain optimal mouth animations.

Dissertation
01 Jan 2008
TL;DR: This work shows that the effect of temporal variation due to coarticulation is statistically significant and should be taken into account in modelling visual speech synthesis, and provides the foundation for further research towards achieving perceptually realistic animation of a talking head and the understanding of visual dynamics of shape and texture during speech.
Abstract: Face to face dialogue is the most natural mode of communication between humans. The combination of human visual perception of expression and perception in changes in intonation provides semantic information that communicates idea, feelings and concepts. The realistic modelling of speech movements, through automatic facial animation, and maintaining audio-visual coherence is still a challenge in both the computer graphics and film industry. A common approach to producing visual speech is to interpolate parameters that describe mouth variation in sequence, known as visemes. A viseme corresponds to a phoneme in an utterance. Most talking head systems use sets of static visemes, represented by a single mouth shape image or 3D model. However, discretising visemes in this way does not account for context-dependent dynamic information, coarticulation. This thesis presents several visual analysis and dynamic modelling techniques for visual phones. This spans several areas of work, from capture and representation through to analysis and synthesis of speech movements and coarticulation. A novel method is reported for the automatic extraction of inner-lip contour edges from sequences of mouth images in speech. The proposed detection technique is a key-frame exemplar-based method that is not dependent on any prior frame information for intitialisation allowing for reliable and accurate inner-lip localisation for large frame to frame changes in lip-shape inherent in 25Hz video of visual speech. Visual analysis of phonemes in continuous speech is performed, that involves the investigation of mouth representations as well as a comparative analysis between static and dynamic representations of visemes. The analysis shows the need to analyse and model the underlying dynamics of visemes due to coarticulation. Finally, visual analysis of lip coarticulation in Vowel-Consonant-Vowel (VCV) utterances is presented. Based on ensemble statistics a novel approach to analysis and modelling of temporal dynamics is presented. Results show that the temporal influence of coarticulation is significant both in lip shape variation and timings of lip movement during coarticulation. This work shows that the effect of temporal variation due to coarticulation is statistically significant and should be taken into account in modelling visual speech synthesis. The work in this thesis provides the foundation for further research towards achieving perceptually realistic animation of a talking head and the understanding of visual dynamics of shape and texture during speech.

Patent
04 Apr 2008
TL;DR: In this paper, an image signal analysis means extracting vowel information of the voiced speech from the input lip image, and a ratio of lip opening size at vowel pronunciation to a predetermined reference size is extracted as a pitch ratio.
Abstract: PROBLEM TO BE SOLVED: To include intonation intended by a speaker in a synthetic speech when synthetic voiced speech is generated from non-voiced speech and a lip image. SOLUTION: The non-voiced speech of the speaker and a photographic lip image are synchronously input to generate synthetic voiced speech, in a speech synthesis device. An image signal analysis means extracts vowel information of the voiced speech from the input lip image, and a ratio of lip opening size at vowel pronunciation to a predetermined reference size is extracted as a pitch ratio. A speech signal analysis means extracts consonant information from the input non-voiced speech and a sound model of the non-voiced vowel corresponding to the vowel extracted by the image signal analysis means, and text information is extracted from a built-in dictionary which stores phoneme sequences and words in association with each other, and a language model for calculating the sequence of the word, and a continuation time length of a whole pronunciation from power variation of the input non-voiced speech. A speech synthesis means synthesizes voiced speech with intonation added thereto, based on various information extracted by both analysis means. COPYRIGHT: (C)2010,JPO&INPIT

Proceedings ArticleDOI
01 Dec 2008
TL;DR: A method of lip shape coordination with speech in the facial expression robot system ldquoH&Frobot-IIIrdquo, which can model the talking robotpsilas lip shape using visual speech system, which includes three modules: speech recognition, lip shape recognition and lip pose actuator.
Abstract: This paper proposes a method of lip shape coordination with speech in the facial expression robot system ldquoH&Frobot-IIIrdquo. The proposed method can model the talking robotpsilas lip shape using visual speech system, which includes three modules: speech recognition, lip shape recognition and lip pose actuator. In the lip shape recognition module, a viseme representation method is proposed for synthesising the human visual speech. To analyze the robotpsilas lip shape, lip shape model is developed based on the anatomy and facial action coding system (FACS). When robot speaking, the lip shape coordination with speech can be realized through basic lip shape or the combination of basic lip shape. In the ldquoH&Frobot-IIIrdquo system, the lip shape is realized through slide and guide slot mechanism, which implements the two-way movement of muscle in the lip. Finally, the result of the experiment, which is the lip coordination with speech, is shown. When speaking same word, the lip shape of robot is similarity to that of human.

Proceedings ArticleDOI
01 Sep 2008
TL;DR: A new learning-based approach to speech synthesis that achieves mouth movements with rich and expressive articulation for novel audio input by using a Locally Linear Embedding representation of feature points on 3D scans, and a system of viseme categories, which are used to define triphone substitution rules and a cost function.
Abstract: This paper presents a new learning-based approach to speech synthesis that achieves mouth movements with rich and expressive articulation for novel audio input. From a database of 3D triphone motions, our algorithm picks the optimal sequences based on a triphone similarity measure, and concatenates them to create new utterances that include coarticulation effects. By using a Locally Linear Embedding (LLE) representation of feature points on 3D scans, we propose a model that defines a measure of similarity among visemes, and a system of viseme categories, which are used to define triphone substitution rules and a cost function. Moreover, we compute deformation vectors for several facial expressions, allowing expression variation to be smoothly added to the speech animation. In an entirely data-driven approach, our automated procedure for defining viseme categories closely reproduces the groups of related visemes that are defined in the phonetics literature. The structure of our selection method is intrinsic to the nature of speech and generates a substitution table that can be reused as-is in different speech animation systems.

Proceedings ArticleDOI
15 Aug 2008
TL;DR: The previous work of digital speech signal processing is discussed and how to apply the existing speech processing techniques into the proposed algorithms of speech driven lip motion animation for Japanese style anime is discussed.
Abstract: In the manufacture of Japanese style anime the movement of lip with speech is usually shortened to the more convenient 'open' and 'close' of mouth because of the expensive production cost. In this paper we provide an approach to deal with speech driven lip animation for Japanese style anime. First we discuss the previous work of digital speech signal processing and show how to apply the existing speech processing techniques into our work. Then we propose our algorithms of speech driven lip motion animation. Finally the experiment results will be provided.

Patent
16 Jul 2008
TL;DR: In this paper, the authors proposed a speech recognizer that has less throughput and high recognition performance especially for speech recognition of a tone language, where tone information indicating tone of the selected label was extracted from the input speech, and corrected on the basis of the extracted tone information and content of the pattern list.
Abstract: PROBLEM TO BE SOLVED: To provide a speech recognizer that has less throughput and high recognition performance especially for speech recognition of a tone language.SOLUTION: The speech recognizer: extracts a fundamental frequency from input speech, and acoustically analyses the input speech; selects one of plural speech recognition results obtained by speech recognition, and outputs a label string indicating the selected speech recognition result; selects at least one label in the output label string on the basis of a pattern list held in advance; and extracts tone information indicating tone of the selected label on the basis of the fundamental frequency extracted from the input speech, and corrects the selected label on the basis of the extracted tone information and content of the pattern list.

Proceedings ArticleDOI
22 Sep 2008
TL;DR: A comparative study between spontaneous speech and read Mandarin speech in the context of automatic speech recognition and the technique of Multispace distribution (MSD) to model partially continuous F0 contours is presented.
Abstract: In this paper, we present a comparative study between spontaneous speech and read Mandarin speech in the context of automatic speech recognition. We focus on analysis and modeling of prosodic features, based on a unique speech corpus that contains similar amounts of read and spontaneous speech data from the same group of speakers. Statistical analysis is carried out on tone contours and duration of syllable and subsyllable units. Speech recognition experiments are performed to evaluate the effectiveness of different approaches to incorporate prosodic features into acoustic modeling. A key problem being addressed is how to deal with the unvoiced frames where F0 values are unavailable. We apply the technique of Multispace distribution (MSD) to model partially continuous F0 contours. For spontaneous speech, the tonal-syllable error rate is reduced from the MFCC baseline of 64.8% to 59.4% with the MSD based prosody model. For read speech, the performance improves from 46.0% to 36.4%.


Journal IssueDOI
TL;DR: This work proposes a set of algorithms to efficiently make speech animation for 3D cartoon characters based on blendshapes, a linear interpolation technique, which is widely used in facial animation practice.
Abstract: We propose a set of algorithms to efficiently make speech animation for 3D cartoon characters. Our prototype system is based on blendshapes, a linear interpolation technique, which is widely used in facial animation practice. In our system, a few base target shapes of the character, prerecorded voice, and its transcription are required as input. We describe a simple technique that amplifies the target shapes from few inputs using a generic database of viseme mouth shapes. We also introduce additional lip-synch editing parameters that allow designers to quickly tune the lip movements. Based on these, we implement our prototype system as a Maya plug-in. The demonstration movies created with this system illustrate well the practicality of our approach. Copyright © 2008 John Wiley & Sons, Ltd.

01 Jun 2008
TL;DR: In this article, the distance between the vowels from the produced lip shapes was analyzed in an experiment of distant interaction between a deaf participant using CS and a normalhearing participant, and the results showed that the set of vowels with similar lip shapes (so called visemes) are the same groups that for a normal-hearing cuer.
Abstract: If the studies on Cued Speech (CS) perception are numerous, those refering to its production by a normalhearing cuer are fewer, and almost non-existent in the case of a deaf cuer. This latter is the topic of this contribution. In an experiment of distant interaction between a deaf participant using CS and a normalhearing participant, we analyze the distance between the vowels from the produced lip shapes. The results show at first that the set of vowels with similar lip shapes (so called visemes) are the same groups that for a normal-hearing cuer. Secondly, the lip shapes are in coherence with the CS system. Finally, the analysis of the temporal coordination between the CS hand gestures and the lips one, reveals the same scheme as observed at first by Attina [6] with normal-hearing cuers and confirmed by Aboutabit et al.[3] in a more complex speech context, i.e. the advance of the CS hand on the lip shapes.

01 Sep 2008
TL;DR: A parameterisation of lip movements is described which maintains the dynamic structure inherent in the task of producing speech sounds and is believed to be appropriate to various areas of speech modeling, in particular the synthesis of speech lip movements.
Abstract: In this paper we describe a parameterisation of lip movements which maintains the dynamic structure inherent in the task of producing speech sounds. A stereo capture system is used to reconstruct 3D models of a speaker producing sentences from the TIMIT corpus. This data is mapped into a space which maintains the relationships between samples and their temporal derivatives. By incorporating dynamic information within the parameterisation of lip movements we can model the cyclical structure, as well as the causal nature of speech movements as described by an underlying visual speech manifold. It is believed that such a structure will be appropriate to various areas of speech modeling, in particular the synthesis of speech lip movements.

Proceedings ArticleDOI
07 Apr 2008
TL;DR: This paper represents a hybrid framework for lip reading which is based on both audio and visual speech parameters extracted from a video stream of isolated spoken words, and shows interesting results in recognition of isolated words.
Abstract: It is well known that speech production and perception process is inherently bimodal consisting of audio and visual components. Recently there has been increased interest in using the visual modality in combination with the acoustic modality for improved speech processing. This field of study has gained the title of audio-visual speech processing. Lip movement recognition, also known as lip reading, is a communication skill which involves the interpretation of lip movements in order to estimate some important parameters of the lips that include, but not limited to, size, shape and orientation. In this paper, we represent a hybrid framework for lip reading which is based on both audio and visual speech parameters extracted from a video stream of isolated spoken words. The proposed algorithm is self-tuned in the sense that it starts with an estimations of speech parameters based on visual lip features and then the coefficients of the algorithm are fine-tuned based on the extracted audio parameters. In the audio speech processing part, extracted audio features are used to generate a vector containing information of the speech phonemes. These information are used later to enhance the recognition and matching process. For lip feature extraction, we use a modified version of the method used by F. Huang and T. Chen for tracking of multiple faces. This method is based on statistical color modeling and the deformable template. The experiments based on the proposed framework showed interesting results in recognition of isolated words.

Proceedings ArticleDOI
11 Aug 2008
TL;DR: A system to synthesize lips-syncs speech animation given a novel utterance is presented that uses a nonlinear blend-shape method and derives key-shapes using a novel automatic clustering algorithm.
Abstract: Facial animation is traditionally considered as an important but tedious task for many applications.Recently the demand for lipssyncs animation is increasing, but there seems few fast and easy generation methods.In this talk, a system to synthesize lips-syncs speech animation given a novel utterance is presented. Our system uses a nonlinear blend-shape method and derives key-shapes using a novel automatic clustering algorithm. Finally a Gaussianphoneme model is used to predict the proper motion dynamic that can be used for synthesizing a new speech animation.

Journal Article
TL;DR: Results of speech recognition using DTW and pattern matching arithmetic are improved and methods for modifying speech prosodic characteristics are researched to make HCI more intelligent.
Abstract: Emotional speech processing technology plays an important role in the improvement of HCIIn this paper,numerous speech samples of same speaker are recorded under three primary emotionsStatistical analysis is used to analyze the pitch,energy,time and the related prosodic parameters,based on which PCA methods is used to recognize emotion states and the correct rate is 9167%Combined with the emotion recognition results,results of speech recognition using DTW and pattern matching arithmetic are improvedMeanwhile,methods for modifying speech prosodic characteristics are researched to make HCI more intelligent

Patent
01 Jun 2008

Book ChapterDOI
01 Jan 2008
TL;DR: This chapter describes results of research that tries to realistically connect personality and 3D characters, not only on an expressive level (for example, generating individualized expressions on a 3D face), but also with real-time video tracking, on a dialogue level and on a perceptive level.
Abstract: With the emergence of 3D graphics, we are now able to create very realistic 3D characters that can move and talk. Multimodal interaction with such characters is also possible, as various technologies have matured for speech and video analysis, natural language dialogues, and animation. However, the behavior expressed by these characters is far from believable in most systems. We feel that this problem arises due to their lack of individuality on various levels: perception, dialogue, and expression. In this chapter, we describe results of research that tries to realistically connect personality and 3D characters, not only on an expressive level (for example, generating individualized expressions on a 3D face), but also with real-time video tracking, on a dialogue level (generating responses that actually correspond to what a certain personality in a certain emotional state would say) and on a perceptive level (having a virtual character that uses expression user data to create corresponding behavior). The idea of linking personality with agent behavior has been discussed by Marsella et al. [33], with the influence of emotion on behavior in general, and Johns et al. [21] with how personality and emotion can affect decision making. Traditionally, any text or voice-driven speech animation system uses the phonemes as the basic units of speech, and visemes as the basic units of animation. Though text-to-speech synthesizers and phoneme recognizers often use biphonebased techniques, the end user seldom has access to this information, except for dedicated systems. Most commercially and freely available software applications allow access to only time-stamped phoneme streams along with audio. Thus, in order to generate animation from this information, an extra level of processing, namely co-articulation, is required. This process takes care of the influence of the neighboring visemes for fluent speech production. This processing stage can be eliminated by using the syllable as a basic unit of speech rather than the phoneme. Overall, we do not intend to give a complete survey of ongoing research in behavior, emotion, and personality. Our main goal is to create believable conversational agents that can interact with many modalities. We thus concentrate on emotion extraction of a real user (Section 2.3), visyllable-based speech animation (Section 2.4), dialogue systems and emotions (Section 2.5).