Showing papers on "Viseme published in 2008"

PDF

Open Access

Journal Article•DOI•

Real-Time Multimodal Human–Avatar Interaction

[...]

Yun Fu¹, Renxiang Li², T.S. Huang¹, M. Danielsen¹•Institutions (2)

University of Illinois at Urbana–Champaign¹, Motorola²

01 Apr 2008-IEEE Transactions on Circuits and Systems for Video Technology

TL;DR: A novel real-time multimodal human-avatar interaction (RTM-HAI) framework with vision-based remote animation control (RAC) that integrates audio-visual analysis and synthesis modules to realize multichannel and runtime animations, visual TTS and real- time viseme detection and rendering.

...read moreread less

Abstract: This paper presents a novel real-time multimodal human-avatar interaction (RTM-HAI) framework with vision-based remote animation control (RAC). The framework is designed for both mobile and desktop avatar-based human-machine or human-human visual communications in real-world scenarios. Using 3-D components stored in the Java mobile 3-D (M3G) file format, the avatar models can be flexibly constructed and customized on the fly on any mobile devices or systems that support the M3G standard. For the RAC head tracker, we propose a 2-D real-time face detection/tracking strategy through an interactive loop, in which the detection and tracking complement each other for efficient and reliable face localization, tolerating extreme user movement. With the face location robustly tracked, the RAC head tracker selects a main user and estimates the user's head rolling, tilting, yawing, scaling, horizontal, and vertical motion in order to generate avatar animation parameters. The animation parameters can be used either locally or remotely and can be transmitted through socket over the network. In addition, it integrates audio-visual analysis and synthesis modules to realize multichannel and runtime animations, visual TTS and real-time viseme detection and rendering. The framework is recognized as an effective design for future realistic industrial products of humanoid kiosk and human-to-human mobile communication.

...read moreread less

70 citations

Patent•

Systems and Methods for Voice Personalization of Video Content

[...]

Jonathan Isaac Strietzel, Jon Hayes Snoddy, Douglas Alexander Fidaleo

19 Nov 2008

TL;DR: In this paper, a system for voice personalization of video content is described, which includes a composition of a background scene having a character, head model data representing an individualized three-dimensional (3D) head model of a user, audio data simulating the user's voice, and a viseme track containing instructions for causing the individualized 3D head model to lip sync the words contained in the audio data.

...read moreread less

Abstract: Systems and methods are disclosed for performing voice personalization of video content. The personalized media content may include a composition of a background scene having a character, head model data representing an individualized three-dimensional (3D) head model of a user, audio data simulating the user's voice, and a viseme track containing instructions for causing the individualized 3D head model to lip sync the words contained in the audio data. The audio data simulating the user's voice can be generated using a voice transformation process. In certain examples, the audio data is based on a text input or selected by the user (e.g., via a telephone or computer) or a textual dialogue of a background character.

...read moreread less

31 citations

Proceedings Article•DOI•

Real-time conversion from a single 2D face image to a 3D text-driven emotive audio-visual avatar

[...]

Hao Tang¹, Yuxiao Hu¹, Yun Fu¹, Mark Hasegawa-Johnson¹, Thomas S. Huang¹ - Show less +1 more•Institutions (1)

University of Illinois at Urbana–Champaign¹

26 Aug 2008

TL;DR: A complete pipeline of efficient and low-cost techniques to construct a realistic 3D text-driven emotive audio-visual avatar from a single 2D frontal-view face image of any person on the fly is proposed.

...read moreread less

Abstract: In this paper, we propose a complete pipeline of efficient and low-cost techniques to construct a realistic 3D text-driven emotive audio-visual avatar from a single 2D frontal-view face image of any person on the fly. This real-time conversion is achieved through three steps. First, a personalized 3D face model is built based on the 2D face image using a fully automatic 3D face shape and texture reconstruction framework. Second, using standard MPEG-4 FAPs (Facial Animation Parameters), the face model is animated by the Viseme and expression channels and is complemented by the visual prosody channel that controls head, eye and eyelid movements. Finally, the facial animation is combined and synchronized with the emotive synthetic speech generated by incorporating an emotion transformer into a Festival-MBROLA text to neutral speech synthesizer.

...read moreread less

27 citations

Book•

Speech Sounds: A Pictorial Guide to Typical and Atypical Speech

[...]

Sharynne McLeod, Sadanand Singh¹•Institutions (1)

San Diego State University¹

01 Nov 2008

TL;DR: Speech sounds : a pictorial guide to typical and atypical speech as discussed by the authors, is a pictual guide for typical and unusual speech in the context of speech synthesis and decoding.

...read moreread less

Abstract: Speech sounds : a pictorial guide to typical and atypical speech , Speech sounds : a pictorial guide to typical and atypical speech , کتابخانه دیجیتال جندی شاپور اهواز

...read moreread less

17 citations

Book Chapter•DOI•

Learning Expressive Human-Like Head Motion Sequences from Speech

[...]

Carlos Busso¹, Zhigang Deng², Ulrich Neumann¹, Shrikanth S. Narayanan¹•Institutions (2)

University of Southern California¹, University of Houston²

01 Jan 2008

TL;DR: With the development of new trends in human-machine interfaces, animated feature films and video games, better avatars and virtual agents are required to synthesize realistic animation that capture and resemble the complex relationship between these communicative channels.

...read moreread less

Abstract: With the development of new trends in human-machine interfaces, animated feature films and video games, better avatars and virtual agents are required that more accurately mimic how humans communicate and interact. Gestures and speech are jointly used to express intended messages. The tone and energy of the speech, facial expression, rigid head motion and hand motion combine in a non-trivial manner as they unfold in natural human interaction. Given that the use of large motion capture datasets is expensive and can only be applied in planned scenarios, new automatic approaches are required to synthesize realistic animation that capture and resemble the complex relationship between these communicative channels. One useful and practical approach is the use of acoustic features to generate gestures, exploiting the link between gestures and speech. Since the shape of the lips is determined by the underlying articulation, acoustic features have been used to generate visual visemes that match the spoken sentences [4, 5, 12, 17]. Likewise, acoustic features have been used to synthesize facial expressions [11, 30], exploiting the fact that the same muscles used for articulation also affect the shape of the face [44, 46]. One important gesture that has received less attention than other aspects in facial animations is rigid head motion. Head motion is important not only to acknowledge active listening or replace verbal information (e.g. “nod”), but also for many aspect of human

...read moreread less

17 citations

Dissertation•

The application of manifold based visual speech units for visual speech recognition

[...]

Dahai Yu

01 Nov 2008

TL;DR: A large section of this thesis has been dedicated to analysis the performance of the new visual speech unit model when compared with that attained for standard (MPEG-4) viseme models.

...read moreread less

Abstract: This dissertation presents a new learning-based representation that is referred to as a Visual Speech Unit for visual speech recognition (VSR). The automated recognition of human speech using only features from the visual domain has become a significant research topic that plays an essential role in the development of many multimedia systems such as audio visual speech recognition(AVSR), mobile phone applications, human-computer interaction (HCI) and sign language recognition. The inclusion of the lip visual information is opportune since it can improve the overall accuracy of audio or hand recognition algorithms especially when such systems are operated in environments characterized by a high level of acoustic noise. The main contribution of the work presented in this thesis is located in the development of a new learning-based representation that is referred to as Visual Speech Unit for Visual Speech Recognition (VSR). The main components of the developed Visual Speech Recognition system are applied to: (a) segment the mouth region of interest, (b) extract the visual features from the real time input video image and (c) to identify the visual speech units. The major difficulty associated with the VSR systems resides in the identification of the smallest elements contained in the image sequences that represent the lip movements in the visual domain. The Visual Speech Unit concept as proposed represents an extension of the standard viseme model that is currently applied for VSR. The VSU model augments the standard viseme approach by including in this new representation not only the data associated with the articulation of the visemes but also the transitory information between consecutive visemes. A large section of this thesis has been dedicated to analysis the performance of the new visual speech unit model when compared with that attained for standard (MPEG- 4) viseme models. Two experimental results indicate that: 1. The developed VSR system achieved 80-90% correct recognition when the system has been applied to the identification of 60 classes of VSUs, while the recognition rate for the standard set of MPEG-4 visemes was only 62-72%. 2. 15 words are identified when VSU and viseme are employed as the visual speech element. The accuracy rate for word recognition based on VSUs is 7%-12% higher than the accuracy rate based on visemes.

...read moreread less

12 citations

Proceedings Article•DOI•

A phone-viseme dynamic Bayesian network for audio-visual automatic speech recognition

[...]

Louis H. Terry¹, Aggelos K. Katsaggelos¹•Institutions (1)

Northwestern University¹

01 Dec 2008

TL;DR: This work extends and improves a recently introduced dynamic Bayesian network based audio-visual automatic speech recognition (AV-ASR) system to model the audio and visual streams as being composed of separate, yet related, sub-word units.

...read moreread less

Abstract: This work extends and improves a recently introduced (Dec. 2007) dynamic Bayesian network (DBN) based audio-visual automatic speech recognition (AV-ASR) system. That system models the audio and visual components of speech as being composed of the same sub-word units when, in fact, this is not psycholinguistically true. We extend the system to model the audio and visual streams as being composed of separate, yet related, sub-word units. We also introduce a novel stream weighting structure incorporated into the model itself. In doing so, our system makes improvements in word error rate (WER) and overall recognition accuracy in a large vocabulary continuous speech recognition task (LVCSR). The ldquobestrdquo performing proposed system attains a WER of 66.71%whereas the ldquobestrdquo baseline system performs at a WER of 64.30%. The proposed system also improves accuracy to 45.95% from 39.40%.

...read moreread less

12 citations

Journal Article•DOI•

Sine-wave speech

[...]

Robert E. Remez

22 Jul 2008-Scholarpedia

11 citations

Proceedings Article•DOI•

Visual speech synthesis based on Chinese dynamic visemes

[...]

Hui Zhao¹, Chaojing Tang¹•Institutions (1)

University of Defence¹

20 Jun 2008

TL;DR: Experimental results show that, for speeches which are chosen randomly, the synthetic visual speech is smooth and realistic.

...read moreread less

Abstract: In order to realize realistic visual speech synthesis, a visual speech synthesis method based on Chinese dynamic visemes is proposed. With mouth feature parameters of Chinese static visemes, consonants and vowels are classified using clustering algorithm. According to Chinese pronunciation characters, we can get 40 basic dynamic visemes by combining consonant types and vowel types. With these dynamic visemes and corresponding phonemes, two-layer hidden Markov model (HMM) is built up and trained. Experimental results show that, for speeches which are chosen randomly, the synthetic visual speech is smooth and realistic.

...read moreread less

10 citations

Proceedings Article•DOI•

Methods for improving the quality of syllable based speech synthesis

[...]

Y R Venugopalakrishna¹, M. V. Vinodh¹, Hema A. Murthy¹, C.S. Ramalingam¹•Institutions (1)

Indian Institute of Technology Madras¹

01 Dec 2008

TL;DR: This paper presents some minimal signal modification techniques for reducing audible artifacts in speech synthesis due to discontinuities in pitch, energy, and formant trajectories at the joining point of the units.

...read moreread less

Abstract: Our earlier work [1] on speech synthesis has shown that syllables can produce reasonably natural quality speech. Nevertheless, audible artifacts are present due to discontinuities in pitch, energy, and formant trajectories at the joining point of the units. In this paper, we present some minimal signal modification techniques for reducing these artifacts.

...read moreread less

10 citations

Journal Article•DOI•

A hybrid approach for automatic lip localization and viseme classification to enhance visual speech recognition

[...]

Walid Mahdi¹, Salah Werda¹, Abdelmajid Ben Hamadou¹•Institutions (1)

University of Sfax¹

01 Aug 2008-Computer-Aided Engineering

TL;DR: In this article, a new automatic approach for lip POI localization and feature extraction on a speaker's face based on mouth color information and a geometrical model of the lips is presented.

...read moreread less

Abstract: An automatic lip-reading system is among assistive technologies for hearing impaired or elderly people. We can imagine, for example, a dependent person ordering a machine with an easy lip movement or by a simple visemes (visual phoneme) pronunciation. A lip-reading system is decomposed into three subsystems: a lip localization subsystem, then a feature extracting subsystem, followed by a classification system that maps feature vectors to visemes. The major difficulty in a lip-reading system is the extraction of the visual speech descriptors. In fact, to ensure this task it is necessary to carry out an automatic localization and tracking of the labial gestures. We present, in this paper, a new automatic approach for lip POI localization and feature extraction on a speaker's face based on mouth color information and a geometrical model of the lips. The extracted visual information is then classified in order to recognize the uttered viseme. We have developed our Automatic Lip Feature Extraction prototype (ALiFE). ALiFE prototype is evaluated for multiple speakers under natural conditions. Experiments include a group of French visemes for different speakers. Results revealed that our system recognizes 94.64% of the tested French visemes.

...read moreread less

Journal Article•DOI•

A constraint-based approach to visual speech for a Mexican-Spanish talking head

[...]

Oscar M. Martinez Lazalde¹, Steve Maddock¹, Michael Meredith¹•Institutions (1)

University of Sheffield¹

10 Jan 2008

TL;DR: This work describes an approach to pose-based interpolation that deals with coarticulation using a constraint-based technique and demonstrates it using a Mexican-Spanish talking head, which can vary its speed of talking and produce coARTiculation effects.

...read moreread less

Abstract: A common approach to produce visual speech is to interpolate the parameters describing a sequence of mouth shapes, known as visemes, where a viseme corresponds to a phoneme in an utterance. The interpolation process must consider the issue of context-dependent shape, or coarticulation, in order to produce realistic-looking speech. We describe an approach to such pose-based interpolation that deals with coarticulation using a constraint-based technique. This is demonstrated using a Mexican-Spanish talking head, which can vary its speed of talking and produce coarticulation effects.

...read moreread less

Proceedings Article•DOI•

Accurate visual speech synthesis based on diviseme unit selection and concatenation

[...]

Dong-Mei Jiang, Ilse Ravyse¹, Hichem Sahli¹, Yanning Zhang²•Institutions (2)

VU University Amsterdam¹, Northwestern Polytechnical University²

05 Nov 2008

TL;DR: Visual speech synthesis experiments and subjective evaluation results show that mouth animations can be obtained which are not only realistic with clear and smooth mouth images, but also in good accordance with the acoustic pronunciation and intensity of the input speech.

...read moreread less

Abstract: This paper presents a novel speech driven accurate realistic visual speech synthesis approach. Firstly, an audio visual instance database is built for different viseme context combinations, i.e. diviseme units, using 100 audio visual speech sentences of a female speaker. Then a diviseme instance selection algorithm is introduced to choose the optimal diviseme instances for the viseme contexts in the input speech, considering both the concatenation smoothness of the image sequences, and matching of the mouth movements to the acoustic pronunciation process, as well the intensity of the input speech. Finally mouth image sequences of corresponding viseme segments in the selected diviseme instances are time warped and blended to construct the mouth images of the final animation. Visual speech synthesis experiments and subjective evaluation results show that mouth animations can be obtained which are not only realistic with clear and smooth mouth images, but also in good accordance with the acoustic pronunciation and intensity of the input speech.

...read moreread less

Journal Article•DOI•

Speech driven realistic mouth animation based on multi-modal unit selection

[...]

Dongmei Jiang¹, Ilse Ravyse², Hichem Sahli², Werner Verhelst²•Institutions (2)

Northwestern Polytechnical University¹, Vrije Universiteit Brussel²

01 Dec 2008-Journal on Multimodal User Interfaces

TL;DR: Objective and subjective evaluations on the synthesized mouth animations prove that the multimodal diviseme instance selection algorithm proposed in this paper outperforms the triphone unit selection algorithm in Video Rewrite.

...read moreread less

Abstract: This paper presents a novel audio visual diviseme (viseme pair) instance selection and concatenation method for speech driven photo realistic mouth animation. Firstly, an audio visual diviseme database is built consisting of the audio feature sequences, intensity sequences and visual feature sequences of the instances. In the Viterbi based diviseme instance selection, we set the accumulative cost as the weighted sum of three items: 1) logarithm of concatenation smoothness of the synthesized mouth trajectory; 2) logarithm of the pronunciation distance; 3) logarithm of the audio intensity distance between the candidate diviseme instance and the target diviseme segment in the incoming speech. The selected diviseme instances are time warped and blended to construct the mouth animation. Objective and subjective evaluations on the synthesized mouth animations prove that the multimodal diviseme instance selection algorithm proposed in this paper outperforms the triphone unit selection algorithm in Video Rewrite. Clear, accurate, smooth mouth animations can be obtained matching well with the pronunciation and intensity changes in the incoming speech. Moreover, with the logarithm function in the accumulative cost, it is easy to set the weights to obtain optimal mouth animations.

...read moreread less

Dissertation•

Visual analysis of viseme dynamics

[...]

Aseel Turkmani

01 Jan 2008

TL;DR: This work shows that the effect of temporal variation due to coarticulation is statistically significant and should be taken into account in modelling visual speech synthesis, and provides the foundation for further research towards achieving perceptually realistic animation of a talking head and the understanding of visual dynamics of shape and texture during speech.

...read moreread less

Abstract: Face to face dialogue is the most natural mode of communication between humans. The combination of human visual perception of expression and perception in changes in intonation provides semantic information that communicates idea, feelings and concepts. The realistic modelling of speech movements, through automatic facial animation, and maintaining audio-visual coherence is still a challenge in both the computer graphics and film industry. A common approach to producing visual speech is to interpolate parameters that describe mouth variation in sequence, known as visemes. A viseme corresponds to a phoneme in an utterance. Most talking head systems use sets of static visemes, represented by a single mouth shape image or 3D model. However, discretising visemes in this way does not account for context-dependent dynamic information, coarticulation. This thesis presents several visual analysis and dynamic modelling techniques for visual phones. This spans several areas of work, from capture and representation through to analysis and synthesis of speech movements and coarticulation. A novel method is reported for the automatic extraction of inner-lip contour edges from sequences of mouth images in speech. The proposed detection technique is a key-frame exemplar-based method that is not dependent on any prior frame information for intitialisation allowing for reliable and accurate inner-lip localisation for large frame to frame changes in lip-shape inherent in 25Hz video of visual speech. Visual analysis of phonemes in continuous speech is performed, that involves the investigation of mouth representations as well as a comparative analysis between static and dynamic representations of visemes. The analysis shows the need to analyse and model the underlying dynamics of visemes due to coarticulation. Finally, visual analysis of lip coarticulation in Vowel-Consonant-Vowel (VCV) utterances is presented. Based on ensemble statistics a novel approach to analysis and modelling of temporal dynamics is presented. Results show that the temporal influence of coarticulation is significant both in lip shape variation and timings of lip movement during coarticulation. This work shows that the effect of temporal variation due to coarticulation is statistically significant and should be taken into account in modelling visual speech synthesis. The work in this thesis provides the foundation for further research towards achieving perceptually realistic animation of a talking head and the understanding of visual dynamics of shape and texture during speech.

...read moreread less

Patent•

Speech synthesis device, method and program

[...]

Tsutomu Kaneyasu, 勉兼安

04 Apr 2008

TL;DR: In this paper, an image signal analysis means extracting vowel information of the voiced speech from the input lip image, and a ratio of lip opening size at vowel pronunciation to a predetermined reference size is extracted as a pitch ratio.

...read moreread less

Abstract: PROBLEM TO BE SOLVED: To include intonation intended by a speaker in a synthetic speech when synthetic voiced speech is generated from non-voiced speech and a lip image. SOLUTION: The non-voiced speech of the speaker and a photographic lip image are synchronously input to generate synthetic voiced speech, in a speech synthesis device. An image signal analysis means extracts vowel information of the voiced speech from the input lip image, and a ratio of lip opening size at vowel pronunciation to a predetermined reference size is extracted as a pitch ratio. A speech signal analysis means extracts consonant information from the input non-voiced speech and a sound model of the non-voiced vowel corresponding to the vowel extracted by the image signal analysis means, and text information is extracted from a built-in dictionary which stores phoneme sequences and words in association with each other, and a language model for calculating the sequence of the word, and a continuation time length of a whole pronunciation from power variation of the input non-voiced speech. A speech synthesis means synthesizes voiced speech with intonation added thereto, based on various information extracted by both analysis means. COPYRIGHT: (C)2010,JPO&INPIT

...read moreread less

Proceedings Article•DOI•

Research and experiment of lip coordination with speech for the humanoid head robot-“H&Frobot-III”

[...]

Meng Qingmei¹, Wu Weiguo¹, Zhong Yusheng¹, Song Ce¹•Institutions (1)

Harbin Institute of Technology¹

01 Dec 2008

TL;DR: A method of lip shape coordination with speech in the facial expression robot system ldquoH&Frobot-IIIrdquo, which can model the talking robotpsilas lip shape using visual speech system, which includes three modules: speech recognition, lip shape recognition and lip pose actuator.

...read moreread less

Abstract: This paper proposes a method of lip shape coordination with speech in the facial expression robot system ldquoH&Frobot-IIIrdquo. The proposed method can model the talking robotpsilas lip shape using visual speech system, which includes three modules: speech recognition, lip shape recognition and lip pose actuator. In the lip shape recognition module, a viseme representation method is proposed for synthesising the human visual speech. To analyze the robotpsilas lip shape, lip shape model is developed based on the anatomy and facial action coding system (FACS). When robot speaking, the lip shape coordination with speech can be realized through basic lip shape or the combination of basic lip shape. In the ldquoH&Frobot-IIIrdquo system, the lip shape is realized through slide and guide slot mechanism, which implements the two-way movement of muscle in the lip. Finally, the result of the experiment, which is the lip coordination with speech, is shown. When speaking same word, the lip shape of robot is similarity to that of human.

...read moreread less

Proceedings Article•DOI•

A nonlinear viseme model for triphone-based speech synthesis

[...]

Robert Bargmann¹, Volker Blanz¹, Hans-Peter Seidel¹•Institutions (1)

Max Planck Society¹

01 Sep 2008

TL;DR: A new learning-based approach to speech synthesis that achieves mouth movements with rich and expressive articulation for novel audio input by using a Locally Linear Embedding representation of feature points on 3D scans, and a system of viseme categories, which are used to define triphone substitution rules and a cost function.

...read moreread less

Abstract: This paper presents a new learning-based approach to speech synthesis that achieves mouth movements with rich and expressive articulation for novel audio input. From a database of 3D triphone motions, our algorithm picks the optimal sequences based on a triphone similarity measure, and concatenates them to create new utterances that include coarticulation effects. By using a Locally Linear Embedding (LLE) representation of feature points on 3D scans, we propose a model that defines a measure of similarity among visemes, and a system of viseme categories, which are used to define triphone substitution rules and a cost function. Moreover, we compute deformation vectors for several facial expressions, allowing expression variation to be smoothly added to the speech animation. In an entirely data-driven approach, our automated procedure for defining viseme categories closely reproduces the groups of related visemes that are defined in the phonetics literature. The structure of our selection method is intrinsic to the nature of speech and generates a substitution table that can be reused as-is in different speech animation systems.

...read moreread less

Proceedings Article•DOI•

An Approach to Speech Driven Animation

[...]

Ningping Sun, K. Suigetsu, T. Ayabe

15 Aug 2008

TL;DR: The previous work of digital speech signal processing is discussed and how to apply the existing speech processing techniques into the proposed algorithms of speech driven lip motion animation for Japanese style anime is discussed.

...read moreread less

Abstract: In the manufacture of Japanese style anime the movement of lip with speech is usually shortened to the more convenient 'open' and 'close' of mouth because of the expensive production cost. In this paper we provide an approach to deal with speech driven lip animation for Japanese style anime. First we discuss the previous work of digital speech signal processing and show how to apply the existing speech processing techniques into our work. Then we propose our algorithms of speech driven lip motion animation. Finally the experiment results will be provided.

...read moreread less

Patent•

Speech recognizer and speech recognition method

[...]

Ou Giyoutou, Owa Kunihiko, Shosakai Makoto

16 Jul 2008

TL;DR: In this paper, the authors proposed a speech recognizer that has less throughput and high recognition performance especially for speech recognition of a tone language, where tone information indicating tone of the selected label was extracted from the input speech, and corrected on the basis of the extracted tone information and content of the pattern list.

...read moreread less

Abstract: PROBLEM TO BE SOLVED: To provide a speech recognizer that has less throughput and high recognition performance especially for speech recognition of a tone language.SOLUTION: The speech recognizer: extracts a fundamental frequency from input speech, and acoustically analyses the input speech; selects one of plural speech recognition results obtained by speech recognition, and outputs a label string indicating the selected speech recognition result; selects at least one label in the output label string on the basis of a pattern list held in advance; and extracts tone information indicating tone of the selected label on the basis of the fundamental frequency extracted from the input speech, and corrects the selected label on the basis of the extracted tone information and content of the pattern list.

...read moreread less

Proceedings Article•DOI•

Prosody for Mandarin speech recognition: a comparative study of read and spontaneous speech.

[...]

Yu Ting Yeung¹, Yao Qian², Tan Lee, Frank K. Soong²•Institutions (2)

The Chinese University of Hong Kong¹, Microsoft²

22 Sep 2008

TL;DR: A comparative study between spontaneous speech and read Mandarin speech in the context of automatic speech recognition and the technique of Multispace distribution (MSD) to model partially continuous F0 contours is presented.

...read moreread less

Abstract: In this paper, we present a comparative study between spontaneous speech and read Mandarin speech in the context of automatic speech recognition. We focus on analysis and modeling of prosodic features, based on a unique speech corpus that contains similar amounts of read and spontaneous speech data from the same group of speakers. Statistical analysis is carried out on tone contours and duration of syllable and subsyllable units. Speech recognition experiments are performed to evaluate the effectiveness of different approaches to incorporate prosodic features into acoustic modeling. A key problem being addressed is how to deal with the unvoiced frames where F0 values are unavailable. We apply the technique of Multispace distribution (MSD) to model partially continuous F0 contours. For spontaneous speech, the tonal-syllable error rate is reduced from the MFCC baseline of 64.8% to 59.4% with the MSD based prosody model. For read speech, the performance improves from 46.0% to 36.4%.

...read moreread less

Proceedings Article•

An Emotion Estimation from Human Speech Using Speech Recognition and Speech Synthesize.

[...]

Masaki Kurematsu, Marina Ohashi, Orimi Kinosita, Jun Hakura, Hamido Fujita - Show less +1 more

01 Jan 2008

Journal Issue•DOI•

Efficient lip-synch tool for 3D cartoon animation

[...]

Shin-ichi Kawamoto, Tatsuo Yotsukura, Ken Anjyo, Satoshi Nakamura

01 Sep 2008-Computer Animation and Virtual Worlds

TL;DR: This work proposes a set of algorithms to efficiently make speech animation for 3D cartoon characters based on blendshapes, a linear interpolation technique, which is widely used in facial animation practice.

...read moreread less

Abstract: We propose a set of algorithms to efficiently make speech animation for 3D cartoon characters. Our prototype system is based on blendshapes, a linear interpolation technique, which is widely used in facial animation practice. In our system, a few base target shapes of the character, prerecorded voice, and its transcription are required as input. We describe a simple technique that amplifies the target shapes from few inputs using a generic database of viseme mouth shapes. We also introduce additional lip-synch editing parameters that allow designers to quickly tune the lip movements. Based on these, we implement our prototype system as a Maya plug-in. The demonstration movies created with this system illustrate well the practicality of our approach. Copyright © 2008 John Wiley & Sons, Ltd.

...read moreread less

Analyse de la production d'un codeur LPC sourd

[...]

Pablo Sacher, Denis Beautemps, Marie-Agnès Cathiard, Noureddine Aboutabit

01 Jun 2008

TL;DR: In this article, the distance between the vowels from the produced lip shapes was analyzed in an experiment of distant interaction between a deaf participant using CS and a normalhearing participant, and the results showed that the set of vowels with similar lip shapes (so called visemes) are the same groups that for a normal-hearing cuer.

...read moreread less

Abstract: If the studies on Cued Speech (CS) perception are numerous, those refering to its production by a normalhearing cuer are fewer, and almost non-existent in the case of a deaf cuer. This latter is the topic of this contribution. In an experiment of distant interaction between a deaf participant using CS and a normalhearing participant, we analyze the distance between the vowels from the produced lip shapes. The results show at first that the set of vowels with similar lip shapes (so called visemes) are the same groups that for a normal-hearing cuer. Secondly, the lip shapes are in coherence with the CS system. Finally, the analysis of the temporal coordination between the CS hand gestures and the lips one, reveals the same scheme as observed at first by Attina [6] with normal-hearing cuers and confirmed by Aboutabit et al.[3] in a more complex speech context, i.e. the advance of the CS hand on the lip shapes.

...read moreread less

Parameterisation of Speech Lip Movements

[...]

James D. Edge¹, Adrian Hilton¹, Philip J. B. Jackson¹•Institutions (1)

University of Surrey¹

01 Sep 2008

TL;DR: A parameterisation of lip movements is described which maintains the dynamic structure inherent in the task of producing speech sounds and is believed to be appropriate to various areas of speech modeling, in particular the synthesis of speech lip movements.

...read moreread less

Abstract: In this paper we describe a parameterisation of lip movements which maintains the dynamic structure inherent in the task of producing speech sounds. A stereo capture system is used to reconstruct 3D models of a speaker producing sentences from the TIMIT corpus. This data is mapped into a space which maintains the relationships between samples and their temporal derivatives. By incorporating dynamic information within the parameterisation of lip movements we can model the cyclical structure, as well as the causal nature of speech movements as described by an underlying visual speech manifold. It is believed that such a structure will be appropriate to various areas of speech modeling, in particular the synthesis of speech lip movements.

...read moreread less

Proceedings Article•DOI•

Audio-Visual Speech Processing Framework for Lip Reading

[...]

A.M. Nasr¹, A.R. Ramli¹, M. Hamiruce¹, S.K. Subramaniam¹•Institutions (1)

Universiti Putra Malaysia¹

07 Apr 2008

TL;DR: This paper represents a hybrid framework for lip reading which is based on both audio and visual speech parameters extracted from a video stream of isolated spoken words, and shows interesting results in recognition of isolated words.

...read moreread less

Abstract: It is well known that speech production and perception process is inherently bimodal consisting of audio and visual components. Recently there has been increased interest in using the visual modality in combination with the acoustic modality for improved speech processing. This field of study has gained the title of audio-visual speech processing. Lip movement recognition, also known as lip reading, is a communication skill which involves the interpretation of lip movements in order to estimate some important parameters of the lips that include, but not limited to, size, shape and orientation. In this paper, we represent a hybrid framework for lip reading which is based on both audio and visual speech parameters extracted from a video stream of isolated spoken words. The proposed algorithm is self-tuned in the sense that it starts with an estimations of speech parameters based on visual lip features and then the coefficients of the algorithm are fine-tuned based on the extracted audio parameters. In the audio speech processing part, extracted audio features are used to generate a vector containing information of the speech phonemes. These information are used later to enhance the recognition and matching process. For lip feature extraction, we use a modified version of the method used by F. Huang and T. Chen for tracking of multiple faces. This method is based on statistical color modeling and the deformable template. The experiments based on the proposed framework showed interesting results in recognition of isolated words.

...read moreread less

Proceedings Article•DOI•

Lips-sync 3D speech animation

[...]

Fu-Chung Huang¹, Bing-Yu Chen², Yung-Yu Chaung¹, Shuen-Huei Guan•Institutions (2)

National Taiwan University¹, University of Tokyo²

11 Aug 2008

TL;DR: A system to synthesize lips-syncs speech animation given a novel utterance is presented that uses a nonlinear blend-shape method and derives key-shapes using a novel automatic clustering algorithm.

...read moreread less

Abstract: Facial animation is traditionally considered as an important but tedious task for many applications.Recently the demand for lipssyncs animation is increasing, but there seems few fast and easy generation methods.In this talk, a system to synthesize lips-syncs speech animation given a novel utterance is presented. Our system uses a nonlinear blend-shape method and derives key-shapes using a novel automatic clustering algorithm. Finally a Gaussianphoneme model is used to predict the proper motion dynamic that can be used for synthesizing a new speech animation.

...read moreread less

Journal Article•

Emotional Speech Processing in Human-Computer Speech Interface

[...]

Yin Yan-hua¹•Institutions (1)

Chinese Academy of Sciences¹

01 Jan 2008-Journal of University of Jinan

TL;DR: Results of speech recognition using DTW and pattern matching arithmetic are improved and methods for modifying speech prosodic characteristics are researched to make HCI more intelligent.

...read moreread less

Abstract: Emotional speech processing technology plays an important role in the improvement of HCIIn this paper,numerous speech samples of same speaker are recorded under three primary emotionsStatistical analysis is used to analyze the pitch,energy,time and the related prosodic parameters,based on which PCA methods is used to recognize emotion states and the correct rate is 9167%Combined with the emotion recognition results,results of speech recognition using DTW and pattern matching arithmetic are improvedMeanwhile,methods for modifying speech prosodic characteristics are researched to make HCI more intelligent

...read moreread less

Patent•

A method for speech recognition

[...]

Li Tze Fen

01 Jun 2008

Book Chapter•DOI•

Expressive Visual Speech Generation

[...]

Thomas Di Giacomo, Stephane Garchery, Nadia Magnenat-Thalmann

01 Jan 2008

TL;DR: This chapter describes results of research that tries to realistically connect personality and 3D characters, not only on an expressive level (for example, generating individualized expressions on a 3D face), but also with real-time video tracking, on a dialogue level and on a perceptive level.

...read moreread less

Abstract: With the emergence of 3D graphics, we are now able to create very realistic 3D characters that can move and talk. Multimodal interaction with such characters is also possible, as various technologies have matured for speech and video analysis, natural language dialogues, and animation. However, the behavior expressed by these characters is far from believable in most systems. We feel that this problem arises due to their lack of individuality on various levels: perception, dialogue, and expression. In this chapter, we describe results of research that tries to realistically connect personality and 3D characters, not only on an expressive level (for example, generating individualized expressions on a 3D face), but also with real-time video tracking, on a dialogue level (generating responses that actually correspond to what a certain personality in a certain emotional state would say) and on a perceptive level (having a virtual character that uses expression user data to create corresponding behavior). The idea of linking personality with agent behavior has been discussed by Marsella et al. [33], with the influence of emotion on behavior in general, and Johns et al. [21] with how personality and emotion can affect decision making. Traditionally, any text or voice-driven speech animation system uses the phonemes as the basic units of speech, and visemes as the basic units of animation. Though text-to-speech synthesizers and phoneme recognizers often use biphonebased techniques, the end user seldom has access to this information, except for dedicated systems. Most commercially and freely available software applications allow access to only time-stamped phoneme streams along with audio. Thus, in order to generate animation from this information, an extra level of processing, namely co-articulation, is required. This process takes care of the influence of the neighboring visemes for fluent speech production. This processing stage can be eliminated by using the syllable as a basic unit of speech rather than the phoneme. Overall, we do not intend to give a complete survey of ongoing research in behavior, emotion, and personality. Our main goal is to create believable conversational agents that can interact with many modalities. We thus concentrate on emotion extraction of a real user (Section 2.3), visyllable-based speech animation (Section 2.4), dialogue systems and emotions (Section 2.5).

...read moreread less