scispace - formally typeset
Search or ask a question

Showing papers on "Viseme published in 2002"


Journal ArticleDOI
TL;DR: This paper examines the suitability of support vector machines for visual speech recognition by modeling the temporal character of speech as a temporal sequence of visemes corresponding to the different phones realized in a Viterbi lattice.
Abstract: Visual speech recognition is an emerging research field. In this paper, we examine the suitability of support vector machines for visual speech recognition. Each word is modeled as a temporal sequence of visemes corresponding to the different phones realized. One support vector machine is trained to recognize each viseme and its output is converted to a posterior probability through a sigmoidal mapping. To model the temporal character of speech, the support vector machines are integrated as nodes into a Viterbi lattice. We test the performance of the proposed approach on a small visual speech recognition task, namely the recognition of the first four digits in English. The word recognition rate obtained is at the level of the previous best reported rates.

47 citations


Journal ArticleDOI
TL;DR: The strategy followed by the paper, which focuses on speech, follows a kind of bootstrap procedure and learns 3D shape statistics from a talking face with a relatively small number of markers to simulate facial anatomy.
Abstract: Realistic face animation is especially hard as we are all experts in the perception and interpretation of face dynamics. One approach is to simulate facial anatomy. Alternatively, animation can be based on first observing the visible 3D dynamics, extracting the basic modes, and putting these together according to the required performance. This is the strategy followed by the paper, which focuses on speech. The approach follows a kind of bootstrap procedure. First, 3D shape statistics are learned from a talking face with a relatively small number of markers. A 3D reconstruction is produced at temporal intervals of 1/25 seconds. A topological mask of the lower half of the face is fitted to the motion. Principal component analysis (PCA) of the mask shapes reduces the dimension of the mask shape space. The result is twofold. On the one hand, the face can be animated; in our case it can be made to speak new sentences. On the other hand, face dynamics can be tracked in 3D without markers for performance capture. Copyright © 2002 John Wiley & Sons, Ltd.

44 citations


Book ChapterDOI
18 Aug 2002
TL;DR: Two approaches in using HMMs (hidden Markov models) to convert audio signals to a sequence of visemes are compared and it is found that the error rates can be reduced to 20.5% and 13.9%, respectably.
Abstract: We describe audio-to-visual conversion techniques for efficient multimedia communications. The audio signals are automatically converted to visual images of mouth shape. The visual speech can be represented as a sequence of visemes, which are the generic face images corresponding to particular sounds. Visual images synchronized with audio signals can provide user-friendly interface for man machine interactions. Also, it can be used to help the people with impaired-hearing. We use HMMs (hidden Markov models) to convert audio signals to a sequence of visemes. In this paper, we compare two approaches in using HMMs. In the first approach, an HMM is trained for each viseme, and the audio signals are directly recognized as a sequence of visemes. In the second approach, each phoneme is modeled with an HMM, and a general phoneme recognizer is utilized to produce a phoneme sequence from the audio signals. The phoneme sequence is then converted to a viseme sequence. We implemented the two approaches and tested them on the TIMIT speech corpus. The viseme recognizer shows 33.9% error rate, and the phoneme-based approach exhibits 29.7% viseme recognition error rate. When similar viseme classes are merged, we have found that the error rates can be reduced to 20.5% and 13.9%, respectably.

42 citations


Patent
21 Oct 2002
TL;DR: In this article, a speech recognition unit (41B) selects an acoustic model set which is nearest to the distance supplied from the distance calculator (47) and performs speech recognition by using the acoustic model sets.
Abstract: A speech recognition apparatus and a speech recognition method capable of improving speech recognition accuracy. A distance calculator (47) calculates a distance between a user speaking and a microphone (21) and supplies the distance to a speech recognition unit (41B). The speech recognition unit (41B) contains acoustic model sets generated from speech data created by recording speeches spoken at a plurality of different distances. The speech recognition unit (41B) selects an acoustic model set which is nearest to the distance supplied from the distance calculator (47) and performs speech recognition by using the acoustic model set.

38 citations


Proceedings Article
01 Jan 2002
TL;DR: Several extensions to the original coarticulation algorithm of Cohen and Massaro are implemented, including an optimization to improve performance as well as special treatment of closure and release phase of bilabial stops and other phonemes.
Abstract: We present a method for generating realistic speech-synchronized facial animations using a physicsbased approach and support for coarticulation, i.e. the coloring of a speech segment by surrounding segments. We have implemented several extensions to the original coarticulation algorithm of Cohen and Massaro [Cohen93]. The enhancements include an optimization to improve performance as well as special treatment of closure and release phase of bilabial stops and other phonemes. Furthermore, for phonemes that are shorter than the sampling intervals of the algorithm and might therefore be missed, additional key frames are created to ensure their impact onto the animation.

32 citations


Proceedings ArticleDOI
10 Dec 2002
TL;DR: Experiments conducted on a small visual speech recognition task using very simple features demonstrate a word recognition rate on the level of the best rates previously reported even without training the state transition probabilities in the Viterbi lattices, proving the suitability of support vector machines for visualspeech recognition.
Abstract: In this paper we propose a visual speech recognition network based on support vector machines. Each word of the dictionary is modeled by a set of temporal sequences of visemes. Each viseme is described by a support vector machine, and the temporal character of speech is modeled by integrating the support vector machines as nodes into a Viterbi decoding lattice. Experiments conducted on a small visual speech recognition task using very simple features demonstrate a word recognition rate on the level of the best rates previously reported even without training the state transition probabilities in the Viterbi lattices. This proves the suitability of support vector machines for visual speech recognition.

31 citations


PatentDOI
TL;DR: In this paper, an automatic speech recognizer only responsive to acoustic speech utterances is activated only in response to acoustic energy having a spectrum associated with the speech utterance and at least one facial characteristic associated with it.
Abstract: An automatic speech recognizer only responsive to acoustic speech utterances is activated only in response to acoustic energy having a spectrum associated with the speech utterances and at least one facial characteristic associated with the speech utterances. In one embodiment, a speaker must be looking directly into a video camera and the voices and facial characteristics of plural speakers must be matched to enable activation of the automatic speech recognizer.

26 citations


Book ChapterDOI
16 Dec 2002
TL;DR: A novel subword lip reading system using continuous Hidden Markov Models (HMMs) configured according to the statistical features of lip motion and trained with the Baum-Welch method is presented.
Abstract: In this paper, a novel subword lip reading system using continuous Hidden Markov Models (HMMs) is presented. The constituent HMMs are configured according to the statistical features of lip motion and trained with the Baum-Welch method. The performance of the proposed system in identifying the fourteen visemes defined in MPEG-4 standards is addressed. Experiment results show that an average accuracy above 80% can be achieved using the proposed system.

25 citations


Dissertation
01 Jan 2002

22 citations


Proceedings ArticleDOI
07 Nov 2002
TL;DR: Experiments conducted on a small visual speech recognition task show a word recognition rate on the level of the best rates previously reported, even without training the state transition probabilities in the Viterbi lattice and using very simple features.
Abstract: In this paper we propose a visual speech recognition network based on support vector machines. Each word of the dictionary is described as a temporal sequence of visemes. Each viseme is described by a support vector machine, and the temporal character of speech is modeled by integrating the support vector machines as nodes into a Viterbi decoding lattice. Experiments conducted on a small visual speech recognition task show a word recognition rate on the level of the best rates previously reported, even without training the state transition probabilities in the Viterbi lattice and using very simple features. This proves the suitability of support vector machines for visual speech recognition.

19 citations


Proceedings ArticleDOI
19 Jun 2002
TL;DR: Some of the issues that make speech so complex to model visually are discussed, including the importance of lip-sync animation in creating a realistic human figure.
Abstract: Lip-sync animation is complex and challenging. It promises to be important in natural human-computer interfaces and entertainment as well as aid in the education of the deaf. It is an important component in creating a realistic human figure. Speech is based on principles from anatomy, physics, and psychophysiology. We discuss some of the issues that make speech so complex to model visually.

Proceedings Article
01 Jan 2002
TL;DR: A method for realistic face animation is proposed, which focuses on speech animation, and replicates the 3D ’visemes’ that it has learned from talking actors, and adds the necessary coarticulation effects.
Abstract: A method for realistic face animation is proposed. In particular it focuses on speech animation. When asked to animate a face it replicates the 3D ’visemes’ that it has learned from talking actors, and adds the necessary coarticulation effects. The speech animation could be based on as few as 16 modes, extracted through Independent Component Analysis from different face dynamics. The exact deformation fields that come with the different visemes are adapted by the system to take the shape of the given face into account. By localising the face to be animated in a face space, where also the locations of the neutral example faces are known, visemes are adapted automatically according to the relative distance with respect to these examples.

Proceedings ArticleDOI
TL;DR: This paper describes a 15-year research effort to improve the automatic pronunciation of proper names and details the issues involved in applying those pronunciations to speech synthesis and speech recognition.
Abstract: This paper describes a 15-year research effort to improve the automatic pronunciation of proper names and details the issues involved in applying those pronunciations to speech synthesis and speech recognition

Patent
Richard J. Qian1
07 Oct 2002
TL;DR: In this paper, a technique for generating an animated character based on visual and audio input from a live subject is described, which is a technique of extracting phonemes to select corresponding visemes.
Abstract: A technique for generating an animated character based on visual and audio input from a live subject. Further described is a technique of extracting phonemes to select corresponding visemes to model a set of physical positions of the subject or emotional expression of the subject.

Proceedings ArticleDOI
26 Aug 2002
TL;DR: A dynamic viseme model for visual speech synthesis that can deal with co-articulation problem and various pauses in continuous speech is proposed.
Abstract: Personalizing a talking head means not only to personalize a head model but also to personalize his talking manner. In this paper, we propose a dynamic viseme model for visual speech synthesis that can deal with co-articulation problem and various pauses in continuous speech. Facial animation parameters (FAPs) defined in MPEG-4 are estimated according to the tracked feature points from two orthogonal views via a mirror setup. Individual talking manner described by model parameters is learnt from FAP data to implement a personalized talking head.

Patent
Kiran Challapali1
06 Sep 2002
TL;DR: In this paper, a video processing system and method for processing a stream of frames of video data is presented, consisting of a packaging system that includes a viseme identification system that determines if frames of inputted video data correspond to at least one predetermined viseme; a visememe library for storing frames that correspond to the at least 1 predetermined v-eme; and an encoder for encoding each frame that corresponds to each v-eeme.
Abstract: A video processing system and method for processing a stream of frames of video data. The system comprises a packaging system that includes: a viseme identification system (10) that determines if frames of inputted video data correspond to at least one predetermined viseme; a viseme library (16) for storing frames that correspond to the at least one predetermined viseme; and an encoder (14) for encoding each frame that corresponds to the at least one predetermined viseme, wherein the encoder utilizes a previously stored frame in the viseme library to encode a current frame. Also provided is a receiver system that includes: a decoder for decoding encoded frames of video data; a reference frame library for storing decoded frames; and wherein the decoder utilizes a previously decoded frame from the frame reference library to decode a current encoded frame, and wherein the previously decoded frame belongs to the same viseme as the current encoded frame.

Proceedings ArticleDOI
09 Dec 2002
TL;DR: An overview of the large scale national project entitled "Spontaneous speech: corpus and processing technology" in Japan is given and the major results of experiments that have been conducted so far are reported, including spontaneous presentation speech recognition, automatic speech summarization, and message-driven speech recognition.
Abstract: How to recognize and understand spontaneous speech is one of the most important issues in state-of-the-art speech recognition technology. In this context, a five-year large scale national project entitled "Spontaneous speech: corpus and processing technology" started in Japan in 1999. This paper gives an overview of the project and reports on the major results of experiments that have been conducted so far at Tokyo Institute of Technology, including spontaneous presentation speech recognition, automatic speech summarization, and message-driven speech recognition. The paper also discusses the most important research problems to be solved in order to achieve ultimate spontaneous speech recognition systems.

Proceedings ArticleDOI
19 Jun 2002
TL;DR: This work generates new, photo-realistic viseme from a single neutral face image by transformation using a set of prototype visemes, which allows us to generate visual speech from photographs and portraits where a full set of Visemes is not available.
Abstract: Animated talking faces can be generated from a set of predefined face and mouth shapes (visemes) by either concatenation or morphing. Each facial image corresponds to one or more phonemes, which are generated in synchrony with the visual changes. Existing implementations require a full set of facial visemes to be captured or created by an artist before the images can be animated. In this work we generate new, photo-realistic visemes from a single neutral face image by transformation using a set of prototype visemes. This allows us to generate visual speech from photographs and portraits where a full set of visemes is not available.

Book ChapterDOI
11 Apr 2002
TL;DR: A new system for the recognition of visual speech based on support vector machines which proved to be powerful classifiers in other visual tasks is proposed, which offers the advantage of an easy generalization to large vocabulary recognition tasks due to the use of viseme models, as opposed to entire word models.
Abstract: Speech recognition based on visual information is an emerging research field We propose here a new system for the recognition of visual speech based on support vector machines which proved to be powerful classifiers in other visual tasks We use support vector machines to recognize the mouth shape corresponding to different phones produced To model the temporal character of the speech we employ the Viterbi decoding in a network of support vector machines The recognition rate obtained is higher than those reported earlier when the same features were used The proposed solution offers the advantage of an easy generalization to large vocabulary recognition tasks due to the use of viseme models, as opposed to entire word models

Proceedings Article
01 Jan 2002
TL;DR: It is demonstrated that working with syllables provides the basis for linguistically motivated speech recognition using the previously reported notion of the Pseudo-Articulatory Representation (PAR).
Abstract: This paper presents an account of the use of syllable structure as the basis for a novel approach to speech recognition. This contrasts with the serial organization of more conventional phonetic segments, and their use in speech recognition systems. It is demonstrated that working with syllables provides the basis for linguistically motivated speech recognition using the previously reported notion of the Pseudo-Articulatory Representation (PAR). The results are very promising taking into account the preliminary nature of the work and the novelty of the approach. A related paper [1] deals with theoretical issues in greater depth.

Book ChapterDOI
12 Aug 2002
TL;DR: This paper compares two approaches in using HMMs (hidden Markov models) to convert audio signals to a sequence of visemes, which are the generic face images corresponding to particular sounds.
Abstract: Visual images synchronized with audio signals can provide user-friendly interface for man machine interactions. The visual speech can be represented as a sequence of visemes, which are the generic face images corresponding to particular sounds. We use HMMs (hidden Markov models) to convert audio signals to a sequence of visemes. In this paper, we compare two approaches in using HMMs. In the first approach, an HMM is trained for each triviseme which is a viseme with its left and right context, and the audio signals are directly recognized as a sequence of trivisemes. In the second approach, each triphone is modeled with an HMM, and a general triphone recognizer is used to produce a triphone sequence from the audio signals. The triviseme or triphone sequence is then converted to a viseme sequence. The performances of the two viseme recognition systems are evaluated on the TIMIT speech corpus.

Journal Article
TL;DR: This work defines 28 basic static visemes of Chinese based on the study of the visual articulators movement in Chinese speech and of the pronunciation rules, and describes them in term of 28 of the total of 68 MPEG-4 FAPs.

01 Jan 2002
TL;DR: This case study illustrates that it is indeed possible to obtain visually relevant speech segmentation data directly from a purely acoustic observation sequence.
Abstract: This paper addresses the problem of animating a talking figure, such as an avatar, using speech input only. The proposed system is based on Hidden Markov Models for the acoustic observation vectors of the speech sounds that correspond to each of 16 visually distinct mouth shapes (called visemes). This case study illustrates that it is indeed possible to obtain visually relevant speech segmentation data directly from a purely acoustic observation sequence.

Proceedings Article
01 Jan 2002
TL;DR: A general framework for the integration of speaker and speech recognizers is presented, and it is shown that the posteriori probability can be expressed as the product of four terms: a likelihood score from a speaker-independent speech recognizer, the (normalized) likelihood score of a text-dependent speaker recognizers, the likelihood of a Speaker-dependent statistical language model, and the prior probability of the speaker.
Abstract: This paper presents a general framework for the integration of speaker and speech recognizers. The framework poses the problem of combining speech and speaker recognizers as the joint maximization of the a posteriori probability of the word sequence and speaker given the observed utterance. It is shown that the posteriori probability can be expressed as the product of four terms: a likelihood score from a speaker-independent speech recognizer, the (normalized) likelihood score of a text-dependent speaker recognizer, the likelihood of a speaker-dependent statistical language model, and the prior probability of the speaker. Efficient search strategies are discussed, with a particular focus on the problem of recognizing and verifying name-based identity claims over very large populations (e.g., ”My name is John Doe”). The efficient search approach uses a speaker-independent recognizer to first generate a list of top hypotheses, followed by a resorting of this list based on the combined score of the four terms discussed above. Experimental results on an over-the-telephone speech recognition task show a 34% reduction in the error rate where the test-set consists of users speaking their first and last name from a grammar covering 1 million unique persons.

Journal Article
TL;DR: In this paper, the 3D positions of thousands of points on the face were determined at the temporal resolution of video using ICA, and then decomposed into their basic modes, which better capture the underlying, anatomical changes that the face undergoes.
Abstract: The paper discusses the detailed analysis of visual speech. As with other forms of biological motion, humans are known to be very sensitive to the realism in the ways the lips move. In order to determine the elements that come to play in the perceptual analysis of visual speech, it is important to have control over the data. The paper discusses the capture of detailed 3D deformations of faces when talking. The data are detailed in both a temporal and spatial sense. The 3D positions of thousands of points on the face are determined at the temporal resolution of video. Such data have been decomposed into their basic modes, using ICA. It is noteworthy that this yielded better results than a mere PCA analysis, which results in modes that individually represent facial changes that anatomically inconsistent. The ICs better capture the underlying, anatomical changes that the face undergoes. Different visemes are all based on the underlying, joint action of the facial muscles. The IC modes do not reflect single muscles, but nevertheless decompose the speech related deformations into anatomically convincing modes, coined 'pseudo-muscles'.

Proceedings Article
01 Jan 2002
TL;DR: This study investigates the extent to which a localist-distributive hybrid formal model of human memory replicates observed behavioral patterns in perception and recognition of appropriately coded language data.
Abstract: This study investigates the extent to which a localist-distributive hybrid formal model of human memory replicates observed behavioral patterns in perception and recognition of appropriately coded language data. Extending previous research that considered for modeled memorization only items with uniform, undefined randomly generated featural specifications, a MINERVA2 simulation was trained to recognize linguistic events and categories at both acoustic-phonetic and phonological-featural processing levels. Results of both test conditions parallel two important effects observed in behavioral data and are discussed with respect to speech perception as well as human memory research.

Proceedings ArticleDOI
04 Nov 2002
TL;DR: Results show that compared to the phoneme based system, the tied-state triseme based speech recognition system gives talking head animation with smoother and more plausible mouth shapes.
Abstract: In this paper, we present a viseme (the basic speech units In the visual domain) based continuous speech recognition system, which segments speech into viseme sequences with timing boundaries to drive a talking head. In the viseme Hidden Markov Model (HMM) training, the instances of a viseme with different contexts are formulated as trisemes. Based on the mouth shape parameters Liprounding and the defined viseme similarity weight (VSW) from the 3D viseme facial models, 166 questions concerning the viseme contexts are designed to build triseme decision trees to tie the states of the trisemes with similar contexts, so that they can share the same parameters. To evaluate the system performance, the image related measurements are also taken to evaluate the resulting viseme sequences, with 'jerky instances' in Liprounding and VSW graphs evaluating their smoothness. Results show that compared to the phoneme based system, the tied-state triseme based speech recognition system gives talking head animation with smoother and more plausible mouth shapes.

Journal Article
TL;DR: In this paper, audio signals are automatically converted to visual images of mouth shape, which are the generic face images corresponding to particular sounds, and the visual speech can be represented as a sequence of visemes.
Abstract: We describe audio-to-visual conversion techniques for efficient multimedia communications. The audio signals are automatically converted to visual images of mouth shape. The visual speech can be represented as a sequence of visemes, which are the generic face images corresponding to particular sounds. Visual images synchronized with audio signals can provide user-friendly interface for man machine interactions. Also, it can be used to help the people with impaired-hearing. We use HMMs (hidden Markov models) to convert audio signals to a sequence of visemes. In this paper, we compare two approaches in using HMMs. In the first approach, an HMM is trained for each viseme, and the audio signals are directly recognized as a sequence of visemes. In the second approach, each phoneme is modeled with an HMM, and a general phoneme recognizer is utilized to produce a phoneme sequence from the audio signals. The phoneme sequence is then converted to a viseme sequence. We implemented the two approaches and tested them on the TIMIT speech corpus. The viseme recognizer shows 33.9% error rate, and the phoneme-based approach exhibits 29.7% viseme recognition error rate. When similar viseme classes are merged, we have found that the error rates can be reduced to 20.5% and 13.9%, respectably.


Journal ArticleDOI
TL;DR: This is a preliminary study to the use of synthetic speech for the elderly using CMU's VIKIA to discuss experimental conditions, show actual figures and try to draw conclusions for speech communication between computers and the elderly.
Abstract: An aging population still needs to access information, such as bus schedules. It is evident that they will be doing so using computers and especially interfaces using speech input and output. This is a preliminary study to the use of synthetic speech for the elderly. In it twenty persons between the ages of 60 and 80 were asked to listen to speech emitted by a robot (CMU’s VIKIA) and to write down what they heard. All of the speech was natural prerecorded speech (not synthetic) read by one female speaker. There were four listening conditions: (a) only speech emitted, (b) robot moves before emitting speech, (c) face has lip movement during speech, (d) both (b) and (c). There were very few errors for conditions (b), (c), and (d), but errors existed for condition (a). The presentation will discuss experimental conditions, show actual figures and try to draw conclusions for speech communication between computers and the elderly.