Showing papers on "Viseme published in 2002"

PDF

Open Access

Journal Article•DOI•

A support vector machine-based dynamic network for visual speech recognition applications

[...]

Mihaela Gordan¹, Constantine Kotropoulos¹, Ioannis Pitas¹•Institutions (1)

01 Jan 2002-EURASIP Journal on Advances in Signal Processing

TL;DR: This paper examines the suitability of support vector machines for visual speech recognition by modeling the temporal character of speech as a temporal sequence of visemes corresponding to the different phones realized in a Viterbi lattice.

...read moreread less

Abstract: Visual speech recognition is an emerging research field. In this paper, we examine the suitability of support vector machines for visual speech recognition. Each word is modeled as a temporal sequence of visemes corresponding to the different phones realized. One support vector machine is trained to recognize each viseme and its output is converted to a posterior probability through a sigmoidal mapping. To model the temporal character of speech, the support vector machines are integrated as nodes into a Viterbi lattice. We test the performance of the proposed approach on a small visual speech recognition task, namely the recognition of the first four digits in English. The word recognition rate obtained is at the level of the previous best reported rates.

...read moreread less

47 citations

Journal Article•DOI•

Realistic face animation for speech

[...]

Gregor A. Kalberer¹, Luc Van Gool•Institutions (1)

ETH Zurich¹

01 May 2002-Journal of Visualization and Computer Animation

TL;DR: The strategy followed by the paper, which focuses on speech, follows a kind of bootstrap procedure and learns 3D shape statistics from a talking face with a relatively small number of markers to simulate facial anatomy.

...read moreread less

Abstract: Realistic face animation is especially hard as we are all experts in the perception and interpretation of face dynamics. One approach is to simulate facial anatomy. Alternatively, animation can be based on first observing the visible 3D dynamics, extracting the basic modes, and putting these together according to the required performance. This is the strategy followed by the paper, which focuses on speech. The approach follows a kind of bootstrap procedure. First, 3D shape statistics are learned from a talking face with a relatively small number of markers. A 3D reconstruction is produced at temporal intervals of 1/25 seconds. A topological mask of the lower half of the face is fitted to the motion. Principal component analysis (PCA) of the mask shapes reduces the dimension of the mask shape space. The result is twofold. On the one hand, the face can be animated; in our case it can be made to speak new sentences. On the other hand, face dynamics can be tracked in 3D without markers for performance capture. Copyright © 2002 John Wiley & Sons, Ltd.

...read moreread less

44 citations

Book Chapter•DOI•

Audio-to-Visual Conversion Using Hidden Markov Models

[...]

Soonkyu Lee¹, Dongsuk Yook¹•Institutions (1)

Korea University¹

18 Aug 2002

TL;DR: Two approaches in using HMMs (hidden Markov models) to convert audio signals to a sequence of visemes are compared and it is found that the error rates can be reduced to 20.5% and 13.9%, respectably.

...read moreread less

Abstract: We describe audio-to-visual conversion techniques for efficient multimedia communications. The audio signals are automatically converted to visual images of mouth shape. The visual speech can be represented as a sequence of visemes, which are the generic face images corresponding to particular sounds. Visual images synchronized with audio signals can provide user-friendly interface for man machine interactions. Also, it can be used to help the people with impaired-hearing. We use HMMs (hidden Markov models) to convert audio signals to a sequence of visemes. In this paper, we compare two approaches in using HMMs. In the first approach, an HMM is trained for each viseme, and the audio signals are directly recognized as a sequence of visemes. In the second approach, each phoneme is modeled with an HMM, and a general phoneme recognizer is utilized to produce a phoneme sequence from the audio signals. The phoneme sequence is then converted to a viseme sequence. We implemented the two approaches and tested them on the TIMIT speech corpus. The viseme recognizer shows 33.9% error rate, and the phoneme-based approach exhibits 29.7% viseme recognition error rate. When similar viseme classes are merged, we have found that the error rates can be reduced to 20.5% and 13.9%, respectably.

...read moreread less

42 citations

Patent•

Speech recognition apparatus and speech recognition method

[...]

Yasuharu Asano¹•Institutions (1)

Sony Broadcast & Professional Research Laboratories¹

21 Oct 2002

TL;DR: In this article, a speech recognition unit (41B) selects an acoustic model set which is nearest to the distance supplied from the distance calculator (47) and performs speech recognition by using the acoustic model sets.

...read moreread less

Abstract: A speech recognition apparatus and a speech recognition method capable of improving speech recognition accuracy. A distance calculator (47) calculates a distance between a user speaking and a microphone (21) and supplies the distance to a speech recognition unit (41B). The speech recognition unit (41B) contains acoustic model sets generated from speech data created by recording speeches spoken at a plurality of different distances. The speech recognition unit (41B) selects an acoustic model set which is nearest to the distance supplied from the distance calculator (47) and performs speech recognition by using the acoustic model set.

...read moreread less

38 citations

Proceedings Article•

Speech Synchronization for Physics-based Facial Animation

[...]

Irene Albrecht¹, Jörg Haber¹, Hans-Peter Seidel¹, Vaclav Skala•Institutions (1)

Max Planck Society¹

01 Jan 2002

TL;DR: Several extensions to the original coarticulation algorithm of Cohen and Massaro are implemented, including an optimization to improve performance as well as special treatment of closure and release phase of bilabial stops and other phonemes.

...read moreread less

Abstract: We present a method for generating realistic speech-synchronized facial animations using a physicsbased approach and support for coarticulation, i.e. the coloring of a speech segment by surrounding segments. We have implemented several extensions to the original coarticulation algorithm of Cohen and Massaro [Cohen93]. The enhancements include an optimization to improve performance as well as special treatment of closure and release phase of bilabial stops and other phonemes. Furthermore, for phonemes that are shorter than the sampling intervals of the algorithm and might therefore be missed, additional key frames are created to ensure their impact onto the animation.

...read moreread less

32 citations

Proceedings Article•DOI•

Application of support vector machines classifiers to visual speech recognition

[...]

Mihaela Gordan, Constantine Kotropoulos, Ioannis Pitas

10 Dec 2002

TL;DR: Experiments conducted on a small visual speech recognition task using very simple features demonstrate a word recognition rate on the level of the best rates previously reported even without training the state transition probabilities in the Viterbi lattices, proving the suitability of support vector machines for visualspeech recognition.

...read moreread less

Abstract: In this paper we propose a visual speech recognition network based on support vector machines. Each word of the dictionary is modeled by a set of temporal sequences of visemes. Each viseme is described by a support vector machine, and the temporal character of speech is modeled by integrating the support vector machines as nodes into a Viterbi decoding lattice. Experiments conducted on a small visual speech recognition task using very simple features demonstrate a word recognition rate on the level of the best rates previously reported even without training the state transition probabilities in the Viterbi lattices. This proves the suitability of support vector machines for visual speech recognition.

...read moreread less

31 citations

Patent•DOI•

Speech activity detection using acoustic and facial characteristics in an automatic speech recognition system

[...]

Antonio Colmenarez¹, Andreas Kellner¹•Institutions (1)

Philips¹

30 Jan 2002-Journal of the Acoustical Society of America

TL;DR: In this paper, an automatic speech recognizer only responsive to acoustic speech utterances is activated only in response to acoustic energy having a spectrum associated with the speech utterance and at least one facial characteristic associated with it.

...read moreread less

Abstract: An automatic speech recognizer only responsive to acoustic speech utterances is activated only in response to acoustic energy having a spectrum associated with the speech utterances and at least one facial characteristic associated with the speech utterances. In one embodiment, a speaker must be looking directly into a video camera and the voices and facial characteristics of plural speakers must be matched to enable activation of the automatic speech recognizer.

...read moreread less

26 citations

Book Chapter•DOI•

Recognition of Visual Speech Elements Using Hidden Markov Models

[...]

Say Wei Foo¹, Liang Dong²•Institutions (2)

Nanyang Technological University¹, National University of Singapore²

16 Dec 2002

TL;DR: A novel subword lip reading system using continuous Hidden Markov Models (HMMs) configured according to the statistical features of lip motion and trained with the Baum-Welch method is presented.

...read moreread less

Abstract: In this paper, a novel subword lip reading system using continuous Hidden Markov Models (HMMs) is presented. The constituent HMMs are configured according to the statistical features of lip motion and trained with the Baum-Welch method. The performance of the proposed system in identifying the fourteen visemes defined in MPEG-4 standards is addressed. Experiment results show that an average accuracy above 80% can be achieved using the proposed system.

...read moreread less

25 citations

Dissertation•

Audio-visual Speech Processing

[...]

Simon Lucey

01 Jan 2002

22 citations

Proceedings Article•DOI•

Visual speech recognition using support vector machines

[...]

Mihaela Gordan, Constantine Kotropoulos, Ioannis Pitas

07 Nov 2002

TL;DR: Experiments conducted on a small visual speech recognition task show a word recognition rate on the level of the best rates previously reported, even without training the state transition probabilities in the Viterbi lattice and using very simple features.

...read moreread less

Abstract: In this paper we propose a visual speech recognition network based on support vector machines. Each word of the dictionary is described as a temporal sequence of visemes. Each viseme is described by a support vector machine, and the temporal character of speech is modeled by integrating the support vector machines as nodes into a Viterbi decoding lattice. Experiments conducted on a small visual speech recognition task show a word recognition rate on the level of the best rates previously reported, even without training the state transition probabilities in the Viterbi lattice and using very simple features. This proves the suitability of support vector machines for visual speech recognition.

...read moreread less

19 citations

Proceedings Article•DOI•

Issues with lip sync animation: can you read my lips?

[...]

Rick Parent¹, Scott A. King², O. Fujimura¹•Institutions (2)

Ohio State University¹, University of Otago²

19 Jun 2002

TL;DR: Some of the issues that make speech so complex to model visually are discussed, including the importance of lip-sync animation in creating a realistic human figure.

...read moreread less

Abstract: Lip-sync animation is complex and challenging. It promises to be important in natural human-computer interfaces and entertainment as well as aid in the education of the deaf. It is an important component in creating a realistic human figure. Speech is based on principles from anatomy, physics, and psychophysiology. We discuss some of the issues that make speech so complex to model visually.

...read moreread less

Proceedings Article•

Speech Animation using Viseme Space

[...]

Gregor A. Kalberer, Pascal Müller, Luc Van Gool

01 Jan 2002

TL;DR: A method for realistic face animation is proposed, which focuses on speech animation, and replicates the 3D ’visemes’ that it has learned from talking actors, and adds the necessary coarticulation effects.

...read moreread less

Abstract: A method for realistic face animation is proposed. In particular it focuses on speech animation. When asked to animate a face it replicates the 3D ’visemes’ that it has learned from talking actors, and adds the necessary coarticulation effects. The speech animation could be based on as few as 16 modes, extracted through Independent Component Analysis from different face dynamics. The exact deformation fields that come with the different visemes are adapted by the system to take the shape of the given face into account. By localising the face to be animated in a face space, where also the locations of the neutral example faces are known, visemes are adapted automatically according to the relative distance with respect to these examples.

...read moreread less

Proceedings Article•DOI•

Proper name pronunciations for speech technology applications

[...]

Murray F. Spiegel¹•Institutions (1)

Telcordia Technologies¹

11 Sep 2002-International Journal of Speech Technology

TL;DR: This paper describes a 15-year research effort to improve the automatic pronunciation of proper names and details the issues involved in applying those pronunciations to speech synthesis and speech recognition.

...read moreread less

Abstract: This paper describes a 15-year research effort to improve the automatic pronunciation of proper names and details the issues involved in applying those pronunciations to speech synthesis and speech recognition

...read moreread less

Patent•

Generating animation from visual and audio input

[...]

Richard J. Qian¹•Institutions (1)

Intel¹

07 Oct 2002

TL;DR: In this paper, a technique for generating an animated character based on visual and audio input from a live subject is described, which is a technique of extracting phonemes to select corresponding visemes.

...read moreread less

Abstract: A technique for generating an animated character based on visual and audio input from a live subject. Further described is a technique of extracting phonemes to select corresponding visemes to model a set of physical positions of the subject or emotional expression of the subject.

...read moreread less

Proceedings Article•DOI•

A dynamic viseme model for personalizing a talking head

[...]

Wang Zhiming¹, Cai Lianhong¹, Ai Haizhou¹•Institutions (1)

Tsinghua University¹

26 Aug 2002

TL;DR: A dynamic viseme model for visual speech synthesis that can deal with co-articulation problem and various pauses in continuous speech is proposed.

...read moreread less

Abstract: Personalizing a talking head means not only to personalize a head model but also to personalize his talking manner. In this paper, we propose a dynamic viseme model for visual speech synthesis that can deal with co-articulation problem and various pauses in continuous speech. Facial animation parameters (FAPs) defined in MPEG-4 are estimated according to the tracked feature points from two orthogonal views via a mirror setup. Individual talking manner described by model parameters is learnt from FAP data to implement a personalized talking head.

...read moreread less

Patent•

Viseme based video coding

[...]

Kiran Challapali¹•Institutions (1)

Philips¹

06 Sep 2002

TL;DR: In this paper, a video processing system and method for processing a stream of frames of video data is presented, consisting of a packaging system that includes a viseme identification system that determines if frames of inputted video data correspond to at least one predetermined viseme; a visememe library for storing frames that correspond to the at least 1 predetermined v-eme; and an encoder for encoding each frame that corresponds to each v-eeme.

...read moreread less

Abstract: A video processing system and method for processing a stream of frames of video data. The system comprises a packaging system that includes: a viseme identification system (10) that determines if frames of inputted video data correspond to at least one predetermined viseme; a viseme library (16) for storing frames that correspond to the at least one predetermined viseme; and an encoder (14) for encoding each frame that corresponds to the at least one predetermined viseme, wherein the encoder utilizes a previously stored frame in the viseme library to encode a current frame. Also provided is a receiver system that includes: a decoder for decoding encoded frames of video data; a reference frame library for storing decoded frames; and wherein the decoder utilizes a previously decoded frame from the frame reference library to decode a current encoded frame, and wherein the previously decoded frame belongs to the same viseme as the current encoded frame.

...read moreread less

Proceedings Article•DOI•

Recent progress in spontaneous speech recognition and understanding

[...]

S. Furui¹•Institutions (1)

Tokyo Institute of Technology¹

09 Dec 2002

TL;DR: An overview of the large scale national project entitled "Spontaneous speech: corpus and processing technology" in Japan is given and the major results of experiments that have been conducted so far are reported, including spontaneous presentation speech recognition, automatic speech summarization, and message-driven speech recognition.

...read moreread less

Abstract: How to recognize and understand spontaneous speech is one of the most important issues in state-of-the-art speech recognition technology. In this context, a five-year large scale national project entitled "Spontaneous speech: corpus and processing technology" started in Japan in 1999. This paper gives an overview of the project and reports on the major results of experiments that have been conducted so far at Tokyo Institute of Technology, including spontaneous presentation speech recognition, automatic speech summarization, and message-driven speech recognition. The paper also discusses the most important research problems to be solved in order to achieve ultimate spontaneous speech recognition systems.

...read moreread less

Proceedings Article•DOI•

Prototyping and transforming visemes for animated speech

[...]

Bernard Tiddeman¹, David I. Perrett¹•Institutions (1)

University of St Andrews¹

19 Jun 2002

TL;DR: This work generates new, photo-realistic viseme from a single neutral face image by transformation using a set of prototype visemes, which allows us to generate visual speech from photographs and portraits where a full set of Visemes is not available.

...read moreread less

Abstract: Animated talking faces can be generated from a set of predefined face and mouth shapes (visemes) by either concatenation or morphing. Each facial image corresponds to one or more phonemes, which are generated in synchrony with the visual changes. Existing implementations require a full set of facial visemes to be captured or created by an artist before the images can be animated. In this work we generate new, photo-realistic visemes from a single neutral face image by transformation using a set of prototype visemes. This allows us to generate visual speech from photographs and portraits where a full set of visemes is not available.

...read moreread less

Book Chapter•DOI•

A Temporal Network of Support Vector Machine Classifiers for the Recognition of Visual Speech

[...]

Mihaela Gordan¹, Constantine Kotropoulos², Ioannis Pitas²•Institutions (2)

Technical University of Cluj-Napoca¹, Aristotle University of Thessaloniki²

11 Apr 2002

TL;DR: A new system for the recognition of visual speech based on support vector machines which proved to be powerful classifiers in other visual tasks is proposed, which offers the advantage of an easy generalization to large vocabulary recognition tasks due to the use of viseme models, as opposed to entire word models.

...read moreread less

Abstract: Speech recognition based on visual information is an emerging research field We propose here a new system for the recognition of visual speech based on support vector machines which proved to be powerful classifiers in other visual tasks We use support vector machines to recognize the mouth shape corresponding to different phones produced To model the temporal character of the speech we employ the Viterbi decoding in a network of support vector machines The recognition rate obtained is higher than those reported earlier when the same features were used The proposed solution offers the advantage of an easy generalization to large vocabulary recognition tasks due to the use of viseme models, as opposed to entire word models

...read moreread less

Proceedings Article•

Speech recognition using syllable patterns.

[...]

Li Zhang¹, William H. Edmondson¹•Institutions (1)

University of Birmingham¹

01 Jan 2002

TL;DR: It is demonstrated that working with syllables provides the basis for linguistically motivated speech recognition using the previously reported notion of the Pseudo-Articulatory Representation (PAR).

...read moreread less

Abstract: This paper presents an account of the use of syllable structure as the basis for a novel approach to speech recognition. This contrasts with the serial organization of more conventional phonetic segments, and their use in speech recognition systems. It is demonstrated that working with syllables provides the basis for linguistically motivated speech recognition using the previously reported notion of the Pseudo-Articulatory Representation (PAR). The results are very promising taking into account the preliminary nature of the work and the novelty of the approach. A related paper [1] deals with theoretical issues in greater depth.

...read moreread less

Book Chapter•DOI•

Viseme Recognition Experiment Using Context Dependent Hidden Markov Models

[...]

Soonkyu Lee¹, Dongsuk Yook¹•Institutions (1)

Korea University¹

12 Aug 2002

TL;DR: This paper compares two approaches in using HMMs (hidden Markov models) to convert audio signals to a sequence of visemes, which are the generic face images corresponding to particular sounds.

...read moreread less

Abstract: Visual images synchronized with audio signals can provide user-friendly interface for man machine interactions. The visual speech can be represented as a sequence of visemes, which are the generic face images corresponding to particular sounds. We use HMMs (hidden Markov models) to convert audio signals to a sequence of visemes. In this paper, we compare two approaches in using HMMs. In the first approach, an HMM is trained for each triviseme which is a viseme with its left and right context, and the audio signals are directly recognized as a sequence of trivisemes. In the second approach, each triphone is modeled with an HMM, and a general triphone recognizer is used to produce a triphone sequence from the audio signals. The triviseme or triphone sequence is then converted to a viseme sequence. The performances of the two viseme recognition systems are evaluated on the TIMIT speech corpus.

...read moreread less

Journal Article•

Study of Chinese viseme

[...]

Cai Lianhong

01 Jan 2002-Applied Acoustics

TL;DR: This work defines 28 basic static visemes of Chinese based on the study of the visual articulators movement in Chinese speech and of the pronunciation rules, and describes them in term of 28 of the total of 68 MPEG-4 FAPs.

...read moreread less

Acoustic viseme modelling for speech driven automation: a case study

[...]

Werner Verhelst¹, Jfa¹, Cora, Ilse Ravyse¹, Sahli Hichem, Dong-Mei Jiang, Lei Xie, Rongchun Zhao, Jla - Show less +5 more•Institutions (1)

Vrije Universiteit Brussel¹

01 Jan 2002

TL;DR: This case study illustrates that it is indeed possible to obtain visually relevant speech segmentation data directly from a purely acoustic observation sequence.

...read moreread less

Abstract: This paper addresses the problem of animating a talking figure, such as an avatar, using speech input only. The proposed system is based on Hidden Markov Models for the acoustic observation vectors of the speech sounds that correspond to each of 16 visually distinct mouth shapes (called visemes). This case study illustrates that it is indeed possible to obtain visually relevant speech segmentation data directly from a purely acoustic observation sequence.

...read moreread less

Proceedings Article•

Combining speaker and speech recognition systems.

[...]

Larry Heck, Dominique Genoud

01 Jan 2002

TL;DR: A general framework for the integration of speaker and speech recognizers is presented, and it is shown that the posteriori probability can be expressed as the product of four terms: a likelihood score from a speaker-independent speech recognizer, the (normalized) likelihood score of a text-dependent speaker recognizers, the likelihood of a Speaker-dependent statistical language model, and the prior probability of the speaker.

...read moreread less

Abstract: This paper presents a general framework for the integration of speaker and speech recognizers. The framework poses the problem of combining speech and speaker recognizers as the joint maximization of the a posteriori probability of the word sequence and speaker given the observed utterance. It is shown that the posteriori probability can be expressed as the product of four terms: a likelihood score from a speaker-independent speech recognizer, the (normalized) likelihood score of a text-dependent speaker recognizer, the likelihood of a speaker-dependent statistical language model, and the prior probability of the speaker. Efficient search strategies are discussed, with a particular focus on the problem of recognizing and verifying name-based identity claims over very large populations (e.g., ”My name is John Doe”). The efficient search approach uses a speaker-independent recognizer to first generate a list of top hypotheses, followed by a resorting of this list based on the combined score of the four terms discussed above. Experimental results on an over-the-telephone speech recognition task show a 34% reduction in the error rate where the test-set consists of users speaking their first and last name from a grammar covering 1 million unique persons.

...read moreread less

Journal Article•

Biological motion of speech

[...]

Gregor A. Kalberer, Pascal Müller, Luc Van Gool

01 Jan 2002-Lecture Notes in Computer Science

TL;DR: In this paper, the 3D positions of thousands of points on the face were determined at the temporal resolution of video using ICA, and then decomposed into their basic modes, which better capture the underlying, anatomical changes that the face undergoes.

...read moreread less

Abstract: The paper discusses the detailed analysis of visual speech. As with other forms of biological motion, humans are known to be very sensitive to the realism in the ways the lips move. In order to determine the elements that come to play in the perceptual analysis of visual speech, it is important to have control over the data. The paper discusses the capture of detailed 3D deformations of faces when talking. The data are detailed in both a temporal and spatial sense. The 3D positions of thousands of points on the face are determined at the temporal resolution of video. Such data have been decomposed into their basic modes, using ICA. It is noteworthy that this yielded better results than a mere PCA analysis, which results in modes that individually represent facial changes that anatomically inconsistent. The ICs better capture the underlying, anatomical changes that the face undergoes. Different visemes are all based on the underlying, joint action of the facial muscles. The IC modes do not reflect single muscles, but nevertheless decompose the speech related deformations into anatomically convincing modes, coined 'pseudo-muscles'.

...read moreread less

Proceedings Article•

Modeling recognition of speech sounds with minerva2.

[...]

Travis Wade, Deborah K. Eakin, Russell Webb, Arvin Agah, Frank Brown, Allard Jongman, John Gauch, Thomas A. Schreiber, Joan A. Sereno - Show less +5 more

01 Jan 2002

TL;DR: This study investigates the extent to which a localist-distributive hybrid formal model of human memory replicates observed behavioral patterns in perception and recognition of appropriately coded language data.

...read moreread less

Abstract: This study investigates the extent to which a localist-distributive hybrid formal model of human memory replicates observed behavioral patterns in perception and recognition of appropriately coded language data. Extending previous research that considered for modeled memorization only items with uniform, undefined randomly generated featural specifications, a MINERVA2 simulation was trained to recognize linguistic events and categories at both acoustic-phonetic and phonological-featural processing levels. Results of both test conditions parallel two important effects observed in behavioral data and are discussed with respect to speech perception as well as human memory research.

...read moreread less

Proceedings Article•DOI•

Triseme decision trees in the continuous speech recognition system for a talking head

[...]

Dong-Mei Jiang, Lei Xie, I. Ravyse, Rongchun Zhao, H. Sahli, J. Cornelis - Show less +2 more

04 Nov 2002

TL;DR: Results show that compared to the phoneme based system, the tied-state triseme based speech recognition system gives talking head animation with smoother and more plausible mouth shapes.

...read moreread less

Abstract: In this paper, we present a viseme (the basic speech units In the visual domain) based continuous speech recognition system, which segments speech into viseme sequences with timing boundaries to drive a talking head. In the viseme Hidden Markov Model (HMM) training, the instances of a viseme with different contexts are formulated as trisemes. Based on the mouth shape parameters Liprounding and the defined viseme similarity weight (VSW) from the 3D viseme facial models, 166 questions concerning the viseme contexts are designed to build triseme decision trees to tie the states of the trisemes with similar contexts, so that they can share the same parameters. To evaluate the system performance, the image related measurements are also taken to evaluate the resulting viseme sequences, with 'jerky instances' in Liprounding and VSW graphs evaluating their smoothness. Results show that compared to the phoneme based system, the tied-state triseme based speech recognition system gives talking head animation with smoother and more plausible mouth shapes.

...read moreread less

Journal Article•

Audio-to-visual conversion using hidden Markov models

[...]

Soonkyu Lee¹, Dongsuk Yook¹•Institutions (1)

Korea University¹

01 Jan 2002-Lecture Notes in Computer Science

TL;DR: In this paper, audio signals are automatically converted to visual images of mouth shape, which are the generic face images corresponding to particular sounds, and the visual speech can be represented as a sequence of visemes.

...read moreread less

Journal Article•DOI•

Advances in Large Vocabulary Speech Recognition

[...]

Jean-Luc Gauvain, Renato De Mori, Lori Lamel

01 Jan 2002-Computer Speech & Language

Journal Article•DOI•

Elderly perception of speech from a computer

[...]

Alan W. Black, Maxine Eskenazi, Reid Simmons

06 May 2002-Journal of the Acoustical Society of America

TL;DR: This is a preliminary study to the use of synthetic speech for the elderly using CMU's VIKIA to discuss experimental conditions, show actual figures and try to draw conclusions for speech communication between computers and the elderly.

...read moreread less

Abstract: An aging population still needs to access information, such as bus schedules. It is evident that they will be doing so using computers and especially interfaces using speech input and output. This is a preliminary study to the use of synthetic speech for the elderly. In it twenty persons between the ages of 60 and 80 were asked to listen to speech emitted by a robot (CMU’s VIKIA) and to write down what they heard. All of the speech was natural prerecorded speech (not synthetic) read by one female speaker. There were four listening conditions: (a) only speech emitted, (b) robot moves before emitting speech, (c) face has lip movement during speech, (d) both (b) and (c). There were very few errors for conditions (b), (c), and (d), but errors existed for condition (a). The presentation will discuss experimental conditions, show actual figures and try to draw conclusions for speech communication between computers and the elderly.

...read moreread less