scispace - formally typeset
Search or ask a question

Showing papers on "Viseme published in 1996"


Journal ArticleDOI
TL;DR: The intended audience for the book Fiindahientals of Speech Synthesis arid Recognitioii, edited by Eric Keller, is the "whole new generation of computer scientists" who will wonder "what speech is all about" in the context of building or deploying computer-based speech technology.
Abstract: The intended audience for the book Fiindahientals of Speech Synthesis arid Recognitioii, edited by Eric Keller, is the “whole new generation of computer scientists” who will wonder “what speech is all about” in the context of building or deploying computer-based speech technology. Apart from the reader’s being a “well motivated” “computer scientist,” there are no specific prerequisites mentioned. Given the title, one would thus hope for a broad, balanced introduction to speech science and technology. This, the book is not. It does, however, contain some excellent material much of which is accessible to the lay reader in speech. The book is divided into three sections: “Background,” aimed at introducing basic speech science and technology concepts; ‘‘State of the Art,” which, as the name implies provides a window on current practical consequences of speech research and development; and “Challenges,” presenting research questions in the areas of speech production, perception, synthesis, and human-machine interaction. Because the chapters are written by several contributors, it seems most appropriate to review them individually, rather than to attempt generalizations that cover the book as a whole. After glancing through the introduction to the “Background” section and through Chapter I, “Fundamentals of Phonetic Science,” I was ready to put the bookdown permanently. These are rife with colloquialisms, nonstandard terminology, inaccuracies, and misleading material. The introduction to phonetics is primarily contained in one footnote. The discussion of speech acoustics suggests a general lack of understanding of the area. The terminology is nonstandard without having the redeeming value ofclarifying issues. For instance, all potential points of constriction of the vocal tract are termed “ports” with the “linguo-palatal port” having the same valence as the velar port. The introduction to the voicing mechanism and the accompanying illustration convey no insight regarding the self-oscillatory nature of the process. No introduction is provided for basic measurement concepts (frequency, intensity, spectnim, spectrogram). Basic processing methods are incompletely or inaccurately described: “[LPC] which is often calculated by taking a spectrum of an autocorrelation.” The only redeeming aspects of the chapter are its brevity and the references contained at the end.

77 citations


Book ChapterDOI
01 Jan 1996
TL;DR: For a number of years, researchers have studied patterns of consonant and vowel confusions in speechreading Visually similar speech sounds are referred to as visemes and factors affecting viseme groupings include coarticulation effects of accompanying sounds, environmental effects (e g, lighting), and articulatory differences among various talkers as mentioned in this paper.
Abstract: For a number of years, researchers have studied patterns of consonant and vowel confusions in speechreading Visually similar speech sounds are referred to as visemes. Factors affecting viseme groupings include coarticulation effects of accompanying sounds, environmental effects (e g, lighting), and articulatory differences among various talkers. The contribution of the latter, talker differences, is addressed in this paper.

52 citations


Proceedings ArticleDOI
03 Oct 1996
TL;DR: An audiovisual speech synthesizer from unlimited French text is presented, using a 3-D parametric model of the face that depends on the phonetic context, the speech rate, and an "hypo-hyper articulation" coefficient adjustable by the user.
Abstract: An audiovisual speech synthesizer from unlimited French text is presented. It uses a 3-D parametric model of the face. The facial model is controlled by eight parameters. Target values have been assigned to the parameters, for each French viseme, based upon measurements made on a human speaker. Parameter trajectories are modeled by means of dominance functions associated with each parameter and each viseme. A dominance function is characterized by three coefficients so that coarticulation finally depends on the phonetic context, the speech rate, and an "hypo-hyper articulation" coefficient adjustable by the user. Finally, the visual and audiovisual intelligibility of the visual synthesizer has been evaluated in its first version, and compared to that of the acoustic synthesizer on which it was implemented.

49 citations


Journal ArticleDOI
TL;DR: The authors refrain from posing completely new paradigms for ASR which more closely parallel the relationship between speech production and human speech understanding, and work aimed at integrating speech production models into existing ASR formalisms is described.
Abstract: This paper investigates the issues that are associated with applying speech production models to automatic speech recognition (ASR). Here the applicability of articulatory representations to ASR is considered independently of the role of articulatory representations in speech perception. While the question of whether it is necessary or even possible for human listeners to recover the state of the articulators during the process of perceiving speech is an important one, it is not considered here. Hence, the authors refrain from posing completely new paradigms for ASR which more closely parallel the relationship between speech production and human speech understanding. Instead, work aimed at integrating speech production models into existing ASR formalisms is described.

48 citations


Book ChapterDOI
01 Jan 1996
TL;DR: Recognition of the synthetic talker is reasonably close to that of the human talker, but a significant distance remains to be covered and improvements to the synthetic phoneme specifications are discussed.
Abstract: We report here on an experiment comparing visual recognition of monosyllabic words produced either by our computer-animated talker or a human talker. Recognition of the synthetic talker is reasonably close to that of the human talker, but a significant distance remains to be covered and we discuss improvements to the synthetic phoneme specifications. In an additional experiment using the same paradigm, we compare perception of our animated talker with a similarly generated point-light display, finding significantly worse performance for the latter for a number of viseme classes. We conclude with some ideas for future progress and briefly describe our new animated tongue.

42 citations


Book ChapterDOI
01 Jan 1996
TL;DR: The feature selection process uses a correlation matrix, principal component analysis, and speechreading heuristics to reduce the number of oral-cavity features from 35 to 13 and concludes that the dynamic oral-Cavity Features offer great potential for machine speechreading and for the teaching of human speechreading.
Abstract: We describe a methodology to automatically identify visemes and to determine important oral-cavity features for a speaker dependent, optical continuous speech recognizer. A viseme, as defined by Fisher (1968), represents phones that contain optically similar sequences of oral-cavity movements. Large vocabulary, continuous acoustic speech recognizers that use Hidden Markov Models (HMMs) require accurate phones models (Lee 1989). Similarly, an optical recognizer requires accurate viseme models (Goldschen 1993). Since no universal agreement exists on a subjective viseme definition, we provide an empirical viseme definition using HMMs. We train a set of phone HMMs using optical information, and then cluster similar phone HMMs to form viseme HMMs. We compare our algorithmic phone-to-viseme mapping with the mappings from human speechreading experts. We start, however, by describing the oral-cavity feature selection process to determine features that characterize the movements of the oral-cavity during speech. The feature selection process uses a correlation matrix, principal component analysis, and speechreading heuristics to reduce the number of oral-cavity features from 35 to 13. Our analysis concludes that the dynamic oral-cavity features offer great potential for machine speechreading and for the teaching of human speechreading.

39 citations


Proceedings ArticleDOI
10 Sep 1996
TL;DR: This paper developed and tested three systems, each dealing with one or both problems and proposing a different integration strategy, and shows that some of the solutions proposed give satisfactory results.
Abstract: This paper presents our work on the integration of visual data in automatic speech recognition systems. We particularly aim at solving two problems: • classifiation differences for the modeling of acoustic information (phonemes) and visual information (visemes); • the phenomena of anticipation and retention of visemes on the corresponding phonemes. We developed and tested three systems, each dealing with one or both problems and proposing a different integration strategy. The comparison of system performances show that some of the solutions we propose give satisfactory results, and suggest that further work on some others would lead to more performance improvement.

10 citations


Proceedings ArticleDOI
03 Oct 1996
TL;DR: The authors show how the pseudo-articulatory representations facilitate the details of speech processing, for both synthesis and recognition, and give details of work in progress on recognition.
Abstract: Pseudo-articulatory representations are increasingly being used in work on speech synthesis and recognition. The value of such representations lies in their derivation from linguistic abstractions-they are based on articulatory idealizations used by linguists to describe speech. Iles and Edmondson (1994) demonstrated that, using these representations, it is possible to overcome the many-to-one problem in mapping articulatory configuration to acoustic signal. The authors show how the representations facilitate the details of speech processing, for both synthesis and recognition, and give details of work in progress on recognition. The role of pseudo-articulatory representations in the development of an integrated approach to synthesis and recognition is also discussed.

5 citations


Patent
08 Feb 1996
TL;DR: In this article, a method and device for deciding quality of speech is presented, where the speech to be evaluated is listened in to by a person who reproduces the speech and the stops of vowel sounds in he produced and reproduced speech respectively are appointed.
Abstract: The present invention refers to a method and device for deciding quality of speech. The speech to be evaluated is listened in to by a person who reproduces the speech. Stops of vowel sounds in he produced and reproduced speech respectively are appointed. The difference between the stops of the vowel sounds is registered. Out of the obtained differences an average value is created. The achieved average value indicates the quality of the produced speech. The invention can be used for evaluation of different speech producing sources such as equipments and/or machines and people's ability to comprehend the speech.

5 citations


Book ChapterDOI
01 Jan 1996
TL;DR: A facial animation program with an open input text vocabulary for use as a training aid for speechreading by introducing appropriate motion models for those visual articulatory movements that are relevant for the process of speechreading.
Abstract: Goal of this paper is to introduce appropriate motion models for those visual articulatory movements that are relevant for the process of speechreading, and with this, design a facial animation program with an open input text vocabulary for use as a training aid for speechreading

4 citations