scispace - formally typeset
Search or ask a question

Showing papers on "Viseme published in 1994"


Book
21 Jul 1994
TL;DR: Part I: Speech and language 1. Communication 2. The production of speech 3. The sounds of speech 4. The description and classification of speech sounds 5. Sounds in language 6. The historical background 7. Standard and regional accents 8. The English vowels 9. Words 11. Connected speech 12. Words in connected speech 13. Teaching the pronunciation of English
Abstract: PART I: Speech and language 1. Communication 2. The production of speech 3. The sounds of speech 4. The description and classification of speech sounds 5. Sounds in language PART II: The sounds of English 6. The historical background 7. Standard and regional accents 8. The English vowels 9. The English consonants PART III: Words and connected speech 10. Words 11. Connected speech 12. Words in connected speech 13. Teaching the pronunciation of English

659 citations


Journal ArticleDOI
TL;DR: Phonological data show unmistakably that the acoustic-auditory properties of speech sounds, not their articulations, are the primary determinant of their behavior.
Abstract: Should the human or mechanical listener attempt to recover articulatory information when recognizing speech? The answer is ‘‘no,’’ with one qualification: listeners who need eventually to speak out loud the utterances they hear others make would be aided by figuring out what other speakers do with their vocal tracts but even in this case it may not be absolutely necessary. (There is also no dispute that the speech scientist wishing to understand how speech works needs to known the articulations giving rise to specific sounds.) The argument has three parts: (a) the phonologies of languages (e.g., their segment inventories, phonotactic patterns, etc.) unmistakably optimize sounds, not articulations; (b) infants and even various nonhuman species can differentiate certain sound contrasts in human speech even though it is highly unlikely that they can deduce vocal tract movements generating the sounds; (c) humans can differentiate many nonspeech sounds almost as complex as speech, e.g., music, machine noises, as well as bird and monkey vocalizations, where there is little or no possibility of recovering the mechanisms producing the sounds.

136 citations


Proceedings ArticleDOI
31 Oct 1994
TL;DR: A continuous optical automatic speech recognizer that uses optical information from the oral-cavity shadow of a speaker that achieves a 25.3 percent recognition on sentences having a perplexity of 150 without using any syntactic, semantic, acoustic, or contextual guides is described.
Abstract: We describe a continuous optical automatic speech recognizer (OASR) that uses optical information from the oral-cavity shadow of a speaker. The system achieves a 25.3 percent recognition on sentences having a perplexity of 150 without using any syntactic, semantic, acoustic, or contextual guides. We introduce 13, mostly dynamic, oral-cavity features used for optical recognition, present phones that appear optically similar (visemes) for our speaker, and present the recognition results for our hidden Markov models (HMMs) using visemes, trisemes, and generalized trisemes. We conclude that future research is warranted for optical recognition, especially when combined with other input modalities. >

80 citations


Journal ArticleDOI
TL;DR: In this paper, a list of meaningless Dutch syllables was presented to four subject groups, differing with respect to their knowledge of and experience with lipreading expertise, to establish viseme classifications of Dutch vowels and consonants.
Abstract: Videotaped lists of meaningless Dutch syllables were presented in quiet to four subject groups, differing with respect to their knowledge of and experience with lipreading (lipreading expertise). Syllables consisted of all Dutch consonants within three vowel contexts, and of all Dutch vowels within four consonant contexts. Three speakers pronounced all syllable lists. The aim of the research was (1) to establish viseme classifications of Dutch vowels and consonants; (2) to interpret the visual‐perceptual dimensions underlying this classification and relate them to acoustic‐phonetic parameters; (3) to establish the effect of lipreading expertise on the classification of visually similar phonemes (visemes). In general, viseme classification proved very constant with different subject groups: Lipreading expertise is not related to viseme recognition. Important visual features in consonant lipreading are lip articulation, degree of oral cavity opening, and place of articulation, leading to the following viseme classification: /p,b,m/, /f,v,υ/, /s,z,■/, and /t,d,n,j,l,k,x,r,■,h/. In the acoustic domain, these features may be related to spectral differences. Vowel features in lipreading are lip rounding, degree of lip opening, and vowel duration, yielding the following visemes: /i,■,e,e,ei,a,■/, /u,y,œ,■/, /o/,o/, and /au,œy/. In the acoustic domain, lip rounding may roughly be related to the second formant, lip opening to the first formant.

32 citations


Proceedings Article
01 Jan 1994
TL;DR: A statistical model of speech is developed that incorporates certain temporal properties of human speech perception that may in principle allow for statistical modeling of speech components that are more relevant for discrimination between candidate utterances during speech recognition.
Abstract: We have developed a statistical model of speech that incorporates certain temporal properties of human speech perception. The primary goal of this work is to avoid a number of current constraining assumptions for statistical speech recognition systems, particularly the model of speech as a sequence of stationary segments consisting of uncorrelated acoustic vectors. A focus on perceptual models may in principle allow for statistical modeling of speech components that are more relevant for discrimination between candidate utterances during speech recognition. In particular, we hope to develop systems that have some of the robust properties of human audition for speech collected under adverse conditions. The outline of this new research direction is given here, along with some preliminary theoretical work.

28 citations


Proceedings ArticleDOI
31 Oct 1994
TL;DR: The goal of automatic lip-sync (ALS) is to translate speech sounds into mouth shapes, and the direct map from sound to shape avoids many language problems associated with SR and provides a unique domain for error correction.
Abstract: The goal of automatic lip-sync (ALS) is to translate speech sounds into mouth shapes. Although this seems related to speech recognition (SR), the direct map from sound to shape avoids many language problems associated with SR and provides a unique domain for error correction. Among other things, ALS animation may be used for animating cartoons realistically and as an aid to the hearing disabled. Currently, a program named Owie performs speaker dependent ALS for vowels. >

15 citations



Proceedings ArticleDOI
19 Apr 1994
TL;DR: A novel approach for classifying continuous speech into visible mouth-shape related classes (called visemes) is described, which is a quite promising result considering that the test is applied on continuous multi-speakers and large vocabulary speech.
Abstract: The paper describes a novel approach for classifying continuous speech into visible mouth-shape related classes (called visemes) The selection and comparison of various acoustic speech features and the use of context information in the classification are addressed Continuous speech is classified into 9 visible mouth-shape related classes on an acoustic frame basis Some mouth-shape related acoustic speech signal features are selected as the input to a classifier constructed with recurrent neural network (RNN) 304 training sentences and 88 testing sentences are chosen from DARPA TIMIT continuous speech database The average viseme recognition rate for the test set reaches 847% on frame level, which is a quite promising result considering that the test is applied on continuous multi-speakers and large vocabulary speech >

5 citations


Journal ArticleDOI
TL;DR: In this article, visible and articulatory archiphones and diphone "aliases" were formalized, using standard phonological distinctive features, and a system of disemes was created.
Abstract: Realistic on‐screen computer ‘‘assistants’’ need synthetic visible speech that has accurate, pleasing articulation; they also need to run in real‐time on a personal computer. Previous visible speech systems have used 9–32 ‘‘visemes,’’ minimal contrastive units of visible articulation. Viseme‐based animation has been choppy, inaccurate, and insufficiently plastic. In contrast, concatenative speech synthesis utilizes a large inventory of diphone units. Therefore, two improvements are needed: expansion beyond simple visemes, and reduction of the number of diphones so that they may be mapped to the facial images in real‐time. To this end, visible and articulatory archiphones and diphone ‘‘aliases’’ were formalized, using standard phonological distinctive features, and a system of disemes was created. Disemes begin during one viseme (phone) in an archiphonic family and end somewhere during the following viseme (phone), in another archiphonic family. In this way, many transitions that occur in approximately 1800 diphones (for General American English) can be visually depicted by the same diseme, due to their similarity in lip, teeth, and tongue image positions. The effectiveness of mapping these disemes may be demonstrated, using a variety of non‐screen agents and faces.

3 citations


PatentDOI
TL;DR: The recognition rate of a speech recognition system is improved by compensating for changes in the user's speech that result from factors such as emotion, anxiety or fatigue.
Abstract: The recognition rate of a speech recognition system is improved by compensating for changes in the user's speech that result from factors such as emotion, anxiety or fatigue. A speech signal derived from a user's utterance is modified by a preprocessor 32 and provided to a speech recognition system to improve the recognition rate. The speech signal is modified based on a bio-signal which is indicative of the user's emotional state.