scispace - formally typeset
Search or ask a question
Topic

Viseme

About: Viseme is a research topic. Over the lifetime, 865 publications have been published within this topic receiving 17889 citations.


Papers
More filters
01 Jan 1989
TL;DR: This chapter discusses Vowel Recognition in Continuous Speech using the Gaussian classifier, the Neural Network, and the Hidden Markov Models.
Abstract: 1 Chapter 1Speech Understanding 2 The Speech Recognition Problem 2 Speaker Related Systems 2 Continuous Speech 3 Vocabulary Size 5 The Vowel Classifier 7 Chapter 2 Phonetics 8 Phoneme Variability 8 Coarticulation 8 Spectrogram Reading 11 Chapter 3 Production, Acoustics and Perception ofVowels 12 Source-Filter Theory 12 The Source 13 The Filter 14 Vowel Production 16 Diphthongs 18 Semi-Vowels 18 Vowel Nasalization 19 Vowel Acoustics 20 Effects of Coarticulation 21 Vowel Perception 25 Automatic Vowel Recognition 26 Summary SL Chapter 4 System Implementation 32 General Description 32 Database 33 Feature Sets 34 Linear Predictive Coding 34 Spectral Moments 34 Median Value 36 Formants and Fundamental Frequency 36 Vowel Extraction 37 Preclassification 37 Maximum Likelihood 38 Neural Network 38 Dynamic Classification Using Hidden Markov Models 40 Chapter 5 Results and Conclusions 42 Database Size 42 Vowel Separability 43 Feature Set 50 Preclassification Results 50 Understanding Classification Errors 54 Dynamic Classification 55 Average Center Values 58 Three-Frame Sampling 59 Projecting Results 60 Conclusions 61 Further Studies 63 Chapter 6 User Documentation 66 Building the database 66 Designing the Neural Network 68 Designing the Gaussian classifier 69 Designing the Hidden Markov Model 70 Extra Useful Routines 72 References 73 Appendix A 76 Appendix B 77 Appendix C 79 Appendix D 85 Appendix E Glossary 89 Vowel Recognition in Continuous Speech 11/1 4/89

1 citations

01 Jan 2012
TL;DR: The experimental results indicate that the new Visual Speech Unit concept achieves 90% recognition rate when the system has been applied to the identification of 60 classes of VSUs, while the recognition rate for the standard set of MPEG-4 visemes was only 52%.
Abstract: In this paper we propose a new learning-based representation that is referred to as Visual Speech Unit (VSU) for visual speech recognition (VSR). The new Visual Speech Unit concept proposes an extension of the standard viseme model that is currently applied for VSR by including in this representation not only the data associated with the visemes, but also the transitory information between consecutive visemes. The developed speech recognition system consists of several computational stages: (a) lips segmentation, (b) construction of the Expectation-Maximization Principal Component Analysis (EM-PCA) manifolds from the input video image, (c) registration between the models of the VSUs and the EM-PCA data constructed from the input image sequence and (d) recognition of the VSUs using a standard Hidden Markov Model (HMM) classification scheme. In this paper we were particularly interested to evaluate the classification accuracy obtained for our new VSU models when compared with that attained for standard (MPEG-4) viseme models. The experimental results indicate that we achieved 90% recognition rate when the system has been applied to the identification of 60 classes of VSUs, while the recognition rate for the standard set of MPEG-4 visemes was only 52%.

1 citations

Proceedings Article
01 Jan 1997
TL;DR: This paper quantifies the geometry of speech turbulence as re ected in the fragmentation of the time signal by using fractal models and describes an algorithm for estimating the short-time fractal dimension of speech signals based on multiscale morphological ltering and discusses its potential for phonetic classi cation.
Abstract: The dynamics of air ow during speech production may often result into some small or large degree of turbulence. In this paper, we quantify the geometry of speech turbulence as re ected in the fragmentation of the time signal by using fractal models. We describe an e cient algorithm for estimating the short-time fractal dimension of speech signals based on multiscale morphological ltering and discuss its potential for phonetic classi cation. We also report experimental results on using the shorttime fractal dimension of speech signals at multiple scales as additional features in an automatic speech recognition system using hidden Markov models, which provides a modest improvement in speech recognition performance.

1 citations

Proceedings Article
01 Jan 1995

1 citations

Journal ArticleDOI
TL;DR: The 15 hours of training with each of these skilled lipreaders in the LA condition suggest that the consonants /s,l,TH/ are quite reliably identified through lipreading alone, and a processor optimized to enhance specifically these contrasts might prove to be the better speechreading aid.
Abstract: The homorganic obstruent pairs /p-b, t-d, k-g, ch-j, f-v, th-TH, s-z, sh-zh/ are notoriously confusable in the lipreading alone (LA) condition, and the nasal consonants Iml and Inl are often mistaken for their homorganic oral counteφarts. Implant users' speechreading of these viseme group members is improved by the addition of either multi­ ple channel or single-channel electrical stimulation.'^ The palatal obstruent distinctions /sh,zh,ch,]7 have been targeted for remediation via other speechreading aids.' One approach to implant sound processor setting involves opti­ mizing to distinguish the nonvisible frequently occurring consonants /t,d,k,s,z,n,l,TH/ (E. Schubert, personal com­ munication). This optimization method resulted in signifi­ cant speechreading improvement and even some open speech comprehension without lipreading for one deaf patient (M. White, personal communication). During a 50-week consonant training program conducted in our laboratory, two experimental subjects spent seven sessions identifying consonants in the LA condition, and five sub­ sequent sessions identifying the same consonants in the stimulation plus lipreading condition, aided by a singlechannel sound processor. Our 15 hours of training with each of these skilled lipreaders in the LA condition suggest that the consonants /s,l,TH/ are quite reliably identified through lipreading alone. Of the consonants not visible on the lips, /t,d,k,g,n,y/ are the troublesome contrasts. The singlechannel sound processor provides some help in disambigu­ ating these six consonants for deaf speechreaders, although a processor optimized to enhance specifically these contrasts might prove to be the better speechreading aid.

1 citations


Network Information
Related Topics (5)
Vocabulary
44.6K papers, 941.5K citations
78% related
Feature vector
48.8K papers, 954.4K citations
76% related
Feature extraction
111.8K papers, 2.1M citations
75% related
Feature (computer vision)
128.2K papers, 1.7M citations
74% related
Unsupervised learning
22.7K papers, 1M citations
73% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20237
202212
202113
202039
201919
201822