scispace - formally typeset
Search or ask a question
Topic

Viseme

About: Viseme is a research topic. Over the lifetime, 865 publications have been published within this topic receiving 17889 citations.


Papers
More filters
Proceedings ArticleDOI
15 Jun 2005
TL;DR: A new method for mapping natural speech to lip shape animation in real time using neural networks that eliminates the need for tedious manual neural network design by trial and error and considerably improves the viseme classification results.
Abstract: In this paper we present a new method for mapping natural speech to lip shape animation in real time. The speech signal, represented by MFCC vectors, is classified into viseme classes using neural networks. The topology of neural networks is automatically configured using genetic algorithms. This eliminates the need for tedious manual neural network design by trial and error and considerably improves the viseme classification results. This method is available in real-time and offline mode, and is suitable for various applications. So, we propose the new multimedia services for mobile devices based on the lip sync system described.

6 citations

Journal ArticleDOI
TL;DR: An automatic multimodal translation can be achieved that is suitable for video mail or automatic dubbing systems into other languages and substitute only the speech organ's image with the synthesized one, which is made by a 3D wire-frame model that is adaptable to any speaker.
Abstract: We introduce a multimodal English-to-Japanese and Japanese-to-English translation system that also translates the speaker's speech motion by synchronizing it to the translated speech. This system also introduces both a face synthesis technique that can generate any viseme lip shape and a face tracking technique that can estimate the original position and rotation of a speaker's face in an image sequence. To retain the speaker's facial expression, we substitute only the speech organ's image with the synthesized one, which is made by a 3D wire-frame model that is adaptable to any speaker. Our approach provides translated image synthesis with an extremely small database. The tracking motion of the face from a video image is performed by template matching. In this system, the translation and rotation of the face are detected by using a 3D personal face model whose texture is captured from a video frame. We also propose a method to customize the personal face model by using our GUI tool. By combining these techniques and the translated voice synthesis technique, an automatic multimodal translation can be achieved that is suitable for video mail or automatic dubbing systems into other languages.

6 citations

01 Sep 2000
TL;DR: The robust technique of Support VectorMachines is applied for learning a regression function from a sparse subset of Haar coefficients to the LMM parameters to bypass current computationally intensive methods for matching objects to morphable models.
Abstract: Thispaperdescribesamethodforestimatingtheparametersofalinearmorphablemodel(LMM)that models mouth images The method uses a learning-based approach to estimate the LMMparameters directly from the images of the object class (in this case mouths) Thus this methodcan be used to bypass current computationally intensive methods that use analysis by synthesis,for matching objects to morphable models We have used the invariance properties of Haarwavelets for representing mouth images We apply the robust technique of Support VectorMachines (SVM) for learning a regression function from a sparse subset of Haar coefficients tothe LMM parameters The estimation of LMM parameters could possibly have application toother problems in vision We investigate one such application, namely viseme recognition

6 citations

Proceedings ArticleDOI
Cyril Allauzen1, Michael Riley1
25 Aug 2013
TL;DR: A modified composition algorithm is described that is used for combining two finite-state transducers, representing the context-dependent lexicon and the language model respectively, in large vocabulary speech recogntion.
Abstract: This paper describes a modified composition algorithm that is used for combining two finite-state transducers, representing the context-dependent lexicon and the language model respectively, in large vocabulary speech recogntion. This algorithm is a hybrid between the static and dynamic expansion of the resultant transducer, which maps from context-dependent phones to words and is searched during decoding. The approach is to pre-compute part of the recognition transducer and leave the balance to be expanded during decoding. This method allows for a fine-grained trade-off between space and time in recognition. For example, the time overhead of purely dynamic expansion can be reduced by over six-fold with only a 20% increase in memory in a collection of large-vocabulary recognition tasks available on the Google Android platform.

6 citations


Network Information
Related Topics (5)
Vocabulary
44.6K papers, 941.5K citations
78% related
Feature vector
48.8K papers, 954.4K citations
76% related
Feature extraction
111.8K papers, 2.1M citations
75% related
Feature (computer vision)
128.2K papers, 1.7M citations
74% related
Unsupervised learning
22.7K papers, 1M citations
73% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20237
202212
202113
202039
201919
201822