Topic
Viseme
About: Viseme is a research topic. Over the lifetime, 865 publications have been published within this topic receiving 17889 citations.
Papers published on a yearly basis
Papers
More filters
••
15 Jun 2005TL;DR: A new method for mapping natural speech to lip shape animation in real time using neural networks that eliminates the need for tedious manual neural network design by trial and error and considerably improves the viseme classification results.
Abstract: In this paper we present a new method for mapping natural speech to lip shape animation in real time. The speech signal, represented by MFCC vectors, is classified into viseme classes using neural networks. The topology of neural networks is automatically configured using genetic algorithms. This eliminates the need for tedious manual neural network design by trial and error and considerably improves the viseme classification results. This method is available in real-time and offline mode, and is suitable for various applications. So, we propose the new multimedia services for mobile devices based on the lip sync system described.
6 citations
••
TL;DR: An automatic multimodal translation can be achieved that is suitable for video mail or automatic dubbing systems into other languages and substitute only the speech organ's image with the synthesized one, which is made by a 3D wire-frame model that is adaptable to any speaker.
Abstract: We introduce a multimodal English-to-Japanese and Japanese-to-English translation system that also translates the speaker's speech motion by synchronizing it to the translated speech. This system also introduces both a face synthesis technique that can generate any viseme lip shape and a face tracking technique that can estimate the original position and rotation of a speaker's face in an image sequence. To retain the speaker's facial expression, we substitute only the speech organ's image with the synthesized one, which is made by a 3D wire-frame model that is adaptable to any speaker. Our approach provides translated image synthesis with an extremely small database. The tracking motion of the face from a video image is performed by template matching. In this system, the translation and rotation of the face are detected by using a 3D personal face model whose texture is captured from a video frame. We also propose a method to customize the personal face model by using our GUI tool. By combining these techniques and the translated voice synthesis technique, an automatic multimodal translation can be achieved that is suitable for video mail or automatic dubbing systems into other languages.
6 citations
01 Sep 2000
TL;DR: The robust technique of Support VectorMachines is applied for learning a regression function from a sparse subset of Haar coefficients to the LMM parameters to bypass current computationally intensive methods for matching objects to morphable models.
Abstract: Thispaperdescribesamethodforestimatingtheparametersofalinearmorphablemodel(LMM)that models mouth images The method uses a learning-based approach to estimate the LMMparameters directly from the images of the object class (in this case mouths) Thus this methodcan be used to bypass current computationally intensive methods that use analysis by synthesis,for matching objects to morphable models We have used the invariance properties of Haarwavelets for representing mouth images We apply the robust technique of Support VectorMachines (SVM) for learning a regression function from a sparse subset of Haar coefficients tothe LMM parameters The estimation of LMM parameters could possibly have application toother problems in vision We investigate one such application, namely viseme recognition
6 citations
••
01 Apr 2012
6 citations
••
25 Aug 2013TL;DR: A modified composition algorithm is described that is used for combining two finite-state transducers, representing the context-dependent lexicon and the language model respectively, in large vocabulary speech recogntion.
Abstract: This paper describes a modified composition algorithm that is used for combining two finite-state transducers, representing the context-dependent lexicon and the language model respectively, in large vocabulary speech recogntion. This algorithm is a hybrid between the static and dynamic expansion of the resultant transducer, which maps from context-dependent phones to words and is searched during decoding. The approach is to pre-compute part of the recognition transducer and leave the balance to be expanded during decoding. This method allows for a fine-grained trade-off between space and time in recognition. For example, the time overhead of purely dynamic expansion can be reduced by over six-fold with only a 20% increase in memory in a collection of large-vocabulary recognition tasks available on the Google Android platform.
6 citations