scispace - formally typeset
Search or ask a question
Topic

Viseme

About: Viseme is a research topic. Over the lifetime, 865 publications have been published within this topic receiving 17889 citations.


Papers
More filters
Posted Content
TL;DR: In this article, a three-stage Long Short-Term Memory (LSTM) network architecture is proposed to produce animator-centric speech motion curves that drive a JALI or standard FACS-based production face-rig, directly from input audio.
Abstract: We present a novel deep-learning based approach to producing animator-centric speech motion curves that drive a JALI or standard FACS-based production face-rig, directly from input audio. Our three-stage Long Short-Term Memory (LSTM) network architecture is motivated by psycho-linguistic insights: segmenting speech audio into a stream of phonetic-groups is sufficient for viseme construction; speech styles like mumbling or shouting are strongly co-related to the motion of facial landmarks; and animator style is encoded in viseme motion curve profiles. Our contribution is an automatic real-time lip-synchronization from audio solution that integrates seamlessly into existing animation pipelines. We evaluate our results by: cross-validation to ground-truth data; animator critique and edits; visual comparison to recent deep-learning lip-synchronization solutions; and showing our approach to be resilient to diversity in speaker and language.

29 citations

28 Aug 2006
TL;DR: In this paper, a trainable trajectory formation system for facial animation is proposed that dissociates parametric spaces and methods for movement planning and execution, achieved by HMM-based trajectory formation.
Abstract: A new trainable trajectory formation system for facial animation is here proposed that dissociates parametric spaces and methods for movement planning and execution. Movement planning is achieved by HMM-based trajectory formation. Movement execution is performed by concatenation of multi-represented diphones. Movement planning ensures that the essential visual characteristics of visemes are reached (lip closing for bilabials, rounding and opening for palatal fricatives, etc) and that appropriate coarticulation is planned. Movement execution grafts phonetic details and idiosyncratic articulatory strategies (dissymetries, importance of jaw movements, etc) to the planned gestural score.

28 citations

Proceedings Article
01 Jan 1994
TL;DR: A statistical model of speech is developed that incorporates certain temporal properties of human speech perception that may in principle allow for statistical modeling of speech components that are more relevant for discrimination between candidate utterances during speech recognition.
Abstract: We have developed a statistical model of speech that incorporates certain temporal properties of human speech perception. The primary goal of this work is to avoid a number of current constraining assumptions for statistical speech recognition systems, particularly the model of speech as a sequence of stationary segments consisting of uncorrelated acoustic vectors. A focus on perceptual models may in principle allow for statistical modeling of speech components that are more relevant for discrimination between candidate utterances during speech recognition. In particular, we hope to develop systems that have some of the robust properties of human audition for speech collected under adverse conditions. The outline of this new research direction is given here, along with some preliminary theoretical work.

28 citations

Journal ArticleDOI
TL;DR: Methods for reinforcing the visible speech recognition in the framework of separate identification are outlined and it is shown that using these methods improves performances of the DI+SI based system under varying noise-level conditions.
Abstract: In recent years a number of techniques have been proposed to improve the accuracy and the robustness of automatic speech recognition in noisy environments. Among these, suplementing the acoustic information with visual data, mostly extracted from speaker's lip shapes, has been proved to be successful. We have already demonstrated the effectiveness of integrating visual data at two different levels during speech decoding according to both direct and separate identification strategies (DI+SI). This paper outlines methods for reinforcing the visible speech recognition in the framework of separate identification. First, we define visual-specific units using a self-organizing mapping technique. Second, we complete a stochastic learning of these units with a discriminative neural-network-based technique for speech recognition purposes. Finally, we show on a connected-letter speech recognition task that using these methods improves performances of the DI+SI based system under varying noise-level conditions.

28 citations


Network Information
Related Topics (5)
Vocabulary
44.6K papers, 941.5K citations
78% related
Feature vector
48.8K papers, 954.4K citations
76% related
Feature extraction
111.8K papers, 2.1M citations
75% related
Feature (computer vision)
128.2K papers, 1.7M citations
74% related
Unsupervised learning
22.7K papers, 1M citations
73% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20237
202212
202113
202039
201919
201822