scispace - formally typeset
Search or ask a question
Topic

Viseme

About: Viseme is a research topic. Over the lifetime, 865 publications have been published within this topic receiving 17889 citations.


Papers
More filters
Proceedings ArticleDOI
01 May 1986
TL;DR: An automated method of synchronizing facial animation to recorded speech is described, which retains intelligibility and natural speech rhythm while achieving a “synthetic realism” consistent with computer animation.
Abstract: An automated method of synchronizing facial animation to recorded speech is described. In this method, a common speech synthesis method (linear prediction) is adapted to provide simple and accurate phoneme recognition. The recognized phonemes are then associated with mouth positions to provide keyframes for computer animation of speech using a parametric model of the human face.The linear prediction software, once implemented, can also be used for speech resynthesis. The synthesis retains intelligibility and natural speech rhythm while achieving a “synthetic realism” consistent with computer animation. Speech synthesis also enables certain useful manipulations for the purpose of computer character animation.

104 citations

Journal ArticleDOI
TL;DR: It is indicated that the automatic derivation of mouth movement from a speech soundtrack is a tractable problem and a common speech synthesis method, linear prediction, is adapted to provide simple and accurate phoneme recognition.
Abstract: SUMMARY The problem of creating mouth animation synchronized to recorded speech is discussed. Review of a model of speech sound generation indicates that the automatic derivation of mouth movement from a speech soundtrack is a tractable problem. Several automatic lip-sync techniques are compared, and one method is described in detail. In this method a common speech synthesis method, linear prediction, is adapted to provide simple and accurate phoneme recognition. The recognized phonemes are associated with mouth positions to provide keyframes for computer animation of speech. Experience with this technique indicates that automatic lipsync can produce useful results.

101 citations

Patent
TL;DR: In this paper, a method and apparatus of converting input text into an audio-visual speech stream resulting in a talking face image enunciating the text is presented, which is then displayed in real time, thereby displaying photo-realistic talking face.
Abstract: A method and apparatus of converting input text into an audio-visual speech stream resulting in a talking face image enunciating the text. This method of converting input text into an audio-visual speech stream comprises the steps of: recording a visual corpus of a human-subject, building a viseme interpolation database, and synchronizing the talking face image with the text stream. In a preferred embodiment, viseme transitions are automatically calculated using optical flow methods, and morphing techniques are employed to result in smooth viseme transitions. The viseme transitions are concatenated together and synchronized with the phonemes according to the timing information. The audio-visual speech stream is then displayed in real time, thereby displaying a photo-realistic talking face.

98 citations

Patent
TL;DR: In this paper, a system for learning a mapping between time-varying signals is used to drive facial animation directly from speech, without laborious voice track analysis, and the output of the system is a sequence of facial control parameters suitable for driving a variety of different kinds of animation ranging from warped photorealistic images to 3D cartoon characters.
Abstract: A system for learning a mapping between time-varying signals is used to drive facial animation directly from speech, without laborious voice track analysis. The system learns dynamical models of facial and vocal action from observations of a face and the facial gestures made while speaking. Instead of depending on heuristic intermediate representations such as phonemes or visemes, the system trains hidden Markov models to obtain its own optimal representation of vocal and facial action. An entropy-minimizing training technique using an entropic prior ensures that these models contain sufficient dynamical information to synthesize realistic facial motion to accompany new vocal performances. In addition, they can make optimal use of context to handle ambiguity and relatively long-lasting facial co-articulation effects. The output of the system is a sequence of facial control parameters suitable for driving a variety of different kinds of animation ranging from warped photorealistic images to 3D cartoon characters.

96 citations

PatentDOI
TL;DR: A computerized system time aligns frames of spoken training data against models of the speech sounds; automatically selects different sets of phonetic context classifications which divide the speech sound models into speech sound groups aligned against acoustically similar frames; creates model components from the frames aligned againstspeech sound groups with related classifications; and uses these model components to build a separate model for each related speech sound group.
Abstract: A computerized system time aligns frames of spoken training data against models of the speech sounds; automatically selects different sets of phonetic context classifications which divide the speech sound models into speech sound groups aligned against acoustically similar frames; creates model components from the frames aligned against speech sound groups with related classifications; and uses these model components to build a separate model for each related speech sound group. A decision tree classifies speech sounds into such groups, and related speech sound groups descend from common tree nodes. New speech samples time aligned against a given speech sound group's model update models of related speech sound groups, decreasing the training data required to adapt the system. The phonetic context classifications can be based on knowledge of which contextual features are associated with acoustic similarity. The computerized system samples speech sounds using a first, larger, parameter set; automatically selects combinations of phonetic context classifications which divide the speech sounds into groups whose frames are acoustically similar, such as by use of a decision tree; selects a second, smaller, set of parameters based on that set's ability to separate the frames aligned with each speech sound group, such as by used of linear discriminant analysis; and then uses these new parameters to represent frames and speech sound models. Then, using the new parameters, a decision tree classifier can be used to re-classify the speech sounds and to calculate new acoustic models for the resulting groups of speech sounds.

95 citations


Network Information
Related Topics (5)
Vocabulary
44.6K papers, 941.5K citations
78% related
Feature vector
48.8K papers, 954.4K citations
76% related
Feature extraction
111.8K papers, 2.1M citations
75% related
Feature (computer vision)
128.2K papers, 1.7M citations
74% related
Unsupervised learning
22.7K papers, 1M citations
73% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20237
202212
202113
202039
201919
201822