Topic
Viseme
About: Viseme is a research topic. Over the lifetime, 865 publications have been published within this topic receiving 17889 citations.
Papers published on a yearly basis
Papers
More filters
•
01 Oct 2000
TL;DR: ICSLP2000: the 6th International Conference on Spoken Language Processing, October 16-20, 2000, Beijing, China.
Abstract: ICSLP2000: the 6th International Conference on Spoken Language Processing, October 16-20, 2000, Beijing, China.
45 citations
••
TL;DR: Time warping and motion vector blending at the juncture of two divisemes and the algorithm to search the optimal concatenated visible speech are developed to provide the final concatenative motion sequence.
Abstract: We present a technique for accurate automatic visible speech synthesis from textual input. When provided with a speech waveform and the text of a spoken sentence, the system produces accurate visible speech synchronized with the audio signal. To develop the system, we collected motion capture data from a speaker's face during production of a set of words containing all diviseme sequences in English. The motion capture points from the speaker's face are retargeted to the vertices of the polygons of a 3D face model. When synthesizing a new utterance, the system locates the required sequence of divisemes, shrinks or expands each diviseme based on the desired phoneme segment durations in the target utterance, then moves the polygons in the regions of the lips and lower face to correspond to the spatial coordinates of the motion capture data. The motion mapping is realized by a key-shape mapping function learned by a set of viseme examples in the source and target faces. A well-posed numerical algorithm estimates the shape blending coefficients. Time warping and motion vector blending at the juncture of two divisemes and the algorithm to search the optimal concatenated visible speech are also developed to provide the final concatenative motion sequence. Copyright © 2004 John Wiley & Sons, Ltd.
44 citations
••
TL;DR: The strategy followed by the paper, which focuses on speech, follows a kind of bootstrap procedure and learns 3D shape statistics from a talking face with a relatively small number of markers to simulate facial anatomy.
Abstract: Realistic face animation is especially hard as we are all experts in the perception and interpretation of face dynamics. One approach is to simulate facial anatomy. Alternatively, animation can be based on first observing the visible 3D dynamics, extracting the basic modes, and putting these together according to the required performance. This is the strategy followed by the paper, which focuses on speech. The approach follows a kind of bootstrap procedure. First, 3D shape statistics are learned from a talking face with a relatively small number of markers. A 3D reconstruction is produced at temporal intervals of 1/25 seconds. A topological mask of the lower half of the face is fitted to the motion. Principal component analysis (PCA) of the mask shapes reduces the dimension of the mask shape space. The result is twofold. On the one hand, the face can be animated; in our case it can be made to speak new sentences. On the other hand, face dynamics can be tracked in 3D without markers for performance capture. Copyright © 2002 John Wiley & Sons, Ltd.
44 citations
••
01 Jan 1996TL;DR: Recognition of the synthetic talker is reasonably close to that of the human talker, but a significant distance remains to be covered and improvements to the synthetic phoneme specifications are discussed.
Abstract: We report here on an experiment comparing visual recognition of monosyllabic words produced either by our computer-animated talker or a human talker. Recognition of the synthetic talker is reasonably close to that of the human talker, but a significant distance remains to be covered and we discuss improvements to the synthetic phoneme specifications. In an additional experiment using the same paradigm, we compare perception of our animated talker with a similarly generated point-light display, finding significantly worse performance for the latter for a number of viseme classes. We conclude with some ideas for future progress and briefly describe our new animated tongue.
42 citations
••
18 Aug 2002TL;DR: Two approaches in using HMMs (hidden Markov models) to convert audio signals to a sequence of visemes are compared and it is found that the error rates can be reduced to 20.5% and 13.9%, respectably.
Abstract: We describe audio-to-visual conversion techniques for efficient multimedia communications. The audio signals are automatically converted to visual images of mouth shape. The visual speech can be represented as a sequence of visemes, which are the generic face images corresponding to particular sounds. Visual images synchronized with audio signals can provide user-friendly interface for man machine interactions. Also, it can be used to help the people with impaired-hearing. We use HMMs (hidden Markov models) to convert audio signals to a sequence of visemes. In this paper, we compare two approaches in using HMMs. In the first approach, an HMM is trained for each viseme, and the audio signals are directly recognized as a sequence of visemes. In the second approach, each phoneme is modeled with an HMM, and a general phoneme recognizer is utilized to produce a phoneme sequence from the audio signals. The phoneme sequence is then converted to a viseme sequence. We implemented the two approaches and tested them on the TIMIT speech corpus. The viseme recognizer shows 33.9% error rate, and the phoneme-based approach exhibits 29.7% viseme recognition error rate. When similar viseme classes are merged, we have found that the error rates can be reduced to 20.5% and 13.9%, respectably.
42 citations