Topic
Viseme
About: Viseme is a research topic. Over the lifetime, 865 publications have been published within this topic receiving 17889 citations.
Papers published on a yearly basis
Papers
More filters
••
01 Aug 2005TL;DR: An efficient system for realistic speech animation is proposed, which supports all steps of the animation pipeline, from the capture or design of 3-D head models up to the synthesis and editing of the performance.
Abstract: An efficient system for realistic speech animation is proposed. The system supports all steps of the animation pipeline, from the capture or design of 3-D head models up to the synthesis and editing of the performance. This pipeline is fully 3-D, which yields high flexibility in the use of the animated character. Real detailed 3-D face dynamics, observed at video frame rate for thousands of points on the face of speaking actors, underpin the realism of the facial deformations. These are given a compact and intuitive representation via independent component analysis (ICA). Performances amount to trajectories through this ‘viseme space’. When asked to animate a face the system replicates the ‘visemes’ that it has learned, and adds the necessary co-articulation effects. Realism has been improved through comparisons with motion captured groundtruth. Faces for which no 3-D dynamics could be observed can be animated nonetheless. Their visemes are adapted automatically to their physiognomy by localising the face in a ‘face space’.
22 citations
••
23 Jun 1997TL;DR: The perceptual boundaries of speech reading and multimedia technology, which are the constraints that effect speech reading performance, are investigated and conclusions on the relationship between viseme groupings, accuracy of viseme recognition, and presentation rate are drawn.
Abstract: In the future, multimedia technology will be able to provide video frame rates equal to or better than 30 frames per second (FPS). Until that time the hearing impaired community will be using band limited communication systems over unshielded twisted pair copper wiring. As a result, multimedia communication systems will use a coder/decoder (CODEC) to compress the video and audio signals for transmission. For these systems to be usable by the hearing impaired community, the algorithms within the CODEC have to be designed to account for the perceptual boundaries of the hearing impaired. We investigate the perceptual boundaries of speech reading and multimedia technology, which are the constraints that effect speech reading performance. We analyze and draw conclusions on the relationship between viseme groupings, accuracy of viseme recognition, and presentation rate. These results are critical in the design of multimedia systems for the hearing impaired.
22 citations
01 Jan 1999
TL;DR: A three dimensional facial model with a commercial audio text-to-speech synthesizer and the confusion patterns of consonants and the identification of the Finnish visemes were examined to examine the intelligibility of both natural and synthetic auditory speech.
Abstract: We describe our Finnish audio-visual speech synthesizer, its evaluation and discuss possible improvements. We have combined a three dimensional facial model with a commercial audio text-to-speech synthesizer. The visual speech is based on a letter-to-viseme mapping and the animation is created by linear interpolation between the visemes. An intelligibility test was run to quantify the benefit of seeing the synthetic and natural face on hearing the synthetic and natural voice presented at different signal to noise ratios. Both natural and synthetic faces improved the intelligibility of both natural and synthetic auditory speech. We examined the confusion patterns of consonants and the identification of the Finnish visemes. We also propose how the viseme repertoire of the talking head can be improved.
22 citations
•
TL;DR: The phoneme lipreading system word accuracy outperforms the viseme based system word word accuracy, however, the phoneme system achieved lower accuracy at the unit level which shows the importance of the dictionary for decoding classification outputs into words.
Abstract: There is debate if phoneme or viseme units are the most effective for a lipreading
system. Some studies use phoneme units even though phonemes describe unique short
sounds; other studies tried to improve lipreading accuracy by focusing on visemes with
varying results. We compare the performance of a lipreading system by modeling visual
speech using either 13 viseme or 38 phoneme units. We report the accuracy of our
system at both word and unit levels. The evaluation task is large vocabulary continuous
speech using the TCD-TIMIT corpus. We complete our visual speech modeling via
hybrid DNN-HMMs and our visual speech decoder is aWeighted Finite-State Transducer
(WFST). We use DCT and Eigenlips as a representation of mouth ROI image. The
phoneme lipreading system word accuracy outperforms the viseme based system word
accuracy. However, the phoneme system achieved lower accuracy at the unit level which
shows the importance of the dictionary for decoding classification outputs into words.
22 citations