Topic

Viseme

About: Viseme is a research topic. Over the lifetime, 865 publications have been published within this topic receiving 17889 citations.

...read moreread less

Papers published on a yearly basis

1 / 2

Papers

PDF

Open Access

More filters

Proceedings Article•

Stream Weight Optimization of Speech and Lip Image Sequence for Audio-Visual Speech Recognition

[...]

Satoshi Nakamura, Hidetoshi Ito, Kiyohiro Shikano

01 Oct 2000

TL;DR: ICSLP2000: the 6th International Conference on Spoken Language Processing, October 16-20, 2000, Beijing, China.

...read moreread less

Abstract: ICSLP2000: the 6th International Conference on Spoken Language Processing, October 16-20, 2000, Beijing, China.

...read moreread less

45 citations

Journal Article•DOI•

Accurate automatic visible speech synthesis of arbitrary 3D models based on concatenation of diviseme motion capture data

[...]

Jiyong Ma¹, Ronald A. Cole, Bryan L. Pellom, Wayne H. Ward, Barbara Wise - Show less +1 more•Institutions (1)

University of Colorado Boulder¹

01 Dec 2004-Computer Animation and Virtual Worlds

TL;DR: Time warping and motion vector blending at the juncture of two divisemes and the algorithm to search the optimal concatenated visible speech are developed to provide the final concatenative motion sequence.

...read moreread less

Abstract: We present a technique for accurate automatic visible speech synthesis from textual input. When provided with a speech waveform and the text of a spoken sentence, the system produces accurate visible speech synchronized with the audio signal. To develop the system, we collected motion capture data from a speaker's face during production of a set of words containing all diviseme sequences in English. The motion capture points from the speaker's face are retargeted to the vertices of the polygons of a 3D face model. When synthesizing a new utterance, the system locates the required sequence of divisemes, shrinks or expands each diviseme based on the desired phoneme segment durations in the target utterance, then moves the polygons in the regions of the lips and lower face to correspond to the spatial coordinates of the motion capture data. The motion mapping is realized by a key-shape mapping function learned by a set of viseme examples in the source and target faces. A well-posed numerical algorithm estimates the shape blending coefficients. Time warping and motion vector blending at the juncture of two divisemes and the algorithm to search the optimal concatenated visible speech are also developed to provide the final concatenative motion sequence. Copyright © 2004 John Wiley & Sons, Ltd.

...read moreread less

44 citations

Journal Article•DOI•

Realistic face animation for speech

[...]

Gregor A. Kalberer¹, Luc Van Gool•Institutions (1)

ETH Zurich¹

01 May 2002-Journal of Visualization and Computer Animation

TL;DR: The strategy followed by the paper, which focuses on speech, follows a kind of bootstrap procedure and learns 3D shape statistics from a talking face with a relatively small number of markers to simulate facial anatomy.

...read moreread less

Abstract: Realistic face animation is especially hard as we are all experts in the perception and interpretation of face dynamics. One approach is to simulate facial anatomy. Alternatively, animation can be based on first observing the visible 3D dynamics, extracting the basic modes, and putting these together according to the required performance. This is the strategy followed by the paper, which focuses on speech. The approach follows a kind of bootstrap procedure. First, 3D shape statistics are learned from a talking face with a relatively small number of markers. A 3D reconstruction is produced at temporal intervals of 1/25 seconds. A topological mask of the lower half of the face is fitted to the motion. Principal component analysis (PCA) of the mask shapes reduces the dimension of the mask shape space. The result is twofold. On the one hand, the face can be animated; in our case it can be made to speak new sentences. On the other hand, face dynamics can be tracked in 3D without markers for performance capture. Copyright © 2002 John Wiley & Sons, Ltd.

...read moreread less

44 citations

Book Chapter•DOI•

Perception of Synthetic Visual Speech

[...]

Michael M. Cohen¹, Rachel Walker¹, Dominic W. Massaro¹•Institutions (1)

University of California, Santa Cruz¹

01 Jan 1996

TL;DR: Recognition of the synthetic talker is reasonably close to that of the human talker, but a significant distance remains to be covered and improvements to the synthetic phoneme specifications are discussed.

...read moreread less

Abstract: We report here on an experiment comparing visual recognition of monosyllabic words produced either by our computer-animated talker or a human talker. Recognition of the synthetic talker is reasonably close to that of the human talker, but a significant distance remains to be covered and we discuss improvements to the synthetic phoneme specifications. In an additional experiment using the same paradigm, we compare perception of our animated talker with a similarly generated point-light display, finding significantly worse performance for the latter for a number of viseme classes. We conclude with some ideas for future progress and briefly describe our new animated tongue.

...read moreread less

42 citations

Book Chapter•DOI•

Audio-to-Visual Conversion Using Hidden Markov Models

[...]

Soonkyu Lee¹, Dongsuk Yook¹•Institutions (1)

Korea University¹

18 Aug 2002

TL;DR: Two approaches in using HMMs (hidden Markov models) to convert audio signals to a sequence of visemes are compared and it is found that the error rates can be reduced to 20.5% and 13.9%, respectably.

...read moreread less

Abstract: We describe audio-to-visual conversion techniques for efficient multimedia communications. The audio signals are automatically converted to visual images of mouth shape. The visual speech can be represented as a sequence of visemes, which are the generic face images corresponding to particular sounds. Visual images synchronized with audio signals can provide user-friendly interface for man machine interactions. Also, it can be used to help the people with impaired-hearing. We use HMMs (hidden Markov models) to convert audio signals to a sequence of visemes. In this paper, we compare two approaches in using HMMs. In the first approach, an HMM is trained for each viseme, and the audio signals are directly recognized as a sequence of visemes. In the second approach, each phoneme is modeled with an HMM, and a general phoneme recognizer is utilized to produce a phoneme sequence from the audio signals. The phoneme sequence is then converted to a viseme sequence. We implemented the two approaches and tested them on the TIMIT speech corpus. The viseme recognizer shows 33.9% error rate, and the phoneme-based approach exhibits 29.7% viseme recognition error rate. When similar viseme classes are merged, we have found that the error rates can be reduced to 20.5% and 13.9%, respectably.

...read moreread less

42 citations

Collapse

Network Information

Performance

Metrics

884

Papers

19,235

Citations

No. of papers in the topic in previous years
Year	Papers
2023	7
2022	12
2021	13
2020	39
2019	19
2018	22

Viseme

Papers published on a yearly basis

Papers

Trending Questions (8)

Network Information

Related Topics (5)

Performance

Metrics