scispace - formally typeset
Search or ask a question
Topic

Viseme

About: Viseme is a research topic. Over the lifetime, 865 publications have been published within this topic receiving 17889 citations.


Papers
More filters
Proceedings Article
01 Jan 2003
TL;DR: This paper extends prior work in multi-stream modeling by introducing cross-stream observation dependencies and a new discriminative criterion for selecting such dependencies and Experimental results combining short-term PLP features with longterm TRAP features show gains associated with a multi- stream model with partial state asynchrony over a baseline HMM.
Abstract: This paper extends prior work in multi-stream modeling by introducing cross-stream observation dependencies and a new discriminative criterion for selecting such dependencies. Experimental results combining short-term PLP features with longterm TRAP features show gains associated with a multi-stream model with partial state asynchrony over a baseline HMM. Frame-based analyses show significant discriminant information in the added cross-stream dependencies, but so far there are only small gains in recognition accuracy.

9 citations

Journal ArticleDOI
TL;DR: A structured approach to create speaker-dependent visemes with a fixed number of viseme within each set, based upon clustering phonemes, which significantly improves on previous lipreading results with RMAV speakers.
Abstract: Lipreading is understanding speech from observed lip movements. An observed series of lip motions is an ordered sequence of visual lip gestures. These gestures are commonly known, but as yet are not formally defined, as `visemes'. In this article, we describe a structured approach which allows us to create speaker-dependent visemes with a fixed number of visemes within each set. We create sets of visemes for sizes two to 45. Each set of visemes is based upon clustering phonemes, thus each set has a unique phoneme-to-viseme mapping. We first present an experiment using these maps and the Resource Management Audio-Visual (RMAV) dataset which shows the effect of changing the viseme map size in speaker-dependent machine lipreading and demonstrate that word recognition with phoneme classifiers is possible. Furthermore, we show that there are intermediate units between visemes and phonemes which are better still. Second, we present a novel two-pass training scheme for phoneme classifiers. This approach uses our new intermediary visual units from our first experiment in the first pass as classifiers; before using the phoneme-to-viseme maps, we retrain these into phoneme classifiers. This method significantly improves on previous lipreading results with RMAV speakers.

9 citations

Proceedings ArticleDOI
19 Apr 2015
TL;DR: This paper introduces a method for automatic redubbing of video that exploits the many-to-many mapping of phoneme sequences to lip movements modelled as dynamic visemes, and explores the natural ambiguity in visual speech.
Abstract: This paper introduces a method for automatic redubbing of video that exploits the many-to-many mapping of phoneme sequences to lip movements modelled as dynamic visemes [1]. For a given utterance, the corresponding dynamic viseme sequence is sampled to construct a graph of possible phoneme sequences that synchronize with the video. When composed with a pronunciation dictionary and language model, this produces a vast number of word sequences that are in sync with the original video, literally putting plausible words into the mouth of the speaker. We demonstrate that traditional, many-to-one, static visemes lack flexibility for this application as they produce significantly fewer word sequences. This work explores the natural ambiguity in visual speech and offers insight for automatic speech recognition and the importance of language modeling.

9 citations

Journal ArticleDOI
10 Jan 2008
TL;DR: This work describes an approach to pose-based interpolation that deals with coarticulation using a constraint-based technique and demonstrates it using a Mexican-Spanish talking head, which can vary its speed of talking and produce coARTiculation effects.
Abstract: A common approach to produce visual speech is to interpolate the parameters describing a sequence of mouth shapes, known as visemes, where a viseme corresponds to a phoneme in an utterance. The interpolation process must consider the issue of context-dependent shape, or coarticulation, in order to produce realistic-looking speech. We describe an approach to such pose-based interpolation that deals with coarticulation using a constraint-based technique. This is demonstrated using a Mexican-Spanish talking head, which can vary its speed of talking and produce coarticulation effects.

9 citations


Network Information
Related Topics (5)
Vocabulary
44.6K papers, 941.5K citations
78% related
Feature vector
48.8K papers, 954.4K citations
76% related
Feature extraction
111.8K papers, 2.1M citations
75% related
Feature (computer vision)
128.2K papers, 1.7M citations
74% related
Unsupervised learning
22.7K papers, 1M citations
73% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20237
202212
202113
202039
201919
201822