scispace - formally typeset
Search or ask a question
Topic

Viseme

About: Viseme is a research topic. Over the lifetime, 865 publications have been published within this topic receiving 17889 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: A novel method for generating acoustic models for viseme recognition from speech using transformations from trained phoneme acoustic models is proposed, which is language-independent; only the available speech resources are needed.
Abstract: Viseme recognition from speech is one of the methods needed to operate a talking head system, which can be used in various areas, such as mobile services and applications, gaming, the entertainment industry, and so on. This paper proposes a novel method for generating acoustic models for viseme recognition from speech. The viseme acoustic models were generated using transformations from trained phoneme acoustic models. The proposed transformation method is language-independent; only the available speech resources are needed. The viseme sequence with corresponding time information was produced as a result of recognition using context-dependent acoustic models. The evaluation of the proposed acoustic models’ transformation method was carried out on a test scenario with phonetically balanced words, in which the results were compared to the baseline viseme recognition system. The improvement in viseme accuracy was statistically significant when using the proposed method for transforming acoustic models. DOI: http://dx.doi.org/10.5755/j01.eee.19.9.5657

2 citations

Proceedings ArticleDOI
01 Jan 2006
TL;DR: In this article, a stereo capture system is used to reconstruct 3D models of a speaker producing sentences from the TIMIT corpus, which is mapped into a space which maintains the relationships between samples and their temporal derivatives.
Abstract: In this paper we describe a parameterisation of lip movements which maintains the dynamic structure inherent in the task of producing speech sounds. A stereo capture system is used to reconstruct 3D models of a speaker producing sentences from the TIMIT corpus. This data is mapped into a space which maintains the relationships between samples and their temporal derivatives. By incorporating dynamic information within the parameterisation of lip movements we can model the cyclical structure, as well as the causal nature of speech movements as described by an underlying visual speech manifold. It is believed that such a structure will be appropriate to various areas of speech modeling, in particular the synthesis of speech lip movements.

2 citations

Proceedings Article
01 Jan 2010
TL;DR: A novel model synthesis method is proposed for band-limited speech recognition that detects speech bandwidth automatically and synthesizes a new acoustic model only using a full-bandwidth model when the bandwidth has been changed.
Abstract: A recognizer trained with full-bandwidth speech performs badly when recognizing band-limited speech because of environment mismatch. In this paper, we have proposed a novel model synthesis method for band-limited speech recognition. It detects speech bandwidth automatically and synthesizes a new acoustic model only using a full-bandwidth model when the bandwidth has been changed. Experiments conducted on TIMIT/NTIMIT databases show that the proposed method has achieved substantial improvement over the baseline speech recognizer.

2 citations

Posted Content
TL;DR: In this article, the authors proposed a method to use external text data (for viseme-to-character mapping) by dividing video to character into two stages, namely converting video to viseme, and then converting viseme to character by using separate models.
Abstract: Lip-reading is the operation of recognizing speech from lip movements. This is a difficult task because the movements of the lips when pronouncing the words are similar for some of them. Viseme is used to describe lip movements during a conversation. This paper aims to show how to use external text data (for viseme-to-character mapping) by dividing video-to-character into two stages, namely converting video to viseme, and then converting viseme to character by using separate models. Our proposed method improves word error rate by 4\% compared to the normal sequence to sequence lip-reading model on the BBC-Oxford Lip Reading Sentences 2 (LRS2) dataset.

2 citations


Network Information
Related Topics (5)
Vocabulary
44.6K papers, 941.5K citations
78% related
Feature vector
48.8K papers, 954.4K citations
76% related
Feature extraction
111.8K papers, 2.1M citations
75% related
Feature (computer vision)
128.2K papers, 1.7M citations
74% related
Unsupervised learning
22.7K papers, 1M citations
73% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20237
202212
202113
202039
201919
201822