scispace - formally typeset
Search or ask a question
Topic

Viseme

About: Viseme is a research topic. Over the lifetime, 865 publications have been published within this topic receiving 17889 citations.


Papers
More filters
Book ChapterDOI
27 Feb 2017
TL;DR: This work constructs an automatic system that uses DCT and SIFT descriptors to extract the main characteristics of the mouth region and HMMs to model the statistic relations of both viseme and phoneme sequences, and builds a phoneme-to-viseme mapping based on visual similarities between phonemes to maximize word recognition.
Abstract: Speech is the most used communication method between humans and it is considered a multisensory process. Even though there is a popular belief that speech is something that we hear, there is overwhelming evidence that the brain treats speech as something that we hear and see. Much of the research has focused on Automatic Speech Recognition (ASR) systems, treating speech primarily as an acoustic form of communication. In the last years, there has been an increasing interest in systems for Automatic Lip-Reading (ALR), although exploiting the visual information has been proved to be challenging. One of the main problems in ALR is how to make the system robust to the visual ambiguities that appear at the word level. These ambiguities make confused and imprecise the definition of the minimum distinguishable unit of the video domain. In contrast to the audio domain, where the phoneme is the standard minimum auditory unit, there is no consensus on the definition of the minimum visual unit (the viseme). In this work, we focus on the automatic construction of a phoneme-to-viseme mapping based on visual similarities between phonemes to maximize word recognition. We investigate the usefulness of different phoneme-to-viseme mappings, obtaining the best results for intermediate vocabulary lengths. We construct an automatic system that uses DCT and SIFT descriptors to extract the main characteristics of the mouth region and HMMs to model the statistic relations of both viseme and phoneme sequences. We test our system in two Spanish corpora with continuous speech (AV@CAR and VLRF) containing 19 and 24 speakers, respectively. Our results indicate that we are able to recognize 47% (resp. 51%) of the phonemes and 23% (resp. 21%) of the words, for AV@CAR and VLRF. We also show additional results that support the usefulness of visemes. Experiments on a comparable ALR system trained exclusively using phonemes at all its stages confirm the existence of strong visual ambiguities between groups of phonemes. This fact and the higher word accuracy obtained when using phoneme-to-viseme mappings, justify the usefulness of visemes instead of the direct use of phonemes for ALR.

3 citations

Journal ArticleDOI
TL;DR: In this article, the authors present the KMSMWL-A (Korean Standard Monosyllabic Word List for Adults, KS-MWL A) for adults.
Abstract: 배경 및 목적: 한국어 독화소 체계 분석에 관한 기존의 연구들은 한정된 음소들을 조합하여 이루어짐으로써 일상회화체의 다양한 특성들을 반영하지 못한 점이 있다. 이에 본 연구는 일상회화체 내 음소 출현율을 반영한 한국표준 단음절어표 일반용(Korean Standard Monosyllabic Word List for Adults, KS-MWL-A)을 사용하여...

3 citations

Proceedings ArticleDOI
27 Aug 2007
TL;DR: A lipreading recognition experiment on Arabic where a set of consonant-vowel stimuli were presented as visual-only speech and participants were asked to report what they recognized, shows that some of the phonemes were well discriminated, however, for others it depends on the context.
Abstract: In this paper, we present a study of visual speech in Arabic. More specifically, we performed a lipreading recognition experiment on Arabic, where a set of consonant-vowel stimuli were presented as visual-only speech and participants were asked to report what they recognized. The overall lipreading scores were consistent with other experiments in other languages. The resulting consonant confusion matrix shows that some of the phonemes were well discriminated, however, for others it depends on the context. Results are discussed based on the category of phonemes and the vowel context.

3 citations

Proceedings ArticleDOI
01 Sep 1998
TL;DR: This paper implicitly differentiates between the quality of visual representation necessary for speech and speaker recognition and assesses the performance of visual lip features with respect to well established audio features.
Abstract: This paper implicitly differentiates between the quality of visual representation necessary for speech and speaker recognition and assesses the performance of visual lip features with respect to well established audio features Blue lip highlighted data is used to show how variations in lip measurements can influence speech and speaker recognition From these experiments and other researchers results [1] it is postulated that the fine detail of the lips is critical for speaker recognition, but conversely, the same amount of detail does not noticeably improve visual speech recognition Visual error rates of 263% and 70% are achieved for cross-digit speaker and cross-speaker speech recognition respectively

3 citations

Proceedings ArticleDOI
27 May 2015
TL;DR: This study proposes the introduction of co-articulation in speech synchronization to produce a more realistic animation, and viseme mapping based on the consonant-vowel (CV) syllable pattern in the Indonesian language, resulting in a more specific viseme group, so it supports the development of realistic speech synchronization.
Abstract: Speech synchronization is one of the studies in the field of facial animation that has been widely studied, which results in speech animation, but there are still many challenges that have not been reached at this time, one of which is realistic speech synchronization. Because of differences in visual phoneme (viseme) in the pronunciation of each language, it is very difficult to make speech synchronization tools that are applicable for all languages, and at present there are no speech synchronization tools that can provide good results for Indonesian Language. This study proposes the introduction of co-articulation in speech synchronization to produce a more realistic animation, and viseme mapping based on the consonant-vowel (CV) syllable pattern in the Indonesian language, resulting in a more specific viseme group, so it supports the development of realistic speech synchronization, next called as Bahasa Speech Sync. Co-articulation calculation is done using Kochanek-Bartels spline interpolation approach which adds tension, bias and continuity parameters, using the 4 control points taken from real human videos, to accommodate the concept of co-articulation. Viseme mapping is done by comparing the difference in distance between the 12 crucial points with a point of reference for each syllable. Based on the results of our proposed viseme grouping procedure, we have simplified the viseme generation from a combination of 21 consonants and 5 vowels into 24 groups of viseme, 18 of which represent the start position while 6 groups represent the and position. Test result of the similarity of the movement between generated animation and real human videos has achieved 89% “realistic” perception based on our proposed distance criteria.

3 citations


Network Information
Related Topics (5)
Vocabulary
44.6K papers, 941.5K citations
78% related
Feature vector
48.8K papers, 954.4K citations
76% related
Feature extraction
111.8K papers, 2.1M citations
75% related
Feature (computer vision)
128.2K papers, 1.7M citations
74% related
Unsupervised learning
22.7K papers, 1M citations
73% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20237
202212
202113
202039
201919
201822