scispace - formally typeset
Search or ask a question
Topic

Viseme

About: Viseme is a research topic. Over the lifetime, 865 publications have been published within this topic receiving 17889 citations.


Papers
More filters
Book ChapterDOI
01 Jan 1993
TL;DR: While this study focuses on the feasibility, validity, and segregated contribution of exclusively continuous OASR, future highly robust recognition systems should combine optical and acoustic information with syntactic, semantic and pragmatic aids.
Abstract: This study describes the design and implementation of a novel continuous speech recognizer that uses optical information from the oral-cavity shadow of a speaker. The system uses hidden Markov models (HMMs) trained to discriminate optical information and achieves a recognition rate of 25.3 percent on 150 test sentences. This is the first system to accomplish continuous optical automatic speech recognition (OASR). This level of performance--without the use of syntactical, semantic, or any other contextual guide to the recognition process--indicates that OASR may be used as a major supplement for robust multi-modal recognition in noisy environments. Additionally, new features important for OASR were discovered, and novel approaches to vector quantization, training, and clustering were utilized. This study contains three major components. First, it hypothesize 35 static and dynamic optical features to characterize the shadow of the oral-cavity for the speaker. Using the corresponding correlation matrix and a principal component analysis, the study discarded 22 oral-cavity features. The remaining 13 oral-cavity features are mostly dynamic features, unlike the static features used by previous researchers. Second, the study merged phonemes that appear optically similar on the speaker's oral-cavity region into visemes. The visemes were objectively analyzed and discriminated using HMM and clustering algorithms. Most significantly, the visemes for the speaker, obtained through computation, are consistent with the phoneme-to-viseme mapping discussed by most lipreading experts. This similarity, in a sense, verifies the selection of oral-cavity features. Third, the study trained the HMMs to recognize, without a grammar, a set of sentences having a perplexity of 150, using visemes, trisemes (triplets of visemes), and generalized trisemes (clustered trisemes). The system achieved recognition rates of 2 percent, 12.7 percent, and 25.3 percent using, respectively, viseme HMMs, triseme HMMs, and generalized triseme HMMs. The study concludes that methodologies used in this investigation demonstrate the need for further research on continuous OASR and on the integration of optical information with other recognition methods. While this study focuses on the feasibility, validity, and segregated contribution of exclusively continuous OASR, future highly robust recognition systems should combine optical and acoustic information with syntactic, semantic and pragmatic aids.

94 citations

Proceedings ArticleDOI
14 Jun 2020
TL;DR: This work describes a technique to detect manipulated videos by exploiting the fact that the dynamics of the mouth shape – visemes – are occasionally inconsistent with a spoken phoneme, and demonstrates the efficacy and robustness of this approach to detect different types of deep-fake videos, including in thewild deep fakes.
Abstract: Recent advances in machine learning and computer graphics have made it easier to convincingly manipulate video and audio. These so-called deep-fake videos range from complete full-face synthesis and replacement (face-swap), to complete mouth and audio synthesis and replacement (lip-sync), and partial word-based audio and mouth synthesis and replacement. Detection of deep fakes with only a small spatial and temporal manipulation is particularly challenging. We describe a technique to detect such manipulated videos by exploiting the fact that the dynamics of the mouth shape - visemes - are occasionally inconsistent with a spoken phoneme. We focus on the visemes associated with words having the sound M (mama), B (baba), or P (papa) in which the mouth must completely close in order to pronounce these phonemes. We observe that this is not the case in many deep-fake videos. Such phoneme-viseme mismatches can, therefore, be used to detect even spatially small and temporally localized manipulations. We demonstrate the efficacy and robustness of this approach to detect different types of deep-fake videos, including in-the-wild deep fakes.

90 citations

Proceedings ArticleDOI
17 Oct 2005
TL;DR: A novel dynamic Bayesian network with a multi-stream structure and observations consisting of articulate feature classifier scores, which can model varying degrees of co-articulation in a principled way is presented.
Abstract: We present an approach to detecting and recognizing spoken isolated phrases based solely on visual input. We adopt an architecture that first employs discriminative detection of visual speech and articulate features, and then performs recognition using a model that accounts for the loose synchronization of the feature streams. Discriminative classifiers detect the subclass of lip appearance corresponding to the presence of speech, and further decompose it into features corresponding to the physical components of articulate production. These components often evolve in a semi-independent fashion, and conventional viseme-based approaches to recognition fail to capture the resulting co-articulation effects. We present a novel dynamic Bayesian network with a multi-stream structure and observations consisting of articulate feature classifier scores, which can model varying degrees of co-articulation in a principled way. We evaluate our visual-only recognition system on a command utterance task. We show comparative results on lip detection and speech/non-speech classification, as well as recognition performance against several baseline systems

87 citations

Journal ArticleDOI
TL;DR: The concept of speech management (SM) is introduced in this paper, which refers to processes whereby a speaker manages his or her linguistic contributions to a communicative interaction, and which involves phenomena which have previously been studied under such rubrics as "planning", "editing" and "self-repair".
Abstract: This paper introduces the concept of speech management (SM), which refers to processes whereby a speaker manages his or her linguistic contributions to a communicative interaction, and which involves phenomena which have previously been studied under such rubrics as “planning”, “editing”, “(self-)repair”, etc. It is argued that SM phenomena exhibit considerable systematicity and regularity and must be considered part of the linguistic system. Furthermore, it is argued that SM phenomena must be related not only to such intraindividual factors as planning and memory, but also to interactional factors such as turntaking and feedback, and to informational content. Structural and functional taxonomies are presented together with a formal description of complex types of SM. The structural types are exemplified with data from a corpus of SM phenomena.

85 citations


Network Information
Related Topics (5)
Vocabulary
44.6K papers, 941.5K citations
78% related
Feature vector
48.8K papers, 954.4K citations
76% related
Feature extraction
111.8K papers, 2.1M citations
75% related
Feature (computer vision)
128.2K papers, 1.7M citations
74% related
Unsupervised learning
22.7K papers, 1M citations
73% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20237
202212
202113
202039
201919
201822