Topic

Viseme

About: Viseme is a research topic. Over the lifetime, 865 publications have been published within this topic receiving 17889 citations.

...read moreread less

Papers published on a yearly basis

1 / 2

Papers

PDF

Open Access

More filters

Book Chapter•DOI•

Continuous automatic speech recognition by lipreading

[...]

Alan Jeffrey Goldschen¹•Institutions (1)

Mitre Corporation¹

01 Jan 1993

TL;DR: While this study focuses on the feasibility, validity, and segregated contribution of exclusively continuous OASR, future highly robust recognition systems should combine optical and acoustic information with syntactic, semantic and pragmatic aids.

...read moreread less

Abstract: This study describes the design and implementation of a novel continuous speech recognizer that uses optical information from the oral-cavity shadow of a speaker. The system uses hidden Markov models (HMMs) trained to discriminate optical information and achieves a recognition rate of 25.3 percent on 150 test sentences. This is the first system to accomplish continuous optical automatic speech recognition (OASR). This level of performance--without the use of syntactical, semantic, or any other contextual guide to the recognition process--indicates that OASR may be used as a major supplement for robust multi-modal recognition in noisy environments. Additionally, new features important for OASR were discovered, and novel approaches to vector quantization, training, and clustering were utilized. This study contains three major components. First, it hypothesize 35 static and dynamic optical features to characterize the shadow of the oral-cavity for the speaker. Using the corresponding correlation matrix and a principal component analysis, the study discarded 22 oral-cavity features. The remaining 13 oral-cavity features are mostly dynamic features, unlike the static features used by previous researchers. Second, the study merged phonemes that appear optically similar on the speaker's oral-cavity region into visemes. The visemes were objectively analyzed and discriminated using HMM and clustering algorithms. Most significantly, the visemes for the speaker, obtained through computation, are consistent with the phoneme-to-viseme mapping discussed by most lipreading experts. This similarity, in a sense, verifies the selection of oral-cavity features. Third, the study trained the HMMs to recognize, without a grammar, a set of sentences having a perplexity of 150, using visemes, trisemes (triplets of visemes), and generalized trisemes (clustered trisemes). The system achieved recognition rates of 2 percent, 12.7 percent, and 25.3 percent using, respectively, viseme HMMs, triseme HMMs, and generalized triseme HMMs. The study concludes that methodologies used in this investigation demonstrate the need for further research on continuous OASR and on the integration of optical information with other recognition methods. While this study focuses on the feasibility, validity, and segregated contribution of exclusively continuous OASR, future highly robust recognition systems should combine optical and acoustic information with syntactic, semantic and pragmatic aids.

...read moreread less

94 citations

Proceedings Article•DOI•

Detecting Deep-Fake Videos From Phoneme-Viseme Mismatches

[...]

Shruti Agarwal¹, Hany Farid¹, Ohad Fried², Maneesh Agrawala²•Institutions (2)

University of California, Berkeley¹, Stanford University²

14 Jun 2020

TL;DR: This work describes a technique to detect manipulated videos by exploiting the fact that the dynamics of the mouth shape – visemes – are occasionally inconsistent with a spoken phoneme, and demonstrates the efficacy and robustness of this approach to detect different types of deep-fake videos, including in thewild deep fakes.

...read moreread less

Abstract: Recent advances in machine learning and computer graphics have made it easier to convincingly manipulate video and audio. These so-called deep-fake videos range from complete full-face synthesis and replacement (face-swap), to complete mouth and audio synthesis and replacement (lip-sync), and partial word-based audio and mouth synthesis and replacement. Detection of deep fakes with only a small spatial and temporal manipulation is particularly challenging. We describe a technique to detect such manipulated videos by exploiting the fact that the dynamics of the mouth shape - visemes - are occasionally inconsistent with a spoken phoneme. We focus on the visemes associated with words having the sound M (mama), B (baba), or P (papa) in which the mouth must completely close in order to pronounce these phonemes. We observe that this is not the case in many deep-fake videos. Such phoneme-viseme mismatches can, therefore, be used to detect even spatially small and temporally localized manipulations. We demonstrate the efficacy and robustness of this approach to detect different types of deep-fake videos, including in-the-wild deep fakes.

...read moreread less

90 citations

Proceedings Article•DOI•

Visual speech recognition with loosely synchronized feature streams

[...]

Kate Saenko¹, Karen Livescu¹, Michael R. Siracusa¹, Kevin W. Wilson¹, James Glass¹, Trevor Darrell¹ - Show less +2 more•Institutions (1)

Massachusetts Institute of Technology¹

17 Oct 2005

TL;DR: A novel dynamic Bayesian network with a multi-stream structure and observations consisting of articulate feature classifier scores, which can model varying degrees of co-articulation in a principled way is presented.

...read moreread less

Abstract: We present an approach to detecting and recognizing spoken isolated phrases based solely on visual input. We adopt an architecture that first employs discriminative detection of visual speech and articulate features, and then performs recognition using a model that accounts for the loose synchronization of the feature streams. Discriminative classifiers detect the subclass of lip appearance corresponding to the presence of speech, and further decompose it into features corresponding to the physical components of articulate production. These components often evolve in a semi-independent fashion, and conventional viseme-based approaches to recognition fail to capture the resulting co-articulation effects. We present a novel dynamic Bayesian network with a multi-stream structure and observations consisting of articulate feature classifier scores, which can model varying degrees of co-articulation in a principled way. We evaluate our visual-only recognition system on a command utterance task. We show comparative results on lip detection and speech/non-speech classification, as well as recognition performance against several baseline systems

...read moreread less

87 citations

Proceedings Article•

New methods in continuous Mandarin speech recognition.

[...]

C. Julian Chen, Ramesh A. Gopinath, Michael Daniel Monkowski, Michael Picheny, Katherine Shen - Show less +1 more

01 Jan 1997

86 citations

Journal Article•DOI•

Speech Management—on the Non-written Life of Speech

[...]

Jens Allwood, Joakim Nivre, Elisabeth Ahlsén

01 Jun 1990-Nordic Journal of Linguistics

TL;DR: The concept of speech management (SM) is introduced in this paper, which refers to processes whereby a speaker manages his or her linguistic contributions to a communicative interaction, and which involves phenomena which have previously been studied under such rubrics as "planning", "editing" and "self-repair".

...read moreread less

Abstract: This paper introduces the concept of speech management (SM), which refers to processes whereby a speaker manages his or her linguistic contributions to a communicative interaction, and which involves phenomena which have previously been studied under such rubrics as “planning”, “editing”, “(self-)repair”, etc. It is argued that SM phenomena exhibit considerable systematicity and regularity and must be considered part of the linguistic system. Furthermore, it is argued that SM phenomena must be related not only to such intraindividual factors as planning and memory, but also to interactional factors such as turntaking and feedback, and to informational content. Structural and functional taxonomies are presented together with a formal description of complex types of SM. The structural types are exemplified with data from a corpus of SM phenomena.

...read moreread less

85 citations

Collapse

Network Information

Performance

Metrics

884

Papers

19,235

Citations

No. of papers in the topic in previous years
Year	Papers
2023	7
2022	12
2021	13
2020	39
2019	19
2018	22

Viseme

Papers published on a yearly basis

Papers

Trending Questions (8)

Network Information

Related Topics (5)

Performance

Metrics