scispace - formally typeset
Open AccessProceedings Article

A German viseme-set for automatic transcription of input text used for audio-visual speech synthesis.

Christian Weiss, +1 more
- pp 2945-2948
Reads0
Chats0
TLDR
A German viseme inventory for visemically transcribing text according to phonetic transcribtion is introduced and an inventory of German visemo classes in a SAMPA-like labelling is worked out and a model for automatic visemic transcription of given input text is trained.
Abstract
In this paper, we introduce a German viseme inventory for visemically transcribing text according to phonetic transcribtion. A viseme set like the one presented in this work is essential for speech-driven audio-visual synthesis due to the fact that the selection of appropriate video segments is based on the visemically transcribed input text. For text-to-speech synthesis, a transcription of the input text into the phonemic representation is used, in order to avoid ambiguous meanings and to acquire the correct pronunciation of the underlying input text and to serve as labels in unitselection-based synthesis systems. Likewise, the visual synthesis requires a transcription that represents analogue to the phonemes the visual counterpart which is called viseme in related literature and which also serves as a unit label in our data-driven video-realistic audio-visual synthesis system. We worked out an inventory of German viseme classes in a SAMPA-like labelling and trained a model for automatic visemic transcription of given input text.

read more

Citations
More filters
Journal ArticleDOI

Animated virtual characters to explore audio-visual speech in controlled and naturalistic environments.

TL;DR: The results conclusively demonstrate that computer-generated speech stimuli are judicious, and that they can supplement natural speech with higher control over stimulus timing and content.

Weakly Supervised Automatic Transcription of Mouthings for Gloss-Based Sign Language Corpora

TL;DR: This work proposes a method to automatically annotate mouthings in sign language corpora, requiring no more than a simple gloss annotation and a source of weak supervision, such as automatic speech transcripts.

Handling multimodality and scarce resources in sign language machine translation

TL;DR: This thesis improves the automatic alignment between the annotated signs and their translations in the spoken language by applying a morphosyntactic and a semantic analysis and bridging the differences to find corresponding signs and phrases.
Proceedings ArticleDOI

Avatars 4 all: an avatar generation toolchain

TL;DR: The work-in-progress paper presents an application driven approach to create a versatile toolchain to create avatars that targets the lack of an economic solution that enables the integration of avatars on multiple platforms and multiple devices.
References
More filters
Journal ArticleDOI

Hearing lips and seeing voices

TL;DR: The study reported here demonstrates a previously unrecognised influence of vision upon speech perception, on being shown a film of a young woman's talking head in which repeated utterances of the syllable [ba] had been dubbed on to lip movements for [ga].
Journal ArticleDOI

A maximum entropy approach to natural language processing

TL;DR: A maximum-likelihood approach for automatically constructing maximum entropy models is presented and how to implement this approach efficiently is described, using as examples several problems in natural language processing.
Proceedings ArticleDOI

Video Rewrite: driving visual speech with audio

TL;DR: Video Rewrite is the first facial-animation system to automate all the labeling and assembly tasks required to resync existing footage to a new soundtrack.
Book

Perceiving talking faces: from speech perception to a behavioral principle

TL;DR: In this paper, Cohen et al. presented a framework for synthesizing talking faces, broadening the framework of the model and testing the test set, and synthesizing and evaluating talking faces applying talking faces.
Book ChapterDOI

Modeling Coarticulation in Synthetic Visual Speech

TL;DR: An implementation of Lofqvist’s (1990) gestural theory of speech production is described for visual speech synthesis along with a description of the graphically controlled development system.