scispace - formally typeset
Open Access

Influenсe of Phone-Viseme Temporal Correlations on Audiovisual STT and TTS Performance.

Reads0
Chats0
TLDR
The experimental results have shown that use of asynchronous frameworks for combined audible and visible speech processing results in improvement of the accuracy of audiovisual speech recognition as well as the naturalness and the intelligibility of speech synthesis.
Abstract
In this paper, we present a research of temporal correlations of audiovisual units in continuous Russian speech. The corpus-based study identifies natural time asynchronies between flows of audible and visible speech modalities partially caused by inertance of the articulation organs. Original methods for speech asynchrony modeling have been proposed and studied using bimodal ASR and TTS systems. The experimental results have shown that use of asynchronous frameworks for combined audible and visible speech processing results in improvement of the accuracy of audiovisual speech recognition as well as the naturalness and the intelligibility of speech synthesis.

read more

Citations
More filters
Book ChapterDOI

A Universal Assistive Technology with Multimodal Input and Multimedia Output Interfaces

TL;DR: The conceptual model and the software-hardware architecture with levels and components of the universal assistive technology are described and several multimodal systems and interfaces to the people with disabilities are proposed.
Book ChapterDOI

HAVRUS Corpus: High-Speed Recordings of Audio-Visual Russian Speech

TL;DR: A software-hardware complex for collection of audio-visual speech databases with a high-speed camera and a dynamic microphone and the architecture of the developed software as well as some details of the collected database of Russian audio- visual speech HAVRUS are described.
Journal ArticleDOI

How to Teach DNNs to Pay Attention to the Visual Modality in Speech Recognition

TL;DR: The inner workings of AV Align are investigated and a regularisation method which involves predicting lip-related Action Units from visual representations is proposed which leads to better exploitation of the visual modality and encourages researchers to rethink the multimodal convergence problem when having one dominant modality.
Journal ArticleDOI

How to Teach DNNs to Pay Attention to the Visual Modality in Speech Recognition

TL;DR: In this article, the authors investigated the inner workings of AV Align and visualised the audio-visual alignment patterns, and proposed a regularization method which involves predicting lip-related Action Units from visual representations.
Book ChapterDOI

Multimodal synthesizer for russian and czech sign languages and audio-visual speech

TL;DR: A model of a computer-animated avatar for the Russian and Czech sign languages with particular attention to animation principles of the "talking head" allows for maximum expansion of the functions of the program, making it suitable not only for deaf and hard-of-hearing people, but for blind and non-disabled people too.
References
More filters
Journal ArticleDOI

Articulatory phonology: an overview.

TL;DR: It is suggested that the gestural approach clarifies the understanding of phonological development, by positing that prelinguistic units of action are harnessed into (gestural) phonological structures through differentiation and coordination.
Proceedings ArticleDOI

A coupled HMM for audio-visual speech recognition

TL;DR: This paper introduces a novel audio-visual fusion technique that uses a coupled hidden Markov model (HMM) to model the state asynchrony of the audio and visual observations sequences while still preserving their natural correlation over time.
Journal ArticleDOI

Inter-language differences in the influence of visual cues in speech perception.

TL;DR: The authors found that Japanese listeners were less influenced by visual cues than were Americans when presented an audio signal dubbed onto a video recording of a talking face producing an incongruent syllable, robust visual effects of visual cues have been reported for native English speakers in English-speaking cultures.
Journal ArticleDOI

On the Importance of Audiovisual Coherence for the Perceived Quality of Synthesized Visual Speech

TL;DR: This work extended the well-known unit selection audio synthesis technique to work with multimodal segments containing original combinations of audio and video and shown that the degree of coherence between the auditory mode and the visual mode has an influence on the perceived quality of the synthetic visual speech fragment.

Learning optimal audiovisual phasing for a HMM-based control model for facial animation

TL;DR: In this article, an HMM-based trajectory formation model was proposed to predict articulatory trajectories of a talking face from phonetic input, and a phasing model was developed that predicts the delays between the acoustic boundaries of allophones to be synthesized and the gestural boundaries of HMM triphones.