Influenсe of Phone-Viseme Temporal Correlations on Audiovisual STT and TTS Performance.

Open Access

Influenсe of Phone-Viseme Temporal Correlations on Audiovisual STT and TTS Performance.

Alexey Karpov, +3 more

- pp 1030-1033

Chats0

TLDR

The experimental results have shown that use of asynchronous frameworks for combined audible and visible speech processing results in improvement of the accuracy of audiovisual speech recognition as well as the naturalness and the intelligibility of speech synthesis.

Abstract:

In this paper, we present a research of temporal correlations of audiovisual units in continuous Russian speech. The corpus-based study identifies natural time asynchronies between flows of audible and visible speech modalities partially caused by inertance of the articulation organs. Original methods for speech asynchrony modeling have been proposed and studied using bimodal ASR and TTS systems. The experimental results have shown that use of asynchronous frameworks for combined audible and visible speech processing results in improvement of the accuracy of audiovisual speech recognition as well as the naturalness and the intelligibility of speech synthesis.

Citations

PDF

Open Access

More filters

Book ChapterDOI

A Universal Assistive Technology with Multimodal Input and Multimedia Output Interfaces

Alexey Karpov, +1 more

TL;DR: The conceptual model and the software-hardware architecture with levels and components of the universal assistive technology are described and several multimodal systems and interfaces to the people with disabilities are proposed.

...read moreread less

Book ChapterDOI

HAVRUS Corpus: High-Speed Recordings of Audio-Visual Russian Speech

Vasilisa Verkhodanova, +5 more

TL;DR: A software-hardware complex for collection of audio-visual speech databases with a high-speed camera and a dynamic microphone and the architecture of the developed software as well as some details of the collected database of Russian audio- visual speech HAVRUS are described.

...read moreread less

Journal ArticleDOI

How to Teach DNNs to Pay Attention to the Visual Modality in Speech Recognition

George Sterpu, +2 more

- 17 Apr 2020 -

arXiv: Audio and Speech Processing

TL;DR: The inner workings of AV Align are investigated and a regularisation method which involves predicting lip-related Action Units from visual representations is proposed which leads to better exploitation of the visual modality and encourages researchers to rethink the multimodal convergence problem when having one dominant modality.

...read moreread less

Journal ArticleDOI

How to Teach DNNs to Pay Attention to the Visual Modality in Speech Recognition

George Sterpu, +2 more

- 13 Mar 2020 -

IEEE Transactions on Audio, Speech, and ...

TL;DR: In this article, the authors investigated the inner workings of AV Align and visualised the audio-visual alignment patterns, and proposed a regularization method which involves predicting lip-related Action Units from visual representations.

...read moreread less

Book ChapterDOI

Multimodal synthesizer for russian and czech sign languages and audio-visual speech

Alexey Karpov, +3 more

TL;DR: A model of a computer-animated avatar for the Russian and Czech sign languages with particular attention to animation principles of the "talking head" allows for maximum expansion of the functions of the program, making it suitable not only for deaf and hard-of-hearing people, but for blind and non-disabled people too.

...read moreread less

References

PDF

Open Access

More filters

Journal ArticleDOI

Articulatory phonology: an overview.

Catherine P. Browman, +1 more

- 01 Jan 1992 -

Phonetica

TL;DR: It is suggested that the gestural approach clarifies the understanding of phonological development, by positing that prelinguistic units of action are harnessed into (gestural) phonological structures through differentiation and coordination.

...read moreread less

Proceedings ArticleDOI

A coupled HMM for audio-visual speech recognition

Ara V. Nefian, +5 more

TL;DR: This paper introduces a novel audio-visual fusion technique that uses a coupled hidden Markov model (HMM) to model the state asynchrony of the audio and visual observations sequences while still preserving their natural correlation over time.

...read moreread less

Journal ArticleDOI

Inter-language differences in the influence of visual cues in speech perception.

Kaoru Sekiyama, +1 more

- 01 Oct 1993 -

Journal of Phonetics

TL;DR: The authors found that Japanese listeners were less influenced by visual cues than were Americans when presented an audio signal dubbed onto a video recording of a talking face producing an incongruent syllable, robust visual effects of visual cues have been reported for native English speakers in English-speaking cultures.

...read moreread less

Journal ArticleDOI

On the Importance of Audiovisual Coherence for the Perceived Quality of Synthesized Visual Speech

Wesley Mattheyses, +2 more

- 01 Jan 2009 -

Eurasip Journal on Audio, Speech, and Mu...

TL;DR: This work extended the well-known unit selection audio synthesis technique to work with multimodal segments containing original combinations of audio and video and shown that the degree of coherence between the auditory mode and the visual mode has an influence on the perceived quality of the synthetic visual speech fragment.

...read moreread less

Learning optimal audiovisual phasing for a HMM-based control model for facial animation

Oxana Govokhina, +2 more

TL;DR: In this article, an HMM-based trajectory formation model was proposed to predict articulatory trajectories of a talking face from phonetic input, and a phasing model was developed that predicts the delays between the acoustic boundaries of allophones to be synthesized and the gestural boundaries of HMM triphones.

...read moreread less

Related Papers (1)

Automatic Viseme Clustering for Audiovisual Speech Synthesis.

Wesley Mattheyses, +2 more