scispace - formally typeset
Open AccessPosted Content

Neural Speaker Embeddings for Ultrasound-based Silent Speech Interfaces

Reads0
Chats0
TLDR
In this article, the authors presented multi-speaker experiments using the recently published TaL80 corpus and adjusted the x-vector framework popular in speech processing to operate with ultrasound tongue videos.
Abstract
Articulatory-to-acoustic mapping seeks to reconstruct speech from a recording of the articulatory movements, for example, an ultrasound video. Just like speech signals, these recordings represent not only the linguistic content, but are also highly specific to the actual speaker. Hence, due to the lack of multi-speaker data sets, researchers have so far concentrated on speaker-dependent modeling. Here, we present multi-speaker experiments using the recently published TaL80 corpus. To model speaker characteristics, we adjusted the x-vector framework popular in speech processing to operate with ultrasound tongue videos. Next, we performed speaker recognition experiments using 50 speakers from the corpus. Then, we created speaker embedding vectors and evaluated them on the remaining speakers. Finally, we examined how the embedding vector influences the accuracy of our ultrasound-to-speech conversion network in a multi-speaker scenario. In the experiments we attained speaker recognition error rates below 3%, and we also found that the embedding vectors generalize nicely to unseen speakers. Our first attempt to apply them in a multi-speaker silent speech framework brought about a marginal reduction in the error rate of the spectral estimation step.

read more

References
More filters
Proceedings ArticleDOI

Articulatory-to-speech conversion using bi-directional long short-term memory

TL;DR: This study developed a method for simultaneously estimating the spectral envelope and sound source parameters from articulatory data obtained with an electromagnetic articulography (EMA) sensor and performed objective and subjective evaluation of the word error rate to examine the effectiveness of this method.
Proceedings ArticleDOI

Enhancing multimodal silent speech interfaces with feature selection

TL;DR: Both unsupervised and supervised FS techniques improve on the classification accuracy on both individual and combined modalities, and both are useful as pre-processing for feature fusion.
Proceedings ArticleDOI

Vocoder-Based Speech Synthesis from Silent Videos

TL;DR: In this article, the authors presented a method to synthesize speech from the silent video of a talker using deep learning, which learns a mapping function from raw video frames to acoustic features and reconstructs the speech with a vocoder synthesis algorithm.
Proceedings ArticleDOI

Speaker-independent Classification of Phonetic Segments from Raw Ultrasound in Child Speech

TL;DR: This work investigates the classification of phonetic segments (tongue shapes) from raw ultrasound recordings under several training scenarios: speaker-dependent, multi-speaker, speaker-independent, and speaker-adapted and observes that models underperform when applied to data from speakers not seen at training time.
Proceedings ArticleDOI

Tal: A Synchronised Multi-Speaker Corpus of Ultrasound Tongue Imaging, Audio, and Lip Videos

TL;DR: The Tongue and Lips corpus (TaL) as mentioned in this paper is a multi-speaker corpus of audio, ultrasound tongue imaging, and lip videos, which contains 24 hours of parallel ultrasound, video, and audio data, of which approximately 13.5 hours are speech.
Related Papers (5)