Neural Speaker Embeddings for Ultrasound-based Silent Speech Interfaces

Open AccessPosted Content

Neural Speaker Embeddings for Ultrasound-based Silent Speech Interfaces

Amin Honarmandi Shandiz, +4 more

- 08 Jun 2021 -

arXiv: Sound

Chats0

TLDR

In this article, the authors presented multi-speaker experiments using the recently published TaL80 corpus and adjusted the x-vector framework popular in speech processing to operate with ultrasound tongue videos.

Abstract:

Articulatory-to-acoustic mapping seeks to reconstruct speech from a recording of the articulatory movements, for example, an ultrasound video. Just like speech signals, these recordings represent not only the linguistic content, but are also highly specific to the actual speaker. Hence, due to the lack of multi-speaker data sets, researchers have so far concentrated on speaker-dependent modeling. Here, we present multi-speaker experiments using the recently published TaL80 corpus. To model speaker characteristics, we adjusted the x-vector framework popular in speech processing to operate with ultrasound tongue videos. Next, we performed speaker recognition experiments using 50 speakers from the corpus. Then, we created speaker embedding vectors and evaluated them on the remaining speakers. Finally, we examined how the embedding vector influences the accuracy of our ultrasound-to-speech conversion network in a multi-speaker scenario. In the experiments we attained speaker recognition error rates below 3%, and we also found that the embedding vectors generalize nicely to unseen speakers. Our first attempt to apply them in a multi-speaker silent speech framework brought about a marginal reduction in the error rate of the spectral estimation step.

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Articulatory-to-speech conversion using bi-directional long short-term memory

Fumiaki Taguchi, +1 more

TL;DR: This study developed a method for simultaneously estimating the spectral envelope and sound source parameters from articulatory data obtained with an electromagnetic articulography (EMA) sensor and performed objective and subjective evaluation of the word error rate to examine the effectiveness of this method.

...read moreread less

Proceedings ArticleDOI

Enhancing multimodal silent speech interfaces with feature selection

João Freitas, +4 more

TL;DR: Both unsupervised and supervised FS techniques improve on the classification accuracy on both individual and combined modalities, and both are useful as pre-processing for feature fusion.

...read moreread less

Proceedings ArticleDOI

Vocoder-Based Speech Synthesis from Silent Videos

Daniel Michelsanti, +5 more

TL;DR: In this article, the authors presented a method to synthesize speech from the silent video of a talker using deep learning, which learns a mapping function from raw video frames to acoustic features and reconstructs the speech with a vocoder synthesis algorithm.

...read moreread less

Proceedings ArticleDOI

Speaker-independent Classification of Phonetic Segments from Raw Ultrasound in Child Speech

Manuel Sam Ribeiro, +3 more

TL;DR: This work investigates the classification of phonetic segments (tongue shapes) from raw ultrasound recordings under several training scenarios: speaker-dependent, multi-speaker, speaker-independent, and speaker-adapted and observes that models underperform when applied to data from speakers not seen at training time.

...read moreread less

Proceedings ArticleDOI

Tal: A Synchronised Multi-Speaker Corpus of Ultrasound Tongue Imaging, Audio, and Lip Videos

Manuel Sam Ribeiro, +6 more

TL;DR: The Tongue and Lips corpus (TaL) as mentioned in this paper is a multi-speaker corpus of audio, ultrasound tongue imaging, and lip videos, which contains 24 hours of parallel ultrasound, video, and audio data, of which approximately 13.5 hours are speech.

...read moreread less

Collapse

arXiv: Computation and Language

Speaker change detection using features through a neural network speaker classifier

Zhenhao Ge, +3 more

Neural Speaker Embeddings for Ultrasound-based Silent Speech Interfaces

References

Articulatory-to-speech conversion using bi-directional long short-term memory

Enhancing multimodal silent speech interfaces with feature selection

Vocoder-Based Speech Synthesis from Silent Videos

Speaker-independent Classification of Phonetic Segments from Raw Ultrasound in Child Speech

Tal: A Synchronised Multi-Speaker Corpus of Ultrasound Tongue Imaging, Audio, and Lip Videos

Related Papers (5)

Learning speaker representation for neural network based multichannel speaker extraction

Speaker-Adaptive Speech Recognition Based on Surface Electromyography

Zero-Shot Multi-Speaker Text-To-Speech with State-Of-The-Art Neural Speaker Embeddings

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

Speaker change detection using features through a neural network speaker classifier