Neural Speaker Embeddings for Ultrasound-based Silent Speech Interfaces

Open AccessPosted Content

Neural Speaker Embeddings for Ultrasound-based Silent Speech Interfaces

Amin Honarmandi Shandiz, +4 more

- 08 Jun 2021 -

arXiv: Sound

Chats0

TLDR

In this article, the authors presented multi-speaker experiments using the recently published TaL80 corpus and adjusted the x-vector framework popular in speech processing to operate with ultrasound tongue videos.

Abstract:

Articulatory-to-acoustic mapping seeks to reconstruct speech from a recording of the articulatory movements, for example, an ultrasound video. Just like speech signals, these recordings represent not only the linguistic content, but are also highly specific to the actual speaker. Hence, due to the lack of multi-speaker data sets, researchers have so far concentrated on speaker-dependent modeling. Here, we present multi-speaker experiments using the recently published TaL80 corpus. To model speaker characteristics, we adjusted the x-vector framework popular in speech processing to operate with ultrasound tongue videos. Next, we performed speaker recognition experiments using 50 speakers from the corpus. Then, we created speaker embedding vectors and evaluated them on the remaining speakers. Finally, we examined how the embedding vector influences the accuracy of our ultrasound-to-speech conversion network in a multi-speaker scenario. In the experiments we attained speaker recognition error rates below 3%, and we also found that the embedding vectors generalize nicely to unseen speakers. Our first attempt to apply them in a multi-speaker silent speech framework brought about a marginal reduction in the error rate of the spectral estimation step.

References

PDF

Open Access

More filters

Journal ArticleDOI

Applying dnn adaptation to reduce the session dependency of ultrasound tongue imaging-based silent speech interfaces

Gábor Gosztolya, +6 more

- 01 Jan 2020 -

Acta Polytechnica Hungarica

TL;DR: The results indicate that by using adaptation, less training data and training time are needed to achieve the same speech quality over training a new DNN from scratch, and it is shown that DNN adaptation can be useful for handling session dependency.

...read moreread less

Book ChapterDOI

3D Convolutional Neural Networks for Ultrasound-Based Silent Speech Interfaces

László Tóth, +1 more

TL;DR: This work experiments with another approach that extends the CNN to perform 3D convolution, where the extra dimension corresponds to time, and finds experimentally that the 3D network outperforms the CNN+LSTM model, indicating that 3D CNNs may be a feasible alternative to CNN+ LSTM networks in SSI systems.

...read moreread less

Proceedings ArticleDOI

Cross-Speaker Silent-Speech Command Word Recognition Using Electro-Optical Stomatography

Simon Stone, +1 more

TL;DR: In this paper, the authors presented the results of a study using a measurement technology called Electro-Optical Stomatography to capture speech movements and use the acquired data to recognize a number of command words.

...read moreread less

Posted Content

Reconstructing Speech from Real-Time Articulatory MRI Using Neural Vocoders.

Yide Yu, +2 more

- 23 Apr 2021 -

arXiv: Sound

TL;DR: In this article, the authors compare the performance of three deep neural architectures for the estimation task, combining convolutional (CNN) and recurrence-based (LSTM) neural layers.

...read moreread less

Book ChapterDOI

Improving Neural Silent Speech Interface Models by Adversarial Training

Amin Honarmandi Shandiz, +4 more

TL;DR: In this paper, a Generative Adversarial Network (GAN) is proposed to improve the perceptual quality of the generated signals by increasing their similarity to real signals, where the similarity is evaluated via a discriminator network.

...read moreread less

Collapse

arXiv: Computation and Language

Speaker change detection using features through a neural network speaker classifier

Zhenhao Ge, +3 more

Neural Speaker Embeddings for Ultrasound-based Silent Speech Interfaces

References

Applying dnn adaptation to reduce the session dependency of ultrasound tongue imaging-based silent speech interfaces

3D Convolutional Neural Networks for Ultrasound-Based Silent Speech Interfaces

Cross-Speaker Silent-Speech Command Word Recognition Using Electro-Optical Stomatography

Reconstructing Speech from Real-Time Articulatory MRI Using Neural Vocoders.

Improving Neural Silent Speech Interface Models by Adversarial Training

Related Papers (5)

Learning speaker representation for neural network based multichannel speaker extraction

Speaker-Adaptive Speech Recognition Based on Surface Electromyography

Zero-Shot Multi-Speaker Text-To-Speech with State-Of-The-Art Neural Speaker Embeddings

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

Speaker change detection using features through a neural network speaker classifier