scispace - formally typeset
Open AccessPosted Content

Neural Speaker Embeddings for Ultrasound-based Silent Speech Interfaces

Reads0
Chats0
TLDR
In this article, the authors presented multi-speaker experiments using the recently published TaL80 corpus and adjusted the x-vector framework popular in speech processing to operate with ultrasound tongue videos.
Abstract
Articulatory-to-acoustic mapping seeks to reconstruct speech from a recording of the articulatory movements, for example, an ultrasound video. Just like speech signals, these recordings represent not only the linguistic content, but are also highly specific to the actual speaker. Hence, due to the lack of multi-speaker data sets, researchers have so far concentrated on speaker-dependent modeling. Here, we present multi-speaker experiments using the recently published TaL80 corpus. To model speaker characteristics, we adjusted the x-vector framework popular in speech processing to operate with ultrasound tongue videos. Next, we performed speaker recognition experiments using 50 speakers from the corpus. Then, we created speaker embedding vectors and evaluated them on the remaining speakers. Finally, we examined how the embedding vector influences the accuracy of our ultrasound-to-speech conversion network in a multi-speaker scenario. In the experiments we attained speaker recognition error rates below 3%, and we also found that the embedding vectors generalize nicely to unseen speakers. Our first attempt to apply them in a multi-speaker silent speech framework brought about a marginal reduction in the error rate of the spectral estimation step.

read more

References
More filters
Journal ArticleDOI

Applying dnn adaptation to reduce the session dependency of ultrasound tongue imaging-based silent speech interfaces

TL;DR: The results indicate that by using adaptation, less training data and training time are needed to achieve the same speech quality over training a new DNN from scratch, and it is shown that DNN adaptation can be useful for handling session dependency.
Book ChapterDOI

3D Convolutional Neural Networks for Ultrasound-Based Silent Speech Interfaces

TL;DR: This work experiments with another approach that extends the CNN to perform 3D convolution, where the extra dimension corresponds to time, and finds experimentally that the 3D network outperforms the CNN+LSTM model, indicating that 3D CNNs may be a feasible alternative to CNN+ LSTM networks in SSI systems.
Proceedings ArticleDOI

Cross-Speaker Silent-Speech Command Word Recognition Using Electro-Optical Stomatography

TL;DR: In this paper, the authors presented the results of a study using a measurement technology called Electro-Optical Stomatography to capture speech movements and use the acquired data to recognize a number of command words.
Posted Content

Reconstructing Speech from Real-Time Articulatory MRI Using Neural Vocoders.

TL;DR: In this article, the authors compare the performance of three deep neural architectures for the estimation task, combining convolutional (CNN) and recurrence-based (LSTM) neural layers.
Book ChapterDOI

Improving Neural Silent Speech Interface Models by Adversarial Training

TL;DR: In this paper, a Generative Adversarial Network (GAN) is proposed to improve the perceptual quality of the generated signals by increasing their similarity to real signals, where the similarity is evaluated via a discriminator network.
Related Papers (5)