Open AccessPosted Content
Neural Speaker Embeddings for Ultrasound-based Silent Speech Interfaces
Reads0
Chats0
TLDR
In this article, the authors presented multi-speaker experiments using the recently published TaL80 corpus and adjusted the x-vector framework popular in speech processing to operate with ultrasound tongue videos.Abstract:
Articulatory-to-acoustic mapping seeks to reconstruct speech from a recording of the articulatory movements, for example, an ultrasound video. Just like speech signals, these recordings represent not only the linguistic content, but are also highly specific to the actual speaker. Hence, due to the lack of multi-speaker data sets, researchers have so far concentrated on speaker-dependent modeling. Here, we present multi-speaker experiments using the recently published TaL80 corpus. To model speaker characteristics, we adjusted the x-vector framework popular in speech processing to operate with ultrasound tongue videos. Next, we performed speaker recognition experiments using 50 speakers from the corpus. Then, we created speaker embedding vectors and evaluated them on the remaining speakers. Finally, we examined how the embedding vector influences the accuracy of our ultrasound-to-speech conversion network in a multi-speaker scenario. In the experiments we attained speaker recognition error rates below 3%, and we also found that the embedding vectors generalize nicely to unseen speakers. Our first attempt to apply them in a multi-speaker silent speech framework brought about a marginal reduction in the error rate of the spectral estimation step.read more
References
More filters
Journal ArticleDOI
Applying dnn adaptation to reduce the session dependency of ultrasound tongue imaging-based silent speech interfaces
Gábor Gosztolya,Gábor Gosztolya,Tamás Grósz,László Tóth,Alexandra Markó,Tamás Gábor Csapó,Tamás Gábor Csapó +6 more
TL;DR: The results indicate that by using adaptation, less training data and training time are needed to achieve the same speech quality over training a new DNN from scratch, and it is shown that DNN adaptation can be useful for handling session dependency.
Book ChapterDOI
3D Convolutional Neural Networks for Ultrasound-Based Silent Speech Interfaces
TL;DR: This work experiments with another approach that extends the CNN to perform 3D convolution, where the extra dimension corresponds to time, and finds experimentally that the 3D network outperforms the CNN+LSTM model, indicating that 3D CNNs may be a feasible alternative to CNN+ LSTM networks in SSI systems.
Proceedings ArticleDOI
Cross-Speaker Silent-Speech Command Word Recognition Using Electro-Optical Stomatography
Simon Stone,Peter Birkholz +1 more
TL;DR: In this paper, the authors presented the results of a study using a measurement technology called Electro-Optical Stomatography to capture speech movements and use the acquired data to recognize a number of command words.
Posted Content
Reconstructing Speech from Real-Time Articulatory MRI Using Neural Vocoders.
TL;DR: In this article, the authors compare the performance of three deep neural architectures for the estimation task, combining convolutional (CNN) and recurrence-based (LSTM) neural layers.
Book ChapterDOI
Improving Neural Silent Speech Interface Models by Adversarial Training
TL;DR: In this paper, a Generative Adversarial Network (GAN) is proposed to improve the perceptual quality of the generated signals by increasing their similarity to real signals, where the similarity is evaluated via a discriminator network.
Related Papers (5)
Speaker-Adaptive Speech Recognition Based on Surface Electromyography
Michael Wand,Tanja Schultz +1 more