DNN-Based Ultrasound-to-Speech Conversion for a Silent Speech Interface.

doi:10.21437/INTERSPEECH.2017-939

Open AccessProceedings ArticleDOI

DNN-Based Ultrasound-to-Speech Conversion for a Silent Speech Interface.

- pp 3672-3676

TLDR

It is found that the representation that used several neighboring image frames in combination with a feature selection method was preferred both by the subjects taking part in the listening experiments, and in terms of the Normalized Mean Squared Error.

Abstract:

In this paper we present our initial results in articulatory-toacoustic conversion based on tongue movement recordings using Deep Neural Networks (DNNs). Despite the fact that deep learning has revolutionized several fields, so far only a few researchers have applied DNNs for this task. Here, we compare various possible feature representation approaches combined with DNN-based regression. As the input, we recorded synchronized 2D ultrasound images and speech signals. The task of the DNN was to estimate Mel-Generalized Cepstrum-based Line Spectral Pair (MGC-LSP) coefficients, which then served as input to a standard pulse-noise vocoder for speech synthesis. As the raw ultrasound images have a relatively high resolution, we experimented with various feature selection and transformation approaches to reduce the size of the feature vectors. The synthetic speech signals resulting from the various DNN configurations were evaluated both using objective measures and a subjective listening test. We found that the representation that used several neighboring image frames in combination with a feature selection method was preferred both by the subjects taking part in the listening experiments, and in terms of the Normalized Mean Squared Error. Our results may be useful for creating Silent Speech Interface applications in the future.

Figures

Table 2: Naturalness scores of listening test #1

Table 1: NMSE and mean R2 scores on the development set

Figure 2: The first two extracted Eigentongues.

Figure 1: A raw ultrasound image and the mask which was used in our correlation-based feature selection method (max., 20%).

Figure 3: Results of the listening test #2 concerning naturalness. The errorbars show the 95% confidence intervals.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

SottoVoce: An Ultrasound Imaging-Based Silent Speech Interaction Using Deep Neural Networks

Naoki Kimura, +2 more

TL;DR: A system to detect a user's unvoiced utterance and recognize the utterance contents without the user's uttering voice is proposed, and it is confirmed that audio signals generated by the system can control the existing smart speakers.

...read moreread less

Proceedings ArticleDOI

F0 Estimation for DNN-Based Ultrasound Silent Speech Interfaces

Tamás Grósz, +4 more

TL;DR: Deep neural networks are experimented with to perform articulatory-to-acoustic conversion from ultrasound images, with an emphasis on estimating the voicing feature and the F0 curve from the ultrasound input, with a correlation rate of 0.74.

...read moreread less

Proceedings ArticleDOI

DNN-based Acoustic-to-Articulatory Inversion using Ultrasound Tongue Imaging

Dagoberto Porras, +2 more

TL;DR: This paper implemented several different Deep Neural Networks to estimate the articulatory information from the acoustic signal, and shows that CW-SSIM is the most useful error measure in the UTI context.

...read moreread less

Journal ArticleDOI

Non-Invasive Silent Phoneme Recognition Using Microwave Signals

Peter Birkholz, +3 more

- 01 Dec 2018 -

IEEE Transactions on Audio, Speech, and ...

TL;DR: Electromagnetic transmission and reflection measurements of the vocal tract have great potential for future silent-speech interfaces, and are suggested to be a viable alternative to established methods.

...read moreread less

Proceedings ArticleDOI

Multi-Task Learning of Speech Recognition and Speech Synthesis Parameters for Ultrasound-based Silent Speech Interfaces.

László Tóth, +4 more

TL;DR: The results show that the parallel learning of the two types of targets is indeed beneficial for both tasks, and improvements are obtained by using multi-task training of deep neural networks as a weight initialization step before task-specific training.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

Alex Krizhevsky, +2 more

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.

...read moreread less

Journal ArticleDOI

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Shaoqing Ren, +3 more

- 01 Jun 2017 -

IEEE Transactions on Pattern Analysis an...

TL;DR: This work introduces a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals and further merge RPN and Fast R-CNN into a single network by sharing their convolutionAL features.

...read moreread less

Posted Content

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Shaoqing Ren, +3 more

- 04 Jun 2015 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: Faster R-CNN as discussed by the authors proposes a Region Proposal Network (RPN) to generate high-quality region proposals, which are used by Fast R-NN for detection.

...read moreread less

Journal ArticleDOI

Eigenfaces for recognition

Matthew Turk, +1 more

- 01 Jan 1991 -

Journal of Cognitive Neuroscience

TL;DR: A near-real-time computer system that can locate and track a subject's head, and then recognize the person by comparing characteristics of the face to those of known individuals, and that is easy to implement using a neural network architecture.

...read moreread less

Proceedings ArticleDOI

Holistically-Nested Edge Detection

Saining Xie, +1 more

TL;DR: HED turns pixel-wise edge classification into image-to-image prediction by means of a deep learning model that leverages fully convolutional neural networks and deeply-supervised nets to approach the human ability to resolve the challenging ambiguity in edge and object boundary detection.

...read moreread less