scispace - formally typeset
Open AccessProceedings ArticleDOI

DNN-Based Ultrasound-to-Speech Conversion for a Silent Speech Interface.

TLDR
It is found that the representation that used several neighboring image frames in combination with a feature selection method was preferred both by the subjects taking part in the listening experiments, and in terms of the Normalized Mean Squared Error.
Abstract
In this paper we present our initial results in articulatory-toacoustic conversion based on tongue movement recordings using Deep Neural Networks (DNNs). Despite the fact that deep learning has revolutionized several fields, so far only a few researchers have applied DNNs for this task. Here, we compare various possible feature representation approaches combined with DNN-based regression. As the input, we recorded synchronized 2D ultrasound images and speech signals. The task of the DNN was to estimate Mel-Generalized Cepstrum-based Line Spectral Pair (MGC-LSP) coefficients, which then served as input to a standard pulse-noise vocoder for speech synthesis. As the raw ultrasound images have a relatively high resolution, we experimented with various feature selection and transformation approaches to reduce the size of the feature vectors. The synthetic speech signals resulting from the various DNN configurations were evaluated both using objective measures and a subjective listening test. We found that the representation that used several neighboring image frames in combination with a feature selection method was preferred both by the subjects taking part in the listening experiments, and in terms of the Normalized Mean Squared Error. Our results may be useful for creating Silent Speech Interface applications in the future.

read more

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI

SottoVoce: An Ultrasound Imaging-Based Silent Speech Interaction Using Deep Neural Networks

TL;DR: A system to detect a user's unvoiced utterance and recognize the utterance contents without the user's uttering voice is proposed, and it is confirmed that audio signals generated by the system can control the existing smart speakers.
Proceedings ArticleDOI

F0 Estimation for DNN-Based Ultrasound Silent Speech Interfaces

TL;DR: Deep neural networks are experimented with to perform articulatory-to-acoustic conversion from ultrasound images, with an emphasis on estimating the voicing feature and the F0 curve from the ultrasound input, with a correlation rate of 0.74.
Proceedings ArticleDOI

DNN-based Acoustic-to-Articulatory Inversion using Ultrasound Tongue Imaging

TL;DR: This paper implemented several different Deep Neural Networks to estimate the articulatory information from the acoustic signal, and shows that CW-SSIM is the most useful error measure in the UTI context.
Journal ArticleDOI

Non-Invasive Silent Phoneme Recognition Using Microwave Signals

TL;DR: Electromagnetic transmission and reflection measurements of the vocal tract have great potential for future silent-speech interfaces, and are suggested to be a viable alternative to established methods.
Proceedings ArticleDOI

Multi-Task Learning of Speech Recognition and Speech Synthesis Parameters for Ultrasound-based Silent Speech Interfaces.

TL;DR: The results show that the parallel learning of the two types of targets is indeed beneficial for both tasks, and improvements are obtained by using multi-task training of deep neural networks as a weight initialization step before task-specific training.
References
More filters
Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Journal ArticleDOI

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

TL;DR: This work introduces a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals and further merge RPN and Fast R-CNN into a single network by sharing their convolutionAL features.
Posted Content

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

TL;DR: Faster R-CNN as discussed by the authors proposes a Region Proposal Network (RPN) to generate high-quality region proposals, which are used by Fast R-NN for detection.
Journal ArticleDOI

Eigenfaces for recognition

TL;DR: A near-real-time computer system that can locate and track a subject's head, and then recognize the person by comparing characteristics of the face to those of known individuals, and that is easy to implement using a neural network architecture.
Proceedings ArticleDOI

Holistically-Nested Edge Detection

TL;DR: HED turns pixel-wise edge classification into image-to-image prediction by means of a deep learning model that leverages fully convolutional neural networks and deeply-supervised nets to approach the human ability to resolve the challenging ambiguity in edge and object boundary detection.
Related Papers (5)