scispace - formally typeset
Open AccessPosted Content

Reconstructing Speech from Real-Time Articulatory MRI Using Neural Vocoders.

TLDR
In this article, the authors compare the performance of three deep neural architectures for the estimation task, combining convolutional (CNN) and recurrence-based (LSTM) neural layers.
Abstract
Several approaches exist for the recording of articulatory movements, such as eletromagnetic and permanent magnetic articulagraphy, ultrasound tongue imaging and surface electromyography. Although magnetic resonance imaging (MRI) is more costly than the above approaches, the recent developments in this area now allow the recording of real-time MRI videos of the articulators with an acceptable resolution. Here, we experiment with the reconstruction of the speech signal from a real-time MRI recording using deep neural networks. Instead of estimating speech directly, our networks are trained to output a spectral vector, from which we reconstruct the speech signal using the WaveGlow neural vocoder. We compare the performance of three deep neural architectures for the estimation task, combining convolutional (CNN) and recurrence-based (LSTM) neural layers. Besides the mean absolute error (MAE) of our networks, we also evaluate our models by comparing the speech signals obtained using several objective speech quality metrics like the mean cepstral distortion (MCD), Short-Time Objective Intelligibility (STOI), Perceptual Evaluation of Speech Quality (PESQ) and Signal-to-Distortion Ratio (SDR). The results indicate that our approach can successfully reconstruct the gross spectral shape, but more improvements are needed to reproduce the fine spectral details.

read more

Citations
More filters
Journal ArticleDOI

Optimizing the Ultrasound Tongue Image Representation for Residual Network-Based Articulatory-to-Acoustic Mapping

TL;DR: In this paper , the authors compared the raw scanline representation with the wedge-shaped processed ultrasound tongue imaging (UTI) as the input for the residual network applied for articulatory-to-acoustic mapping (AAM).
Book ChapterDOI

Voice Activity Detection for Ultrasound-Based Silent Speech Interfaces Using Convolutional Neural Networks.

TL;DR: In this paper, a convolutional neural network classifier was used to separate silent and speech-containing ultrasound tongue images, using a conventional VAD algorithm to create the training labels from the corresponding speech signal.
Posted Content

Voice Activity Detection for Ultrasound-based Silent Speech Interfaces using Convolutional Neural Networks

TL;DR: In this article, a convolutional neural network classifier was used to separate silent and speech-containing ultrasound tongue images, using a conventional VAD algorithm to create the training labels from the corresponding speech signal.
Posted Content

Neural Speaker Embeddings for Ultrasound-based Silent Speech Interfaces

TL;DR: In this article, the authors presented multi-speaker experiments using the recently published TaL80 corpus and adjusted the x-vector framework popular in speech processing to operate with ultrasound tongue videos.
Book ChapterDOI

Tagged-MRI Sequence to Audio Synthesis via Self Residual Attention Guided Heterogeneous Translator

TL;DR: In this paper , a fully convolutional asymmetry translator with guidance of a self residual attention strategy was proposed to exploit the moving muscular structures during speech and leverage a pairwise correlation of the samples with the same utterances with a latent space representation disentanglement strategy.
References
More filters
Journal ArticleDOI

Long short-term memory

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
Journal ArticleDOI

An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech

TL;DR: A short-time objective intelligibility measure (STOI) is presented, which shows high correlation with the intelligibility of noisy and time-frequency weighted noisy speech (e.g., resulting from noise reduction) of three different listening experiments and showed better correlation with speech intelligibility compared to five other reference objective intelligible models.
Proceedings ArticleDOI

A Closer Look at Spatiotemporal Convolutions for Action Recognition

TL;DR: In this article, a new spatio-temporal convolutional block "R(2+1)D" was proposed, which achieved state-of-the-art performance on Sports-1M, Kinetics, UCF101, and HMDB51.
Proceedings ArticleDOI

Waveglow: A Flow-based Generative Network for Speech Synthesis

TL;DR: WaveGlow as mentioned in this paper is a flow-based network capable of generating high quality speech from mel-spectrograms without the need for auto-regression, and it is implemented using only a single network, trained using a single cost function: maximizing the likelihood of the training data.
Proceedings Article

MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis

TL;DR: The model is non-autoregressive, fully convolutional, with significantly fewer parameters than competing models and generalizes to unseen speakers for mel-spectrogram inversion, and suggests a set of guidelines to design general purpose discriminators and generators for conditional sequence synthesis tasks.
Related Papers (5)