Reconstructing Speech from Real-Time Articulatory MRI Using Neural Vocoders.

Open AccessPosted Content

Reconstructing Speech from Real-Time Articulatory MRI Using Neural Vocoders.

- 23 Apr 2021 -

TLDR

In this article, the authors compare the performance of three deep neural architectures for the estimation task, combining convolutional (CNN) and recurrence-based (LSTM) neural layers.

Abstract:

Several approaches exist for the recording of articulatory movements, such as eletromagnetic and permanent magnetic articulagraphy, ultrasound tongue imaging and surface electromyography. Although magnetic resonance imaging (MRI) is more costly than the above approaches, the recent developments in this area now allow the recording of real-time MRI videos of the articulators with an acceptable resolution. Here, we experiment with the reconstruction of the speech signal from a real-time MRI recording using deep neural networks. Instead of estimating speech directly, our networks are trained to output a spectral vector, from which we reconstruct the speech signal using the WaveGlow neural vocoder. We compare the performance of three deep neural architectures for the estimation task, combining convolutional (CNN) and recurrence-based (LSTM) neural layers. Besides the mean absolute error (MAE) of our networks, we also evaluate our models by comparing the speech signals obtained using several objective speech quality metrics like the mean cepstral distortion (MCD), Short-Time Objective Intelligibility (STOI), Perceptual Evaluation of Speech Quality (PESQ) and Signal-to-Distortion Ratio (SDR). The results indicate that our approach can successfully reconstruct the gross spectral shape, but more improvements are needed to reproduce the fine spectral details.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Optimizing the Ultrasound Tongue Image Representation for Residual Network-Based Articulatory-to-Acoustic Mapping

Tamás Gábor Csapó, +4 more

- 01 Nov 2022 -

Sensors

TL;DR: In this paper , the authors compared the raw scanline representation with the wedge-shaped processed ultrasound tongue imaging (UTI) as the input for the residual network applied for articulatory-to-acoustic mapping (AAM).

...read moreread less

Book ChapterDOI

Voice Activity Detection for Ultrasound-Based Silent Speech Interfaces Using Convolutional Neural Networks.

Amin Honarmandi Shandiz, +1 more

TL;DR: In this paper, a convolutional neural network classifier was used to separate silent and speech-containing ultrasound tongue images, using a conventional VAD algorithm to create the training labels from the corresponding speech signal.

...read moreread less

Posted Content

Voice Activity Detection for Ultrasound-based Silent Speech Interfaces using Convolutional Neural Networks

Amin Honarmandi Shandiz, +1 more

- 28 May 2021 -

arXiv: Sound

TL;DR: In this article, a convolutional neural network classifier was used to separate silent and speech-containing ultrasound tongue images, using a conventional VAD algorithm to create the training labels from the corresponding speech signal.

...read moreread less

Posted Content

Neural Speaker Embeddings for Ultrasound-based Silent Speech Interfaces

Amin Honarmandi Shandiz, +4 more

- 08 Jun 2021 -

arXiv: Sound

TL;DR: In this article, the authors presented multi-speaker experiments using the recently published TaL80 corpus and adjusted the x-vector framework popular in speech processing to operate with ultrasound tongue videos.

...read moreread less

Book ChapterDOI

Tagged-MRI Sequence to Audio Synthesis via Self Residual Attention Guided Heterogeneous Translator

Jianzhong Song

TL;DR: In this paper , a fully convolutional asymmetry translator with guidance of a self residual attention strategy was proposed to exploit the moving muscular structures during speech and leverage a pairwise correlation of the samples with the same utterances with a latent space representation disentanglement strategy.

...read moreread less

References

PDF

Open Access

More filters

Journal ArticleDOI

Long short-term memory

Sepp Hochreiter, +1 more

- 01 Nov 1997 -

Neural Computation

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.

...read moreread less

Journal ArticleDOI

An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech

Cees H. Taal, +3 more

- 01 Sep 2011 -

IEEE Transactions on Audio, Speech, and ...

TL;DR: A short-time objective intelligibility measure (STOI) is presented, which shows high correlation with the intelligibility of noisy and time-frequency weighted noisy speech (e.g., resulting from noise reduction) of three different listening experiments and showed better correlation with speech intelligibility compared to five other reference objective intelligible models.

...read moreread less

Proceedings ArticleDOI

A Closer Look at Spatiotemporal Convolutions for Action Recognition

Du Tran, +6 more

TL;DR: In this article, a new spatio-temporal convolutional block "R(2+1)D" was proposed, which achieved state-of-the-art performance on Sports-1M, Kinetics, UCF101, and HMDB51.

...read moreread less

Proceedings ArticleDOI

Waveglow: A Flow-based Generative Network for Speech Synthesis

Ryan Prenger, +2 more

TL;DR: WaveGlow as mentioned in this paper is a flow-based network capable of generating high quality speech from mel-spectrograms without the need for auto-regression, and it is implemented using only a single network, trained using a single cost function: maximizing the likelihood of the training data.

...read moreread less

Proceedings Article

MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis

Kundan Kumar, +8 more

TL;DR: The model is non-autoregressive, fully convolutional, with significantly fewer parameters than competing models and generalizes to unseen speakers for mel-spectrogram inversion, and suggests a set of guidelines to design general purpose discriminators and generators for conditional sequence synthesis tasks.

...read moreread less