Neural Speaker Embeddings for Ultrasound-based Silent Speech Interfaces

Home
/
Papers
/
Neural Speaker Embeddings for Ultrasound-based Silent Speech Interfaces

Posted Content•

Neural Speaker Embeddings for Ultrasound-based Silent Speech Interfaces

Amin Honarmandi Shandiz¹, László Tóth¹, Gábor Gosztolya¹, Alexandra Markó², Tamás Gábor Csapó³ - Show less +1 more•Institutions (3)

University of Szeged¹, Eötvös Loránd University², Budapest University of Technology and Economics³

08 Jun 2021-arXiv: Sound-

TL;DR: In this article, the authors presented multi-speaker experiments using the recently published TaL80 corpus and adjusted the x-vector framework popular in speech processing to operate with ultrasound tongue videos.

read less

Abstract: Articulatory-to-acoustic mapping seeks to reconstruct speech from a recording of the articulatory movements, for example, an ultrasound video. Just like speech signals, these recordings represent not only the linguistic content, but are also highly specific to the actual speaker. Hence, due to the lack of multi-speaker data sets, researchers have so far concentrated on speaker-dependent modeling. Here, we present multi-speaker experiments using the recently published TaL80 corpus. To model speaker characteristics, we adjusted the x-vector framework popular in speech processing to operate with ultrasound tongue videos. Next, we performed speaker recognition experiments using 50 speakers from the corpus. Then, we created speaker embedding vectors and evaluated them on the remaining speakers. Finally, we examined how the embedding vector influences the accuracy of our ultrasound-to-speech conversion network in a multi-speaker scenario. In the experiments we attained speaker recognition error rates below 3%, and we also found that the embedding vectors generalize nicely to unseen speakers. Our first attempt to apply them in a multi-speaker silent speech framework brought about a marginal reduction in the error rate of the spectral estimation step.

...read moreread less

References

PDF

Open Access

More filters

Proceedings Article•DOI•

X-Vectors: Robust DNN Embeddings for Speaker Recognition

[...]

David Snyder¹, Daniel Garcia-Romero¹, Gregory Sell¹, Daniel Povey¹, Sanjeev Khudanpur¹ - Show less +1 more•Institutions (1)

Johns Hopkins University¹

15 Apr 2018

TL;DR: This paper uses data augmentation, consisting of added noise and reverberation, as an inexpensive method to multiply the amount of training data and improve robustness of deep neural network embeddings for speaker recognition.

...read moreread less

Abstract: In this paper, we use data augmentation to improve performance of deep neural network (DNN) embeddings for speaker recognition. The DNN, which is trained to discriminate between speakers, maps variable-length utterances to fixed-dimensional embeddings that we call x-vectors. Prior studies have found that embeddings leverage large-scale training datasets better than i-vectors. However, it can be challenging to collect substantial quantities of labeled data for training. We use data augmentation, consisting of added noise and reverberation, as an inexpensive method to multiply the amount of training data and improve robustness. The x-vectors are compared with i-vector baselines on Speakers in the Wild and NIST SRE 2016 Cantonese. We find that while augmentation is beneficial in the PLDA classifier, it is not helpful in the i-vector extractor. However, the x-vector DNN effectively exploits data augmentation, due to its supervised training. As a result, the x-vectors achieve superior performance on the evaluation datasets.

...read moreread less

2,300 citations

Proceedings Article•DOI•

A Closer Look at Spatiotemporal Convolutions for Action Recognition

[...]

Du Tran¹, Heng Wang¹, Lorenzo Torresani¹, Jamie Ray², Jamie Ray¹, Yann LeCun¹, Manohar Paluri¹ - Show less +3 more•Institutions (2)

Facebook¹, Dartmouth College²

12 Apr 2018

TL;DR: In this article, a new spatio-temporal convolutional block "R(2+1)D" was proposed, which achieved state-of-the-art performance on Sports-1M, Kinetics, UCF101, and HMDB51.

...read moreread less

Abstract: In this paper we discuss several forms of spatiotemporal convolutions for video analysis and study their effects on action recognition. Our motivation stems from the observation that 2D CNNs applied to individual frames of the video have remained solid performers in action recognition. In this work we empirically demonstrate the accuracy advantages of 3D CNNs over 2D CNNs within the framework of residual learning. Furthermore, we show that factorizing the 3D convolutional filters into separate spatial and temporal components yields significantly gains in accuracy. Our empirical study leads to the design of a new spatiotemporal convolutional block "R(2+1)D" which produces CNNs that achieve results comparable or superior to the state-of-the-art on Sports-1M, Kinetics, UCF101, and HMDB51.

...read moreread less

1,827 citations

Proceedings Article•DOI•

A time delay neural network architecture for efficient modeling of long temporal contexts.

[...]

Vijayaditya Peddinti¹, Daniel Povey¹, Sanjeev Khudanpur¹•Institutions (1)

Johns Hopkins University¹

06 Sep 2015

TL;DR: This paper proposes a time delay neural network architecture which models long term temporal dependencies with training times comparable to standard feed-forward DNNs and uses sub-sampling to reduce computation during training.

...read moreread less

Abstract: Recurrent neural network architectures have been shown to efficiently model long term temporal dependencies between acoustic events. However the training time of recurrent networks is higher than feedforward networks due to the sequential nature of the learning algorithm. In this paper we propose a time delay neural network architecture which models long term temporal dependencies with training times comparable to standard feed-forward DNNs. The network uses sub-sampling to reduce computation during training. On the Switchboard task we show a relative improvement of 6% over the baseline DNN model. We present results on several LVCSR tasks with training data ranging from 3 to 1800 hours to show the effectiveness of the TDNN architecture in learning wider temporal dependencies in both small and large data scenarios.

...read moreread less

1,016 citations

Proceedings Article•DOI•

Waveglow: A Flow-based Generative Network for Speech Synthesis

[...]

Ryan Prenger¹, Rafael Valle¹, Bryan Catanzaro¹•Institutions (1)

Nvidia¹

12 May 2019

TL;DR: WaveGlow as mentioned in this paper is a flow-based network capable of generating high quality speech from mel-spectrograms without the need for auto-regression, and it is implemented using only a single network, trained using a single cost function: maximizing the likelihood of the training data.

...read moreread less

Abstract: In this paper we propose WaveGlow: a flow-based network capable of generating high quality speech from mel-spectrograms. WaveGlow combines insights from Glow [1] and WaveNet [2] in order to provide fast, efficient and high-quality audio synthesis, without the need for auto-regression. WaveGlow is implemented using only a single network, trained using only a single cost function: maximizing the likelihood of the training data, which makes the training procedure simple and stable. Our PyTorch implementation produces audio samples at a rate of more than 500 kHz on an NVIDIA V100 GPU. Mean Opinion Scores show that it delivers audio quality as good as the best publicly available WaveNet implementation. All code will be made publicly available online [3].

...read moreread less

606 citations

Proceedings Article•DOI•

Attentive Statistics Pooling for Deep Speaker Embedding

[...]

Koji Okabe¹, Takafumi Koshinaka¹, Koichi Shinoda²•Institutions (2)

NEC¹, Tokyo Institute of Technology²

29 Mar 2018

TL;DR: Attention statistics pooling for deep speaker embedding in text-independent speaker verification uses an attention mechanism to give different weights to different frames and generates not only weighted means but also weighted standard deviations, which can capture long-term variations in speaker characteristics more effectively.

...read moreread less

Abstract: This paper proposes attentive statistics pooling for deep speaker embedding in text-independent speaker verification. In conventional speaker embedding, frame-level features are averaged over all the frames of a single utterance to form an utterance-level feature. Our method utilizes an attention mechanism to give different weights to different frames and generates not only weighted means but also weighted standard deviations. In this way, it can capture long-term variations in speaker characteristics more effectively. An evaluation on the NIST SRE 2012 and the VoxCeleb data sets shows that it reduces equal error rates (EERs) from the conventional method by 7.5% and 8.1%, respectively.

...read moreread less

450 citations