Natural Text-to-Speech Synthesis by Conditioning Spectrogram Predictions from Transformer Network on WaveGlow Vocoder

doi:10.1109/ISCMI51676.2020.9311564

Proceedings ArticleDOI

Natural Text-to-Speech Synthesis by Conditioning Spectrogram Predictions from Transformer Network on WaveGlow Vocoder

G Sanjay, +2 more

Chats0

TLDR

This work aims to look at some of the issues and limitations present in the current works of Tacotron-2, and attempts to further improve its performance by modifying its architecture.

Abstract:

Text to Speech (TTS) is a form of speech synthesis where the text is converted into a spoken human-like voice output. The state of the art methods for TTS employs a neural network based approach. This work aims to look at some of the issues and limitations present in the current works, specifically Tacotron-2, and attempts to further improve its performance by modifying its architecture. The modified model uses Transformer network as a Spectrogram Prediction Network (SPN) and WaveGlow as an Audio Generation Network (AGN). For the modified model, performance improvements are seen in terms of the speech output generated for corresponding texts, the inference time taken for audio generation, and a Mean Opinion Score (MOS) of 4.10 (out of 5) is obtained.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

Spoofing-Aware Speaker Verification with Unsupervised Domain Adaptation

Xuechen Liu, +2 more

TL;DR: This paper begins the concern of enhancing the spoofing robustness of the automatic speaker verification (ASV) system, without the primary presence of a separate countermeasure module by employing three unsupervised domain adaptation techniques to optimize the back-end using the audio data in the training partition of the ASVspoof 2019 dataset.

...read moreread less

Proceedings ArticleDOI

Bi-Sep: A Multi-Resolution Cross-Domain Monaural Speech Separation Framework

Kuan-Hsun Ho, +2 more

TL;DR: In this article , a Bi-Projection Fusion (BPF) mechanism was proposed to merge the information between two domains to improve the fine-grained vision of time-domain methods.

...read moreread less

References

PDF

Open Access

More filters

Posted Content

Neural Machine Translation by Jointly Learning to Align and Translate

Dzmitry Bahdanau, +2 more

- 01 Sep 2014 -

arXiv: Computation and Language

TL;DR: In this paper, the authors propose to use a soft-searching model to find the parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.

...read moreread less

Posted Content

WaveNet: A Generative Model for Raw Audio

Aaron van den Oord, +8 more

- 12 Sep 2016 -

arXiv: Sound

TL;DR: This paper proposed WaveNet, a deep neural network for generating audio waveforms, which is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones.

...read moreread less

WaveNet: A Generative Model for Raw Audio

Aaron van den Oord, +8 more

TL;DR: WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition.

...read moreread less

Proceedings Article

Variational Inference with Normalizing Flows

Danilo Jimenez Rezende, +1 more

TL;DR: It is demonstrated that the theoretical advantages of having posteriors that better match the true posterior, combined with the scalability of amortized variational approaches, provides a clear improvement in performance and applicability of variational inference.

...read moreread less

Proceedings Article

Attention-based models for speech recognition

Jan Chorowski, +4 more

TL;DR: The authors proposed a location-aware attention mechanism for the TIMET phoneme recognition task, which achieved an improved 18.7% phoneme error rate (PER) on utterances which are roughly as long as the ones it was trained on.

...read moreread less