scispace - formally typeset
Proceedings ArticleDOI

Natural Text-to-Speech Synthesis by Conditioning Spectrogram Predictions from Transformer Network on WaveGlow Vocoder

Reads0
Chats0
TLDR
This work aims to look at some of the issues and limitations present in the current works of Tacotron-2, and attempts to further improve its performance by modifying its architecture.
Abstract
Text to Speech (TTS) is a form of speech synthesis where the text is converted into a spoken human-like voice output. The state of the art methods for TTS employs a neural network based approach. This work aims to look at some of the issues and limitations present in the current works, specifically Tacotron-2, and attempts to further improve its performance by modifying its architecture. The modified model uses Transformer network as a Spectrogram Prediction Network (SPN) and WaveGlow as an Audio Generation Network (AGN). For the modified model, performance improvements are seen in terms of the speech output generated for corresponding texts, the inference time taken for audio generation, and a Mean Opinion Score (MOS) of 4.10 (out of 5) is obtained.

read more

Citations
More filters
Proceedings ArticleDOI

Spoofing-Aware Speaker Verification with Unsupervised Domain Adaptation

TL;DR: This paper begins the concern of enhancing the spoofing robustness of the automatic speaker verification (ASV) system, without the primary presence of a separate countermeasure module by employing three unsupervised domain adaptation techniques to optimize the back-end using the audio data in the training partition of the ASVspoof 2019 dataset.
Proceedings ArticleDOI

Bi-Sep: A Multi-Resolution Cross-Domain Monaural Speech Separation Framework

TL;DR: In this article , a Bi-Projection Fusion (BPF) mechanism was proposed to merge the information between two domains to improve the fine-grained vision of time-domain methods.
References
More filters
Posted Content

Neural Machine Translation by Jointly Learning to Align and Translate

TL;DR: In this paper, the authors propose to use a soft-searching model to find the parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.
Posted Content

WaveNet: A Generative Model for Raw Audio

TL;DR: This paper proposed WaveNet, a deep neural network for generating audio waveforms, which is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones.

WaveNet: A Generative Model for Raw Audio

TL;DR: WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition.
Proceedings Article

Variational Inference with Normalizing Flows

TL;DR: It is demonstrated that the theoretical advantages of having posteriors that better match the true posterior, combined with the scalability of amortized variational approaches, provides a clear improvement in performance and applicability of variational inference.
Proceedings Article

Attention-based models for speech recognition

TL;DR: The authors proposed a location-aware attention mechanism for the TIMET phoneme recognition task, which achieved an improved 18.7% phoneme error rate (PER) on utterances which are roughly as long as the ones it was trained on.
Related Papers (5)
Trending Questions (1)
What is the state of the art for speech to text?

The provided paper is about improving the performance of Text-to-Speech (TTS) synthesis using a modified model. It does not mention the state of the art for speech to text.