Tacotron: Towards End-to-End Speech Synthesis

Open AccessPosted Content

Tacotron: Towards End-to-End Speech Synthesis

Yuxuan Wang, +13 more

- 29 Mar 2017 -

arXiv: Computation and Language

Chats0

TLDR

Tacotron is presented, an end-to-end generative text- to-speech model that synthesizes speech directly from characters that achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness.

Abstract:

A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module. Building these components often requires extensive domain expertise and may contain brittle design choices. In this paper, we present Tacotron, an end-to-end generative text-to-speech model that synthesizes speech directly from characters. Given pairs, the model can be trained completely from scratch with random initialization. We present several key techniques to make the sequence-to-sequence framework perform well for this challenging task. Tacotron achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness. In addition, since Tacotron generates speech at the frame level, it's substantially faster than sample-level autoregressive methods.

Citations

PDF

Open Access

More filters

Posted Content

WaveGlow: A Flow-based Generative Network for Speech Synthesis

Ryan Prenger, +2 more

- 31 Oct 2018 -

arXiv: Sound

TL;DR: WaveGlow is a flow-based network capable of generating high quality speech from mel-spectrograms, implemented using only a single network, trained using a single cost function: maximizing the likelihood of the training data, which makes the training procedure simple and stable.

...read moreread less

Posted Content

Monotonic Chunkwise Attention

Chung-Cheng Chiu, +1 more

- 14 Dec 2017 -

arXiv: Computation and Language

TL;DR: This paper proposed Monotonic Chunkwise Attention (MoChA), which adaptively splits the input sequence into small chunks over which soft attention is computed, and showed that models utilizing MoChA can be trained efficiently with standard backpropagation.

...read moreread less

Posted Content

Voice Conversion Challenge 2020: Intra-lingual semi-parallel and cross-lingual voice conversion.

Yi Zhao, +7 more

- 28 Aug 2020 -

arXiv: Audio and Speech Processing

TL;DR: From the results of crowd-sourced listening tests, it is observed that VC methods have progressed rapidly thanks to advanced deep learning methods, and the overall naturalness and similarity scores were lower than those for the intra-lingual conversion task.

...read moreread less

Posted Content

MelNet: A Generative Model for Audio in the Frequency Domain

Sean Vasquez, +1 more

- 04 Jun 2019 -

arXiv: Audio and Speech Processing

TL;DR: This work designs a model capable of generating high-fidelity audio samples which capture structure at timescales that time-domain models have yet to achieve, and applies it to a variety of audio generation tasks, showing improvements over previous approaches in both density estimates and human judgments.

...read moreread less

Posted Content

Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis

Rafael Valle, +3 more

- 12 May 2020 -

arXiv: Sound

TL;DR: The mean opinion scores (MOS) show that Flowtron matches state-of-the-art TTS models in terms of speech quality, and results on control of speech variation, interpolation between samples and style transfer between speakers seen and unseen during training are provided.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.

...read moreread less

Proceedings Article

Adam: A Method for Stochastic Optimization

Diederik P. Kingma, +1 more

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

...read moreread less

Proceedings Article

Neural Machine Translation by Jointly Learning to Align and Translate

Dzmitry Bahdanau, +2 more

TL;DR: It is conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and it is proposed to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.

...read moreread less

Posted Content

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioffe, +1 more

- 11 Feb 2015 -

arXiv: Learning

TL;DR: Batch Normalization as mentioned in this paper normalizes layer inputs for each training mini-batch to reduce the internal covariate shift in deep neural networks, and achieves state-of-the-art performance on ImageNet.

...read moreread less

Posted Content

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

Martín Abadi, +39 more

- 01 Jan 2015 -

arXiv: Distributed, Parallel, and Cluste...

TL;DR: The TensorFlow interface and an implementation of that interface that is built at Google are described, which has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields.

...read moreread less