Representation Mixing for TTS Synthesis

doi:10.1109/ICASSP.2019.8682880

Open AccessProceedings ArticleDOI

Representation Mixing for TTS Synthesis

- pp 5906-5910

TLDR

This article proposed a representation mixing method for combining multiple types of linguistic information in a single encoder, named representation mixing, enabling flexible choice between character, phoneme, or mixed representations during inference.

Abstract:

Recent character and phoneme-based parametric TTS systems using deep learning have shown strong performance in natural speech generation. However, the choice between character or phoneme input can create serious limitations for practical deployment, as direct control of pronunciation is crucial in certain cases. We demonstrate a simple method for combining multiple types of linguistic information in a single encoder, named representation mixing, enabling flexible choice between character, phoneme, or mixed representations during inference. Experiments and user studies on a public audiobook corpus show the efficacy of our approach.

Citations

PDF

Open Access

More filters

Posted Content

LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech

Heiga Zen, +7 more

- 05 Apr 2019 -

arXiv: Sound

TL;DR: This paper introduced a new speech corpus called "LibriTTS" for text-to-speech use, which is derived from the original audio and text materials of the LibriSpeech corpus, which was used for training and evaluating automatic speech recognition systems.

...read moreread less

Posted Content

FastPitch: Parallel Text-to-speech with Pitch Prediction

Adrian Łańcucki

- 11 Jun 2020 -

arXiv: Audio and Speech Processing

TL;DR: It is found that uniformly increasing or decreasing the pitch with FastPitch generates speech that resembles the voluntary modulation of voice, making it comparable to state-of-the-art speech.

...read moreread less

Proceedings ArticleDOI

Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning

Yu Zhang, +8 more

TL;DR: This article presented a multispeaker, multilingual text-to-speech (TTS) synthesis model based on Tacotron that is able to produce high quality speech in multiple languages.

...read moreread less

Proceedings ArticleDOI

Location-Relative Attention Mechanisms for Robust Long-Form Speech Synthesis

Eric Battenberg, +6 more

TL;DR: It is concluded that GMM attention and DCA can generalize to very long utterances, while preserving naturalness for shorter, in-domain utterances.

...read moreread less

Posted Content

Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning

Yu Zhang, +8 more

- 09 Jul 2019 -

arXiv: Computation and Language

TL;DR: A multispeaker, multilingual text-to-speech (TTS) synthesis model based on Tacotron that is able to produce high quality speech in multiple languages and be able to transfer voices across languages, e.g. English and Mandarin.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.

...read moreread less

Journal ArticleDOI

Long short-term memory

Sepp Hochreiter, +1 more

- 01 Nov 1997 -

Neural Computation

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.

...read moreread less

Journal Article

Dropout: a simple way to prevent neural networks from overfitting

Nitish Srivastava, +4 more

- 01 Jan 2014 -

Journal of Machine Learning Research

TL;DR: It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.

...read moreread less

Proceedings Article

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioffe, +1 more

TL;DR: Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.

...read moreread less

Proceedings Article

Neural Machine Translation by Jointly Learning to Align and Translate

Dzmitry Bahdanau, +2 more

TL;DR: It is conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and it is proposed to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.

...read moreread less