Representation Mixing for TTS Synthesis
Kyle Kastner,João Felipe Santos,Yoshua Bengio,Aaron Courville +3 more
- pp 5906-5910
TLDR
This article proposed a representation mixing method for combining multiple types of linguistic information in a single encoder, named representation mixing, enabling flexible choice between character, phoneme, or mixed representations during inference.Abstract:
Recent character and phoneme-based parametric TTS systems using deep learning have shown strong performance in natural speech generation. However, the choice between character or phoneme input can create serious limitations for practical deployment, as direct control of pronunciation is crucial in certain cases. We demonstrate a simple method for combining multiple types of linguistic information in a single encoder, named representation mixing, enabling flexible choice between character, phoneme, or mixed representations during inference. Experiments and user studies on a public audiobook corpus show the efficacy of our approach.read more
Citations
More filters
Posted Content
LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech
TL;DR: This paper introduced a new speech corpus called "LibriTTS" for text-to-speech use, which is derived from the original audio and text materials of the LibriSpeech corpus, which was used for training and evaluating automatic speech recognition systems.
Posted Content
FastPitch: Parallel Text-to-speech with Pitch Prediction
TL;DR: It is found that uniformly increasing or decreasing the pitch with FastPitch generates speech that resembles the voluntary modulation of voice, making it comparable to state-of-the-art speech.
Proceedings ArticleDOI
Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning
Yu Zhang,Ron Weiss,Heiga Zen,Yonghui Wu,Zhifeng Chen,RJ Skerry-Ryan,Ye Jia,Andrew Rosenberg,Bhuvana Ramabhadran +8 more
TL;DR: This article presented a multispeaker, multilingual text-to-speech (TTS) synthesis model based on Tacotron that is able to produce high quality speech in multiple languages.
Proceedings ArticleDOI
Location-Relative Attention Mechanisms for Robust Long-Form Speech Synthesis
Eric Battenberg,RJ Skerry-Ryan,Soroosh Mariooryad,Daisy Stanton,David T. H. Kao,Matt Shannon,Tom Bagby +6 more
TL;DR: It is concluded that GMM attention and DCA can generalize to very long utterances, while preserving naturalness for shorter, in-domain utterances.
Posted Content
Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning
Yu Zhang,Ron Weiss,Heiga Zen,Yonghui Wu,Zhifeng Chen,RJ Skerry-Ryan,Ye Jia,Andrew Rosenberg,Bhuvana Ramabhadran +8 more
TL;DR: A multispeaker, multilingual text-to-speech (TTS) synthesis model based on Tacotron that is able to produce high quality speech in multiple languages and be able to transfer voices across languages, e.g. English and Mandarin.
References
More filters
Proceedings ArticleDOI
Deep Residual Learning for Image Recognition
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Journal ArticleDOI
Long short-term memory
TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
Journal Article
Dropout: a simple way to prevent neural networks from overfitting
TL;DR: It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.
Proceedings Article
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Sergey Ioffe,Christian Szegedy +1 more
TL;DR: Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.
Proceedings Article
Neural Machine Translation by Jointly Learning to Align and Translate
TL;DR: It is conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and it is proposed to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.