scispace - formally typeset
Open AccessPosted Content

LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech

Reads0
Chats0
TLDR
This paper introduced a new speech corpus called "LibriTTS" for text-to-speech use, which is derived from the original audio and text materials of the LibriSpeech corpus, which was used for training and evaluating automatic speech recognition systems.
Abstract
This paper introduces a new speech corpus called "LibriTTS" designed for text-to-speech use. It is derived from the original audio and text materials of the LibriSpeech corpus, which has been used for training and evaluating automatic speech recognition systems. The new corpus inherits desired properties of the LibriSpeech corpus while addressing a number of issues which make LibriSpeech less than ideal for text-to-speech work. The released corpus consists of 585 hours of speech data at 24kHz sampling rate from 2,456 speakers and the corresponding texts. Experimental results show that neural end-to-end TTS models trained from the LibriTTS corpus achieved above 4.0 in mean opinion scores in naturalness in five out of six evaluation speakers. The corpus is freely available for download from this http URL.

read more

Citations
More filters
Proceedings ArticleDOI

MLS: A Large-Scale Multilingual Dataset for Speech Research.

TL;DR: This paper introduces Multilingual LibriSpeech (MLS) dataset, a large multilingual corpus suitable for speech research and believes such a large transcribed dataset will open new avenues in ASR and Text-To-Speech research.
Posted Content

LibriMix: An Open-Source Dataset for Generalizable Speech Separation

TL;DR: The experiments show that the generalization error is smaller for models trained with LibriMix than with WHAM!, in both clean and noisy conditions, and a third test set based on VCTK for speech and WHAM! for noise is introduced.
Proceedings ArticleDOI

Mellotron: Multispeaker Expressive Voice Synthesis by Conditioning on Rhythm, Pitch and Global Style Tokens

TL;DR: A multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data, and synthesized samples that include style transfer from other speakers, singers and styles not seen during training, procedural manipulation of rhythm and pitch and choir synthesis.
Proceedings ArticleDOI

Generating Diverse and Natural Text-to-Speech Samples Using a Quantized Fine-Grained VAE and Autoregressive Prosody Prior

TL;DR: Experimental results show that the proposed sequential prior in a discrete latent space which can generate more naturally sounding samples significantly improves the naturalness in random sample generation and randomly sampling can be used as data augmentation to improve the ASR performance.
Posted Content

Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis

TL;DR: The mean opinion scores (MOS) show that Flowtron matches state-of-the-art TTS models in terms of speech quality, and results on control of speech variation, interpolation between samples and style transfer between speakers seen and unseen during training are provided.
References
More filters
Proceedings Article

Adam: A Method for Stochastic Optimization

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Proceedings ArticleDOI

Librispeech: An ASR corpus based on public domain audio books

TL;DR: It is shown that acoustic models trained on LibriSpeech give lower error rate on the Wall Street Journal (WSJ) test sets than models training on WSJ itself.
Proceedings ArticleDOI

Long Short-Term Memory Recurrent Neural Network Architectures for Large Scale Acoustic Modeling

TL;DR: The first distributed training of LSTM RNNs using asynchronous stochastic gradient descent optimization on a large cluster of machines is introduced and it is shown that a two-layer deep LSTm RNN where each L STM layer has a linear recurrent projection layer can exceed state-of-the-art speech recognition performance.
Journal ArticleDOI

Violin plots : A box plot-density trace synergism

TL;DR: A proposed further adaptation, the violin plot, pools the best statistical features of alternative graphical representations of batches of data and adds the information available from local density estimates to the basic summary statistics inherent in box plots.
Journal ArticleDOI

Statistical Parametric Speech Synthesis

TL;DR: This paper gives a general overview of techniques in statistical parametric speech synthesis, and contrasts these techniques with the more conventional unit selection technology that has dominated speech synthesis over the last ten years.