LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech

Open AccessPosted Content

LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech

Heiga Zen, +7 more

- 05 Apr 2019 -

arXiv: Sound

Chats0

TLDR

This paper introduced a new speech corpus called "LibriTTS" for text-to-speech use, which is derived from the original audio and text materials of the LibriSpeech corpus, which was used for training and evaluating automatic speech recognition systems.

Abstract:

This paper introduces a new speech corpus called "LibriTTS" designed for text-to-speech use. It is derived from the original audio and text materials of the LibriSpeech corpus, which has been used for training and evaluating automatic speech recognition systems. The new corpus inherits desired properties of the LibriSpeech corpus while addressing a number of issues which make LibriSpeech less than ideal for text-to-speech work. The released corpus consists of 585 hours of speech data at 24kHz sampling rate from 2,456 speakers and the corresponding texts. Experimental results show that neural end-to-end TTS models trained from the LibriTTS corpus achieved above 4.0 in mean opinion scores in naturalness in five out of six evaluation speakers. The corpus is freely available for download from this http URL.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

MLS: A Large-Scale Multilingual Dataset for Speech Research.

Vineel Pratap, +4 more

TL;DR: This paper introduces Multilingual LibriSpeech (MLS) dataset, a large multilingual corpus suitable for speech research and believes such a large transcribed dataset will open new avenues in ASR and Text-To-Speech research.

...read moreread less

Posted Content

LibriMix: An Open-Source Dataset for Generalizable Speech Separation

Joris Cosentino, +4 more

- 22 May 2020 -

arXiv: Audio and Speech Processing

TL;DR: The experiments show that the generalization error is smaller for models trained with LibriMix than with WHAM!, in both clean and noisy conditions, and a third test set based on VCTK for speech and WHAM! for noise is introduced.

...read moreread less

Proceedings ArticleDOI

Mellotron: Multispeaker Expressive Voice Synthesis by Conditioning on Rhythm, Pitch and Global Style Tokens

Rafael Valle, +3 more

TL;DR: A multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data, and synthesized samples that include style transfer from other speakers, singers and styles not seen during training, procedural manipulation of rhythm and pitch and choir synthesis.

...read moreread less

Proceedings ArticleDOI

Generating Diverse and Natural Text-to-Speech Samples Using a Quantized Fine-Grained VAE and Autoregressive Prosody Prior

Guangzhi Sun, +7 more

TL;DR: Experimental results show that the proposed sequential prior in a discrete latent space which can generate more naturally sounding samples significantly improves the naturalness in random sample generation and randomly sampling can be used as data augmentation to improve the ASR performance.

...read moreread less

Posted Content

Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis

Rafael Valle, +3 more

- 12 May 2020 -

arXiv: Sound

TL;DR: The mean opinion scores (MOS) show that Flowtron matches state-of-the-art TTS models in terms of speech quality, and results on control of speech variation, interpolation between samples and style transfer between speakers seen and unseen during training are provided.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings Article

Adam: A Method for Stochastic Optimization

Diederik P. Kingma, +1 more

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

...read moreread less

Proceedings ArticleDOI

Librispeech: An ASR corpus based on public domain audio books

Vassil Panayotov, +3 more

TL;DR: It is shown that acoustic models trained on LibriSpeech give lower error rate on the Wall Street Journal (WSJ) test sets than models training on WSJ itself.

...read moreread less

Proceedings ArticleDOI

Long Short-Term Memory Recurrent Neural Network Architectures for Large Scale Acoustic Modeling

Hasim Sak, +2 more

TL;DR: The first distributed training of LSTM RNNs using asynchronous stochastic gradient descent optimization on a large cluster of machines is introduced and it is shown that a two-layer deep LSTm RNN where each L STM layer has a linear recurrent projection layer can exceed state-of-the-art speech recognition performance.

...read moreread less

Journal ArticleDOI

Violin plots : A box plot-density trace synergism

Jerry L. Hintze, +1 more

- 01 May 1998 -

The American Statistician

TL;DR: A proposed further adaptation, the violin plot, pools the best statistical features of alternative graphical representations of batches of data and adds the information available from local density estimates to the basic summary statistics inherent in box plots.

...read moreread less

Journal ArticleDOI

Statistical Parametric Speech Synthesis

Alan W. Black, +2 more

TL;DR: This paper gives a general overview of techniques in statistical parametric speech synthesis, and contrasts these techniques with the more conventional unit selection technology that has dominated speech synthesis over the last ten years.

...read moreread less

Collapse

LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech

Citations

MLS: A Large-Scale Multilingual Dataset for Speech Research.

LibriMix: An Open-Source Dataset for Generalizable Speech Separation

Mellotron: Multispeaker Expressive Voice Synthesis by Conditioning on Rhythm, Pitch and Global Style Tokens

Generating Diverse and Natural Text-to-Speech Samples Using a Quantized Fine-Grained VAE and Autoregressive Prosody Prior

Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis

References

Adam: A Method for Stochastic Optimization

Librispeech: An ASR corpus based on public domain audio books

Long Short-Term Memory Recurrent Neural Network Architectures for Large Scale Acoustic Modeling

Violin plots : A box plot-density trace synergism

Statistical Parametric Speech Synthesis

Related Papers (5)

A corpus training method and system

Corpus expansion method and apparatus

Automatic Construction for a TTS Corpus with Limited Text

SINICA CORPUS: Design Methodology for Balanced Corpora

Domain corpus independent vocabulary generation for embedded continuous speech recognition