Training Multi-Speaker Neural Text-to-Speech Systems using Speaker-Imbalanced Speech Corpora

Open AccessPosted Content

Training Multi-Speaker Neural Text-to-Speech Systems using Speaker-Imbalanced Speech Corpora

- 01 Apr 2019 -

TLDR

In this article, a multi-speaker text-to-speech model was proposed to generate synthetic speech with better quality and stability than a speaker-dependent one, when the available data of a target speaker is insufficient to train a high quality speakerdependent neural TTS system.

Abstract:

When the available data of a target speaker is insufficient to train a high quality speaker-dependent neural text-to-speech (TTS) system, we can combine data from multiple speakers and train a multi-speaker TTS model instead. Many studies have shown that neural multi-speaker TTS model trained with a small amount data from multiple speakers combined can generate synthetic speech with better quality and stability than a speaker-dependent one. However when the amount of data from each speaker is highly unbalanced, the best approach to make use of the excessive data remains unknown. Our experiments showed that simply combining all available data from every speaker to train a multi-speaker model produces better than or at least similar performance to its speaker-dependent counterpart. Moreover by using an ensemble multi-speaker model, in which each subsystem is trained on a subset of available data, we can further improve the quality of the synthetic speech especially for underrepresented speakers whose training data is limited.

Citations

PDF

Open Access

More filters

Posted Content

Expressive TTS Training with Frame and Style Reconstruction Loss

Rui Liu, +3 more

- 04 Aug 2020 -

arXiv: Sound

TL;DR: This study is the first study to incorporate utterance level perceptual quality as a loss function into Tacotron training for improved expressiveness, and marks a departure from the style token paradigm.

...read moreread less

Posted Content

The Sequence-to-Sequence Baseline for the Voice Conversion Challenge 2020: Cascading ASR and TTS.

Wen-Chin Huang, +3 more

- 06 Oct 2020 -

arXiv: Audio and Speech Processing

TL;DR: This paper revisits a naive approach for voice conversion by utilizing ESPnet, an open-source end-to-end speech processing toolkit, and the many well-configured pretrained models provided by the community, demonstrating the promising ability of seq2seq models to convert speaker identity.

...read moreread less

Proceedings ArticleDOI

Efficient Neural Speech Synthesis for Low-Resource Languages Through Multilingual Modeling.

Marcel de Korte, +2 more

TL;DR: The authors investigated to what extent multilingual multi-speaker modeling can be an alternative to monolingual multi-Speaker modeling, and explored how data from foreign languages may best be combined with low-resource language data.

...read moreread less

Posted Content

Teacher-Student Training for Robust Tacotron-based TTS

Rui Liu, +5 more

- 07 Nov 2019 -

arXiv: Computation and Language

TL;DR: This article proposed a teacher-student training scheme for Tacotron-based TTS by introducing a distillation loss function in addition to the feature loss function, which is called knowledge distillation.

...read moreread less

Journal ArticleDOI

Modeling Prosodic Phrasing With Multi-Task Learning in Tacotron-Based TTS

Rui Liu, +4 more

- 13 Aug 2020 -

IEEE Signal Processing Letters

TL;DR: This letter proposes a multi-task learning scheme for Tacotron training, that optimizes the system to predict both Mel spectrum and phrase breaks, and shows that the proposed training scheme consistently improves the voice quality for both Chinese and Mongolian systems.

...read moreread less

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Effect of Data Reduction on Sequence-to-sequence Neural TTS

Javier Latorre, +6 more

TL;DR: This paper showed that the lack of data from one speaker can be compensated with data from other speakers, and that the naturalness of Tacotron2-like models trained on a blend of 5k utterances from 7 speakers is better than or equivalent to that of speaker dependent models trained over a large amount of data.

...read moreread less

Proceedings ArticleDOI

Utterance Selection for Optimizing Intelligibility of TTS Voices Trained on ASR Data

Erica Cooper, +4 more

TL;DR: It is found that, when a large amount of data is available, selecting from the corpus based on criteria such as standard deviation of f0, fast speaking rate, and hypo-articulation produces the most intelligible voices.

...read moreread less

Proceedings ArticleDOI

Speaker representations for speaker adaptation in multiple speakers' BLSTM-RNN-based speech synthesis

Yi Zhao, +2 more

TL;DR: Experimental results show that the speaker representations input to the first layer of acoustic model can effectively control speaker identity during speaker adaptive training, thus improving the synthesized speech quality of speakers included in training phase.

...read moreread less

Proceedings ArticleDOI

Corpus building for data-driven TTS systems

Weibin Zhu, +6 more

TL;DR: This work built a large and balanced Mandarin text-and-speech corpus, named IBM Mandarin TTS Corpus, designed for both statistical prosody modeling, and context dependence of phonemic features, and investigated the problem of a proper synthetic unit.

...read moreread less

Proceedings ArticleDOI

Data Selection for Improving Naturalness of TTS Voices Trained on Small Found Corpuses

F.-Y. Kuo, +5 more

TL;DR: This work investigates techniques that select training data from small, found corpuses in order to improve the naturalness of synthesized text-to-speech voices and proposes three metrics related to the narrator's articulation that give significant improvements in naturalness.

...read moreread less