Open AccessPosted Content
Training Multi-Speaker Neural Text-to-Speech Systems using Speaker-Imbalanced Speech Corpora
TLDR
In this article, a multi-speaker text-to-speech model was proposed to generate synthetic speech with better quality and stability than a speaker-dependent one, when the available data of a target speaker is insufficient to train a high quality speakerdependent neural TTS system.Abstract:
When the available data of a target speaker is insufficient to train a high quality speaker-dependent neural text-to-speech (TTS) system, we can combine data from multiple speakers and train a multi-speaker TTS model instead. Many studies have shown that neural multi-speaker TTS model trained with a small amount data from multiple speakers combined can generate synthetic speech with better quality and stability than a speaker-dependent one. However when the amount of data from each speaker is highly unbalanced, the best approach to make use of the excessive data remains unknown. Our experiments showed that simply combining all available data from every speaker to train a multi-speaker model produces better than or at least similar performance to its speaker-dependent counterpart. Moreover by using an ensemble multi-speaker model, in which each subsystem is trained on a subset of available data, we can further improve the quality of the synthetic speech especially for underrepresented speakers whose training data is limited.read more
Citations
More filters
Posted Content
Expressive TTS Training with Frame and Style Reconstruction Loss
TL;DR: This study is the first study to incorporate utterance level perceptual quality as a loss function into Tacotron training for improved expressiveness, and marks a departure from the style token paradigm.
Posted Content
The Sequence-to-Sequence Baseline for the Voice Conversion Challenge 2020: Cascading ASR and TTS.
TL;DR: This paper revisits a naive approach for voice conversion by utilizing ESPnet, an open-source end-to-end speech processing toolkit, and the many well-configured pretrained models provided by the community, demonstrating the promising ability of seq2seq models to convert speaker identity.
Proceedings ArticleDOI
Efficient Neural Speech Synthesis for Low-Resource Languages Through Multilingual Modeling.
TL;DR: The authors investigated to what extent multilingual multi-speaker modeling can be an alternative to monolingual multi-Speaker modeling, and explored how data from foreign languages may best be combined with low-resource language data.
Posted Content
Teacher-Student Training for Robust Tacotron-based TTS
TL;DR: This article proposed a teacher-student training scheme for Tacotron-based TTS by introducing a distillation loss function in addition to the feature loss function, which is called knowledge distillation.
Journal ArticleDOI
Modeling Prosodic Phrasing With Multi-Task Learning in Tacotron-Based TTS
TL;DR: This letter proposes a multi-task learning scheme for Tacotron training, that optimizes the system to predict both Mel spectrum and phrase breaks, and shows that the proposed training scheme consistently improves the voice quality for both Chinese and Mongolian systems.
References
More filters
Proceedings ArticleDOI
Multi-speaker modeling and speaker adaptation for DNN-based TTS synthesis
TL;DR: This paper proposes an approach to model multiple speakers TTS with a general DNN, where the same hidden layers are shared among different speakers while the output layers are composed of speaker-dependent nodes explaining the target of each speaker.
Proceedings Article
Thousands of voices for HMM-based speech synthesis
Junichi Yamagishi,Bela Usabaev,Simon King,Oliver Watts,John Dines,Jilei Tian,Rile Hu,Yong Guan,Keiichiro Oura,Keiichi Tokuda,Reima Karhila,Mikko Kurimo +11 more
TL;DR: In this paper, a speaker-adaptive HMM-based speech synthesis system is proposed to produce high quality voices on non-TTS corpora such as ASR corpora.
Proceedings Article
Sample-efficient adaptive text-to-speech
Yutian Chen,Yannis M. Assael,Brendan Shillingford,David Budden,Scott Reed,Heiga Zen,Quan Wang,Luis C. Cobo,Andrew Trask,Ben Laurie,Caglar Gulcehre,Aaron van den Oord,Oriol Vinyals,Nando de Freitas +13 more
TL;DR: In this article, a meta-learning approach for adaptive text-to-speech (TTS) with few data is presented, where the aim is to produce a network that requires few data at deployment time to rapidly adapt to new speakers.
Journal ArticleDOI
Thousands of Voices for HMM-Based Speech Synthesis–Analysis and Application of TTS Systems Built on Various ASR Corpora
Junichi Yamagishi,Bela Usabaev,Simon King,Oliver Watts,John Dines,Jilei Tian,Yong Guan,Rile Hu,Keiichiro Oura,Yi-Jian Wu,Keiichi Tokuda,Reima Karhila,Mikko Kurimo +12 more
TL;DR: This paper demonstrates the thousands of voices for HMM-based speech synthesis that are made from several popular ASR corpora such as the Wall Street Journal, Resource Management, Globalphone, and SPEECON databases.
Posted Content
Sample Efficient Adaptive Text-to-Speech
Yutian Chen,Yannis M. Assael,Brendan Shillingford,David Budden,Scott Reed,Heiga Zen,Quan Wang,Luis C. Cobo,Andrew Trask,Ben Laurie,Caglar Gulcehre,Aaron van den Oord,Oriol Vinyals,Nando de Freitas +13 more
TL;DR: Three strategies are introduced and benchmark three strategies at adapting the multi-speaker neural network to new speakers, obtaining state-of-the-art results in both sample naturalness and voice similarity with merely a few minutes of audio data from new speakers.
Related Papers (5)
Modeling Multi-speaker Latent Space to Improve Neural TTS: Quick Enrolling New Speaker and Enhancing Premium Voice
Yan Deng,Lei He,Frank K. Soong +2 more