Modeling Multi-speaker Latent Space to Improve Neural TTS: Quick Enrolling New Speaker and Enhancing Premium Voice

Open AccessPosted Content

Modeling Multi-speaker Latent Space to Improve Neural TTS: Quick Enrolling New Speaker and Enhancing Premium Voice

- 13 Dec 2018 -

TLDR

The multi-speaker latent space is investigated to improve neural TTS for adapting the system to new speakers with only several minutes of speech or enhancing a premium voice by utilizing the data from other speakers for richer contextual coverage and better generalization.

Abstract:

Neural TTS has shown it can generate high quality synthesized speech. In this paper, we investigate the multi-speaker latent space to improve neural TTS for adapting the system to new speakers with only several minutes of speech or enhancing a premium voice by utilizing the data from other speakers for richer contextual coverage and better generalization. A multi-speaker neural TTS model is built with the embedded speaker information in both spectral and speaker latent space. The experimental results show that, with less than 5 minutes of training data from a new speaker, the new model can achieve an MOS score of 4.16 in naturalness and 4.64 in speaker similarity close to human recordings (4.74). For a well-trained premium voice, we can achieve an MOS score of 4.5 for out-of-domain texts, which is comparable to an MOS of 4.58 for professional recordings, and significantly outperforms single speaker result of 4.28.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

Zero-Shot Multi-Speaker Text-To-Speech with State-Of-The-Art Neural Speaker Embeddings

Erica Cooper, +6 more

TL;DR: In this article, the authors investigate multi-speaker modeling for end-to-end text-tospeech synthesis and study the effects of different types of state-of-the-art neural speaker embeddings on speaker similarity for unseen speakers.

...read moreread less

Proceedings ArticleDOI

Developing RNN-T Models Surpassing High-Performance Hybrid Models with Customization Capability.

Jinyu Li, +10 more

TL;DR: In this paper, the authors describe the recent development of RNN-T models with reduced GPU memory consumption during training, better initialization strategy, and advanced encoder modeling with future lookahead.

...read moreread less

Posted Content

Zero-Shot Multi-Speaker Text-To-Speech with State-of-the-art Neural Speaker Embeddings

Erica Cooper, +6 more

- 23 Oct 2019 -

arXiv: Audio and Speech Processing

TL;DR: Learnable dictionary encoding-based speaker embeddings with angular softmax loss can improve equal error rates over x-vectors in a speaker verification task and improve speaker similarity and naturalness for unseen speakers when used for zero-shot adaptation to new speakers in end-to-end speech synthesis.

...read moreread less

Proceedings ArticleDOI

Attentron: Few-Shot Text-to-Speech Utilizing Attention-Based Variable-Length Embedding.

Seungwoo Choi, +3 more

TL;DR: Attentron is proposed, a few-shot TTS model that clones voices of speakers unseen during training that significantly outperforms state-of-the-art models when generating speech for unseen speakers in terms of speaker similarity and quality.

...read moreread less

Proceedings ArticleDOI

Improving Prosody with Linguistic and Bert Derived Features in Multi-Speaker Based Mandarin Chinese Neural TTS

Yujia Xiao, +3 more

TL;DR: This study investigates linguistic features and Bert-derived information to improve the prosody of the Mandarin Chinese TTS, and finds the model with additional character embeddings from Bert is the best, which outperforms the baseline by 0.17 MOS gain.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Posted Content

WaveNet: A Generative Model for Raw Audio

Aaron van den Oord, +8 more

- 12 Sep 2016 -

arXiv: Sound

TL;DR: This paper proposed WaveNet, a deep neural network for generating audio waveforms, which is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones.

...read moreread less

WaveNet: A Generative Model for Raw Audio

Aaron van den Oord, +8 more

TL;DR: WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition.

...read moreread less

Proceedings ArticleDOI

Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions

Jonathan Shen, +12 more

TL;DR: Tacotron 2, a neural network architecture for speech synthesis directly from text that is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize time-domain waveforms from those Spectrograms is described.

...read moreread less

Proceedings ArticleDOI

Tacotron: Towards End-to-End Speech Synthesis

Yuxuan Wang, +13 more

TL;DR: Tacotron as mentioned in this paper is an end-to-end generative text to speech model that synthesizes speech directly from characters, given pairs, the model can be trained completely from scratch with random initialization.

...read moreread less

Posted Content

Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions

Jonathan Shen, +12 more

- 16 Dec 2017 -

arXiv: Computation and Language

TL;DR: Tacotron 2 as mentioned in this paper uses a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize timedomain waveforms.

...read moreread less

arXiv: Audio and Speech Processing

Modeling Multi-speaker Latent Space to Improve Neural TTS: Quick Enrolling New Speaker and Enhancing Premium Voice

Citations

Zero-Shot Multi-Speaker Text-To-Speech with State-Of-The-Art Neural Speaker Embeddings

Developing RNN-T Models Surpassing High-Performance Hybrid Models with Customization Capability.

Zero-Shot Multi-Speaker Text-To-Speech with State-of-the-art Neural Speaker Embeddings

Attentron: Few-Shot Text-to-Speech Utilizing Attention-Based Variable-Length Embedding.

Improving Prosody with Linguistic and Bert Derived Features in Multi-Speaker Based Mandarin Chinese Neural TTS

References

WaveNet: A Generative Model for Raw Audio

WaveNet: A Generative Model for Raw Audio

Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions

Tacotron: Towards End-to-End Speech Synthesis

Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions

Related Papers (5)

Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

Neural Voice Cloning with a Few Samples

Tacotron: Towards End-to-End Speech Synthesis

Training Multi-Speaker Neural Text-to-Speech Systems using Speaker-Imbalanced Speech Corpora