scispace - formally typeset
Open AccessPosted Content

Modeling Multi-speaker Latent Space to Improve Neural TTS: Quick Enrolling New Speaker and Enhancing Premium Voice

TLDR
The multi-speaker latent space is investigated to improve neural TTS for adapting the system to new speakers with only several minutes of speech or enhancing a premium voice by utilizing the data from other speakers for richer contextual coverage and better generalization.
Abstract
Neural TTS has shown it can generate high quality synthesized speech. In this paper, we investigate the multi-speaker latent space to improve neural TTS for adapting the system to new speakers with only several minutes of speech or enhancing a premium voice by utilizing the data from other speakers for richer contextual coverage and better generalization. A multi-speaker neural TTS model is built with the embedded speaker information in both spectral and speaker latent space. The experimental results show that, with less than 5 minutes of training data from a new speaker, the new model can achieve an MOS score of 4.16 in naturalness and 4.64 in speaker similarity close to human recordings (4.74). For a well-trained premium voice, we can achieve an MOS score of 4.5 for out-of-domain texts, which is comparable to an MOS of 4.58 for professional recordings, and significantly outperforms single speaker result of 4.28.

read more

Citations
More filters
Proceedings ArticleDOI

Zero-Shot Multi-Speaker Text-To-Speech with State-Of-The-Art Neural Speaker Embeddings

TL;DR: In this article, the authors investigate multi-speaker modeling for end-to-end text-tospeech synthesis and study the effects of different types of state-of-the-art neural speaker embeddings on speaker similarity for unseen speakers.
Proceedings ArticleDOI

Developing RNN-T Models Surpassing High-Performance Hybrid Models with Customization Capability.

TL;DR: In this paper, the authors describe the recent development of RNN-T models with reduced GPU memory consumption during training, better initialization strategy, and advanced encoder modeling with future lookahead.
Posted Content

Zero-Shot Multi-Speaker Text-To-Speech with State-of-the-art Neural Speaker Embeddings

TL;DR: Learnable dictionary encoding-based speaker embeddings with angular softmax loss can improve equal error rates over x-vectors in a speaker verification task and improve speaker similarity and naturalness for unseen speakers when used for zero-shot adaptation to new speakers in end-to-end speech synthesis.
Proceedings ArticleDOI

Attentron: Few-Shot Text-to-Speech Utilizing Attention-Based Variable-Length Embedding.

TL;DR: Attentron is proposed, a few-shot TTS model that clones voices of speakers unseen during training that significantly outperforms state-of-the-art models when generating speech for unseen speakers in terms of speaker similarity and quality.
Proceedings ArticleDOI

Improving Prosody with Linguistic and Bert Derived Features in Multi-Speaker Based Mandarin Chinese Neural TTS

TL;DR: This study investigates linguistic features and Bert-derived information to improve the prosody of the Mandarin Chinese TTS, and finds the model with additional character embeddings from Bert is the best, which outperforms the baseline by 0.17 MOS gain.
References
More filters
Posted Content

WaveNet: A Generative Model for Raw Audio

TL;DR: This paper proposed WaveNet, a deep neural network for generating audio waveforms, which is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones.

WaveNet: A Generative Model for Raw Audio

TL;DR: WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition.
Proceedings ArticleDOI

Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions

TL;DR: Tacotron 2, a neural network architecture for speech synthesis directly from text that is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize time-domain waveforms from those Spectrograms is described.
Proceedings ArticleDOI

Tacotron: Towards End-to-End Speech Synthesis

TL;DR: Tacotron as mentioned in this paper is an end-to-end generative text to speech model that synthesizes speech directly from characters, given pairs, the model can be trained completely from scratch with random initialization.
Posted Content

Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions

TL;DR: Tacotron 2 as mentioned in this paper uses a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize timedomain waveforms.
Related Papers (5)