Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

Open AccessPosted Content

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

- 12 Jun 2018 -

TLDR

In this paper, a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of many different speakers, including those unseen during training is presented.

Abstract:

We describe a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of many different speakers, including those unseen during training. Our system consists of three independently trained components: (1) a speaker encoder network, trained on a speaker verification task using an independent dataset of noisy speech from thousands of speakers without transcripts, to generate a fixed-dimensional embedding vector from seconds of reference speech from a target speaker; (2) a sequence-to-sequence synthesis network based on Tacotron 2, which generates a mel spectrogram from text, conditioned on the speaker embedding; (3) an auto-regressive WaveNet-based vocoder that converts the mel spectrogram into a sequence of time domain waveform samples. We demonstrate that the proposed model is able to transfer the knowledge of speaker variability learned by the discriminatively-trained speaker encoder to the new task, and is able to synthesize natural speech from speakers that were not seen during training. We quantify the importance of training the speaker encoder on a large and diverse speaker set in order to obtain the best generalization performance. Finally, we show that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

Few-Shot Adversarial Learning of Realistic Neural Talking Head Models

Egor Zakharov, +3 more

TL;DR: This work presents a system that performs lengthy meta-learning on a large dataset of videos, and is able to frame few- and one-shot learning of neural talking head models of previously unseen people as adversarial training problems with high capacity generators and discriminators.

...read moreread less

Proceedings ArticleDOI

ASVspoof 2019: Future horizons in spoofed and fake audio detection

Massimiliano Todisco, +9 more

TL;DR: The 2019 database, protocols and challenge results are described, and major findings which demonstrate the real progress made in protecting against the threat of spoofing and fake audio are outlined.

...read moreread less

Posted Content

Jukebox: A Generative Model for Music

Prafulla Dhariwal, +5 more

- 30 Apr 2020 -

arXiv: Audio and Speech Processing

TL;DR: It is shown that the combined model at scale can generate high-fidelity and diverse songs with coherence up to multiple minutes, and can condition on artist and genre to steer the musical and vocal style, and on unaligned lyrics to make the singing more controllable.

...read moreread less

Posted Content

LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech

Heiga Zen, +7 more

- 05 Apr 2019 -

arXiv: Sound

TL;DR: This paper introduced a new speech corpus called "LibriTTS" for text-to-speech use, which is derived from the original audio and text materials of the LibriSpeech corpus, which was used for training and evaluating automatic speech recognition systems.

...read moreread less

Proceedings ArticleDOI

Fully Supervised Speaker Diarization

Aonan Zhang, +4 more

TL;DR: A fully supervised speaker diarization approach, named unbounded interleaved-state recurrent neural networks (UIS-RNN), given extracted speaker-discriminative embeddings, which decodes in an online fashion while most state-of-the-art systems rely on offline clustering.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings Article

Neural Machine Translation by Jointly Learning to Align and Translate

Dzmitry Bahdanau, +2 more

TL;DR: It is conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and it is proposed to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.

...read moreread less

Posted Content

Neural Machine Translation by Jointly Learning to Align and Translate

Dzmitry Bahdanau, +2 more

- 01 Sep 2014 -

arXiv: Computation and Language

TL;DR: In this paper, the authors propose to use a soft-searching model to find the parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.

...read moreread less

Journal ArticleDOI

Suppression of acoustic noise in speech using spectral subtraction

S. Boll

- 01 Apr 1979 -

IEEE Transactions on Acoustics, Speech, ...

TL;DR: A stand-alone noise suppression algorithm that resynthesizes a speech waveform and can be used as a pre-processor to narrow-band voice communications systems, speech recognition systems, or speaker authentication systems.

...read moreread less

Proceedings ArticleDOI

Librispeech: An ASR corpus based on public domain audio books

Vassil Panayotov, +3 more

TL;DR: It is shown that acoustic models trained on LibriSpeech give lower error rate on the Wall Street Journal (WSJ) test sets than models training on WSJ itself.

...read moreread less

WaveNet: A Generative Model for Raw Audio

Aaron van den Oord, +8 more

TL;DR: WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition.

...read moreread less

Collapse

arXiv: Sound

Librispeech: An ASR corpus based on public domain audio books

Vassil Panayotov, +3 more

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

Citations

Few-Shot Adversarial Learning of Realistic Neural Talking Head Models

ASVspoof 2019: Future horizons in spoofed and fake audio detection

Jukebox: A Generative Model for Music

LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech

Fully Supervised Speaker Diarization

References

Neural Machine Translation by Jointly Learning to Align and Translate

Neural Machine Translation by Jointly Learning to Align and Translate

Suppression of acoustic noise in speech using spectral subtraction

Librispeech: An ASR corpus based on public domain audio books

WaveNet: A Generative Model for Raw Audio

Related Papers (5)

Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions