Open AccessPosted Content
Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis
Ye Jia,Yu Zhang,Ron Weiss,Quan Wang,Jonathan Shen,Fei Ren,Zhifeng Chen,Patrick Nguyen,Ruoming Pang,Ignacio Lopez Moreno,Yonghui Wu +10 more
TLDR
In this paper, a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of many different speakers, including those unseen during training is presented.Abstract:
We describe a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of many different speakers, including those unseen during training. Our system consists of three independently trained components: (1) a speaker encoder network, trained on a speaker verification task using an independent dataset of noisy speech from thousands of speakers without transcripts, to generate a fixed-dimensional embedding vector from seconds of reference speech from a target speaker; (2) a sequence-to-sequence synthesis network based on Tacotron 2, which generates a mel spectrogram from text, conditioned on the speaker embedding; (3) an auto-regressive WaveNet-based vocoder that converts the mel spectrogram into a sequence of time domain waveform samples. We demonstrate that the proposed model is able to transfer the knowledge of speaker variability learned by the discriminatively-trained speaker encoder to the new task, and is able to synthesize natural speech from speakers that were not seen during training. We quantify the importance of training the speaker encoder on a large and diverse speaker set in order to obtain the best generalization performance. Finally, we show that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.read more
Citations
More filters
Proceedings ArticleDOI
Few-Shot Adversarial Learning of Realistic Neural Talking Head Models
TL;DR: This work presents a system that performs lengthy meta-learning on a large dataset of videos, and is able to frame few- and one-shot learning of neural talking head models of previously unseen people as adversarial training problems with high capacity generators and discriminators.
Proceedings ArticleDOI
ASVspoof 2019: Future horizons in spoofed and fake audio detection
Massimiliano Todisco,Xin Wang,Ville Vestman,Sahidullah,Héctor Delgado,Andreas Nautsch,Junichi Yamagishi,Nicholas Evans,Tomi Kinnunen,Kong Aik Lee +9 more
TL;DR: The 2019 database, protocols and challenge results are described, and major findings which demonstrate the real progress made in protecting against the threat of spoofing and fake audio are outlined.
Posted Content
Jukebox: A Generative Model for Music
TL;DR: It is shown that the combined model at scale can generate high-fidelity and diverse songs with coherence up to multiple minutes, and can condition on artist and genre to steer the musical and vocal style, and on unaligned lyrics to make the singing more controllable.
Posted Content
LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech
TL;DR: This paper introduced a new speech corpus called "LibriTTS" for text-to-speech use, which is derived from the original audio and text materials of the LibriSpeech corpus, which was used for training and evaluating automatic speech recognition systems.
Proceedings ArticleDOI
Fully Supervised Speaker Diarization
TL;DR: A fully supervised speaker diarization approach, named unbounded interleaved-state recurrent neural networks (UIS-RNN), given extracted speaker-discriminative embeddings, which decodes in an online fashion while most state-of-the-art systems rely on offline clustering.
References
More filters
Proceedings Article
Neural Machine Translation by Jointly Learning to Align and Translate
TL;DR: It is conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and it is proposed to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.
Posted Content
Neural Machine Translation by Jointly Learning to Align and Translate
TL;DR: In this paper, the authors propose to use a soft-searching model to find the parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.
Journal ArticleDOI
Suppression of acoustic noise in speech using spectral subtraction
TL;DR: A stand-alone noise suppression algorithm that resynthesizes a speech waveform and can be used as a pre-processor to narrow-band voice communications systems, speech recognition systems, or speaker authentication systems.
Proceedings ArticleDOI
Librispeech: An ASR corpus based on public domain audio books
TL;DR: It is shown that acoustic models trained on LibriSpeech give lower error rate on the Wall Street Journal (WSJ) test sets than models training on WSJ itself.
WaveNet: A Generative Model for Raw Audio
Aaron van den Oord,Sander Dieleman,Heiga Zen,Karen Simonyan,Oriol Vinyals,Alex Graves,Nal Kalchbrenner,Andrew W. Senior,Koray Kavukcuoglu +8 more
TL;DR: WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition.