Open AccessProceedings ArticleDOI

BOFFIN TTS: Few-Shot Speaker Adaptation by Bayesian Optimization

- pp 7639-7643

TLDR

Results indicate, across multiple corpora, that BOFFIN TTS can learn to synthesize new speakers using less than ten minutes of audio, achieving the same naturalness as produced for the speakers used to train the base model.

Abstract:

We present BOFFIN TTS (Bayesian Optimization For FIne-tuning Neural Text To Speech), a novel approach for few-shot speaker adaptation. Here, the task is to fine-tune a pre-trained TTS model to mimic a new speaker using a small corpus of target utterances. We demonstrate that there does not exist a one-size-fits-all adaptation strategy, with convincing synthesis requiring a corpus-specific configuration of the hyper-parameters that control fine-tuning. By using Bayesian optimization to efficiently optimize these hyper-parameter values for a target speaker, we are able to perform adaptation with an average 30% improvement in speaker similarity over standard techniques. Results indicate, across multiple corpora, that BOFFIN TTS can learn to synthesize new speakers using less than ten minutes of audio, achieving the same naturalness as produced for the speakers used to train the base model.

Content maybe subject to copyright Report

Figures

Table 1: Comparing the mean naturalness scores achieved by BOFFIN TTS on target speakers (adapt-synth), by the base multi-speaker model on base speakers(base-synth), and by true audio for both target (adapt-truth) and base-model speakers (base-truth). We present each listener with samples across multiple base and adapted speakers and ask for a 5 point score from ‘completely unnatural’ to ‘completely natural’. We print mean responses alongside 95% confidence bounds.

Fig. 2: (a, b, c): Loss of the current best hyper-parameter configuration found by each system as we adapt to three randomly selected speakers from each corpora. We plot means and standard error for BOFFIN TTS and RS based on 5 runs with different random seeds, alongside the loss achieved by the base-line adaptation system. (d): Hyper-parameter values chosen by BOFFIN TTS for multiple target speakers across three different data-sets. Each point represents a single speaker. We plot the six hyper-parameters whose optimal values show the largest variation across speakers.

Fig. 1: Multi-speaker acoustic model architecture.

Fig. 3: MUSHRA tests for speaker similarity and naturalness. For similarity, we presented the same utterance synthesized by each system alongside a reference recording of the target speaker on another utterance and requested a rating of each system between ‘definitely a different person’ (0) and ‘definitely the same person’ (100). For naturalness, we repeat without a reference recording and instead asked for ratings between ‘completely unnatural’ and ‘completely natural’

Citations

PDF

Open Access

More filters

Posted Content

Learning from Very Few Samples: A Survey

Jiang Lu, +3 more

- 06 Sep 2020 -

arXiv: Learning

TL;DR: This survey extensively review 300+ papers of FSL spanning from the 2000s to 2019 and provides a timely and comprehensive survey for FSL, which categorize FSL approaches into the generative model based and discriminative modelBased kinds in principle, and emphasize particularly on the meta learning based FSL approach.

...read moreread less

Proceedings Article

AdaSpeech: Adaptive Text to Speech for Custom Voice

Mingjian Chen, +6 more

TL;DR: In this paper, an adaptive text-to-speech (TTS) system for high-quality and efficient customization of new voices is proposed, which uses one acoustic encoder to extract an utterance-level vector and another one to extract a sequence of phoneme-level vectors.

...read moreread less

Posted Content

AdaDurIAN: Few-shot Adaptation for Neural Text-to-Speech with DurIAN.

Zewang Zhang, +4 more

- 12 May 2020 -

arXiv: Sound

TL;DR: AdaDurIAN is introduced by training an improved DurIAN-based average model and leverage it to few-shot learning with the shared speaker-independent content encoder across different speakers and can outperform the baseline end-to-end system by a large margin.

...read moreread less

Posted Content

Are we Forgetting about Compositional Optimisers in Bayesian Optimisation

Antoine Grosnit, +5 more

- 15 Dec 2020 -

arXiv: Learning

TL;DR: This paper highlights the empirical advantages of the compositional approach to acquisition function maximisation across 3958 individual experiments comprising synthetic optimisation tasks as well as tasks from the 2020 NeurIPS competition on Black-Box Optimisation for Machine Learning.

...read moreread less

Posted Content

A Survey on Neural Speech Synthesis.

Xu Tan, +3 more

- 29 Jun 2021 -

arXiv: Audio and Speech Processing

TL;DR: A comprehensive survey on neural text-to-speech (TTS) can be found in this paper, focusing on the key components in neural TTS, including text analysis, acoustic models and vocoders.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal Article

Random search for hyper-parameter optimization

James Bergstra, +1 more

- 01 Mar 2012 -

Journal of Machine Learning Research

TL;DR: This paper shows empirically and theoretically that randomly chosen trials are more efficient for hyper-parameter optimization than trials on a grid, and shows that random search is a natural baseline against which to judge progress in the development of adaptive (sequential) hyper- parameter optimization algorithms.

...read moreread less

Book ChapterDOI

Gaussian processes in machine learning

Carl Edward Rasmussen

- 02 Feb 2003 -

Lecture Notes in Computer Science

TL;DR: In this paper, the authors give a basic introduction to Gaussian Process regression models and present the simple equations for incorporating training data and examine how to learn the hyperparameters using the marginal likelihood.

...read moreread less

Proceedings Article

Practical Bayesian Optimization of Machine Learning Algorithms

Jasper Snoek, +2 more

TL;DR: This work describes new algorithms that take into account the variable cost of learning algorithm experiments and that can leverage the presence of multiple cores for parallel experimentation and shows that these proposed algorithms improve on previous automatic procedures and can reach or surpass human expert-level optimization for many algorithms.

...read moreread less

Journal ArticleDOI

Taking the Human Out of the Loop: A Review of Bayesian Optimization

Bobak Shahriari, +4 more

TL;DR: This review paper introduces Bayesian optimization, highlights some of its methodological aspects, and showcases a wide range of applications.

...read moreread less

WaveNet: A Generative Model for Raw Audio

Aaron van den Oord, +8 more

TL;DR: WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition.

...read moreread less

Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions

Jonathan Shen, +12 more

Neural Voice Cloning with a Few Samples

Sercan O. Arik, +4 more

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

Ye Jia, +10 more

WaveNet: A Generative Model for Raw Audio

Aaron van den Oord, +8 more

- 12 Sep 2016 -

arXiv: Sound

Frequently Asked Questions (1)

Q1. What are the contributions in "Boffin tts: few-shot speaker adaptation by bayesian optimization" ?

The authors present BOFFIN TTS ( Bayesian Optimization For FInetuning Neural Text To Speech ), a novel approach for few-shot speaker adaptation. The authors demonstrate that there does not exist a one-size-fits-all adaptation strategy, with convincing synthesis requiring a corpus-specific configuration of the hyperparameters that control fine-tuning. By using Bayesian optimization to efficiently optimize these hyper-parameter values for a target speaker, the authors are able to perform adaptation with an average 30 % improvement in speaker similarity over standard techniques.

BOFFIN TTS: Few-Shot Speaker Adaptation by Bayesian Optimization

Figures

Citations

Learning from Very Few Samples: A Survey

AdaSpeech: Adaptive Text to Speech for Custom Voice

AdaDurIAN: Few-shot Adaptation for Neural Text-to-Speech with DurIAN.

Are we Forgetting about Compositional Optimisers in Bayesian Optimisation

A Survey on Neural Speech Synthesis.

References

Random search for hyper-parameter optimization

Gaussian processes in machine learning

Practical Bayesian Optimization of Machine Learning Algorithms

Taking the Human Out of the Loop: A Review of Bayesian Optimization

WaveNet: A Generative Model for Raw Audio

Related Papers (5)

Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions

Neural Voice Cloning with a Few Samples

Tacotron: Towards End-to-End Speech Synthesis

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

WaveNet: A Generative Model for Raw Audio

Frequently Asked Questions (1)

Q1. What are the contributions in "Boffin tts: few-shot speaker adaptation by bayesian optimization" ?