scispace - formally typeset
Open AccessProceedings ArticleDOI

BOFFIN TTS: Few-Shot Speaker Adaptation by Bayesian Optimization

TLDR
Results indicate, across multiple corpora, that BOFFIN TTS can learn to synthesize new speakers using less than ten minutes of audio, achieving the same naturalness as produced for the speakers used to train the base model.
Abstract
We present BOFFIN TTS (Bayesian Optimization For FIne-tuning Neural Text To Speech), a novel approach for few-shot speaker adaptation. Here, the task is to fine-tune a pre-trained TTS model to mimic a new speaker using a small corpus of target utterances. We demonstrate that there does not exist a one-size-fits-all adaptation strategy, with convincing synthesis requiring a corpus-specific configuration of the hyper-parameters that control fine-tuning. By using Bayesian optimization to efficiently optimize these hyper-parameter values for a target speaker, we are able to perform adaptation with an average 30% improvement in speaker similarity over standard techniques. Results indicate, across multiple corpora, that BOFFIN TTS can learn to synthesize new speakers using less than ten minutes of audio, achieving the same naturalness as produced for the speakers used to train the base model.

read more

Content maybe subject to copyright    Report

Figures
Citations
More filters
Posted Content

Learning from Very Few Samples: A Survey

TL;DR: This survey extensively review 300+ papers of FSL spanning from the 2000s to 2019 and provides a timely and comprehensive survey for FSL, which categorize FSL approaches into the generative model based and discriminative modelBased kinds in principle, and emphasize particularly on the meta learning based FSL approach.
Proceedings Article

AdaSpeech: Adaptive Text to Speech for Custom Voice

TL;DR: In this paper, an adaptive text-to-speech (TTS) system for high-quality and efficient customization of new voices is proposed, which uses one acoustic encoder to extract an utterance-level vector and another one to extract a sequence of phoneme-level vectors.
Posted Content

AdaDurIAN: Few-shot Adaptation for Neural Text-to-Speech with DurIAN.

TL;DR: AdaDurIAN is introduced by training an improved DurIAN-based average model and leverage it to few-shot learning with the shared speaker-independent content encoder across different speakers and can outperform the baseline end-to-end system by a large margin.
Posted Content

Are we Forgetting about Compositional Optimisers in Bayesian Optimisation

TL;DR: This paper highlights the empirical advantages of the compositional approach to acquisition function maximisation across 3958 individual experiments comprising synthetic optimisation tasks as well as tasks from the 2020 NeurIPS competition on Black-Box Optimisation for Machine Learning.
Posted Content

A Survey on Neural Speech Synthesis.

TL;DR: A comprehensive survey on neural text-to-speech (TTS) can be found in this paper, focusing on the key components in neural TTS, including text analysis, acoustic models and vocoders.
References
More filters
Journal Article

Random search for hyper-parameter optimization

TL;DR: This paper shows empirically and theoretically that randomly chosen trials are more efficient for hyper-parameter optimization than trials on a grid, and shows that random search is a natural baseline against which to judge progress in the development of adaptive (sequential) hyper- parameter optimization algorithms.
Book ChapterDOI

Gaussian processes in machine learning

TL;DR: In this paper, the authors give a basic introduction to Gaussian Process regression models and present the simple equations for incorporating training data and examine how to learn the hyperparameters using the marginal likelihood.
Proceedings Article

Practical Bayesian Optimization of Machine Learning Algorithms

TL;DR: This work describes new algorithms that take into account the variable cost of learning algorithm experiments and that can leverage the presence of multiple cores for parallel experimentation and shows that these proposed algorithms improve on previous automatic procedures and can reach or surpass human expert-level optimization for many algorithms.
Journal ArticleDOI

Taking the Human Out of the Loop: A Review of Bayesian Optimization

TL;DR: This review paper introduces Bayesian optimization, highlights some of its methodological aspects, and showcases a wide range of applications.

WaveNet: A Generative Model for Raw Audio

TL;DR: WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition.
Related Papers (5)
Frequently Asked Questions (1)
Q1. What are the contributions in "Boffin tts: few-shot speaker adaptation by bayesian optimization" ?

The authors present BOFFIN TTS ( Bayesian Optimization For FInetuning Neural Text To Speech ), a novel approach for few-shot speaker adaptation. The authors demonstrate that there does not exist a one-size-fits-all adaptation strategy, with convincing synthesis requiring a corpus-specific configuration of the hyperparameters that control fine-tuning. By using Bayesian optimization to efficiently optimize these hyper-parameter values for a target speaker, the authors are able to perform adaptation with an average 30 % improvement in speaker similarity over standard techniques.