scispace - formally typeset
Search or ask a question

Showing papers by "Roberto Barra-Chicote published in 2020"


Proceedings ArticleDOI
14 May 2020
TL;DR: Results indicate, across multiple corpora, that BOFFIN TTS can learn to synthesize new speakers using less than ten minutes of audio, achieving the same naturalness as produced for the speakers used to train the base model.
Abstract: We present BOFFIN TTS (Bayesian Optimization For FIne-tuning Neural Text To Speech), a novel approach for few-shot speaker adaptation. Here, the task is to fine-tune a pre-trained TTS model to mimic a new speaker using a small corpus of target utterances. We demonstrate that there does not exist a one-size-fits-all adaptation strategy, with convincing synthesis requiring a corpus-specific configuration of the hyper-parameters that control fine-tuning. By using Bayesian optimization to efficiently optimize these hyper-parameter values for a target speaker, we are able to perform adaptation with an average 30% improvement in speaker similarity over standard techniques. Results indicate, across multiple corpora, that BOFFIN TTS can learn to synthesize new speakers using less than ten minutes of audio, achieving the same naturalness as produced for the speakers used to train the base model.

51 citations


Posted Content
TL;DR: A first subjective evaluation of automatic Dubbing of excerpts of TED Talks from English into Italian is reported, which measures the perceived naturalness of automatic dubbing and the relative importance of each proposed enhancement.
Abstract: We present enhancements to a speech-to-speech translation pipeline in order to perform automatic dubbing. Our architecture features neural machine translation generating output of preferred length, prosodic alignment of the translation with the original speech segments, neural text-to-speech with fine tuning of the duration of each utterance, and, finally, audio rendering to enriches text-to-speech output with background noise and reverberation extracted from the original audio. We report on a subjective evaluation of automatic dubbing of excerpts of TED Talks from English into Italian, which measures the perceived naturalness of automatic dubbing and the relative importance of each proposed enhancement.

23 citations


Proceedings ArticleDOI
25 Oct 2020
TL;DR: This paper focuses on recent progress on the prosodic alignment component, which aims at synchronizing the translated transcript with the original utterances, and presents empirical results for English-to-Italian dubbing on a publicly available collection of TED Talks.
Abstract: Automatic dubbing aims at replacing all speech contained in a video with speech in a different language, so that the result sounds and looks as natural as the original. Hence, in addition to conveying the same content of an original utterance (which is the typical objective of speech translation), dubbed speech should ideally also match its duration, the lip movements and gestures in the video, timbre, emotion and prosody of the speaker, and finally background noise and reverberation of the environment. In this paper, after describing our dubbing architecture, we focus on recent progress on the prosodic alignment component, which aims at synchronizing the translated transcript with the original utterances. We present empirical results for English-to-Italian dubbing on a publicly available collection of TED Talks. Our new prosodic alignment model, which allows for small relaxations in synchronicity, shows to significantly improve both prosodic alignment accuracy and overall subjective dubbing quality of previous work.

22 citations


Proceedings ArticleDOI
01 Jul 2020
TL;DR: In this paper, the authors present enhancements to a speech-to-speech translation pipeline in order to perform automatic dubbing of TED Talks from English into Italian, which measure the perceived naturalness of automatic dubbed and the relative importance of each proposed enhancement.
Abstract: We present enhancements to a speech-to-speech translation pipeline in order to perform automatic dubbing Our architecture features neural machine translation generating output of preferred length, prosodic alignment of the translation with the original speech segments, neural text-to-speech with fine tuning of the duration of each utterance, and, finally, audio rendering to enriches text-to-speech output with background noise and reverberation extracted from the original audio We report and discuss results of a first subjective evaluation of automatic dubbing of excerpts of TED Talks from English into Italian, which measures the perceived naturalness of automatic dubbing and the relative importance of each proposed enhancement

14 citations


Proceedings ArticleDOI
04 May 2020
TL;DR: This article proposed a Text-to-Speech method to create an unseen expressive style using one utterance of expressive speech of around one second, which enhances the disentanglement capabilities of a state-of-the-art sequence-tosequence based system with a VAE and a Householder Flow.
Abstract: We propose a Text-to-Speech method to create an unseen expressive style using one utterance of expressive speech of around one second. Specifically, we enhance the disentanglement capabilities of a state-of-the-art sequence-to-sequence based system with a Variational AutoEncoder (VAE) and a Householder Flow. The proposed system provides a 22% KL-divergence reduction while jointly improving perceptual metrics over state-of-the-art. At synthesis time we use one example of expressive style as a reference input to the encoder for generating any text in the desired style. Perceptual MUSHRA evaluations show that we can create a voice with a 9% relative naturalness improvement over standard Neural Text-to-Speech, while also improving the perceived emotional intensity (59 compared to the 55 of neutral speech).

12 citations


Posted Content
TL;DR: The authors used Bayesian optimization to optimize the hyper-parameter values of a pre-trained TTS model to mimic a new speaker using a small corpus of target utterances, achieving an average 30% improvement in speaker similarity over standard techniques.
Abstract: We present BOFFIN TTS (Bayesian Optimization For FIne-tuning Neural Text To Speech), a novel approach for few-shot speaker adaptation. Here, the task is to fine-tune a pre-trained TTS model to mimic a new speaker using a small corpus of target utterances. We demonstrate that there does not exist a one-size-fits-all adaptation strategy, with convincing synthesis requiring a corpus-specific configuration of the hyper-parameters that control fine-tuning. By using Bayesian optimization to efficiently optimize these hyper-parameter values for a target speaker, we are able to perform adaptation with an average 30% improvement in speaker similarity over standard techniques. Results indicate, across multiple corpora, that BOFFIN TTS can learn to synthesize new speakers using less than ten minutes of audio, achieving the same naturalness as produced for the speakers used to train the base model.

8 citations


Posted Content
TL;DR: The use of a sentence-level conditioning vector to improve the signal quality of a Parallel WaveNet neural vocoder with the latent vector from a pre-trained VAE component of a Tacotron 2-style sequence-to-sequence model is investigated.
Abstract: Recently the state-of-the-art text-to-speech synthesis systems have shifted to a two-model approach: a sequence-to-sequence model to predict a representation of speech (typically mel-spectrograms), followed by a 'neural vocoder' model which produces the time-domain speech waveform from this intermediate speech representation. This approach is capable of synthesizing speech that is confusable with natural speech recordings. However, the inference speed of neural vocoder approaches represents a major obstacle for deploying this technology for commercial applications. Parallel WaveNet is one approach which has been developed to address this issue, trading off some synthesis quality for significantly faster inference speed. In this paper we investigate the use of a sentence-level conditioning vector to improve the signal quality of a Parallel WaveNet neural vocoder. We condition the neural vocoder with the latent vector from a pre-trained VAE component of a Tacotron 2-style sequence-to-sequence model. With this, we are able to significantly improve the quality of vocoded speech.

3 citations


Posted Content
TL;DR: This paper proposed an attention-based deep learning model that automatically derives optimal syllable-level representation from frame-level and phoneme-level audio features, which achieves 94.8% precision and 49.2% recall for the detection of incorrectly stressed words in L2 English speech of Slavic and Baltic speakers.
Abstract: This paper describes two novel complementary techniques that improve the detection of lexical stress errors in non-native (L2) English speech: attention-based feature extraction and data augmentation based on Neural Text-To-Speech (TTS). In a classical approach, audio features are usually extracted from fixed regions of speech such as the syllable nucleus. We propose an attention-based deep learning model that automatically derives optimal syllable-level representation from frame-level and phoneme-level audio features. Training this model is challenging because of the limited amount of incorrect stress patterns. To solve this problem, we propose to augment the training set with incorrectly stressed words generated with Neural TTS. Combining both techniques achieves 94.8% precision and 49.2% recall for the detection of incorrectly stressed words in L2 English speech of Slavic and Baltic speakers.

1 citations