Top 8 papers published by Roberto Barra-Chicote from Amazon.com in 2020

Proceedings Article•DOI•

BOFFIN TTS: Few-Shot Speaker Adaptation by Bayesian Optimization

[...]

Henry B. Moss¹, Vatsal Aggarwal², Nishant Prateek², Javier González², Roberto Barra-Chicote² - Show less +1 more•Institutions (2)

Lancaster University¹, Amazon.com²

14 May 2020

TL;DR: Results indicate, across multiple corpora, that BOFFIN TTS can learn to synthesize new speakers using less than ten minutes of audio, achieving the same naturalness as produced for the speakers used to train the base model.

...read moreread less

Abstract: We present BOFFIN TTS (Bayesian Optimization For FIne-tuning Neural Text To Speech), a novel approach for few-shot speaker adaptation. Here, the task is to fine-tune a pre-trained TTS model to mimic a new speaker using a small corpus of target utterances. We demonstrate that there does not exist a one-size-fits-all adaptation strategy, with convincing synthesis requiring a corpus-specific configuration of the hyper-parameters that control fine-tuning. By using Bayesian optimization to efficiently optimize these hyper-parameter values for a target speaker, we are able to perform adaptation with an average 30% improvement in speaker similarity over standard techniques. Results indicate, across multiple corpora, that BOFFIN TTS can learn to synthesize new speakers using less than ten minutes of audio, achieving the same naturalness as produced for the speakers used to train the base model.

...read moreread less

51 citations

Posted Content•

From Speech-to-Speech Translation to Automatic Dubbing

[...]

Marcello Federico, Robert Enyedi, Roberto Barra-Chicote, Ritwik Giri, Umut Isik, Arvindh Krishnaswamy, Hassan Sawaf - Show less +3 more

19 Jan 2020-arXiv: Computation and Language

TL;DR: A first subjective evaluation of automatic Dubbing of excerpts of TED Talks from English into Italian is reported, which measures the perceived naturalness of automatic dubbing and the relative importance of each proposed enhancement.

...read moreread less

Abstract: We present enhancements to a speech-to-speech translation pipeline in order to perform automatic dubbing. Our architecture features neural machine translation generating output of preferred length, prosodic alignment of the translation with the original speech segments, neural text-to-speech with fine tuning of the duration of each utterance, and, finally, audio rendering to enriches text-to-speech output with background noise and reverberation extracted from the original audio. We report on a subjective evaluation of automatic dubbing of excerpts of TED Talks from English into Italian, which measures the perceived naturalness of automatic dubbing and the relative importance of each proposed enhancement.

...read moreread less

23 citations

Proceedings Article•DOI•

Evaluating and optimizing prosodic alignment for automatic dubbing

[...]

Marcello Federico¹, Yogesh Virkar¹, Robert Enyedi¹, Roberto Barra-Chicote¹•Institutions (1)

Amazon.com¹

25 Oct 2020

TL;DR: This paper focuses on recent progress on the prosodic alignment component, which aims at synchronizing the translated transcript with the original utterances, and presents empirical results for English-to-Italian dubbing on a publicly available collection of TED Talks.

...read moreread less

Abstract: Automatic dubbing aims at replacing all speech contained in a video with speech in a different language, so that the result sounds and looks as natural as the original. Hence, in addition to conveying the same content of an original utterance (which is the typical objective of speech translation), dubbed speech should ideally also match its duration, the lip movements and gestures in the video, timbre, emotion and prosody of the speaker, and finally background noise and reverberation of the environment. In this paper, after describing our dubbing architecture, we focus on recent progress on the prosodic alignment component, which aims at synchronizing the translated transcript with the original utterances. We present empirical results for English-to-Italian dubbing on a publicly available collection of TED Talks. Our new prosodic alignment model, which allows for small relaxations in synchronicity, shows to significantly improve both prosodic alignment accuracy and overall subjective dubbing quality of previous work.

...read moreread less

22 citations

Proceedings Article•DOI•

From Speech-to-Speech Translation to Automatic Dubbing

[...]

Marcello Federico, Robert Enyedi, Roberto Barra-Chicote, Ritwik Giri, Umut Isik, Arvindh Krishnaswamy, Hassan Sawaf - Show less +3 more

01 Jul 2020

TL;DR: In this paper, the authors present enhancements to a speech-to-speech translation pipeline in order to perform automatic dubbing of TED Talks from English into Italian, which measure the perceived naturalness of automatic dubbed and the relative importance of each proposed enhancement.

...read moreread less

Abstract: We present enhancements to a speech-to-speech translation pipeline in order to perform automatic dubbing Our architecture features neural machine translation generating output of preferred length, prosodic alignment of the translation with the original speech segments, neural text-to-speech with fine tuning of the duration of each utterance, and, finally, audio rendering to enriches text-to-speech output with background noise and reverberation extracted from the original audio We report and discuss results of a first subjective evaluation of automatic dubbing of excerpts of TED Talks from English into Italian, which measures the perceived naturalness of automatic dubbing and the relative importance of each proposed enhancement

...read moreread less

14 citations

Proceedings Article•DOI•

Using Vaes and Normalizing Flows for One-Shot Text-To-Speech Synthesis of Expressive Speech

[...]

Vatsal Aggarwal¹, Marius Cotescu¹, Nishant Prateek¹, Jaime Lorenzo-Trueba¹, Roberto Barra-Chicote¹ - Show less +1 more•Institutions (1)

Amazon.com¹

04 May 2020

TL;DR: This article proposed a Text-to-Speech method to create an unseen expressive style using one utterance of expressive speech of around one second, which enhances the disentanglement capabilities of a state-of-the-art sequence-tosequence based system with a VAE and a Householder Flow.

...read moreread less

Abstract: We propose a Text-to-Speech method to create an unseen expressive style using one utterance of expressive speech of around one second. Specifically, we enhance the disentanglement capabilities of a state-of-the-art sequence-to-sequence based system with a Variational AutoEncoder (VAE) and a Householder Flow. The proposed system provides a 22% KL-divergence reduction while jointly improving perceptual metrics over state-of-the-art. At synthesis time we use one example of expressive style as a reference input to the encoder for generating any text in the desired style. Perceptual MUSHRA evaluations show that we can create a voice with a 9% relative naturalness improvement over standard Neural Text-to-Speech, while also improving the perceived emotional intensity (59 compared to the 55 of neutral speech).

...read moreread less

12 citations

Posted Content•

BOFFIN TTS: Few-Shot Speaker Adaptation by Bayesian Optimization

[...]

Henry B. Moss¹, Vatsal Aggarwal², Nishant Prateek², Javier González², Roberto Barra-Chicote² - Show less +1 more•Institutions (2)

Lancaster University¹, Amazon.com²

04 Feb 2020-arXiv: Audio and Speech Processing

TL;DR: The authors used Bayesian optimization to optimize the hyper-parameter values of a pre-trained TTS model to mimic a new speaker using a small corpus of target utterances, achieving an average 30% improvement in speaker similarity over standard techniques.

...read moreread less

Abstract: We present BOFFIN TTS (Bayesian Optimization For FIne-tuning Neural Text To Speech), a novel approach for few-shot speaker adaptation. Here, the task is to fine-tune a pre-trained TTS model to mimic a new speaker using a small corpus of target utterances. We demonstrate that there does not exist a one-size-fits-all adaptation strategy, with convincing synthesis requiring a corpus-specific configuration of the hyper-parameters that control fine-tuning. By using Bayesian optimization to efficiently optimize these hyper-parameter values for a target speaker, we are able to perform adaptation with an average 30% improvement in speaker similarity over standard techniques. Results indicate, across multiple corpora, that BOFFIN TTS can learn to synthesize new speakers using less than ten minutes of audio, achieving the same naturalness as produced for the speakers used to train the base model.

...read moreread less

8 citations

Posted Content•

Parallel WaveNet conditioned on VAE latent vectors.

[...]

Jonas Rohnke, Thomas Merritt, Jaime Lorenzo-Trueba, Adam Gabrys, Vatsal Aggarwal, Alexis Moinet, Roberto Barra-Chicote - Show less +3 more

17 Dec 2020-arXiv: Audio and Speech Processing

TL;DR: The use of a sentence-level conditioning vector to improve the signal quality of a Parallel WaveNet neural vocoder with the latent vector from a pre-trained VAE component of a Tacotron 2-style sequence-to-sequence model is investigated.

...read moreread less

Abstract: Recently the state-of-the-art text-to-speech synthesis systems have shifted to a two-model approach: a sequence-to-sequence model to predict a representation of speech (typically mel-spectrograms), followed by a 'neural vocoder' model which produces the time-domain speech waveform from this intermediate speech representation. This approach is capable of synthesizing speech that is confusable with natural speech recordings. However, the inference speed of neural vocoder approaches represents a major obstacle for deploying this technology for commercial applications. Parallel WaveNet is one approach which has been developed to address this issue, trading off some synthesis quality for significantly faster inference speed. In this paper we investigate the use of a sentence-level conditioning vector to improve the signal quality of a Parallel WaveNet neural vocoder. We condition the neural vocoder with the latent vector from a pre-trained VAE component of a Tacotron 2-style sequence-to-sequence model. With this, we are able to significantly improve the quality of vocoded speech.

...read moreread less

3 citations

Posted Content•

Detection of Lexical Stress Errors in Non-native (L2) English with Data Augmentation and Attention

[...]

Daniel Korzekwa¹, Roberto Barra-Chicote¹, Szymon Zaporowski², Grzegorz Beringer¹, Jaime Lorenzo-Trueba¹, Alicja Serafinowicz, Jasha Droppo¹, Thomas Drugman¹, Bozena Kostek² - Show less +5 more•Institutions (2)

Amazon.com¹, Gdańsk University of Technology²

29 Dec 2020-arXiv: Audio and Speech Processing

TL;DR: This paper proposed an attention-based deep learning model that automatically derives optimal syllable-level representation from frame-level and phoneme-level audio features, which achieves 94.8% precision and 49.2% recall for the detection of incorrectly stressed words in L2 English speech of Slavic and Baltic speakers.

...read moreread less

Abstract: This paper describes two novel complementary techniques that improve the detection of lexical stress errors in non-native (L2) English speech: attention-based feature extraction and data augmentation based on Neural Text-To-Speech (TTS). In a classical approach, audio features are usually extracted from fixed regions of speech such as the syllable nucleus. We propose an attention-based deep learning model that automatically derives optimal syllable-level representation from frame-level and phoneme-level audio features. Training this model is challenging because of the limited amount of incorrect stress patterns. To solve this problem, we propose to augment the training set with incorrectly stressed words generated with Neural TTS. Combining both techniques achieves 94.8% precision and 49.2% recall for the detection of incorrectly stressed words in L2 English speech of Slavic and Baltic speakers.

...read moreread less

1 citations

Showing papers by "Roberto Barra-Chicote published in 2020"