scispace - formally typeset
Search or ask a question
Author

Roberto Barra-Chicote

Other affiliations: Technical University of Madrid
Bio: Roberto Barra-Chicote is an academic researcher from Amazon.com. The author has contributed to research in topics: Speech synthesis & Computer science. The author has an hindex of 17, co-authored 58 publications receiving 707 citations. Previous affiliations of Roberto Barra-Chicote include Technical University of Madrid.


Papers
More filters
Journal ArticleDOI
TL;DR: Adapted MFCC and PLP coefficients improve human activity recognition and segmentation accuracies while reducing feature vector size considerably, overcome significantly baseline error rates and contribute significantly to reduce the segmentation error rate.

83 citations

Proceedings ArticleDOI
15 Sep 2019
TL;DR: A WaveRNN-based vocoder is shown to be capable of generating speech of consistently good quality regardless of whether the input spectrogram comes from a speaker or style seen during training or from an out-of-domain scenario when the recording conditions are studio-quality.
Abstract: This paper explores the potential universality of neural vocoders. We train a WaveRNN-based vocoder on 74 speakers coming from 17 languages. This vocoder is shown to be capable of generating speech of consistently good quality (98% relative mean MUSHRA when compared to natural speech) regardless of whether the input spectrogram comes from a speaker or style seen during training or from an out-of-domain scenario when the recording conditions are studio-quality. When the recordings show significant changes in quality, or when moving towards non-speech vocalizations or singing, the vocoder still significantly outperforms speaker-dependent vocoders, but operates at a lower average relative MUSHRA of 75%. These results are shown to be consistent across languages, regardless of them being seen during training (e.g. English or Japanese) or unseen (e.g. Wolof, Swahili, Ahmaric).

80 citations

Journal ArticleDOI
TL;DR: The analysis shows that, although the HMM method produces significantly better neutral speech, the two methods produce emotional speech of similar quality, except for emotions having context-dependent prosodic patterns.

75 citations

Proceedings ArticleDOI
14 May 2020
TL;DR: Results indicate, across multiple corpora, that BOFFIN TTS can learn to synthesize new speakers using less than ten minutes of audio, achieving the same naturalness as produced for the speakers used to train the base model.
Abstract: We present BOFFIN TTS (Bayesian Optimization For FIne-tuning Neural Text To Speech), a novel approach for few-shot speaker adaptation. Here, the task is to fine-tune a pre-trained TTS model to mimic a new speaker using a small corpus of target utterances. We demonstrate that there does not exist a one-size-fits-all adaptation strategy, with convincing synthesis requiring a corpus-specific configuration of the hyper-parameters that control fine-tuning. By using Bayesian optimization to efficiently optimize these hyper-parameter values for a target speaker, we are able to perform adaptation with an average 30% improvement in speaker similarity over standard techniques. Results indicate, across multiple corpora, that BOFFIN TTS can learn to synthesize new speakers using less than ten minutes of audio, achieving the same naturalness as produced for the speakers used to train the base model.

51 citations

Posted Content
TL;DR: The authors trained a WaveRNN-based vocoder on 74 speakers coming from 17 languages and found that the results were consistent across languages, regardless of them being seen during training or unseen (e.g. Wolof, Swahili, Ahmaric).
Abstract: This paper explores the potential universality of neural vocoders. We train a WaveRNN-based vocoder on 74 speakers coming from 17 languages. This vocoder is shown to be capable of generating speech of consistently good quality (98% relative mean MUSHRA when compared to natural speech) regardless of whether the input spectrogram comes from a speaker or style seen during training or from an out-of-domain scenario when the recording conditions are studio-quality. When the recordings show significant changes in quality, or when moving towards non-speech vocalizations or singing, the vocoder still significantly outperforms speaker-dependent vocoders, but operates at a lower average relative MUSHRA of 75%. These results are shown to be consistent across languages, regardless of them being seen during training (e.g. English or Japanese) or unseen (e.g. Wolof, Swahili, Ahmaric).

38 citations


Cited by
More filters
Journal ArticleDOI
01 Oct 1980

1,565 citations

Journal ArticleDOI
TL;DR: The basic phenomenon reflecting the last fifteen years is addressed, commenting on databases, modelling and annotation, the unit of analysis and prototypicality and automatic processing including discussions on features, classification, robustness, evaluation, and implementation and system integration.

671 citations

Journal ArticleDOI
TL;DR: It is shown that Bayesian models are able to use prior information and model measurements with various distributions, and a range of deep neural networks can be integrated in multi-modal learning for capturing the complex mechanism of biological systems.
Abstract: Driven by high-throughput sequencing techniques, modern genomic and clinical studies are in a strong need of integrative machine learning models for better use of vast volumes of heterogeneous information in the deep understanding of biological systems and the development of predictive models. How data from multiple sources (called multi-view data) are incorporated in a learning system is a key step for successful analysis. In this article, we provide a comprehensive review on omics and clinical data integration techniques, from a machine learning perspective, for various analyses such as prediction, clustering, dimension reduction and association. We shall show that Bayesian models are able to use prior information and model measurements with various distributions; tree-based methods can either build a tree with all features or collectively make a final decision based on trees learned from each view; kernel methods fuse the similarity matrices learned from individual views together for a final similarity matrix or learning model; network-based fusion methods are capable of inferring direct and indirect associations in a heterogeneous network; matrix factorization models have potential to learn interactions among features from different views; and a range of deep neural networks can be integrated in multi-modal learning for capturing the complex mechanism of biological systems.

333 citations