Home
/
Authors
/
Rafael Valle

Author

Rafael Valle

Other affiliations: University of California, Berkeley

Bio: Rafael Valle is an academic researcher from Nvidia. The author has contributed to research in topics: Speech synthesis & Computer science. The author has an hindex of 9, co-authored 26 publications receiving 992 citations. Previous affiliations of Rafael Valle include University of California, Berkeley.

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Waveglow: A Flow-based Generative Network for Speech Synthesis

[...]

Ryan Prenger¹, Rafael Valle¹, Bryan Catanzaro¹•Institutions (1)

Nvidia¹

12 May 2019

TL;DR: WaveGlow as mentioned in this paper is a flow-based network capable of generating high quality speech from mel-spectrograms without the need for auto-regression, and it is implemented using only a single network, trained using a single cost function: maximizing the likelihood of the training data.

...read moreread less

Abstract: In this paper we propose WaveGlow: a flow-based network capable of generating high quality speech from mel-spectrograms. WaveGlow combines insights from Glow [1] and WaveNet [2] in order to provide fast, efficient and high-quality audio synthesis, without the need for auto-regression. WaveGlow is implemented using only a single network, trained using only a single cost function: maximizing the likelihood of the training data, which makes the training procedure simple and stable. Our PyTorch implementation produces audio samples at a rate of more than 500 kHz on an NVIDIA V100 GPU. Mean Opinion Scores show that it delivers audio quality as good as the best publicly available WaveNet implementation. All code will be made publicly available online [3].

...read moreread less

606 citations

Posted Content•

WaveGlow: A Flow-based Generative Network for Speech Synthesis

[...]

Ryan Prenger¹, Rafael Valle¹, Bryan Catanzaro¹•Institutions (1)

Nvidia¹

31 Oct 2018-arXiv: Sound

TL;DR: WaveGlow is a flow-based network capable of generating high quality speech from mel-spectrograms, implemented using only a single network, trained using a single cost function: maximizing the likelihood of the training data, which makes the training procedure simple and stable.

...read moreread less

Abstract: In this paper we propose WaveGlow: a flow-based network capable of generating high quality speech from mel-spectrograms. WaveGlow combines insights from Glow and WaveNet in order to provide fast, efficient and high-quality audio synthesis, without the need for auto-regression. WaveGlow is implemented using only a single network, trained using only a single cost function: maximizing the likelihood of the training data, which makes the training procedure simple and stable. Our PyTorch implementation produces audio samples at a rate of more than 500 kHz on an NVIDIA V100 GPU. Mean Opinion Scores show that it delivers audio quality as good as the best publicly available WaveNet implementation. All code will be made publicly available online.

...read moreread less

525 citations

Proceedings Article•DOI•

Mellotron: Multispeaker Expressive Voice Synthesis by Conditioning on Rhythm, Pitch and Global Style Tokens

[...]

Rafael Valle¹, Jason Li¹, Ryan Prenger¹, Bryan Catanzaro¹•Institutions (1)

Nvidia¹

04 May 2020

TL;DR: A multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data, and synthesized samples that include style transfer from other speakers, singers and styles not seen during training, procedural manipulation of rhythm and pitch and choir synthesis.

...read moreread less

Abstract: Mellotron is a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data. By explicitly conditioning on rhythm and continuous pitch contours from an audio signal or music score, Mellotron is able to generate speech in a variety of styles ranging from read speech to expressive speech, from slow drawls to rap and from monotonous voice to singing voice. Unlike other methods, we train Mellotron using only read speech data without alignments between text and audio. We evaluate our models using the LJSpeech and LibriTTS datasets. We provide F0 Frame Errors and synthesized samples that include style transfer from other speakers, singers and styles not seen during training, procedural manipulation of rhythm and pitch and choir synthesis.

...read moreread less

122 citations

Posted Content•

Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis

[...]

Rafael Valle¹, Kevin J. Shih¹, Ryan Prenger¹, Bryan Catanzaro²•Institutions (2)

Nvidia¹, University of California, Berkeley²

12 May 2020-arXiv: Sound

TL;DR: The mean opinion scores (MOS) show that Flowtron matches state-of-the-art TTS models in terms of speech quality, and results on control of speech variation, interpolation between samples and style transfer between speakers seen and unseen during training are provided.

...read moreread less

Abstract: In this paper we propose Flowtron: an autoregressive flow-based generative network for text-to-speech synthesis with control over speech variation and style transfer. Flowtron borrows insights from IAF and revamps Tacotron in order to provide high-quality and expressive mel-spectrogram synthesis. Flowtron is optimized by maximizing the likelihood of the training data, which makes training simple and stable. Flowtron learns an invertible mapping of data to a latent space that can be manipulated to control many aspects of speech synthesis (pitch, tone, speech rate, cadence, accent). Our mean opinion scores (MOS) show that Flowtron matches state-of-the-art TTS models in terms of speech quality. In addition, we provide results on control of speech variation, interpolation between samples and style transfer between speakers seen and unseen during training. Code and pre-trained models will be made publicly available at this https URL

...read moreread less

85 citations

Journal Article•DOI•

Missing Data Imputation for Supervised Learning

[...]

Jason Poulos¹, Rafael Valle¹•Institutions (1)

University of California, Berkeley¹

19 Mar 2018-Applied Artificial Intelligence

TL;DR: In this article, the authors compare methods for imputing missing categorical data and find that missing data imputation can help improve the performance of prediction models in situations where missing data hide useful information.

...read moreread less

Abstract: Missing data imputation can help improve the performance of prediction models in situations where missing data hide useful information This paper compares methods for imputing missing categorical

...read moreread less

40 citations

1
2
3
4
…
5
6
7

Collapse

Cited by

PDF

Open Access

More filters

Posted Content•

Denoising Diffusion Probabilistic Models

[...]

Jonathan Ho¹, Ajay Jain¹, Pieter Abbeel¹•Institutions (1)

University of California, Berkeley¹

19 Jun 2020-arXiv: Learning

TL;DR: High quality image synthesis results are presented using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics, which naturally admit a progressive lossy decompression scheme that can be interpreted as a generalization of autoregressive decoding.

...read moreread less

Abstract: We present high quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics. Our best results are obtained by training on a weighted variational bound designed according to a novel connection between diffusion probabilistic models and denoising score matching with Langevin dynamics, and our models naturally admit a progressive lossy decompression scheme that can be interpreted as a generalization of autoregressive decoding. On the unconditional CIFAR10 dataset, we obtain an Inception score of 9.46 and a state-of-the-art FID score of 3.17. On 256x256 LSUN, we obtain sample quality similar to ProgressiveGAN. Our implementation is available at this https URL

...read moreread less

2,704 citations

Posted Content•

Normalizing Flows for Probabilistic Modeling and Inference

[...]

George Papamakarios¹, Eric Nalisnick¹, Danilo Jimenez Rezende¹, Shakir Mohamed¹, Balaji Lakshminarayanan¹ - Show less +1 more•Institutions (1)

Google¹

05 Dec 2019-arXiv: Machine Learning

TL;DR: This review places special emphasis on the fundamental principles of flow design, and discusses foundational topics such as expressive power and computational trade-offs, and summarizes the use of flows for tasks such as generative modeling, approximate inference, and supervised learning.

...read moreread less

Abstract: Normalizing flows provide a general mechanism for defining expressive probability distributions, only requiring the specification of a (usually simple) base distribution and a series of bijective transformations. There has been much recent work on normalizing flows, ranging from improving their expressive power to expanding their application. We believe the field has now matured and is in need of a unified perspective. In this review, we attempt to provide such a perspective by describing flows through the lens of probabilistic modeling and inference. We place special emphasis on the fundamental principles of flow design, and discuss foundational topics such as expressive power and computational trade-offs. We also broaden the conceptual framing of flows by relating them to more general probability transformations. Lastly, we summarize the use of flows for tasks such as generative modeling, approximate inference, and supervised learning.

...read moreread less

716 citations

Journal Article•DOI•

Normalizing Flows: An Introduction and Review of Current Methods

[...]

Ivan Kobyzev, Simon J. D. Prince, Marcus A. Brubaker

01 Nov 2021-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: The goal of this survey article is to give a coherent and comprehensive review of the literature around the construction and use of Normalizing Flows for distribution learning to provide context and explanation of the models.

...read moreread less

Abstract: Normalizing Flows are generative models which produce tractable distributions where both sampling and density evaluation can be efficient and exact. The goal of this survey article is to give a coherent and comprehensive review of the literature around the construction and use of Normalizing Flows for distribution learning. We aim to provide context and explanation of the models, review current state-of-the-art literature, and identify open questions and promising future directions.

...read moreread less

683 citations

Posted Content•

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

[...]

Jungil Kong, Jaehyeon Kim, Jaekyoung Bae

12 Oct 2020-arXiv: Sound

TL;DR: It is demonstrated that modeling periodic patterns of an audio is crucial for enhancing sample quality and the generality of HiFi-GAN is shown to the mel-spectrogram inversion of unseen speakers and end-to-end speech synthesis.

...read moreread less

Abstract: Several recent work on speech synthesis have employed generative adversarial networks (GANs) to produce raw waveforms. Although such methods improve the sampling efficiency and memory usage, their sample quality has not yet reached that of autoregressive and flow-based generative models. In this work, we propose HiFi-GAN, which achieves both efficient and high-fidelity speech synthesis. As speech audio consists of sinusoidal signals with various periods, we demonstrate that modeling periodic patterns of an audio is crucial for enhancing sample quality. A subjective human evaluation (mean opinion score, MOS) of a single speaker dataset indicates that our proposed method demonstrates similarity to human quality while generating 22.05 kHz high-fidelity audio 167.9 times faster than real-time on a single V100 GPU. We further show the generality of HiFi-GAN to the mel-spectrogram inversion of unseen speakers and end-to-end speech synthesis. Finally, a small footprint version of HiFi-GAN generates samples 13.4 times faster than real-time on CPU with comparable quality to an autoregressive counterpart.

...read moreread less

629 citations

Proceedings Article•

FastSpeech: Fast, Robust and Controllable Text to Speech

[...]

Yi Ren¹, Yangjun Ruan¹, Xu Tan², Tao Qin², Sheng Zhao², Zhou Zhao¹, Tie-Yan Liu² - Show less +3 more•Institutions (2)

Zhejiang University¹, Microsoft²

22 May 2019

TL;DR: FastSpeech as mentioned in this paper proposes a feed-forward network based on Transformer to generate mel-spectrogram in parallel for text-to-speech (TTS) by extracting attention alignments from an encoder-decoder based teacher model.

...read moreread less

Abstract: Neural network based end-to-end text to speech (TTS) has significantly improved the quality of synthesized speech. Prominent methods (e.g., Tacotron 2) usually first generate mel-spectrogram from text, and then synthesize speech from the mel-spectrogram using vocoder such as WaveNet. Compared with traditional concatenative and statistical parametric approaches, neural network based end-to-end models suffer from slow inference speed, and the synthesized speech is usually not robust (i.e., some words are skipped or repeated) and lack of controllability (voice speed or prosody control). In this work, we propose a novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS. Specifically, we extract attention alignments from an encoder-decoder based teacher model for phoneme duration prediction, which is used by a length regulator to expand the source phoneme sequence to match the length of the target mel-spectrogram sequence for parallel mel-spectrogram generation. Experiments on the LJSpeech dataset show that our parallel model matches autoregressive models in terms of speech quality, nearly eliminates the problem of word skipping and repeating in particularly hard cases, and can adjust voice speed smoothly. Most importantly, compared with autoregressive Transformer TTS, our model speeds up mel-spectrogram generation by 270x and the end-to-end speech synthesis by 38x. Therefore, we call our model FastSpeech.

...read moreread less

623 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse