Home
/
Authors
/
Chris Donahue

Author

Chris Donahue

Bio: Chris Donahue is an academic researcher from University of California, San Diego. The author has contributed to research in topics: Engineering & Video game. The author has an hindex of 15, co-authored 30 publications receiving 1252 citations. Previous affiliations of Chris Donahue include Stanford University.

Topics: Engineering, Video game, Computer science, MIDI, Language model ...read more

Papers

PDF

Open Access

More filters

Proceedings Article•

Adversarial Audio Synthesis

[...]

Chris Donahue¹, Julian McAuley², Miller S. Puckette¹•Institutions (2)

University of California, San Diego¹, University of California²

27 Sep 2018

TL;DR: WaveGAN is a first attempt at applying GANs to unsupervised synthesis of raw-waveform audio, capable of synthesizing one second slices of audio waveforms with global coherence, suitable for sound effect generation.

...read moreread less

Abstract: Audio signals are sampled at high temporal resolutions, and learning to synthesize audio requires capturing structure across a range of timescales. Generative adversarial networks (GANs) have seen wide success at generating images that are both locally and globally coherent, but they have seen little application to audio generation. In this paper we introduce WaveGAN, a first attempt at applying GANs to unsupervised synthesis of raw-waveform audio. WaveGAN is capable of synthesizing one second slices of audio waveforms with global coherence, suitable for sound effect generation. Our experiments demonstrate that, without labels, WaveGAN learns to produce intelligible words when trained on a small-vocabulary speech dataset, and can also synthesize audio from other domains such as drums, bird vocalizations, and piano. We compare WaveGAN to a method which applies GANs designed for image generation on image-like audio feature representations, finding both approaches to be promising.

...read moreread less

248 citations

Posted Content•

Adversarial Audio Synthesis

[...]

Chris Donahue¹, Julian McAuley², Miller S. Puckette¹•Institutions (2)

University of California, San Diego¹, University of California²

12 Feb 2018-arXiv: Sound

TL;DR: WaveGAN as mentioned in this paper uses GANs to synthesize one second slices of audio waveforms with global coherence, suitable for sound effect generation, which can also synthesize audio from other domains such as drums, bird vocalizations, and piano.

...read moreread less

246 citations

Proceedings Article•DOI•

Enabling Language Models to Fill in the Blanks.

[...]

Chris Donahue¹, Mina Lee¹, Percy Liang¹•Institutions (1)

Stanford University¹

01 Jul 2020

TL;DR: It is shown that humans have difficulty identifying sentences infilled by the approach, which can enable LMs to infill entire sentences effectively on three different domains: short stories, scientific abstracts, and lyrics.

...read moreread less

Abstract: We present a simple approach for text infilling, the task of predicting missing spans of text at any position in a document. While infilling could enable rich functionality especially for writing assistance tools, more attention has been devoted to language modeling—a special case of infilling where text is predicted at the end of a document. In this paper, we aim to extend the capabilities of language models (LMs) to the more general task of infilling. To this end, we train (or fine tune) off-the-shelf LMs on sequences containing the concatenation of artificially-masked text and the text which was masked. We show that this approach, which we call infilling by language modeling, can enable LMs to infill entire sentences effectively on three different domains: short stories, scientific abstracts, and lyrics. Furthermore, we show that humans have difficulty identifying sentences infilled by our approach as machine-generated in the domain of short stories.

...read moreread less

136 citations

Proceedings Article•DOI•

Exploring Speech Enhancement with Generative Adversarial Networks for Robust Speech Recognition

[...]

Chris Donahue¹, Bo Li², Rohit Prabhavalkar²•Institutions (2)

University of California, San Diego¹, Google²

15 Apr 2018

TL;DR: In this article, the authors investigate the effectiveness of GANs for speech enhancement, in the context of improving noise robustness of automatic speech recognition (ASR) systems, and propose operating GAN on log-Mel filterbank spectra instead of waveforms, which requires less computation and is robust to reverberant noise.

...read moreread less

Abstract: We investigate the effectiveness of generative adversarial networks (GANs) for speech enhancement, in the context of improving noise robustness of automatic speech recognition (ASR) systems. Prior work [1] demonstrates that GANs can effectively suppress additive noise in raw waveform speech signals, improving perceptual quality metrics; however this technique was not justified in the context of ASR. In this work, we conduct a detailed study to measure the effectiveness of GANs in enhancing speech contaminated by both additive and reverberant noise. Motivated by recent advances in image processing [2], we propose operating GANs on log-Mel filterbank spectra instead of waveforms, which requires less computation and is more robust to reverberant noise. While GAN enhancement improves the performance of a clean-trained ASR system on noisy speech, it falls short of the performance achieved by conventional multi-style training (MTR). By appending the GAN-enhanced features to the noisy inputs and retraining, we achieve a 7% WER improvement relative to the MTR system.

...read moreread less

136 citations

Posted Content•

Synthesizing Audio with Generative Adversarial Networks

[...]

Chris Donahue, Julian McAuley, Miller S. Puckette

12 Feb 2018

TL;DR: WaveGAN is introduced, a first attempt at applying GANs to raw audio synthesis in an unsupervised setting and it is found that human judges prefer the generated examples from WaveGAN over those from a method which naively apply GAns on image-like audio feature representations.

...read moreread less

Abstract: While Generative Adversarial Networks (GANs) have seen wide success at the problem of synthesizing realistic images, they have seen little application to the problem of unsupervised audio generation Unlike for images, a barrier to success is that the best discriminative representations for audio tend to be non-invertible, and thus cannot be used to synthesize listenable outputs In this paper, we introduce WaveGAN, a first attempt at applying GANs to raw audio synthesis in an unsupervised setting Our experiments on speech demonstrate that WaveGAN can produce intelligible words from a small vocabulary of human speech, as well as synthesize audio from other domains such as bird vocalizations, drums, and piano Qualitatively, we find that human judges prefer the generated examples from WaveGAN over those from a method which naively apply GANs on image-like audio feature representations

...read moreread less

132 citations

1
2
3
4
…
5
6
7

Collapse

Cited by

PDF

Open Access

More filters

Journal Article•

“Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告

[...]

杉山拓海

12 Sep 2017-Computers & Graphics

3,940 citations

Book Chapter•DOI•

Multimodal Unsupervised Image-to-Image Translation

[...]

Xun Huang¹, Ming-Yu Liu², Serge Belongie¹, Jan Kautz²•Institutions (2)

Cornell University¹, Nvidia²

08 Sep 2018

TL;DR: In this article, the authors propose a multimodal unsupervised image-to-image (MUNIT) framework, where the image representation can be decomposed into a content code that is domain-invariant and a style code that captures domain-specific properties.

...read moreread less

Abstract: Unsupervised image-to-image translation is an important and challenging problem in computer vision. Given an image in the source domain, the goal is to learn the conditional distribution of corresponding images in the target domain, without seeing any examples of corresponding image pairs. While this conditional distribution is inherently multimodal, existing approaches make an overly simplified assumption, modeling it as a deterministic one-to-one mapping. As a result, they fail to generate diverse outputs from a given source domain image. To address this limitation, we propose a Multimodal Unsupervised Image-to-image \(\text{ Translation } \text{(MUNIT) }\) framework. We assume that the image representation can be decomposed into a content code that is domain-invariant, and a style code that captures domain-specific properties. To translate an image to another domain, we recombine its content code with a random style code sampled from the style space of the target domain. We analyze the proposed framework and establish several theoretical results. Extensive experiments with comparisons to state-of-the-art approaches further demonstrate the advantage of the proposed framework. Moreover, our framework allows users to control the style of translation outputs by providing an example style image. Code and pretrained models are available at https://github.com/nvlabs/MUNIT.

...read moreread less

1,874 citations

Proceedings Article•

Implicit Neural Representations with Periodic Activation Functions

[...]

Vincent Sitzmann¹, Julien N. P. Martel¹, Alexander W. Bergman¹, David B. Lindell¹, Gordon Wetzstein¹ - Show less +1 more•Institutions (1)

Stanford University¹

17 Jun 2020

TL;DR: In this paper, the authors propose to leverage periodic activation functions for implicit neural representations and demonstrate that these networks, dubbed sinusoidal representation networks or Sirens, are ideally suited for representing complex natural signals and their derivatives.

...read moreread less

Abstract: Implicitly defined, continuous, differentiable signal representations parameterized by neural networks have emerged as a powerful paradigm, offering many possible benefits over conventional representations. However, current network architectures for such implicit neural representations are incapable of modeling signals with fine detail, and fail to represent a signal's spatial and temporal derivatives, despite the fact that these are essential to many physical signals defined implicitly as the solution to partial differential equations. We propose to leverage periodic activation functions for implicit neural representations and demonstrate that these networks, dubbed sinusoidal representation networks or Sirens, are ideally suited for representing complex natural signals and their derivatives. We analyze Siren activation statistics to propose a principled initialization scheme and demonstrate the representation of images, wavefields, video, sound, and their derivatives. Further, we show how Sirens can be leveraged to solve challenging boundary value problems, such as particular Eikonal equations (yielding signed distance functions), the Poisson equation, and the Helmholtz and wave equations. Lastly, we combine Sirens with hypernetworks to learn priors over the space of Siren functions.

...read moreread less

1,058 citations

Posted Content•

Multimodal Unsupervised Image-to-Image Translation

[...]

Xun Huang¹, Ming-Yu Liu², Serge Belongie¹, Jan Kautz²•Institutions (2)

Cornell University¹, Nvidia²

12 Apr 2018-arXiv: Computer Vision and Pattern Recognition

TL;DR: A Multimodal Unsupervised Image-to-image Translation (MUNIT) framework that assumes that the image representation can be decomposed into a content code that is domain-invariant, and a style code that captures domain-specific properties.

...read moreread less

Abstract: Unsupervised image-to-image translation is an important and challenging problem in computer vision. Given an image in the source domain, the goal is to learn the conditional distribution of corresponding images in the target domain, without seeing any pairs of corresponding images. While this conditional distribution is inherently multimodal, existing approaches make an overly simplified assumption, modeling it as a deterministic one-to-one mapping. As a result, they fail to generate diverse outputs from a given source domain image. To address this limitation, we propose a Multimodal Unsupervised Image-to-image Translation (MUNIT) framework. We assume that the image representation can be decomposed into a content code that is domain-invariant, and a style code that captures domain-specific properties. To translate an image to another domain, we recombine its content code with a random style code sampled from the style space of the target domain. We analyze the proposed framework and establish several theoretical results. Extensive experiments with comparisons to the state-of-the-art approaches further demonstrates the advantage of the proposed framework. Moreover, our framework allows users to control the style of translation outputs by providing an example style image. Code and pretrained models are available at this https URL

...read moreread less

923 citations

Posted Content•

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

[...]

Jungil Kong, Jaehyeon Kim, Jaekyoung Bae

12 Oct 2020-arXiv: Sound

TL;DR: It is demonstrated that modeling periodic patterns of an audio is crucial for enhancing sample quality and the generality of HiFi-GAN is shown to the mel-spectrogram inversion of unseen speakers and end-to-end speech synthesis.

...read moreread less

Abstract: Several recent work on speech synthesis have employed generative adversarial networks (GANs) to produce raw waveforms. Although such methods improve the sampling efficiency and memory usage, their sample quality has not yet reached that of autoregressive and flow-based generative models. In this work, we propose HiFi-GAN, which achieves both efficient and high-fidelity speech synthesis. As speech audio consists of sinusoidal signals with various periods, we demonstrate that modeling periodic patterns of an audio is crucial for enhancing sample quality. A subjective human evaluation (mean opinion score, MOS) of a single speaker dataset indicates that our proposed method demonstrates similarity to human quality while generating 22.05 kHz high-fidelity audio 167.9 times faster than real-time on a single V100 GPU. We further show the generality of HiFi-GAN to the mel-spectrogram inversion of unseen speakers and end-to-end speech synthesis. Finally, a small footprint version of HiFi-GAN generates samples 13.4 times faster than real-time on CPU with comparable quality to an autoregressive counterpart.

...read moreread less

629 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse