Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks

doi:10.1109/TASLP.2017.2761547

Open AccessJournal ArticleDOI

Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks

Yuki Saito, +2 more

- 01 Jan 2018 -

IEEE Transactions on Audio, Speech, and ...

- Vol. 26, Iss: 1, pp 84-96

TLDR

The proposed method can generate more natural spectral parameters and $F_0$ than conventional minimum generation error training algorithm regardless of its hyperparameter settings, and it is found that a Wasserstein GAN minimizing the Earth-Mover's distance works the best in terms of improving the synthetic speech quality.

Abstract:

A method for statistical parametric speech synthesis incorporating generative adversarial networks (GANs) is proposed. Although powerful deep neural networks techniques can be applied to artificially synthesize speech waveform, the synthetic speech quality is low compared with that of natural speech. One of the issues causing the quality degradation is an oversmoothing effect often observed in the generated speech parameters. A GAN introduced in this paper consists of two neural networks: a discriminator to distinguish natural and generated samples, and a generator to deceive the discriminator. In the proposed framework incorporating the GANs, the discriminator is trained to distinguish natural and generated speech parameters, while the acoustic models are trained to minimize the weighted sum of the conventional minimum generation loss and an adversarial loss for deceiving the discriminator. Since the objective of the GANs is to minimize the divergence (i.e., distribution difference) between the natural and generated speech parameters, the proposed method effectively alleviates the oversmoothing effect on the generated speech parameters. We evaluated the effectiveness for text-to-speech and voice conversion, and found that the proposed method can generate more natural spectral parameters and $F_0$ than conventional minimum generation error training algorithm regardless of its hyperparameter settings. Furthermore, we investigated the effect of the divergence of various GANs, and found that a Wasserstein GAN minimizing the Earth-Mover's distance works the best in terms of improving the synthetic speech quality.

Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks

Citations

A Survey of Deep Learning: Platforms, Applications and Emerging Research Trends

A Review on Generative Adversarial Networks: Algorithms, Theory, and Applications

StarGAN-VC: non-parallel many-to-many Voice Conversion Using Star Generative Adversarial Networks

WaveNet Vocoder with Limited Training Data for Voice Conversion.

Speech analysis for health: Current state-of-the-art and the increasing impact of deep learning

References

Generative Adversarial Nets

Reducing the Dimensionality of Data with Neural Networks

"Why Should I Trust You?": Explaining the Predictions of Any Classifier

Algorithms for Non-negative Matrix Factorization

Deep Sparse Rectifier Neural Networks

Related Papers (5)

Generative Adversarial Nets

WaveNet: A Generative Model for Raw Audio

Conditional Generative Adversarial Nets

Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks

Adam: A Method for Stochastic Optimization