scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Audio Enhancement and Synthesis using Generative Adversarial Networks: A Survey

17 Jan 2019-International Journal of Computer Applications (Foundation of Computer Science (FCS), NY, USA)-Vol. 182, Iss: 35, pp 27-31
TL;DR: Different techniques involving GAN will be explored relative to speech synthesis, speech enhancement, music generation, and general audio synthesis, including variants created to combat those weaknesses.
Abstract: Generative adversarial networks (GAN) have become prominent in the field of machine learning. Their premise is based on a minimax game in which a generator and discriminator “compete” against each other until an optimal point is reached. The goal of the generator is to produce synthetic samples that match that of real data. The discriminator tries to classify the real data as real and the generated data as not real. Together, the generator improves to the point where the fake data and real data are identical to the discriminator. GAN has been successfully applied in the image processing field over a large range of GAN variant architectures. Although not as prominent, the audio enhancement and synthesis field has also benefitted from GAN in a variety of different forms. In this survey paper, different techniques involving GAN will be explored relative to speech synthesis, speech enhancement, music generation, and general audio synthesis. Strengths and weaknesses of GAN will be looked at including variants created to combat those weaknesses. Also, a few similar machine learning architectures will be explored that may help achieve promising results.

Content maybe subject to copyright    Report

Citations
More filters
Posted Content
TL;DR: This paper attempts to provide a review on various GANs methods from the perspectives of algorithms, theory, and applications, and compares the commonalities and differences of these GAns methods.
Abstract: Generative adversarial networks (GANs) are a hot research topic recently. GANs have been widely studied since 2014, and a large number of algorithms have been proposed. However, there is few comprehensive study explaining the connections among different GANs variants, and how they have evolved. In this paper, we attempt to provide a review on various GANs methods from the perspectives of algorithms, theory, and applications. Firstly, the motivations, mathematical representations, and structure of most GANs algorithms are introduced in details. Furthermore, GANs have been combined with other machine learning algorithms for specific applications, such as semi-supervised learning, transfer learning, and reinforcement learning. This paper compares the commonalities and differences of these GANs methods. Secondly, theoretical issues related to GANs are investigated. Thirdly, typical applications of GANs in image processing and computer vision, natural language processing, music, speech and audio, medical field, and data science are illustrated. Finally, the future open research problems for GANs are pointed out.

344 citations


Cites background from "Audio Enhancement and Synthesis usi..."

  • ...1) GANs for specific applications: There are surveys of using GANs for specific applications such as image synthesis and editing [5], audio enhancement and synthesis [6]....

    [...]

Journal ArticleDOI
TL;DR: A review of the various GAN methods from the perspectives of algorithms, theory, and applications is provided in this paper , where the motivations, mathematical representations, and structures of most GAN algorithms are introduced in detail, and compared their commonalities and differences.
Abstract: Generative adversarial networks (GANs) have recently become a hot research topic; however, they have been studied since 2014, and a large number of algorithms have been proposed. Nevertheless, few comprehensive studies explain the connections among different GAN variants and how they have evolved. In this paper, we attempt to provide a review of the various GAN methods from the perspectives of algorithms, theory, and applications. First, the motivations, mathematical representations, and structures of most GAN algorithms are introduced in detail, and we compare their commonalities and differences. Second, theoretical issues related to GANs are investigated. Finally, typical applications of GANs in image processing and computer vision, natural language processing, music, speech and audio, the medical field, and data science are discussed.

77 citations

Posted Content
TL;DR: This is the first paper that reviews the state-of-the-art video GANs models and summarizes the main improvements in GAns that are not necessarily applied in the video domain in the first run but have been adopted in multiple video Gans variations.
Abstract: With the increasing interest in the content creation field in multiple sectors such as media, education, and entertainment, there is an increasing trend in the papers that uses AI algorithms to generate content such as images, videos, audio, and text. Generative Adversarial Networks (GANs) in one of the promising models that synthesizes data samples that are similar to real data samples. While the variations of GANs models, in general, have been covered to some extent in several survey papers, to the best of our knowledge, this is among the first survey papers that reviews the state-of-the-art video GANs models. This paper first categorized GANs review papers into general GANs review papers, image GANs review papers, and special field GANs review papers such as anomaly detection, medical imaging, or cybersecurity. The paper then summarizes the main improvements in GANs frameworks that are not initially developed for the video domain but have been adopted in multiple video GANs variations. Then, a comprehensive review of video GANs models is provided under two main divisions according to the presence or non-presence of a condition. The conditional models then further grouped according to the type of condition into audio, text, video, and image. The paper is concluded by highlighting the main challenges and limitations of the current video GANs models. A comprehensive list of datasets, applied loss functions, and evaluation metrics is provided in the supplementary material.

20 citations


Additional excerpts

  • ...The field of synthesizing and enhancing audio using GANs architectures has also been reviewed [37]....

    [...]

Posted Content
TL;DR: A comprehensive survey of recent advances in human emotion synthesis by studying available databases, advantages, and disadvantages of the generative models along with the related training strategies considering two principal human communication modalities, namely audio and video.
Abstract: Synthesizing realistic data samples is of great value for both academic and industrial communities. Deep generative models have become an emerging topic in various research areas like computer vision and signal processing. Affective computing, a topic of a broad interest in computer vision society, has been no exception and has benefited from generative models. In fact, affective computing observed a rapid derivation of generative models during the last two decades. Applications of such models include but are not limited to emotion recognition and classification, unimodal emotion synthesis, and cross-modal emotion synthesis. As a result, we conducted a review of recent advances in human emotion synthesis by studying available databases, advantages, and disadvantages of the generative models along with the related training strategies considering two principal human communication modalities, namely audio and video. In this context, facial expression synthesis, speech emotion synthesis, and the audio-visual (cross-modal) emotion synthesis is reviewed extensively under different application scenarios. Gradually, we discuss open research problems to push the boundaries of this research area for future works.

10 citations


Additional excerpts

  • ...enhancement and synthesis [15], image synthesis [16], and text synthesis [17]....

    [...]

Journal ArticleDOI
TL;DR: In this article , the authors proposed a method to reconstruct complex hydrological structures by using deep convolutional generative adversarial networks (DCGAN) in the Monte-Carlo simulation process, named MC-GAN.
Abstract: Characterization of complex subsurface structures is challenging due to the demand to preserve geological realism of the training images in earth and environmental sciences. In this work, we propose a novel method to reconstruct complex hydrological structures by using deep convolutional generative adversarial networks (DCGAN) in the Monte-Carlo simulation process, named MC-GAN. Network architectures for reconstructing both two-dimensional (2D) and three-dimensional (3D) complex spatial structures are provided in this method. We first exploit the robust DCGAN to reproduce abundant and various spatial pattern blocks. Then, we combine the various heterogeneous patterns to reconstruct a complex hydrological structure by using the Monte-Carlo stochastic simulation process. The method is able to represent multiple-scale spatial structures under the premise of using the same generative adversarial network architecture. It not only ensures the simulation efficiency, but also makes the heterogeneous patterns in the realizations more diverse. Three sets of training images were used to test the capability of the proposed method. The experiment results demonstrate that our method can accurately characterize complex heterogeneous spatial structures. At the same time, the trained deep learning model can be reused effectively to generate multiple-scale spatial structures.

10 citations

References
More filters
Journal ArticleDOI
08 Dec 2014
TL;DR: A new framework for estimating generative models via an adversarial process, in which two models are simultaneously train: a generative model G that captures the data distribution and a discriminative model D that estimates the probability that a sample came from the training data rather than G.
Abstract: We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to ½ everywhere. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.

38,211 citations

Posted Content
TL;DR: The conditional version of generative adversarial nets is introduced, which can be constructed by simply feeding the data, y, to the generator and discriminator, and it is shown that this model can generate MNIST digits conditioned on class labels.
Abstract: Generative Adversarial Nets [8] were recently introduced as a novel way to train generative models. In this work we introduce the conditional version of generative adversarial nets, which can be constructed by simply feeding the data, y, we wish to condition on to both the generator and discriminator. We show that this model can generate MNIST digits conditioned on class labels. We also illustrate how this model could be used to learn a multi-modal model, and provide preliminary examples of an application to image tagging in which we demonstrate how this approach can generate descriptive tags which are not part of training labels.

7,987 citations


"Audio Enhancement and Synthesis usi..." refers background or methods in this paper

  • ...Further work may require combining the best properties of various GAN architectures [5][8][10][15][9] to improve existing structures....

    [...]

  • ...Paper [3] has attempted to address the issue by combining cGAN (conditional GAN) [15] and SPSS in a multi-task learning framework....

    [...]

Posted Content
TL;DR: This work proposes an alternative to clipping weights: penalize the norm of gradient of the critic with respect to its input, which performs better than standard WGAN and enables stable training of a wide variety of GAN architectures with almost no hyperparameter tuning.
Abstract: Generative Adversarial Networks (GANs) are powerful generative models, but suffer from training instability. The recently proposed Wasserstein GAN (WGAN) makes progress toward stable training of GANs, but sometimes can still generate only low-quality samples or fail to converge. We find that these problems are often due to the use of weight clipping in WGAN to enforce a Lipschitz constraint on the critic, which can lead to undesired behavior. We propose an alternative to clipping weights: penalize the norm of gradient of the critic with respect to its input. Our proposed method performs better than standard WGAN and enables stable training of a wide variety of GAN architectures with almost no hyperparameter tuning, including 101-layer ResNets and language models over discrete data. We also achieve high quality generations on CIFAR-10 and LSUN bedrooms.

4,133 citations


"Audio Enhancement and Synthesis usi..." refers background in this paper

  • ...Further work may require combining the best properties of various GAN architectures [5][8][10][15][9] to improve existing structures....

    [...]

  • ...[8] proposed that instead of a weight clipping, a gradient penalty can be used....

    [...]

Journal ArticleDOI
TL;DR: Generative adversarial networks (GANs) as mentioned in this paper provide a way to learn deep representations without extensively annotated training data by deriving backpropagation signals through a competitive process involving a pair of networks.
Abstract: Generative adversarial networks (GANs) provide a way to learn deep representations without extensively annotated training data. They achieve this by deriving backpropagation signals through a competitive process involving a pair of networks. The representations that can be learned by GANs may be used in a variety of applications, including image synthesis, semantic image editing, style transfer, image superresolution, and classification. The aim of this review article is to provide an overview of GANs for the signal processing community, drawing on familiar analogies and concepts where possible. In addition to identifying different methods for training and constructing GANs, we also point to remaining challenges in their theory and application.

1,413 citations

Proceedings ArticleDOI
28 Mar 2017
TL;DR: This work proposes the use of generative adversarial networks for speech enhancement, and operates at the waveform level, training the model end-to-end, and incorporate 28 speakers and 40 different noise conditions into the same model, such that model parameters are shared across them.
Abstract: Current speech enhancement techniques operate on the spectral domain and/or exploit some higher-level feature. The majority of them tackle a limited number of noise conditions and rely on first-order statistics. To circumvent these issues, deep networks are being increasingly used, thanks to their ability to learn complex functions from large example sets. In this work, we propose the use of generative adversarial networks for speech enhancement. In contrast to current techniques, we operate at the waveform level, training the model end-to-end, and incorporate 28 speakers and 40 different noise conditions into the same model, such that model parameters are shared across them. We evaluate the proposed model using an independent, unseen test set with two speakers and 20 alternative noise conditions. The enhanced samples confirm the viability of the proposed model, and both objective and subjective evaluations confirm the effectiveness of it. With that, we open the exploration of generative architectures for speech enhancement, which may progressively incorporate further speech-centric design choices to improve their performance.

1,001 citations


"Audio Enhancement and Synthesis usi..." refers background or methods in this paper

  • ...The results show that SEGAN works well as an end-to-end method for speech enhancement....

    [...]

  • ...While many speech enhancement methods use spectrograms or SPSS methods, Speech Enhancement GAN (SEGAN) [17] operates on the waveform level....

    [...]

  • ...SEGAN can operate on raw audio and learn from different speaker and noise conditions....

    [...]

  • ..."SEGAN: Speech enhancement generative adversarial network." arXiv preprint arXiv:1703.09452 (2017)....

    [...]

  • ...One key feature is the use of skip connections in which low level details of the signal pass straight through to the decoder [17]....

    [...]

Trending Questions (1)
What are the various types of generative models for audio Synthetic Data?

The paper explores different techniques involving GAN for audio synthesis, including speech synthesis, speech enhancement, music generation, and general audio synthesis.