CBLDNN-Based Speaker-Independent Speech Separation Via Generative Adversarial Training

doi:10.1109/ICASSP.2018.8462505

Proceedings ArticleDOI

CBLDNN-Based Speaker-Independent Speech Separation Via Generative Adversarial Training

Chenxing Li, +4 more

- pp 711-715

Chats0

TLDR

The experimental results show that the proposed CBLDNN-GAT model achieves 11.0d-B signal-to-distortion ratio (SDR) improvement, which is the new state-of-the-art result.

Abstract:

In this paper, we propose a speaker-independent multi-speaker monaural speech separation system (CBLDNN-GAT) based on convolutional, bidirectional long short-term memory, deep feedforward neural network (CBLDNN) with generative adversarial training (GAT). Our system aims at obtaining better speech quality instead of only minimizing a mean square error (MSE). In the initial phase, we utilize log-mel filterbank and pitch features to warm up our CBLDNN in a multi-task manner. Thus, the information that contributes to separating speech and improving speech quality is integrated into the model. We execute GAT throughout the training, which makes the separated speech indistinguishable from the real one. We evaluate CBLDNN-GAT on WSJ0-2mix dataset. The experimental results show that the proposed model achieves 11.0d-B signal-to-distortion ratio (SDR) improvement, which is the new state-of-the-art result.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation

Yi Luo, +1 more

- 20 Sep 2018 -

arXiv: Sound

TL;DR: A fully convolutional time-domain audio separation network (Conv-TasNet), a deep learning framework for end-to-end time- domain speech separation, which significantly outperforms previous time–frequency masking methods in separating two- and three-speaker mixtures.

...read moreread less

Posted Content

Wavesplit: End-to-End Speech Separation by Speaker Clustering

Neil Zeghidour, +1 more

- 20 Feb 2020 -

arXiv: Audio and Speech Processing

TL;DR: Wavesplit redefines the state-of-the-art on clean mixtures of 2 or 3 speakers, as well as in noisy and reverberated settings, and set a new benchmark on the recent LibriMix dataset.

...read moreread less

Journal ArticleDOI

Divide and Conquer: A Deep CASA Approach to Talker-Independent Monaural Speaker Separation

Yuzhou Liu, +1 more

- 01 Dec 2019 -

IEEE Transactions on Audio, Speech, and ...

TL;DR: In this article, the authors decompose the multi-speaker separation task into the stages of simultaneous grouping and sequential grouping, which achieves state-of-the-art results with a modest model size.

...read moreread less

Proceedings ArticleDOI

Dual-Path Transformer Network: Direct Context-Aware Modeling for End-to-End Monaural Speech Separation.

Jingjing Chen, +2 more

TL;DR: A dual-path transformer network (DPTNet) for end-to-end speech separation, which introduces direct context-awareness in the modeling for speech sequences by introduces a improved transformer.

...read moreread less

Posted Content

Voice Separation with an Unknown Number of Multiple Speakers

Eliya Nachmani, +2 more

- 29 Feb 2020 -

arXiv: Audio and Speech Processing

TL;DR: A new method is presented for separating a mixed audio sequence, in which multiple voices speak simultaneously, that greatly outperforms the current state of the art, which, as it is shown, is not competitive for more than two speakers.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

Generative Adversarial Nets

Ian Goodfellow, +7 more

TL;DR: A new framework for estimating generative models via an adversarial process, in which two models are simultaneously train: a generative model G that captures the data distribution and a discriminative model D that estimates the probability that a sample came from the training data rather than G.

...read moreread less

Posted Content

Image-to-Image Translation with Conditional Adversarial Networks

Phillip Isola, +3 more

- 21 Nov 2016 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: Conditional Adversarial Network (CA) as discussed by the authors is a general-purpose solution to image-to-image translation problems, which can be used to synthesize photos from label maps, reconstructing objects from edge maps, and colorizing images, among other tasks.

...read moreread less

Posted Content

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

Martín Abadi, +39 more

- 01 Jan 2015 -

arXiv: Distributed, Parallel, and Cluste...

TL;DR: The TensorFlow interface and an implementation of that interface that is built at Google are described, which has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields.

...read moreread less

Proceedings Article

The Kaldi Speech Recognition Toolkit

Daniel Povey, +12 more

TL;DR: The design of Kaldi is described, a free, open-source toolkit for speech recognition research that provides a speech recognition system based on finite-state automata together with detailed documentation and a comprehensive set of scripts for building complete recognition systems.

...read moreread less

Proceedings ArticleDOI

A unified architecture for natural language processing: deep neural networks with multitask learning

Ronan Collobert, +1 more

TL;DR: This work describes a single convolutional neural network architecture that, given a sentence, outputs a host of language processing predictions: part-of-speech tags, chunks, named entity tags, semantic roles, semantically similar words and the likelihood that the sentence makes sense using a language model.

...read moreread less

IEEE Transactions on Audio, Speech, and ...

Permutation invariant training of deep models for speaker-independent multi-talker speech separation

Dong Yu, +3 more

TaSNet: Time-Domain Audio Separation Network for Real-Time, Single-Channel Speech Separation

Yi Luo, +1 more

CBLDNN-Based Speaker-Independent Speech Separation Via Generative Adversarial Training

Citations

Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation

Wavesplit: End-to-End Speech Separation by Speaker Clustering

Divide and Conquer: A Deep CASA Approach to Talker-Independent Monaural Speaker Separation

Dual-Path Transformer Network: Direct Context-Aware Modeling for End-to-End Monaural Speech Separation.

Voice Separation with an Unknown Number of Multiple Speakers

References

Generative Adversarial Nets

Image-to-Image Translation with Conditional Adversarial Networks

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

The Kaldi Speech Recognition Toolkit

A unified architecture for natural language processing: deep neural networks with multitask learning

Related Papers (5)

Deep clustering: Discriminative embeddings for segmentation and separation

Deep attractor network for single-microphone speaker separation

Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks

Permutation invariant training of deep models for speaker-independent multi-talker speech separation

TaSNet: Time-Domain Audio Separation Network for Real-Time, Single-Channel Speech Separation