scispace - formally typeset
Proceedings ArticleDOI

CBLDNN-Based Speaker-Independent Speech Separation Via Generative Adversarial Training

Reads0
Chats0
TLDR
The experimental results show that the proposed CBLDNN-GAT model achieves 11.0d-B signal-to-distortion ratio (SDR) improvement, which is the new state-of-the-art result.
Abstract
In this paper, we propose a speaker-independent multi-speaker monaural speech separation system (CBLDNN-GAT) based on convolutional, bidirectional long short-term memory, deep feedforward neural network (CBLDNN) with generative adversarial training (GAT). Our system aims at obtaining better speech quality instead of only minimizing a mean square error (MSE). In the initial phase, we utilize log-mel filterbank and pitch features to warm up our CBLDNN in a multi-task manner. Thus, the information that contributes to separating speech and improving speech quality is integrated into the model. We execute GAT throughout the training, which makes the separated speech indistinguishable from the real one. We evaluate CBLDNN-GAT on WSJ0-2mix dataset. The experimental results show that the proposed model achieves 11.0d-B signal-to-distortion ratio (SDR) improvement, which is the new state-of-the-art result.

read more

Citations
More filters
Journal ArticleDOI

Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation

TL;DR: A fully convolutional time-domain audio separation network (Conv-TasNet), a deep learning framework for end-to-end time- domain speech separation, which significantly outperforms previous time–frequency masking methods in separating two- and three-speaker mixtures.
Posted Content

Wavesplit: End-to-End Speech Separation by Speaker Clustering

TL;DR: Wavesplit redefines the state-of-the-art on clean mixtures of 2 or 3 speakers, as well as in noisy and reverberated settings, and set a new benchmark on the recent LibriMix dataset.
Journal ArticleDOI

Divide and Conquer: A Deep CASA Approach to Talker-Independent Monaural Speaker Separation

TL;DR: In this article, the authors decompose the multi-speaker separation task into the stages of simultaneous grouping and sequential grouping, which achieves state-of-the-art results with a modest model size.
Proceedings ArticleDOI

Dual-Path Transformer Network: Direct Context-Aware Modeling for End-to-End Monaural Speech Separation.

TL;DR: A dual-path transformer network (DPTNet) for end-to-end speech separation, which introduces direct context-awareness in the modeling for speech sequences by introduces a improved transformer.
Posted Content

Voice Separation with an Unknown Number of Multiple Speakers

TL;DR: A new method is presented for separating a mixed audio sequence, in which multiple voices speak simultaneously, that greatly outperforms the current state of the art, which, as it is shown, is not competitive for more than two speakers.
References
More filters
Journal ArticleDOI

Generative Adversarial Nets

TL;DR: A new framework for estimating generative models via an adversarial process, in which two models are simultaneously train: a generative model G that captures the data distribution and a discriminative model D that estimates the probability that a sample came from the training data rather than G.
Posted Content

Image-to-Image Translation with Conditional Adversarial Networks

TL;DR: Conditional Adversarial Network (CA) as discussed by the authors is a general-purpose solution to image-to-image translation problems, which can be used to synthesize photos from label maps, reconstructing objects from edge maps, and colorizing images, among other tasks.
Proceedings Article

The Kaldi Speech Recognition Toolkit

TL;DR: The design of Kaldi is described, a free, open-source toolkit for speech recognition research that provides a speech recognition system based on finite-state automata together with detailed documentation and a comprehensive set of scripts for building complete recognition systems.
Proceedings ArticleDOI

A unified architecture for natural language processing: deep neural networks with multitask learning

TL;DR: This work describes a single convolutional neural network architecture that, given a sentence, outputs a host of language processing predictions: part-of-speech tags, chunks, named entity tags, semantic roles, semantically similar words and the likelihood that the sentence makes sense using a language model.
Related Papers (5)