Proceedings ArticleDOI
CBLDNN-Based Speaker-Independent Speech Separation Via Generative Adversarial Training
Chenxing Li,Lei Zhu,Shuang Xu,Peng Gao,Bo Xu +4 more
- pp 711-715
Reads0
Chats0
TLDR
The experimental results show that the proposed CBLDNN-GAT model achieves 11.0d-B signal-to-distortion ratio (SDR) improvement, which is the new state-of-the-art result.Abstract:
In this paper, we propose a speaker-independent multi-speaker monaural speech separation system (CBLDNN-GAT) based on convolutional, bidirectional long short-term memory, deep feedforward neural network (CBLDNN) with generative adversarial training (GAT). Our system aims at obtaining better speech quality instead of only minimizing a mean square error (MSE). In the initial phase, we utilize log-mel filterbank and pitch features to warm up our CBLDNN in a multi-task manner. Thus, the information that contributes to separating speech and improving speech quality is integrated into the model. We execute GAT throughout the training, which makes the separated speech indistinguishable from the real one. We evaluate CBLDNN-GAT on WSJ0-2mix dataset. The experimental results show that the proposed model achieves 11.0d-B signal-to-distortion ratio (SDR) improvement, which is the new state-of-the-art result.read more
Citations
More filters
Journal ArticleDOI
Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation
Yi Luo,Nima Mesgarani +1 more
TL;DR: A fully convolutional time-domain audio separation network (Conv-TasNet), a deep learning framework for end-to-end time- domain speech separation, which significantly outperforms previous time–frequency masking methods in separating two- and three-speaker mixtures.
Posted Content
Wavesplit: End-to-End Speech Separation by Speaker Clustering
Neil Zeghidour,David Grangier +1 more
TL;DR: Wavesplit redefines the state-of-the-art on clean mixtures of 2 or 3 speakers, as well as in noisy and reverberated settings, and set a new benchmark on the recent LibriMix dataset.
Journal ArticleDOI
Divide and Conquer: A Deep CASA Approach to Talker-Independent Monaural Speaker Separation
Yuzhou Liu,DeLiang Wang +1 more
TL;DR: In this article, the authors decompose the multi-speaker separation task into the stages of simultaneous grouping and sequential grouping, which achieves state-of-the-art results with a modest model size.
Proceedings ArticleDOI
Dual-Path Transformer Network: Direct Context-Aware Modeling for End-to-End Monaural Speech Separation.
TL;DR: A dual-path transformer network (DPTNet) for end-to-end speech separation, which introduces direct context-awareness in the modeling for speech sequences by introduces a improved transformer.
Posted Content
Voice Separation with an Unknown Number of Multiple Speakers
TL;DR: A new method is presented for separating a mixed audio sequence, in which multiple voices speak simultaneously, that greatly outperforms the current state of the art, which, as it is shown, is not competitive for more than two speakers.
References
More filters
Journal ArticleDOI
Generative Adversarial Nets
Ian Goodfellow,Jean Pouget-Abadie,Mehdi Mirza,Bing Xu,David Warde-Farley,Sherjil Ozair,Aaron Courville,Yoshua Bengio +7 more
TL;DR: A new framework for estimating generative models via an adversarial process, in which two models are simultaneously train: a generative model G that captures the data distribution and a discriminative model D that estimates the probability that a sample came from the training data rather than G.
Posted Content
Image-to-Image Translation with Conditional Adversarial Networks
TL;DR: Conditional Adversarial Network (CA) as discussed by the authors is a general-purpose solution to image-to-image translation problems, which can be used to synthesize photos from label maps, reconstructing objects from edge maps, and colorizing images, among other tasks.
Posted Content
TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems
Martín Abadi,Ashish Agarwal,Paul Barham,Eugene Brevdo,Zhifeng Chen,Craig Citro,Greg S. Corrado,Andy Davis,Jeffrey Dean,Matthieu Devin,Sanjay Ghemawat,Ian Goodfellow,Andrew Harp,Geoffrey Irving,Michael Isard,Yangqing Jia,Rafal Jozefowicz,Lukasz Kaiser,Manjunath Kudlur,Josh Levenberg,Dan Mané,Rajat Monga,Sherry Moore,Derek G. Murray,Chris Olah,Mike Schuster,Jonathon Shlens,Benoit Steiner,Ilya Sutskever,Kunal Talwar,Paul A. Tucker,Vincent Vanhoucke,Vijay K. Vasudevan,Fernanda B. Viégas,Oriol Vinyals,Pete Warden,Martin Wattenberg,Martin Wicke,Yuan Yu,Xiaoqiang Zheng +39 more
TL;DR: The TensorFlow interface and an implementation of that interface that is built at Google are described, which has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields.
Proceedings Article
The Kaldi Speech Recognition Toolkit
Daniel Povey,Arnab Ghoshal,Gilles Boulianne,Lukas Burget,Ondrej Glembek,Nagendra Kumar Goel,Mirko Hannemann,Petr Motlicek,Yanmin Qian,Petr Schwarz,Jan Silovsky,Georg Stemmer,Karel Vesely +12 more
TL;DR: The design of Kaldi is described, a free, open-source toolkit for speech recognition research that provides a speech recognition system based on finite-state automata together with detailed documentation and a comprehensive set of scripts for building complete recognition systems.
Proceedings ArticleDOI
A unified architecture for natural language processing: deep neural networks with multitask learning
Ronan Collobert,Jason Weston +1 more
TL;DR: This work describes a single convolutional neural network architecture that, given a sentence, outputs a host of language processing predictions: part-of-speech tags, chunks, named entity tags, semantic roles, semantically similar words and the likelihood that the sentence makes sense using a language model.
Related Papers (5)
Deep attractor network for single-microphone speaker separation
Zhuo Chen,Yi Luo,Nima Mesgarani +2 more