End-to-End Speech Separation with Unfolded Iterative Phase Reconstruction
Zhong-Qiu Wang,Jonathan Le Roux,DeLiang Wang,John R. Hershey +3 more
- pp 2708-2712
TLDR
In this paper, the authors proposed an end-to-end approach for single-channel speaker-independent multi-speaker speech separation, where time-frequency (T-F) masking, the short-time Fourier transform (STFT), and its inverse are represented as layers within a deep network.Abstract:
This paper proposes an end-to-end approach for single-channel speaker-independent multi-speaker speech separation, where time-frequency (T-F) masking, the short-time Fourier transform (STFT), and its inverse are represented as layers within a deep network. Previous approaches, rather than computing a loss on the reconstructed signal, used a surrogate loss based on the target STFT magnitudes. This ignores reconstruction error introduced by phase inconsistency. In our approach, the loss function is directly defined on the reconstructed signals, which are optimized for best separation. In addition, we train through unfolded iterations of a phase reconstruction algorithm, represented as a series of STFT and inverse STFT layers. While mask values are typically limited to lie between zero and one for approaches using the mixture phase for reconstruction, this limitation is less relevant if the estimated magnitudes are to be used together with phase reconstruction. We thus propose several novel activation functions for the output layer of the T-F masking, to allow mask values beyond one. On the publicly-available wsj0-2mix dataset, our approach achieves state-of-the-art 12.6 dB scale-invariant signal-to-distortion ratio (SI-SDR) and 13.1 dB SDR, revealing new possibilities for deep learning based phase reconstruction and representing a fundamental progress towards solving the notoriously-hard cocktail party problem.read more
Citations
More filters
Journal ArticleDOI
Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation
Yi Luo,Nima Mesgarani +1 more
TL;DR: A fully convolutional time-domain audio separation network (Conv-TasNet), a deep learning framework for end-to-end time- domain speech separation, which significantly outperforms previous time–frequency masking methods in separating two- and three-speaker mixtures.
Proceedings ArticleDOI
SDR – Half-baked or Well Done?
TL;DR: The scale-invariant signal-to-distortion ratio (SI-SDR) as mentioned in this paper is a more robust measure for single-channel separation, which has been proposed in the BSS_eval toolkit.
Proceedings ArticleDOI
Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation
Yi Luo,Zhuo Chen,Takuya Yoshioka +2 more
TL;DR: In this paper, a dual-path recurrent neural network (DPRNN) is proposed for modeling extremely long sequences. But the model is not effective for modeling such long sequences due to optimization difficulties, while one-dimensional CNNs cannot perform utterance-level sequence modeling when its receptive field is smaller than the sequence length.
Posted Content
Algorithm Unrolling: Interpretable, Efficient Deep Learning for Signal and Image Processing
TL;DR: The increasing popularity of unrolled deep networks is due, in part, to their potential in developing efficient, high-performance (yet interpretable) network architectures from reasonably sized training sets.
Journal ArticleDOI
Algorithm Unrolling: Interpretable, Efficient Deep Learning for Signal and Image Processing
TL;DR: In this paper, an emerging technique called algorithm unrolling, or unfolding, offers promise in eliminating these issues by providing a concrete and systematic connection between iterative algorithms that are widely used in signal processing and deep neural networks.
References
More filters
Journal ArticleDOI
Performance measurement in blind audio source separation
TL;DR: This paper considers four different sets of allowed distortions in blind audio source separation algorithms, from time-invariant gains to time-varying filters, and derives a global performance measure using an energy ratio, plus a separate performance measure for each error term.
Journal ArticleDOI
On training targets for supervised speech separation
TL;DR: Results in various test conditions reveal that the two ratio mask targets, the IRM and the FFT-MASK, outperform the other targets in terms of objective intelligibility and quality metrics, and that masking based targets, in general, are significantly better than spectral envelope based targets.
Journal ArticleDOI
Supervised Speech Separation Based on Deep Learning: An Overview
DeLiang Wang,Jitong Chen +1 more
TL;DR: A comprehensive overview of deep learning-based supervised speech separation can be found in this paper, where three main components of supervised separation are discussed: learning machines, training targets, and acoustic features.
Proceedings ArticleDOI
SEGAN: Speech Enhancement Generative Adversarial Network
TL;DR: This work proposes the use of generative adversarial networks for speech enhancement, and operates at the waveform level, training the model end-to-end, and incorporate 28 speakers and 40 different noise conditions into the same model, such that model parameters are shared across them.
Journal ArticleDOI
Computational Auditory Scene Analysis: Principles, Algorithms, and Applications
DeLiang Wang,Guy J. Brown +1 more
TL;DR: This paper focuses on the development of model-Based Speech Segregation in CASA systems, which was first introduced in 2000 and has since been upgraded to a full-blown model-based system.