Time-Frequency Masking in the Complex Domain for Speech Dereverberation and Denoising
TLDR
This paper performs dereverberation and denoising using supervised learning with a deep neural network and defines the complex ideal ratio mask so that direct speech results after the mask is applied to reverberant and noisy speech.Abstract:
In real-world situations, speech is masked by both background noise and reverberation, which negatively affect perceptual quality and intelligibility. In this paper, we address monaural speech separation in reverberant and noisy environments. We perform dereverberation and denoising using supervised learning with a deep neural network. Specifically, we enhance the magnitude and phase by performing separation with an estimate of the complex ideal ratio mask. We define the complex ideal ratio mask so that direct speech results after the mask is applied to reverberant and noisy speech. Our approach is evaluated using simulated and real room impulse responses, and with background noises. The proposed approach improves objective speech quality and intelligibility significantly. Evaluations and comparisons show that it outperforms related methods in many reverberant and noisy environments.read more
Citations
More filters
Journal ArticleDOI
Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments
Zixing Zhang,Jürgen T. Geiger,Jouni Pohjalainen,Amr El-Desoky Mousa,Wenyu Jin,Björn Schuller +5 more
TL;DR: A review of recently developed, representative deep learning approaches for tackling non-stationary additive and convolutional degradation of speech with the aim of providing guidelines for those involved in the development of environmentally robust speech recognition systems.
Journal ArticleDOI
Speaker-Independent Speech Separation With Deep Attractor Network
Yi Luo,Zhuo Chen,Nima Mesgarani +2 more
TL;DR: In this article, a neural network is used to project the time-frequency representation of the mixture signal into a high-dimensional embedding space and a reference point (attractor) is created to represent each speaker.
Proceedings ArticleDOI
Speech Denoising with Deep Feature Losses.
TL;DR: In this article, a fully-convolutional context aggregation network using a deep feature loss is proposed to denoise speech signals by processing the raw waveform directly, which achieves state-of-the-art performance in objective speech quality metrics and in large-scale perceptual experiments with human listeners.
Proceedings ArticleDOI
End-to-End Speech Separation with Unfolded Iterative Phase Reconstruction
TL;DR: In this paper, the authors proposed an end-to-end approach for single-channel speaker-independent multi-speaker speech separation, where time-frequency (T-F) masking, the short-time Fourier transform (STFT), and its inverse are represented as layers within a deep network.
Posted Content
Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments
Zixing Zhang,Jürgen T. Geiger,Jouni Pohjalainen,Amr El-Desoky Mousa,Wenyu Jin,Björn Schuller +5 more
TL;DR: Recently developed, representative deep learning approaches for tackling non-stationary additive and convolutional degradation of speech with the aim of providing guidelines for those involved in the development of environmentally robust speech recognition systems are reviewed.
References
More filters
Proceedings ArticleDOI
Deep Recurrent De-Noising Auto-Encoder and blind de-reverberation for reverberated speech recognition
TL;DR: Results on the 2014 REVERB Challenge development set indicate that the DAE front-end provides complementary performance gains to multi-condition training, feature transformations, and model adaptation.
Proceedings ArticleDOI
Robust speech dereverberation based on non-negativity and sparse nature of speech spectrograms
TL;DR: This paper presents a blind dereverberation method designed to recover the subband envelope of an original speech signal from its reverberant version, formulated as a blind deconvolution problem with non-negative constraints, regularized by the sparse nature of speech spectrograms.
Proceedings ArticleDOI
A Supervised Learning Approach to Monaural Segregation of Reverberant Speech
Zhaozhang Jin,DeLiang Wang +1 more
TL;DR: A supervised learning approach to monaural segregation of reverberant voiced speech is proposed, which learns to map from a set of pitch-based auditory features to a grouping cue encoding the posterior probability of a time-frequency (T-F) unit being target dominant given observed features.
Journal ArticleDOI
Dynamic Precedence Effect Modeling for Source Separation in Reverberant Environments
TL;DR: This study tests a previously proposed binaural separation/precedence model in real rooms with a range of reverberant conditions and concludes that adaptation is necessary and can yield significant gains in separation performance.
Journal ArticleDOI
Pitch-based monaural segregation of reverberant speech
Nicoleta Roman,DeLiang Wang +1 more
TL;DR: This work proposes a two-stage monaural separation system that combines the inverse filtering of the room impulse response corresponding to target location and a pitch-based speech segregation method, and shows that the proposed system results in considerable signal-to-noise ratio gains across different conditions.