scispace - formally typeset
Open AccessJournal ArticleDOI

Time-Frequency Masking in the Complex Domain for Speech Dereverberation and Denoising

TLDR
This paper performs dereverberation and denoising using supervised learning with a deep neural network and defines the complex ideal ratio mask so that direct speech results after the mask is applied to reverberant and noisy speech.
Abstract
In real-world situations, speech is masked by both background noise and reverberation, which negatively affect perceptual quality and intelligibility. In this paper, we address monaural speech separation in reverberant and noisy environments. We perform dereverberation and denoising using supervised learning with a deep neural network. Specifically, we enhance the magnitude and phase by performing separation with an estimate of the complex ideal ratio mask. We define the complex ideal ratio mask so that direct speech results after the mask is applied to reverberant and noisy speech. Our approach is evaluated using simulated and real room impulse responses, and with background noises. The proposed approach improves objective speech quality and intelligibility significantly. Evaluations and comparisons show that it outperforms related methods in many reverberant and noisy environments.

read more

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments

TL;DR: A review of recently developed, representative deep learning approaches for tackling non-stationary additive and convolutional degradation of speech with the aim of providing guidelines for those involved in the development of environmentally robust speech recognition systems.
Journal ArticleDOI

Speaker-Independent Speech Separation With Deep Attractor Network

TL;DR: In this article, a neural network is used to project the time-frequency representation of the mixture signal into a high-dimensional embedding space and a reference point (attractor) is created to represent each speaker.
Proceedings ArticleDOI

Speech Denoising with Deep Feature Losses.

TL;DR: In this article, a fully-convolutional context aggregation network using a deep feature loss is proposed to denoise speech signals by processing the raw waveform directly, which achieves state-of-the-art performance in objective speech quality metrics and in large-scale perceptual experiments with human listeners.
Proceedings ArticleDOI

End-to-End Speech Separation with Unfolded Iterative Phase Reconstruction

TL;DR: In this paper, the authors proposed an end-to-end approach for single-channel speaker-independent multi-speaker speech separation, where time-frequency (T-F) masking, the short-time Fourier transform (STFT), and its inverse are represented as layers within a deep network.
Posted Content

Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments

TL;DR: Recently developed, representative deep learning approaches for tackling non-stationary additive and convolutional degradation of speech with the aim of providing guidelines for those involved in the development of environmentally robust speech recognition systems are reviewed.
References
More filters
Journal ArticleDOI

Binaural classification for reverberant speech segregation using deep neural networks

TL;DR: Evaluations and comparisons show that DNN-based binaural classification produces superior segregation performance in a variety of multisource and reverberant conditions.
Journal ArticleDOI

Theory of Speech masking by reverberation

TL;DR: In this article, a general statistical theory for the masking effect of reverberation on the intelligibility of words is developed for a series of discrete pulses distributed statistically over a 30-db range in sound pressure level in a given frequency band.
Proceedings ArticleDOI

Recognizing reverberant speech with RASTA-PLP

TL;DR: The authors' experimental variant on RASTA processing provides a statistically significant improvement in performance on the reverberant speech, with a best word error rate of 64.1%.
Proceedings ArticleDOI

A deep neural network for time-domain signal reconstruction

TL;DR: A new deep network is proposed that directly reconstructs the time-domain clean signal through an inverse fast Fourier transform layer and significantly outperforms a recent non-negative matrix factorization based separation system in both objective speech intelligibility and quality.
Journal ArticleDOI

A Supervised Learning Approach to Monaural Segregation of Reverberant Speech

TL;DR: A supervised learning approach to monaural segregation of reverberant voiced speech is proposed, which learns to map from a set of pitch-based auditory features to a grouping cue encoding the posterior probability of a time-frequency (T-F) unit being target dominant given observed features.
Related Papers (5)