Time-Frequency Masking in the Complex Domain for Speech Dereverberation and Denoising
TLDR
This paper performs dereverberation and denoising using supervised learning with a deep neural network and defines the complex ideal ratio mask so that direct speech results after the mask is applied to reverberant and noisy speech.Abstract:
In real-world situations, speech is masked by both background noise and reverberation, which negatively affect perceptual quality and intelligibility. In this paper, we address monaural speech separation in reverberant and noisy environments. We perform dereverberation and denoising using supervised learning with a deep neural network. Specifically, we enhance the magnitude and phase by performing separation with an estimate of the complex ideal ratio mask. We define the complex ideal ratio mask so that direct speech results after the mask is applied to reverberant and noisy speech. Our approach is evaluated using simulated and real room impulse responses, and with background noises. The proposed approach improves objective speech quality and intelligibility significantly. Evaluations and comparisons show that it outperforms related methods in many reverberant and noisy environments.read more
Citations
More filters
Journal ArticleDOI
Speech Separation based on Contrastive Learning and Deep Modularization
TL;DR: In this paper , the authors use contrastive learning to establish the representations of frames and then use the learned representations in the downstream deep modularization task to cluster frames based on speaker identity and demonstrate experimentally that different frames of a speaker can be viewed as augmentations of a given hidden standard frame of that speaker.
Journal ArticleDOI
A Novel Jointly Optimized Cooperative DAE-DNN Approach Based on a New Multi-Target Step-Wise Learning for Speech Enhancement
TL;DR: In this article , a cooperative structure of deep autoencoders (DAEs) as generative models and deep neural networks (DNNs) was proposed for speech enhancement, which achieved an average perceptual evaluation of speech quality (PESQ) improvement of up to about 0.3 for TIMIT dataset.
Posted Content
Improved Speaker-Dependent Separation for CHiME-5 Challenge
TL;DR: This paper summarizes several follow-up contributions for improving the submitted NWPU speaker-dependent system for CHiME-5 challenge, which aims to solve the problem of multi-channel, highly-overlapped conversational speech recognition in a dinner party scenario with reverberations and non-stationary noises.
Journal ArticleDOI
Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation and Recognition
Guinan Li,Jiajun Deng,Mengzhe Geng,Zengrui Jin,Tianzi Wang,Shujie Hu,Mingyu Cui,Helen Meng,Xunying Liu +8 more
TL;DR: In this paper , an audio-visual multi-channel speech separation, dereverberation and recognition approach featuring a full incorporation of visual information into all system components is proposed, which consistently demonstrated the efficacy of the video input is consistently demonstrated in mask-based MVDR speech separation.
Journal ArticleDOI
Non-intrusive speech quality prediction based on the blind estimation of clean speech and the i-vector framework
TL;DR: Experimental results show the proposed system providing higher correlations with perceptual speech quality than several benchmark non-intrusive measures, especially for noisy and enhanced speech.
References
More filters
Proceedings Article
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.
TL;DR: Adaptive subgradient methods as discussed by the authors dynamically incorporate knowledge of the geometry of the data observed in earlier iterations to perform more informative gradient-based learning, which allows us to find needles in haystacks in the form of very predictive but rarely seen features.
Journal Article
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
TL;DR: This work describes and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal functions that can be chosen in hindsight.
Journal ArticleDOI
Multitask Learning
TL;DR: Multi-task Learning (MTL) as mentioned in this paper is an approach to inductive transfer that improves generalization by using the domain information contained in the training signals of related tasks as an inductive bias.
Journal ArticleDOI
Image method for efficiently simulating small‐room acoustics
Jont B. Allen,David A. Berkley +1 more
TL;DR: The theoretical and practical use of image techniques for simulating the impulse response between two points in a small rectangular room, when convolved with any desired input signal, simulates room reverberation of the input signal.
Journal ArticleDOI
Perceptual linear predictive (PLP) analysis of speech
TL;DR: A new technique for the analysis of speech, the perceptual linear predictive (PLP) technique, which uses three concepts from the psychophysics of hearing to derive an estimate of the auditory spectrum, and yields a low-dimensional representation of speech.