scispace - formally typeset
Search or ask a question

Showing papers by "DeLiang Wang published in 2019"


Proceedings ArticleDOI
12 May 2019
TL;DR: Experimental results demonstrate that the proposed model gives consistently better enhancement results than a state-of-the-art real-time convolutional recurrent model.
Abstract: This work proposes a fully convolutional neural network (CNN) for real-time speech enhancement in the time domain. The proposed CNN is an encoder-decoder based architecture with an additional temporal convolutional module (TCM) inserted between the encoder and the decoder. We call this architecture a Temporal Convolutional Neural Network (TCNN). The encoder in the TCNN creates a low dimensional representation of a noisy input frame. The TCM uses causal and dilated convolutional layers to utilize the encoder output of the current and previous frames. The decoder uses the TCM output to reconstruct the enhanced frame. The proposed model is trained in a speaker- and noise-independent way. Experimental results demonstrate that the proposed model gives consistently better enhancement results than a state-of-the-art real-time convolutional recurrent model. Moreover, since the model is fully convolutional, it has much fewer trainable parameters than earlier models.

268 citations


Journal ArticleDOI
TL;DR: A new learning mechanism for a fully convolutional neural network to address speech enhancement in the time domain by using mean absolute error loss between the enhanced short-time Fourier transform (STFT) magnitude and the clean STFT magnitude to train the CNN.
Abstract: This paper proposes a new learning mechanism for a fully convolutional neural network (CNN) to address speech enhancement in the time domain. The CNN takes as input the time frames of noisy utterance and outputs the time frames of the enhanced utterance. At the training time, we add an extra operation that converts the time domain to the frequency domain. This conversion corresponds to simple matrix multiplication, and is hence differentiable implying that a frequency domain loss can be used for training in the time domain. We use mean absolute error loss between the enhanced short-time Fourier transform (STFT) magnitude and the clean STFT magnitude to train the CNN. This way, the model can exploit the domain knowledge of converting a signal to the frequency domain for analysis. Moreover, this approach avoids the well-known invalid STFT problem since the proposed CNN operates in the time domain. Experimental results demonstrate that the proposed method substantially outperforms the other methods of speech enhancement. The proposed method is easy to implement and applicable to related speech processing tasks that require time-frequency masking or spectral mapping.

212 citations


Journal ArticleDOI
TL;DR: This work treats speech enhancement as a sequence-to-sequence mapping, and presents a novel convolutional neural network (CNN) architecture for monaural speech enhancement that consistently outperforms a DNN, a unidirectional long short-term memory (LSTM) model, and a bidirectional LSTM model in terms of objective speech intelligibility and quality metrics.
Abstract: For supervised speech enhancement, contextual information is important for accurate mask estimation or spectral mapping. However, commonly used deep neural networks DNNs are limited in capturing temporal contexts. To leverage long-term contexts for tracking a target speaker, we treat speech enhancement as a sequence-to-sequence mapping, and present a novel convolutional neural network CNN architecture for monaural speech enhancement. The key idea is to systematically aggregate contexts through dilated convolutions, which significantly expand receptive fields. The CNN model additionally incorporates gating mechanisms and residual learning. Our experimental results suggest that the proposed model generalizes well to untrained noises and untrained speakers. It consistently outperforms a DNN, a unidirectional long short-term memory LSTM model, and a bidirectional LSTM model in terms of objective speech intelligibility and quality metrics. Moreover, the proposed model has far fewer parameters than DNN and LSTM models.

149 citations


Proceedings ArticleDOI
12 May 2019
TL;DR: A new convolutional recurrent network (CRN) for complex spectral mapping is proposed, which leads to a causal system for noise- and speaker-independent speech enhancement and significantly outperforms an existing Convolutional neural network (CNN), as well as a strong CRN for magnitude spectral mapping.
Abstract: Phase is important for perceptual quality in speech enhancement. However, it seems intractable to directly estimate phase spectrogram through supervised learning due to lack of clear structure in phase spectrogram. Complex spectral mapping aims to estimate the real and imaginary spectrograms of clean speech from those of noisy speech, which simultaneously enhances magnitude and phase responses of noisy speech. In this paper, we propose a new convolutional recurrent network (CRN) for complex spectral mapping, which leads to a causal system for noise- and speaker-independent speech enhancement. In terms of objective intelligibility and perceptual quality, the proposed CRN significantly outperforms an existing convolutional neural network (CNN) for complex spectral mapping, as well as a strong CRN for magnitude spectral mapping. We additionally incorporate a newly-developed group strategy to substantially reduce the number of trainable parameters and the computational cost without sacrificing performance.

139 citations


Journal ArticleDOI
TL;DR: In this article, the authors decompose the multi-speaker separation task into the stages of simultaneous grouping and sequential grouping, which achieves state-of-the-art results with a modest model size.
Abstract: We address talker-independent monaural speaker separation from the perspectives of deep learning and computational auditory scene analysis (CASA). Specifically, we decompose the multi-speaker separation task into the stages of simultaneous grouping and sequential grouping. Simultaneous grouping is first performed in each time frame by separating the spectra of different speakers with a permutation-invariantly trained neural network. In the second stage, the frame-level separated spectra are sequentially grouped to different speakers by a clustering network. The proposed deep CASA approach optimizes frame-level separation and speaker tracking in turn, and produces excellent results for both objectives. Experimental results on the benchmark WSJ0-2mix database show that the new approach achieves the state-of-the-art results with a modest model size.

132 citations


Journal ArticleDOI
TL;DR: This study tightly integrates complementary spectral and spatial features for deep learning based multi-channel speaker separation in reverberant environments to localize individual speakers so that an enhancement network can be trained on spatial as well as spectral features to extract the speaker from an estimated direction and with specific spectral structures.
Abstract: This study tightly integrates complementary spectral and spatial features for deep learning based multi-channel speaker separation in reverberant environments. The key idea is to localize individual speakers so that an enhancement network can be trained on spatial as well as spectral features to extract the speaker from an estimated direction and with specific spectral structures. The spatial and spectral features are designed in a way such that the trained models are blind to the number of microphones and microphone geometry. To determine the direction of the speaker of interest, we identify time-frequency T-F units dominated by that speaker and only use them for direction estimation. The T-F unit level speaker dominance is determined by a two-channel chimera++ network, which combines deep clustering and permutation invariant training at the objective function level, and integrates spectral and interchannel phase patterns at the input feature level. In addition, T-F masking based beamforming is tightly integrated in the system by leveraging the magnitudes and phases produced by beamforming. Strong separation performance has been observed on reverberant talker-independent speaker separation, which separates reverberant speaker mixtures based on a random number of microphones arranged in arbitrary linear-array geometry.

127 citations


Journal ArticleDOI
TL;DR: Deep learning-based time-frequency (T-F) masking has dramatically advanced monaural (single-channel) speech separation and enhancement, but its potential for direction of arrival (DOA) estimation in noisy and reverberant environments is investigated.
Abstract: Deep learning-based time-frequency T-F masking has dramatically advanced monaural single-channel speech separation and enhancement. This study investigates its potential for direction of arrival DOA estimation in noisy and reverberant environments. We explore ways of combining T-F masking and conventional localization algorithms, such as generalized cross correlation with phase transform, as well as newly proposed algorithms based on steered-response SNR and steering vectors. The key idea is to utilize deep neural networks DNNs to identify speech dominant T-F units containing relatively clean phase for DOA estimation. Our DNN is trained using only monaural spectral information, and this makes the trained model directly applicable to arrays with various numbers of microphones arranged in diverse geometries. Although only monaural information is used for training, experimental results show strong robustness of the proposed approach in new environments with intense noise and room reverberation, outperforming traditional DOA estimation methods by large margins. Our study also suggests that the ideal ratio mask and its variants remain effective training targets for robust speaker localization.

88 citations


Proceedings ArticleDOI
12 May 2019
TL;DR: In this article, the authors investigated phase reconstruction for deep learning based monaural talker-independent speaker separation in the short-time Fourier transform (STFT) domain, and proposed three algorithms based on iterative phase reconstruction, group delay estimation, and phase-difference sign prediction.
Abstract: This study investigates phase reconstruction for deep learning based monaural talker-independent speaker separation in the short-time Fourier transform (STFT) domain. The key observation is that, for a mixture oftwo sources, with their magnitudes accurately estimated and under a geometric constraint, the absolute phase difference between each source and the mixture can be uniquely determined; in addition, the source phases at each time-frequency $({\text{T}} - {{\Gamma }})$ unit can be narrowed down to only two candidates. To pick the right candidate, we propose three algorithms based on iterative phase reconstruction, group delay estimation, and phase-difference sign prediction. State-of-the-art results are obtained on the publicly available wsj0-2mix and 3 mix corpus.

84 citations


Journal ArticleDOI
TL;DR: This work proposes a two-stage strategy to enhance corrupted speech, where denoising and dereverberation are conducted sequentially using deep neural networks, and designs a new objective function that incorporates clean phase during model training to better estimate spectral magnitudes.
Abstract: In real-world situations, speech reaching our ears is commonly corrupted by both room reverberation and background noise. These distortions are detrimental to speech intelligibility and quality, and also pose a serious problem to many speech-related applications, including automatic speech and speaker recognition. In order to deal with the combined effects of noise and reverberation, we propose a two-stage strategy to enhance corrupted speech, where denoising and dereverberation are conducted sequentially using deep neural networks. In addition, we design a new objective function that incorporates clean phase during model training to better estimate spectral magnitudes, which would in turn yield better phase estimates when combined with iterative phase reconstruction. The two-stage model is then jointly trained to optimize the proposed objective function. Systematic evaluations and comparisons show that the proposed algorithm improves objective metrics of speech intelligibility and quality substantially, and significantly outperforms previous one-stage enhancement systems.

83 citations


Proceedings ArticleDOI
15 Sep 2019
TL;DR: A causal system to address acoustic echo and noise cancellation jointly as deep learning based speech separation, which incorporates a convolutional recurrent network (CRN) and a recurrent network with long short-term memory (LSTM).
Abstract: We formulate acoustic echo and noise cancellation jointly as deep learning based speech separation, where near-end speech is separated from a single microphone recording and sent to the far end. We propose a causal system to address this problem, which incorporates a convolutional recurrent network (CRN) and a recurrent network with long short-term memory (LSTM). The system is trained to estimate the real and imaginary spectrograms of near-end speech and detect the activity of near-end speech from the microphone signal and far-end signal. Subsequently, the estimated real and imaginary spectrograms are used to separate the near-end signal, hence removing echo and noise. The trained near-end speech detector is employed to further suppress residual echo and noise. Evaluation results show that the proposed method effectively removes acoustic echo and background noise in the presence of nonlinear distortions for both simulated and measured room impulse responses (RIRs). Additionally, the proposed method generalizes well to untrained noises, RIRs and speakers.

66 citations


Proceedings ArticleDOI
12 May 2019
TL;DR: A novel deep learning based framework for real-time speech enhancement on dual-microphone mobile phones in a close-talk scenario that incorporates a convolutional recurrent network (CRN) with high computational efficiency.
Abstract: In mobile speech communication, the quality and intelligibility of the received speech can be severely degraded by background noise if the far-end talker is in an adverse acoustic environment. Therefore, speech enhancement algorithms are typically integrated into mobile phones to remove background noise. In this paper, we propose a novel deep learning based framework for real-time speech enhancement on dual-microphone mobile phones in a close-talk scenario. It incorporates a convolutional recurrent network (CRN) with high computational efficiency. In addition, the framework amounts to a causal system, which is necessary for real-time processing on mobile phones. We find that the proposed approach consistently outperforms a deep neural network (DNN) based method, as well as two traditional methods for speech enhancement.

Posted Content
TL;DR: The proposed deep CASA approach optimizes frame-level separation and speaker tracking in turn, and produces excellent results for both objectives, with a modest model size.
Abstract: We address talker-independent monaural speaker separation from the perspectives of deep learning and computational auditory scene analysis (CASA). Specifically, we decompose the multi-speaker separation task into the stages of simultaneous grouping and sequential grouping. Simultaneous grouping is first performed in each time frame by separating the spectra of different speakers with a permutation-invariantly trained neural network. In the second stage, the frame-level separated spectra are sequentially grouped to different speakers by a clustering network. The proposed deep CASA approach optimizes frame-level separation and speaker tracking in turn, and produces excellent results for both objectives. Experimental results on the benchmark WSJ0-2mix database show that the new approach achieves the state-of-the-art results with a modest model size.

Proceedings ArticleDOI
12 May 2019
TL;DR: This work explores CDNNs for speech enhancement by training a CDNN that learns to map the complex-valued noisy short-time Fourier transform (STFT) to the clean STFT, and proposes thecomplex-valued extensions of the parametric rectified linear unit (PReLU) nonlinearity that helps to improve the performance of CDNN.
Abstract: A recent study has demonstrated the effectiveness of complex-valued deep neural networks (CDNNs) using newly developed tools such as complex batch normalization and complex residual blocks. Motivated by the fact that CDNNs are well suited for the processing of complex-domain representations, we explore CDNNs for speech enhancement. In particular, we train a CDNN that learns to map the complex-valued noisy short-time Fourier transform (STFT) to the clean STFT. Additionally, we propose the complex-valued extensions of the parametric rectified linear unit (PReLU) nonlinearity that helps to improve the performance of CDNN. Experimental results demonstrate that a CDNN using the proposed nonlinearity can give similar or better enhancement results compared to real-valued deep neural networks (DNNs).

Proceedings ArticleDOI
15 Sep 2019
TL;DR: It is shown that rank-1 approximation of a speech covariance matrix based on generalized eigenvalue decomposition leads to the best results for the masking-based MVDR beamformer.
Abstract: Despite successful applications of multi-channel signal processing in robust automatic speech recognition (ASR), relatively little research has been conducted on the effectiveness of such techniques in the robust speaker recognition domain. This paper introduces time-frequency (T-F) maskingbased beamforming to address text-independent speaker recognition in conditions where strong diffuse noise and reverberation are both present. We examine various masking-based beamformers, such as parameterized multi-channel Wiener filter, generalized eigenvalue (GEV) beamformer and minimum variance distortion-less response (MVDR) beamformer, and evaluate their performance in terms of speaker recognition accuracy for i-vector and x-vector based systems. In addition, we present a different formulation for estimating steering vectors from speech covariance matrices. We show that rank-1 approximation of a speech covariance matrix based on generalized eigenvalue decomposition leads to the best results for the masking-based MVDR beamformer. Experiments on the recently introduced NIST SRE 2010 retransmitted corpus show that the MVDR beamformer with rank-1 approximation provides an absolute reduction of 5.55% in equal error rate compared to a standard masking-based MVDR beamformer.

Journal ArticleDOI
TL;DR: Substantial intelligibility improvements were found for hearing-impaired and normal-hearing listeners across a range of target-to-interferer ratios (TIRs) and this study highlights the difficulty associated with perceiving speech in reverberant-noisy environments.
Abstract: For deep learning based speech segregation to have translational significance as a noise-reduction tool, it must perform in a wide variety of acoustic environments. In the current study, performance was examined when target speech was subjected to interference from a single talker and room reverberation. Conditions were compared in which an algorithm was trained to remove both reverberation and interfering speech, or only interfering speech. A recurrent neural network incorporating bidirectional long short-term memory was trained to estimate the ideal ratio mask corresponding to target speech. Substantial intelligibility improvements were found for hearing-impaired (HI) and normal-hearing (NH) listeners across a range of target-to-interferer ratios (TIRs). HI listeners performed better with reverberation removed, whereas NH listeners demonstrated no difference. Algorithm benefit averaged 56 percentage points for the HI listeners at the least-favorable TIR, allowing these listeners to perform numerically better than young NH listeners without processing. The current study highlights the difficulty associated with perceiving speech in reverberant-noisy environments, and it extends the range of environments in which deep learning based speech segregation can be effectively applied. This increasingly wide array of environments includes not only a variety of background noises and interfering speech, but also room reverberation.

Proceedings ArticleDOI
15 Sep 2019
TL;DR: Experimental results suggest that distortion-independent acoustic modeling is able to overcome the distortion problem, and the models investigated in this paper outperform the previous best system on the CHiME-2 corpus.
Abstract: Monaural speech enhancement has made dramatic advances since the introduction of deep learning a few years ago. Although enhanced speech has been demonstrated to have better intelligibility and quality for human listeners, feeding it directly to automatic speech recognition (ASR) systems trained with noisy speech has not produced expected improvements in ASR performance. The lack of an enhancement benefit on recognition, or the gap between monaural speech enhancement and recognition, is often attributed to speech distortions introduced in the enhancement process. In this article, we analyze the distortion problem, compare different acoustic models, and investigate a distortion-independent training scheme for monaural speech recognition. Experimental results suggest that distortion-independent acoustic modeling is able to overcome the distortion problem. Such an acoustic model can also work with speech enhancement models different from the one used during training. Moreover, the models investigated in this paper outperform the previous best system on the CHiME-2 corpus.

Journal ArticleDOI
TL;DR: This paper addresses the problem of talker-dependent speaker separation in reverberant conditions, which are characteristic of real-world environments, and proposes two-stage networks to effectively deal with both speaker separation and speech dereverberation.
Abstract: Speaker separation refers to the problem of separating speech signals from a mixture of simultaneous speakers. Previous studies are limited to addressing the speaker separation problem in anechoic conditions. This paper addresses the problem of talker-dependent speaker separation in reverberant conditions, which are characteristic of real-world environments. We employ recurrent neural networks with bidirectional long short-term memory BLSTM to separate and dereverberate the target speech signal. We propose two-stage networks to effectively deal with both speaker separation and speech dereverberation. In the two-stage model, the first stage separates and dereverberates two-talker mixtures and the second stage further enhances the separated target signal. We have extensively evaluated the two-stage architecture, and our empirical results demonstrate large improvements over unprocessed mixtures and clear performance gain over single-stage networks in a wide range of target-to-interferer ratios and reverberation times in simulated as well as recorded rooms. Moreover, we show that time-frequency masking yields better performance than spectral mapping for reverberant speaker separation.

Journal ArticleDOI
TL;DR: Speech materials free of ceiling effects were employed to reveal the optimal trade-off between rejecting noise and retaining speech during time-frequency masking, and this relative criterion value (-7 dB) was found to hold across noise types that differ in acoustic spectro-temporal complexity.
Abstract: Hearing-impaired listeners' intolerance to background noise during speech perception is well known. The current study employed speech materials free of ceiling effects to reveal the optimal trade-off between rejecting noise and retaining speech during time-frequency masking. This relative criterion value (−7 dB) was found to hold across noise types that differ in acoustic spectro-temporal complexity. It was also found that listeners with hearing impairment and those with normal hearing performed optimally at this same value, suggesting no true noise intolerance once time-frequency units containing speech are extracted.


Posted Content
TL;DR: In this article, a distortion-independent training scheme for monaural speech recognition is proposed, which is able to overcome the distortion problem and outperform the previous best system on the CHiME-2 corpus.
Abstract: Monaural speech enhancement has made dramatic advances since the introduction of deep learning a few years ago. Although enhanced speech has been demonstrated to have better intelligibility and quality for human listeners, feeding it directly to automatic speech recognition (ASR) systems trained with noisy speech has not produced expected improvements in ASR performance. The lack of an enhancement benefit on recognition, or the gap between monaural speech enhancement and recognition, is often attributed to speech distortions introduced in the enhancement process. In this study, we analyze the distortion problem, compare different acoustic models, and investigate a distortion-independent training scheme for monaural speech recognition. Experimental results suggest that distortion-independent acoustic modeling is able to overcome the distortion problem. Such an acoustic model can also work with speech enhancement models different from the one used during training. Moreover, the models investigated in this paper outperform the previous best system on the CHiME-2 corpus.

Proceedings ArticleDOI
01 May 2019
TL;DR: Simulation results show that, compared to cℓ1-MC-ANC, the proposed algorithms exhibit faster convergence or higher noise reduction at steady state in both free field and reverberant environments.
Abstract: Multichannel active noise control (MC-ANC) aims to cancel low-frequency noise in an enclosure. If noise sources are distributed sparsely in space, adding an l 1 -norm constraint to the standard MC-ANC helps to reduce the complexity of the system and accelerate the convergence rate. However, the convergence performance of l 1 -norm constrained MC-ANC (cl 1 -MC-ANC) degrades significantly in reverberant environments. In this paper, we analyze the necessity of using sparsity-inducing algorithms with distinct zero-attracting strengths over loudspeakers, and then derive three algorithms of this kind in the complex domain. Simulation results show that, compared to cl 1 -MC-ANC, the proposed algorithms exhibit faster convergence or higher noise reduction at steady state in both free field and reverberant environments.