scispace - formally typeset
Search or ask a question

Showing papers by "DeLiang Wang published in 2013"


Proceedings ArticleDOI
26 May 2013
TL;DR: The proposed feature enhancement algorithm estimates a smoothed ideal ratio mask (IRM) in the Mel frequency domain using deep neural networks and a set of time-frequency unit level features that has previously been used to estimate the ideal binary mask.
Abstract: We propose a feature enhancement algorithm to improve robust automatic speech recognition (ASR). The algorithm estimates a smoothed ideal ratio mask (IRM) in the Mel frequency domain using deep neural networks and a set of time-frequency unit level features that has previously been used to estimate the ideal binary mask. The estimated IRM is used to filter out noise from a noisy Mel spectrogram before performing cepstral feature extraction for ASR. On the noisy subset of the Aurora-4 robust ASR corpus, the proposed enhancement obtains a relative improvement of over 38% in terms of word error rates using ASR models trained in clean conditions, and an improvement of over 14% when the models are trained using the multi-condition training data. In terms of instantaneous SNR estimation performance, the proposed system obtains a mean absolute error of less than 4 dB in most frequency channels.

557 citations


Journal ArticleDOI
TL;DR: This work proposes to learn more linearly separable and discriminative features from raw acoustic features and train linear SVMs, which are much easier and faster to train than kernel SVMs.
Abstract: Formulating speech separation as a binary classification problem has been shown to be effective. While good separation performance is achieved in matched test conditions using kernel support vector machines (SVMs), separation in unmatched conditions involving new speakers and environments remains a big challenge. A simple yet effective method to cope with the mismatch is to include many different acoustic conditions into the training set. However, large-scale training is almost intractable for kernel machines due to computational complexity. To enable training on relatively large datasets, we propose to learn more linearly separable and discriminative features from raw acoustic features and train linear SVMs, which are much easier and faster to train than kernel SVMs. For feature learning, we employ standard pre-trained deep neural networks (DNNs). The proposed DNN-SVM system is trained on a variety of acoustic conditions within a reasonable amount of time. Experiments on various test mixtures demonstrate good generalization to unseen speakers and background noises.

460 citations


Journal ArticleDOI
TL;DR: Testing using normal-hearing and HI listeners indicated that intelligibility increased following processing in all conditions, and increases were larger for HI listeners, for the modulated background, and for the least-favorable SNRs.
Abstract: Despite considerable effort, monaural (single-microphone) algorithms capable of increasing the intelligibility of speech in noise have remained elusive. Successful development of such an algorithm is especially important for hearing-impaired (HI) listeners, given their particular difficulty in noisy backgrounds. In the current study, an algorithm based on binary masking was developed to separate speech from noise. Unlike the ideal binary mask, which requires prior knowledge of the premixed signals, the masks used to segregate speech from noise in the current study were estimated by training the algorithm on speech not used during testing. Sentences were mixed with speech-shaped noise and with babble at various signal-to-noise ratios (SNRs). Testing using normal-hearing and HI listeners indicated that intelligibility increased following processing in all conditions. These increases were larger for HI listeners, for the modulated background, and for the least-favorable SNRs. They were also often substantial, allowing several HI listeners to improve intelligibility from scores near zero to values above 70%.

213 citations


Journal ArticleDOI
TL;DR: This paper expands T-F unit features to include gammatone frequency cepstral coefficients (GFCC), mel-frequency cep stral coefficients, relative spectral transform (RASTA) and perceptual linear prediction (PLP), and proposes to use a group Lasso approach to select complementary features in a principled way.
Abstract: Monaural speech segregation has been a very challenging problem for decades. By casting speech segregation as a binary classification problem, recent advances have been made in computational auditory scene analysis on segregation of both voiced and unvoiced speech. So far, pitch and amplitude modulation spectrogram have been used as two main kinds of time-frequency (T-F) unit level features in classification. In this paper, we expand T-F unit features to include gammatone frequency cepstral coefficients (GFCC), mel-frequency cepstral coefficients, relative spectral transform (RASTA) and perceptual linear prediction (PLP). Comprehensive comparisons are performed in order to identify effective features for classification-based speech segregation. Our experiments in matched and unmatched test conditions show that these newly included features significantly improve speech segregation performance. Specifically, GFCC and RASTA-PLP are the best single features in matched-noise and unmatched-noise test conditions, respectively. We also find that pitch-based features are crucial for good generalization to unseen environments. To further explore complementarity in terms of discriminative power, we propose to use a group Lasso approach to select complementary features in a principled way. The final combined feature set yields promising results in both matched and unmatched test conditions.

192 citations


Proceedings ArticleDOI
26 May 2013
TL;DR: This study reveals that the nonlinear rectification accounts for the noise robustness differences primarily, and suggests how to enhance MFCC robustness, and further improve GF CC robustness by adopting a different time-frequency representation.
Abstract: Automatic speaker recognition can achieve a high level of performance in matched training and testing conditions. However, such performance drops significantly in mismatched noisy conditions. Recent research indicates that a new speaker feature, gammatone frequency cepstral coefficients (GFCC), exhibits superior noise robustness to commonly used mel-frequency cepstral coefficients (MFCC). To gain a deep understanding of the intrinsic robustness of GFCC relative to MFCC, we design speaker identification experiments to systematically analyze their differences and similarities. This study reveals that the nonlinear rectification accounts for the noise robustness differences primarily. Moreover, this study suggests how to enhance MFCC robustness, and further improve GFCC robustness by adopting a different time-frequency representation.

159 citations


Journal ArticleDOI
TL;DR: An unsupervised approach to separating cochannel speech, which follows the two main stages of computational auditory scene analysis: segmentation and grouping and produces significant SNR improvements across a range of input SNR.
Abstract: Cochannel (two-talker) speech separation is predominantly addressed using pretrained speaker dependent models. In this paper, we propose an unsupervised approach to separating cochannel speech. Our approach follows the two main stages of computational auditory scene analysis: segmentation and grouping. For voiced speech segregation, the proposed system utilizes a tandem algorithm for simultaneous grouping and then unsupervised clustering for sequential grouping. The clustering is performed by a search to maximize the ratio of between- and within-group speaker distances while penalizing within-group concurrent pitches. To segregate unvoiced speech, we first produce unvoiced speech segments based on onset/offset analysis. The segments are grouped using the complementary binary masks of segregated voiced speech. Despite its simplicity, our approach produces significant SNR improvements across a range of input SNR. The proposed system yields competitive performance in comparison to other speaker-independent and model-based methods.

110 citations


Journal ArticleDOI
TL;DR: The proposed framework improves segregation relative to several two-microphone comparison systems that are based solely on azimuth cues and develops a novel hidden Markov model framework to estimate the most probable path through the multisource state space.
Abstract: We propose an approach to binaural detection, localization and segregation of speech based on pitch and azimuth cues. We formulate the problem as a search through a multisource state space across time, where each multisource state encodes the number of active sources, and the azimuth and pitch of each active source. A set of multilayer perceptrons are trained to assign time-frequency units to one of the active sources in each multisource state based jointly on observed pitch and azimuth cues. We develop a novel hidden Markov model framework to estimate the most probable path through the multisource state space. An estimated state path encodes a solution to the detection, localization, pitch estimation and simultaneous organization problems. Segregation is then achieved with an azimuth-based sequential organization stage. We demonstrate that the proposed framework improves segregation relative to several two-microphone comparison systems that are based solely on azimuth cues. Performance gains are consistent across a variety of reverberant conditions.

45 citations


Journal ArticleDOI
TL;DR: This work demonstrates the effectiveness of directly using the masked data on both a small and large vocabulary dataset, and suggests a much better baseline than unenhanced speech for future work in missing feature ASR.
Abstract: Recently, much work has been devoted to the computation of binary masks for speech segregation. Conventional wisdom in the field of ASR holds that these binary masks cannot be used directly; the missing energy significantly affects the calculation of the cepstral features commonly used in ASR. We show that this commonly held belief may be a misconception; we demonstrate the effectiveness of directly using the masked data on both a small and large vocabulary dataset. In fact, this approach, which we term the direct masking approach, performs comparably to two previously proposed missing feature techniques. We also investigate the reasons why other researchers may have not come to this conclusion; variance normalization of the features is a significant factor in performance. This work suggests a much better baseline than unenhanced speech for future work in missing feature ASR.

43 citations


Journal ArticleDOI
TL;DR: Methods that require only a small training corpus and can generalize to unseen conditions are presented that produce high quality IBM estimates under unseen conditions.
Abstract: Monaural speech separation is a well-recognized challenge. Recent studies utilize supervised classification methods to estimate the ideal binary mask (IBM) to address the problem. In a supervised learning framework, the issue of generalization to conditions different from those in training is very important. This paper presents methods that require only a small training corpus and can generalize to unseen conditions. The system utilizes support vector machines to learn classification cues and then employs a rethresholding technique to estimate the IBM. A distribution fitting method is used to generalize to unseen signal-to-noise ratio conditions and voice activity detection based adaptation is used to generalize to unseen noise conditions. Systematic evaluation and comparison show that the proposed approach produces high quality IBM estimates under unseen conditions.

34 citations


Journal ArticleDOI
TL;DR: The first study that investigates the role of binary mask patterns in ASR under various noises, signal-to-noise ratios (SNRs), and vocabulary sizes indicates that maximizing the SNR gain is probably not an appropriate goal for improving either human or machine recognition of noisy speech.
Abstract: Processing noisy signals using the ideal binary mask improves automatic speech recognition (ASR) performance. This paper presents the first study that investigates the role of binary mask patterns in ASR under various noises, signal-to-noise ratios (SNRs), and vocabulary sizes. Binary masks are computed either by comparing the SNR within a time-frequency unit of a mixture signal with a local criterion (LC), or by comparing the local target energy with the long-term average spectral energy of speech. ASR results show that (1) akin to human speech recognition, binary masking significantly improves ASR performance even when the SNR is as low as −60 dB; (2) the ASR performance profiles are qualitatively similar to those obtained in human intelligibility experiments; (3) the difference between the LC and mixture SNR is more correlated to the recognition accuracy than LC; (4) LC at which the performance peaks is lower than 0 dB, which is the threshold that maximizes the SNR gain of processed signals. This broad agreement with human performance is rather surprising. The results also indicate that maximizing the SNR gain is probably not an appropriate goal for improving either human or machine recognition of noisy speech.

30 citations


Journal ArticleDOI
TL;DR: An iterative algorithm to adapt speaker models to match the signal levels in testing to improve separation results significantly and is not limited to given SNR levels.
Abstract: Cochannel speech separation aims to separate two speech signals from a single mixture. In a supervised scenario, the identities of two speakers are given, and current methods use pre-trained speaker models for separation. One issue in model-based methods is the mismatch between training and test signal levels. We propose an iterative algorithm to adapt speaker models to match the signal levels in testing. Our algorithm first obtains initial estimates of source signals using unadapted speaker models and then detects the input signal-to-noise ratio (SNR) of the mixture. The input SNR is then used to adapt the speaker models for more accurate estimation. The two steps iterate until convergence. Compared to search-based SNR detection methods, our method is not limited to given SNR levels. Evaluations demonstrate that the iterative procedure converges quickly in a considerable range of SNRs and improves separation results significantly. Comparisons show that the proposed system performs significantly better than related model-based systems.

Proceedings ArticleDOI
26 May 2013
TL;DR: This study presents a method that alleviates the generalization issue by attempting to denoise acoustic features before training and testing, and shows that a standard multilayer perceptron with proper regularization performs well on this task.
Abstract: Speech separation has been recently formulated as a classification problem. Classification as a form of supervised learning usually performs well on background noises when parts of them are seen in the training set. However, the performance can be significantly worse when generalizing to completely unseen noises. In this study, we present a method that alleviates the generalization issue by attempting to denoise acoustic features before training and testing. We show that a standard multilayer perceptron with proper regularization performs well on this task. Experimental results indicate that the resulting separation system performs significantly better in a variety of unknown noises in low SNR conditions. In a negative SNR condition, we also show that the proposed system produces more intelligible speech according to two recently proposed objective speech intelligibility measures.

Proceedings ArticleDOI
26 May 2013
TL;DR: A novel framework for performing speech separation and robust automatic speech recognition (ASR) in a unified fashion, which is called bidirectional speech decoding (BSD), which obtains a relative improvement of 17% in word error rate over the noisy baseline.
Abstract: We present a novel framework for performing speech separation and robust automatic speech recognition (ASR) in a unified fashion. Separation is performed by estimating the ideal binary mask (IBM), which identifies speech dominant and noise dominant units in a time-frequency (T-F) representation of the noisy signal. ASR is performed on extracted cepstral features after binary masking. Previous systems perform these steps in a sequential fashion - separation followed by recognition. The proposed framework, which we call bidirectional speech decoding (BSD), unifies these two stages. It does this by using multiple IBM estimators each of which is designed specifically for a back-end acoustic phonetic unit (BPU) of the recognizer. The standard ASR decoder is modified to use these IBM estimators to obtain BPU-specific cepstra during likelihood calculation. On the Aurora-4 robust ASR task, the proposed framework obtains a relative improvement of 17% in word error rate over the noisy baseline. It also obtains significant improvements in the quality of the estimated IBM.

Proceedings ArticleDOI
26 May 2013
TL;DR: An approach for improving the perceptual quality of separated speech from binary masking that consists of two stages, where a binary mask is generated in the first stage that effectively performs speech separation.
Abstract: Speech separation based on time-frequency masking has been shown to improve intelligibility of speech signals corrupted by noise. A perceived weakness of binary masking is the quality of separated speech. In this paper, an approach for improving the perceptual quality of separated speech from binary masking is proposed. Our approach consists of two stages, where a binary mask is generated in the first stage that effectively performs speech separation. In the second stage, a sparse-representation approach is used to represent the separated signal by a linear combination of Short-time Fourier Transform (STFT) magnitudes that are generated from a clean speech dictionary. Overlap-and-add synthesis is then used to generate an estimate of the speech signal. The performance of the proposed approach is evaluated with the Perceptual Evaluation of Speech Quality (PESQ), which is a standard objective speech quality measure. The proposed algorithm offers considerable improvements in speech quality over binary-masked noisy speech and other reconstruction approaches.

01 Jan 2013
TL;DR: In this paper, a parsimonious model of short-term and long-term synaptic plasticity at the electrophysiological level is presented, consisting of two interacting differential equations, one describing alterations of the synaptic weight and the other describing changes to the speed of recovery (forgetting).
Abstract: It has been demonstrated that short-term habituation may be caused by a decrease in release of presynaptic neurotransmitters and long-term habituation seems to be caused by morphological changes of presynaptic terminals. A parsimonious model of short-term and long-term synaptic plasticity at the electrophysiological level is presented. This model consists of two interacting differential equations, one describing alterations of the synaptic weight and the other describing changes to the speed of recovery (forgetting). The latter exhibits an inverse S-shaped curve whose high value corresponds to fast recovery (short-term habituation) and low value corresponds to slow recovery (long-term habituation). The model has been tested on short-term and a set of long-term habituation data of prey-catching behavior in toads, spanning minutes to hours to several weeks.

Proceedings ArticleDOI
26 May 2013
TL;DR: This work proposes to use a novel metric learning method to learn invariant speech features in the kernel space that is robust to different noise types and is expected to generalize to unseen noise conditions.
Abstract: Recent studies on speech separation show that the ideal binary mask (IBM) substantially improves speech intelligibility in noise. Supervised learning can be used to effectively estimate the IBM. However, supervised learning has trouble dealing with the situations where the probabilistic properties of the training data and the test data do not match, resulting in a challenging issue of generalization whereby the system trained under particular noise conditions may not generalize to new noise conditions. We propose to use a novel metric learning method to learn invariant speech features in the kernel space. As the learned features encode speech-related information that is robust to different noise types, the system is expected to generalize to unseen noise conditions. Evaluations show the advantage of the proposed approach over other speech separation systems.

Proceedings ArticleDOI
01 Oct 2013
TL;DR: This work suggests the ideal time-frequency binary mask as a main goal for computational auditory scene analysis and describes novel methods to deal with the generalization issue where support vector machines are used to estimate the ideal binary mask.
Abstract: Summary form only given. Speech separation, or the cocktail party problem, is a widely acknowledged challenge. Part of the challenge stems from the confusion of what the computational goal should be. While the separation of every sound source in a mixture is considered the gold standard, I argue that such an objective is neither realistic nor what the human auditory system does. Motivated by the auditory masking phenomenon, we have suggested instead the ideal time-frequency binary mask as a main goal for computational auditory scene analysis. This leads to a new formulation to speech separation that classifies time-frequency units into two classes: those dominated by the target speech and the rest. In supervised learning, a paramount issue is generalization to conditions unseen during training. I describe novel methods to deal with the generalization issue where support vector machines (SVMs) are used to estimate the ideal binary mask. One method employs distribution fitting to adapt to unseen signal-to-noise ratios and iterative voice activity detection to adapt to unseen noises. Another method learns more linearly separable features using deep neural networks (DNNs) and then couples DNN and linear SVM for training on a variety of noisy conditions. Systematic evaluations show high quality separation in new acoustic environments.