scispace - formally typeset
Search or ask a question

Showing papers on "Speech enhancement published in 2014"


Journal ArticleDOI
TL;DR: Results in various test conditions reveal that the two ratio mask targets, the IRM and the FFT-MASK, outperform the other targets in terms of objective intelligibility and quality metrics, and that masking based targets, in general, are significantly better than spectral envelope based targets.
Abstract: Formulation of speech separation as a supervised learning problem has shown considerable promise. In its simplest form, a supervised learning algorithm, typically a deep neural network, is trained to learn a mapping from noisy features to a time-frequency representation of the target of interest. Traditionally, the ideal binary mask (IBM) is used as the target because of its simplicity and large speech intelligibility gains. The supervised learning framework, however, is not restricted to the use of binary targets. In this study, we evaluate and compare separation results by using different training targets, including the IBM, the target binary mask, the ideal ratio mask (IRM), the short-time Fourier transform spectral magnitude and its corresponding mask (FFT-MASK), and the Gammatone frequency power spectrum. Our results in various test conditions reveal that the two ratio mask targets, the IRM and the FFT-MASK, outperform the other targets in terms of objective intelligibility and quality metrics. In addition, we find that masking based targets, in general, are significantly better than spectral envelope based targets. We also present comparisons with recent methods in non-negative matrix factorization and speech enhancement, which show clear performance advantages of supervised speech separation.

1,046 citations


Journal ArticleDOI
TL;DR: This letter presents a regression-based speech enhancement framework using deep neural networks (DNNs) with a multiple-layer deep architecture that tends to achieve significant improvements in terms of various objective quality measures.
Abstract: This letter presents a regression-based speech enhancement framework using deep neural networks (DNNs) with a multiple-layer deep architecture. In the DNN learning process, a large training set ensures a powerful modeling capability to estimate the complicated nonlinear mapping from observed noisy speech to desired clean signals. Acoustic context was found to improve the continuity of speech to be separated from the background noises successfully without the annoying musical artifact commonly observed in conventional speech enhancement algorithms. A series of pilot experiments were conducted under multi-condition training with more than 100 hours of simulated speech data, resulting in a good generalization capability even in mismatched testing conditions. When compared with the logarithmic minimum mean square error approach, the proposed DNN-based algorithm tends to achieve significant improvements in terms of various objective quality measures. Furthermore, in a subjective preference evaluation with 10 listeners, 76.35% of the subjects were found to prefer DNN-based enhanced speech to that obtained with other conventional technique.

860 citations


Journal ArticleDOI
TL;DR: It is shown that, when the noisy phase is enhanced using the proposed phase reconstruction, instrumental measures predict an increase of speech quality over a range of signal to noise ratios, even without explicit amplitude enhancement.
Abstract: The enhancement of speech which is corrupted by noise is commonly performed in the short-time discrete Fourier transform domain. In case only a single microphone signal is available, typically only the spectral amplitude is modified. However, it has recently been shown that an improved spectral phase can as well be utilized for speech enhancement, e.g., for phase-sensitive amplitude estimation. In this paper, we therefore present a method to reconstruct the spectral phase of voiced speech from only the fundamental frequency and the noisy observation. The importance of the spectral phase is highlighted and we elaborate on the reason why noise reduction can be achieved by modifications of the spectral phase. We show that, when the noisy phase is enhanced using the proposed phase reconstruction, instrumental measures predict an increase of speech quality over a range of signal to noise ratios, even without explicit amplitude enhancement.

197 citations


Journal ArticleDOI
TL;DR: It is shown that, in comparison with the reference method which is the Wiener filtering method with decision-directed approach for SNR estimation, the WDA-based speech enhancement methods could achieve better objective speech quality, no matter whether the noise conditions are included in the training set or not.

139 citations


Proceedings ArticleDOI
04 May 2014
TL;DR: The proposed Long Short-Term Memory recurrent neural networks are trained to predict clean speech as well as noise features from noisy speech features, and a magnitude domain soft mask is constructed from these features, which outperforms unsupervised magnitude domain spectral subtraction by a large margin in terms of source-distortion ratio.
Abstract: In this paper we propose the use of Long Short-Term Memory recurrent neural networks for speech enhancement. Networks are trained to predict clean speech as well as noise features from noisy speech features, and a magnitude domain soft mask is constructed from these features. Extensive tests are run on 73 k noisy and reverberated utterances from the Audio-Visual Interest Corpus of spontaneous, emotionally colored speech, degraded by several hours of real noise recordings comprising stationary and non-stationary sources and convolutive noise from the Aachen Room Impulse Response database. In the result, the proposed method is shown to provide superior noise reduction at low signal-to-noise ratios while creating very little artifacts at higher signal-to-noise ratios, thereby outperforming unsupervised magnitude domain spectral subtraction by a large margin in terms of source-distortion ratio.

135 citations


Proceedings ArticleDOI
14 Sep 2014
TL;DR: Three algorithms to address the mismatch problem in deep neural network (DNN) based speech enhancement are proposed and can well suppress highly non-stationary noise better than all the competing state-of-the-art techniques.
Abstract: We propose three algorithms to address the mismatch problem in deep neural network (DNN) based speech enhancement. First, we investigate noise aware training by incorporating noise informationin the testutterance with anideal binary maskbased dynamic noise estimation approach to improve DNN’s speech separation ability from the noisy signal. Next, a set of more than 100 noise types is adopted to enrich the generalization capabilities of the DNN to unseen and non-stationary noise conditions. Finally, the quality of the enhanced speech can further be improved by global variance equalization. Empirical results show that each of the three proposed techniques contributes to the performance improvement. Compared to the conventional logarithmic minimum mean squared error speech enhancement method, our DNN system achieves 0.32 PESQ (perceptual evaluation of speech quality) improvement across six signal-tonoise ratio levels ranging from -5dB to 20dB on a test set with unknown noise types. We also observe that the combined strategies can well suppress highly non-stationary noise better than all the competing state-of-the-art techniques we have evaluated. Index Terms: Speech enhancement, deep neural networks, noise aware training, ideal binary mask, non-stationary noise

102 citations


Journal ArticleDOI
TL;DR: Experimental results based on the NIST 2010 SRE dataset suggest that the proposed VAD outperforms conventional ones whenever interview-style speech is involved, and it is demonstrated that noise reduction is vital for energy-based VAD under low SNR.

98 citations


Proceedings ArticleDOI
04 May 2014
TL;DR: Results presented in this paper indicate that channel concatenation gives similar or better results than beamforming, andAugmenting the standard DNN input with the bottleneck feature from a Speaker Aware Deep Neural Network (SADNN) shows a general advantage over theStandard DNN based recognition system, and yields additional improvements for far field speech recognition.
Abstract: This paper presents an investigation of far field speech recognition using beamforming and channel concatenation in the context of Deep Neural Network (DNN) based feature extraction. While speech enhancement with beamforming is attractive, the algorithms are typically signal-based with no information about the special properties of speech. A simple alternative to beamforming is concatenating multiple channel features. Results presented in this paper indicate that channel concatenation gives similar or better results. On average the DNN front-end yields a 25% relative reduction in Word Error Rate (WER). Further experiments aim at including relevant information in training adapted DNN features. Augmenting the standard DNN input with the bottleneck feature from a Speaker Aware Deep Neural Network (SADNN) shows a general advantage over the standard DNN based recognition system, and yields additional improvements for far field speech recognition.

93 citations


Journal ArticleDOI
TL;DR: This work develops an algorithm, called 'overlapping group shrinkage' (OGS), based on the minimization of a convex cost function involving a group-sparsity promoting penalty function, that produces denoised speech that is relatively free of musical noise.

85 citations


Journal ArticleDOI
TL;DR: The simulation results reveal the superiority of the proposed Wiener filtering method in the case of Additive White Gaussian Noise (AWGN) as well as colored noise.
Abstract: This paper proposes an adaptive Wiener filtering method for speech enhancement. This method depends on the adaptation of the filter transfer function from sample to sample based on the speech signal statistics; the local mean and the local variance. It is implemented in the time domain rather than in the frequency domain to accommodate for the time-varying nature of the speech signals. The proposed method is compared to the traditional frequency-domain Wiener filtering, spectral subtraction and wavelet denoising methods using different speech quality metrics. The simulation results reveal the superiority of the proposed Wiener filtering method in the case of Additive White Gaussian Noise (AWGN) as well as colored noise.

76 citations


Patent
17 Dec 2014
TL;DR: In this article, a trigger detection block detects a presence of data representing a trigger phrase in the received data, and a second trigger phrase detection block is used to detect the presence of the trigger phrases in the enhanced stored data.
Abstract: Received data representing speech is stored, and a trigger detection block detects a presence of data representing a trigger phrase in the received data. In response, a first part of the stored data representing at least a part of the trigger phrase is supplied to an adaptive speech enhancement block, which is trained on the first part of the stored data to derive adapted parameters for the speech enhancement block. A second part of the stored data, overlapping with the first part of the stored data, is supplied to the adaptive speech enhancement block operating with said adapted parameters, to form enhanced stored data. A second trigger phrase detection block detects the presence of data representing the trigger phrase in the enhanced stored data. In response, enhanced speech data are output from the speech enhancement block for further processing, such as speech recognition. Detecting the presence of data representing the trigger phrase in the received data is carried out by means of a first trigger phrase detection block, detecting the presence of data representing the trigger phrase in the enhanced stored data is carried out by means of a second trigger phrase detection block, and the second trigger phrase detection block operates with different, typically more rigorous, detection criteria from the first trigger phrase detection block.

Journal ArticleDOI
TL;DR: This paper introduces an improved general distributed synchronous averaging (IGDSA) algorithm, which can be used in any connected network, and combines that with the DDSB algorithm where multiple node pairs can update their estimates simultaneously, and presents a distributed delay and sum beamformer.
Abstract: In this paper, we investigate the use of randomized gossip for distributed speech enhancement and present a distributed delay and sum beamformer (DDSB). In a randomly connected wireless acoustic sensor network, the DDSB estimates the desired signal at each node by communicating only with its neighbors. We first provide the asynchronous DDSB (ADDSB) where each pair of neighboring nodes updates its data asynchronously. Then, we introduce an improved general distributed synchronous averaging (IGDSA) algorithm, which can be used in any connected network, and combine that with the DDSB algorithm where multiple node pairs can update their estimates simultaneously. For convergence analysis, we first provide bounds for the worst case averaging time of the ADDSB for the best and worst connected networks, and then we compare the convergence rate of the ADDSB with the original synchronous DDSB (OSDDSB) and the improved synchronous DDSB (ISDDSB) in regular networks. This convergence rate comparison is extended to randomly connected non-regular networks using simulations. The simulation results show that the DDSB using the different updating schemes converges to the optimal estimates of the centralized beamformer and that the proposed IGDSA algorithm converges much faster than the original synchronous communication scheme, in particular for non-regular networks. Moreover, comparisons are performed with several existing distributed speech enhancement methods from literature, assuming that the steering vector is given. In the simulated scenario, the proposed method leads to a slight performance improvement at the expense of a higher communication cost. The presented method is not constrained to a certain network topology (e.g., tree connected or fully connected), while this is the case for many of the reference methods.

Journal ArticleDOI
TL;DR: The proposed speech enhancement technique for signals corrupted by nonstationary acoustic noises applies the empirical mode decomposition to the noisy speech signal and obtains a set of intrinsic mode functions (IMF) and adopts the Hurst exponent in the selection of IMFs to reconstruct the speech.
Abstract: This paper presents a speech enhancement technique for signals corrupted by nonstationary acoustic noises. The proposed approach applies the empirical mode decomposition (EMD) to the noisy speech signal and obtains a set of intrinsic mode functions (IMF). The main contribution of the proposed procedure is the adoption of the Hurst exponent in the selection of IMFs to reconstruct the speech. This EMD and Hurst-based (EMDH) approach is evaluated in speech enhancement experiments considering environmental acoustic noises with different indices of nonstationarity. The results show that the EMDH improves the segmental signal-to-noise ratio and an overall quality composite measure, encompassing the perceptual evaluation of speech quality (PESQ). Moreover, the short-time objective intelligibility (STOI) measure reinforces the superior performance of EMDH. Finally, the EMDH is also examined in a speaker identification task in noisy conditions. The proposed technique leads to the highest speaker identification rates when compared to the baseline speech enhancement algorithms and also to a multicondition training procedure.

Journal ArticleDOI
TL;DR: Success in this emerging field will expand the application of voice-based machine interfaces, such as Siri, the intelligent personal assistant on the iPhone and iPad, to much more realistic settings and thereby provide more natural human?machine interfaces.
Abstract: The separation of speech signals measured at multiple microphones in noisy and reverberant environments using only the audio modality has limitations because there is generally insufficient information to fully discriminate the different sound sources. Humans mitigate this problem by exploiting the visual modality, which is insensitive to background noise and can provide contextual information about the audio scene. This advantage has inspired the creation of the new field of audiovisual (AV) speech source separation that targets exploiting visual modality alongside the microphone measurements in a machine. Success in this emerging field will expand the application of voice-based machine interfaces, such as Siri, the intelligent personal assistant on the iPhone and iPad, to much more realistic settings and thereby provide more natural human?machine interfaces.

Journal ArticleDOI
TL;DR: The experimental results show that the proposed singing voice enhancement technique considerably improved the performance of a simple pitch estimation technique, and this results prove the effectiveness of the proposed method.
Abstract: We propose a novel singing voice enhancement technique for monaural music audio signals, which is a quite challenging problem. Many singing voice enhancement techniques have been proposed recently. However, our approach is based on a quite different idea from these existing methods. We focused on the fluctuation of a singing voice and considered to detect it by exploiting two differently resolved spectrograms, one has rich temporal resolution and poor frequency resolution, while the other has rich frequency resolution and poor temporal resolution. On such two spectrograms, the shapes of fluctuating components are quite different. Based on this idea, we propose a singing voice enhancement technique that we call two-stage harmonic/percussive sound separation (HPSS). In this paper, we describe the details of two-stage HPSS and evaluate the performance of the method. The experimental results show that SDR, a commonly-used criterion on the task, was improved by around 4 dB, which is a considerably higher level than existing methods. In addition, we also evaluated the performance of the method as a preprocessing for melody estimation in music. The experimental results show that our singing voice enhancement technique considerably improved the performance of a simple pitch estimation technique. These results prove the effectiveness of the proposed method.

Journal ArticleDOI
TL;DR: Novel speaking-aid systems based on one-to-many eigenvoice conversion (EVC) to enhance three types of alaryngeal speech: esophageal speech, electrolaryngeAL speech, and body-conducted silent electrolarygealspeech are presented.
Abstract: In this paper, we present novel speaking-aid systems based on one-to-many eigenvoice conversion (EVC) to enhance three types of alaryngeal speech: esophageal speech, electrolaryngeal speech, and body-conducted silent electrolaryngeal speech. Although alaryngeal speech allows laryngectomees to utter speech sounds, it suffers from the lack of speech quality and speaker individuality. To improve the speech quality of alaryngeal speech, alaryngeal-speech-to-speech (AL-to-Speech) methods based on statistical voice conversion have been proposed. In this paper, one-to-many EVC capable of flexibly controlling the converted voice quality by adapting the conversion model to given target natural voices is further implemented for the AL-to-Speech methods to effectively recover speaker individuality of each type of alaryngeal speech. These proposed systems are compared with each other from various perspectives. The experimental results demonstrate that our proposed systems are capable of effectively addressing the issues of alaryngeal speech, e.g., yielding significant improvements in speech quality of each type of alaryngeal speech.

Proceedings ArticleDOI
04 May 2014
TL;DR: Objective evaluations using perceptual evaluation of speech quality (PESQ) indicate that the proposed segmental NMF (SNMF) speech enhancement scheme increases the sound quality in noise conditions and outperforms the well-known MMSE log-spectral amplitude (LSA) estimation.
Abstract: The conventional NMF-based speech enhancement algorithm analyzes the magnitude spectrograms of both clean speech and noise in the training data via NMF and estimates a set of spectral basis vectors. These basis vectors are used to span a space to approximate the magnitude spectrogram of the noise-corrupted testing utterances. Finally, the components associated with the clean-speech spectral basis vectors are used to construct the updated magnitude spectrogram, producing an enhanced speech utterance. Considering that the rich spectral-temporal structure may be explored in local frequency and time-varying spectral patches, this study proposes a segmental NMF (SNMF) speech enhancement scheme to improve the conventional frame-wise NMF-based method. Two algorithms are derived to decompose the original nonnegative matrix associated with the magnitude spectrogram; the first algorithm is used in the spectral domain and the second algorithm is used in the temporal domain. When using the decomposition processes, noisy speech signals can be modeled more precisely, and spectrograms regarding the speech part can be constituted more favorably compared with using the conventional NMF-based method. Objective evaluations using perceptual evaluation of speech quality (PESQ) indicate that the proposed SNMF strategy increases the sound quality in noise conditions and outperforms the well-known MMSE log-spectral amplitude (LSA) estimation.

Journal ArticleDOI
TL;DR: Instrumental measures predict that by incorporating uncertain prior information of the phase, the quality and intelligibility of processed speech can be improved both over traditional phase insensitive approaches, and approaches that treat prior information on the phase as deterministic.
Abstract: While most short-time discrete Fourier transform-based single-channel speech enhancement algorithms only modify the noisy spectral amplitude, in recent years the interest in phase processing has increased in the field. The goal of this paper is twofold. First, we derive Bayesian probability density functions and estimators for the clean speech phase when different amounts of prior knowledge about the speech and noise amplitudes is given. Second, we derive a joint Bayesian estimator of the clean speech amplitudes and phases, when uncertain a priori knowledge on the phase is available. Instrumental measures predict that by incorporating uncertain prior information of the phase, the quality and intelligibility of processed speech can be improved both over traditional phase insensitive approaches, and approaches that treat prior information on the phase as deterministic.

Journal ArticleDOI
TL;DR: Using data from three different listening tests it is shown that the proposed objective intelligibility measures provide promising results for speech intelligibility prediction in different scenarios of speech enhancement where speech is processed by non-linear modification strategies.
Abstract: We propose a novel method for objective speech intelligibility prediction which can be useful in many application domains such as hearing instruments and forensics. Most objective intelligibility measures available in the literature employ some kind of signal-to-noise ratio (SNR) or a correlation-based comparison between the spectro-temporal representations of clean and processed speech. In this paper, we investigate the speech intelligibility prediction from the viewpoint of information theory and introduce novel objective intelligibility measures based on the estimated mutual information between the temporal envelopes of clean speech and processed speech in the subband domain. Mutual information allows to account for higher order statistics and hence to consider dependencies beyond the conventional second order statistics. Using data from three different listening tests it is shown that the proposed objective intelligibility measures provide promising results for speech intelligibility prediction in different scenarios of speech enhancement where speech is processed by non-linear modification strategies.

Journal ArticleDOI
TL;DR: Compared to the traditional DMAs, the proposed designs are more robust with respect to white noise amplification while they are capable of achieving similar directional gains.
Abstract: Differential microphone arrays (DMAs), due to their small size and enhanced directivity, are quite promising in speech enhancement applications. However, it is well known that differential beamformers have the drawback of white noise amplification, which is a major issue in the processing of wideband signals such as speech. In this paper, we focus on the design of robust DMAs. Based on the Maclaurin's series approximation and frequency-independent beampatterns, the robust first-, second-, and third-order DMAs are proposed by using more microphones than the order plus one, and the corresponding minimum-norm filters are derived. Compared to the traditional DMAs, the proposed designs are more robust with respect to white noise amplification while they are capable of achieving similar directional gains.

Proceedings ArticleDOI
12 May 2014
TL;DR: A task-supervised NMF method for the adaptation of the basis spectra learned in the first stage to enhance the performance on the specific task used in the second stage, cast as a bilevel optimization program that can be efficiently solved via stochastic gradient descent.
Abstract: Traditionally, NMF algorithms consist of two separate stages: a training stage, in which a generative model is learned; and a testing stage in which the pre-learned model is used in a high level task such as enhancement, separation, or classification. As an alternative, we propose a task-supervised NMF method for the adaptation of the basis spectra learned in the first stage to enhance the performance on the specific task used in the second stage. We cast this problem as a bilevel optimization program that can be efficiently solved via stochastic gradient descent. The proposed approach is general enough to handle sparsity priors of the activations, and allow non-Euclidean data terms such as β-divergences. The framework is evaluated on single-channel speech enhancement tasks.


Proceedings ArticleDOI
12 May 2014
TL;DR: Subjective and objective evaluation result shows that the proposed automated tuning methodology greatly improves the enhanced speech quality, potentially saving resources over manual evaluation, speeding up development and deployment time, and guiding the algorithmic design.
Abstract: In this paper, we propose a formal methodology for tuning the parameters of a single-microphone speech enhancement system for hands-free devices. The tuning problem is formulated as a large-scale nonlinear programming problem that is solved by a genetic algorithm to determine the global solution. A conversational speech database is automatically generated by modeling the interactivity in telephone conversations, and perceptual objective quality measures are used as the optimization criteria for the automated tuning over the generated database. A subjective listening test is then performed by comparing the automatically tuned system based on objective criteria to the system tuned by expert human listeners. Subjective and objective evaluation result shows that the proposed automated tuning methodology greatly improves the enhanced speech quality, potentially saving resources over manual evaluation, speeding up development and deployment time, and guiding the algorithmic design.

Proceedings ArticleDOI
09 Jul 2014
TL;DR: Experimental results show that the quality of the estimated clean speech signal is improved both subjectively and objectively in terms of perceptual evaluation of speech quality (PESQ), especially in mismatch environments where the additive noise is not seen in the DNN training.
Abstract: We address an over-smoothing issue of enhanced speech in deep neural network (DNN) based speech enhancement and propose a global variance equalization framework with two schemes, namely post-processing and post-training with modified object function for the equalization between the global variance of the estimated and the reference speech. Experimental results show that the quality of the estimated clean speech signal is improved both subjectively and objectively in terms of perceptual evaluation of speech quality (PESQ), especially in mismatch environments where the additive noise is not seen in the DNN training.

Journal ArticleDOI
TL;DR: An alternative projection algorithm is developed to separate the speech and noise magnitude spectra by imposing rank and sparsity constraints, with which the enhanced time-domain speech can be constructed from sparse matrix by inverse discrete Fourier transform and overlap-add-synthesis.

Journal ArticleDOI
TL;DR: This work combines a state-of-the art GMM system with a deep Long Short-Term Memory (LSTM) recurrent neural network in a double-stream architecture, enabling them to learn long-range temporal context, and thus increasing the robustness against noise and reverberation.
Abstract: In this article we address the problem of distant speech recognition for reverberant noisy environments. Speech enhancement methods, e. g., using non-negative matrix factorization (NMF), are succesful in improving the robustness of ASR systems. Furthermore, discriminative training and feature transformations are employed to increase the robustness of traditional systems using Gaussian mixture models (GMM). On the other hand, acoustic models based on deep neural networks (DNN) were recently shown to outperform GMMs. In this work, we combine a state-of-the art GMM system with a deep Long Short-Term Memory (LSTM) recurrent neural network in a double-stream architecture. Such networks use memory cells in the hidden units, enabling them to learn long-range temporal context, and thus increasing the robustness against noise and reverberation. The network is trained to predict frame-wise phoneme estimates, which are converted into observation likelihoods to be used as an acoustic model. It is of particular interest whether the LSTM system is capable of improving a robust state-of-the-art GMM system, which is confirmed in the experimental results. In addition, we investigate the efficiency of NMF for speech enhancement on the front-end side. Experiments are conducted on the medium-vocabulary task of the 2nd ‘CHiME’ Speech Separation and Recognition Challenge, which includes reverberation and highly variable noise. Experimental results show that the average word error rate of the challenge baseline is reduced by 64% relative. The best challenge entry, a noise-robust state-of-the-art recognition system, is outperformed by 25% relative.

Journal ArticleDOI
TL;DR: If the speech correlation is properly estimated and the previously derived subband filters discussed in this work show significantly less speech distortion compared to conventional noise reduction algorithms, the quality and intelligibility of the processed signals are predicted by objective measures.
Abstract: Recently, it has been proposed to use the minimum-variance distortionless-response (MVDR) approach in single-channel speech enhancement in the short-time frequency domain. By applying optimal FIR filters to each subband signal, these filters reduce additive noise components with less speech distortion compared to conventional approaches. An important ingredient to these filters is the temporal correlation of the speech signals. We derive algorithms to provide a blind estimation of this quantity based on a maximum-likelihood and maximum a-posteriori estimation. To derive proper models for the inter-frame correlation of the speech and noise signals, we investigate their statistics on a large dataset. If the speech correlation is properly estimated, the previously derived subband filters discussed in this work show significantly less speech distortion compared to conventional noise reduction algorithms. Therefore, the focus of the experimental parts of this work lies on the quality and intelligibility of the processed signals. To evaluate the performance of the subband filters in combination with the clean speech inter-frame correlation estimators, we predict the speech quality and intelligibility by objective measures.

Journal ArticleDOI
TL;DR: Experimental results prove that incorporating the proposed blind spectral weighting (BSW) technique into the standard MFCC feature extraction framework yields significant improvement in SID performance under reverberation mismatch.
Abstract: Room reverberation poses various deleterious effects on performance of automatic speech systems. Speaker identification (SID) performance, in particular, degrades rapidly as reverberation time increases. Reverberation causes two forms of spectro-temporal distortions on speech signals: i) self-masking which is due to early reflections and ii) overlap-masking which is due to late reverberation. Overlap-masking effect of reverberation has been shown to have a greater adverse impact on performance of speech systems. Motivated by this fact, this study proposes a blind spectral weighting (BSW) technique for suppressing the reverberation overlap-masking effect on SID systems. The technique is blind in the sense that prior knowledge of neither the anechoic signal nor the room impulse response is required. Performance of the proposed technique is evaluated on speaker verification tasks under simulated and actual reverberant mismatched conditions. Evaluations are conducted in the context of the conventional GMM-UBM as well as the state-of-the-art i-vector based systems. The GMM-UBM experiments are performed using speech material from a new data corpus well suited for speaker verification experiments under actual reverberant mismatched conditions, entitled MultiRoom8. The i-vector experiments are carried out with microphone (interview and phonecall) data from the NIST SRE 2010 extended evaluation set which are digitally convolved with three different measured room impulse responses extracted from the Aachen impulse response (AIR) database. Experimental results prove that incorporating the proposed blind technique into the standard MFCC feature extraction framework yields significant improvement in SID performance under reverberation mismatch.

Journal ArticleDOI
TL;DR: A novel method for high resolution source localization based on the group delay of MUSIC, which resolves spatially close sources even under reverberation owing to its spatial additive property.
Abstract: Subspace-based source localization methods utilize the spectral magnitude of the MUltiple SIgnal Classification (MUSIC) method. However, in all these methods, a large number of sensors are required to resolve closely spaced sources. A novel method for high resolution source localization based on the group delay of MUSIC is described in this work. The method can resolve both the azimuth and elevation angles of closely spaced sources using a minimal number of sensors over a planar array. At the direction of arrival (DOA) of the desired source, a transition is observed in the phase spectrum of MUSIC. The negative differential of the phase spectrum also called group delay, results in a peak at the DOA. The proposed MUSIC-Group delay spectrum defined as product of MUSIC-Magnitude (MM) and group delay spectra, resolves spatially close sources even under reverberation owing to its spatial additive property. This is illustrated by performing spectral analysis of the MUSIC-Group delay function under reverberant environments. A mathematical proof for the spatial additive property of group delay spectrum is also provided. Source localization error analysis, sensor perturbation analysis, and Cramer-Rao bound (CRB) analysis are then performed to verify the robustness of the MUSIC-Group delay method. Experiments on speech enhancement and distant speech recognition are also conducted on spatialized TIMIT and MONC databases. Experimental results obtained using objective performance measures and word error rates (WER) indicate reasonable robustness when compared to conventional source localization methods in literature.

Proceedings ArticleDOI
20 Nov 2014
TL;DR: A microphone array post- filter design effective in various noise environments is proposed for implementation in a hands-free speech interface by estimating the power spectral density of each target/interference source and background noise to reduce the estimation error of the post-filter.
Abstract: A microphone array post-filter design effective in various noise environments is proposed for implementation in a hands-free speech interface. Although various post-filter designs have been studied, few methods have been effective in practical use since different types of sound sources, such as spatially coherent target/interference sources and incoherent background noise, are included in the observation signals. This paper describes a post-filter that was designed after estimating the power spectral density (PSD) of each target/interference source and background noise to reduce the estimation error of the post-filter. This was achieved by assuming that the background noise was composed of stationary signals. Through experiments involving four microphones, we confirmed that our proposed method works well in certain noisy environments.