scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Speech and Audio Processing in 2003"


Journal ArticleDOI
TL;DR: In this article, an improved minima controlled recursive averaging (IMCRA) approach is proposed for noise estimation in adverse environments involving nonstationary noise, weak speech components, and low input signal-to-noise ratio (SNR).
Abstract: Noise spectrum estimation is a fundamental component of speech enhancement and speech recognition systems. We present an improved minima controlled recursive averaging (IMCRA) approach, for noise estimation in adverse environments involving nonstationary noise, weak speech components, and low input signal-to-noise ratio (SNR). The noise estimate is obtained by averaging past spectral power values, using a time-varying frequency-dependent smoothing parameter that is adjusted by the signal presence probability. The speech presence probability is controlled by the minima values of a smoothed periodogram. The proposed procedure comprises two iterations of smoothing and minimum tracking. The first iteration provides a rough voice activity detection in each frequency band. Then, smoothing in the second iteration excludes relatively strong speech components, which makes the minimum tracking during speech activity robust. We show that in nonstationary noise environments and under low SNR conditions, the IMCRA approach is very effective. In particular, compared to a competitive method, it obtains a lower estimation error, and when integrated into a speech enhancement system achieves improved speech quality and lower residual noise.

902 citations


Journal ArticleDOI
TL;DR: This paper, Part II, generalizes the basic BCC schemes presented in Part I and includes BCC for multichannel signals and employs an enhanced set of perceptual spatial cues for BCC synthesis.
Abstract: Binaural Cue Coding (BCC) is a method for multichannel spatial rendering based on one down-mixed audio channel and side information. The companion paper (Part I) covers the psychoacoustic fundamentals of this method and outlines principles for the design of BCC schemes. The BCC analysis and synthesis methods of Part I are motivated and presented in the framework of stereophonic audio coding. This paper, Part II, generalizes the basic BCC schemes presented in Part I. It includes BCC for multichannel signals and employs an enhanced set of perceptual spatial cues for BCC synthesis. A scheme for multichannel audio coding is presented. Moreover, a modified scheme is derived that allows flexible rendering of the spatial image at the receiver supporting dynamic control. All aspects of complete BCC encoder and decoder implementations are discussed, such as down-mixing of the input signals, low complexity estimation of the spatial cues, and quantization and coding of the side information. Application examples are given and the performance of the coder implementations are evaluated and discussed based on subjective listening test results.

464 citations


Journal ArticleDOI
TL;DR: A generalized subspace approach is proposed for enhancement of speech corrupted by colored noise using a nonunitary transform based on the simultaneous diagonalization of the clean speech and noise covariance matrices to project the noisy signal onto a signal-plus-noise subspace and a noise subspace.
Abstract: A generalized subspace approach is proposed for enhancement of speech corrupted by colored noise. A nonunitary transform, based on the simultaneous diagonalization of the clean speech and noise covariance matrices, is used to project the noisy signal onto a signal-plus-noise subspace and a noise subspace. The clean signal is estimated by nulling the signal components in the noise subspace and retaining the components in the signal subspace. The applied transform has built-in prewhitening and can therefore be used in general for colored noise. The proposed approach is shown to be a generalization of the approach proposed by Y. Ephraim and H.L. Van Trees (see ibid., vol.3, p.251-66, 1995) for white noise. Two estimators are derived based on the nonunitary transform, one based on time-domain constraints and one based on spectral domain constraints. Objective and subjective measures demonstrate improvements over other subspace-based methods when tested with TIMIT sentences corrupted with speech-shaped noise and multi-talker babble.

382 citations


Journal ArticleDOI
TL;DR: It is shown that there is an optimum frame size that is determined by the trade-off between maintaining the number of samples in each frequency bin to estimate statistics and covering the whole reverberation, and that it is not good to be constrained by the condition T>P.
Abstract: Despite several recent proposals to achieve blind source separation (BSS) for realistic acoustic signals, the separation performance is still not good enough. In particular, when the impulse responses are long, performance is highly limited. In this paper, we consider a two-input, two-output convolutive BSS problem. First, we show that it is not good to be constrained by the condition T>P, where T is the frame length of the DFT and P is the length of the room impulse responses. We show that there is an optimum frame size that is determined by the trade-off between maintaining the number of samples in each frequency bin to estimate statistics and covering the whole reverberation. We also clarify the reason for the poor performance of BSS in long reverberant environments, highlighting that the framework of BSS works as two sets of frequency-domain adaptive beamformers. Although BSS can reduce reverberant sounds to some extent like adaptive beamformers, they mainly remove the sounds from the jammer direction. This is the reason for the difficulty of BSS in reverberant environments.

360 citations


Journal ArticleDOI
TL;DR: The spectral smoothness principle is proposed as an efficient new mechanism in estimating the spectral envelopes of detected sounds and works robustly in noise, and is able to handle sounds that exhibit inharmonicities.
Abstract: A new method for estimating the fundamental frequencies of concurrent musical sounds is described. The method is based on an iterative approach, where the fundamental frequency of the most prominent sound is estimated, the sound is subtracted from the mixture, and the process is repeated for the residual signal. For the estimation stage, an algorithm is proposed which utilizes the frequency relationships of simultaneous spectral components, without assuming ideal harmonicity. For the subtraction stage, the spectral smoothness principle is proposed as an efficient new mechanism in estimating the spectral envelopes of detected sounds. With these techniques, multiple fundamental frequency estimation can be performed quite accurately in a single time frame, without the use of long-term temporal features. The experimental data comprised recorded samples of 30 musical instruments from four different sources. Multiple fundamental frequency estimation was performed for random sound source and pitch combinations. Error rates for mixtures ranging from one to six simultaneous sounds were 1.8%, 3.9%, 6.3%, 9.9%, 14%, and 18%, respectively. In musical interval and chord identification tasks, the algorithm outperformed the average of ten trained musicians. The method works robustly in noise, and is able to handle sounds that exhibit inharmonicities. The inharmonicity factor and spectral envelope of each sound is estimated along with the fundamental frequency.

356 citations


Journal ArticleDOI
TL;DR: A general framework for tracking an acoustic source using particle filters is formulated and four specific algorithms that fit within this framework are discussed, and results indicate that the proposed family of algorithms are able to accurately track a moving source in a moderately reverberant room.
Abstract: Traditional acoustic source localization algorithms attempt to find the current location of the acoustic source using data collected at an array of sensors at the current time only. In the presence of strong multipath, these traditional algorithms often erroneously locate a multipath reflection rather than the true source location. A recently proposed approach that appears promising in overcoming this drawback of traditional algorithms, is a state-space approach using particle filtering. In this paper we formulate a general framework for tracking an acoustic source using particle filters. We discuss four specific algorithms that fit within this framework, and demonstrate their performance using both simulated reverberant data and data recorded in a moderately reverberant office room (with a measured reverberation time of 0.39 s). The results indicate that the proposed family of algorithms are able to accurately track a moving source in a moderately reverberant room.

353 citations


Journal ArticleDOI
TL;DR: A more general expression of the post-filter estimation is developed based on an assumed knowledge of the complex coherence of the noise field that can be used to construct a more appropriate post- filter in a variety of different noise fields.
Abstract: This paper introduces a novel technique for estimating the signal power spectral density to be used in the transfer function of a microphone array post-filter. The technique is a generalization of the existing Zelinski post-filter, which uses the auto- and cross-spectral densities of the array inputs to estimate the signal and noise spectral densities. The Zelinski technique, however, assumes zero cross-correlation between the noise on different sensors. This assumption is inaccurate, particularly at low frequencies and for arrays with closely spaced sensors, and thus the corresponding post-filter is suboptimal in realistic noise conditions. In this paper, a more general expression of the post-filter estimation is developed based on an assumed knowledge of the complex coherence of the noise field. This general expression can be used to construct a more appropriate post-filter in a variety of different noise fields. In experiments using real noise recordings from a computer office, the modified post-filter results in significant improvement in terms of objective speech quality measures and speech recognition performance using a diffuse noise model.

316 citations


Journal ArticleDOI
TL;DR: This work presents a robust algorithm for multipitch tracking of noisy speech that combines an improved channel and peak selection method, a new method for extracting periodicity information across different channels, and a hidden Markov model for forming continuous pitch tracks.
Abstract: An effective multipitch tracking algorithm for noisy speech is critical for acoustic signal processing. However, the performance of existing algorithms is not satisfactory. We present a robust algorithm for multipitch tracking of noisy speech. Our approach integrates an improved channel and peak selection method, a new method for extracting periodicity information across different channels, and a hidden Markov model (HMM) for forming continuous pitch tracks. The resulting algorithm can reliably track single and double pitch tracks in a noisy environment. We suggest a pitch error measure for the multipitch situation. The proposed algorithm is evaluated on a database of speech utterances mixed with various types of interference. Quantitative comparisons show that our algorithm significantly outperforms existing ones.

308 citations


Journal ArticleDOI
TL;DR: This paper discusses the most relevant binaural perception phenomena exploited by BCC and presents a psychoacoustically motivated approach for designing a BCC analyzer and synthesizer and suggests that the performance given by the reference synthesizer is not significantly compromised when using a low-complexity FFT-based synthesizer.
Abstract: Binaural Cue Coding (BCC) is a method for multichannel spatial rendering based on one down-mixed audio channel and BCC side information. The BCC side information has a low data rate and it is derived from the multichannel encoder input signal. A natural application of BCC is multichannel audio data rate reduction since only a single down-mixed audio channel needs to be transmitted. An alternative BCC scheme for efficient joint transmission of independent source signals supports flexible spatial rendering at the decoder. This paper (Part I) discusses the most relevant binaural perception phenomena exploited by BCC. Based on that, it presents a psychoacoustically motivated approach for designing a BCC analyzer and synthesizer. This leads to a reference implementation for analysis and synthesis of stereophonic audio signals based on a Cochlear Filter Bank. BCC synthesizer implementations based on the FFT are presented as low-complexity alternatives. A subjective audio quality assessment of these implementations shows the robust performance of BCC for critical speech and audio material. Moreover, the results suggest that the performance given by the reference synthesizer is not significantly compromised when using a low-complexity FFT-based synthesizer. The companion paper (Part II) generalizes BCC analysis and synthesis for multichannel audio and proposes complete BCC schemes including quantization and coding. Part II also describes an alternative BCC scheme with flexible rendering capability at the decoder and proposes several applications for both BCC schemes.

231 citations


Journal ArticleDOI
TL;DR: Through an analysis of age-related acoustic characteristics of children's speech in the context of automatic speech recognition (ASR), effects such as frequency scaling of spectral envelope parameters are demonstrated and speaker normalization algorithm that combines frequency warping and model transformation is shown to reduce acoustic variability.
Abstract: Developmental changes in speech production introduce age-dependent spectral and temporal variability in the speech signal produced by children. Such variabilities pose challenges for robust automatic recognition of children's speech. Through an analysis of age-related acoustic characteristics of children's speech in the context of automatic speech recognition (ASR), effects such as frequency scaling of spectral envelope parameters are demonstrated. Recognition experiments using acoustic models trained from adult speech and tested against speech from children of various ages clearly show performance degradation with decreasing age. On average, the word error rates are two to five times worse for children speech than for adult speech. Various techniques for improving ASR performance on children's speech are reported. A speaker normalization algorithm that combines frequency warping and model transformation is shown to reduce acoustic variability and significantly improve ASR performance for children speakers (by 25-45% under various model training and testing conditions). The use of age-dependent acoustic models further reduces word error rate by 10%. The potential of using piece-wise linear and phoneme-dependent frequency warping algorithms for reducing the variability in the acoustic feature space of children is also investigated.

213 citations


Journal ArticleDOI
TL;DR: Experimental results show that the proposed watermarking algorithm is sustainable against compression algorithms such as MP3 and AAC, as well as common signal processing manipulation attacks.
Abstract: The paper presents the modified patchwork algorithm (MPA), a statistical technique for an audio watermarking algorithm in the transform (not only discrete cosine transform (DCT), but also DFT and DWT) domain. The MPA is an enhanced version of the conventional patchwork algorithm. The MPA is sufficiently robust to withstand some attacks defined by the Secure Digital Music Initiative (SDMI). Experimental results show that the proposed watermarking algorithm is sustainable against compression algorithms such as MP3 and AAC, as well as common signal processing manipulation attacks.

Journal ArticleDOI
TL;DR: The most frequently used approach-based on a modified Hidden Markov Model (HMM) phonetic recognizer is analyzed, and a general framework for the local refinement of boundaries is proposed, and the performance of several pattern classification approaches is compared within this framework.
Abstract: This paper presents the results and conclusions of a thorough study on automatic phonetic segmentation. It starts with a review of the state of the art in this field. Then, it analyzes the most frequently used approach-based on a modified Hidden Markov Model (HMM) phonetic recognizer. For this approach, a statistical correction procedure is proposed to compensate for the systematic errors produced by context-dependent HMMs, and the use of speaker adaptation techniques is considered to increase segmentation precision. Finally, this paper explores the possibility of locally refining the boundaries obtained with the former techniques. A general framework is proposed for the local refinement of boundaries, and the performance of several pattern classification approaches (fuzzy logic, neural networks and Gaussian mixture models) is compared within this framework. The resulting phonetic segmentation scheme was able to increase the performance of a baseline HMM segmentation tool from 27.12%, 79.27%, and 97.75% of automatic boundary marks with errors smaller than 5, 20, and 50 ms, respectively, to 65.86%, 96.01%, and 99.31% in speaker-dependent mode, which is a reasonably good approximation to manual segmentation.

Journal ArticleDOI
TL;DR: This paper presents a Frequency to Eigendomain Transformation (FET) which permits to calculate a perceptually based eigenfilter which yields an improved result where better shaping of the residual noise, from a perceptual perspective, is achieved.
Abstract: The major drawback of most noise reduction methods in speech applications is the annoying residual noise known as musical noise. A potential solution to this artifact is the incorporation of a human hearing model in the suppression filter design. However, since the available models are usually developed in the frequency domain, it is not clear how they can be applied in the signal subspace approach for speech enhancement. In this paper, we present a Frequency to Eigendomain Transformation (FET) which permits to calculate a perceptually based eigenfilter. This filter yields an improved result where better shaping of the residual noise, from a perceptual perspective, is achieved. The proposed method can also be used with the general case of colored noise. Spectrogram illustrations and listening test results are given to show the superiority of the proposed method over the conventional signal subspace approach.

Journal ArticleDOI
TL;DR: A simple but useful statistical model is developed for the room transfer function of acoustical source localization methods when room reverberation is present and the so-called PHAT time-delay estimator is shown to be optimal among a class of cross-correlation based time- delay estimators.
Abstract: Room reverberation is typically the main obstacle for designing robust microphone-based source localization systems. The purpose of the paper is to analyze the achievable performance of acoustical source localization methods when room reverberation is present. To facilitate the analysis, we apply well known results from room acoustics to develop a simple but useful statistical model for the room transfer function. The properties of the statistical model are found to correlate well with results from real data measurements. The room transfer function model is further applied to analyze the statistical properties of some existing methods for source localization. In this respect we consider especially the asymptotic error variance and the probability of an anomalous estimate. A noteworthy outcome of the analysis is that the so-called PHAT time-delay estimator is shown to be optimal among a class of cross-correlation based time-delay estimators. To verify our results on the error variance and the outlier probability we apply the image method for simulation of the room transfer function.

Journal ArticleDOI
TL;DR: Two array signal processing techniques are combined with independent component analysis (ICA) to enhance the performance of blind separation of acoustic signals in a reflective environment by using the subspace method, which reduces the effect of room reflection when the system is used in a room.
Abstract: Two array signal processing techniques are combined with independent component analysis (ICA) to enhance the performance of blind separation of acoustic signals in a reflective environment. The first technique is the subspace method which reduces the effect of room reflection when the system is used in a room. Room reflection is one of the biggest problems in blind source separation (BSS) in acoustic environments. The second technique is a method of solving permutation. For employing the subspace method, ICA must be used in the frequency domain, and precise permutation is necessary for all frequencies. In this method, a physical property of the mixing matrix, i.e., the coherency in adjacent frequencies, is utilized to solve the permutation. The experiments in a meeting room showed that the subspace method improved the rate of automatic speech recognition from 50% to 68% and that the method of solving permutation achieves performance that closely approaches that of the correct permutation, differing by only 4% in recognition rate.

Journal ArticleDOI
TL;DR: The simulation results show that the proposed soft VAD that uses a Laplacian distribution model for speech signals outperforms the previous VAD That uses a Gaussian model.
Abstract: A new voice activity detector (VAD) is developed in this paper. The VAD is derived by applying a Bayesian hypothesis test on decorrelated speech samples. The signal is first decorrelated using an orthogonal transformation, e.g., discrete cosine transform (DCT) or the adaptive Karhunen-Loeve transform (KLT). The distributions of clean speech and noise signals are assumed to be Laplacian and Gaussian, respectively, as investigated recently. In addition, a hidden Markov model (HMM) is employed with two states representing silence and speech. The proposed soft VAD estimates the probability of voice being active (VBA), recursively. To this end, first the a priori probability of VBA is estimated/predicted based on feedback information from the previous time instance. Then the predicted probability is combined/updated with the new observed signal to calculate the probability of VBA at the current time instance. The required parameters of both speech and noise signals are estimated, adaptively, by the maximum likelihood (ML) approach. The simulation results show that the proposed soft VAD that uses a Laplacian distribution model for speech signals outperforms the previous VAD that uses a Gaussian model.

Journal ArticleDOI
TL;DR: In this article, distortion discriminant analysis (DDA) is proposed to map audio data to feature vectors for the classification, retrieval or identification tasks, and the feature extraction operation must be computationally efficient.
Abstract: Mapping audio data to feature vectors for the classification, retrieval or identification tasks presents four principal challenges. The dimensionality of the input must be significantly reduced; the resulting features must be robust to likely distortions of the input; the features must be informative for the task at hand; and the feature extraction operation must be computationally efficient. We propose distortion discriminant analysis (DDA), which fulfills all four of these requirements. DDA constructs a linear, convolutional neural network out of layers, each of which performs an oriented PCA dimensional reduction. We demonstrate the effectiveness of DDA on two audio fingerprinting tasks: searching for 500 audio clips in 36 h of audio test data; and playing over 10 days of audio against a database with approximately 240 000 fingerprints. We show that the system is robust to kinds of noise that are not present in the training procedure. In the large test, the system gives a false positive rate of 1.5 /spl times/ 10/sup -8/ per audio clip, per fingerprint, at a false negative rate of 0.2% per clip.

Journal ArticleDOI
TL;DR: This paper proposes a new method based on the multichannel spatial correlation matrix for time delay estimation that can take advantage of the redundancy when more than two microphones are available and can help the estimator to better cope with noise and reverberation.
Abstract: To find the position of an acoustic source in a room, typically, a set of relative delays among different microphone pairs needs to be determined. The generalized cross-correlation (GCC) method is the most popular to do so and is well explained in a landmark paper by Knapp and Carter. In this paper, the idea of cross-correlation coefficient between two random signals is generalized to the multichannel case by using the notion of spatial prediction. The multichannel spatial correlation matrix is then deduced and its properties are discussed. We then propose a new method based on the multichannel spatial correlation matrix for time delay estimation. It is shown that this new approach can take advantage of the redundancy when more than two microphones are available and this redundancy can help the estimator to better cope with noise and reverberation.

Journal ArticleDOI
TL;DR: A nonlinear acoustic echo cancellation algorithm, mainly focused on loudspeaker distortions, composed of two distinct modules organized in a cascaded structure, based on polynomial Volterra filters and standard linear filtering.
Abstract: This paper describes a nonlinear acoustic echo cancellation algorithm, mainly focused on loudspeaker distortions. The proposed system is composed of two distinct modules organized in a cascaded structure: a nonlinear module based on polynomial Volterra filters models the loudspeaker, and a second module of standard linear filtering identifies the impulse response of the acoustic path. The tracking of the overall system model is achieved by a modified normalized-least mean square algorithm for which equations are derived. Stability conditions are given, and particular attention is placed on the transient behavior of cascaded filters. Finally, results of real data recorded with Alcatel GSM material are presented.

Journal ArticleDOI
TL;DR: The authors revise current approaches on the subject of blind source separation and propose a fast frequency domain ICA framework, providing a solution for the apparent permutation problem encountered in these methods.
Abstract: The problem of separation of audio sources recorded in a real world situation is well established in modern literature. A method to solve this problem is blind source separation (BSS) using independent component analysis (ICA). The recording environment is usually modeled as convolutive. Previous research on ICA of instantaneous mixtures provided solid background for the separation of convolved mixtures. The authors revise current approaches on the subject and propose a fast frequency domain ICA framework, providing a solution for the apparent permutation problem encountered in these methods.

Journal ArticleDOI
TL;DR: A low complexity quantization scheme using transform coding and bit allocation techniques which allows for easy mapping from observation to quantized value is developed for both fixed rate and variable rate systems.
Abstract: A computationally efficient, high quality, vector quantization scheme based on a parametric probability density function (PDF) is proposed. In this scheme, the observations are modeled as i.i.d realizations of a multivariate Gaussian mixture density. The mixture model parameters are efficiently estimated using the expectation maximization (EM) algorithm. A low complexity quantization scheme using transform coding and bit allocation techniques which allows for easy mapping from observation to quantized value is developed for both fixed rate and variable rate systems. An attractive feature of this method is that source encoding using the resultant codebook involves very few searches and its computational complexity is minimal and independent of the rate of the system. Furthermore, the proposed scheme is bit scalable and can switch seamlessly between a memoryless quantizer and a quantizer with memory. The usefulness of the approach is demonstrated for speech coding where Gaussian mixture models are used to model speech line spectral frequencies. The performance of the memoryless quantizer is 1-3 bits better than conventional quantization schemes.

Journal ArticleDOI
TL;DR: In this article, a new online secondary path modeling method with auxiliary noise power scheduling and adaptive filter norm manipulation is proposed to alleviate the increment of the residual noise due to the auxiliary noise.
Abstract: In many practical cases for active noise control (ANC), the online secondary path modeling methods that use auxiliary noise are often applied. However, the auxiliary noise contributes to residual noise, and thus deteriorates the noise control performance of ANC systems. Moreover, a sudden and large change in the secondary path leads to easy divergence of the existing online secondary path modeling methods. To mitigate these problems, this paper proposes a new online secondary path modeling method with auxiliary noise power scheduling and adaptive filter norm manipulation. The auxiliary noise power is scheduled based on the convergence status of an ANC system with consideration of the variation of the primary noise. The purpose is to alleviate the increment of the residual noise due to the auxiliary noise. In addition, the norm manipulation is applied to adaptive filters in the ANC system. The objective is to avoid over-updates of adaptive filters due to the sudden large change in the secondary path and thus prevent the ANC system from diverging. Computer simulations show the effectiveness and robustness of the proposed method.

Journal ArticleDOI
TL;DR: Objective speech quality measures, informal listening tests, and the results of automatic speech recognition experiments indicate a substantial benefit from AMS-based noise suppression, in comparison to unprocessed noisy speech.
Abstract: A single-microphone noise suppression algorithm is described that is based on a novel approach for the estimation of the signal-to-noise ratio (SNR) in different frequency channels: The input signal is transformed into neurophysiologically-motivated spectro-temporal input features. These patterns are called amplitude modulation spectrograms (AMS), as they contain information of both center frequencies and modulation frequencies within each 32 ms-analysis frame. The different representations of speech and noise in AMS patterns are detected by a neural network, which estimates the present SNR in each frequency channel. Quantitative experiments show a reliable estimation of the SNR for most types of nonspeech background noise. For noise suppression, the frequency bands are attenuated according to the estimated present SNR using a Wiener filter approach. Objective speech quality measures, informal listening tests, and the results of automatic speech recognition experiments indicate a substantial benefit from AMS-based noise suppression, in comparison to unprocessed noisy speech.

Journal ArticleDOI
TL;DR: The experimental results demonstrated the crucial importance of using the newly introduced iterations in improving the earlier stochastic approximation technique, and showed sensitivity of the noise estimation algorithm's performance to the forgetting factor embedded in the algorithm.
Abstract: We describe a novel algorithm for recursive estimation of nonstationary acoustic noise which corrupts clean speech, and a successful application of the algorithm in the speech feature enhancement framework of noise-normalized SPLICE for robust speech recognition. The noise estimation algorithm makes use of a nonlinear model of the acoustic environment in the cepstral domain. Central to the algorithm is the innovative iterative stochastic approximation technique that improves piecewise linear approximation to the nonlinearity involved and that subsequently increases the accuracy for noise estimation. We report comprehensive experiments on SPLICE-based, noise-robust speech recognition for the AURORA2 task using the results of iterative stochastic approximation. The effectiveness of the new technique is demonstrated in comparison with a more traditional, MMSE noise estimation algorithm under otherwise identical conditions. The word error rate reduction achieved by iterative stochastic approximation for recursive noise estimation in the framework of noise-normalized SPLICE is 27.9% for the multicondition training mode, and 67.4% for the clean-only training mode, respectively, compared with the results using the standard cepstra with no speech enhancement and using the baseline HMM supplied by AURORA2. These represent the best performance in the clean-training category of the September-2001 AURORA2 evaluation. The relative error rate reduction achieved by using the same noise estimate is increased to 48.40% and 76.86%, respectively, for the two training modes after using a better designed HMM system. The experimental results demonstrated the crucial importance of using the newly introduced iterations in improving the earlier stochastic approximation technique, and showed sensitivity of the noise estimation algorithm's performance to the forgetting factor embedded in the algorithm.

Journal ArticleDOI
TL;DR: The experimental results show that computational reduction by a factor of 17 can be achieved with 5% relative reduction in equal error rate (EER) compared with the baseline, and the SGMM-SBM shows some advantages over the recently proposed hash GMM, including higher speed and better verification performance.
Abstract: We present an integrated system with structural Gaussian mixture models (SGMMs) and a neural network for purposes of achieving both computational efficiency and high accuracy in text-independent speaker verification. A structural background model (SBM) is constructed first by hierarchically clustering all Gaussian mixture components in a universal background model (UBM). In this way the acoustic space is partitioned into multiple regions in different levels of resolution. For each target speaker, a SGMM can be generated through multilevel maximum a posteriori (MAP) adaptation from the SBM. During test, only a small subset of Gaussian mixture components are scored for each feature vector in order to reduce the computational cost significantly. Furthermore, the scores obtained in different layers of the tree-structured models are combined via a neural network for final decision. Different configurations are compared in the experiments conducted on the telephony speech data used in the NIST speaker verification evaluation. The experimental results show that computational reduction by a factor of 17 can be achieved with 5% relative reduction in equal error rate (EER) compared with the baseline. The SGMM-SBM also shows some advantages over the recently proposed hash GMM, including higher speed and better verification performance.

Journal ArticleDOI
TL;DR: Multichannel affine and fast affine projection algorithms are introduced for active noise control or acoustic equalization and it is shown that they can provide the best convergence performance (even over recursive-least-squares algorithms) when nonideal noisy acoustic plant models are used in the adaptive systems.
Abstract: In the field of adaptive signal processing, it is well known that affine projection algorithms or their low-computational implementations fast affine projection algorithms can produce a good tradeoff between convergence speed and computational complexity. Although these algorithms typically do not provide the same convergence speed as recursive-least-squares algorithms, they can provide a much improved convergence speed compared to stochastic gradient descent algorithms, without the high increase of the computational load or the instability often found in recursive-least-squares algorithms. In this paper, multichannel affine and fast affine projection algorithms are introduced for active noise control or acoustic equalization. Multichannel fast affine projection algorithms have been previously published for acoustic echo cancellation, but the problem of active noise control or acoustic equalization is a very different one, leading to different structures, as explained in the paper. The computational complexity of the new algorithms is evaluated, and it is shown through simulations that not only can the new algorithms provide the expected tradeoff between convergence performance and computational complexity, they can also provide the best convergence performance (even over recursive-least-squares algorithms) when nonideal noisy acoustic plant models are used in the adaptive systems.

Journal ArticleDOI
TL;DR: A new perceptually motivated approach is proposed for enhancement of speech corrupted by colored noise that takes into account the frequency masking properties of the human auditory system and reduces the perceptual effect of the residual noise.
Abstract: A new perceptually motivated approach is proposed for enhancement of speech corrupted by colored noise. The proposed approach takes into account the frequency masking properties of the human auditory system and reduces the perceptual effect of the residual noise. This new perceptual method is incorporated into a frequency-domain speech enhancement method and a subspace-based speech enhancement method. A better power spectrum/autocorrelation function estimator was also developed to improve the performance of the proposed algorithms. Objective measures and informal listening tests demonstrated significant improvements over other methods when tested with TIMIT sentences corrupted by various types of colored noise.

Journal ArticleDOI
TL;DR: This work presents a design of nearly perfect reconstruction nonuniform oversampled filter banks which implement signal decompositions of this kind.
Abstract: In emerging audio technology applications, there is a need for decompositions of audio signals into oversampled subband components with time-frequency resolution which mimics that of the cochlear filter bank and with high aliasing attenuation in each of the subbands independently, rather than aliasing cancellation properties. We present a design of nearly perfect reconstruction nonuniform oversampled filter banks which implement signal decompositions of this kind.

Journal ArticleDOI
TL;DR: The proposed method offers better performance regarding suppression levels of disturbing signals and much less distortion to the source speech in a car hands-free mobile telephony environment.
Abstract: This paper presents a new method for the design of oversampled uniform DFT-filter banks for the special application of subband adaptive beamforming with microphone arrays. Since array applications rely on the fact that different source positions give rise to different signal delays, a beamformer alters the phase information of the signals. This in turn leads to signal degradations when perfect reconstruction filter banks are used for the subband decomposition and reconstruction. The objective of the filter bank design is to minimize the magnitude of all aliasing components individually, such that aliasing distortion is minimized although phase alterations occur in the subbands. The proposed method is evaluated in a car hands-free mobile telephony environment and the results show that the proposed method offers better performance regarding suppression levels of disturbing signals and much less distortion to the source speech.

Journal ArticleDOI
Hong Kook Kim, Richard Rose1
TL;DR: In this paper, a set of acoustic feature pre-processing techniques that are applied to improving automatic speech recognition (ASR) performance on noisy speech recognition tasks are presented. But the main contribution of this paper is an approach for cepstrum-domain feature compensation in ASR which is motivated by techniques for decomposing speech and noise that were originally developed for noisy speech enhancement.
Abstract: This paper presents a set of acoustic feature pre-processing techniques that are applied to improving automatic speech recognition (ASR) performance on noisy speech recognition tasks. The principal contribution of this paper is an approach for cepstrum-domain feature compensation in ASR which is motivated by techniques for decomposing speech and noise that were originally developed for noisy speech enhancement. This approach is applied in combination with other feature compensation algorithms to compensating ASR features obtained from a mel-filterbank cepstrum coefficient front-end. Performance comparisons are made with respect to the application of the minimum mean squared error log spectral amplitude (MMSE-LSA) estimator based speech enhancement algorithm prior to feature analysis. An experimental study is presented where the feature compensation approaches described in the paper are found to greatly reduce ASR word error rate compared to uncompensated features under environmental and channel mismatched conditions.