scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Effect of phase-sensitive environment model and higher order VTS on noisy speech feature enhancement [speech recognition applications]

18 Mar 2005-Vol. 1, pp 433-436
TL;DR: This paper proposes two modifications to obtain more accurate versions of the statistics of the combined HMM (starting from a clean speech and a noise model) and explains how the front-end clean speech model itself can be improved by a preprocessing of the training data.
Abstract: Model-based techniques for robust speech recognition often require the statistics of noisy speech. In this paper, we propose two modifications to obtain more accurate versions of the statistics of the combined HMM (starting from a clean speech and a noise model). Usually, the phase difference between speech and noise is neglected in the acoustic environment model. However, we show how a phase-sensitive environment model can be efficiently integrated in the context of multi-stream model-based feature enhancement and gives rise to more accurate covariance matrices for the noisy speech. Also, by expanding the vector Taylor series up to the second order term, an improved noisy speech mean can be obtained. Finally, we explain how the front-end clean speech model itself can be improved by a preprocessing of the training data. Recognition results on the Aurora4 database illustrate the effect on the noise robustness for each of these modifications.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: A thorough overview of modern noise-robust techniques for ASR developed over the past 30 years is provided and methods that are proven to be successful and that are likely to sustain or expand their future applicability are emphasized.
Abstract: New waves of consumer-centric applications, such as voice search and voice interaction with mobile devices and home entertainment systems, increasingly require automatic speech recognition (ASR) to be robust to the full range of real-world noise and other acoustic distorting conditions. Despite its practical importance, however, the inherent links between and distinctions among the myriad of methods for noise-robust ASR have yet to be carefully studied in order to advance the field further. To this end, it is critical to establish a solid, consistent, and common mathematical foundation for noise-robust ASR, which is lacking at present. This article is intended to fill this gap and to provide a thorough overview of modern noise-robust techniques for ASR developed over the past 30 years. We emphasize methods that are proven to be successful and that are likely to sustain or expand their future applicability. We distill key insights from our comprehensive overview in this field and take a fresh look at a few old problems, which nevertheless are still highly relevant today. Specifically, we have analyzed and categorized a wide range of noise-robust techniques using five different criteria: 1) feature-domain vs. model-domain processing, 2) the use of prior knowledge about the acoustic environment distortion, 3) the use of explicit environment-distortion models, 4) deterministic vs. uncertainty processing, and 5) the use of acoustic models trained jointly with the same feature enhancement or model adaptation process used in the testing stage. With this taxonomy-oriented review, we equip the reader with the insight to choose among techniques and with the awareness of the performance-complexity tradeoffs. The pros and cons of using different noise-robust ASR techniques in practical application scenarios are provided as a guide to interested practitioners. The current challenges and future research directions in this field is also carefully analyzed.

534 citations

Book ChapterDOI
Li Deng1
01 Jan 2011
TL;DR: The Bayesian framework is used as a common thread for connecting, analyzing, and categorizing a number of popular approaches to the solutions pursued in the recent past on the problem of uncertainty handling in robust speech recognition.
Abstract: Noise robustness has long been an active area of research that captures significant interest from speech recognition researchers and developers. In this chapter, with a focus on the problem of uncertainty handling in robust speech recognition, we use the Bayesian framework as a common thread for connecting, analyzing, and categorizing a number of popular approaches to the solutions pursued in the recent past. The topics covered in this chapter include 1) Bayesian decision rules with unreliable features and unreliable model parameters; 2) principled ways of computing feature uncertainty using structured speech distortion models; 3) use of a phase factor in an advanced speech distortion model for feature compensation; 4) a novel perspective on model compensation as a special implementation of the general Bayesian predictive classification rule capitalizing on model parameter uncertainty; 5) taxonomy of noise compensation techniques using two distinct axes, feature vs. model domain and structured vs. unstructured transformation; and 6) noise-adaptive training as a hybrid feature-model compensation framework and its various forms of extension.

44 citations

Proceedings Article
01 Jan 2009
TL;DR: An analytic derivation of the moments of the phase factor between clean speech and noise cepstral or log-mel-spectral feature vectors is presented, leading to significant improvements in word accuracy on the AURORA2 database.
Abstract: In this paper we present an analytic derivation of the moments of the phase factor between clean speech and noise cepstral or log-mel-spectral feature vectors. The development shows, among others, that the probability density of the phase factor is of sub-Gaussian nature and that it is independent of the noise type and the signal-to-noise ratio, however dependent on the mel filter bank index. Further we show how to compute the contribution of the phase factor to both the mean and the variance of the noisy speech observation likelihood, which relates the speech and noise feature vectors to those of noisy speech. The resulting phase-sensitive observation model is then used in model-based speech feature enhancement, leading to significant improvements in word accuracy on the AURORA2 database. Index Terms: model-based feature enhancement, phasesensitive observation model, phase factor distribution

25 citations


Cites background or methods from "Effect of phase-sensitive environme..."

  • ...Since a numerical evaluation of the resulting integrals is computationally very demanding if not almost impossible, the observation probability is approximated by a Gaussian, where the effect of the phase factor is either modelled as a contribution to the mean [4], to the variance [2] or to both mean and variance [3]....

    [...]

  • ...However, it is well-known that a more accurate model is obtained if a phase factor α, which results from the unknown phase between the complex speech and noise short-term discrete-time Fourier transform, is taken into account [1, 2, 3, 4]....

    [...]

  • ...Subsequently, the observation probability p(y|x,n) can be determined either by Vector Taylor Series approximation up to linear [5] or higher-order terms [3] or by Monte Carlo Integration [4]....

    [...]

  • ...[3], we achieved recognition accuracies of 85....

    [...]

Book ChapterDOI
01 Jan 2011
TL;DR: This chapter describes the underlying concepts of model-based noise compensation for robust speech recognition and how it can be applied to standard systems and considers important practical issues.
Abstract: A powerful approach for handling uncertainty in observations is to modify the statistical model of the data to appropriately reflect this uncertainty. For the task of noise-robust speech recognition, this requires modifying an underlying “clean” acoustic model to be representative of speech in a particular target acoustic environment. This chapter describes the underlying concepts of model-based noise compensation for robust speech recognition and how it can be applied to standard systems. The chapter will then consider important practical issues. These include i) acoustic environment noise parameter estimation; ii) efficient acoustic model compensation and likelihood calculation; and iii) adaptive training to handle multi-style training data. The chapter will conclude by discussing the limitations of the current approaches and research options to address them.

23 citations


Cites methods from "Effect of phase-sensitive environme..."

  • ...For VTS the cross term can be found using [62, 43]...

    [...]

Proceedings Article
01 Jan 2002
TL;DR: In this paper, the phase relationship between clean speech and the corrupting noise in acoustic distortion is captured by the MMSE estimator, which achieves high efficiency by exploiting single point Taylor series expansion to approximate the joint probability of clean and noisy speech as a multivariate Gaussian.
Abstract: In this paper we present an MMSE (minimum mean square error) speech feature enhancement algorithm, capitalizing on a new probabilistic, nonlinear environment model that effectively incorporates the phase relationship between the clean speech and the corrupting noise in acoustic distortion. The MMSE estimator based on this phase-sensitive model is derived and it achieves high efficiency by exploiting single-point Taylor series expansion to approximate the joint probability of clean and noisy speech as a multivariate Gaussian. As an integral component of the enhancement algorithm, we also present a new sequential MAP-based nonstationary noise estimator. Experimental results on the Aurora2 task demonstrate the importance of exploiting the phase relationship in the speech corruption process captured by the MMSE estimator. The phasesensitive MMSE estimator reported in this paper performs significantly better than phase-insensitive spectral subtraction (54% error rate reduction), and also noticeably better than a phase-insensitive MMSE estimator as our previous state-of-the-art technique reported in [2] (7% error rate reduction), under otherwise identical experimental conditions of speech recognition.

14 citations

References
More filters
Journal ArticleDOI
TL;DR: After training on clean speech data, the performance of the recognizer was found to be severely degraded when noise was added to the speech signal at between 10 and 18 dB, but using PMC the performance was restored to a level comparable with that obtained when training directly in the noise corrupted environment.
Abstract: This paper addresses the problem of automatic speech recognition in the presence of interfering noise. It focuses on the parallel model combination (PMC) scheme, which has been shown to be a powerful technique for achieving noise robustness. Most experiments reported on PMC to date have been on small, 10-50 word vocabulary systems. Experiments on the Resource Management (RM) database, a 1000 word continuous speech recognition task, reveal compensation requirements not highlighted by the smaller vocabulary tasks. In particular, that it is necessary to compensate the dynamic parameters as well as the static parameters to achieve good recognition performance. The database used for these experiments was the RM speaker independent task with either Lynx Helicopter noise or Operation Room noise from the NOISEX-92 database added. The experiments reported here used the HTK RM recognizer developed at CUED modified to include PMC based compensation for the static, delta and delta-delta parameters. After training on clean speech data, the performance of the recognizer was found to be severely degraded when noise was added to the speech signal at between 10 and 18 dB. However, using PMC the performance was restored to a level comparable with that obtained when training directly in the noise corrupted environment.

509 citations

Proceedings ArticleDOI
07 May 1996
TL;DR: This work introduces the use of a vector Taylor series (VTS) expansion to characterize efficiently and accurately the effects on speech statistics of unknown additive noise and unknown linear filtering in a transmission channel.
Abstract: In this paper we introduce a new analytical approach to environment compensation for speech recognition. Previous attempts at solving analytically the problem of noisy speech recognition have either used an overly-simplified mathematical description of the effects of noise on the statistics of speech or they have relied on the availability of large environment-specific adaptation sets. Some of the previous methods required the use of adaptation data that consists of simultaneously-recorded or "stereo" recordings of clean and degraded speech. In this work we introduce the use of a vector Taylor series (VTS) expansion to characterize efficiently and accurately the effects on speech statistics of unknown additive noise and unknown linear filtering in a transmission channel. The VTS approach is computationally efficient. It can be applied either to the incoming speech feature vectors, or to the statistics representing these vectors. In the first case the speech is compensated and then recognized; in the second case HMM statistics are modified using the VTS formulation. Both approaches use only the actual speech segment being recognized to compute the parameters required for environmental compensation. We evaluate the performance of two implementations of VTS algorithms using the CMU SPHINX-II system on the 100-word alphanumeric CENSUS database and on the 1993 5000-word ARPA Wall Street Journal database. Artificial white Gaussian noise is added to both databases. The VTS approaches provide significant improvements in recognition accuracy compared to previous algorithms.

480 citations

Proceedings ArticleDOI
17 May 2004
TL;DR: This paper investigates the use of higher-order autoregressive vector predictors for tracking the noise in noisy speech signals and finds that predictors of order greater than 1 are not observed to improve the performance beyond that obtained with a first-order predictor.
Abstract: This paper investigates the use of higher-order autoregressive vector predictors for tracking the noise in noisy speech signals. The autoregressive predictors form the state equation of a linear dynamical system that models the spectral dynamics of the noise process. Experiments show that the use of such models to track noise can lead to large gains in recognition performance on speech compensated for the estimated noise. However, predictors of order greater than 1 are not observed to improve the performance beyond that obtained with a first-order predictor. We analyze and explain why this is so.

41 citations


"Effect of phase-sensitive environme..." refers background in this paper

  • ...One class of techniques that addresses this problem consists of modelbased techniques that either modify the back-end statistical models [1, 2] or compensate the observed acoustic feature vectors using estimates of the clean speech and/or the background (noise) model parameters [3, 4, 5, 6]....

    [...]

Proceedings Article
01 Oct 2004
TL;DR: Two techniques to cover the gap between the true and the estimated clean speech features in the context of Model-Based Feature Enhancement (MBFE) for noise robust speech recognition are presented.
Abstract: In this paper we present two techniques to cover the gap between the true and the estimated clean speech features in the context of Model-Based Feature Enhancement (MBFE) for noise robust speech recognition. While in the output of every feature enhancement algorithm some residual uncertainty remains, currently this information is mostly discarded. Firstly, we explain how the generation of not only a global MMSEestimate of clean speech, but also several alternative (stateconditional) estimates are supplied to the back-end for recognition. Secondly, we explore the benefits of calculating the variance of the front-end estimate and incorporating this in the acoustic models of the recogniser. Experiments on the Aurora2 task confirmed the superior performance of the resulting system: an average increase in recognition accuracy from 85.65% to 88.50% was obtained for the clean training condition.

25 citations


Additional excerpts

  • ...One class of techniques that addresses this problem consists of modelbased techniques that either modify the back-end statistical models [1, 2] or compensate the observed acoustic feature vectors using estimates of the clean speech and/or the background (noise) model parameters [3, 4, 5, 6]....

    [...]

  • ...Instead, the back-end acoustic model, which is more detailed than the front-end, can use a larger context in the decision process [6]....

    [...]

Proceedings ArticleDOI
17 May 2004
TL;DR: This paper successfully extended the model-based feature enhancement (MBFE) algorithm to jointly remove additive and convolutional noise from corrupted speech to cure the resulting performance degradation.
Abstract: In this paper we describe how we successfully extended the model-based feature enhancement (MBFE) algorithm to jointly remove additive and convolutional noise from corrupted speech. Although a model of the clean speech can incorporate prior knowledge into the feature enhancement process, this model no longer yields an accurate fit if a different microphone is used. To cure the resulting performance degradation, we merge a new iterative EM algorithm to estimate the channel, and the MBFE-algorithm to remove nonstationary additive noise. In the latter, the parameters of a shifted clean speech HMM and a noise HMM are first combined by a vector Taylor series approximation and then the state-conditional MMSE-estimates of the clean speech are calculated. Recognition experiments confirmed the superior performance on the Aurora4 recognition task. An average relative reduction in WER of 12% and 2.8% on the clean and multi condition training respectively, was obtained compared to the Advanced Front-End standard.

24 citations


Additional excerpts

  • ...Model-Based Feature Enhancement (MBFE) is a scalable and efficient technique to jointly reduce the interfering additive and convolutional noise from a noisy speech utterance before recognition by an ASR system [7, 9]....

    [...]

  • ...The corresponding update formula is given by [9] : δh = ⎡ ⎣∑ t ∑ (i, j ) γ (i, j ) t F ′ (i, j ) ( x(i, j ) )−1 F(i, j ) ⎤ ⎦ −1 ....

    [...]

  • ...First, we showed how the phase difference between speech and noise (that is often neglected in the acoustic environment model) gives rise to an additional term in the calculation of the covariance matrices for the noisy speech....

    [...]

  • ...The speaker-independent LVCSR-system that has been developed by the ESAT speech group of the K.U.Leuven, is used as a backend recogniser (details can be found in [9])....

    [...]