scispace - formally typeset
Proceedings ArticleDOI

Noise and speaker compensation in the Log filter bank domain

25 Mar 2012-pp 4709-4712

TL;DR: The elegance of the proposed approach is that given the speech data, the authors obtain directly MFCC features that are robust to noise and speaker-variations that show a significant relative improvement over baseline on Aurora-4 task.

AbstractIn this paper, we propose a method to compensate for noise and speaker-variability directly in the Log filter-bank (FB) domain, so that MFCC features are robust to noise and speaker-variations. For noise-compensation, we use Vector Taylor Series (VTS) approach in the Log FB domain, and speaker-normalization is also done in the Log FB domain using Linear Vocal tract length (VTLN) matrices. For VTLN, optimal selection of warp-factor is done in Log FB domain using canonical GMM model, avoiding the two-pass approach needed by a HMM model. Further, this can be efficiently implemented using sufficient statistics obtained from the GMM and the FB-VTLN-matrices. The warp-factor selection using GMM can also be done in cepstral domain by applying DCT matrices without the usual approximations associated with conventional linear-VTLN. The elegance of the proposed approach is that given the speech data, we obtain directly MFCC features that are robust to noise and speaker-variations. The proposed approach, show a significant relative improvement of 31% over baseline on Aurora-4 task.

...read more


Citations
More filters
Journal ArticleDOI
TL;DR: This paper presents a new approach to speech enhancement from single-channel measurements involving both noise and channel distortion (i.e., convolutional noise), and demonstrates its applications for robust speech recognition and for improving noisy speech quality.
Abstract: This paper presents a new approach to speech enhancement from single-channel measurements involving both noise and channel distortion (i.e., convolutional noise), and demonstrates its applications for robust speech recognition and for improving noisy speech quality. The approach is based on finding longest matching segments (LMS) from a corpus of clean, wideband speech. The approach adds three novel developments to our previous LMS research. First, we address the problem of channel distortion as well as additive noise. Second, we present an improved method for modeling noise for speech estimation. Third, we present an iterative algorithm which updates the noise and channel estimates of the corpus data model. In experiments using speech recognition as a test with the Aurora 4 database, the use of our enhancement approach as a preprocessor for feature extraction significantly improved the performance of a baseline recognition system. In another comparison against conventional enhancement algorithms, both the PESQ and the segmental SNR ratings of the LMS algorithm were superior to the other methods for noisy speech enhancement.

8 citations

Proceedings ArticleDOI
14 Sep 2014
TL;DR: This paper presents a new approach to single-channel speech enhancement involving both noise and channel distortion (i.e., convolutional noise) based on finding longest matching segments (LMS) from a corpus of clean, wideband speech.
Abstract: This paper presents a new approach to single-channel speech enhancement involving both noise and channel distortion (i.e., convolutional noise). The approach is based on finding longest matching segments (LMS) from a corpus of clean, wideband speech. The approach adds three novel developments to our previous LMS research. First, we address the problem of channel distortion as well as additive noise. Second, we present an improved method for modeling noise. Third, we present an iterative algorithm for improved speech estimates. In experiments using speech recognition as a test with the Aurora 4 database, the use of our enhancement approach as a preprocessor for feature extraction significantly improved the performance of a baseline recognition system. In another comparison against conventional enhancement algorithms, both the PESQ and the segmental SNR ratings of the LMS algorithm were superior to the other methods for noisy speech enhancement.

3 citations


References
More filters
Proceedings ArticleDOI
07 May 1996
TL;DR: This work introduces the use of a vector Taylor series (VTS) expansion to characterize efficiently and accurately the effects on speech statistics of unknown additive noise and unknown linear filtering in a transmission channel.
Abstract: In this paper we introduce a new analytical approach to environment compensation for speech recognition. Previous attempts at solving analytically the problem of noisy speech recognition have either used an overly-simplified mathematical description of the effects of noise on the statistics of speech or they have relied on the availability of large environment-specific adaptation sets. Some of the previous methods required the use of adaptation data that consists of simultaneously-recorded or "stereo" recordings of clean and degraded speech. In this work we introduce the use of a vector Taylor series (VTS) expansion to characterize efficiently and accurately the effects on speech statistics of unknown additive noise and unknown linear filtering in a transmission channel. The VTS approach is computationally efficient. It can be applied either to the incoming speech feature vectors, or to the statistics representing these vectors. In the first case the speech is compensated and then recognized; in the second case HMM statistics are modified using the VTS formulation. Both approaches use only the actual speech segment being recognized to compute the parameters required for environmental compensation. We evaluate the performance of two implementations of VTS algorithms using the CMU SPHINX-II system on the 100-word alphanumeric CENSUS database and on the 1993 5000-word ARPA Wall Street Journal database. Artificial white Gaussian noise is added to both databases. The VTS approaches provide significant improvements in recognition accuracy compared to previous algorithms.

477 citations

Journal ArticleDOI
TL;DR: An efficient means for estimating a linear frequency Warping factor and a simple mechanism for implementing frequency warping by modifying the filterbank in mel-frequency cepstrum feature analysis are presented.
Abstract: In an effort to reduce the degradation in speech recognition performance caused by variation in vocal tract shape among speakers, a frequency warping approach to speaker normalization is investigated. A set of low complexity, maximum likelihood based frequency warping procedures have been applied to speaker normalization for a telephone based connected digit recognition task. This paper presents an efficient means for estimating a linear frequency warping factor and a simple mechanism for implementing frequency warping by modifying the filterbank in mel-frequency cepstrum feature analysis. An experimental study comparing these techniques to other well-known techniques for reducing variability is described. The results have shown that frequency warping is consistently able to reduce word error rate by 20% even for very short utterances.

322 citations


"Noise and speaker compensation in t..." refers background in this paper

  • ...Both noise compensation and speaker normalization are done in the feature domain....

    [...]

Journal ArticleDOI
TL;DR: The performance of the new LT was comparable to that of regular VTLN implemented by warping the Mel filterbank, when the MLS criterion was used for FW estimation, and it is shown that the approximations involved do not lead to any performance degradation.
Abstract: Vocal tract length normalization (VTLN) for standard filterbank-based Mel frequency cepstral coefficient (MFCC) features is usually implemented by warping the center frequencies of the Mel filterbank, and the warping factor is estimated using the maximum likelihood score (MLS) criterion. A linear transform (LT) equivalent for frequency warping (FW) would enable more efficient MLS estimation. We recently proposed a novel LT to perform FW for VTLN and model adaptation with standard MFCC features. In this paper, we present the mathematical derivation of the LT and give a compact formula to calculate it for any FW function. We also show that our LT is closely related to different LTs previously proposed for FW with cepstral features, and these LTs for FW are all shown to be numerically almost identical for the sine-log all-pass transform (SLAPT) warping functions. Our formula for the transformation matrix is, however, computationally simpler and, unlike other previous LT approaches to VTLN with MFCC features, no modification of the standard MFCC feature extraction scheme is required. In VTLN and speaker adaptive modeling (SAM) experiments with the DARPA resource management (RM1) database, the performance of the new LT was comparable to that of regular VTLN implemented by warping the Mel filterbank, when the MLS criterion was used for FW estimation. This demonstrates that the approximations involved do not lead to any performance degradation. Performance comparable to front end VTLN was also obtained with LT adaptation of HMM means in the back end, combined with mean bias and variance adaptation according to the maximum likelihood linear regression (MLLR) framework. The FW methods performed significantly better than standard MLLR for very limited adaptation data (1 utterance), and were equally effective with unsupervised parameter estimation. We also performed speaker adaptive training (SAT) with feature space LT denoted CLTFW. Global CLTFW SAT gave results comparable to SAM and VTLN. By estimating multiple CLTFW transforms using a regression tree, and including an additive bias, we obtained significantly improved results compared to VTLN, with increasing adaptation data.

46 citations


"Noise and speaker compensation in t..." refers methods in this paper

  • ...VTS model for noisy cepstra is given by, cy = cx +D ∗ log(1 + e D−1(cn−cx)) (9) where cy is the noisy cepstra, cx is the clean cepstra, cn is the noise vector due to additive noise at the input and D is the DCT matrix....

    [...]

Proceedings ArticleDOI
04 Sep 2005
TL;DR: The proposed method exploits the bandlimited interpolation idea (in the frequency-domain) to do the necessary frequency-warping and yields exact results as long as the cepstral coefficients are que-frency limited.
Abstract: In this paper, we show that frequency-warping (including VTLN) can be implemented through linear transformation of conventional MFCC. Unlike the Pitz-Ney [1] continuous domain approach, we directly determine the relation between frequency-warping and the linear-transformation in the discrete-domain. The advantage of such an approach is that it can be applied to any frequency-warping and is not limited to cases where an analytical closed-form solution can be found. The proposed method exploits the bandlimited interpolation idea (in the frequency-domain) to do the necessary frequency-warping and yields exact results as long as the cepstral coefficients are que-frency limited. This idea of quefrencylimitedness shows the importance of the filter-bank smoothing of the spectra which has been ignored in [1, 2]. Furthermore, unlike [1], since we operate in the discrete domain, we can also apply the usual discrete-cosine transform (i.e. DCT-II) on the logarithm of the filter-bank output to get conventional MFCC features. Therefore, using our proposed method, we can linearly transform conventional MFCC cepstra to do VTLN and we do not require any recomputation of the warped-features. We provide experimental results in support of this approach.

35 citations


"Noise and speaker compensation in t..." refers background or methods in this paper

  • ...5 are obtained for low pass (LP) filtered and high pass (HP) filtered cepstral coefficients at different SNR levels....

    [...]

  • ...VTS model for noisy cepstra is given by, cy = cx +D ∗ log(1 + e D−1(cn−cx)) (9) where cy is the noisy cepstra, cx is the clean cepstra, cn is the noise vector due to additive noise at the input and D is the DCT matrix....

    [...]

  • ...The basic idea of this analysis, is that for any good ”noise compensating” method, histogram of ”cleaned” features should match those of the original speech features....

    [...]

Proceedings Article
01 Jan 2008
TL;DR: This paper develops a computationally efficient approach for warp factor estimation in Vocal Tract Length Normalization (VTLN) that has recognition performance that is comparable to conventional VTLN and yet is computationally more efficient.
Abstract: In this paper, we develop a computationally efficient approach for warp factor estimation in Vocal Tract Length Normalization (VTLN). Recently we have shown that warped features can be obtained by a linear transformation of the unwarped features. Using the warp matrices we show that warp factor estimation can be efficiently performed in an EM framework. This can be done by collecting Sufficient Statistics by aligning the unwarped utterances only once. The likelihood of warped features, which are necessary for warp factor estimation, are computed by appropriately modifying the sufficient statistics using the warp matrices. We show using OGI, TIDIGITS and RM task that this approach has recognition performance that is comparable to conventional VTLN and yet is computationally more efficient.

31 citations


"Noise and speaker compensation in t..." refers methods in this paper

  • ...• VTLN Warping : Warping is done in Log FB domain by multiplying the VTS compensated Log FB coefficients (Fvts) with warping matrices (A0....

    [...]