scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Robust speech recognition through selection of speaker and environment transforms

TL;DR: The proposed method is simple since it involves only the choice of pre-computed environment and speaker transforms and therefore, can be applied with very little test data unlike many other speaker and noise-compensation methods.
Abstract: In this paper, we address the problem of robustness to both noise and speaker-variability in automatic speech recognition (ASR). We propose the use of pre-computed Noise and Speaker transforms, and an optimal combination of these two transforms are chosen during test using maximum-likelihood (ML) criterion. These pre-computed transforms are obtained during training by using data obtained from different noise conditions that are usually encountered for that particular ASR task. The environment transforms are obtained during training using constrained-MLLR (CMLLR) framework, while for speaker-transforms we use the analytically determined linear-VTLN matrices. Even though the exact noise environment may not be encountered during test, the ML-based choice of the closest Environment transform provides “sufficient” cleaning and this is corroborated by experimental results with performance comparable to histogram equalization or Vector Taylor Series approaches on Aurora-2 task. The proposed method is simple since it involves only the choice of pre-computed environment and speaker transforms and therefore, can be applied with very little test data unlike many other speaker and noise-compensation methods.
References
More filters
Journal ArticleDOI
TL;DR: The paper compares the two possible forms of model-based transforms: unconstrained, where any combination of mean and variance transform may be used, and constrained, which requires the variance transform to have the same form as the mean transform.

1,755 citations


"Robust speech recognition through s..." refers methods in this paper

  • ...Two broad approaches to speaker-normalization are speaker-adaptation based approaches such as Maximum Likelihood Linear Regres­ sion (MLLR) or Constrained-MLLR (CMLLR) [1] and Vocal-tract Length Normalization (VTLN) [2]....

    [...]

  • ...The pre-computed environment transforms were obtained from training data using CMLLR framework while the speaker transform are a set of Linear-VTLN matrices corresponding to the range of warp factors....

    [...]

  • ...(1)(c) also shows the results for the best case (which is upper bound) where in CMLLR transform is used instead of VTLN as speaker transform....

    [...]

  • ...Once the speaker variability is removed from the features we can use all the train utterances collected in a specific noise environment (e.g. car noise, restaurant etc.) at different noise levels (e.g very noisy, noisy, less noisy, clean) and estimate environment noise specific CMLLR transforms....

    [...]

  • ...In [10] cascade of CMLLR transforms are used which enables the use of transform estimated in one environment to be used with same speaker in another environment....

    [...]

Proceedings ArticleDOI
07 May 1996
TL;DR: This work introduces the use of a vector Taylor series (VTS) expansion to characterize efficiently and accurately the effects on speech statistics of unknown additive noise and unknown linear filtering in a transmission channel.
Abstract: In this paper we introduce a new analytical approach to environment compensation for speech recognition. Previous attempts at solving analytically the problem of noisy speech recognition have either used an overly-simplified mathematical description of the effects of noise on the statistics of speech or they have relied on the availability of large environment-specific adaptation sets. Some of the previous methods required the use of adaptation data that consists of simultaneously-recorded or "stereo" recordings of clean and degraded speech. In this work we introduce the use of a vector Taylor series (VTS) expansion to characterize efficiently and accurately the effects on speech statistics of unknown additive noise and unknown linear filtering in a transmission channel. The VTS approach is computationally efficient. It can be applied either to the incoming speech feature vectors, or to the statistics representing these vectors. In the first case the speech is compensated and then recognized; in the second case HMM statistics are modified using the VTS formulation. Both approaches use only the actual speech segment being recognized to compute the parameters required for environmental compensation. We evaluate the performance of two implementations of VTS algorithms using the CMU SPHINX-II system on the 100-word alphanumeric CENSUS database and on the 1993 5000-word ARPA Wall Street Journal database. Artificial white Gaussian noise is added to both databases. The VTS approaches provide significant improvements in recognition accuracy compared to previous algorithms.

480 citations


"Robust speech recognition through s..." refers methods in this paper

  • ...Combination of VTS with VTLN [7] and VTS with MLLR are studied in [8]....

    [...]

  • ...In the histogram based approaches, adequate speech data is required to get robust estimates of the quantiles, while in the VTS based approach the noise models are obtained from the first few and last few frames of the utterance....

    [...]

  • ...Two com­ monly used noise-compensation approaches are those based on his­ togram equalization (HEQ) [3] and those based on Vector Taylor Se­ ries (VTS) [4]....

    [...]

  • ...Two commonly used noise-compensation approaches are those based on histogram equalization (HEQ) [3] and those based on Vector Taylor Series (VTS) [4]....

    [...]

  • ...nonlinear compensation techniques like HEQ and VTS with VTLN [5][7]....

    [...]

Journal ArticleDOI
TL;DR: An efficient means for estimating a linear frequency Warping factor and a simple mechanism for implementing frequency warping by modifying the filterbank in mel-frequency cepstrum feature analysis are presented.
Abstract: In an effort to reduce the degradation in speech recognition performance caused by variation in vocal tract shape among speakers, a frequency warping approach to speaker normalization is investigated. A set of low complexity, maximum likelihood based frequency warping procedures have been applied to speaker normalization for a telephone based connected digit recognition task. This paper presents an efficient means for estimating a linear frequency warping factor and a simple mechanism for implementing frequency warping by modifying the filterbank in mel-frequency cepstrum feature analysis. An experimental study comparing these techniques to other well-known techniques for reducing variability is described. The results have shown that frequency warping is consistently able to reduce word error rate by 20% even for very short utterances.

338 citations


"Robust speech recognition through s..." refers methods in this paper

  • ...Two broad approaches to speaker-normalization are speaker-adaptation based approaches such as Maximum Likelihood Linear Regres­ sion (MLLR) or Constrained-MLLR (CMLLR) [1] and Vocal-tract Length Normalization (VTLN) [2]....

    [...]

Journal ArticleDOI
TL;DR: The paper describes how the proposed method of compensating for nonlinear distortions in speech representation caused by noise can be applied to robust speech recognition and it is compared with other compensation techniques.
Abstract: This paper describes a method of compensating for nonlinear distortions in speech representation caused by noise. The method described here is based on the histogram equalization method often used in digital image processing. Histogram equalization is applied to each component of the feature vector in order to improve the robustness of speech recognition systems. The paper describes how the proposed method can be applied to robust speech recognition and it is compared with other compensation techniques. The recognition experiments, including results in the AURORA II framework, demonstrate the effectiveness of histogram equalization when it is applied either alone or in combination with other compensation techniques.

332 citations


"Robust speech recognition through s..." refers methods in this paper

  • ...Two commonly used noise-compensation approaches are those based on histogram equalization (HEQ) [3] and those based on Vector Taylor Series (VTS) [4]....

    [...]

  • ...Two com­ monly used noise-compensation approaches are those based on his­ togram equalization (HEQ) [3] and those based on Vector Taylor Se­ ries (VTS) [4]....

    [...]

  • ...nonlinear compensation techniques like HEQ and VTS with VTLN [5][7]....

    [...]

  • ...Combination of VTLN with HEQ is studied in [5][6]....

    [...]

Proceedings Article
01 Jan 2001
TL;DR: The proposed technique, acoustic factorisation, attempts to model explicitly all the factors that affect the acoustic signal, and may be used in a more flexible fashion than in standard adaptive training schemes.
Abstract: This paper describes a new technique for training a speech recognition system on inhomogeneous training data. The proposed technique, acoustic factorisation, attempts to model explicitly all the factors that affect the acoustic signal. By explicitly modelling all the factors, the trained model set may be used in a more flexible fashion than in standard adaptive training schemes. Since an individual model is trained for each factor, it is possible to factor-in only those factors that are appropriate to a particular target domain, for example the distribution over all training speakers. The target domain specific factors are simply estimated from limited target specific data, for example the target acoustic environment. The paper describes the theory of this new approach for the transforms for a particular speaker and environment. Initial experiments on a large vocabulary speech recognition task are presented.

41 citations


"Robust speech recognition through s..." refers methods in this paper

  • ...Further, in [9][10], the noise and environments transforms have to be estimated using test utterances as adaptation data....

    [...]

  • ...Gales [9] proposed the acoustic-factorization approach to separate the noise and speaker effects and uses cluster-adaptive (CAT) approach for environment transform estimation and MLLR for speaker-transform estimation....

    [...]