What contributions have the authors mentioned in the paper "A vector taylor series approach for environment-independent speech recognition" ?

In this paper the authors introduce a new analytical approach to environment compensation for speech recognition. In this work the authors introduce the use of a Vector Taylor series ( VTS ) expansion to characterize efficiently and accurately the effects on speech statistics of unknown additive noise and unknown linear filtering in a transmission channel. The authors evaluate the performance of two implementations of VTS algorithms using the CMU SPHINX-II system on the 100word alphanumeric CENSUS database and on the 1993 5000word ARPA Wall Street Journal database. The VTS approaches provide significant improvements in recognition accuracy compared to previous algorithms.

What is the description of the VTS algorithm?

the authors speculate that using more generic polynomial approximations that are opti-mized to minimize the error for the parameters of the distribution of z may give us even better performance.

(Open Access) A vector Taylor series approach for environment-independent speech recognition (1996) | Pedro J. Moreno

Q: What is the function of the linear filter?

As in previous papers the authors assume a model of the environment in which speech is corrupted by unknown additive stationary noise and linearly filtered by an unknown channel:where represents the power spectrum of the degradedspeech, is the power spectrum of the clean speech,is the transfer function of the linear filter, and is the power spectrum of the additive noise.

ABSTRACT

In this paper we introduce a new analytical approach to envi-

ronment compensation for speech recognition. Previous

attempts at solving analytically the problem of noisy speech

recognition have either used an overly-simpliﬁed mathematical

description of the effects of noise on the statistics of speech or

they have relied on the availability of large environment-spe-

cific adaptation sets. Some of the previous methods required

the use of adaptation data that consists of simultaneously-

recorded or “stereo” recordings of clean and degraded speech.

In this work we introduce the use of a Vector Taylor series

(VTS) expansion to characterize efficiently and accurately the

effects on speech statistics of unknown additive noise and

unknown linear filtering in a transmission channel. The VTS

approach is computationally efficient. It can be applied either

to the incoming speech feature vectors, or to the statistics rep-

resenting these vectors. In the first case the speech is compen-

sated and then recognized; in the second case HMM statistics

are modified using the VTS formulation. Both approaches use

only the actual speech segment being recognized to compute

the parameters required for environmental compensation.

We evaluate the performance of two implementations of VTS

algorithms using the CMU SPHINX-II system on the 100-

word alphanumeric CENSUS database and on the 1993 5000-

word ARPA Wall Street Journal database. Artificial white

Gaussian noise is added to both databases. The VTS

approaches provide significant improvements in recognition

accuracy compared to previous algorithms.

1. INTRODUCTION

As speech recognition systems become more accurate and

more sophisticated, robustness to noise, channel, and other

environmental effects becomes increasingly important. In the

past few years, researchers at CMU and other sites have devel-

oped a series of techniques to address this problem. Many of

these environment compensation algorithms take advantage of

the availability of “stereo data”, i.e. speech databases that are

simultaneously recorded in high-quality and degraded environ-

ments (e.g. [1][2]). Other algorithms make use of non-simulta-

neously-recorded adaptation data from the degraded

environment (e.g. [5]). Still other algorithms (e.g. [6]) use

knowledge of noise statistics and extensive computation to

adapt the HMMs of clean speech to a new environment. Unfor-

tunately, stereo data, a priori knowledge about the testing envi-

A VECTOR TAYLOR SERIES APPROACH FOR

ENVIRONMENT-INDEPENDENT SPEECH RECOGNITION

Pedro J. Moreno, Bhiksha Raj and Richard M. Stern

Department of Electrical and Computer Engineering & School of Computer Science

Carnegie Mellon University

Pittsburgh, Pennsylvania 15213

ronment, and/or the computational resource requirements of

such algorithms are frequently unavailable.

From a practical point of view, algorithms that can compensate

for the effects of the environment with almost no previous

knowledge, and that only require a small segment of the speech

signal to perform the compensation, are far more attractive

than those that require environment-specific training informa-

tion of any sort. Such compensation algorithms tend to be

based on an analytic characterization of the nature of the degra-

dation, rather than a mere empirical characterization of a large

number of examples.

The CDCN algorithm [3] is an example of this class of model-

based algorithms that has been applied with success to several

databases. Nevertheless, the CDCN algorithm has some limita-

tions:

• It does not model the effects of the environment on the

variance of speech distributions

• The noise is approximated with only limited accuracy

at low SNRs

The VTS algorithms described in this paper address these

problems. Speciﬁcally, they:

• Require only the segment of noisy speech signal to be

recognized to perform compensation.

• Model the effect of the environment on all the statistics

of the probability density function (PDF) of speech.

• Provide a uniﬁed treatment of the noise and channel

reestimation problem.

• Use a better, Gaussian, model for the PDF of the log-

spectra of the noise.

2. A MODEL OF THE ENVIRONMENT

As in previous papers we assume a model of the environment

in which speech is corrupted by unknown additive stationary

noise and linearly ﬁltered by an unknown channel:

where represents the power spectrum of the degraded

speech, is the power spectrum of the clean speech,

is the transfer function of the linear ﬁlter, and is

the power spectrum of the additive noise.

In the log-spectral domain this relation can be expressed as:

Z ω() X ω()H ω()

N ω()+=

Z ω()

X ω()

H ω() ω()

or in more general terms

where q is an unknown parameter that represents the effects of

linear ﬁltering in the log-spectral domain.

We also assume that the PDF of the log-spectra of the speech

signal can be well represented by a summation of multivariate

Gaussian distributions:

Furthermore, we assume that the statistics of the noise can be

well represented by a single Gaussian .

The problem of compensation is twofold. First, the parameters

q, , and need to be determined. Second, the distribution

of z given the PDF of x and the parameters q, , and has

to be computed. Because of the non-linearity of the function

, both problems are non-trivial. Only for very simple

expressions of the function can p(z) be computed

analytically. For other functions such as it

is not possible to compute p(z) analytically. While p(z) could

be computed by Monte-Carlo methods, this approach is com-

putationally expensive and requires previous knowledge of the

parameters , and q. VTS provides a framework that

enables an analytical solution to both problems.

3. DESCRIPTION OF THE

VTS ALGORITHMS

The key of the new VTS algorithms is to approximate the

generic vector function with a vector Taylor series

approximation:

where is the vector function evaluated at a par-

ticular vector point. Similarly, represents the

matrix derivative of the vector function at a particular vector

point. The higher order terms of the Taylor series involve

higher order derivatives resulting in tensors.

The Taylor expansion is exact everywhere when the order of

the Taylor series is infinite. However, when x has a Gaussian

distribution, the function can be expanded around the mean of

x and the expansion needs to be good only within a relatively

1. In fact, the function could be any function.

z x q log 1 e

nx− q−

+()++=

f xnq,,()

z x f xnq,,()+=

px() Pk[]N

xk,

,()

k 0=

M 1−

∑

,()

f nxq,,()

log 1

nx− q−

+()

f nxq,,()

f xnq,,()fx

,,()

,,()xx

−{} ++≅

,,()nn

−{}

,,()qq

−{}…++

,,()

narrow region around the mean. We take advantage of this fact

to truncate the Taylor series after just a few terms.

VTS-0 uses only the zeroth-order terms of the Taylor series

and VTS-1 uses the zeroth-order and first-order terms. Higher

orders of VTS are also possible when greater approximation

accuracy is required.

3.1. Modeling speech statistics using VTS

To confirm that the Taylor series approximations are a good

alternative to the Monte-Carlo approach, simulations were per-

formed using artificial data. A one-dimensional set of vectors

was produced using Monte-Carlo simulation, and these clean

signal vectors were contaminated with noise at different signal-

to-noise ratios (SNRs) and passed through a linear channel pro-

ducing a set of noisy vectors.

Figure 1 shows how the resulting means of the noisy vector set

x can be approximated quite well by the Taylor series. In this

ﬁgure we show the mean of the simulated noisy input signal, as

well as the mean computed using the Taylor series expansion

of orders 0 and 2. As we see, the zeroth-order provides a rea-

sonably good approximation. However at lower SNRs the sec-

ond-order Taylor series expansion provides an even better

approximation of the actual distribution.

Similarly, in Figure 2 we present the zeroth-order and first-

order Taylor series approximations to the variance. The first-

order approximation is closer to the real variance than the

zeroth-order approximation. Odd orders of the Taylor series do

not contribute to the approximation of the mean.

3.2. Statistics of clean and noisy speech

The statistics of clean speech can be modeled as a mixture of

Gaussian distributions. The parameters describing these statis-

tics are estimated using basic EM methods.

The goal of the VTS algorithm is to estimate the pdf of noisy

speech given the pdf of clean speech, a segment of noisy

speech and the Taylor series expansion that relates noisy

speech to clean speech. Once the pdf of the noisy speech is

computed, minimum mean square estimation (MMSE) can be

used to predict the unobserved clean speech sequence.

Alternately, if HMMs are used to describe the pdf of clean

speech we can use the Taylor series approach to compute the

noisy HMMs and perform recognition on the noisy signal

Figure 1. Effects of noise on the mean of the incoming sig-

nal. The exact values of the mean and estimates of the mean

obtained from the zeroth-order and second-order VTS expan-

sion are compared over a range of SNRs.

SNR (dB)

12 -8 -4 4 8 12

Mean Estimate

Exact Mean

-Order VTS

itself. In this paper we only report results obtained using the

ﬁrst approach.

Zeroth-order Vector Taylor Series expansion (VTS-0): The

zero order Taylor series expansion of results in a

Gaussian distribution for the noisy speech z when x is Gaussian

The mean vector and covariance matrices that represent the

noisy speech statistics are computed as

First-order Vector Taylor Series expansion (VTS-1): In the

case of the ﬁrst-order Taylor series expansion of the

resulting distribution of z is also Gaussian when x is Gaussian.

The new mean vector is computed as

In a similar fashion, the new covariance matrix can be

expressed as

where is the variance of the noise.

For both VTS-0 and VTS-1 the parameters q and , and

hence the parameters and , are estimated iteratively

using a modiﬁed version of the EM algorithm. VTS-1 also esti-

Figure 2. Effects of noise on the variance of the signal. The

exact values of the variance and estimates of the variance

obtained from the zeroth-order and ﬁrst-order VTS expansion

are compared over a range of SNRs.

SNR (dB)

2-8 -4 4 8 12

Variance Estimate

Exact Variance

-Order VTS

f xnq,,()

pz() N

,()=

Ez()=Ex fn

,,()+()µ

,,()+==

f xnq,,()

Ex fn

,,()+() +=

,,()xx

−{}() +

,,()nn

−{}() +

,,()qq

−{}()

,,()+()

,,()+()=+

,,()()

,,()

mates the variance of noise, . The algorithms proceed as fol-

lows:

1. Obtain initial estimates of q, and .

2. Expand the function around the mean vector

of each Gaussian in the distribution of x, and the

estimates of and q.

3. Estimate the parameters of the distribution of z,

and .

4. Perform a single iteration of the EM algorithm to re-

estimate the values of q and . In the case of VTS-1

is also re-estimated.

5. If the likelihood of the observed noisy data has not con-

verged, return to Step 2.

Because the distribution of x is assumed to be a Gaussian mix-

ture, the resulting distribution computed for z is also a Gauss-

ian mixture distribution with a one-to-one correspondence

between each Gaussian in the distribution of x to a Gaussian in

the distribution of z.

In all cases, the covariance matrices of the clean speech, the

noisy speech, and the additive noise are assumed to be diagonal

in order to reduce the computational complexity of the algo-

rithm. Non-diagonal matrices would result in a computation-

ally expensive tensor formulation.

3.3. Compensation of noisy speech

Once the parameters of the distribution of z are computed, an

MMSE estimate is used to calculate the clean speech given the

observed noisy speech

The results obtained depend on which order Taylor series

approximation is used. The zeroth-order approximation pro-

duces

A similar value is obtained for the ﬁrst order approximation.

4. EXPERIMENTAL RESULTS

The effectiveness of the VTS algorithms was evaluated by arti-

ficially contaminating utterances from the CMU census data-

base [3] and from the ARPA Wall Street Journal task with

white noise at different SNRs. The SPHINX-II continuous

speech recognition system was used.

In Figure 3 we compare the effectiveness of the zeroth-order

VTS algorithm and the first-order VTS algorithm to the effec-

tiveness of another model-based compensation algorithms

f xnq,,()

xk,

zk,

MMSE

Exz() xp x z()dx

∫

MMSE

z f xnq,,()−()pxz()dx

∫

MMSE

zPkz[]f µ

xk,

q,,()

k 0=

M 1−

∑

−=

CDCN [3] (which does not require stereo data), and the empir-

ical algorithm, RATZ [4] (which does require stereo data).

The VTS-0 algorithm performs better than CDCN at all SNRs,

and the VTS-1 algorithms is observed to perform even better

than the VTS-0 algorithm. In fact, at all SNRs, VTS-0 outper-

forms RATZ, which is an algorithm that assumes the availabil-

ity of stereo data.

In Figure 4 we present results from a similar experiment using

the 5,000-word evaluation set of the 1993 ARPA Wall Street

Journal test set. As before, the data were contaminated by arti-

ficial white noise at different SNRs. Again, the zeroth-order

VTS algorithm outperforms the CDCN algorithm at all SNRs.

5. DISCUSSION

A truncated Taylor series is a special case of a polynomial

approximation to a function. It is well known that for polyno-

mial approximation of any order of a function, better polyno-

mials exist than the Taylor series. Hence, we speculate that

using more generic polynomial approximations that are opti-

Figure 3. Comparison of recognition accuracy obtained for the

CENSUS database using the zeroth-order and ﬁrst-order VTS,

CDCN, and RATZ algorithms as a function of SNR. The dotted

curves indicate baseline performance using cepstral mean nor-

malization only, as well as results obtained by completely

retraining the system in the new environment.

SNR (dB)

5 10152025

Word Accuracy (%)

100

-Order VTS

RATZ

CDCN

CMN

Retrained

Cepstral

Mean Normalization

Retrained

System

Figure 4. Comparison of recognition accuracy obtained for

the 1993 ARPA 5000-word WSJ0 database using the zeroth-

order and ﬁrst-order VTS, CDCN, and RATZ algorithms as a

function of SNR. The dotted curves are as in Fig. 3.

Cepstral

Mean Normalization

SNR (dB)

5 10152025

Word Accuracy (%)

100

-Order VTS

CDCN

Blind RATZ

No Comp.

mized to minimize the error for the parameters of the distribu-

tion of z may give us even better performance. Simulations

indicate that the more general polynomials provide much better

estimates of the mean and variance of z than the Taylor series.

In fact, the VTS algorithms may be viewed as a special case of

algorithms based on more generic polynomial expansions.

Similarly, CDCN may be viewed as a special case of the VTS

approach worked for a zeroth order polynomial in the cepstral

domain.

6. SUMMARY

In this paper we introduce an efficient approximation that ana-

lytically handles the problem of compensating for the effects of

noisy and filtered speech with a bare minimum of testing data

and no “stereo” training data. The algorithms presented pro-

vide significant improvement over previous work. We provide

an easily expandable framework for further improving the per-

formance of these algorithms at greater computational expense

by increasing in the order of the Taylor series approximation.

ACKNOWLEDGEMENTS

The authors thank Evandro Gouvea, Matthew Siegler and Uday

Jain for useful discussions, and especially Matthew Siegler for

helping us with the simulations. Pedro J. Moreno has been sup-

ported by a Fulbright fellowship awarded by the Ministerio de

Educación y Ciencia, Spain. This research was sponsored by

the Department of the Navy, Naval Research Laboratory under

Grant No. N00014-93-1-2005. The views and conclusions con-

tained in this document are those of the authors and should not

be interpreted as representing the official policies, either

expressed or implied, of the U.S. Government.

REFERENCES

1. F.-H. Liu (1994). Environmental Adaptation for Robust

Speech Recognition. Ph. D. Dissertation, ECE Department,

CMU, July 1994.

2. L. Neumeyer, and M. Weintraub (1994). “Probabilistic

Optimum Filtering for Robust Speech Recognition”. Proc.

ICASSP-94.

3. A. Acero (1990). Acoustical and Environmental

Robustness in Automatic Speech Recognition. Ph. D.

Dissertation, ECE Department, CMU, Sept. 1990.

4. P. J. Moreno, B. Raj, R. M. Stern (1995). “Multivariate

Gaussian Based Cepstral Normalization for Robust Speech

Recognition”. Proc. ICASSP-95.

5. C. J. Leggetter and P. C. Woodland (1995) “Flexible

Speaker Adaptation using Maximum Likelihood Linear

Regression”, Proc. ARPA Spoken Language Systems

Technology Workshop, January, 1995.

6. M. Gales and S. Young (1995). “A fast and flexible

implementation of Parallel Model Combination”. Proc.

ICASSP-95.

A vector Taylor series approach for environment-independent speech recognition

Figures

Citations

Application of Hidden Markov Models in Speech Recognition

Ideal ratio mask estimation using deep neural networks for robust speech recognition

Power-normalized cepstral coefficients (PNCC) for robust speech recognition

Exemplar-Based Sparse Representations for Noise Robust Automatic Speech Recognition

Histogram equalization of speech representation for robust speech recognition

References

Acoustical and environmental robustness in automatic speech recognition

Probabilistic optimum filtering for robust speech recognition

A fast and flexible implementation of parallel model combination

Multivariate-Gaussian-based cepstral normalization for robust speech recognition

Environmental adaptation for robust speech recognition

Related Papers (5)

Suppression of acoustic noise in speech using spectral subtraction

The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions

Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models

Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator

Perceptual linear predictive (PLP) analysis of speech

Frequently Asked Questions (8)

Q1. What contributions have the authors mentioned in the paper "A vector taylor series approach for environment-independent speech recognition" ?

Q2. What is the key of the new VTS algorithms?

Q3. What is the description of the VTS algorithm?

Q4. How was the effectiveness of the VTS algorithm evaluated?

Q5. What is the purpose of the VTS algorithm?

Q6. What is the function of the linear filter?

Q7. What is the MMSE estimate of the speech?

Q8. What are the main advantages of the CDCN algorithm?