scispace - formally typeset
Open AccessProceedings ArticleDOI

A vector Taylor series approach for environment-independent speech recognition

Reads0
Chats0
TLDR
This work introduces the use of a vector Taylor series (VTS) expansion to characterize efficiently and accurately the effects on speech statistics of unknown additive noise and unknown linear filtering in a transmission channel.
Abstract
In this paper we introduce a new analytical approach to environment compensation for speech recognition. Previous attempts at solving analytically the problem of noisy speech recognition have either used an overly-simplified mathematical description of the effects of noise on the statistics of speech or they have relied on the availability of large environment-specific adaptation sets. Some of the previous methods required the use of adaptation data that consists of simultaneously-recorded or "stereo" recordings of clean and degraded speech. In this work we introduce the use of a vector Taylor series (VTS) expansion to characterize efficiently and accurately the effects on speech statistics of unknown additive noise and unknown linear filtering in a transmission channel. The VTS approach is computationally efficient. It can be applied either to the incoming speech feature vectors, or to the statistics representing these vectors. In the first case the speech is compensated and then recognized; in the second case HMM statistics are modified using the VTS formulation. Both approaches use only the actual speech segment being recognized to compute the parameters required for environmental compensation. We evaluate the performance of two implementations of VTS algorithms using the CMU SPHINX-II system on the 100-word alphanumeric CENSUS database and on the 1993 5000-word ARPA Wall Street Journal database. Artificial white Gaussian noise is added to both databases. The VTS approaches provide significant improvements in recognition accuracy compared to previous algorithms.

read more

Content maybe subject to copyright    Report

ABSTRACT
In this paper we introduce a new analytical approach to envi-
ronment compensation for speech recognition. Previous
attempts at solving analytically the problem of noisy speech
recognition have either used an overly-simplified mathematical
description of the effects of noise on the statistics of speech or
they have relied on the availability of large environment-spe-
cific adaptation sets. Some of the previous methods required
the use of adaptation data that consists of simultaneously-
recorded or “stereo” recordings of clean and degraded speech.
In this work we introduce the use of a Vector Taylor series
(VTS) expansion to characterize efficiently and accurately the
effects on speech statistics of unknown additive noise and
unknown linear filtering in a transmission channel. The VTS
approach is computationally efficient. It can be applied either
to the incoming speech feature vectors, or to the statistics rep-
resenting these vectors. In the first case the speech is compen-
sated and then recognized; in the second case HMM statistics
are modified using the VTS formulation. Both approaches use
only the actual speech segment being recognized to compute
the parameters required for environmental compensation.
We evaluate the performance of two implementations of VTS
algorithms using the CMU SPHINX-II system on the 100-
word alphanumeric CENSUS database and on the 1993 5000-
word ARPA Wall Street Journal database. Artificial white
Gaussian noise is added to both databases. The VTS
approaches provide significant improvements in recognition
accuracy compared to previous algorithms.
1. INTRODUCTION
As speech recognition systems become more accurate and
more sophisticated, robustness to noise, channel, and other
environmental effects becomes increasingly important. In the
past few years, researchers at CMU and other sites have devel-
oped a series of techniques to address this problem. Many of
these environment compensation algorithms take advantage of
the availability of “stereo data”, i.e. speech databases that are
simultaneously recorded in high-quality and degraded environ-
ments (e.g. [1][2]). Other algorithms make use of non-simulta-
neously-recorded adaptation data from the degraded
environment (e.g. [5]). Still other algorithms (e.g. [6]) use
knowledge of noise statistics and extensive computation to
adapt the HMMs of clean speech to a new environment. Unfor-
tunately, stereo data, a priori knowledge about the testing envi-
A VECTOR TAYLOR SERIES APPROACH FOR
ENVIRONMENT-INDEPENDENT SPEECH RECOGNITION
Pedro J. Moreno, Bhiksha Raj and Richard M. Stern
Department of Electrical and Computer Engineering & School of Computer Science
Carnegie Mellon University
Pittsburgh, Pennsylvania 15213
ronment, and/or the computational resource requirements of
such algorithms are frequently unavailable.
From a practical point of view, algorithms that can compensate
for the effects of the environment with almost no previous
knowledge, and that only require a small segment of the speech
signal to perform the compensation, are far more attractive
than those that require environment-specific training informa-
tion of any sort. Such compensation algorithms tend to be
based on an analytic characterization of the nature of the degra-
dation, rather than a mere empirical characterization of a large
number of examples.
The CDCN algorithm [3] is an example of this class of model-
based algorithms that has been applied with success to several
databases. Nevertheless, the CDCN algorithm has some limita-
tions:
It does not model the effects of the environment on the
variance of speech distributions
The noise is approximated with only limited accuracy
at low SNRs
The VTS algorithms described in this paper address these
problems. Specically, they:
Require only the segment of noisy speech signal to be
recognized to perform compensation.
Model the effect of the environment on all the statistics
of the probability density function (PDF) of speech.
Provide a unied treatment of the noise and channel
reestimation problem.
Use a better, Gaussian, model for the PDF of the log-
spectra of the noise.
2. A MODEL OF THE ENVIRONMENT
As in previous papers we assume a model of the environment
in which speech is corrupted by unknown additive stationary
noise and linearly ltered by an unknown channel:
where represents the power spectrum of the degraded
speech, is the power spectrum of the clean speech,
is the transfer function of the linear lter, and is
the power spectrum of the additive noise.
In the log-spectral domain this relation can be expressed as:
Z ω() X ω()H ω()
2
N ω()+=
Z ω()
X ω()
H ω() ω()

or in more general terms
1
:
where q is an unknown parameter that represents the effects of
linear ltering in the log-spectral domain.
We also assume that the PDF of the log-spectra of the speech
signal can be well represented by a summation of multivariate
Gaussian distributions:
Furthermore, we assume that the statistics of the noise can be
well represented by a single Gaussian .
The problem of compensation is twofold. First, the parameters
q, , and need to be determined. Second, the distribution
of z given the PDF of x and the parameters q, , and has
to be computed. Because of the non-linearity of the function
, both problems are non-trivial. Only for very simple
expressions of the function can p(z) be computed
analytically. For other functions such as it
is not possible to compute p(z) analytically. While p(z) could
be computed by Monte-Carlo methods, this approach is com-
putationally expensive and requires previous knowledge of the
parameters , and q. VTS provides a framework that
enables an analytical solution to both problems.
3. DESCRIPTION OF THE
VTS ALGORITHMS
The key of the new VTS algorithms is to approximate the
generic vector function with a vector Taylor series
approximation:
where is the vector function evaluated at a par-
ticular vector point. Similarly, represents the
matrix derivative of the vector function at a particular vector
point. The higher order terms of the Taylor series involve
higher order derivatives resulting in tensors.
The Taylor expansion is exact everywhere when the order of
the Taylor series is infinite. However, when x has a Gaussian
distribution, the function can be expanded around the mean of
x and the expansion needs to be good only within a relatively
1. In fact, the function could be any function.
z x q log 1 e
nx q
+()++=
f xnq,,()
z x f xnq,,()+=
px() Pk[]N
x
µ
xk,
Σ
xk,
,()
k 0=
M 1
=
N
n
µ
n
Σ
n
,()
µ
n
Σ
n
µ
n
Σ
n
f nxq,,()
f nxq,,()
log 1
e
nx q
+()
µ
n
Σ
n
f nxq,,()
f xnq,,()fx
0
n
0
q
0
,,()
xd
d
fx
0
n
0
q
0
,,()xx
0
{} ++
d
fx
0
n
0
q
0
,,()nn
0
{}
d
fx
0
n
0
q
0
,,()qq
0
{}++
fx
0
n
0
q
0
,,()
xd
d
fx
0
n
0
q
0
,,()
narrow region around the mean. We take advantage of this fact
to truncate the Taylor series after just a few terms.
VTS-0 uses only the zeroth-order terms of the Taylor series
and VTS-1 uses the zeroth-order and first-order terms. Higher
orders of VTS are also possible when greater approximation
accuracy is required.
3.1. Modeling speech statistics using VTS
To confirm that the Taylor series approximations are a good
alternative to the Monte-Carlo approach, simulations were per-
formed using artificial data. A one-dimensional set of vectors
was produced using Monte-Carlo simulation, and these clean
signal vectors were contaminated with noise at different signal-
to-noise ratios (SNRs) and passed through a linear channel pro-
ducing a set of noisy vectors.
Figure 1 shows how the resulting means of the noisy vector set
x can be approximated quite well by the Taylor series. In this
gure we show the mean of the simulated noisy input signal, as
well as the mean computed using the Taylor series expansion
of orders 0 and 2. As we see, the zeroth-order provides a rea-
sonably good approximation. However at lower SNRs the sec-
ond-order Taylor series expansion provides an even better
approximation of the actual distribution.
Similarly, in Figure 2 we present the zeroth-order and first-
order Taylor series approximations to the variance. The first-
order approximation is closer to the real variance than the
zeroth-order approximation. Odd orders of the Taylor series do
not contribute to the approximation of the mean.
3.2. Statistics of clean and noisy speech
The statistics of clean speech can be modeled as a mixture of
Gaussian distributions. The parameters describing these statis-
tics are estimated using basic EM methods.
The goal of the VTS algorithm is to estimate the pdf of noisy
speech given the pdf of clean speech, a segment of noisy
speech and the Taylor series expansion that relates noisy
speech to clean speech. Once the pdf of the noisy speech is
computed, minimum mean square estimation (MMSE) can be
used to predict the unobserved clean speech sequence.
Alternately, if HMMs are used to describe the pdf of clean
speech we can use the Taylor series approach to compute the
noisy HMMs and perform recognition on the noisy signal
Figure 1. Effects of noise on the mean of the incoming sig-
nal. The exact values of the mean and estimates of the mean
obtained from the zeroth-order and second-order VTS expan-
sion are compared over a range of SNRs.
SNR (dB)
12 -8 -4 4 8 12
Mean Estimate
5
10
15
20
0
Exact Mean
0
th
-Order VTS
2
nd
-Order VTS

itself. In this paper we only report results obtained using the
rst approach.
Zeroth-order Vector Taylor Series expansion (VTS-0): The
zero order Taylor series expansion of results in a
Gaussian distribution for the noisy speech z when x is Gaussian
The mean vector and covariance matrices that represent the
noisy speech statistics are computed as
First-order Vector Taylor Series expansion (VTS-1): In the
case of the rst-order Taylor series expansion of the
resulting distribution of z is also Gaussian when x is Gaussian.
The new mean vector is computed as
In a similar fashion, the new covariance matrix can be
expressed as
where is the variance of the noise.
For both VTS-0 and VTS-1 the parameters q and , and
hence the parameters and , are estimated iteratively
using a modied version of the EM algorithm. VTS-1 also esti-
Figure 2. Effects of noise on the variance of the signal. The
exact values of the variance and estimates of the variance
obtained from the zeroth-order and rst-order VTS expansion
are compared over a range of SNRs.
SNR (dB)
2-8 -4 4 8 12
Variance Estimate
1
2
3
4
5
6
0
Exact Variance
0
th
-Order VTS
1
st
-Order VTS
f xnq,,()
pz() N
z
µ
z
Σ
z
,()=
µ
z
Ez()=Ex fn
0
x
0
q
0
,,()+()µ
x
fn
0
x
0
q
0
,,()+==
Σ
z
Σ
x
=
f xnq,,()
µ
z
Ex fn
0
x
0
q
0
,,()+() +=
E
xd
d
fx
0
n
0
q
0
,,()xx
0
{}() +
E
nd
d
fx
0
n
0
q
0
,,()nn
0
{}() +
E
qd
d
fx
0
n
0
q
0
,,()qq
0
{}()
Σ
z
I
xd
d
fn
0
x
0
q
0
,,()+()
T
Σ
x
I
xd
d
fn
0
x
0
q
0
,,()+()=+
xd
d
fn
0
x
0
q
0
,,()()
T
Σ
n
xd
d
fn
0
x
0
q
0
,,()
Σ
n
µ
n
µ
z
Σ
z
mates the variance of noise, . The algorithms proceed as fol-
lows:
1. Obtain initial estimates of q, and .
2. Expand the function around the mean vector
of each Gaussian in the distribution of x, and the
estimates of and q.
3. Estimate the parameters of the distribution of z,
and .
4. Perform a single iteration of the EM algorithm to re-
estimate the values of q and . In the case of VTS-1
is also re-estimated.
5. If the likelihood of the observed noisy data has not con-
verged, return to Step 2.
Because the distribution of x is assumed to be a Gaussian mix-
ture, the resulting distribution computed for z is also a Gauss-
ian mixture distribution with a one-to-one correspondence
between each Gaussian in the distribution of x to a Gaussian in
the distribution of z.
In all cases, the covariance matrices of the clean speech, the
noisy speech, and the additive noise are assumed to be diagonal
in order to reduce the computational complexity of the algo-
rithm. Non-diagonal matrices would result in a computation-
ally expensive tensor formulation.
3.3. Compensation of noisy speech
Once the parameters of the distribution of z are computed, an
MMSE estimate is used to calculate the clean speech given the
observed noisy speech
The results obtained depend on which order Taylor series
approximation is used. The zeroth-order approximation pro-
duces
A similar value is obtained for the rst order approximation.
4. EXPERIMENTAL RESULTS
The effectiveness of the VTS algorithms was evaluated by arti-
ficially contaminating utterances from the CMU census data-
base [3] and from the ARPA Wall Street Journal task with
white noise at different SNRs. The SPHINX-II continuous
speech recognition system was used.
In Figure 3 we compare the effectiveness of the zeroth-order
VTS algorithm and the first-order VTS algorithm to the effec-
tiveness of another model-based compensation algorithms
Σ
n
µ
n
Σ
n
f xnq,,()
µ
xk,
µ
n
µ
zk,
Σ
zk,
µ
n
Σ
n
x
ˆ
MMSE
Exz() xp x z()dx
==
x
ˆ
MMSE
z f xnq,,()()pxz()dx
=
x
ˆ
MMSE
zPkz[]f µ
xk,
µ
n
q,,()
k 0=
M 1
=

CDCN [3] (which does not require stereo data), and the empir-
ical algorithm, RATZ [4] (which does require stereo data).
The VTS-0 algorithm performs better than CDCN at all SNRs,
and the VTS-1 algorithms is observed to perform even better
than the VTS-0 algorithm. In fact, at all SNRs, VTS-0 outper-
forms RATZ, which is an algorithm that assumes the availabil-
ity of stereo data.
In Figure 4 we present results from a similar experiment using
the 5,000-word evaluation set of the 1993 ARPA Wall Street
Journal test set. As before, the data were contaminated by arti-
ficial white noise at different SNRs. Again, the zeroth-order
VTS algorithm outperforms the CDCN algorithm at all SNRs.
5. DISCUSSION
A truncated Taylor series is a special case of a polynomial
approximation to a function. It is well known that for polyno-
mial approximation of any order of a function, better polyno-
mials exist than the Taylor series. Hence, we speculate that
using more generic polynomial approximations that are opti-
Figure 3. Comparison of recognition accuracy obtained for the
CENSUS database using the zeroth-order and rst-order VTS,
CDCN, and RATZ algorithms as a function of SNR. The dotted
curves indicate baseline performance using cepstral mean nor-
malization only, as well as results obtained by completely
retraining the system in the new environment.
SNR (dB)
5 10152025
Word Accuracy (%)
20
40
60
80
100
0
1
st
-Order VTS
0
th
-Order VTS
RATZ
CDCN
CMN
Retrained
Cepstral
Mean Normalization
Retrained
System
Figure 4. Comparison of recognition accuracy obtained for
the 1993 ARPA 5000-word WSJ0 database using the zeroth-
order and rst-order VTS, CDCN, and RATZ algorithms as a
function of SNR. The dotted curves are as in Fig. 3.
Cepstral
Mean Normalization
SNR (dB)
5 10152025
Word Accuracy (%)
20
40
60
80
100
0
1
st
-Order VTS
0
th
-Order VTS
CDCN
Blind RATZ
No Comp.
mized to minimize the error for the parameters of the distribu-
tion of z may give us even better performance. Simulations
indicate that the more general polynomials provide much better
estimates of the mean and variance of z than the Taylor series.
In fact, the VTS algorithms may be viewed as a special case of
algorithms based on more generic polynomial expansions.
Similarly, CDCN may be viewed as a special case of the VTS
approach worked for a zeroth order polynomial in the cepstral
domain.
6. SUMMARY
In this paper we introduce an efficient approximation that ana-
lytically handles the problem of compensating for the effects of
noisy and filtered speech with a bare minimum of testing data
and no stereo training data. The algorithms presented pro-
vide significant improvement over previous work. We provide
an easily expandable framework for further improving the per-
formance of these algorithms at greater computational expense
by increasing in the order of the Taylor series approximation.
ACKNOWLEDGEMENTS
The authors thank Evandro Gouvea, Matthew Siegler and Uday
Jain for useful discussions, and especially Matthew Siegler for
helping us with the simulations. Pedro J. Moreno has been sup-
ported by a Fulbright fellowship awarded by the Ministerio de
Educación y Ciencia, Spain. This research was sponsored by
the Department of the Navy, Naval Research Laboratory under
Grant No. N00014-93-1-2005. The views and conclusions con-
tained in this document are those of the authors and should not
be interpreted as representing the official policies, either
expressed or implied, of the U.S. Government.
REFERENCES
1. F.-H. Liu (1994). Environmental Adaptation for Robust
Speech Recognition. Ph. D. Dissertation, ECE Department,
CMU, July 1994.
2. L. Neumeyer, and M. Weintraub (1994). Probabilistic
Optimum Filtering for Robust Speech Recognition. Proc.
ICASSP-94.
3. A. Acero (1990). Acoustical and Environmental
Robustness in Automatic Speech Recognition. Ph. D.
Dissertation, ECE Department, CMU, Sept. 1990.
4. P. J. Moreno, B. Raj, R. M. Stern (1995). Multivariate
Gaussian Based Cepstral Normalization for Robust Speech
Recognition. Proc. ICASSP-95.
5. C. J. Leggetter and P. C. Woodland (1995) Flexible
Speaker Adaptation using Maximum Likelihood Linear
Regression, Proc. ARPA Spoken Language Systems
Technology Workshop, January, 1995.
6. M. Gales and S. Young (1995). A fast and flexible
implementation of Parallel Model Combination. Proc.
ICASSP-95.

Citations
More filters
Book

Application of Hidden Markov Models in Speech Recognition

TL;DR: The aim of this review is first to present the core architecture of a HMM-based LVCSR system and then to describe the various refinements which are needed to achieve state-of-the-art performance.
Proceedings ArticleDOI

Ideal ratio mask estimation using deep neural networks for robust speech recognition

TL;DR: The proposed feature enhancement algorithm estimates a smoothed ideal ratio mask (IRM) in the Mel frequency domain using deep neural networks and a set of time-frequency unit level features that has previously been used to estimate the ideal binary mask.
Journal ArticleDOI

Power-normalized cepstral coefficients (PNCC) for robust speech recognition

TL;DR: Experimental results demonstrate that PNCC processing provides substantial improvements in recognition accuracy compared to MFCC and PLP processing for speech in the presence of various types of additive noise and in reverberant environments, with only slightly greater computational cost than conventional MFCC processing.
Journal ArticleDOI

Exemplar-Based Sparse Representations for Noise Robust Automatic Speech Recognition

TL;DR: The results show that the hybrid system performed substantially better than source separation or missing data mask estimation at lower signal-to-noise ratios (SNRs), achieving up to 57.1% accuracy at SNR = -5 dB.
Journal ArticleDOI

Histogram equalization of speech representation for robust speech recognition

TL;DR: The paper describes how the proposed method of compensating for nonlinear distortions in speech representation caused by noise can be applied to robust speech recognition and it is compared with other compensation techniques.
References
More filters
BookDOI

Acoustical and environmental robustness in automatic speech recognition

Alex Acero
TL;DR: This dissertation describes a number of algorithms developed to increase the robustness of automatic speech recognition systems with respect to changes in the environment, including the SNR-Dependent Cepstral Normalization, (SDCN) and the Codeword-Dependent Cep stral normalization (CDCN).
Proceedings ArticleDOI

Probabilistic optimum filtering for robust speech recognition

TL;DR: A new mapping algorithm for speech recognition that relates the features of simultaneous recordings of clean and noisy speech to reduce recognition errors when the training and testing acoustic environments do not match is presented.
Proceedings ArticleDOI

A fast and flexible implementation of parallel model combination

TL;DR: This paper introduces an alternative method that can compensate all the parameters of the recognition system, whilst reducing the computational load of this task, and offers an additional degree of flexibility, as it allows the number of components to be chosen and optimised using standard iterative techniques.

Environmental adaptation for robust speech recognition

TL;DR: A number of new algorithms are described that improve the ability of speech recognition systems to adapt to new acoustical environments and are evaluated in terms of their effectiveness in improving environmental robustness and their computational complexity, among other attributes.
Related Papers (5)
Frequently Asked Questions (8)
Q1. What contributions have the authors mentioned in the paper "A vector taylor series approach for environment-independent speech recognition" ?

In this paper the authors introduce a new analytical approach to environment compensation for speech recognition. In this work the authors introduce the use of a Vector Taylor series ( VTS ) expansion to characterize efficiently and accurately the effects on speech statistics of unknown additive noise and unknown linear filtering in a transmission channel. The authors evaluate the performance of two implementations of VTS algorithms using the CMU SPHINX-II system on the 100word alphanumeric CENSUS database and on the 1993 5000word ARPA Wall Street Journal database. The VTS approaches provide significant improvements in recognition accuracy compared to previous algorithms. 

The key of the new VTS algorithms is to approximate the generic vector function with a vector Taylor series approximation:where is the vector function evaluated at a par-ticular vector point. 

the authors speculate that using more generic polynomial approximations that are opti-mized to minimize the error for the parameters of the distribution of z may give us even better performance. 

The effectiveness of the VTS algorithms was evaluated by artificially contaminating utterances from the CMU census database [3] and from the ARPA Wall Street Journal task with white noise at different SNRs. 

Once the pdf of the noisy speech is computed, minimum mean square estimation (MMSE) can be used to predict the unobserved clean speech sequence. 

As in previous papers the authors assume a model of the environment in which speech is corrupted by unknown additive stationary noise and linearly filtered by an unknown channel:where represents the power spectrum of the degradedspeech, is the power spectrum of the clean speech,is the transfer function of the linear filter, and is the power spectrum of the additive noise. 

Once the parameters of the distribution of z are computed, an MMSE estimate is used to calculate the clean speech given the observed noisy speech 

Still other algorithms (e.g. [6]) use knowledge of noise statistics and extensive computation to adapt the HMMs of clean speech to a new environment.