How many mel frequency coe cients are passed to the recognizer?

After a discrete cosine transform of the logarithmic lterbank outputs the authors obtain 12 mel frequency cepstral coe cients, which, augmented by 12 regression coe cients, are passed to the recognizer.

How many frames are used to estimate the power spectrum of a speech recognizer?

As the estimation is more reliable if more data is available, the authors use the notation N(!; t) to denote an estimation of N(!) using all frames from the beginning of the utterance up to frame t.

how long did the car noise take to be inserted?

In order to verify this theoretical consideration by an experiment, the authors inserted 0.5 seconds of car noise from a BMW 540 at 50 km/h before the beginning of each sound le of the test set.

How can the authors estimate the noise power spectrum from the speech signal?

This observation can be used to estimate a noise power spectrum N(!) from the observed speech signal X(!; t) by taking the q-th quantile over time in every frequency band.

(Open Access) Quantile based noise estimation for spectral subtraction and Wiener filtering (2000) | Volker Stahl

Q: What have the authors contributed in "Quantile based noise estimation for spectral subtraction and wiener filtering" ?

In this paper the authors restrict their considerations to the case where only a single microphone recording of the noisy signal is available. The algorithms which the authors investigate proceed in two steps:

Q: What are the future works mentioned in the paper "Quantile based noise estimation for spectral subtraction and wiener filtering" ?

The latter seems superior according to experimental evidence ( Section 5 ).

Q: What is the relative reduction for the optimal choice q?

The word error rate without any noise reduction method is 11.7%, i.e. the relative reduction is 26% for the optimal choice q = 0:55.

Q: What is the recursion of the utterance?

The recursion is initialized by N(!; 0) = X(!; 0), which re ects the assumption that the rst frame of an utterance does not contain speech.

Q: What is the cost of the quantile based noise estimation method?

The quantile based noise estimation method gives signi - cantly better results but is more expensive in terms of computing time and memory.

QUANTILE BASED NOISE ESTIMATION FOR

SPECTRAL SUBTRACTION AND WIENER FILTERING

Volker Stahl, Alexander Fischer and Rolf Bippus

Philips Research Lab oratories

Weisshausstrasse 2, D-52066 Aachen, Germany

email:

vstahl,afischer,bippus

@pfa.resea rch.philips.com

ABSTRACT

Elimination of additive noise from a sp eech signal is a fun-

damental problem in audio signal pro cessing. In this pap er

we restrict our considerations to the case where only a single

microphone recording of the noisy signal is available. The

algorithms whichweinvestigate pro ceed in two steps: First,

the noise p ower spectrum is estimated. A metho d based on

temporal quantiles in the p ower sp ectral domain is prop osed

and compared with pause detection and recursiveaverag-

ing. The second step is to eliminate the estimated noise

from the observed signal by sp ectral subtraction or Wiener

ltering. The database used in the exp eriments comprises

6034 utterances of German digits and digit strings by770

speakers in 10 dierent cars. Without noise reduction, we

obtain an error rate of 11.7%. Quantile based noise esti-

mation and Wiener ltering reduce the error rate to 8.6%.

Similar improvements are achieved in an experiment with

articial, non-stationary noise.

1. INTRODUCTION

The error rate of sp eech recognition systems increases dra-

matically in the presence of noise. It is therefore inevitable

to provide some means of noise reduction in the front end of

speech recognizers which op erate under adverse conditions.

A particularly noisy but important application domain of

speech recognition is the car environment [3, 4, 5, 2, 8, 9].

In this paper weinvestigate dierent noise reduction meth-

ods and carry out exp eriments on a large speech database

which has b een recorded in the car.

The paper is structured as follows: In Section 2 we give

a brief description of the speech recognition system and the

database used for the exp eriments. Model assumptions on

the sp eech and noise signal are stated in Section 3. In Sec-

tion 4 we discuss two metho ds to estimate the noise p ower

spectrum. The rst method is based on frame wise sp eech/

non-speech classication and recursiveaveraging over non-

speech frames. As pause detection in noisy environments

is a dicult problem, we prop ose a second metho d, which

does not depend on a classier. The noise is estimated as a

temporal quantile in the power spectral domain. According

to an exp erimental comparison, quantile based noise esti-

mation p erforms signicantly b etter, especially under non-

stationary noise. In Section 5 we apply sp ectral subtraction

and Wiener ltering to eliminate the estimated noise from

the input signal. The results are summarized in Section 6.

2. DATABASE AND SPEECH RECOGNITION

SYSTEM

The experimental results rep orted in this paper are based

on the German digit string subset of the MoTiV database

[7]. The corpus comprises 6034 utterances (4436 for train-

ing and 1598 for evaluating the error rate) by 770 sp eakers

in 10 cars at various driving situations. Training and evalu-

ation is always done on the matched scenario, i.e. the same

noise elimination methods are applied during training and

evaluation.

The speech recognizer is a continuous mixture density

hidden Markov mo del (HMM) system whose parameters

are estimated by Viterbi training. Each mixture consists

of 8 Gaussian densities with density specic, diagonal co-

variance matrices. The system uses two HMMs for each

digit, one for male and one for female sp eakers. The signal

analysis is as follows: The observed sp eech signal is subdi-

vided into overlapping, 16 ms spaced frames of 32 ms length.

For each frame the power sp ectrum is estimated through a

Hamming windowed FFT followed by a lter bank with 15

mel spaced triangular kernels. After a discrete cosine trans-

form of the logarithmic lterbank outputs we obtain 12 mel

frequency cepstral co ecients, which, augmented by 12 re-

gression co ecients, are passed to the recognizer. In this

paper we exp eriment with an additional prepro cessing step

in the power sp ectral domain in order to reduce additive

noise in the signal.

3. NOTATION AND ASSUMPTIONS

We assume that the observed noise signal is a realization

of a wide sense stationary pro cess [11 ]. The ma jor part

of this pap er deals with the estimation of its p ower spec-

trum

(

). As the estimation is more reliable if more data

is available, we use the notation

(

!; t

) to denote an es-

timation of

(

) using all frames from the b eginning of

the utterance up to frame

. Further, we assume that the

clean speech signal within each frame

is an instance of a

wide sense stationary process with p ower spectrum

(

!; t

For the sake of notational simplicitywe do not distinguish

between p ower sp ectra and p eriodigram based p ower sp ec-

trum estimations. As the sp eech and noise signal are as-

sumed to b e additive and indep endent, the p ower sp ectrum

of the observed signal is

(

!; t

) =

(

!; t

(

)

The

power sp ectrum

(

!; t

) is estimated by magnitude squared

Fourier coecients of the observed signal in frame

. The

clean sp eech signal power spectrum can therefore b e esti-

mated as

(

!; t

(

!; t

)

;

(

!; t

)

4. ESTIMATION OF THE NOISE SPECTRUM

A crucial step in noise suppression methods like Wiener l-

tering or sp ectral subtraction is the estimation of the noise

spectrum. There are applications where this task is sim-

plied by some prior knowledge of the noise sp ectrum or

bymulti channel recordings. However, in this pap er we as-

sume that there is only a single microphone and all weknow

about the noise signal is that it is more or less stationary,

independent of the sp eech signal and additive.

A commonly used method for noise spectrum estima-

tion is to average over sections in the input signal whichdo

not contain speech (Section 4.1). However, this approach

requires that non-speech sections can be detected reliably,

which is dicult especially under noisy conditions. More-

over, it relies on the fact that there actually exists a su-

cient amount of non-speech in the signal. In order to avoid

these problems, we prop ose a metho d to estimate the noise

spectrum without explicit frame wise sp eech / non-speech

classication (Section 4.2). The idea is to estimate the noise

energy in each frequency band by temp oral quantiles in the

power sp ectral domain.

4.1. Noise Sp ectrum Estimation Based on Frame

Wise Speech / Non-Sp eech Classication

If the signal to noise ratio is not too low, a simple metho d

to detect speech is based on the signal energy. As the noise

signal is assumed to b e stationary, the signal energy in the

entire utterance is greater or equal the noise energy. If

the energy in a frame is signicantly larger than the es-

timated noise energy, then the frame is likely to contain

speech. Otherwise it is a pure noise frame and is used to

update the current noise estimation. Let

(

!; t

) be the

power sp ectrum at frequency

in the

-th frame of the

input signal and

(

!; t

) b e the power spectrum of the es-

timated noise energy at frequency

in frame

. A simple

recursiveformula to estimate the noise energy

(

!; t

)isas

follows:

(

!; t

) =



(

!; t

;

1) if XNR(

)

>

;



)

(

!; t

;

1) +

X

(

!; t

)else

(1)

XNR(

) =

(

!; t

)

(

!; t

;

for all

. The recursion is initialized by

(

0) =

(

0),

which reects the assumption that the rst frame of an ut-

terance does not contain sp eech. Note that each frame is

classied as either pure noise or speech plus noise. Equa-

tion (1) has two parameters



and



which dep end on the

speech data under consideration. Parameter



is related

to the signal to noise ratio. Parameter



determines the

adaptation speed of the noise estimation. According to ex-

perimental results



8 and



03 p erform well for

the MoTiV corpus. The estimated noise

(

!; t

)is removed

from the input signal

(

!; t

)by means of a Wiener lter,

see Section 5. With this noise elimination method we ob-

tain a word error rate of 10.3%. Without noise elimination

the word error rate is 11.7%, i.e. the relative improvement

is 12%.

Frame wise speech / non{sp eech classication under

noisy conditions is a dicult problem far from being solved

satisfactorily. The frame error rate of the speech / non-

speech classier describ ed above is around 16% on the Mo-

TiV corpus. In the next section we describ e a metho d for

estimating the noise spectrum whichdoesnot require ex-

plicit sp eech / non-speech classication.

4.2. Quantile Based Noise Spectrum Estimation

In [10 ] an algorithm for noise estimation based on mini-

mum statistics has been prop osed. As the minimum is sen-

sitive to outliers we use a quantile dierent from minimum.

The algorithm proposed in this section is somewhat simpler

and has fewer parameters than the one in [10] but is com-

putationally more exp ensive. A similar metho d has been

described in [2 ].

It is well known that even in speech sections of the input

signal not all frequency bands are p ermanently o ccupied

with sp eech. In fact, a signicant p ercentage of the time

the energy in each frequency band is on the noise level.

This observation can be used to estimate a noise power

spectrum

(

) from the observed sp eech signal

(

!; t

)

by taking the

-th quantile over time in every frequency

band. More precisely, for every

the frames of the en-

tire utterance

(

!; t

= 0

;::: ;T

are sorted such that

(

!; t

)



(

!; t

)



:::



(

!; t

)

The

-quantile noise

estimation is dened as

(

!; t

)

(2)

For example,

= 0 yields the minimum,

= 1 the

maximum and

5 the median. This approachisbased

on the assumption that each frequency band carries at least

the

-th fraction of time only noise, even during sp eech

sections. Obviously this is true for very small values of

but in order to obtain a robust estimation of the noise

spectrum, which is not sensitive to outliers, we hope that

is somewhere near the median, i.e.



300 Hz

1500 Hz

3000 Hz

N(ω)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Figure 1: Quantiles of the energy distribution in the ob-

served signal

at 300Hz, 1500Hz and 3000Hz for a typical

utterance of the MoTiV corpus.

Figure 1 shows

(

) according to (2) in dep endence of

for 3 dierent frequencies

and a typical 7 digit utterance

taken from the MoTiV corpus. Roughly in 80-90% of the

frames the signal energy in the frequency bands is low, i.e.

close to the noise energy level and only in 10-20% of the

time the frequency band carries high energy,voiced sp eech.

Note that the curves also depend on the duration of the

pause sections in the signal. However, the ma jor part of

the utterance in Figure 1 was speech. For the MoTiV cor-

pus the optimal value for

was determined exp erimentally.

The estimated noise

(

)was eliminated from the signal

by a Wiener lter, see Section 5. The resulting word error

rates (WER) are summarized in Table 1. The word error

rate without any noise reduction metho d is 11.7%, i.e. the

relative reduction is 26% for the optimal choice

55.

With a 5909 words test set we obtain under certain simpli-

fying assumptions a condence interval of 0.8% on the 95%

signicance level for the baseline error rate 11.7%. Error

rates below 10.9% are therefore signicant improvements.

0.2 0.3 0.4 0.5 0.55 0.6 0.7

WER 11.3 10.8 10.1 8.9 8.6 8.8 9.7

Table 1: Word error rate with Wiener lter and noise esti-

mation

(

) according to (2).

Causality.

Note that the estimation of the noise sp ec-

trum dep ends on the entire utterance

(

!; t

) for all

;::: ;T

. A noise suppression lter based on this approach

is therefore not causal. However, if we dene

(

!; t

)asthe

-quantile of

(

!; 

) for



;::: ;t

, we obtain a causal

lter. Table 2 summarizes the results of the same experi-

ments as in Table 1 but this time we used a causal noise

estimation. The error rates achieved by the causal lter

are slightly higher than for the non-causal case. The reason

is that the noise estimation at the beginning of the signal

is very unreliable b ecause few data is available to estimate

(

!; t

) for small

0.2 0.3 0.4 0.5 0.55 0.6 0.7

WER 11.5 10.8 10.0 8.8 8.9 9.1 10.2

Table 2: Same experimentas in Table 1 but with causal

noise estimation.

Eciency.

The computational cost and memory con-

sumption for estimating

(

!; t

)grows with

. This is prob-

lematic for real time and low resource implementations. As

a consequence weinvestigated approximate methods for the

quantile computation whichare moreecient in terms of

time and space. The idea is to store the observations

(

!; t

)

for

;

;:::

in a buer with xed length . Separate

buers are used for each frequency

. If a buer is full, then

the largest and the smallest elementareremoved from the

buer. The quantile is determined by considering only the

elements in the buer. The obvious question now is how

large the buer should be and how muchtherecognition

error rate increases with a nite length buer. Results of

experiments with dierent buer lengths  and

5are

reported in Table 3. As expected, the error rate increases

for small buer sizes and achieves asymptotically the error

rate of the exact quantile computation. Another metho d to

 3 5 10 20 40 60 100

WER 10.6 10.2 9.3 9.1 9.3 9.2 8.9

Table 3: Same exp erimentasin Table 2 for

= 0

5 but

with limited buer length  for the quantile computation.

improveeciencyistointegrate several adjacent frequen-

cies and do a band wise noise estimation [6].

Non-stationary Noise.

We observed that the classi-

er based metho d in Section 4.1 performs quite p oorly if

the noise energy increases abruptly,sayattime

. The rea-

son is that the estimated noise

(

)attime

is small

compared to subsequent input frames

(

!; t

) for

t >

especially if frame

(

!; t

) do es not contain sp eech. There-

fore, according to (1), all frames after

are classied as

speech and hence the noise estimation will not b e up dated

any more after time

, i.e.

(

!; t

(

) for all

In other words, the noise estimation does not converge to

the observed noise. The quantile based metho d presented

in this section do es not suer from this problem and seems

therefore advantageous for non-stationary noise. In order

to verify this theoretical consideration by an experiment,

we inserted 0.5 seconds of car noise from a BMW 540 at 50

km/h b efore the b eginning of each sound le of the test set.

The columns of Table 4 contain the word error rates for the

cases no noise reduction, noise estimation by the classier

based method and noise estimation by the quantile based

method for

5 and buer sizes 10, 20, 60, and unlimited

respectively. In each scenario the error rate is signicantly

higher than in the corresp onding case without inserted car

noise. The deterioration for the classier based noise es-

timation method, however, is much more severe than for

the quantile based metho d and is even worse than for the

case without noise elimination. The adaptation time to a

changing noise signal in the quantile based method is pro-

portional to the buer length , which explains whyin this

experiment shorter buer lengths give b etter results.

Method none classier quantile  = 10

;

WER 13.7 18.5 10.1 10.5 10.6 11.7

Table 4: Word error rate if 0.5 seconds low energy car noise

are added to the beginning of the sound les of the test set.

5. ELIMINATION OF THE NOISE FROM THE

SPEECH SIGNAL

In the previous section we discussed methods for estimat-

ing the noise power sp ectrum

(

!; t

). In this section we

review approaches for eliminating the estimated noise from

the observed signal. If we had complete information ab out

the noise sp ectrum, i.e. magnitude and phase, the noise

elimination would amount to a simple subtraction of the

complex Fourier co ecients. Unfortunately we have no

phase information of the noise. Hence we apply sp ectral

subtraction and Wiener ltering for the noise elimination.

The FIR Wiener lter is dened as the linear lter which

minimizes the mean square error in the time domain. Spec-

tral subtraction relies on the fact that the p ower sp ectrum

of the sum of two independent random signals is the sum

of the power sp ectra. The noise elimination rule of sp ectral

subtraction is therefore simply to subtract the power spec-

trum of the estimated noise from the p ower spectrum of the

observed signal. Surprisingly the formulae for the Wiener

lter and sp ectral subtraction are quite similar. Let

(

!; t

)=(

(

!; t

)

;

(

!; t

))

(

!; t

)

(3)

The noise reduced signal

(

!; t

)by Wiener ltering is

(

!; t

) =

(

!; t

)

(

!; t

)

;

noise reduction by spectral subtraction is dened as

(

!; t

(

!; t

)

;

(

!; t

(

!; t

)

(

!; t

)

Sometimes the long term estimated noise p ower spectrum

(

!; t

) can be larger than the instantaneous observed power

spectrum

(

!; t

). In this case we would expect that the

noise reduced power spectrum

(

!; t

) should be zero. There-

fore (3) is usually modied as

(

!; t

)=max(

(

!; t

)

;

(

!; t

)

;

(

!; t

)

Experimental exp erience indicates that better recognition

results are achieved if a small fraction of the noise power

is left in the signal [1 , 10 ]. Hence, the energy of the noise

reduced signal

(

!; t

)which is passed to the recognizer is



(

!; t

) = max(

(

!; t

)

; N

(

!; t

))

where



04 has b een chosen experimentally.

An experimental comparison of sp ectral subtraction and

Wiener ltering for



04 is given in Table 5. The noise

power spectrum

(

!; t

) has been estimated as in Table 2.

0.2 0.3 0.4 0.5 0.6 0.7 0.8

Wiener 11.5 10.8 10.0 8.8 9.1 10.2 12.1

Subtr. 11.7 11.4 10.9 10.1 9.9 9.6 11.6

Table 5: Exp erimental comparison of the word error rates

of Wiener ltering and spectral subtraction.

As suggested in [1] the p erformance of sp ectral subtrac-

tion can be improved by subtracting an overestimation of

the noise power spectrum, i.e.

;

(

!; t

)=max(

(

!; t

)

;

N

(

!; t

)

; N

(

!; t

))

In our exp eriments we found an optimum for



5, which

gives a word error rate of 9.2% for

6. CONCLUSION

Weinvestigated methods to remove additive noise from a

speech signal which has been recorded in the car environ-

mentbyasingle microphone. The error rate of a sp eech

recognizer has b een reduced by up to 26% relativeby quan-

tile based noise estimation in the power sp ectral domain

and Wiener ltering. The metho ds pro ceed in two steps:

Estimation of the noise signal and elimination.

Noise Estimation.

We studied two noise estimation

methods: The rst one is based on frame wise speech/non-

speech classication and recursive smoothing over non-speech

frames (Section 4.1), the second method estimates the noise

by quantiles in the power sp ectral domain (Section 4.2).

The quantile based noise estimation metho d gives signi-

cantly better results but is more exp ensive in terms of com-

puting time and memory. An approximation algorithm for

improving the eciency of the quantile based metho d has

been proposed. The classier based method requires prior

knowledge ab out the signal to noise ratio, which is not the

case for the quantile based metho d. However, the quan-

tile based metho d relies on assumptions on energy distribu-

tions of human speech in the time{frequency domain, which

need to b e veried by more experiments. Finally, the quan-

tile based metho d seems to work better for certain kinds of

non-stationary noise than the classier based metho d.

Noise Elimination.

Two metho ds for removing the

estimated noise have been investigated, namely spectral

subtraction and Wiener ltering. The latter seems sup erior

according to exp erimental evidence (Section 5). If sp ectral

subtraction is mo died such that an appropriate overesti-

mation of the noise is subtracted, then the achieved error

rate comes close to the Wiener lter.

7. REFERENCES

[1] M. Berouti, R. Schwartz, and J. Makhoul, \Enhance-

mentofSpeech Corrupted by Acoustic Noise," in

Proc.

ICASSP

,(Washington, USA), pp. 208{211, Apr. 1979.

[2] H. G. Hirsch and C. Ehrlicher, \Noise Estimation

Techniques for Robust Sp eech Recognition," in

Proc.

ICASSP

, pp. 153{157, 1995.

[3] Juang, B. H. \Speech Recognition in Adverse Environ-

ments", Computer Speech and Language 5: pp. 275-

294, 1991.

[4] Junqua, J.-C., Haton, J.P. \Robustness in Automatic

Speech Recognition: Fundamentals and Applications",

Kluwer, Boston, 1996.

[5] P. Lo ckwood and J. Boudy, \Experiments with a

Nonlinear Sp ectral Subtractor (NSS), Hidden Markov

Models and the pro jection, for robust speech recogni-

tion in cars,"

Speech Communication

,vol. 11, pp. 215{

228, 1992.

[6] L. Singh and S. Sridharan, \Sp eech Enhancement us-

ing Critical Band Spectral Subtraction," in

Proc. IC-

SLP

, (Sydney, Australia), Nov. 1998.

[7] D. Langmann, T. Schneider, R. Grudszus, A. Fischer,

T. Crull, H. Ptzinger, M. Westphal, and U. Jekosch,

\CSDC - The MoTiV Car-Speech Data Collection," in

First International ConferenceonLanguage Resources

and Evaluation

, (Granada, Spain), May 1998.

[8] A. Fischer and V. Stahl, \Subword Unit based Sp eech

Recognition in Car Environments," in

Proc. ICASSP

(Seattle, USA), pp. 257{261, May 1998

[9] A. Fischer and V. Stahl,\Database and Online Adapta-

tion for improved Sp eech Recognition in Car Environ-

ments," in

Proc. ICASSP

, (Phoenix, USA), pp. 445{

449, March 1999

[10] R. Martin, \Spectral Subtraction based on Minimum

Statistics,"

Proc. European Signal Processing Confer-

ence

, pp. 1182{1185, Sep 1994.

[11] M. H. Hayes, \Statistical Digital Signal Processing and

Modeling,"

John Wiley & Sons, Inc.

, 1996.

Quantile based noise estimation for spectral subtraction and Wiener filtering

Figures

Citations

Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging

Noise Spectrum Estimation in Adverse Environments: Improved Minima Controlled

Noise estimation by minima controlled recursive averaging for robust speech enhancement

Speech enhancement for non-stationary noise environments

A noise-estimation algorithm for highly non-stationary environments

References

Statistical Digital Signal Processing and Modeling

Enhancement of speech corrupted by acoustic noise

Spectral Subtraction Based on Minimum Statistics

Experiments with a Nonlinear Spectral Subtractor (NSS), Hidden Markov Models and the projection, for robust speech recognition in cars

Noise estimation techniques for robust speech recognition

Related Papers (5)

Suppression of acoustic noise in speech using spectral subtraction

Spectral Subtraction Based on Minimum Statistics

Noise power spectral density estimation based on optimal smoothing and minimum statistics

Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator

Enhancement of speech corrupted by acoustic noise

Frequently Asked Questions (10)

Q1. What have the authors contributed in "Quantile based noise estimation for spectral subtraction and wiener filtering" ?

Q2. What are the future works mentioned in the paper "Quantile based noise estimation for spectral subtraction and wiener filtering" ?

Q3. How much noise is in the frequency bands?

Q4. How many mel frequency coe cients are passed to the recognizer?

Q5. How many frames are used to estimate the power spectrum of a speech recognizer?

Q6. What is the relative reduction for the optimal choice q?

Q7. What is the recursion of the utterance?

Q8. how long did the car noise take to be inserted?

Q9. What is the cost of the quantile based noise estimation method?

Q10. How can the authors estimate the noise power spectrum from the speech signal?