scispace - formally typeset
Open AccessProceedings ArticleDOI

Quantile based noise estimation for spectral subtraction and Wiener filtering

Volker Stahl, +2 more
- Vol. 3, pp 1875-1878
Reads0
Chats0
TLDR
This paper restricts its considerations to the case where only a single microphone recording of the noisy signal is available and proposes a method based on temporal quantiles in the power spectral domain, which is compared with pause detection and recursive averaging.
Abstract
Elimination of additive noise from a speech signal is a fundamental problem in audio signal processing. In this paper we restrict our considerations to the case where only a single microphone recording of the noisy signal is available. The algorithms which we investigate proceed in two steps. First, the noise power spectrum is estimated. A method based on temporal quantiles in the power spectral domain is proposed and compared with pause detection and recursive averaging. The second step is to eliminate the estimated noise from the observed signal by spectral subtraction or Wiener filtering. The database used in the experiments comprises 6034 utterances of German digits and digit strings by 770 speakers in 10 different cars. Without noise reduction, we obtain an error rate of 11.7%. Quantile based noise estimation and Wiener filtering reduce the error rate to 8.6%. Similar improvements are achieved in an experiment with artificial, non-stationary noise.

read more

Content maybe subject to copyright    Report

QUANTILE BASED NOISE ESTIMATION FOR
SPECTRAL SUBTRACTION AND WIENER FILTERING
Volker Stahl, Alexander Fischer and Rolf Bippus
Philips Research Lab oratories
Weisshausstrasse 2, D-52066 Aachen, Germany
email:
f
vstahl,afischer,bippus
g
@pfa.resea rch.philips.com
ABSTRACT
Elimination of additive noise from a sp eech signal is a fun-
damental problem in audio signal pro cessing. In this pap er
we restrict our considerations to the case where only a single
microphone recording of the noisy signal is available. The
algorithms whichweinvestigate pro ceed in two steps: First,
the noise p ower spectrum is estimated. A metho d based on
temporal quantiles in the p ower sp ectral domain is prop osed
and compared with pause detection and recursiveaverag-
ing. The second step is to eliminate the estimated noise
from the observed signal by sp ectral subtraction or Wiener
ltering. The database used in the exp eriments comprises
6034 utterances of German digits and digit strings by770
speakers in 10 dierent cars. Without noise reduction, we
obtain an error rate of 11.7%. Quantile based noise esti-
mation and Wiener ltering reduce the error rate to 8.6%.
Similar improvements are achieved in an experiment with
articial, non-stationary noise.
1. INTRODUCTION
The error rate of sp eech recognition systems increases dra-
matically in the presence of noise. It is therefore inevitable
to provide some means of noise reduction in the front end of
speech recognizers which op erate under adverse conditions.
A particularly noisy but important application domain of
speech recognition is the car environment [3, 4, 5, 2, 8, 9].
In this paper weinvestigate dierent noise reduction meth-
ods and carry out exp eriments on a large speech database
which has b een recorded in the car.
The paper is structured as follows: In Section 2 we give
a brief description of the speech recognition system and the
database used for the exp eriments. Model assumptions on
the sp eech and noise signal are stated in Section 3. In Sec-
tion 4 we discuss two metho ds to estimate the noise p ower
spectrum. The rst method is based on frame wise sp eech/
non-speech classication and recursiveaveraging over non-
speech frames. As pause detection in noisy environments
is a dicult problem, we prop ose a second metho d, which
does not depend on a classier. The noise is estimated as a
temporal quantile in the power spectral domain. According
to an exp erimental comparison, quantile based noise esti-
mation p erforms signicantly b etter, especially under non-
stationary noise. In Section 5 we apply sp ectral subtraction
and Wiener ltering to eliminate the estimated noise from
the input signal. The results are summarized in Section 6.
2. DATABASE AND SPEECH RECOGNITION
SYSTEM
The experimental results rep orted in this paper are based
on the German digit string subset of the MoTiV database
[7]. The corpus comprises 6034 utterances (4436 for train-
ing and 1598 for evaluating the error rate) by 770 sp eakers
in 10 cars at various driving situations. Training and evalu-
ation is always done on the matched scenario, i.e. the same
noise elimination methods are applied during training and
evaluation.
The speech recognizer is a continuous mixture density
hidden Markov mo del (HMM) system whose parameters
are estimated by Viterbi training. Each mixture consists
of 8 Gaussian densities with density specic, diagonal co-
variance matrices. The system uses two HMMs for each
digit, one for male and one for female sp eakers. The signal
analysis is as follows: The observed sp eech signal is subdi-
vided into overlapping, 16 ms spaced frames of 32 ms length.
For each frame the power sp ectrum is estimated through a
Hamming windowed FFT followed by a lter bank with 15
mel spaced triangular kernels. After a discrete cosine trans-
form of the logarithmic lterbank outputs we obtain 12 mel
frequency cepstral co ecients, which, augmented by 12 re-
gression co ecients, are passed to the recognizer. In this
paper we exp eriment with an additional prepro cessing step
in the power sp ectral domain in order to reduce additive
noise in the signal.
3. NOTATION AND ASSUMPTIONS
We assume that the observed noise signal is a realization
of a wide sense stationary pro cess [11 ]. The ma jor part
of this pap er deals with the estimation of its p ower spec-
trum
N
(
!
). As the estimation is more reliable if more data
is available, we use the notation
N
(
!; t
) to denote an es-
timation of
N
(
!
) using all frames from the b eginning of
the utterance up to frame
t
. Further, we assume that the
clean speech signal within each frame
t
is an instance of a
wide sense stationary process with p ower spectrum
S
(
!; t
).
For the sake of notational simplicitywe do not distinguish
between p ower sp ectra and p eriodigram based p ower sp ec-
trum estimations. As the sp eech and noise signal are as-
sumed to b e additive and indep endent, the p ower sp ectrum
of the observed signal is
X
(
!; t
) =
S
(
!; t
)+
N
(
!
)
:
The
power sp ectrum
X
(
!; t
) is estimated by magnitude squared

Fourier coecients of the observed signal in frame
t
. The
clean sp eech signal power spectrum can therefore b e esti-
mated as
S
(
!; t
)=
X
(
!; t
)
;
N
(
!; t
)
:
4. ESTIMATION OF THE NOISE SPECTRUM
A crucial step in noise suppression methods like Wiener l-
tering or sp ectral subtraction is the estimation of the noise
spectrum. There are applications where this task is sim-
plied by some prior knowledge of the noise sp ectrum or
bymulti channel recordings. However, in this pap er we as-
sume that there is only a single microphone and all weknow
about the noise signal is that it is more or less stationary,
independent of the sp eech signal and additive.
A commonly used method for noise spectrum estima-
tion is to average over sections in the input signal whichdo
not contain speech (Section 4.1). However, this approach
requires that non-speech sections can be detected reliably,
which is dicult especially under noisy conditions. More-
over, it relies on the fact that there actually exists a su-
cient amount of non-speech in the signal. In order to avoid
these problems, we prop ose a metho d to estimate the noise
spectrum without explicit frame wise sp eech / non-speech
classication (Section 4.2). The idea is to estimate the noise
energy in each frequency band by temp oral quantiles in the
power sp ectral domain.
4.1. Noise Sp ectrum Estimation Based on Frame
Wise Speech / Non-Sp eech Classication
If the signal to noise ratio is not too low, a simple metho d
to detect speech is based on the signal energy. As the noise
signal is assumed to b e stationary, the signal energy in the
entire utterance is greater or equal the noise energy. If
the energy in a frame is signicantly larger than the es-
timated noise energy, then the frame is likely to contain
speech. Otherwise it is a pure noise frame and is used to
update the current noise estimation. Let
X
(
!; t
) be the
power sp ectrum at frequency
!
in the
t
-th frame of the
input signal and
N
(
!; t
) b e the power spectrum of the es-
timated noise energy at frequency
!
in frame
t
. A simple
recursiveformula to estimate the noise energy
N
(
!; t
)isas
follows:
N
(
!; t
) =
N
(
!; t
;
1) if XNR(
t
)
>
(1
;
)
N
(
!; t
;
1) +
X
(
!; t
)else
(1)
XNR(
t
) =
P
!
X
(
!; t
)
P
!
N
(
!; t
;
1)
for all
!
. The recursion is initialized by
N
(
!;
0) =
X
(
!;
0),
which reects the assumption that the rst frame of an ut-
terance does not contain sp eech. Note that each frame is
classied as either pure noise or speech plus noise. Equa-
tion (1) has two parameters
and
which dep end on the
speech data under consideration. Parameter
is related
to the signal to noise ratio. Parameter
determines the
adaptation speed of the noise estimation. According to ex-
perimental results
=1
:
8 and
=0
:
03 p erform well for
the MoTiV corpus. The estimated noise
N
(
!; t
)is removed
from the input signal
X
(
!; t
)by means of a Wiener lter,
see Section 5. With this noise elimination method we ob-
tain a word error rate of 10.3%. Without noise elimination
the word error rate is 11.7%, i.e. the relative improvement
is 12%.
Frame wise speech / non{sp eech classication under
noisy conditions is a dicult problem far from being solved
satisfactorily. The frame error rate of the speech / non-
speech classier describ ed above is around 16% on the Mo-
TiV corpus. In the next section we describ e a metho d for
estimating the noise spectrum whichdoesnot require ex-
plicit sp eech / non-speech classication.
4.2. Quantile Based Noise Spectrum Estimation
In [10 ] an algorithm for noise estimation based on mini-
mum statistics has been prop osed. As the minimum is sen-
sitive to outliers we use a quantile dierent from minimum.
The algorithm proposed in this section is somewhat simpler
and has fewer parameters than the one in [10] but is com-
putationally more exp ensive. A similar metho d has been
described in [2 ].
It is well known that even in speech sections of the input
signal not all frequency bands are p ermanently o ccupied
with sp eech. In fact, a signicant p ercentage of the time
the energy in each frequency band is on the noise level.
This observation can be used to estimate a noise power
spectrum
N
(
!
) from the observed sp eech signal
X
(
!; t
)
by taking the
q
-th quantile over time in every frequency
band. More precisely, for every
!
the frames of the en-
tire utterance
X
(
!; t
),
t
= 0
;::: ;T
are sorted such that
X
(
!; t
0
)
X
(
!; t
1
)
:::
X
(
!; t
T
)
:
The
q
-quantile noise
estimation is dened as
N
(
!
)=
X
(
!; t
b
qT
c
)
:
(2)
For example,
q
= 0 yields the minimum,
q
= 1 the
maximum and
q
=0
:
5 the median. This approachisbased
on the assumption that each frequency band carries at least
the
q
-th fraction of time only noise, even during sp eech
sections. Obviously this is true for very small values of
q
but in order to obtain a robust estimation of the noise
spectrum, which is not sensitive to outliers, we hope that
q
is somewhere near the median, i.e.
q
0
:
5.
300 Hz
1500 Hz
3000 Hz
N(ω)
q
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Figure 1: Quantiles of the energy distribution in the ob-
served signal
X
at 300Hz, 1500Hz and 3000Hz for a typical
utterance of the MoTiV corpus.

Figure 1 shows
N
(
!
) according to (2) in dep endence of
q
for 3 dierent frequencies
!
and a typical 7 digit utterance
taken from the MoTiV corpus. Roughly in 80-90% of the
frames the signal energy in the frequency bands is low, i.e.
close to the noise energy level and only in 10-20% of the
time the frequency band carries high energy,voiced sp eech.
Note that the curves also depend on the duration of the
pause sections in the signal. However, the ma jor part of
the utterance in Figure 1 was speech. For the MoTiV cor-
pus the optimal value for
q
was determined exp erimentally.
The estimated noise
N
(
!
)was eliminated from the signal
by a Wiener lter, see Section 5. The resulting word error
rates (WER) are summarized in Table 1. The word error
rate without any noise reduction metho d is 11.7%, i.e. the
relative reduction is 26% for the optimal choice
q
=0
:
55.
With a 5909 words test set we obtain under certain simpli-
fying assumptions a condence interval of 0.8% on the 95%
signicance level for the baseline error rate 11.7%. Error
rates below 10.9% are therefore signicant improvements.
q
0.2 0.3 0.4 0.5 0.55 0.6 0.7
WER 11.3 10.8 10.1 8.9 8.6 8.8 9.7
Table 1: Word error rate with Wiener lter and noise esti-
mation
N
(
!
) according to (2).
Causality.
Note that the estimation of the noise sp ec-
trum dep ends on the entire utterance
X
(
!; t
) for all
t
=
0
;::: ;T
. A noise suppression lter based on this approach
is therefore not causal. However, if we dene
N
(
!; t
)asthe
q
-quantile of
X
(
!;
) for
=0
;::: ;t
, we obtain a causal
lter. Table 2 summarizes the results of the same experi-
ments as in Table 1 but this time we used a causal noise
estimation. The error rates achieved by the causal lter
are slightly higher than for the non-causal case. The reason
is that the noise estimation at the beginning of the signal
is very unreliable b ecause few data is available to estimate
N
(
!; t
) for small
t
.
q
0.2 0.3 0.4 0.5 0.55 0.6 0.7
WER 11.5 10.8 10.0 8.8 8.9 9.1 10.2
Table 2: Same experimentas in Table 1 but with causal
noise estimation.
Eciency.
The computational cost and memory con-
sumption for estimating
N
(
!; t
)grows with
t
. This is prob-
lematic for real time and low resource implementations. As
a consequence weinvestigated approximate methods for the
quantile computation whichare moreecient in terms of
time and space. The idea is to store the observations
X
(
!; t
)
for
t
=0
;
1
;:::
in a buer with xed length . Separate
buers are used for each frequency
!
. If a buer is full, then
the largest and the smallest elementareremoved from the
buer. The quantile is determined by considering only the
elements in the buer. The obvious question now is how
large the buer should be and how muchtherecognition
error rate increases with a nite length buer. Results of
experiments with dierent buer lengths and
q
=0
:
5are
reported in Table 3. As expected, the error rate increases
for small buer sizes and achieves asymptotically the error
rate of the exact quantile computation. Another metho d to
3 5 10 20 40 60 100
WER 10.6 10.2 9.3 9.1 9.3 9.2 8.9
Table 3: Same exp erimentasin Table 2 for
q
= 0
:
5 but
with limited buer length for the quantile computation.
improveeciencyistointegrate several adjacent frequen-
cies and do a band wise noise estimation [6].
Non-stationary Noise.
We observed that the classi-
er based metho d in Section 4.1 performs quite p oorly if
the noise energy increases abruptly,sayattime
^
t
. The rea-
son is that the estimated noise
N
(
!;
^
t
)attime
^
t
is small
compared to subsequent input frames
X
(
!; t
) for
t >
^
t
,
especially if frame
X
(
!; t
) do es not contain sp eech. There-
fore, according to (1), all frames after
^
t
are classied as
speech and hence the noise estimation will not b e up dated
any more after time
^
t
, i.e.
N
(
!; t
)=
N
(
!;
^
t
) for all
t>
^
t:
In other words, the noise estimation does not converge to
the observed noise. The quantile based metho d presented
in this section do es not suer from this problem and seems
therefore advantageous for non-stationary noise. In order
to verify this theoretical consideration by an experiment,
we inserted 0.5 seconds of car noise from a BMW 540 at 50
km/h b efore the b eginning of each sound le of the test set.
The columns of Table 4 contain the word error rates for the
cases no noise reduction, noise estimation by the classier
based method and noise estimation by the quantile based
method for
q
=0
:
5 and buer sizes 10, 20, 60, and unlimited
respectively. In each scenario the error rate is signicantly
higher than in the corresp onding case without inserted car
noise. The deterioration for the classier based noise es-
timation method, however, is much more severe than for
the quantile based metho d and is even worse than for the
case without noise elimination. The adaptation time to a
changing noise signal in the quantile based method is pro-
portional to the buer length , which explains whyin this
experiment shorter buer lengths give b etter results.
Method none classier quantile = 10
;
20
;
60
;
1
WER 13.7 18.5 10.1 10.5 10.6 11.7
Table 4: Word error rate if 0.5 seconds low energy car noise
are added to the beginning of the sound les of the test set.
5. ELIMINATION OF THE NOISE FROM THE
SPEECH SIGNAL
In the previous section we discussed methods for estimat-
ing the noise power sp ectrum
N
(
!; t
). In this section we
review approaches for eliminating the estimated noise from
the observed signal. If we had complete information ab out
the noise sp ectrum, i.e. magnitude and phase, the noise
elimination would amount to a simple subtraction of the
complex Fourier co ecients. Unfortunately we have no
phase information of the noise. Hence we apply sp ectral
subtraction and Wiener ltering for the noise elimination.
The FIR Wiener lter is dened as the linear lter which
minimizes the mean square error in the time domain. Spec-
tral subtraction relies on the fact that the p ower sp ectrum
of the sum of two independent random signals is the sum
of the power sp ectra. The noise elimination rule of sp ectral

subtraction is therefore simply to subtract the power spec-
trum of the estimated noise from the p ower spectrum of the
observed signal. Surprisingly the formulae for the Wiener
lter and sp ectral subtraction are quite similar. Let
H
(
!; t
)=(
X
(
!; t
)
;
N
(
!; t
))
=X
(
!; t
)
:
(3)
The noise reduced signal
S
(
!; t
)by Wiener ltering is
S
(
!; t
) =
H
(
!; t
)
2
X
(
!; t
)
;
noise reduction by spectral subtraction is dened as
S
(
!; t
)=
X
(
!; t
)
;
N
(
!; t
)=
H
(
!; t
)
X
(
!; t
)
:
Sometimes the long term estimated noise p ower spectrum
N
(
!; t
) can be larger than the instantaneous observed power
spectrum
X
(
!; t
). In this case we would expect that the
noise reduced power spectrum
S
(
!; t
) should be zero. There-
fore (3) is usually modied as
H
(
!; t
)=max(
X
(
!; t
)
;
N
(
!; t
)
;
0)
=X
(
!; t
)
Experimental exp erience indicates that better recognition
results are achieved if a small fraction of the noise power
is left in the signal [1 , 10 ]. Hence, the energy of the noise
reduced signal
S
(
!; t
)which is passed to the recognizer is
S
(
!; t
) = max(
S
(
!; t
)
; N
(
!; t
))
where
=0
:
04 has b een chosen experimentally.
An experimental comparison of sp ectral subtraction and
Wiener ltering for
=0
:
04 is given in Table 5. The noise
power spectrum
N
(
!; t
) has been estimated as in Table 2.
q
0.2 0.3 0.4 0.5 0.6 0.7 0.8
Wiener 11.5 10.8 10.0 8.8 9.1 10.2 12.1
Subtr. 11.7 11.4 10.9 10.1 9.9 9.6 11.6
Table 5: Exp erimental comparison of the word error rates
of Wiener ltering and spectral subtraction.
As suggested in [1] the p erformance of sp ectral subtrac-
tion can be improved by subtracting an overestimation of
the noise power spectrum, i.e.
S
;
(
!; t
)=max(
X
(
!; t
)
;
N
(
!; t
)
; N
(
!; t
))
:
In our exp eriments we found an optimum for
=2
:
5, which
gives a word error rate of 9.2% for
q
=0
:
5.
6. CONCLUSION
Weinvestigated methods to remove additive noise from a
speech signal which has been recorded in the car environ-
mentbyasingle microphone. The error rate of a sp eech
recognizer has b een reduced by up to 26% relativeby quan-
tile based noise estimation in the power sp ectral domain
and Wiener ltering. The metho ds pro ceed in two steps:
Estimation of the noise signal and elimination.
Noise Estimation.
We studied two noise estimation
methods: The rst one is based on frame wise speech/non-
speech classication and recursive smoothing over non-speech
frames (Section 4.1), the second method estimates the noise
by quantiles in the power sp ectral domain (Section 4.2).
The quantile based noise estimation metho d gives signi-
cantly better results but is more exp ensive in terms of com-
puting time and memory. An approximation algorithm for
improving the eciency of the quantile based metho d has
been proposed. The classier based method requires prior
knowledge ab out the signal to noise ratio, which is not the
case for the quantile based metho d. However, the quan-
tile based metho d relies on assumptions on energy distribu-
tions of human speech in the time{frequency domain, which
need to b e veried by more experiments. Finally, the quan-
tile based metho d seems to work better for certain kinds of
non-stationary noise than the classier based metho d.
Noise Elimination.
Two metho ds for removing the
estimated noise have been investigated, namely spectral
subtraction and Wiener ltering. The latter seems sup erior
according to exp erimental evidence (Section 5). If sp ectral
subtraction is mo died such that an appropriate overesti-
mation of the noise is subtracted, then the achieved error
rate comes close to the Wiener lter.
7. REFERENCES
[1] M. Berouti, R. Schwartz, and J. Makhoul, \Enhance-
mentofSpeech Corrupted by Acoustic Noise," in
Proc.
ICASSP
,(Washington, USA), pp. 208{211, Apr. 1979.
[2] H. G. Hirsch and C. Ehrlicher, \Noise Estimation
Techniques for Robust Sp eech Recognition," in
Proc.
ICASSP
, pp. 153{157, 1995.
[3] Juang, B. H. \Speech Recognition in Adverse Environ-
ments", Computer Speech and Language 5: pp. 275-
294, 1991.
[4] Junqua, J.-C., Haton, J.P. \Robustness in Automatic
Speech Recognition: Fundamentals and Applications",
Kluwer, Boston, 1996.
[5] P. Lo ckwood and J. Boudy, \Experiments with a
Nonlinear Sp ectral Subtractor (NSS), Hidden Markov
Models and the pro jection, for robust speech recogni-
tion in cars,"
Speech Communication
,vol. 11, pp. 215{
228, 1992.
[6] L. Singh and S. Sridharan, \Sp eech Enhancement us-
ing Critical Band Spectral Subtraction," in
Proc. IC-
SLP
, (Sydney, Australia), Nov. 1998.
[7] D. Langmann, T. Schneider, R. Grudszus, A. Fischer,
T. Crull, H. Ptzinger, M. Westphal, and U. Jekosch,
\CSDC - The MoTiV Car-Speech Data Collection," in
First International ConferenceonLanguage Resources
and Evaluation
, (Granada, Spain), May 1998.
[8] A. Fischer and V. Stahl, \Subword Unit based Sp eech
Recognition in Car Environments," in
Proc. ICASSP
,
(Seattle, USA), pp. 257{261, May 1998
[9] A. Fischer and V. Stahl,\Database and Online Adapta-
tion for improved Sp eech Recognition in Car Environ-
ments," in
Proc. ICASSP
, (Phoenix, USA), pp. 445{
449, March 1999
[10] R. Martin, \Spectral Subtraction based on Minimum
Statistics,"
Proc. European Signal Processing Confer-
ence
, pp. 1182{1185, Sep 1994.
[11] M. H. Hayes, \Statistical Digital Signal Processing and
Modeling,"
John Wiley & Sons, Inc.
, 1996.
Citations
More filters
Journal ArticleDOI

Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging

TL;DR: In this article, an improved minima controlled recursive averaging (IMCRA) approach is proposed for noise estimation in adverse environments involving nonstationary noise, weak speech components, and low input signal-to-noise ratio (SNR).

Noise Spectrum Estimation in Adverse Environments: Improved Minima Controlled

TL;DR: It is shown that in nonstationary noise environments and under low SNR conditions, the IMCRA approach is very effective, compared to a competitive method, it obtains a lower estimation error, and when integrated into a speech enhancement system achieves improved speech quality and lower residual noise.
Journal ArticleDOI

Noise estimation by minima controlled recursive averaging for robust speech enhancement

TL;DR: A minima controlled recursive averaging (MCRA) approach for noise estimation that is computationally efficient, robust with respect to the input signal-to-noise ratio (SNR) and type of underlying additive noise, and characterized by the ability to quickly follow abrupt changes in the noise spectrum.
Journal ArticleDOI

Speech enhancement for non-stationary noise environments

Israel Cohen, +1 more
- 01 Nov 2001 - 
TL;DR: An optimally-modi#ed log-spectral amplitude (OM-LSA) speech estimator and a minima controlled recursive averaging (MCRA) noise estimation approach for robust speech enhancement are presented.
Journal ArticleDOI

A noise-estimation algorithm for highly non-stationary environments

TL;DR: The proposed noise-estimation algorithm when integrated in speech enhancement was preferred over other noise-ESTimation algorithms, indicating that the local minimum estimation algorithm adapts very quickly to highly non-stationary noise environments.
References
More filters
Book

Statistical Digital Signal Processing and Modeling

TL;DR: The main thrust is to provide students with a solid understanding of a number of important and related advanced topics in digital signal processing such as Wiener filters, power spectrum estimation, signal modeling and adaptive filtering.
Proceedings ArticleDOI

Enhancement of speech corrupted by acoustic noise

TL;DR: This paper describes a method for enhancing speech corrupted by broadband noise based on the spectral noise subtraction method, which can automatically adapt to a wide range of signal-to-noise ratios, as long as a reasonable estimate of the noise spectrum can be obtained.

Spectral Subtraction Based on Minimum Statistics

TL;DR: An unbiased noise power estimator based on minimum statistics is derived and its statistical properties and its performance in the context of spectral subtraction are discussed.
Journal ArticleDOI

Experiments with a Nonlinear Spectral Subtractor (NSS), Hidden Markov Models and the projection, for robust speech recognition in cars

TL;DR: The performance of an HMM-based recogniser rises from 56% (no compensation) to 98% after speech enhancement and the lower limit of applicability of the projection (low SNR values) can be loosened after combination with NSS.
Proceedings ArticleDOI

Noise estimation techniques for robust speech recognition

TL;DR: Two new techniques are presented to estimate the noise spectra or the noise characteristics for noisy speech signals and can be combined with a nonlinear spectral subtraction scheme to enhance noisy speech and to improve the performance of speech recognition systems.
Related Papers (5)
Frequently Asked Questions (10)
Q1. What have the authors contributed in "Quantile based noise estimation for spectral subtraction and wiener filtering" ?

In this paper the authors restrict their considerations to the case where only a single microphone recording of the noisy signal is available. The algorithms which the authors investigate proceed in two steps: 

The latter seems superior according to experimental evidence ( Section 5 ). 

Roughly in 80-90% of the frames the signal energy in the frequency bands is low, i.e. close to the noise energy level and only in 10-20% of the time the frequency band carries high energy, voiced speech. 

After a discrete cosine transform of the logarithmic lterbank outputs the authors obtain 12 mel frequency cepstral coe cients, which, augmented by 12 regression coe cients, are passed to the recognizer. 

As the estimation is more reliable if more data is available, the authors use the notation N(!; t) to denote an estimation of N(!) using all frames from the beginning of the utterance up to frame t. 

The word error rate without any noise reduction method is 11.7%, i.e. the relative reduction is 26% for the optimal choice q = 0:55. 

The recursion is initialized by N(!; 0) = X(!; 0), which re ects the assumption that the rst frame of an utterance does not contain speech. 

In order to verify this theoretical consideration by an experiment, the authors inserted 0.5 seconds of car noise from a BMW 540 at 50 km/h before the beginning of each sound le of the test set. 

The quantile based noise estimation method gives signi - cantly better results but is more expensive in terms of computing time and memory. 

This observation can be used to estimate a noise power spectrum N(!) from the observed speech signal X(!; t) by taking the q-th quantile over time in every frequency band.