scispace - formally typeset
Open AccessProceedings ArticleDOI

Joint Late Reverberation and Noise Power Spectral Density Estimation in a Spatially Homogeneous Noise Field

TLDR
This paper model the noise as a spatially homogeneous sound field with an unknown time-varying PSD and a known time-invariant spatial coherence matrix and shows that the proposed blocking-based estimator yields the best performance when used in an MWF.
Abstract
Many multi-channel dereverberation and noise reduction techniques such as the multi-channel Wiener filter (MWF) require an estimate of the late reverberation and noise power spectral densities (PSDs). State-of-the-art multi-channel methods for estimating the late reverberation PSD typically assume that the noise PSD matrix is known. Instead of assuming that the noise PSD matrix is known, in this paper we model the noise as a spatially homogeneous sound field with an unknown time-varying PSD and a known time-invariant spatial coherence matrix. Based on this model, two joint estimators of the late reverberation and noise PSDs are proposed, i.e., a non-blocking-based estimator which simultaneously estimates the target signal, late reverberation, and noise PSDs, and a blocking-based estimator which first estimates the late reverberation and noise PSDs at the output of a blocking matrix aiming to block the target signal. Experimental results show that the proposed blocking-based estimator yields the best performance when used in an MWF, even resulting in a similar or better performance than a state-of-the-art blocking-based estimator of the late reverberation PSD which assumes that the noise PSD matrix is known.

read more

Content maybe subject to copyright    Report

JOINT LATE REVERBERATION AND NOISE POWER SPECTRAL DENSITY ESTIMATION
IN A SPATIALLY HOMOGENEOUS NOISE FIELD
Ina Kodrasi
?
, Simon Doclo
?
?
University of Oldenburg, Department of Medical Physics and Acoustics
and Cluster of Excellence Hearing4All, Oldenburg, Germany
Idiap Research Institute, Speech and Audio Processing Group, Martigny, Switzerland
{ina.kodrasi,simon.doclo}@uni-oldenburg.de
ABSTRACT
Many multi-channel dereverberation and noise reduction techniques
such as the multi-channel Wiener filter (MWF) require an estimate
of the late reverberation and noise power spectral densities (PSDs).
State-of-the-art multi-channel methods for estimating the late rever-
beration PSD typically assume that the noise PSD matrix is known.
Instead of assuming that the noise PSD matrix is known, in this pa-
per we model the noise as a spatially homogeneous sound field with
an unknown time-varying PSD and a known time-invariant spatial
coherence matrix. Based on this model, two joint estimators of the
late reverberation and noise PSDs are proposed, i.e., a non-blocking-
based estimator which simultaneously estimates the target signal,
late reverberation, and noise PSDs, and a blocking-based estima-
tor which first estimates the late reverberation and noise PSDs at the
output of a blocking matrix aiming to block the target signal. Ex-
perimental results show that the proposed blocking-based estimator
yields the best performance when used in an MWF, even resulting in
a similar or better performance than a state-of-the-art blocking-based
estimator of the late reverberation PSD which assumes that the noise
PSD matrix is known.
Index Terms PSD estimation, late reverberation, noise,
MWF, least-squares
1. INTRODUCTION
In many hands-free speech communication applications, the recorded
microphone signals do not only contain the desired speech signal,
but also attenuated and delayed copies of the desired speech signal
due to reverberation, as well as additive noise. While early reverber-
ation may be desirable [1], late reverberation and noise may degrade
the perceived quality and hinder the intelligibility of speech [2, 3].
Hence, effective dereverberation and noise reduction techniques are
required.
A commonly used dereverberation and noise reduction tech-
nique is the multi-channel Wiener filter (MWF), which aims at
minimizing the mean-square error between the output signal and
the target signal [4–6]. The implementation of the MWF requires
(among other parameters) an estimate of the late reverberation
and noise power spectral densities (PSDs). To estimate the late
reverberation PSD, several single-channel estimators based on a
temporal model of reverberation [7–9] as well as multi-channel
estimators based on a diffuse sound field model for the late rever-
beration [10–18] have been proposed. To the best of our knowledge,
state-of-the-art multi-channel late reverberation PSD estimators es-
timate the late reverberation PSD assuming that an estimate of the
This work was supported by the Cluster of Excellence Hearing4All,
funded by the German Research Foundation (DFG), and the joint Lower
Saxony-Israeli Project ATHENA, funded by the State of Lower Saxony.
noise PSD matrix is available. The noise PSD matrix is typically es-
timated from the microphone signals during speech pauses detected
by means of a voice activity detector (VAD) [19, 20], generally
requiring the noise PSD to be rather time-invariant. However, in
many acoustic scenarios, e.g., in highly reverberant environments,
speech pauses may rarely occur, making the estimation of the noise
PSD matrix challenging. In addition, in many acoustic scenarios
the noise PSD can be time-varying, e.g., when the noise consists of
microphone self-noise in a system with the input gain automatically
adjusted during operation using an automatic gain control.
Instead of assuming that an estimate of the noise PSD matrix
is available, in this paper we model the noise as a spatially homo-
geneous sound field with a time-varying PSD and assume that only
knowledge of the time-invariant spatial coherence matrix is avail-
able. Two alternative joint estimators of the late reverberation and
noise PSDs are proposed, i.e., a non-blocking-based estimator which
simultaneously estimates the target signal, late reverberation, and
noise PSDs, and a blocking-based estimator which first estimates
only the late reverberation and noise PSDs at the output of a block-
ing matrix aiming to block the target signal. The proposed PSD es-
timators can be viewed as extensions of the PSD estimators in [10]
and [16], where only the target signal and late reverberation PSDs
are estimated assuming that an estimate of the noise PSD matrix is
available. Simulation results for several realistic acoustic scenarios
show that the proposed blocking-based PSD estimator yields the best
performance when used in an MWF, also yielding a similar or better
performance than the PSD estimator in [10] which assumes that the
noise PSD matrix is known.
2. SIGNAL MODEL AND ASSUMPTIONS
Consider a reverberant and noisy multi-channel acoustic system
with a single speech source and M microphones. In the short-time
Fourier transform (STFT) domain, the M-dimensional vector of the
received microphone signals y(k, l) = [Y
1
(k, l) . . . Y
M
(k, l)]
T
at
frequency bin k and frame index l is given by
y(k, l) = x
e
(k, l) + x
r
(k, l)
| {z }
x(k,l)
+v(k, l), (1)
with x
e
(k, l) the direct and early reverberation component, x
r
(k, l)
the late reverberation component, x(k, l) the reverberant com-
ponent, and v(k, l) the noise component. The vectors x
e
(k, l),
x
r
(k, l), x(k, l), and v(k, l) are defined similarly as y(k, l). The
direct and early reverberation component x
e
(k, l) can be expressed
as
x
e
(k, l) = S(k, l)d(k), (2)
with S(k, l) the target signal, i.e., the direct and early reverbera-
tion component received at the reference microphone, and d(k) =

[D
1
(k) . . . D
M
(k)]
T
the M-dimensional vector of relative trans-
fer functions (RTFs) of the target signal between the reference mi-
crophone and all microphones. The target signal S(k, l) is often
defined as the direct component only, such that the RTF vector d(k)
only depends on the direction of arrival (DOA) of the speech source
and the microphone array geometry [10–14, 16]. For conciseness,
the frequency index k is omitted in the remainder of this paper.
Assuming that the components in (1) are mutually uncorrelated,
the PSD matrix of the microphone signals is equal to
Φ
y
(l) = E{y(l)y
H
(l)} = Φ
x
e
(l) + Φ
x
r
(l)
|
{z }
Φ
x
(l)
+Φ
v
(l), (3)
where E denotes the expectation operator, Φ
x
e
(l) is the direct and
early reverberation PSD matrix, Φ
x
r
(l) is the late reverberation PSD
matrix, Φ
x
(l) is the reverberant PSD matrix, and Φ
v
(l) is the noise
PSD matrix. The PSD matrix Φ
x
e
(l) can be expressed as (cf. (2))
Φ
x
e
(l) = Φ
s
(l)dd
H
, (4)
with Φ
s
(l) the time-varying PSD of the target signal, i.e., Φ
s
(l) =
E{|S(l)|
2
}. Modeling the late reverberation as a diffuse sound
field [10–18], the PSD matrix Φ
x
r
(l) can be expressed as
Φ
x
r
(l) = Φ
r
(l)Γ, (5)
with Φ
r
(l) the time-varying PSD of the late reverberation and Γ the
spatial coherence matrix of a diffuse sound field, which can be an-
alytically computed based on the microphone array geometry [21].
Modeling the additive noise as a spatially homogeneous sound field,
the noise PSD matrix Φ
v
(l) can be expressed as
Φ
v
(l) = Φ
v
(l)Ψ, (6)
with Φ
v
(l) the time-varying noise PSD and Ψ the spatial coher-
ence matrix of the noise, which is assumed to be time-invariant. In
the presence of spatially uncorrelated noise (e.g., microphone self-
noise), Ψ = I, with I the M × M-dimensional identity matrix.
Using (4), (5), and (6), the PSD matrix Φ
y
(l) is equal to
Φ
y
(l) = Φ
s
(l)dd
H
+ Φ
r
(l)Γ + Φ
v
(l)Ψ. (7)
Given the filter vector w(l) = [W
1
(l) . . . W
M
(l)]
T
, the output
signal Z(l) of the speech enhancement system is equal to the sum
of the filtered microphone signals, i.e., Z(l) = w
H
(l)y(l). Dere-
verberation and noise reduction techniques aim at designing the fil-
ter w(l) such that the output signal Z(l) is as close as possible to
the target signal S(l). A widely used dereverberation and noise re-
duction technique is the MWF, which aims at minimizing the mean-
square error between Z(l) and S(l) [4–6]. The MWF is typically im-
plemented as a minimum variance distortionless response (MVDR)
beamformer w
MVDR
(l) followed by a single-channel Wiener postfil-
ter G(l) [10,12–18], i.e.,
w
MWF
(l) =
[
ˆ
Φ
r
(l)Γ +
ˆ
Φ
v
(l)Ψ]
1
d
d
H
[
ˆ
Φ
r
(l)Γ +
ˆ
Φ
v
(l)Ψ]
1
d
|
{z }
w
MVDR
(l)
ˆρ(l)
1 + ˆρ(l)
| {z }
G(l)
, (8)
with
ˆ
Φ
r
(l) and
ˆ
Φ
v
(l) denoting the estimated late reverberation and
noise PSDs respectively and ˆρ(l) denoting the estimated target-to-
late reverberation and noise ratio (TRNR) at the output of the MVDR
beamformer. The TRNR can be estimated as
ˆρ(l) =
ˆ
Φ
s
(l)
ˆ
Φ
rn
(l)
, (9)
with
ˆ
Φ
s
(l) denoting the estimated target signal PSD and
ˆ
Φ
rn
(l) =
{d
H
[
ˆ
Φ
r
(l)Γ +
ˆ
Φ
v
(l)Ψ]
1
d}
1
the estimated residual late rever-
beration and noise PSD at the output of the MVDR beamformer.
Alternatively, ˆρ(l) can be estimated using the decision directed ap-
proach as [16, 22]
ˆρ
DD
(l) = β
|Z(l 1)|
2
ˆ
Φ
rn
(l 1)
+ (1 β)
ˆ
Φ
s
(l)
ˆ
Φ
rn
(l)
, (10)
with β a smoothing parameter. As can be observed in (8), (9),
and (10), the implementation of the MWF requires estimates of the
time-varying target signal, late reverberation, and noise PSDs. The
objective of this paper is to derive estimates
ˆ
Φ
s
(l),
ˆ
Φ
r
(l), and
ˆ
Φ
v
(l),
assuming that the RTF vector d, the diffuse spatial coherence matrix
Γ, and the noise spatial coherence matrix Ψ are known. The RTF
vector can be constructed based on a DOA estimate, the diffuse spa-
tial coherence matrix can be constructed based on the microphone
array geometry, and the noise spatial coherence matrix can be con-
structed assuming a reasonable sound field model for the noise.
3. JOINT TARGET SIGNAL, LATE REVERBERATION,
AND NOISE PSD ESTIMATORS
To the best of our knowledge, state-of-the-art multi-channel PSD
estimators do not explicitly model the noise as a spatially homoge-
neous sound field and only derive target signal and late reverberation
PSDs estimates
ˆ
Φ
s
(l) and
ˆ
Φ
r
(l) assuming that an estimate of the
noise PSD matrix Φ
v
(l) is available [10–18]. The noise PSD matrix
is typically estimated from the microphone signals during speech
pauses detected by means of a VAD [19,20], generally requiring the
noise PSD Φ
v
(l) to be time-invariant. Instead of assuming that an
estimate of the noise PSD matrix Φ
v
(l) is available, in this paper
we assume that only knowledge of the noise spatial coherence ma-
trix Ψ is available and propose a non-blocking-based and a blocking-
based estimator of the target signal PSD Φ
s
(l), the late reverberation
PSD Φ
v
(l), and the noise PSD Φ
v
(l). The proposed PSD estimators
can be viewed as extensions of the PSD estimators in [10] and [16],
where only estimates of Φ
s
(l) and Φ
r
(l) are derived assuming that
the noise PSD matrix Φ
v
(l) is known.
3.1. Non-blocking-based PSD estimator
In the following we propose to simultaneously estimate the target
signal, late reverberation, and noise PSDs using the signal model
in (7) and an estimate of the PSD matrix Φ
y
(l). An estimate of
Φ
y
(l) can be directly obtained from the microphone signals using
recursive averaging as
ˆ
Φ
y
(l) = αy(l)y
H
(l) + (1 α)
ˆ
Φ
y
(l 1), (11)
with α a smoothing factor. Matching (11) to (7) and since the matri-
ces dd
H
, Γ, and Ψ are known, a system of M(M + 1)/2 equations
with three unknowns Φ
s
(l), Φ
r
(l), and Φ
v
(l) arises
1
. For M 3,
the system of equations is overdetermined and an estimate of the un-
known PSDs Φ
s
(l), Φ
r
(l), and Φ
v
(l) can be obtained by minimizing
the least-squares cost function
2
J
n
(l) = k
ˆ
Φ
y
(l) Φ
s
(l)dd
H
Φ
r
(l)Γ Φ
v
(l)Ψk
2
F
, (12)
1
Note that since the matrices
ˆ
Φ
y
(l), dd
H
, Γ, and Ψ are symmetric,
matching (11) to (7) yields M(M +1)/2 equations instead of M
2
equations.
2
Note that this non-blocking-based least-squares cost function has already
been used in [23] in the context of noise reduction only, in order to estimate
the PSDs of different spatially homogeneous noise fields.

with k · k
F
the matrix Frobenius norm. Setting the derivative of (12)
with respect to Φ
s
(l), Φ
r
(l), and Φ
v
(l) to 0 results in a system of
equations which can be written as
(d
H
d)
2
d
H
Γd d
H
Ψd
d
H
Γd tr{Γ
H
Γ} tr{Γ
H
Ψ}
d
H
Ψd tr{Γ
H
Ψ} tr{Ψ
H
Ψ}
|
{z }
A
n
ˆ
Φ
s,n
(l)
ˆ
Φ
r,n
(l)
ˆ
Φ
v,n
(l)
| {z }
ˆ
φ
φ
φ
n
(l)
=
d
H
ˆ
Φ
y
(l)d
tr{
ˆ
Φ
H
y
(l)Γ}
tr{
ˆ
Φ
H
y
(l)Ψ}
| {z }
p
n
(l)
,
(13)
where tr{·} denotes the trace operator and the quantities A
n
,
ˆ
φ
φ
φ
n
(l),
and p
n
(l) have been introduced in order to simplify the notation.
The solution to (13) is given by
ˆ
φ
φ
φ
n
(l) = A
1
n
p
n
(l), (14)
with the proposed target signal PSD estimate
ˆ
Φ
s,n
(l) being the first
element of
ˆ
φ
φ
φ
n
(l), late reverberation PSD estimate
ˆ
Φ
r,n
(l) being the
second element of
ˆ
φ
φ
φ
n
(l), and noise PSD estimate
ˆ
Φ
v,n
(l) being the
third element of
ˆ
φ
φ
φ
n
(l).
3.2. Blocking-based PSD estimator
In the following we propose an alternative PSD estimator which first
estimates the late reverberation and noise PSDs using reference sig-
nals at the output of a blocking matrix aiming to block the target
signal. Based on the estimated late reverberation and noise PSDs,
the target signal PSD is then estimated in a second step.
In order to block the target signal, an M ×(M 1)-dimensional
blocking matrix B is constructed such that
B
H
d = 0, (15)
and a set of M 1 reference signals
˜
u(l) containing only late re-
verberation and noise is generated as
˜
u(l) = B
H
y(l). There exist
many blocking matrices which satisfy (15). In this paper, the block-
ing matrix is computed from the first M 1 columns of the matrix
T defined as
T = I
dd
H
kdk
2
2
. (16)
Based on (7) and (15), the PSD matrix of the reference signals at the
blocking matrix output can be expressed as
Φ
˜
u
(l) = E{
˜
u(l)
˜
u
H
(l)} = Φ
r
(l) B
H
ΓB
| {z }
˜
Γ
v
(l) B
H
ΨB
| {z }
˜
Ψ
. (17)
The matrices
˜
Γ and
˜
Ψ can be computed using the known spatial co-
herence matrices Γ and Ψ and an estimate
ˆ
Φ
˜
u
(l) of the PSD matrix
Φ
˜
u
(l) can be directly obtained from the reference signals similarly
to (11). Matching the estimated PSD matrix
ˆ
Φ
˜
u
(l) to (17) gives rise
to a system of M(M 1)/2 equations with two unknowns Φ
r
(l)
and Φ
v
(l)
3
. For M 3, the system of equations is overdetermined
and an estimate of Φ
r
(l) and Φ
v
(l) can be obtained by minimizing
the least-squares cost function
J
b
(l) = k
ˆ
Φ
˜
u
(l) Φ
r
(l)
˜
Γ Φ
v
(l)
˜
Ψk
2
F
. (18)
Setting the derivative of (18) with respect to Φ
r
(l) and Φ
v
(l) to 0
yields a system of equations which can be written as
"
tr{
˜
Γ
H
˜
Γ} tr{
˜
Γ
H
˜
Ψ}
tr{
˜
Γ
H
˜
Ψ} tr{
˜
Ψ
H
˜
Ψ}
#
| {z }
A
b
ˆ
Φ
r,b
(l)
ˆ
Φ
v,b
(l)
| {z }
ˆ
φ
φ
φ
b
(l)
=
tr{
ˆ
Φ
H
˜
u
(l)
˜
Γ}
tr{
ˆ
Φ
H
˜
u
(l)
˜
Ψ}
| {z }
p
b
(l)
, (19)
3
Note that since the matrices
ˆ
Φ
˜
u
(l),
˜
Γ, and
˜
Ψ are symmetric, matching
ˆ
Φ
˜
u
(l) to (7) yields M(M 1)/2 equations instead of (M 1)
2
equations.
where the quantities A
b
,
ˆ
φ
φ
φ
b
(l), and p
b
(l) have been introduced in
order to simplify the notation. The solution to (19) is given by
ˆ
φ
φ
φ
b
(l) = A
1
b
p
b
(l), (20)
with the proposed blocking-based late reverberation PSD estimate
ˆ
Φ
r,b
(l) being the first element of
ˆ
φ
φ
φ
b
(l) and the noise PSD estimate
ˆ
Φ
v,b
(l) being the second element of
ˆ
φ
φ
φ
b
(l). Using the late reverber-
ation and noise PSD estimates
ˆ
Φ
r,b
(l) and
ˆ
Φ
v,b
(l), the blocking-
based target signal PSD can be estimated as
ˆ
Φ
s,b
(l) =
1
d
H
d
tr{
ˆ
Φ
y
(l)
ˆ
Φ
r,b
(l)Γ
ˆ
Φ
v,b
(l)Ψ}. (21)
It should be noted that if the signal model in (7) perfectly holds,
the non-blocking-based estimator proposed in Section 3.1 and the
blocking-based estimator proposed in this section would result in
the same PSD estimates. In practice however, the signal model in (7)
does not perfectly hold since the early and late reverberation com-
ponents are not perfectly uncorrelated, the late reverberation is not
perfectly diffuse, and the noise cannot be typically perfectly modeled
by a spatially homogeneous sound field. Furthermore, estimating the
matrices Φ
y
(l) and Φ
˜
u
(l) by recursive averaging of a single real-
ization of the signals will not yield the expected value operator. As
a result, the proposed PSD estimators yield different PSD estimates
in practice. As will be shown in Section 4, using the blocking-based
PSD estimates in an MWF yields a better performance than using
the non-blocking-based PSD estimates.
4. EXPERIMENTAL RESULTS
In this section, we investigate the dereverberation and noise reduc-
tion performance of the MWF using the proposed PSD estimators
and two alternative versions to compute the TRNR. More precisely,
we investigate the performance of the MWF implemented using
the proposed non-blocking-based estimator with the TRNR
estimated as in (9), which will be referred to as NBB,
the proposed non-blocking-based estimator with the TRNR
estimated as in (10), which will be referred to as NBB-DD,
the proposed blocking-based estimator with the TRNR esti-
mated as in (9), which will be referred to as BB, and
the proposed blocking-based estimator with the TRNR esti-
mated as in (10), which will be referred to as BB-DD.
In addition, the performance of the BB and BB-DD methods will be
compared to the performance of the MWF implemented using the
target signal and late reverberation PSD estimates from [10], where
it is assumed that an estimate of the noise PSD matrix is available.
4.1. Setup and instrumental measures
We consider three multi-channel acoustic systems with a single
speech source and M = 4 microphones. The first acoustic system
consists of a linear microphone array with an inter-sensor distance
of 3 cm [24], the second acoustic system consists of a circular mi-
crophone array with a radius of 10 cm [25], and the third acoustic
system consists of a linear microphone array with an inter-sensor
distance of 6 cm [26]. Table 1 presents the reverberation time T
60
,
the DOA θ of the speech source, and the direct-to-reverberation ratio
(DRR) for each acoustic system. The speech components are gen-
erated by convolving a 38 s long clean speech signal with measured
room impulse responses at a sampling frequency f
s
= 16 kHz. The
noise components consist of stationary uncorrelated noise with a
broadband reverberant signal-to-noise ratio (RSNR) between 10 dB
and 40 dB. The reverberant speech-plus-noise signal is preceded

Table 1: Characteristics of the considered acoustic systems.
Acoustic system T
60
[s] θ DRR [dB]
1 0.61 90
0.76
2 0.73 45
1.43
3 1.25 15
0.04
by a 1 s long noise-only segment such that when using the PSD
estimator from [10], the noise PSD matrix can be estimated from
the noise-only segment. The signals are processed using a weighted
overlap-add STFT framework with a frame size of 1024 samples
and an overlap of 75%. The first microphone is arbitrarily selected
as the reference microphone. The target signal is defined as the
direct component only, such that the RTF vector can be computed
based on the DOA of the speech source.
The PSD matrices
ˆ
Φ
y
(l) and
ˆ
Φ
˜
u
(l) are estimated as in (11) with
a smoothing factor α corresponding to a time constant of 40 ms.
The diffuse spatial coherence matrix Γ is computed based on the
microphone array geometry and the noise spatial coherence matrix
is set to Ψ = I. The smoothing parameter in (10) is set to β = 0.98
and the minimum gain of the single-channel Wiener postfilter is set
to 17 dB. For the estimator from [10], the noise PSD matrix Φ
v
is
estimated as
ˆ
Φ
v
=
1
L
v
L
v
X
l=1
v(l)v
H
(l), (22)
with L
v
being the total number of noise-only segments.
The performance is evaluated in terms of the improvement
in frequency-weighted segmental SNR (fwSSNR) [27] and log-
likelihood ratio (LLR) [27] between the output signal and the
reference microphone signal. The fwSSNR and LLR measures are
intrusive measures comparing the signal being evaluated to a refer-
ence signal. The reference signal used in this paper is the anechoic
speech signal. It should be noted that a positive fwSSNR and a
negative LLR indicate a performance improvement.
4.2. Performance of the proposed estimators
In this section the performance of NBB, NBB-DD, BB, and BB-DD
is investigated for all considered RSNRs and acoustic systems. The
presented performance measures are averaged over all considered
acoustic systems.
Fig. 1 depicts the performance of all considered techniques in
terms of fwSSNR and LLR. It can be observed that, as ex-
pected, for all considered techniques the performance improvement
decreases as the RSNR increases. Furthermore, it can be observed
that in terms of both performance measures and for all considered
10 20 30 40
0
2
4
6
8
RSNR [dB]
fwSSNR [dB]
10 20 30 40
-0.5
-0.4
-0.3
-0.2
-0.1
0
RSNR [dB]
LLR [dB]
NBB NBB-DD BB BB-DD
(a) (b)
Fig. 1: MWF performance using the proposed PSD estimators.
Table 2: Average performance of the MWF using the proposed
blocking-based estimator and the estimator from [10] which assumes
that the noise PSD matrix is known (RSNR = 10 dB).
BB BR BB-DD BR-DD
fwSSNR [dB] 6.47 5.31 7.07 6.53
LLR [dB] 0.41 0.33 0.36 0.31
RSNRs, a larger performance improvement is obtained when using
NBB-DD instead of NBB, suggesting that smoothing the TRNR es-
timate using the decision directed approach is particularly important
when using non-blocking-based PSD estimates. In addition, it can
be observed that BB and BB-DD outperform NBB and NBB-DD for
all considered RSNRs. While BB-DD yields the highest fwSSNR,
BB results in the highest LLR. Informal listening tests suggest
that BB-DD yields a better perceptual quality than BB, with BB
introducing more musical noise and signal artifacts than BB-DD.
In summary, the presented results show that for the considered
acoustic scenarios, the proposed blocking-based PSD estimates yield
a better performance than the non-blocking-based PSD estimates.
4.3. Performance of the proposed blocking-based estimator and
the state-of-the-art estimator from [10]
In this section, the performance of BB and BB-DD is compared to
the performance of the estimator from [10], which uses a blocking
matrix and only estimates the target signal and late reverberation
PSDs, assuming that an estimate of the noise PSD matrix is avail-
able. The noise PSD matrix is estimated as in (22) and the MWF is
implemented using
ˆ
Φ
v
(instead of
ˆ
Φ
v
(l)Ψ in (8)) with the TRNR
estimated as in (9) or (10). Using [10] with the TRNR estimated as
in (9) will be referred to as BR, whereas using [10] with the TRNR
estimated as in (10) will be referred to as BR-DD. Due to space con-
straints, only the performance for RSNR = 10 dB is presented and
similarly as before, the performance is averaged over all considered
acoustic systems.
Table 2 depicts the performance of the considered techniques in
terms of fwSSNR and LLR. It can be observed that BB and BB-
DD result in a similar or better performance than BR and BR-DD,
respectively. It should be noted that the noise PSD matrix estimate
used for BR and BR-DD is rather accurate, since the noise is station-
ary and all noise-only segments are used to compute the PSD matrix.
The presented results show that the proposed blocking-based estima-
tor manages to remove the assumption that the noise PSD matrix is
known and additionally estimates the noise PSD without hindering
the dereverberation and noise reduction performance.
5. CONCLUSION
In this paper joint estimators for the late reverberation and noise
PSDs have been derived, removing the assumption made by state-
of-the-art late reverberation PSD estimators that the noise PSD ma-
trix is known. Modeling the noise as a spatially homogeneous sound
field with an unknown time-varying PSD and a known time-invariant
spatial coherence matrix, we have derived a non-blocking-based and
a blocking-based joint estimator of the late reverberation and noise
PSDs. Simulation results show that the proposed blocking-based
PSD estimator yields the best performance when used in an MWF,
also yielding a similar or better performance than a state-of-the-art
blocking-based late reverberation PSD estimator which assumes that
the noise PSD matrix is known.

6. REFERENCES
[1] J. S. Bradley, H. Sato, and M. Picard, “On the importance of
early reflections for speech in rooms, Journal of the Acous-
tical Society of America, vol. 113, no. 6, pp. 3233–3244, June
2003.
[2] R. Beutelmann and T. Brand, “Prediction of speech intelligi-
bility in spatial noise and reverberation for normal-hearing and
hearing-impaired listeners, Journal of the Acoustical Society
of America, vol. 120, no. 1, pp. 331–342, July 2006.
[3] A. Warzybok, I. Kodrasi, J. O. Jungmann, E. A. P. Ha-
bets, T. Gerkmann, A. Mertins, S. Doclo, B. Kollmeier, and
S. Goetze, “Subjective speech quality and speech intelligibil-
ity evaluation of single-channel dereverberation algorithms,
in Proc. International Workshop on Acoustic Echo and Noise
Control, Antibes, France, Sept. 2014, pp. 333–337.
[4] S. Doclo and M. Moonen, “Combined frequency-domain
dereverberation and noise reduction technique for multi-
microphone speech enhancement, in Proc. International
Workshop on Acoustic Echo and Noise Control, Darmstadt,
Germany, Sept. 2001, pp. 31–34.
[5] E. A. P. Habets and J. Benesty, A two-stage beamforming ap-
proach for noise reduction and dereverberation, IEEE Trans-
actions on Audio, Speech, and Language Processing, vol. 21,
no. 5, pp. 945–958, May 2013.
[6] B. Cauchi, I. Kodrasi, R. Rehr, S. Gerlach, A. Juki
´
c, T. Gerk-
mann, S. Doclo, and S. Goetze, “Combination of MVDR
beamforming and single-channel spectral processing for en-
hancing noisy and reverberant speech, EURASIP Journal on
Advances in Signal Processing, vol. 2015, no. 1, 2015.
[7] K. Lebart and J. M. Boucher, A new method based on spectral
subtraction for speech dereverberation, Acta Acoustica, vol.
87, no. 3, pp. 359–366, May-Jun. 2001.
[8] E. A. P. Habets, S. Gannot, and I. Cohen, “Late reverber-
ant spectral variance estimation based on a statistical model,
IEEE Signal Processing Letters, vol. 16, no. 9, pp. 770–774,
Sept. 2009.
[9] S. Braun, B. Schwartz, S. Gannot, and E. A. P. Habets, “Late
reverberation PSD estimation for single-channel dereverbera-
tion using relative convolutive transfer functions, in Proc.
International Workshop on Acoustic Echo and Noise Control,
Xi’an, China, Sept. 2016.
[10] S. Braun and E. A. P. Habets, “Dereverberation in noisy en-
vironments using reference signals and a maximum likelihood
estimator, in Proc. European Signal Processing Conference,
Marrakech, Morocco, Sept. 2013.
[11] O. Thiergart and E. A. P. Habets, “Extracting reverberant sound
using a linearly constrained minimum variance spatial filter,
IEEE Signal Processing Letters, vol. 21, no. 5, pp. 630–634,
May 2014.
[12] S. Braun and E. A. P. Habets, A multichannel diffuse
power estimator for dereverberation in the presence of multi-
ple sources, EURASIP Journal on Applied Signal Processing,
vol. 2015, no. 1, Dec. 2015.
[13] O. Schwartz, S. Braun, S. Gannot, and E. A. P. Habets, “Maxi-
mum likelihood estimation of the late reverberant power spec-
tral density in noisy environments, in Proc. IEEE Workshop
on Applications of Signal Processing to Audio and Acoustics,
New York, USA, Oct. 2015.
[14] O. Schwartz, S. Gannot, and E. A. P. Habets, “Joint maximum
likelihood estimation of late reverberant and speech power
spectral density in noisy environments, in Proc. IEEE Inter-
national Conference on Acoustics, Speech, and Signal Process-
ing, Shanghai, China, Mar. 2016, pp. 151–155.
[15] A. Kuklasi
´
nski, S. Doclo, S. H. Jensen, and J. Jensen, “Max-
imum likelihood PSD estimation for speech enhancement in
reverberation and noise, IEEE/ACM Transactions on Audio,
Speech, and Language Processing, vol. 24, no. 9, pp. 1595–
1608, Sept. 2016.
[16] O. Schwartz, S. Gannot, and E. A. P. Habets, “Joint estima-
tion of late reverberant and speech power spectral densities in
noisy environments using Frobenius norm, in Proc. European
Signal Processing Conference, Budapest, Hungary, Sept. 2016,
pp. 1123–1127.
[17] I. Kodrasi and S. Doclo, “Late reverberant power spectral den-
sity estimation based on an eigenvalue decomposition, in
Proc. IEEE International Conference on Acoustics, Speech,
and Signal Processing, New Orleans, USA, Mar. 2017, pp.
611–615.
[18] I. Kodrasi and S. Doclo, “Multi-channel late reverberation
power spectral density estimation based on nuclear norm min-
imization, in Proc. IEEE Workshop on Applications of Sig-
nal Processing to Audio and Acoustics, New York, USA, Oct.
2017, pp. 101–105.
[19] J. Rami
´
rez, J. C. Segura, C. Ben
´
ıtez,
´
A. de la Torre, and A. Ru-
bio, “Efficient voice activity detection algorithms using long-
term speech information, Speech Communication, vol. 42, no.
3, pp. 271–287, Apr. 2004.
[20] K. Ishizuka, T. Nakatani, M. Fujimoto, and N. Miyazaki,
“Noise robust voice activity detection based on periodic to ape-
riodic component ratio, Speech Communication, vol. 52, no.
1, pp. 41–60, Jan. 2010.
[21] B. F. Cron and C. H. Sherman, “Spatial-correlation functions
for various noise models, The Journal of the Acoustical Soci-
ety of America, vol. 34, no. 11, pp. 1732–1736, Nov. 1962.
[22] Y. Ephraim and D. Malah, “Speech enhancement using a min-
imum mean-square error short-time spectral amplitude estima-
tor, IEEE Transactions on Acoustics, Speech and Signal Pro-
cessing, vol. 32, no. 6, pp. 1109–1121, Dec. 1984.
[23] Y. A. Huang, A. Luebs, J. Skoglund, and W. B. Kleijn, “Glob-
ally optimized least-squares post-filtering for microphone ar-
ray speech enhancement, in Proc. IEEE International Confer-
ence on Acoustics, Speech, and Signal Processing, Shanghai,
China, Mar. 2016, pp. 380–384.
[24] E. Hadad, F. Heese, P. Vary, and S. Gannot, “Multichannel
audio database in various acoustic environments, in Proc.
International Workshop on Acoustic Echo and Noise Control,
Antibes, France, Sept. 2014, pp. 313–317.
[25] K. Kinoshita, M. Delcroix, S. Gannot, E. A. P. Habets,
R. Haeb-Umbach, W. Kellermann, V. Leutnant, R. Maas,
T. Nakatani, B. Raj, A. Sehr, and T. Yoshioka, “A summary of
the REVERB challenge: state-of-the-art and remaining chal-
lenges in reverberant speech processing research, EURASIP
Journal on Advances in Signal Processing, vol. 2016, no. 1,
Jan. 2016.
[26] J. Eaton, N. D. Gaubitch, A. H. Moore, and P. A. Naylor, “The
ACE challenge - Corpus description and performance evalua-
tion, in Proc. IEEE Workshop on Applications of Signal Pro-
cessing to Audio and Acoustics, New York, USA, Oct. 2015.
[27] S. Quackenbush, T. Barnwell, and M. Clements, Objective
measures of speech quality, Prentice-Hall, New Jersey, USA,
1988.
Citations
More filters
Journal ArticleDOI

Direction and Reverberation Preserving Noise Reduction of Ambisonics Signals

TL;DR: This work investigates the direction-preserving noise reduction method for higher-order Ambisonics (HOA) signals further and compares it against a beamforming-based method and the matrix multi-channel Wiener filter.
Journal ArticleDOI

Square Root-Based Multi-Source Early PSD Estimation and Recursive RETF Update in Reverberant Environments by Means of the Orthogonal Procrustes Problem

TL;DR: In this paper, the authors propose to factorize the early correlation matrix and minimize the approximation error defined with respect to the early-correlation-matrix square root, where early refers to reflections contained within the same STFT frame.
Proceedings ArticleDOI

Joint Estimation of RETF Vector and Power Spectral Densities for Speech Enhancement Based on Alternating Least Squares

TL;DR: This paper proposes to jointly estimate the RETF vector and all PSDs by minimizing the Frobenius norm of a model-based error matrix using an alternating least squares method and shows that the proposed method leads to a larger MWF performance than a state-of-the-art method based on covariance whitening.
Journal ArticleDOI

Improved Distributed Minimum Variance Distortionless Response (MVDR) Beamforming Method Based on a Local Average Consensus Algorithm for Bird Audio Enhancement in Wireless Acoustic Sensor Networks

TL;DR: An improved distributed minimum variance distortionless response (IDMVDR) beamforming method for bird audio enhancement in WASN is proposed and the results show that the proposed method performs better in audio quality and convergence rate, and therefore it is suitable for WASN with dynamic topology.
Proceedings ArticleDOI

Joint Training for Simultaneous Speech Denoising and Dereverberation with Deep Embedding Representations.

TL;DR: A joint training method for simultaneous speech denoising and dereverberation using deep embedding representations that outperforms the WPE and BLSTM baselines and can be simultaneously optimized.
References
More filters
Journal ArticleDOI

Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator

TL;DR: In this article, a system which utilizes a minimum mean square error (MMSE) estimator is proposed and then compared with other widely used systems which are based on Wiener filtering and the "spectral subtraction" algorithm.
Journal Article

Speech enhancement using a minimum mean square error short-time spectral amplitude estimator

TL;DR: This paper derives a minimum mean-square error STSA estimator, based on modeling speech and noise spectral components as statistically independent Gaussian random variables, which results in a significant reduction of the noise, and provides enhanced speech with colorless residual noise.
Journal ArticleDOI

Efficient voice activity detection algorithms using long-term speech information

TL;DR: A new VAD algorithm for improving speech detection robustness in noisy environments and the performance of speech recognition systems is presented, which formsulates the speech/non-speech decision rule by comparing the long-term spectral envelope to the average noise spectrum, thus yielding a high discriminating decision rule and minimizing the average number of decision errors.
Related Papers (5)
Frequently Asked Questions (1)
Q1. What contributions have the authors mentioned in the paper "Joint late reverberation and noise power spectral density estimation in a spatially homogeneous noise field" ?

Instead of assuming that the noise PSD matrix is known, in this paper the authors model the noise as a spatially homogeneous sound field with an unknown time-varying PSD and a known time-invariant spatial coherence matrix. Based on this model, two joint estimators of the late reverberation and noise PSDs are proposed, i. e., a non-blockingbased estimator which simultaneously estimates the target signal, late reverberation, and noise PSDs, and a blocking-based estimator which first estimates the late reverberation and noise PSDs at the output of a blocking matrix aiming to block the target signal.