scispace - formally typeset
Open AccessJournal ArticleDOI

A Consolidated Perspective on Multimicrophone Speech Enhancement and Source Separation

TLDR
This paper proposes to analyze a large number of established and recent techniques according to four transverse axes: 1) the acoustic impulse response model, 2) the spatial filter design criterion, 3) the parameter estimation algorithm, and 4) optional postfiltering.
Abstract
Speech enhancement and separation are core problems in audio signal processing, with commercial applications in devices as diverse as mobile phones, conference call systems, hands-free systems, or hearing aids. In addition, they are crucial preprocessing steps for noise-robust automatic speech and speaker recognition. Many devices now have two to eight microphones. The enhancement and separation capabilities offered by these multichannel interfaces are usually greater than those of single-channel interfaces. Research in speech enhancement and separation has followed two convergent paths, starting with microphone array processing and blind source separation, respectively. These communities are now strongly interrelated and routinely borrow ideas from each other. Yet, a comprehensive overview of the common foundations and the differences between these approaches is lacking at present. In this paper, we propose to fill this gap by analyzing a large number of established and recent techniques according to four transverse axes: 1 the acoustic impulse response model, 2 the spatial filter design criterion, 3 the parameter estimation algorithm, and 4 optional postfiltering. We conclude this overview paper by providing a list of software and data resources and by discussing perspectives and future trends in the field.

read more

Content maybe subject to copyright    Report

HAL Id: hal-01414179
https://hal.inria.fr/hal-01414179v2
Submitted on 4 Mar 2017
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entic research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diusion de documents
scientiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
A consolidated perspective on multi-microphone speech
enhancement and source separation
Sharon Gannot, Emmanuel Vincent, Shmulik Markovich-Golan, Alexey
Ozerov
To cite this version:
Sharon Gannot, Emmanuel Vincent, Shmulik Markovich-Golan, Alexey Ozerov. A consolidated per-
spective on multi-microphone speech enhancement and source separation. IEEE/ACM Transactions
on Audio, Speech and Language Processing, Institute of Electrical and Electronics Engineers, 2017,
25 (4), pp.692-730. �10.1109/TASLP.2016.2647702�. �hal-01414179v2�

1
A Consolidated Perspective on Multi-Microphone
Speech Enhancement and Source Separation
Sharon Gannot, Emmanuel Vincent, Shmulik Markovich-Golan, and Alexey Ozerov
Abstract—Speech enhancement and separation are core prob-
lems in audio signal processing, with commercial applications
in devices as diverse as mobile phones, conference call systems,
hands-free systems, or hearing aids. In addition, they are cru-
cial pre-processing steps for noise-robust automatic speech and
speaker recognition. Many devices now have two to eight mi-
crophones. The enhancement and separation capabilities offered
by these multichannel interfaces are usually greater than those
of single-channel interfaces. Research in speech enhancement
and separation has followed two convergent paths, starting
with microphone array processing and blind source separation,
respectively. These communities are now strongly interrelated
and routinely borrow ideas from each other. Yet, a comprehensive
overview of the common foundations and the differences between
these approaches is lacking at present. In this article, we propose
to fill this gap by analyzing a large number of established
and recent techniques according to four transverse axes: a)
the acoustic impulse response model, b) the spatial filter design
criterion, c) the parameter estimation algorithm, and d) optional
postfiltering. We conclude this overview paper by providing a list
of software and data resources and by discussing perspectives and
future trends in the field.
Index Terms—Multichannel, array processing, beamforming,
Wiener filter, independent component analysis, sparse component
analysis, expectation-maximization, postfiltering.
I. INTRODUCTION
S
PEECH enhancement and separation are core problems in
audio signal processing. Real-world speech signals often
involve one or more of the following distortions: reverberation,
interfering speakers, and/or noise. In this context, source
separation refers to the problem of extracting one or more
target speakers and cancelling interfering speakers and/or
noise. Speech enhancement is more general, in that it refers
to the problem of extracting one or more target speakers and
cancelling one or more of these three types of distortion. If
one focuses on removing interfering speakers and noise, as
opposed to reverberation, the terms of “signal enhancement”
and “source separation” become essentially interchangeable.
These problems arise in various real scenarios. For instance,
spoken communication over mobile phones or hands-free
systems requires the enhancement or separation of the near-
end speaker’s voice with respect to interfering speakers and
environmental noises before it is transmitted to the far-end
listener. Conference call systems or hearing aids face the same
problem, except that several speakers may be considered as
S. Gannot and S. Markovich-Golan are with Bar-Ilan University, Ramat-Gan
5290002, Israel (email: gannot@eng.biu.ac.il, shmuel.markovich@biu.ac.il).
E. Vincent is with Inria, 54600 Villers-l
`
es-Nancy, France (e-mail: em-
manuel.vincent@inria.fr). A. Ozerov is with Technicolor R&D, 35576 Cesson
S
´
evign
´
e, France (email: alexey.ozerov@technicolor.com).
targets. Speech enhancement and separation are also crucial
pre-processing steps for robust automatic speech recognition
and understanding, as available in today’s personal assistants,
GPS, televisions, video game consoles, and medical dictation
devices. More generally, they are believed to be necessary to
provide humanoid robots, assistive devices, and surveillance
systems with machine audition capabilities. While the above
applications require real-time processing, off-line separation of
singing voice, drums, and other musical instruments has been
successfully used for music information retrieval, upmixing of
mono or stereo movie soundtracks to 3D sound formats, and
remixing of music recordings. Other applications, e.g. meeting
transcription, can be also processed off-line.
With few exceptions such as speech codecs and old sound
archives, the input signals are multichannel. The number of
microphones per device has steadily increased in the last
few years. Most smartphones, tablets and in-car hands-free
systems are now equipped with two or three microphones.
Hearing aids typically feature two microphones per ear and
a wireless link [1] to enable communication between the left
and right hearing aids, and conference call systems with eight
microphones are commercially available. Research prototypes
with forty to hundreds of microphones have been demonstrated
in lecture halls, office and domestic environments [2]–[6].
The enhancement capabilities offered by these multichannel
interfaces are usually greater than those of single-channel in-
terfaces. They make it possible to design multichannel spatial
filters that selectively enhance or suppress sounds in certain
directions (or volumes) by exploiting the spatial diversity, e.g.
phase and level differences, or more generally, the different
acoustic properties between channels. Single-channel spectral
filters, in contrast, require much more detailed knowledge
about the target and the noise and they usually result in smaller
quality improvement. As a matter of fact, it can be shown that
the maximum quality improvement theoretically achievable
with only two microphones is already much greater than with
a single microphone and that it keeps increasing with more
microphones [7].
Hundreds of multichannel audio signal enhancement tech-
niques have been proposed in the literature over the last forty
years along two historical research paths. Microphone array
processing emerged from the theory of sensor array processing
for telecommunications and it focused mostly on the local-
ization and enhancement of speech in noisy or reverberant
environments [8]–[12], while blind source separation (BSS)
was later popularized by the machine learning community
and it addressed “cocktail party” scenarios involving several
sound sources mixed together [13]–[18]. These two research

2
tracks have converged in the last decade and they are hardly
distinguishable today. As will be shown in this overview paper,
source separation techniques are not necessarily blind anymore
and most of them exploit the same theoretical tools, impulse
response models and spatial filtering principles as speech
enhancement techniques.
Despite this convergence, most books and reviews have
focused on either of these tracks. This article intends to fill this
gap by providing a comprehensive overview of their common
foundations and their differences. The vastness of the topic
requires us to limit the scope of this overview to the following:
we focus on multichannel recordings made by multiple
microphones, as opposed to multichannel signals created
by mixing software which do not match the acoustics of
real environments;
we mostly study the enhancement and separation of
speech with respect to interfering speech sources and
environmental noise in reverberant environments, as op-
posed to cancelling echoes and reverberation of the target
speech;
we concentrate on truly multichannel techniques based on
acoustic impulse response models and multichannel filter-
ing: as such, we only briefly introduce speech and noise
models, computational auditory scene analysis (CASA)
models, and time-frequency masking techniques used to
assist multichannel processing, but do not describe their
use for single-channel or channel-wise filtering in depth;
we do not describe possible use of the enhanced signals
for subsequent tasks;
time difference of arrival (TDOA) estimation and speaker
localization of (multiple) sound sources are beyond the
scope of this paper.
Readers interested in multichannel signals created by pro-
fessional mixing software and in the use of source separation
as a prior step to audio upmixing and remixing may refer
to, e.g., [19]–[21]. Echo cancellation, dereverberation, and
CASA are major topics described in the books [22]–[25].
For more information about advanced spectral models and
their use for single-channel and channel-wise spectral filtering,
see, e.g., [18], [26], [27]. For the use of speech enhancement
and musical instrument separation as pre-processing steps
for speech recognition and music information retrieval, see,
e.g., [28]–[31]. For a survey of TDOA and location estimation
techniques, interested readers may refer to [32]–[34].
In spite of its limited scope, this overview still covers a
wide field of research. In order to classify existing techniques
irrespectively of their origin in microphone array processing or
BSS, we adopt four transverse axes: a) the acoustic impulse
response model, b) the spatial filter design criterion, c) the
parameter estimation algorithm, and d) optional postfiltering.
These four modeling and processing steps are common to
all techniques, as illustrated in Fig. 1. The structure of the
article is as follows. We recall useful elements of acoustics
and introduce general notations in Section II. After describing
various acoustic impulse response models in Section III,
we define the fundamental concepts of spatial filtering in
Section IV and review existing design criteria, estimation algo-
rithms, and postfiltering techniques in Sections V, VI, and VII,
respectively. We provide a list of resources in Section VIII and
conclude in Section IX by summarizing the similarities and the
differences between approaches originating from microphone
array processing and BSS and discussing perspectives in the
field.
II. ELEMENTS OF ACOUSTICS NOTATIONS
From now on, we assume that two or more sound sources
are simultaneously recorded by two or more microphones.
The microphones are assumed to be omnidirectional, unless
explicitly stated otherwise. The set of microphones is called
a microphone array. Each recorded signal is called a channel
and the set of recorded signals is the array input signal or the
mixture signal.
A. Physics
Sound is a variation of air pressure on the order of 10
2
Pa
for a speech source at a distance of 1 m, on top of the average
atmospheric pressure of 10
5
Pa. For such pressure values, the
wave equation that governs the propagation of sound in air is
linear [35]. This has two implications:
1) the pressure field at any time is the sum of the pressure
fields resulting from each source at that time;
2) the pressure field emitted at a given source propagates
over space and time according to a linear operation.
Unless clipping occurs, microphones operate linearly to record
the pressure value at given point in space. If one considers the
pressure field emitted by each source as the target
1
, the overall
phenomenon is therefore linear.
In the free field, the solution to the wave equation is given
by the spherical wave model. The waveform x
i
(
˜
t) recorded at
point i when emitting a waveform s
j
(
˜
t) at point j is equal to
x
i
(
˜
t) =
1
4πq
ij
s
j
˜
t
q
ij
c
(1)
with
˜
t denoting continuous time, q
ij
the distance between
points i and j, and c the speed of sound, that is 343 m/s at
20
C. This speed is very small compared to the speed of light,
so that propagation delays are not negligible. The recorded
waveform differs from the emitted waveform by a delay q
ij
/c
and an attenuation factor of 1/
4πq
ij
.
In the presence of obstacles, the sound wave is affected in
different ways depending on its frequency ν. The wavelength
λ = c/ν of audio varies from 17 mm at ν = 20 kHz to 17 m
at ν = 20 Hz.
When the sound wave hits an object of dimension smaller
than λ, it is not affected. When it hits an obstacle of compara-
ble dimension to λ, it is subject to diffraction. The wavefront
is bended in a way that depends on the shape of the obstacle,
its material and the angle of incidence. Roughly speaking, it
will take more time for the wave to pass the obstacle and it
will be more attenuated than in air. This phenomenon occurs
1
Loudspeakers and musical instruments such as the trumpet do not operate
linearly. These nonlinearities occur within solid parts of the loudspeaker or
the instrument, however, before vibration is transmitted to air.

3
source
signals
Room
acoustics
(Sec. II and III)
multichannel
mixture signal
filtered
signal
postfiltered
signal
Spatial
filtering
(Sec. IV and V)
Postfiltering
(Sec. VII)
Parameter
estimation
(Sec. VI)
Figure 1. General schema showing acoustical propagation (gray) and the processing steps behind speech enhancement and source separation (black). Plain
arrows indicate the processing order common to all algorithms and dashed arrows the feedback loops for certain algorithms.
most notably for hearing aid users, whose torso, head, and
pinna, act as obstacles [36]. It also explains source directivity,
i.e. the fact that the sound emitted by a source depends on
direction.
When the wave hits a large rigid surface of dimension larger
than λ, it is subject to reflection. The direction of the reflected
wave is symmetrical to the direction of the incident wave with
respect to the surface normal. Only part of the wave power is
reflected: the rest is absorbed by the surface. The absorption
ratio depends on the material and the angle of incidence [37].
It is on the order of 1% for a tiled floor, 7% for a concrete
wall, and 15% for a carpeted floor.
Due to these small values, many successive wave reflections
typically occur before the power becomes negligible. This
induces multiple propagation paths between each source and
each microphone, each with a different delay and attenuation
factor. The waves corresponding to different paths are coherent
and may result in constructive or destructive interference.
B. Deterministic perspective
Let us now move from the physical domain to discrete time
signal processing. We assume that the recorded sound scene
consists of J sources and that the number of microphones is
equal to I. We adopt the following general notations: scalars
are represented by plain letters, vectors by bold lowercase
letters, and matrices by bold uppercase letters. The source
index, the microphone index, and the time index are denoted
by i, j, and t, respectively. The operator
T
refers to matrix
transposition, and
H
to Hermitian transposition.
According to the first linearity assumption in Section II-A,
the multichannel mixture signal x(t) = [x
1
(t), . . . , x
I
(t)]
T
can be expressed as
x(t) =
J
X
j=1
c
j
(t) (2)
where c
j
(t) = [c
1j
(t), . . . , c
Ij
(t)]
T
is the spatial image [38]
of source j, that is the contribution of that source to the
sound recorded at the microphones. This formulation is very
general: it applies both to targets and noise, and multiple noise
sounds can be modeled either as multiple sources or as a single
source [39]. In particular, it is valid for spatially diffuse sources
such as wind, trucks, or large musical instruments, which emit
sound in a large region of space.
In the case of a point source, the second linearity assump-
tion makes it possible to express c
j
(t) by linear convolu-
tion of a single-channel source signal s
j
(t) and the vector
a
j
(t, τ) = [a
1j
(t, τ), . . . , a
Ij
(t, τ)]
T
of acoustic impulse re-
sponses (AIRs) from the source to the microphones:
c
j
(t) =
X
τ=0
a
j
(t, τ)s
j
(t τ ) (3)
This expression only holds for sources such as human speakers
which emit sound in a tight region of space. The AIRs result
from the summation of the multiple propagation paths and
they vary over time due to movements of the source, of the
microphones, or of other objects in the environment. When
such movements are small, they can be approximated as time-
invariant and denoted as a
j
(τ).
A schematic illustration of the shape of an AIR is provided
in Fig. 2. It consists of three successive parts. The first peak is
the direct path from the source to the microphone, as modeled
in (1). It is followed by early echoes corresponding to the
first few reflections on the room boundaries and the furniture.
Subsequent reflections cannot be distinguished from each other
anymore and they form an exponentially decreasing tail called
reverberation. This overall shape is often described by two
quantities: the reverberation time (RT), that is the time it takes
for the reverberant tail to decay by 60 decibels (dB), and the
direct-to-reverberant ratio (DRR), that is ratio of the power
of direct sound (i.e., direct path) to that of the rest of the
AIR. The RT depends solely on the room, while the DRR
also depends on the source-to-microphone distance. The RT
is virtually equal to 0 in outdoor conditions due to the absence
of reflection and it is on the order of 50 ms in a car [40], 0.2
to 0.8 s in office or domestic conditions, 0.4 s to 1 s in a
classroom, and 1 s or more in an auditorium [41].
Fig. 3 depicts a real AIR measured in a meeting room. It has
both positive and negative values and it exhibits a strong first
reflection on a table just after the direct path, but its magnitude
follows the overall shape in Fig. 2.
C. Statistical perspective
Besides the above deterministic characterization of AIRs, it
is useful to adopt a statistical point of view [35], [42]. To do
so, we decompose AIRs as
a
ij
(τ) = e
ij
(τ) + r
ij
(τ) (4)

4
0 20 40 60 80 100
0
0.1
0.2
0.3
0.4
0.5
direct path
early echoes
reverberation
τ (ms)
magnitude
Figure 2. Schematic illustration of the shape of an AIR for a reverberation
time of 0.25 s (from [18]).
0 20 40 60 80 100
−0.2
−0.1
0
0.1
0.2
0.3
0.4
0.5
τ (ms)
a
ij
(τ )
Figure 3. First 0.1 s of a real AIR from the Aachen Impulse Response
Database [41] recorded in a meeting room with a reverberation time of 0.23 s
with a source-to-microphone distance of 1.45 m.
where e
ij
(τ) models the direct path and early echoes and
r
ij
(τ) models reverberation.
The fact that reverberation results from the superposition of
thousands to millions of acoustic paths makes it follow the law
of large numbers. This implies three useful properties. Firstly,
r
ij
(τ) can be modeled as a zero-mean Gaussian noise signal
whose amplitude decays exponentially over time according to
the room’s RT [43]. Secondly, the covariance E(r
ij
(ν)r
ij
(ν
0
))
between its Fourier transform r
ij
(ν) at two different frequen-
cies ν and ν
0
decays quickly with the difference between ν
and ν
0
[44], [45]. Thirdly, if the room’s RT is large enough, the
reverberant sound field is diffuse, homogenous and isotropic,
which means that it has equal power in all directions of space.
This last property makes it possible to compute the normalized
correlation between two different channels i and i
0
in closed-
form as [35], [45], [46]
ii
0
(ν) =
E
spat
(r
ij
(ν)r
i
0
j
(ν))
p
E
spat
(|r
ij
(ν)|
2
)
p
E
spat
(|r
i
0
j
(ν)|
2
)
=
sin(2πν`
ii
0
/c)
2πν`
ii
0
/c
(5)
where E
spat
denotes spatial expectation over all possible abso-
lute positions of the sources and the microphone array in the
0 1 2 3 4 5 6 7 8
−0.2
0
0.2
0.4
0.6
0.8
1
ν (kHz)
ii
(ν)
ii
= 5 cm
ii
= 20 cm
ii
= 1 m
Figure 4. Interchannel coherence
ii
0
(ν) of the reverberant part of an AIR
as a function of microphone distance `
ii
0
and frequency ν.
room, and `
ii
0
the distance between the microphones. Note
that the result does not depend on j anymore. This quantity
known as the interchannel coherence is shown in Fig. 4. It
is large for small arrays and low frequencies and it increases
with microphone distance and frequency. We can further define
the I × I coherence matrix of the diffuse sound field by
concatenating all elements from (5) as ((ν))
ii
0
=
ii
0
(ν).
It is interesting to note that both deterministic and statistical
perspectives are valid. The appropriate choice depends on the
observation length, and both perspectives can be useful in
accomplishing different tasks [47]. We will elaborate on this
issue in the subsequent section.
III. ACOUSTIC IMPULSE RESPONSE MODELS
The above properties of AIRs can be modeled and exploited
to design enhancement techniques. Five categories of models
have been proposed in the literature. A model is defined by
a parameterization of the AIRs and possible prior knowledge
about the parameter values. This prior knowledge can take the
form of deterministic constraints, penalty terms which we shall
denote by P(.), or probabilistic priors which we shall denote
by p(.).
A. Time-domain models
The simplest approach is to consider the AIRs as finite
impulse response (FIR) filters modeled by their time-domain
coefficients a
j
(t, τ) or a
j
(τ), τ {0, . . . , L1}. The assumed
length L is generally on the order of several hundred to a
few thousand taps. This model was very popular in the early
stages of research [48]–[55]. Recently, interest has revived
with sparse penalties which account for prior knowledge about
the physical properties of AIRs, namely the facts that power
concentrates in the direct path and the first early echoes [56]–
[60] and that the time envelope decays exponentially [61], but
these penalties have not yet been used in a BSS context.
Time-domain modeling of AIRs exhibits several limitations.
Firstly, prior knowledge about the spatial position of the
sources does not easily translate into constraints on the AIR
coefficients [62]. Secondly, the source signals are typically

Citations
More filters
Journal ArticleDOI

Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation

TL;DR: A fully convolutional time-domain audio separation network (Conv-TasNet), a deep learning framework for end-to-end time- domain speech separation, which significantly outperforms previous time–frequency masking methods in separating two- and three-speaker mixtures.
Journal Article

Auditory Scene Analysis: The Perceptual Organization of Sound by Albert Bregman (review)

TL;DR: In this paper, the authors present a short bibliography on AI and the arts, which is presented in four sections: General Arguments, Proposals, and Approaches (31 references), Artificial Intelligence in Music (124 references); Artificial AI in Literature and the Performing Arts (13 references), and Artificial Intelligence and Visual Art (57 references).
Proceedings ArticleDOI

Multi-Channel Deep Clustering: Discriminative Spectral and Spatial Embeddings for Speaker-Independent Speech Separation

TL;DR: It is found that simply encoding inter-microphone phase patterns as additional input features during deep clustering provides a significant improvement in separation performance, even with random microphone array geometry.
Journal ArticleDOI

Acoustic beamforming for noise source localization - Reviews, methodology and applications

TL;DR: The main concepts of beamforming, starting from the very basics and progressing on to more advanced concepts and techniques are presented, in order to give the reader the possibility to identify concepts and references which might be useful for her/his work.
Journal ArticleDOI

Machine learning in acoustics: theory and applications

TL;DR: In this paper, the authors survey the recent advances and transformative potential of machine learning (ML) including deep learning, in the field of acoustics and highlight ML developments in four acoustICS research areas: source localization in speech processing, source localization from ocean acoustic, bioacoustics, and environmental sounds in everyday scenes.
References
More filters
Journal ArticleDOI

Pattern Recognition and Machine Learning

Radford M. Neal
- 01 Aug 2007 - 
TL;DR: This book covers a broad range of topics for regular factorial designs and presents all of the material in very mathematical fashion and will surely become an invaluable resource for researchers and graduate students doing research in the design of factorial experiments.
Book

Pattern Recognition and Machine Learning (Information Science and Statistics)

TL;DR: Looking for competent reading resources?
Journal ArticleDOI

Independent component analysis, a new concept?

Pierre Comon
- 01 Apr 1994 - 
TL;DR: An efficient algorithm is proposed, which allows the computation of the ICA of a data matrix within a polynomial time and may actually be seen as an extension of the principal component analysis (PCA).
Related Papers (5)
Frequently Asked Questions (14)
Q1. What are the contributions mentioned in the paper "A consolidated perspective on multi-microphone speech enhancement and source separation" ?

In this article, the authors propose to fill this gap by analyzing a large number of established and recent techniques according to four transverse axes: a ) the acoustic impulse response model, b ) the spatial filter design criterion, c ) the parameter estimation algorithm, and d ) optional postfiltering. The authors conclude this overview paper by providing a list of software and data resources and by discussing perspectives and future trends in the field. 

Time-domain filtering can be exactly implemented in the frequency domain using overlap and save techniques [69], [100], provided that the analysis frame-length is larger than the filter length. 

Due to the increased number of parameters, the estimation of this model is more difficult, especially when the number of microphones The authoris large. 

A semi-fixed beamforming approach, suitable for cases when the position of the target source cannot be determined in advance, is to estimate its DOA and to design a FBF steered towards it. 

It is suggested to minimize the noise variance at the output of the beamformer while constraining the maximal distortion incurred to the speech signal, denoted σ2D. 

Assuming that reverberation alone does not compromise intelligibility, which is the case in many scenarios, the dereverberation requirement can be relaxed. 

A device that can directly measure the sound velocity, i.e. the first-order vector derivative of the sound pressure, is also available [128]. 

These models have been little used in practice, due to the potentially large number of STFT domain filter coefficients to be estimated. 

the AIRs or the RTFs between the target source position and the microphones can be estimated during a calibration process and used to construct a matched-filter FBF [139]. 

The signal to noise ratio (SNR) at the output of the microphone array is therefore given by:SNRout = σ2s |wHa(k0)|2wHΣuw . (30)If the noise is spatially-white, i.e. Σu = σ2uI, then:SNRout = σ2s σ2u |wHa(k0)|2 wHw = SNRin |wHa(k0)|2 wHw (31)with SNRin = σ2s σ2u. 

This model is popular for channel-wise filtering in the context of CASA, where the ILD and ITD are called interaural level and intensity differences, respectively, and are influenced by the shape of the pinna, the head and the torso [36]. 

Under certain assumptions, the mean value R̄j(f) of this distribution can be defined asR̄j(f) = dj(f)d H j (f) + σ 2 revΩ(f) (21)where dj(f) is the steering vector in (7), Ω(f) is the covariance matrix of a diffuse sound field whose entries Ωii′(νf ) are given in (5), and σ2rev is the power of early echoes and reverberation [113]. 

The simplest approach is to consider the AIRs as finite impulse response (FIR) filters modeled by their time-domain coefficients aj(t, τ) or aj(τ), τ ∈ {0, . . . , L−1}. 

it cannot even be computed in closed-form: parameter estimation and beamforming are tightly coupled as illustrated by the dashed arrow in Fig.