What are the contributions mentioned in the paper "A consolidated perspective on multi-microphone speech enhancement and source separation" ?

In this article, the authors propose to fill this gap by analyzing a large number of established and recent techniques according to four transverse axes: a ) the acoustic impulse response model, b ) the spatial filter design criterion, c ) the parameter estimation algorithm, and d ) optional postfiltering. The authors conclude this overview paper by providing a list of software and data resources and by discussing perspectives and future trends in the field.

What is the way to implement time-domain filtering?

Time-domain filtering can be exactly implemented in the frequency domain using overlap and save techniques [69], [100], provided that the analysis frame-length is larger than the filter length.

Why is the estimation of the full-rank model more difficult?

Due to the increased number of parameters, the estimation of this model is more difficult, especially when the number of microphones The authoris large.

What is the advantage of a semi-fixed beamforming approach?

A semi-fixed beamforming approach, suitable for cases when the position of the target source cannot be determined in advance, is to estimate its DOA and to design a FBF steered towards it.

What is the way to minimize the noise variance at the output of the beamformer?

It is suggested to minimize the noise variance at the output of the beamformer while constraining the maximal distortion incurred to the speech signal, denoted σ2D.

What is the dereverberation requirement for a beamformer?

Assuming that reverberation alone does not compromise intelligibility, which is the case in many scenarios, the dereverberation requirement can be relaxed.

What is the way to measure the sound pressure?

A device that can directly measure the sound velocity, i.e. the first-order vector derivative of the sound pressure, is also available [128].

Why have these models been little used in practice?

These models have been little used in practice, due to the potentially large number of STFT domain filter coefficients to be estimated.

What can be done to design a matched-filter FBF?

the AIRs or the RTFs between the target source position and the microphones can be estimated during a calibration process and used to construct a matched-filter FBF [139].

What is the SNR at the output of the microphone array?

What is the popular model for channel-wise filtering?

This model is popular for channel-wise filtering in the context of CASA, where the ILD and ITD are called interaural level and intensity differences, respectively, and are influenced by the shape of the pinna, the head and the torso [36].

What is the simplest way to define the spatial covariance of a diffuse sound field?

Under certain assumptions, the mean value R̄j(f) of this distribution can be defined asR̄j(f) = dj(f)d H j (f) + σ 2 revΩ(f) (21)where dj(f) is the steering vector in (7), Ω(f) is the covariance matrix of a diffuse sound field whose entries Ωii′(νf ) are given in (5), and σ2rev is the power of early echoes and reverberation [113].

What is the common approach to consider the AIRs as finite impulse response filters?

The simplest approach is to consider the AIRs as finite impulse response (FIR) filters modeled by their time-domain coefficients aj(t, τ) or aj(τ), τ ∈ {0, . . . , L−1}.

How can a beamformer be computed in closed form?

it cannot even be computed in closed-form: parameter estimation and beamforming are tightly coupled as illustrated by the dashed arrow in Fig.

(Open Access) A Consolidated Perspective on Multimicrophone Speech Enhancement and Source Separation (2017) | Sharon Gannot

HAL Id: hal-01414179

https://hal.inria.fr/hal-01414179v2

Submitted on 4 Mar 2017

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-

entic research documents, whether they are pub-

lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diusion de documents

scientiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

A consolidated perspective on multi-microphone speech

enhancement and source separation

Sharon Gannot, Emmanuel Vincent, Shmulik Markovich-Golan, Alexey

Ozerov

To cite this version:

Sharon Gannot, Emmanuel Vincent, Shmulik Markovich-Golan, Alexey Ozerov. A consolidated per-

spective on multi-microphone speech enhancement and source separation. IEEE/ACM Transactions

on Audio, Speech and Language Processing, Institute of Electrical and Electronics Engineers, 2017,

25 (4), pp.692-730. �10.1109/TASLP.2016.2647702�. �hal-01414179v2�

A Consolidated Perspective on Multi-Microphone

Speech Enhancement and Source Separation

Sharon Gannot, Emmanuel Vincent, Shmulik Markovich-Golan, and Alexey Ozerov

Abstract—Speech enhancement and separation are core prob-

lems in audio signal processing, with commercial applications

in devices as diverse as mobile phones, conference call systems,

hands-free systems, or hearing aids. In addition, they are cru-

cial pre-processing steps for noise-robust automatic speech and

speaker recognition. Many devices now have two to eight mi-

crophones. The enhancement and separation capabilities offered

by these multichannel interfaces are usually greater than those

of single-channel interfaces. Research in speech enhancement

and separation has followed two convergent paths, starting

with microphone array processing and blind source separation,

respectively. These communities are now strongly interrelated

and routinely borrow ideas from each other. Yet, a comprehensive

overview of the common foundations and the differences between

these approaches is lacking at present. In this article, we propose

to ﬁll this gap by analyzing a large number of established

and recent techniques according to four transverse axes: a)

the acoustic impulse response model, b) the spatial ﬁlter design

criterion, c) the parameter estimation algorithm, and d) optional

postﬁltering. We conclude this overview paper by providing a list

of software and data resources and by discussing perspectives and

future trends in the ﬁeld.

Index Terms—Multichannel, array processing, beamforming,

Wiener ﬁlter, independent component analysis, sparse component

analysis, expectation-maximization, postﬁltering.

I. INTRODUCTION

PEECH enhancement and separation are core problems in

audio signal processing. Real-world speech signals often

involve one or more of the following distortions: reverberation,

interfering speakers, and/or noise. In this context, source

separation refers to the problem of extracting one or more

target speakers and cancelling interfering speakers and/or

noise. Speech enhancement is more general, in that it refers

to the problem of extracting one or more target speakers and

cancelling one or more of these three types of distortion. If

one focuses on removing interfering speakers and noise, as

opposed to reverberation, the terms of “signal enhancement”

and “source separation” become essentially interchangeable.

These problems arise in various real scenarios. For instance,

spoken communication over mobile phones or hands-free

systems requires the enhancement or separation of the near-

end speaker’s voice with respect to interfering speakers and

environmental noises before it is transmitted to the far-end

listener. Conference call systems or hearing aids face the same

problem, except that several speakers may be considered as

S. Gannot and S. Markovich-Golan are with Bar-Ilan University, Ramat-Gan

5290002, Israel (email: gannot@eng.biu.ac.il, shmuel.markovich@biu.ac.il).

E. Vincent is with Inria, 54600 Villers-l

es-Nancy, France (e-mail: em-

manuel.vincent@inria.fr). A. Ozerov is with Technicolor R&D, 35576 Cesson

evign

e, France (email: alexey.ozerov@technicolor.com).

targets. Speech enhancement and separation are also crucial

pre-processing steps for robust automatic speech recognition

and understanding, as available in today’s personal assistants,

GPS, televisions, video game consoles, and medical dictation

devices. More generally, they are believed to be necessary to

provide humanoid robots, assistive devices, and surveillance

systems with machine audition capabilities. While the above

applications require real-time processing, off-line separation of

singing voice, drums, and other musical instruments has been

successfully used for music information retrieval, upmixing of

mono or stereo movie soundtracks to 3D sound formats, and

remixing of music recordings. Other applications, e.g. meeting

transcription, can be also processed off-line.

With few exceptions such as speech codecs and old sound

archives, the input signals are multichannel. The number of

microphones per device has steadily increased in the last

few years. Most smartphones, tablets and in-car hands-free

systems are now equipped with two or three microphones.

Hearing aids typically feature two microphones per ear and

a wireless link [1] to enable communication between the left

and right hearing aids, and conference call systems with eight

microphones are commercially available. Research prototypes

with forty to hundreds of microphones have been demonstrated

in lecture halls, ofﬁce and domestic environments [2]–[6].

The enhancement capabilities offered by these multichannel

interfaces are usually greater than those of single-channel in-

terfaces. They make it possible to design multichannel spatial

ﬁlters that selectively enhance or suppress sounds in certain

directions (or volumes) by exploiting the spatial diversity, e.g.

phase and level differences, or more generally, the different

acoustic properties between channels. Single-channel spectral

ﬁlters, in contrast, require much more detailed knowledge

about the target and the noise and they usually result in smaller

quality improvement. As a matter of fact, it can be shown that

the maximum quality improvement theoretically achievable

with only two microphones is already much greater than with

a single microphone and that it keeps increasing with more

microphones [7].

Hundreds of multichannel audio signal enhancement tech-

niques have been proposed in the literature over the last forty

years along two historical research paths. Microphone array

processing emerged from the theory of sensor array processing

for telecommunications and it focused mostly on the local-

ization and enhancement of speech in noisy or reverberant

environments [8]–[12], while blind source separation (BSS)

was later popularized by the machine learning community

and it addressed “cocktail party” scenarios involving several

sound sources mixed together [13]–[18]. These two research

tracks have converged in the last decade and they are hardly

distinguishable today. As will be shown in this overview paper,

source separation techniques are not necessarily blind anymore

and most of them exploit the same theoretical tools, impulse

response models and spatial ﬁltering principles as speech

enhancement techniques.

Despite this convergence, most books and reviews have

focused on either of these tracks. This article intends to ﬁll this

gap by providing a comprehensive overview of their common

foundations and their differences. The vastness of the topic

requires us to limit the scope of this overview to the following:

• we focus on multichannel recordings made by multiple

microphones, as opposed to multichannel signals created

by mixing software which do not match the acoustics of

real environments;

• we mostly study the enhancement and separation of

speech with respect to interfering speech sources and

environmental noise in reverberant environments, as op-

posed to cancelling echoes and reverberation of the target

speech;

• we concentrate on truly multichannel techniques based on

acoustic impulse response models and multichannel ﬁlter-

ing: as such, we only brieﬂy introduce speech and noise

models, computational auditory scene analysis (CASA)

models, and time-frequency masking techniques used to

assist multichannel processing, but do not describe their

use for single-channel or channel-wise ﬁltering in depth;

• we do not describe possible use of the enhanced signals

for subsequent tasks;

• time difference of arrival (TDOA) estimation and speaker

localization of (multiple) sound sources are beyond the

scope of this paper.

Readers interested in multichannel signals created by pro-

fessional mixing software and in the use of source separation

as a prior step to audio upmixing and remixing may refer

to, e.g., [19]–[21]. Echo cancellation, dereverberation, and

CASA are major topics described in the books [22]–[25].

For more information about advanced spectral models and

their use for single-channel and channel-wise spectral ﬁltering,

see, e.g., [18], [26], [27]. For the use of speech enhancement

and musical instrument separation as pre-processing steps

for speech recognition and music information retrieval, see,

e.g., [28]–[31]. For a survey of TDOA and location estimation

techniques, interested readers may refer to [32]–[34].

In spite of its limited scope, this overview still covers a

wide ﬁeld of research. In order to classify existing techniques

irrespectively of their origin in microphone array processing or

BSS, we adopt four transverse axes: a) the acoustic impulse

response model, b) the spatial ﬁlter design criterion, c) the

parameter estimation algorithm, and d) optional postﬁltering.

These four modeling and processing steps are common to

all techniques, as illustrated in Fig. 1. The structure of the

article is as follows. We recall useful elements of acoustics

and introduce general notations in Section II. After describing

various acoustic impulse response models in Section III,

we deﬁne the fundamental concepts of spatial ﬁltering in

Section IV and review existing design criteria, estimation algo-

rithms, and postﬁltering techniques in Sections V, VI, and VII,

respectively. We provide a list of resources in Section VIII and

conclude in Section IX by summarizing the similarities and the

differences between approaches originating from microphone

array processing and BSS and discussing perspectives in the

ﬁeld.

II. ELEMENTS OF ACOUSTICS — NOTATIONS

From now on, we assume that two or more sound sources

are simultaneously recorded by two or more microphones.

The microphones are assumed to be omnidirectional, unless

explicitly stated otherwise. The set of microphones is called

a microphone array. Each recorded signal is called a channel

and the set of recorded signals is the array input signal or the

mixture signal.

A. Physics

Sound is a variation of air pressure on the order of 10

−2

for a speech source at a distance of 1 m, on top of the average

atmospheric pressure of 10

Pa. For such pressure values, the

wave equation that governs the propagation of sound in air is

linear [35]. This has two implications:

1) the pressure ﬁeld at any time is the sum of the pressure

ﬁelds resulting from each source at that time;

2) the pressure ﬁeld emitted at a given source propagates

over space and time according to a linear operation.

Unless clipping occurs, microphones operate linearly to record

the pressure value at given point in space. If one considers the

pressure ﬁeld emitted by each source as the target

, the overall

phenomenon is therefore linear.

In the free ﬁeld, the solution to the wave equation is given

by the spherical wave model. The waveform x

(

t) recorded at

point i when emitting a waveform s

(

t) at point j is equal to

(

t) =

√

4πq



t −



(1)

with

t denoting continuous time, q

the distance between

points i and j, and c the speed of sound, that is 343 m/s at

◦

C. This speed is very small compared to the speed of light,

so that propagation delays are not negligible. The recorded

waveform differs from the emitted waveform by a delay q

and an attenuation factor of 1/

√

4πq

In the presence of obstacles, the sound wave is affected in

different ways depending on its frequency ν. The wavelength

λ = c/ν of audio varies from 17 mm at ν = 20 kHz to 17 m

at ν = 20 Hz.

When the sound wave hits an object of dimension smaller

than λ, it is not affected. When it hits an obstacle of compara-

ble dimension to λ, it is subject to diffraction. The wavefront

is bended in a way that depends on the shape of the obstacle,

its material and the angle of incidence. Roughly speaking, it

will take more time for the wave to pass the obstacle and it

will be more attenuated than in air. This phenomenon occurs

Loudspeakers and musical instruments such as the trumpet do not operate

linearly. These nonlinearities occur within solid parts of the loudspeaker or

the instrument, however, before vibration is transmitted to air.

source

signals

Room

acoustics

(Sec. II and III)

multichannel

mixture signal

filtered

signal

postfiltered

signal

Spatial

filtering

(Sec. IV and V)

Postfiltering

(Sec. VII)

Parameter

estimation

(Sec. VI)

Figure 1. General schema showing acoustical propagation (gray) and the processing steps behind speech enhancement and source separation (black). Plain

arrows indicate the processing order common to all algorithms and dashed arrows the feedback loops for certain algorithms.

most notably for hearing aid users, whose torso, head, and

pinna, act as obstacles [36]. It also explains source directivity,

i.e. the fact that the sound emitted by a source depends on

direction.

When the wave hits a large rigid surface of dimension larger

than λ, it is subject to reﬂection. The direction of the reﬂected

wave is symmetrical to the direction of the incident wave with

respect to the surface normal. Only part of the wave power is

reﬂected: the rest is absorbed by the surface. The absorption

ratio depends on the material and the angle of incidence [37].

It is on the order of 1% for a tiled ﬂoor, 7% for a concrete

wall, and 15% for a carpeted ﬂoor.

Due to these small values, many successive wave reﬂections

typically occur before the power becomes negligible. This

induces multiple propagation paths between each source and

each microphone, each with a different delay and attenuation

factor. The waves corresponding to different paths are coherent

and may result in constructive or destructive interference.

B. Deterministic perspective

Let us now move from the physical domain to discrete time

signal processing. We assume that the recorded sound scene

consists of J sources and that the number of microphones is

equal to I. We adopt the following general notations: scalars

are represented by plain letters, vectors by bold lowercase

letters, and matrices by bold uppercase letters. The source

index, the microphone index, and the time index are denoted

by i, j, and t, respectively. The operator

refers to matrix

transposition, and

to Hermitian transposition.

According to the ﬁrst linearity assumption in Section II-A,

the multichannel mixture signal x(t) = [x

(t), . . . , x

(t)]

can be expressed as

x(t) =

j=1

(t) (2)

where c

(t) = [c

(t), . . . , c

(t)]

is the spatial image [38]

of source j, that is the contribution of that source to the

sound recorded at the microphones. This formulation is very

general: it applies both to targets and noise, and multiple noise

sounds can be modeled either as multiple sources or as a single

source [39]. In particular, it is valid for spatially diffuse sources

such as wind, trucks, or large musical instruments, which emit

sound in a large region of space.

In the case of a point source, the second linearity assump-

tion makes it possible to express c

(t) by linear convolu-

tion of a single-channel source signal s

(t) and the vector

(t, τ) = [a

(t, τ), . . . , a

(t, τ)]

of acoustic impulse re-

sponses (AIRs) from the source to the microphones:

(t) =

∞

τ=0

(t, τ)s

(t − τ ) (3)

This expression only holds for sources such as human speakers

which emit sound in a tight region of space. The AIRs result

from the summation of the multiple propagation paths and

they vary over time due to movements of the source, of the

microphones, or of other objects in the environment. When

such movements are small, they can be approximated as time-

invariant and denoted as a

(τ).

A schematic illustration of the shape of an AIR is provided

in Fig. 2. It consists of three successive parts. The ﬁrst peak is

the direct path from the source to the microphone, as modeled

in (1). It is followed by early echoes corresponding to the

ﬁrst few reﬂections on the room boundaries and the furniture.

Subsequent reﬂections cannot be distinguished from each other

anymore and they form an exponentially decreasing tail called

reverberation. This overall shape is often described by two

quantities: the reverberation time (RT), that is the time it takes

for the reverberant tail to decay by 60 decibels (dB), and the

direct-to-reverberant ratio (DRR), that is ratio of the power

of direct sound (i.e., direct path) to that of the rest of the

AIR. The RT depends solely on the room, while the DRR

also depends on the source-to-microphone distance. The RT

is virtually equal to 0 in outdoor conditions due to the absence

of reﬂection and it is on the order of 50 ms in a car [40], 0.2

to 0.8 s in ofﬁce or domestic conditions, 0.4 s to 1 s in a

classroom, and 1 s or more in an auditorium [41].

Fig. 3 depicts a real AIR measured in a meeting room. It has

both positive and negative values and it exhibits a strong ﬁrst

reﬂection on a table just after the direct path, but its magnitude

follows the overall shape in Fig. 2.

C. Statistical perspective

Besides the above deterministic characterization of AIRs, it

is useful to adopt a statistical point of view [35], [42]. To do

so, we decompose AIRs as

(τ) = e

(τ) + r

(τ) (4)

0 20 40 60 80 100

0.1

0.2

0.3

0.4

0.5

direct path

early echoes

reverberation

τ (ms)

magnitude

Figure 2. Schematic illustration of the shape of an AIR for a reverberation

time of 0.25 s (from [18]).

0 20 40 60 80 100

−0.2

−0.1

0.1

0.2

0.3

0.4

0.5

τ (ms)

(τ )

Figure 3. First 0.1 s of a real AIR from the Aachen Impulse Response

Database [41] recorded in a meeting room with a reverberation time of 0.23 s

with a source-to-microphone distance of 1.45 m.

where e

(τ) models the direct path and early echoes and

(τ) models reverberation.

The fact that reverberation results from the superposition of

thousands to millions of acoustic paths makes it follow the law

of large numbers. This implies three useful properties. Firstly,

(τ) can be modeled as a zero-mean Gaussian noise signal

whose amplitude decays exponentially over time according to

the room’s RT [43]. Secondly, the covariance E(r

(ν)r

∗

(ν

))

between its Fourier transform r

(ν) at two different frequen-

cies ν and ν

decays quickly with the difference between ν

and ν

[44], [45]. Thirdly, if the room’s RT is large enough, the

reverberant sound ﬁeld is diffuse, homogenous and isotropic,

which means that it has equal power in all directions of space.

This last property makes it possible to compute the normalized

correlation between two different channels i and i

in closed-

form as [35], [45], [46]

Ω

(ν) =

spat

(ν)r

∗

(ν))

spat

(|r

(ν)|

)

spat

(|r

(ν)|

)

sin(2πν`

/c)

2πν`

(5)

where E

spat

denotes spatial expectation over all possible abso-

lute positions of the sources and the microphone array in the

0 1 2 3 4 5 6 7 8

−0.2

0.2

0.4

0.6

0.8

ν (kHz)

Ω

′

(ν)

ℓ

′

= 5 cm

ℓ

′

= 20 cm

ℓ

′

= 1 m

Figure 4. Interchannel coherence Ω

(ν) of the reverberant part of an AIR

as a function of microphone distance `

and frequency ν.

room, and `

the distance between the microphones. Note

that the result does not depend on j anymore. This quantity

known as the interchannel coherence is shown in Fig. 4. It

is large for small arrays and low frequencies and it increases

with microphone distance and frequency. We can further deﬁne

the I × I coherence matrix of the diffuse sound ﬁeld by

concatenating all elements from (5) as (Ω(ν))

= Ω

(ν).

It is interesting to note that both deterministic and statistical

perspectives are valid. The appropriate choice depends on the

observation length, and both perspectives can be useful in

accomplishing different tasks [47]. We will elaborate on this

issue in the subsequent section.

III. ACOUSTIC IMPULSE RESPONSE MODELS

The above properties of AIRs can be modeled and exploited

to design enhancement techniques. Five categories of models

have been proposed in the literature. A model is deﬁned by

a parameterization of the AIRs and possible prior knowledge

about the parameter values. This prior knowledge can take the

form of deterministic constraints, penalty terms which we shall

denote by P(.), or probabilistic priors which we shall denote

by p(.).

A. Time-domain models

The simplest approach is to consider the AIRs as ﬁnite

impulse response (FIR) ﬁlters modeled by their time-domain

coefﬁcients a

(t, τ) or a

(τ), τ ∈ {0, . . . , L−1}. The assumed

length L is generally on the order of several hundred to a

few thousand taps. This model was very popular in the early

stages of research [48]–[55]. Recently, interest has revived

with sparse penalties which account for prior knowledge about

the physical properties of AIRs, namely the facts that power

concentrates in the direct path and the ﬁrst early echoes [56]–

[60] and that the time envelope decays exponentially [61], but

these penalties have not yet been used in a BSS context.

Time-domain modeling of AIRs exhibits several limitations.

Firstly, prior knowledge about the spatial position of the

sources does not easily translate into constraints on the AIR

coefﬁcients [62]. Secondly, the source signals are typically

A Consolidated Perspective on Multimicrophone Speech Enhancement and Source Separation

Figures

Citations

Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation

Auditory Scene Analysis: The Perceptual Organization of Sound by Albert Bregman (review)

Multi-Channel Deep Clustering: Discriminative Spectral and Spatial Embeddings for Speaker-Independent Speech Separation

Acoustic beamforming for noise source localization - Reviews, methodology and applications

Machine learning in acoustics: theory and applications

References

Maximum likelihood from incomplete data via the EM algorithm

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning (Information Science and Statistics)

Matrix computations (3rd ed.)

Independent component analysis, a new concept?

Related Papers (5)

Supervised Speech Separation Based on Deep Learning: An Overview

An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech

Performance measurement in blind audio source separation

Deep clustering: Discriminative embeddings for segmentation and separation

Beamforming: a versatile approach to spatial filtering

Frequently Asked Questions (14)

Q1. What are the contributions mentioned in the paper "A consolidated perspective on multi-microphone speech enhancement and source separation" ?

Q2. What is the way to implement time-domain filtering?

Q3. Why is the estimation of the full-rank model more difficult?

Q4. What is the advantage of a semi-fixed beamforming approach?

Q5. What is the way to minimize the noise variance at the output of the beamformer?

Q6. What is the dereverberation requirement for a beamformer?

Q7. What is the way to measure the sound pressure?

Q8. Why have these models been little used in practice?

Q9. What can be done to design a matched-filter FBF?

Q10. What is the SNR at the output of the microphone array?

Q11. What is the popular model for channel-wise filtering?

Q12. What is the simplest way to define the spatial covariance of a diffuse sound field?

Q13. What is the common approach to consider the AIRs as finite impulse response filters?

Q14. How can a beamformer be computed in closed form?