What are the contributions mentioned in the paper "Acoustic beamforming for speaker diarization of meetings" ?

In this work the use of classic acoustic beamforming techniques is proposed together with several novel algorithms to create a complete frontend for speaker diarization in the meeting room domain. New techniques the authors are present include blind reference-channel selection, two-step Time Delay of Arrival ( TDOA ) Viterbi postprocessing, and a dynamic output signal weighting algorithm, together with using such TDOA values in the diarization to complement the acoustic information.

What was used for the speaker diarization task?

The beamforming system developed for the speaker diarization task was also used to obtain an enhanced signal for the ASR systems that ICSI and SRI presented at the RT NIST evaluations.

What is the significance test of the RT06s system compared to the baseline?

The significance test of this system compared to the baseline is passed on both development and test cases (Z = 1.98 in development, Z = 2.31 in test).

What is the purpose of the acoustic beamforming system presented in this work?

The acoustic beamforming system presented in this work was created for use in the speaker diarization task for the meetings environment.

How is the noise cancellation achieved in standard beamforming systems?

In standard beamforming systems, this noise cancellation is achieved through the use of identical microphones placed only a few inches apart one from each other.

(Open Access) Acoustic Beamforming for Speaker Diarization of Meetings (2007) | Xavier Anguera

Q: What is the TDOA filtering step?

1. The authors implemented two filtering steps, a noisy TDOA detection and elimination (TDOA continuity enhancement), and 1-best TDOA selection from the N -best vector.

Q: What is the TDOA value for each channel?

Assuming TDOAmg(l,m)[c] is the TDOA value for the g(l,m)best element in channel m for segment c, the transition weights between two TDOA combinations for all microphones are determined byTr2[i, j; c] =∑M m=1 ∆diff[i,j; c]−|TDOAmg(i,m)[c]−TDOAmg(j,m)[c−1]| ∆diff[i,j; c](8)where now ∆diff[i, j; c] = max(|TDOAmg(i,n)[c] − TDOAmg(j,m)[c− 1]|, ∀i, j, m).

JOURNAL OF L

X CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 1

Acoustic Beamforming for Speaker Diarization of

Meetings

Xavier Anguera, Member, IEEE, Chuck Wooters, Member, IEEE, Javier Hernando, Member, IEEE

Abstract—When performing speaker diarization on recordings

from meetings, multiple microphones of different qualities are

usually available and distributed around the meeting room.

Although several approaches have been proposed in recent years

to take advantage of multiple microphones, they are either

too computationally expensive and not easily scalable or they

can not outperform the simpler case of using the best single

microphone. In this work the use of classic acoustic beamforming

techniques is proposed together with several novel algorithms

to create a complete frontend for speaker diarization in the

meeting room domain. New techniques we are present include

blind reference-channel selection, two-step Time Delay of Arrival

(TDOA) Viterbi postprocessing, and a dynamic output signal

weighting algorithm, together with using such TDOA values in

the diarization to complement the acoustic information. Tests on

speaker diarization show a 25% relative improvement on the test

set compared to using a single most centrally located microphone.

Additional experimental results show improvements using these

techniques in a speech recognition task.

Index Terms—acoustic beamforming, speaker diarization,

speaker segmentation and clustering, meetings processing.

I. INTRODUCTION

ossibly the most noticeable difference when performing

speaker diarization in the meetings environment versus

other domains (like broadcast news or telephone speech) is

the availability, at times, of multiple microphone channels,

synchronously recording what occurs in the meeting. Their

varied locations, quantity, and wide range of signal quality

has made it difﬁcult to come up with automatic ways to take

advantage of these multiple channels for speech-related tasks

such as speaker diarization.

In the system developed by Macquarie University [1] and

the TNO/AMI systems ([2] and [3]), either the most centrally

located microphone (known a priori) or a randomly selected

single microphone was used for speaker diarization. This

approach was designed to prevent low quality microphones

from affecting the results. Such approaches ignore the potential

advantage of using multiple microphones- making use of the

alternate microphone channels to create an improved signal

as the interaction moves from one speaker to another. Several

alternatives have been proposed to analyze and switch chan-

nels dynamically as the meeting progresses. At CMU [4] this

is done before any speaker diarization processing by using a

combination of energy and signal-to-noise metrics. However,

this approach creates a patchwork-type signal which could

At the time of this work X. Anguera was visiting the International Computer

Science Institute (ICSI), Berkeley, California. C. Wooters is currently with

ICSI and J. Hernando is with Universitat Politecnica de Catalunya (UPC),

Barcelona, Spain.

make interfere with the speaker diarization algorithms. In an

alternative presented in an initial LIA implementation [5], all

channels were processed in parallel and the best segments

from each channel were selected at the output. This technique

is computationally expensive as a full speaker diarization

processing must be performed for every channel. Later, LIA

proposed ([6] and [7]) a weighted sum of all channels into

a single channel prior to performing diarization. However,

this approach does not take into account the fact that the

signals may be misaligned due to the propagation time of

speech through the air or hardware timing issues, resulting in

a summed signal that contains echoes, and usually performs

worse than the best single channel.

To take advantage of the multiple microphones available in

a typical meeting room, we previously proposed ([8] and [9])

the use of microphone array beamforming for speech/acoustic

enhancement (see [10], [11]). Although the task at hand differs

from the classic due to some of the assumptions in the

beamforming theory, it was found to be beneﬁcial to use it

as a starting-point for taking advantage of the multiple distant

microphones.

In this work we propose a full acoustic beamforming fron-

tend, based on weighted-delay&sum techniques [10], aimed at

creating a single enhanced signal from an unknown number

of multiple microphone channels. This system is designed

for recordings made in meetings in which several speakers

and other sources of interference are present. Several new

algorithms are proposed to adapt the general beamforming

theory to this particular domain. Algorithms proposed in-

clude the automatic selection of the reference channel, the

computation of the N-best channel delays, postprocessing

techniques to select the optimum delay values (including a

noise thresholding and a two-step selection algorithm via

Viterbi decoding), and a dynamic channel-weight estimation

to reduce the negative impact of low quality channels.

The system presented here was used as part of ICSI’s

submission to the Spring 2006 Rich Transcription evaluation

(RT06s) organized by NIST [12], both in the speaker diariza-

tion and in the speech recognition systems. Additionally, the

software is currently available as open-source [13].

The next section describes the modules used in the acoustic

beamforming system. Then, we present experimental results

showing the improvements gained by using the new system

within the task of speaker diarization, and ﬁnally, we present

results for the task of speech recognition.

JOURNAL OF L

X CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 2

II. MULTICHANNEL ACOUSTIC BEAMFORMING SYSTEM

IMPLEMENTATION

The acoustic beamforming system is based on the weighted-

delay&sum microphone array theory, which is a generalization

of the well known weighted-delay&sum beamforming tech-

nique ([14], [15]). The signal output y[n] is expressed as the

weighted sum of the different channels as follows:

y[n] =

m=1

[n]x

[n − TDOA

(m,ref)

[n]] (1)

where

[

]

is the relative weight for microphone

(out of

M microphones) at instant n, with the sum of all weights

equals to 1; x

[n] is the signal for each channel, and

TDOA

(m,ref)

[n] (Time Delay of Arrival) is the relative delay

between each channel and the reference channel, in order to

obtain all signals aligned with each other at each instant n.

In practice, TDOA

(m,ref)

[n] is estimated via cross correlation

techniques once every several acoustic frames. In the imple-

mentation presented here it corresponds to once every 250 ms,

using GCC-PHAT (Generalized Cross Correlation with Phase

Transform) as proposed in [16] and [17] and described below.

We will refer to these as “acoustic segments”, and we will

refer to the (usually larger) set of frames used to estimate the

cross correlation measure the “analysis window”.

The weighted-delay&sum technique was selected for use in

the meetings domain given the following set of constraints:

• Unknown locations of the microphones in the meeting

room.

• Non-uniform microphone settings (gain, recording off-

sets, etc.)

• Unknown location and number of speakers in the room.

Due to this constraint, any techniques based on known

source locations are unsuitable.

• Unknown number of microphones in the meeting room.

The system should be able to handle from 2 to >100

microphone channels.

Figure 1 shows the different blocks involved in the pro-

posed weighted-delay&sum process. The process can be split

into four main blocks. First, signal enhancement via Wiener

ﬁltering is performed on each individual channel to reduce the

noise. Next, the information extraction block is in charge of

estimating which channel to use as the reference channel, an

overall weighting factor for the output signal, the skew present

in the ICSI meetings, and the N-best TDOA values at each

analysis segment. Third, a selection of the appropriate TDOA

delays between signals is obtained in order to optimally align

the channels before the sum. Finally, the signals are aligned

and summed. The output of the system is composed of the

acoustic signal and a vector of TDOA values, which can be

used as extra information about a speaker’s position. A more

detailed description of each block follows.

A. Individual Channel Signal Enhancement

Prior to doing any multichannel beamforming, each individ-

ual channel is Wiener ﬁltered [18]. This aims at cleaning the

signal of corrupting noise, which is assumed to be additive

and of a stochastic nature. The implementation of Wiener

ﬁltering is taken from the ICSI-SRI-UW system used for

ASR in [19], and applied to each channel independently.

This implementation performs an internal speech/non-speech

and noise power estimation for each channel independently,

ignoring any multichannel properties or microphone locations.

The use of such ﬁltering improves the beamforming as it

increases the quality of the signal, even though it introduces

a small phase nonlinearity given that the ﬁlter is not of

linear phase. Alternative multichannel Wiener ﬁlters were

not considered but could further improve results by taking

advantage of redundancies in the different input channels.

B. Meeting Information Extraction block

The algorithms in this block extract information from the

input signals to be used further on in the process to construct

the output signal. It is composed of four algorithms- reference

channel estimation, overall channels weighting factor, ICSI

meetings skew estimation, and the TDOA N -best delays

estimation.

1) Reference Channel Estimation: This algorithm attempts

to automatically ﬁnd the most centrally located and best

quality channel to be used as the reference channel in further

processing. It is important for this channel to be the best

representative of the acoustics in the meeting, as the correct

estimation of the delays of each of the channels depends on

the reference chosen.

In the meetings used for the Rich Transcription evaluations

[20], there is one microphone that is selected as the most cen-

trally located microphone. This microphone channel is used in

the Single Distant Microphone (SDM) task. The SDM channel

is chosen given the room layout and the prior knowledge of

the microphone types. This module presented here, ignores

that channel chosen for the SDM condition and selects one

microphone automatically based only on the acoustics. This

is intended for system robustness in cases where absolutely

no information is available on the room layout or microphone

placements.

In order to ﬁnd the reference channel, we use a metric

based on a time-average of the cross-correlation between each

channel i and all of the others j = 1 . . . M, j 6= i, computed

on segments of 1 second, as

xcorr

K(M − 1)

k=1

j=1,j6=i

xcorr[i, j; k] (2)

where M is the total number of channels/microphones and

K = 200 indicates the number of one second blocks used

in the average. The xcorr[i, j; k] indicates a standard cross-

correlation measure between channels i and j for each block

k. The channel i with the highest average cross-correlation

was chosen as the reference channel. An alternative SNR

metric was analyzed and the results were not conclusive as

to which method performed better in all cases. The cross-

correlation metric was chosen as it matches the algorithm

search for maximum correlation values and because it is

simple to implement.

JOURNAL OF L

X CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 3

1 2 3 4

1 2 1 2 3

C) TDOA values selection D) Output signal generation

B) Meeting information extraction

A) Individual chann.

signal enhancement

Wiener Filtering Reference channel

computation

Overall channels

weighting factor

ICSI meetings

skew estimation

GCC-PHAT

cross-correlation

N-best estimation

Noise threshold

estimation and

noise thresholding

Dual pass

Viterbi decoding

of delays

Interchannel

output weight

adaptation

Bad quality

segments

elimination

Channels sum

Enhanced

signal

TDOA

values

Fig. 1. Weighted-delay&sum block diagram

2) Overall Channels Weighting Factor: For practical rea-

sons, speech processing applications use acoustic data that

was sampled with a limited number of bits (e.g. 16 bits per

sample) providing a certain amount of dynamic range, which

is often not fully used because the recorded signals are of low

amplitude. When summing up several input signals, we are

increasing the resolution in the resulting signal, and thus we

must try to take advantage of as much of the output resolution

as possible. The overall channel weighting factor is used to

normalize the input signals to match the ﬁle’s available dy-

namic range. It is useful for low amplitude input signals since

the beamformed output has greater resolution and therefore

can be scaled appropriately to minimize the quantization errors

generated by scaling it to the output sampling requirements.

There are several methods in signal processing for ﬁnding

the maximum value of a signal in order to perform amplitude

normalization. These include- compute the absolute maximum

amplitude, the Root Mean Square (RMS) value, or other

variations of it, over the entire recording. It was observed in

meetings data that the signals may contain low energy areas

(silence regions) with short average durations, and high energy

areas (impulsive noises like door slams, or laughs), with even

shorter duration. Using the absolute maximum or RMS would

“saturate” the normalizing factor to the highest possible value

or bias it according to the amount of silence in the meeting.

So instead, we chose a windowed maximum averaging to try

to increase the likelihood that every window contains some

speech. In each window the maximum value is found and

these max values are averaged over the entire recording. The

weighting factor was obtained directly from this average.

3) ICSI Meetings Skew Estimation: This module was cre-

ated to deal with the meetings that come from the ICSI

Meeting Corpus, some of which have an error in the syn-

chronization of the channels. This was originally detected and

reported in [21], indicating that the hardware used for the

recordings was found not to keep an exact synchronization

between the different channels, resulting in a skew between

channels of multiples of 2.64 ms. It is not possible to know

beforehand the amount of skew of each of the channels as the

room setup did not follow a consistent ordering regarding the

connections to the hardware being used. Therefore we need to

automatically detect such skew so that it does not affect the

beamforming.

The artiﬁcially generated skew does not affect the general

processing of the channels by an ASR system as it does not

need exact time alignment between the channels- utterance

boundaries always include a silence “guard” region, and

the usual parametrizations (10-20ms long) cover small time

differences.

It does pose a problem though when computing the delays

between channels as it introduces an artiﬁcial delay between

channel pairs, which forces us to use a larger analysis window

for the ICSI meetings than with other meetings in order

to compute the delays accurately. This increases the chance

of delay estimation error. This module is therefore used to

estimate the skew between each channel and the reference

channel (in the case of ICSI meetings) and use it as a constant

bias in the rest of the delay processing.

In order to estimate the bias, an average cross-correlation

metric was put in place in order to obtain the average (across

time) delay between each channel and the reference channel

for a set of long acoustic windows (around 20 seconds), evenly

distributed along the meeting.

4) TDOA N -best delays estimation: The computation of the

time delay of arrival (TDOA) between each of the channels

and the reference channel is computed in segments of 250

ms. This allows the beamforming to quickly modify its beam

steering whenever the active speaker changes. In this imple-

mentation the TDOA was computed over a window of 500ms

(called the analysis window), which covers the current analysis

segment and the next. The size of the analysis window and of

the segment size constitute a tradeoff. A large analysis window

or segment window leads to a reduction in the resolution

of changes in the TDOA. On the other hand, using a small

analysis window reduces the robustness of the estimation. The

reduction of the segment size also increases the computational

cost of the system, while not increasing the quality of the

output signal. The selection of the scroll and analysis window

sizes was done empirically given some development data and

no exhaustive study was performed to ﬁne-tune these values.

JOURNAL OF L

X CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 4

In order to compute the TDOA between the reference

channel and any other channel for any given segment it is

usual to estimate it as the delay that maximizes the cross-

correlation between the two segments. In current beamforming

systems, the use of the cross-correlation in its classical form

i,ref

xcorr

(d) =

n=0

[n]x

ref

[n + d]) is avoided as it is very

sensitive to noise and reverberation. To improve robustness

against these problems, it is common practice to use the

GCC-PHAT. Such variation to the standard cross-correlation

proposes an amplitude normalization in the frequency domain,

maintaining the phase information, which conveys the delay

information between the signals.

Given two signals x

(n) and x

ref

(n), the GCC-PHAT is

computed with:

i,ref

PHAT

(d) = F

−1

(f)[X

ref

(f)]

∗

(f)[X

ref

(f)]

∗

(3)

Where X

(f) and X

ref

(f) are the Fourier transforms of the

two signals, F

−1

indicates the inverse Fourier transformation,

[ ]

∗

denotes the complex conjugate and | · | is the modulus.

The resulting

i,ref

PHAT

(d) is the correlation function between

signals i and ref. All possible values range from 0 to 1 given

the frequency domain amplitude normalization performed.

The Time delay of arrival (TDOA) for these two micro-

phones (i and ref) is estimated as

TDOA

= arg max

i,ref

PHAT

(d)

(4)

which we noted with subscript 1 (1st-best) to differentiate it

from further computed values.

Although the maximum value of

i,ref

PHAT

(d) corresponds to

the estimated TDOA for that particular segment and micro-

phones pair, it does not always “point” at the correct speaker

during that segment. In the system proposed here the top

N relative maxima of

i,ref

PHAT

(d) are computed instead (we

use N around 4), and several post-processing techniques are

used to “stabilize” and choose the appropriate delay before

aligning the signals for the sum. Therefore, for each analysis

segment we obtain a vector TDOA

for microphone i with

m = 1 . . . M, i 6= ref with its corresponding correlation values

GCC-PHAT

with n = 1 . . . N .

We could isolate three cases where it was considered not

appropriate to use the absolute maximum (1st-best) from

i,ref

PHAT

(d). On the one hand, the maximum can be due to

spurious noises or events not related to the active speaker,

and the active speaker is actually represented by another

local maximum of the cross-correlation. On the other hand,

when two or more speakers are speaking simultaneously, each

speaker will be represented by a different maximum in the

cross-correlation function, but the absolute maximum might

not be constantly assigned to the same speaker resulting in

artiﬁcial speaker switching. Finally, when the segment that

has been processed is entirely ﬁlled with non-speech acoustic

data (either noise or random acoustic events) the

i,ref

PHAT

(d)

function obtains maximum values randomly over all possible

delays, making it not suitable for beamforming. In this case

no source delay information can be extracted from the signal

and the delays ought to be totally discarded and substituted

by others in the surrounding time frames, as will be seen in

next section.

C. TDOA Values Selection/Post-Processing

Once the TDOA values of all channels across all meeting

have been computed it is desirable to apply a TDOA post-

processing to obtain the set of delay values to be applied to

each of the signals when performing the weighted-delay&sum

as proposed in eq. 1. We implemented two ﬁltering steps,

a noisy TDOA detection and elimination (TDOA continuity

enhancement), and 1-best TDOA selection from the N -best

vector.

1) Noisy TDOA Thresholding: This ﬁrst proposed ﬁltering

step is intended to detect those TDOA values that are not

reliable. A TDOA value does not show any useful information

when it is computed over a silence (or mainly silence) region

or when the SNR of either of the signals being compared

is low, making them very dissimilar. The ﬁrst problem could

be addressed by using a speech/non-speech detector prior to

any further processing, but prior experimentation indicated

that further errors were introduced due to the detector. The

selected algorithm applies a simple continuity ﬁlter on the

TDOA values for each segment c based on their GCC-PHAT

values by using a noise threshold Θ

noise

in the following way:

TDOA

[c] =

TDOA

[c − 1] if GCC-PHAT

[c] < Θ

noise

TDOA

[c] if GCC-PHAT

[c] ≥ Θ

noise

(5)

where Θ

noise

is deﬁned as the minimum correlation value

below which it can be assumed that the correlation is returning

feasible results. It is set independently in every meeting as

the correlation values are dependent not only on the signal

quality but also on the microphone distribution in the different

meeting rooms. In order to ﬁnd an appropriate value for it, the

histogram of the distribution of correlation values needs to be

evaluated for each meeting. In our implementation a threshold

was selected at the value which ﬁlters out the lowest 10% of

the cross-correlation frames, using the histogram for all cross-

correlation values from all microphones in each meeting.

Experimentation showed that the ﬁnal performance did not

decrease when computing a threshold over the distribution

of all correlation values together, compared to individual

threshold values computed for each channel independently,

which would impose a higher computational burden on the

system.

2) Dual-Step Viterbi Post-Processing: This second post-

processing technique applied to the computed delays is used

to select the appropriate delay to be used among the N -

best GCC-PHAT values computed previously. The aim here

is to maximize speaker continuity avoiding constant delay

switching in the case of multiple speakers, and to ﬁlter out

undesired beam steering towards spurious noises present in

the room.

As seen in ﬁgure 2 a two-step Viterbi decoding of the N -

best TDOA is proposed. The ﬁrst step consists of a local

(single-channel) decoding where the two-best delays are cho-

sen from the N-best delays computed for that channel at every

JOURNAL OF L

X CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 5

First Viterbi Step Second Viterbi Step

TDOA

Multiple

Channel

Viterbi

Individual channel 1

Viterbi

Individual channel N

Viterbi

2-best

TDOA

Fig. 2. weighted-delay&sum double-Viterbi delays selection

segment. The second decoding step considers all combinations

of two-best delays across all channels, and selects the ﬁnal

single TDOA value that is most consistent across all channels.

For each step, one needs to deﬁne the topology of the state

sequence used in the Viterbi decoding and the emission and

transition weights to be used. The use of a two-step algorithm

is due in part to computational constraints since an exhaustive

search over all possible combinations of all N -best values for

all channels would easily become computationally prohibitive.

Both steps choose the most probable (and second most

probable) sequence of hidden states where each one is related

to the TDOA values computed for one segment. In the ﬁrst

step the set of possible states at each segment c is given

by the computed N -best values. Each possible state has an

emission probability-like value for each processed segment.

This value is equal to the log GCC-PHAT

[c] value for channel

i, with n = 1 . . . N . No prior scaling or normalization is

required as the GCC-PHAT values range from 0 to 1 (given the

amplitude normalization performed on the frequency domain

in its deﬁnition).

The transition weight between two states in step 1 is taken as

decreasing linearly with the distance between its delays. Given

two nodes, i and j at segments c and c − 1, respectively, the

transition weight for a given channel m is deﬁned as

T r

[i, j; c] =

∆diﬀ

[i,j; c]−|TDOA

[c]−TDOA

[c−1]|

∆diﬀ

[i,j; c]

(6)

where ∆diﬀ

[i, j; c] = max(|TDOA

[c] − TDOA

[c −

1]|, ∀i, j). This way all transition weights are locally bounded

between 0 and 1, assigning a 0 weight to the furthest away

delays pair. This implies that only N − 1 TDOA values will

be considered at each segment.

This ﬁrst Viterbi step aims at ﬁnding the two best TDOA

values (from the computed N -best) that represent the meet-

ing’s speakers at any given time. By doing so it is believed that

the system will be able to choose the most appropriate/stable

TDOA value for that segment and a secondary delay, which

may come from interfering events, e.g. other speakers or the

same speaker’s echoes. The TDOA values can be any two (not

allowing the paths to collapse) of the N-best TDOA values

computed previously by the system, and are chosen exclusively

based on their distance to surrounding TDOA values and their

GCC-PHAT values.

The second pass Viterbi decoding ﬁnds the best possible

path given the set of hidden states generated by all possible

combinations of delays from the two-best delays obtained ear-

lier for each channel. Given a vector g(l) of dimension M − 1

(same as the number of channels for which TDOA values are

computed) which is the lth combination of possible indexes

from the 2-best TDOA values for each channel (obtained in

step 1), it is expanded as g(l) = [g(l, 1) . . . g(l, M −1)] where

each element g(l, m) = {0, 1}, with 2

M−1

combinations

possible.

One can rewrite GCC-PHAT

g( l,m)

[c], the GCC-PHAT value

associated with the g(l, m)-best TDOA value for channel

m at segment c, which will take values [0, 1]. Then the

emission probability-like values are obtained as the product of

the individual GCC-PHAT values of each considered TDOA

combination g(l) at segment c as

(g(l))[c] =

m=1

log(GCC-PHAT

g (l,m)

[c]) (7)

which can be considered to be the extension of the individual

channel emission probability-like values to the case of multiple

TDOA values, where we consider that the different dimensions

are independent from each other (interpreted as independence

of the TDOA values obtained for each channel at segment c,

not their relationship with each other in space along time).

The transition weights are computed in a similar way as in

the ﬁrst step, but in this case they introduce a new dimension

to the computation, as now a vector of possible TDOA values

needs to be taken into account. As was done with the emission

probability-like values, the total distance is considered to

be the sum of the individual distances from each element.

Assuming TDOA

g( l,m)

[c] is the TDOA value for the g(l, m)-

best element in channel m for segment c, the transition weights

between two TDOA combinations for all microphones are

determined by

T r

[i, j; c] =

m=1

∆diﬀ[i,j; c]−|TDOA

g(i,m)

[c]−TDOA

g(j,m)

[c−1]|

∆diﬀ[i,j; c]

(8)

where now ∆diﬀ[i, j; c] = max(|TDOA

g( i,n)

[c] −

TDOA

g( j,m)

[c − 1]|, ∀i, j, m).

This second processing step considers the relationship in

space present between all channels, as they are presumably

steering to the same position. By performing a decoding over

time, it selects the TDOA vector elements according to their

distance to nearby vectors.

In both cases, the transition weights are modiﬁed (raised to

a power) to emphasize thier effect in the decision of the best

path. This is similar to the use of word-transition-penalties in

an ASR systems. It will be shown in the experiments section

that a weight of 25 for both cases appears to optimize the

diarization error rate on the development set.

To illustrate how the two-step Viterbi decoding works on the

TDOA values, let us consider the example in ﬁgure 3a. This

Acoustic Beamforming for Speaker Diarization of Meetings

Figures

Citations

Speaker Diarization: A Review of Recent Research

Neural network based spectral mask estimation for acoustic beamforming

Array signal processing

An analysis of environment, microphone and data simulation mismatches in robust speech recognition

Improved MVDR beamforming using single-channel mask prediction networks

References

Two decades of array signal processing research: the parametric approach

The generalized correlation method for estimation of time delay

Beamforming: a versatile approach to spatial filtering

Extrapolation, Interpolation, and Smoothing of Stationary Time Series

Array Signal Processing

Related Papers (5)

The Kaldi Speech Recognition Toolkit

The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines

The generalized correlation method for estimation of time delay

Speaker Diarization: A Review of Recent Research

The AMI meeting corpus: a pre-announcement

Frequently Asked Questions (8)

Q1. What are the contributions mentioned in the paper "Acoustic beamforming for speaker diarization of meetings" ?

Q2. What was used for the speaker diarization task?

Q3. What is the significance test of the RT06s system compared to the baseline?

Q4. What is the purpose of the acoustic beamforming system presented in this work?

Q5. How is the noise cancellation achieved in standard beamforming systems?

Q6. What is the TDOA filtering step?

Q7. What is the way to estimate the delay between channels?

Q8. What is the TDOA value for each channel?