scispace - formally typeset
Open AccessJournal ArticleDOI

Acoustic Beamforming for Speaker Diarization of Meetings

Reads0
Chats0
TLDR
The use of classic acoustic beamforming techniques is proposed together with several novel algorithms to create a complete frontend for speaker diarization in the meeting room domain and shows improvements in a speech recognition task.
Abstract
When performing speaker diarization on recordings from meetings, multiple microphones of different qualities are usually available and distributed around the meeting room. Although several approaches have been proposed in recent years to take advantage of multiple microphones, they are either too computationally expensive and not easily scalable or they cannot outperform the simpler case of using the best single microphone. In this paper, the use of classic acoustic beamforming techniques is proposed together with several novel algorithms to create a complete frontend for speaker diarization in the meeting room domain. New techniques we are presenting include blind reference-channel selection, two-step time delay of arrival (TDOA) Viterbi postprocessing, and a dynamic output signal weighting algorithm, together with using such TDOA values in the diarization to complement the acoustic information. Tests on speaker diarization show a 25% relative improvement on the test set compared to using a single most centrally located microphone. Additional experimental results show improvements using these techniques in a speech recognition task.

read more

Content maybe subject to copyright    Report

JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 1
Acoustic Beamforming for Speaker Diarization of
Meetings
Xavier Anguera, Member, IEEE, Chuck Wooters, Member, IEEE, Javier Hernando, Member, IEEE
Abstract—When performing speaker diarization on recordings
from meetings, multiple microphones of different qualities are
usually available and distributed around the meeting room.
Although several approaches have been proposed in recent years
to take advantage of multiple microphones, they are either
too computationally expensive and not easily scalable or they
can not outperform the simpler case of using the best single
microphone. In this work the use of classic acoustic beamforming
techniques is proposed together with several novel algorithms
to create a complete frontend for speaker diarization in the
meeting room domain. New techniques we are present include
blind reference-channel selection, two-step Time Delay of Arrival
(TDOA) Viterbi postprocessing, and a dynamic output signal
weighting algorithm, together with using such TDOA values in
the diarization to complement the acoustic information. Tests on
speaker diarization show a 25% relative improvement on the test
set compared to using a single most centrally located microphone.
Additional experimental results show improvements using these
techniques in a speech recognition task.
Index Terms—acoustic beamforming, speaker diarization,
speaker segmentation and clustering, meetings processing.
I. INTRODUCTION
P
ossibly the most noticeable difference when performing
speaker diarization in the meetings environment versus
other domains (like broadcast news or telephone speech) is
the availability, at times, of multiple microphone channels,
synchronously recording what occurs in the meeting. Their
varied locations, quantity, and wide range of signal quality
has made it difficult to come up with automatic ways to take
advantage of these multiple channels for speech-related tasks
such as speaker diarization.
In the system developed by Macquarie University [1] and
the TNO/AMI systems ([2] and [3]), either the most centrally
located microphone (known a priori) or a randomly selected
single microphone was used for speaker diarization. This
approach was designed to prevent low quality microphones
from affecting the results. Such approaches ignore the potential
advantage of using multiple microphones- making use of the
alternate microphone channels to create an improved signal
as the interaction moves from one speaker to another. Several
alternatives have been proposed to analyze and switch chan-
nels dynamically as the meeting progresses. At CMU [4] this
is done before any speaker diarization processing by using a
combination of energy and signal-to-noise metrics. However,
this approach creates a patchwork-type signal which could
At the time of this work X. Anguera was visiting the International Computer
Science Institute (ICSI), Berkeley, California. C. Wooters is currently with
ICSI and J. Hernando is with Universitat Politecnica de Catalunya (UPC),
Barcelona, Spain.
make interfere with the speaker diarization algorithms. In an
alternative presented in an initial LIA implementation [5], all
channels were processed in parallel and the best segments
from each channel were selected at the output. This technique
is computationally expensive as a full speaker diarization
processing must be performed for every channel. Later, LIA
proposed ([6] and [7]) a weighted sum of all channels into
a single channel prior to performing diarization. However,
this approach does not take into account the fact that the
signals may be misaligned due to the propagation time of
speech through the air or hardware timing issues, resulting in
a summed signal that contains echoes, and usually performs
worse than the best single channel.
To take advantage of the multiple microphones available in
a typical meeting room, we previously proposed ([8] and [9])
the use of microphone array beamforming for speech/acoustic
enhancement (see [10], [11]). Although the task at hand differs
from the classic due to some of the assumptions in the
beamforming theory, it was found to be beneficial to use it
as a starting-point for taking advantage of the multiple distant
microphones.
In this work we propose a full acoustic beamforming fron-
tend, based on weighted-delay&sum techniques [10], aimed at
creating a single enhanced signal from an unknown number
of multiple microphone channels. This system is designed
for recordings made in meetings in which several speakers
and other sources of interference are present. Several new
algorithms are proposed to adapt the general beamforming
theory to this particular domain. Algorithms proposed in-
clude the automatic selection of the reference channel, the
computation of the N-best channel delays, postprocessing
techniques to select the optimum delay values (including a
noise thresholding and a two-step selection algorithm via
Viterbi decoding), and a dynamic channel-weight estimation
to reduce the negative impact of low quality channels.
The system presented here was used as part of ICSI’s
submission to the Spring 2006 Rich Transcription evaluation
(RT06s) organized by NIST [12], both in the speaker diariza-
tion and in the speech recognition systems. Additionally, the
software is currently available as open-source [13].
The next section describes the modules used in the acoustic
beamforming system. Then, we present experimental results
showing the improvements gained by using the new system
within the task of speaker diarization, and finally, we present
results for the task of speech recognition.

JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 2
II. MULTICHANNEL ACOUSTIC BEAMFORMING SYSTEM
IMPLEMENTATION
The acoustic beamforming system is based on the weighted-
delay&sum microphone array theory, which is a generalization
of the well known weighted-delay&sum beamforming tech-
nique ([14], [15]). The signal output y[n] is expressed as the
weighted sum of the different channels as follows:
y[n] =
M
X
m=1
W
m
[n]x
m
[n TDOA
(m,ref)
[n]] (1)
where
W
m
[
n
]
is the relative weight for microphone
m
(out of
M microphones) at instant n, with the sum of all weights
equals to 1; x
m
[n] is the signal for each channel, and
TDOA
(m,ref)
[n] (Time Delay of Arrival) is the relative delay
between each channel and the reference channel, in order to
obtain all signals aligned with each other at each instant n.
In practice, TDOA
(m,ref)
[n] is estimated via cross correlation
techniques once every several acoustic frames. In the imple-
mentation presented here it corresponds to once every 250 ms,
using GCC-PHAT (Generalized Cross Correlation with Phase
Transform) as proposed in [16] and [17] and described below.
We will refer to these as “acoustic segments”, and we will
refer to the (usually larger) set of frames used to estimate the
cross correlation measure the “analysis window”.
The weighted-delay&sum technique was selected for use in
the meetings domain given the following set of constraints:
Unknown locations of the microphones in the meeting
room.
Non-uniform microphone settings (gain, recording off-
sets, etc.)
Unknown location and number of speakers in the room.
Due to this constraint, any techniques based on known
source locations are unsuitable.
Unknown number of microphones in the meeting room.
The system should be able to handle from 2 to >100
microphone channels.
Figure 1 shows the different blocks involved in the pro-
posed weighted-delay&sum process. The process can be split
into four main blocks. First, signal enhancement via Wiener
filtering is performed on each individual channel to reduce the
noise. Next, the information extraction block is in charge of
estimating which channel to use as the reference channel, an
overall weighting factor for the output signal, the skew present
in the ICSI meetings, and the N-best TDOA values at each
analysis segment. Third, a selection of the appropriate TDOA
delays between signals is obtained in order to optimally align
the channels before the sum. Finally, the signals are aligned
and summed. The output of the system is composed of the
acoustic signal and a vector of TDOA values, which can be
used as extra information about a speaker’s position. A more
detailed description of each block follows.
A. Individual Channel Signal Enhancement
Prior to doing any multichannel beamforming, each individ-
ual channel is Wiener filtered [18]. This aims at cleaning the
signal of corrupting noise, which is assumed to be additive
and of a stochastic nature. The implementation of Wiener
filtering is taken from the ICSI-SRI-UW system used for
ASR in [19], and applied to each channel independently.
This implementation performs an internal speech/non-speech
and noise power estimation for each channel independently,
ignoring any multichannel properties or microphone locations.
The use of such filtering improves the beamforming as it
increases the quality of the signal, even though it introduces
a small phase nonlinearity given that the filter is not of
linear phase. Alternative multichannel Wiener filters were
not considered but could further improve results by taking
advantage of redundancies in the different input channels.
B. Meeting Information Extraction block
The algorithms in this block extract information from the
input signals to be used further on in the process to construct
the output signal. It is composed of four algorithms- reference
channel estimation, overall channels weighting factor, ICSI
meetings skew estimation, and the TDOA N -best delays
estimation.
1) Reference Channel Estimation: This algorithm attempts
to automatically find the most centrally located and best
quality channel to be used as the reference channel in further
processing. It is important for this channel to be the best
representative of the acoustics in the meeting, as the correct
estimation of the delays of each of the channels depends on
the reference chosen.
In the meetings used for the Rich Transcription evaluations
[20], there is one microphone that is selected as the most cen-
trally located microphone. This microphone channel is used in
the Single Distant Microphone (SDM) task. The SDM channel
is chosen given the room layout and the prior knowledge of
the microphone types. This module presented here, ignores
that channel chosen for the SDM condition and selects one
microphone automatically based only on the acoustics. This
is intended for system robustness in cases where absolutely
no information is available on the room layout or microphone
placements.
In order to find the reference channel, we use a metric
based on a time-average of the cross-correlation between each
channel i and all of the others j = 1 . . . M, j 6= i, computed
on segments of 1 second, as
xcorr
i
=
1
K(M 1)
K
X
k=1
M
X
j=1,j6=i
xcorr[i, j; k] (2)
where M is the total number of channels/microphones and
K = 200 indicates the number of one second blocks used
in the average. The xcorr[i, j; k] indicates a standard cross-
correlation measure between channels i and j for each block
k. The channel i with the highest average cross-correlation
was chosen as the reference channel. An alternative SNR
metric was analyzed and the results were not conclusive as
to which method performed better in all cases. The cross-
correlation metric was chosen as it matches the algorithm
search for maximum correlation values and because it is
simple to implement.

JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 3
1 2 3 4
1 2 1 2 3
C) TDOA values selection D) Output signal generation
B) Meeting information extraction
A) Individual chann.
signal enhancement
Wiener Filtering Reference channel
computation
Overall channels
weighting factor
ICSI meetings
skew estimation
GCC-PHAT
cross-correlation
N-best estimation
Noise threshold
estimation and
noise thresholding
Dual pass
Viterbi decoding
of delays
Interchannel
output weight
adaptation
Bad quality
segments
elimination
Channels sum
Enhanced
signal
TDOA
values
Fig. 1. Weighted-delay&sum block diagram
2) Overall Channels Weighting Factor: For practical rea-
sons, speech processing applications use acoustic data that
was sampled with a limited number of bits (e.g. 16 bits per
sample) providing a certain amount of dynamic range, which
is often not fully used because the recorded signals are of low
amplitude. When summing up several input signals, we are
increasing the resolution in the resulting signal, and thus we
must try to take advantage of as much of the output resolution
as possible. The overall channel weighting factor is used to
normalize the input signals to match the file’s available dy-
namic range. It is useful for low amplitude input signals since
the beamformed output has greater resolution and therefore
can be scaled appropriately to minimize the quantization errors
generated by scaling it to the output sampling requirements.
There are several methods in signal processing for finding
the maximum value of a signal in order to perform amplitude
normalization. These include- compute the absolute maximum
amplitude, the Root Mean Square (RMS) value, or other
variations of it, over the entire recording. It was observed in
meetings data that the signals may contain low energy areas
(silence regions) with short average durations, and high energy
areas (impulsive noises like door slams, or laughs), with even
shorter duration. Using the absolute maximum or RMS would
“saturate” the normalizing factor to the highest possible value
or bias it according to the amount of silence in the meeting.
So instead, we chose a windowed maximum averaging to try
to increase the likelihood that every window contains some
speech. In each window the maximum value is found and
these max values are averaged over the entire recording. The
weighting factor was obtained directly from this average.
3) ICSI Meetings Skew Estimation: This module was cre-
ated to deal with the meetings that come from the ICSI
Meeting Corpus, some of which have an error in the syn-
chronization of the channels. This was originally detected and
reported in [21], indicating that the hardware used for the
recordings was found not to keep an exact synchronization
between the different channels, resulting in a skew between
channels of multiples of 2.64 ms. It is not possible to know
beforehand the amount of skew of each of the channels as the
room setup did not follow a consistent ordering regarding the
connections to the hardware being used. Therefore we need to
automatically detect such skew so that it does not affect the
beamforming.
The artificially generated skew does not affect the general
processing of the channels by an ASR system as it does not
need exact time alignment between the channels- utterance
boundaries always include a silence “guard” region, and
the usual parametrizations (10-20ms long) cover small time
differences.
It does pose a problem though when computing the delays
between channels as it introduces an artificial delay between
channel pairs, which forces us to use a larger analysis window
for the ICSI meetings than with other meetings in order
to compute the delays accurately. This increases the chance
of delay estimation error. This module is therefore used to
estimate the skew between each channel and the reference
channel (in the case of ICSI meetings) and use it as a constant
bias in the rest of the delay processing.
In order to estimate the bias, an average cross-correlation
metric was put in place in order to obtain the average (across
time) delay between each channel and the reference channel
for a set of long acoustic windows (around 20 seconds), evenly
distributed along the meeting.
4) TDOA N -best delays estimation: The computation of the
time delay of arrival (TDOA) between each of the channels
and the reference channel is computed in segments of 250
ms. This allows the beamforming to quickly modify its beam
steering whenever the active speaker changes. In this imple-
mentation the TDOA was computed over a window of 500ms
(called the analysis window), which covers the current analysis
segment and the next. The size of the analysis window and of
the segment size constitute a tradeoff. A large analysis window
or segment window leads to a reduction in the resolution
of changes in the TDOA. On the other hand, using a small
analysis window reduces the robustness of the estimation. The
reduction of the segment size also increases the computational
cost of the system, while not increasing the quality of the
output signal. The selection of the scroll and analysis window
sizes was done empirically given some development data and
no exhaustive study was performed to fine-tune these values.

JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 4
In order to compute the TDOA between the reference
channel and any other channel for any given segment it is
usual to estimate it as the delay that maximizes the cross-
correlation between the two segments. In current beamforming
systems, the use of the cross-correlation in its classical form
(R
i,ref
xcorr
(d) =
P
N
n=0
x
i
[n]x
ref
[n + d]) is avoided as it is very
sensitive to noise and reverberation. To improve robustness
against these problems, it is common practice to use the
GCC-PHAT. Such variation to the standard cross-correlation
proposes an amplitude normalization in the frequency domain,
maintaining the phase information, which conveys the delay
information between the signals.
Given two signals x
i
(n) and x
ref
(n), the GCC-PHAT is
computed with:
ˆ
R
i,ref
PHAT
(d) = F
1
³
X
i
(f)[X
ref
(f)]
|X
i
(f)[X
ref
(f)]
|
´
(3)
Where X
i
(f) and X
ref
(f) are the Fourier transforms of the
two signals, F
1
indicates the inverse Fourier transformation,
[ ]
denotes the complex conjugate and | · | is the modulus.
The resulting
ˆ
R
i,ref
PHAT
(d) is the correlation function between
signals i and ref. All possible values range from 0 to 1 given
the frequency domain amplitude normalization performed.
The Time delay of arrival (TDOA) for these two micro-
phones (i and ref) is estimated as
TDOA
i
1
= arg max
d
¡
ˆ
R
i,ref
PHAT
(d)
¢
(4)
which we noted with subscript 1 (1st-best) to differentiate it
from further computed values.
Although the maximum value of
ˆ
R
i,ref
PHAT
(d) corresponds to
the estimated TDOA for that particular segment and micro-
phones pair, it does not always “point” at the correct speaker
during that segment. In the system proposed here the top
N relative maxima of
ˆ
R
i,ref
PHAT
(d) are computed instead (we
use N around 4), and several post-processing techniques are
used to “stabilize” and choose the appropriate delay before
aligning the signals for the sum. Therefore, for each analysis
segment we obtain a vector TDOA
i
n
for microphone i with
m = 1 . . . M, i 6= ref with its corresponding correlation values
GCC-PHAT
i
n
with n = 1 . . . N .
We could isolate three cases where it was considered not
appropriate to use the absolute maximum (1st-best) from
ˆ
R
i,ref
PHAT
(d). On the one hand, the maximum can be due to
spurious noises or events not related to the active speaker,
and the active speaker is actually represented by another
local maximum of the cross-correlation. On the other hand,
when two or more speakers are speaking simultaneously, each
speaker will be represented by a different maximum in the
cross-correlation function, but the absolute maximum might
not be constantly assigned to the same speaker resulting in
artificial speaker switching. Finally, when the segment that
has been processed is entirely filled with non-speech acoustic
data (either noise or random acoustic events) the
ˆ
R
i,ref
PHAT
(d)
function obtains maximum values randomly over all possible
delays, making it not suitable for beamforming. In this case
no source delay information can be extracted from the signal
and the delays ought to be totally discarded and substituted
by others in the surrounding time frames, as will be seen in
next section.
C. TDOA Values Selection/Post-Processing
Once the TDOA values of all channels across all meeting
have been computed it is desirable to apply a TDOA post-
processing to obtain the set of delay values to be applied to
each of the signals when performing the weighted-delay&sum
as proposed in eq. 1. We implemented two filtering steps,
a noisy TDOA detection and elimination (TDOA continuity
enhancement), and 1-best TDOA selection from the N -best
vector.
1) Noisy TDOA Thresholding: This first proposed filtering
step is intended to detect those TDOA values that are not
reliable. A TDOA value does not show any useful information
when it is computed over a silence (or mainly silence) region
or when the SNR of either of the signals being compared
is low, making them very dissimilar. The first problem could
be addressed by using a speech/non-speech detector prior to
any further processing, but prior experimentation indicated
that further errors were introduced due to the detector. The
selected algorithm applies a simple continuity filter on the
TDOA values for each segment c based on their GCC-PHAT
values by using a noise threshold Θ
noise
in the following way:
TDOA
i
n
[c] =
½
TDOA
i
n
[c 1] if GCC-PHAT
i
1
[c] < Θ
noise
TDOA
i
n
[c] if GCC-PHAT
i
1
[c] Θ
noise
(5)
where Θ
noise
is defined as the minimum correlation value
below which it can be assumed that the correlation is returning
feasible results. It is set independently in every meeting as
the correlation values are dependent not only on the signal
quality but also on the microphone distribution in the different
meeting rooms. In order to find an appropriate value for it, the
histogram of the distribution of correlation values needs to be
evaluated for each meeting. In our implementation a threshold
was selected at the value which filters out the lowest 10% of
the cross-correlation frames, using the histogram for all cross-
correlation values from all microphones in each meeting.
Experimentation showed that the final performance did not
decrease when computing a threshold over the distribution
of all correlation values together, compared to individual
threshold values computed for each channel independently,
which would impose a higher computational burden on the
system.
2) Dual-Step Viterbi Post-Processing: This second post-
processing technique applied to the computed delays is used
to select the appropriate delay to be used among the N -
best GCC-PHAT values computed previously. The aim here
is to maximize speaker continuity avoiding constant delay
switching in the case of multiple speakers, and to filter out
undesired beam steering towards spurious noises present in
the room.
As seen in figure 2 a two-step Viterbi decoding of the N -
best TDOA is proposed. The first step consists of a local
(single-channel) decoding where the two-best delays are cho-
sen from the N-best delays computed for that channel at every

JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 5
First Viterbi Step Second Viterbi Step
TDOA
1
n
TDOA
M
n
Multiple
Channel
Viterbi
Individual channel 1
Viterbi
Individual channel N
Viterbi
2-best
2-best
TDOA
m
Fig. 2. weighted-delay&sum double-Viterbi delays selection
segment. The second decoding step considers all combinations
of two-best delays across all channels, and selects the final
single TDOA value that is most consistent across all channels.
For each step, one needs to define the topology of the state
sequence used in the Viterbi decoding and the emission and
transition weights to be used. The use of a two-step algorithm
is due in part to computational constraints since an exhaustive
search over all possible combinations of all N -best values for
all channels would easily become computationally prohibitive.
Both steps choose the most probable (and second most
probable) sequence of hidden states where each one is related
to the TDOA values computed for one segment. In the first
step the set of possible states at each segment c is given
by the computed N -best values. Each possible state has an
emission probability-like value for each processed segment.
This value is equal to the log GCC-PHAT
i
n
[c] value for channel
i, with n = 1 . . . N . No prior scaling or normalization is
required as the GCC-PHAT values range from 0 to 1 (given the
amplitude normalization performed on the frequency domain
in its definition).
The transition weight between two states in step 1 is taken as
decreasing linearly with the distance between its delays. Given
two nodes, i and j at segments c and c 1, respectively, the
transition weight for a given channel m is defined as
T r
m
1
[i, j; c] =
∆diff
m
[i,j; c]−|TDOA
m
i
[c]TDOA
m
j
[c1]|
∆diff
m
[i,j; c]
(6)
where ∆diff
m
[i, j; c] = max(|TDOA
m
i
[c] TDOA
m
j
[c
1]|, i, j). This way all transition weights are locally bounded
between 0 and 1, assigning a 0 weight to the furthest away
delays pair. This implies that only N 1 TDOA values will
be considered at each segment.
This first Viterbi step aims at finding the two best TDOA
values (from the computed N -best) that represent the meet-
ing’s speakers at any given time. By doing so it is believed that
the system will be able to choose the most appropriate/stable
TDOA value for that segment and a secondary delay, which
may come from interfering events, e.g. other speakers or the
same speaker’s echoes. The TDOA values can be any two (not
allowing the paths to collapse) of the N-best TDOA values
computed previously by the system, and are chosen exclusively
based on their distance to surrounding TDOA values and their
GCC-PHAT values.
The second pass Viterbi decoding finds the best possible
path given the set of hidden states generated by all possible
combinations of delays from the two-best delays obtained ear-
lier for each channel. Given a vector g(l) of dimension M 1
(same as the number of channels for which TDOA values are
computed) which is the lth combination of possible indexes
from the 2-best TDOA values for each channel (obtained in
step 1), it is expanded as g(l) = [g(l, 1) . . . g(l, M 1)] where
each element g(l, m) = {0, 1}, with 2
M1
combinations
possible.
One can rewrite GCC-PHAT
m
g( l,m)
[c], the GCC-PHAT value
associated with the g(l, m)-best TDOA value for channel
m at segment c, which will take values [0, 1]. Then the
emission probability-like values are obtained as the product of
the individual GCC-PHAT values of each considered TDOA
combination g(l) at segment c as
P
2
(g(l))[c] =
M
X
m=1
log(GCC-PHAT
m
g (l,m)
[c]) (7)
which can be considered to be the extension of the individual
channel emission probability-like values to the case of multiple
TDOA values, where we consider that the different dimensions
are independent from each other (interpreted as independence
of the TDOA values obtained for each channel at segment c,
not their relationship with each other in space along time).
The transition weights are computed in a similar way as in
the first step, but in this case they introduce a new dimension
to the computation, as now a vector of possible TDOA values
needs to be taken into account. As was done with the emission
probability-like values, the total distance is considered to
be the sum of the individual distances from each element.
Assuming TDOA
m
g( l,m)
[c] is the TDOA value for the g(l, m)-
best element in channel m for segment c, the transition weights
between two TDOA combinations for all microphones are
determined by
T r
2
[i, j; c] =
P
M
m=1
∆diff[i,j; c]−|TDOA
m
g(i,m)
[c]TDOA
m
g(j,m)
[c1]|
∆diff[i,j; c]
(8)
where now ∆diff[i, j; c] = max(|TDOA
m
g( i,n)
[c]
TDOA
m
g( j,m)
[c 1]|, i, j, m).
This second processing step considers the relationship in
space present between all channels, as they are presumably
steering to the same position. By performing a decoding over
time, it selects the TDOA vector elements according to their
distance to nearby vectors.
In both cases, the transition weights are modified (raised to
a power) to emphasize thier effect in the decision of the best
path. This is similar to the use of word-transition-penalties in
an ASR systems. It will be shown in the experiments section
that a weight of 25 for both cases appears to optimize the
diarization error rate on the development set.
To illustrate how the two-step Viterbi decoding works on the
TDOA values, let us consider the example in figure 3a. This

Citations
More filters
Journal ArticleDOI

Speaker Diarization: A Review of Recent Research

TL;DR: An analysis of speaker diarization performance as reported through the NIST Rich Transcription evaluations on meeting data and identify important areas for future research are presented.
Proceedings ArticleDOI

Neural network based spectral mask estimation for acoustic beamforming

TL;DR: A neural network based approach to acoustic beamforming is presented, used to estimate spectral masks from which the Cross-Power Spectral Density matrices of speech and noise are estimated, which are used to compute the beamformer coefficients.
Journal ArticleDOI

Array signal processing

TL;DR: This book is very referred for you because it gives not only the experience but also lesson, that's not about who are reading this array signal processing book but about this book that will give wellness for all people from many societies.
Journal ArticleDOI

An analysis of environment, microphone and data simulation mismatches in robust speech recognition

TL;DR: It is found that training on different noise environments and different microphones barely affects the ASR performance, especially when several environments are present in the training data: only the number of microphones has a significant impact.
Proceedings ArticleDOI

Improved MVDR beamforming using single-channel mask prediction networks

TL;DR: It is shown that using a single mask across microphones for covariance prediction with minima-limited post-masking yields the best result in terms of signal-level quality measures and speech recognition word error rates in a mismatched training condition.
References
More filters
Journal ArticleDOI

Two decades of array signal processing research: the parametric approach

TL;DR: The article consists of background material and of the basic problem formulation, and introduces spectral-based algorithmic solutions to the signal parameter estimation problem and contrast these suboptimal solutions to parametric methods.
Journal ArticleDOI

The generalized correlation method for estimation of time delay

TL;DR: In this paper, a maximum likelihood estimator is developed for determining time delay between signals received at two spatially separated sensors in the presence of uncorrelated noise, where the role of the prefilters is to accentuate the signal passed to the correlator at frequencies for which the signal-to-noise (S/N) ratio is highest and suppress the noise power.
Journal ArticleDOI

Beamforming: a versatile approach to spatial filtering

TL;DR: An overview of beamforming from a signal-processing perspective is provided, with an emphasis on recent research.
Book

Array Signal Processing

TL;DR: The author explains the development of the Wiener Solution and some of the techniques used in its implementation, including Optimum Processing: Steady State Performance and theWiener Solution, which simplifies the implementation of the Covariance Matrix.
Related Papers (5)
Frequently Asked Questions (8)
Q1. What are the contributions mentioned in the paper "Acoustic beamforming for speaker diarization of meetings" ?

In this work the use of classic acoustic beamforming techniques is proposed together with several novel algorithms to create a complete frontend for speaker diarization in the meeting room domain. New techniques the authors are present include blind reference-channel selection, two-step Time Delay of Arrival ( TDOA ) Viterbi postprocessing, and a dynamic output signal weighting algorithm, together with using such TDOA values in the diarization to complement the acoustic information. 

The beamforming system developed for the speaker diarization task was also used to obtain an enhanced signal for the ASR systems that ICSI and SRI presented at the RT NIST evaluations. 

The significance test of this system compared to the baseline is passed on both development and test cases (Z = 1.98 in development, Z = 2.31 in test). 

The acoustic beamforming system presented in this work was created for use in the speaker diarization task for the meetings environment. 

In standard beamforming systems, this noise cancellation is achieved through the use of identical microphones placed only a few inches apart one from each other. 

1. The authors implemented two filtering steps, a noisy TDOA detection and elimination (TDOA continuity enhancement), and 1-best TDOA selection from the N -best vector. 

In order to estimate the bias, an average cross-correlation metric was put in place in order to obtain the average (across time) delay between each channel and the reference channel for a set of long acoustic windows (around 20 seconds), evenly distributed along the meeting. 

Assuming TDOAmg(l,m)[c] is the TDOA value for the g(l,m)best element in channel m for segment c, the transition weights between two TDOA combinations for all microphones are determined byTr2[i, j; c] =∑M m=1 ∆diff[i,j; c]−|TDOAmg(i,m)[c]−TDOAmg(j,m)[c−1]| ∆diff[i,j; c](8)where now ∆diff[i, j; c] = max(|TDOAmg(i,n)[c] − TDOAmg(j,m)[c− 1]|, ∀i, j, m).