scispace - formally typeset
Open AccessJournal ArticleDOI

Speaker Diarization: A Review of Recent Research

TLDR
An analysis of speaker diarization performance as reported through the NIST Rich Transcription evaluations on meeting data and identify important areas for future research are presented.
Abstract
Speaker diarization is the task of determining “who spoke when?” in an audio or video recording that contains an unknown amount of speech and also an unknown number of speakers. Initially, it was proposed as a research topic related to automatic speech recognition, where speaker diarization serves as an upstream processing step. Over recent years, however, speaker diarization has become an important key technology for many tasks, such as navigation, retrieval, or higher level inference on audio data. Accordingly, many important improvements in accuracy and robustness have been reported in journals and conferences in the area. The application domains, from broadcast news, to lectures and meetings, vary greatly and pose different problems, such as having access to multiple microphones and multimodal information or overlapping speech. The most recent review of existing technology dates back to 2006 and focuses on the broadcast news domain. In this paper, we review the current state-of-the-art, focusing on research developed since 2006 that relates predominantly to speaker diarization for conference meetings. Finally, we present an analysis of speaker diarization performance as reported through the NIST Rich Transcription evaluations on meeting data and identify important areas for future research.

read more

Content maybe subject to copyright    Report

356 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 2, FEBRUARY 2012
Speaker Diarization: A Review of Recent Research
Xavier Anguera Miro, Member, IEEE, Simon Bozonnet, Student Member, IEEE, Nicholas Evans, Member, IEEE,
Corinne Fredouille, Gerald Friedland, Member, IEEE, and Oriol Vinyals
Abstract—Speaker diarization is the task of determining “who
spoke when?” in an audio or video recording that contains an
unknown amount of speech and also an unknown number of
speakers. Initially, it was proposed as a research topic related to
automatic speech recognition, where speaker diarization serves
as an upstream processing step. Over recent years, however,
speaker diarization has become an important key technology for
many tasks, such as navigation, retrieval, or higher level inference
on audio data. Accordingly, many important improvements in
accuracy and robustness have been reported in journals and
conferences in the area. The application domains, from broadcast
news, to lectures and meetings, vary greatly and pose different
problems, such as having access to multiple microphones and
multimodal information or overlapping speech. The most recent
review of existing technology dates back to 2006 and focuses on
the broadcast news domain. In this paper, we review the cur-
rent state-of-the-art, focusing on research developed since 2006
that relates predominantly to speaker diarization for conference
meetings. Finally, we present an analysis of speaker diarization
performance as reported through the NIST Rich Transcription
evaluations on meeting data and identify important areas for
future research.
Index Terms—Meetings, rich transcription, speaker diarization.
I. INTRODUCTION
S
PEAKER diarization has emerged as an increasingly im-
portant and dedicated domain of speech research. Whereas
speaker and speech recognition involve, respectively, the recog-
nition of a person’s identity or the transcription of their speech,
speaker diarization relates to the problem of determining “who
spoke when?.” More formally this requires the unsupervised
identification of each speaker within an audio stream and the
intervals during which each speaker is active.
Manuscript received August 19, 2010; revised December 03, 2010; accepted
February 13, 2011. Date of current version January 13, 2012. This work was
supported in part by the joint-national Adaptable ambient living assistant”
(ALIAS) project funded through the European Ambient Assisted Living (AAL)
program under Agreement AAL-2009-2-049 and in part by the Annotation
Collaborative pour l’Accessibilité Vidéo” (ACAV) project funded by the French
Ministry of Industry (Innovative Web call) under Contract 09.2.93.0966. The
work of X. Anguera Miro was supported in part by the Torres Quevedo Spanish
program. The associate editor coordinating the review of this manuscript and
approving it for publication was Prof. Sadaoki Furui.
X. Anguera Miro is with the Multimedia Research Group, Telefonica Re-
search, 08021 Barcelona, Spain (e-mail: xanguera@tid.es).
S. Bozonnet and N. Evans are with the Multimedia Communications
Department, EURECOM, 06904 Sophia Antipolis Cedex, France (e-mail:
bozonnet@eurecom.fr).
C. Fredouille is with the University of Avignon, CERI/LIA, F-84911 Avignon
Cedex 9, France (e-mail: corinne.fredouille@univ-avignon.fr).
G. Friedland and O. Vinyals are with the International Computer Science
Institute (ICSI), Berkeley, CA 94704 USA (e-mail: fractor@icsi.berkeley.edu;
evans@eurecom.fr).
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TASL.2011.2125954
Speaker diarization has utility in a majority of applications
related to audio and/or video document processing, such as
information retrieval for example. Indeed, it is often the case
that audio and/or video recordings contain more than one
active speaker. This is the case for telephone conversations (for
example stemming from call centers), broadcast news, debates,
shows, movies, meetings, domain-specific videos (such as
surgery operations for instance), or even lecture or conference
recordings including multiple speakers or questions/answers
sessions. In all such cases, it can be advantageous to automat-
ically determine the number of speakers involved in addition
to the periods when each speaker is active. Clear examples of
applications for speaker diarization algorithms include speech
and speaker indexing, document content structuring, speaker
recognition (in the presence of multiple or competing speakers),
to help in speech-to-text transcription (i.e., so-called speaker at-
tributed speech-to-text), speech translation and, more generally,
Rich Transcription (RT), a community within which the current
state-of-the-art technology has been developed. The most sig-
nificant effort in the Rich Transcription domain comes directly
from the internationally competitive RT evaluations, sponsored
by the National Institute of Standards and Technology (NIST)
in the Unites States [1]. Initiated originally within the telephony
domain, and subsequently in that of broadcast news, today it is
in the domain of conference meetings that speaker diarization
receives the most attention. Speaker diarization is thus an
extremely important area of speech processing research.
An excellent review of speaker diarization research is pre-
sented in [2], although it predominantly focuses its attention to
speaker diarization for broadcast news. Coupled with the tran-
sition to conference meetings, however, the state-of-the-art has
advanced significantly since then. This paper presents an up-to-
date review of present state-of-the-art systems and reviews the
progress made in the field of speaker diarization since 2005 up
until now, including the most recent NIST RT evaluation that
was held in 2009. Official evaluations are an important vehicle
for pushing the state-of-the-art forward as it is only with stan-
dard experimental protocols and databases that it is possible to
meaningfully compare different approaches. While we also ad-
dress emerging new research in speaker diarization, in this paper
special emphasis is placed on established technologies within
the context of the NIST RT benchmark evaluations, which has
become a reliable indicator for the current state-of-the-art in
speaker diarization. This paper aims at giving a concise refer-
ence overview of established approaches, both for the general
reader and for those new to the field. Despite rapid gains in
popularity over recent years, the field is relatively embryonic
compared to the mature fields of speech and speaker recogni-
tion. There are outstanding opportunities for contributions and
we hope that this paper serves to encourage others to participate.
1558-7916/$31.00 © 2012 IEEE

ANGUERA MIRO et al.: SPEAKER DIARIZATION: A REVIEW OF RECENT RESEARCH 357
Section II presents a brief history of speaker diarization
research and the transition to the conference meeting domain.
We describe the main differences between broadcast news
and conference meetings and present a high-level overview of
current approaches to speaker diarization. In Section III, we
present a more detailed description of the main algorithms that
are common to many speaker diarization systems, including
those recently introduced to make use of information coming
from multiple microphones, namely delay-and-sum beam-
forming. Section IV presents some of the most recent work in
the field including efforts to handle multimodal information
and overlapping speech. We also discuss the use of features
based on inter-channel delay and prosodics and also attempts
to combine speaker diarization systems. In Section V, we
present an overview of the current status in speaker diarization
research. We describe the NIST RT evaluations, the different
datasets and the performance achieved by state-of-the-art sys-
tems. We also identify the remaining problems and highlight
potential solutions in the context of current work. Finally, our
conclusions are presented in Section VI.
II. S
PEAKER DIARIZATION
Over recent years, the scientific community has developed
research on speaker diarization in a number of different do-
mains, with the focus usually being dictated by funded research
projects. From early work with telephony data, broadcast
news (BN) became the main focus of research towards the
late 1990s and early 2000s and the use of speaker diariza-
tion was aimed at automatically annotating TV and radio
transmissions that are broadcast daily all over the world. An-
notations included automatic speech transcription and meta
data labeling, including speaker diarization. Interest in the
meeting domain grew extensively from 2002, with the launch
of several related research projects including the European
Union (EU) Multimodal Meeting Manager (M4) project, the
Swiss Interactive Multimodal Information Management (IM2)
project, the EU Augmented Multi-party Interaction (AMI)
project, subsequently continued through the EU Augmented
Multi-party Interaction with Distant Access (AMIDA) project
and, and finally, the EU Computers in the Human Interaction
Loop (CHIL) project. All these projects addressed the research
and development of multimodal technologies dedicated to the
enhancement of human-to-human communications (notably in
distant access) by automatically extracting meeting content,
making the information available to meeting participants, or for
archiving purposes.
These technologies have to meet challenging demands such
as content indexing, linking and/or summarization of on-going
or archived meetings, the inclusion of both verbal and nonverbal
human communication (people movements, emotions, interac-
tions with others, etc.). This is achieved by exploiting several
synchronized data streams, such as audio, video and textual in-
formation (agenda, discussion papers, slides, etc.), that are able
to capture different kinds of information that are useful for the
structuring and analysis of meeting content. Speaker diarization
plays an important role in the analysis of meeting data since it al-
lows for such content to be structured in speaker turns, to which
linguistic content and other metadata can be added (such as the
dominant speakers, the level of interactions, or emotions).
Undertaking benchmarking evaluations has proven to be
an extremely productive means for estimating and comparing
algorithm performance and for verifying genuine technolog-
ical advances. Speaker diarization is no exception and, since
2002, the US National Institute for Standards and Technology
(NIST) has organized official speaker diarization evaluations
1
involving broadcast news (BN) and, more recently, meeting
data. These evaluations have crucially contributed to bringing
researchers together and to stimulating new ideas to advance the
state-of-the-art. While other contrastive sub-domains such as
lecture meetings and coffee breaks have also been considered,
the conference meeting scenario has been the primary focus
of the NIST RT evaluations since 2004. The meeting scenario
is often referred to as “speech recognition complete, i.e., a
scenario in which all of the problems that arise in any speech
recognition can be encountered in this domain. Conference
meetings thus pose a number of new challenges to speaker
diarization that typically were less relevant in earlier research.
A. Broadcast News Versus Conference Meetings
With the change of focus of the NIST RT evaluations from BN
to meetings diarization algorithms had to be adapted according
to the differences in the nature of the data. First, BN speech
data is usually acquired using boom or lapel microphones with
some recordings being made in the studio and others in the
field. Conversely, meetings are usually recorded using desktop
or far-field microphones (single microphones or microphone ar-
rays) which are more convenient for users than head-mounted or
lapel microphones.
2
As a result, the signal-to-noise ratio is gen-
erally better for BN data than it is for meeting recordings. Addi-
tionally, differences between meeting room configurations and
microphone placement lead to variations in recording quality,
including background noise, reverberation and variable speech
levels (depending on the distance between speakers and micro-
phones).
Second, BN speech is often read or at least prepared in ad-
vance while meeting speech tends to be more spontaneous in
nature and contains more overlapping speech. Although BN
recordings can contain speech that is overlapped with music,
laughter, or applause (far less common for conference meeting
data), in general, the detection of acoustic events and speakers
tends to be more challenging for conference meeting data than
for BN data.
Finally, the number of speakers is usually larger in BN but
speaker turns occur less frequently than they do in conference
meeting data, resulting in BN having a longer average speaker
turn length. An extensive analysis of BN characteristics is re-
ported in [3] and a comparison of BN and conference meeting
data can be found in [4].
1
Speaker diarization was evaluated prior to 2002 through NIST Speaker
Recognition (SR) evaluation campaigns (focusing on telephone speech) and
not within the RT evaluation campaigns.
2
Meeting databases recorded for research purposes usually contain
head-mounted and lapel microphone recordings for ground-truth creation
purposes only.

358 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 2, FEBRUARY 2012
Fig. 1. General Diarization system. (a) Alternative clustering schemas.
(b) General speaker diarization architecture.
B. Main Approaches
Most of present state-of-the-art speaker diarization systems
fit into one of two categories: the bottom-up and the top-down
approaches, as illustrated in Fig. 1(a). The top-down approach
is initialized with very few clusters (usually one) whereas the
bottom-up approach is initialized with many clusters (usually
more clusters than expected speakers). In both cases the aim
is to iteratively converge towards an optimum number of clus-
ters. If the final number is higher than the optimum then the
system is said to under-cluster. If it is lower it is said to over-
cluster. Both bottom-up and top-down approaches are generally
based on hidden Markov models (HMMs) where each state is a
Gaussian mixture model (GMM) and corresponds to a speaker.
Transitions between states correspond to speaker turns. In this
section, we briefly outline the standard bottom-up and top-down
approaches as well as two recently proposed alternatives: one
based on information theory; and a second one based on a non
parametric Bayesian approach. Although these new approaches
have not been reported previously in the context of official NIST
RT evaluations they have shown strong potential on NIST RT
evaluation datasets and are thus included here. Additionally,
some other works propose sequential single-pass segmentation
and clustering approaches [5]–[7], although their performance
tends to fall short of the state-of-the-art.
1) Bottom-Up Approach: The bottom-up approach is by far
the most common in the literature. Also known as agglomer-
ative hierarchical clustering (AHC or AGHC), the bottom-up
approach trains a number of clusters or models and aims at
successively merging and reducing the number of clusters until
only one remains for each speaker. Various initializations have
been studied and, whereas some have investigated
-means clus-
tering, many systems use a uniform initialization, where the
audio stream is divided into a number of equal length abutted
segments. This simpler approach generally leads to equivalent
performance [8]. In all cases the audio stream is initially over-
segmented into a number of segments which exceeds the antic-
ipated maximum number of speakers. The bottom-up approach
then iteratively selects closely matching clusters to merge, hence
reducing the number of clusters by one upon each iteration.
Clusters are generally modeled with a GMM and, upon merging,
a single new GMM is trained on the data that was previously
assigned to the two individual clusters. Standard distance met-
rics, such as those described in Section III-C, are used to iden-
tify the closest clusters. A reassignment of frames to clusters
is usually performed after each cluster merging, via Viterbi re-
alignment for example, and the whole process is repeated itera-
tively, until some stopping criterion is reached, upon which there
should remain only one cluster for each detected speaker. Pos-
sible stopping criteria include thresholded approaches such as
the Bayesian Information Criterion (BIC) [9], Kullback–Leibler
(KL)-based metrics [10], the generalized likelihood ratio (GLR)
[11] or the recently proposed
metric [12]. Bottom-up systems
submitted to the NIST RT evaluations [9], [13] have performed
consistently well.
2) Top-Down Approach: In contrast with the previous ap-
proach, the top-down approach first models the entire audio
stream with a single speaker model and successively adds new
models to it until the full number of speakers are deemed to be
accounted for. A single GMM model is trained on all the speech
segments available, all of which are marked as unlabeled. Using
some selection procedure to identify suitable training data from
the non-labeled segments, new speaker models are iteratively
added to the model one-by-one, with interleaved Viterbi realign-
ment and adaptation. Segments attributed to any one of these
new models are marked as labeled. Stopping criteria similar to
those employed in bottom-up systems may be used to terminate
the process or it can continue until no more relevant unlabeled
segments with which to train new speaker models remain. Top-
down approaches are far less popular than their bottom-up coun-
terparts. Some examples include [14]–[16]. While they are gen-
erally out-performed by the best bottom-up systems, top-down
approaches have performed consistently and respectably well
against the broader field of other bottom-up entries. Top-down
approaches are also extremely computationally efficient and can
be improved through cluster purification [17].
3) Other Approaches: A recent alternative approach, though
also bottom-up in nature, is inspired from rate-distortion theory
and is based on an information-theoretic framework [18]. It is
completely non parametric and its results have been shown to
be comparable to those of state-of-the-art parametric systems,
with significant savings in computation. Clustering is based on
mutual information, which measures the mutual dependence
of two variables [19]. Only a single global GMM is tuned for
the full audio stream, and mutual information is computed in
a new space of relevance variables defined by the GMM com-
ponents. The approach aims at minimizing the loss of mutual
information between successive clusterings while preserving as
much information as possible from the original dataset. Two
suitable methods have been reported: the agglomerative infor-
mation bottleneck (aIB) [18] and the sequential information bot-
tleneck (sIB) [19]. Even if this new system does not lead to
better performance than parametric approaches, results com-
parable to state-of-the-art GMM systems are reported and are
achieved with great savings in computation.
Alternatively, Bayesian machine learning became popular by
the end of the 1990s and has recently been used for speaker
diarization. The key component of Bayesian inference is that
it does not aim at estimating the parameters of a system (i.e.,
to perform point estimates), but rather the parameters of their

ANGUERA MIRO et al.: SPEAKER DIARIZATION: A REVIEW OF RECENT RESEARCH 359
related distribution (hyperparameters). This allows for avoiding
any premature hard decision in the diarization problem and for
automatically regulating the system with the observations (e.g.,
the complexity of the model is data dependent). However, the
computation of posterior distributions often requires intractable
integrals and, as a result, the statistics community has developed
approximate inference methods. Monte Carlo Markov chains
(MCMCs) were first used [20] to provide a systematic approach
to the computation of distributions via sampling, enabling the
deployment of Bayesian methods. However, sampling methods
are generally slow and prohibitive when the amount of data is
large, and they require to be run several times as the chains may
get stuck and not converge in a practical number of iterations.
Another alternative approach, known as Variational Bayes,
has been popular since 1993 [21], [22] and aims at providing a
deterministic approximation of the distributions. It enables an
inference problem to be converted to an optimization problem
by approximating the intractable distribution with a tractable
approximation obtained by minimizing the Kullback–Leibler
divergence between them. In [23] a Variational Bayes-EM
algorithm is used to learn a GMM speaker model and optimize
a change detection process and the merging criterion. In [24],
variational Bayes is combined successfully with eigenvoice
modeling, described in [25], for the speaker diarization of
telephone conversations. However, these systems still con-
sider classical Viterbi decoding for the classification and
differ from the nonparametric Bayesian systems introduced in
Section IV-F.
Finally, the recently proposed speaker binary keys [26] have
been successfully applied to speaker diarization in meetings
[27] with similar performance to state-of-the-art systems but
also with considerable computational savings (running in
around 0.1 times real-time). Speaker binary keys are small bi-
nary vectors computed from the acoustic data using a universal
background model (UBM)-like model. Once they are computed
all processing tasks take place in the binary domain. Other
works in speaker diarization concerned with speed include [28],
[29] which achieve faster than real-time processing through the
use of several processing tricks applied to a standard bottom-up
approach ([28]) or by parallelizing most of the processing
in a GPU unit ([29]). The need for efficient diarization sys-
tems is emphasized when processing very large databases or
when using diarization as a preprocessing step to other speech
algorithms.
III. M
AIN ALGORITHMS
Fig. 1(b) shows a block diagram of the generic modules which
make up most speaker diarization systems. The data prepro-
cessing step (Fig. 1(b)-i) tends to be somewhat domain spe-
cific. For meeting data, preprocessing usually involves noise re-
duction (such as Wiener filtering for example), multi-channel
acoustic beamforming (see Section III-A), the parameterization
of speech data into acoustic features (such as MFCC, PLP, etc.)
and the detection of speech segments with a speech activity
detection algorithm (see Section III-B). Cluster initialization
(Fig. 1(b)-ii) depends on the approach to diarization, i.e., the
choice of an initial set of clusters in bottom-up clustering [8],
[13], [30] (see Section III-C) or a single segment in top-down
clustering [15], [16]. Next, in Fig. 1(b)-iii/iv, a distance between
clusters and a split/merging mechanism (see Section III-D) is
used to iteratively merge clusters [13], [31] or to introduce new
ones [16]. Optionally, data purification algorithms can be used
to make clusters more discriminant [13], [17], [32]. Finally, as
illustrated in Fig. 1(b)-v, stopping criteria are used to determine
when the optimum number of clusters has been reached [33],
[34].
A. Acoustic Beamforming
The application of speaker diarization to the meeting domain
triggered the need for dealing with multiple microphones which
are often used to record the same meeting from different lo-
cations in the room [35]–[37]. The microphones can have dif-
ferent characteristics: wall-mounted microphones (intended for
speaker localization), lapel microphones, desktop microphones
positioned on the meeting room table or microphone arrays. The
use of different microphone combinations as well as differences
in microphone quality called for new approaches to speaker di-
arization with multiple channels.
The multiple distant microphone (MDM) condition was in-
troduced in the NIST RT’04 (Spring) evaluation. A variety of
algorithms have been proposed to extend mono-channel diariza-
tion systems to handle multiple channels. One option, proposed
in [38], is to perform speaker diarization on each channel inde-
pendently and then to merge the individual outputs. In order to
do so, a two axis merging algorithm is used which considers the
longest detected speaker segments in each channel and iterates
over the segmentation output. In the same year, a late-stage fu-
sion approach was also proposed [39]. In it, speaker segmen-
tation is performed separately in all channels and diarization
is applied only taking into account the channel whose speech
segments have the best signal-to-noise ratio (SNR). Subsequent
approaches investigated preprocessing to combine the acoustic
signals to obtain a single channel which could then be processed
by a regular mono-channel diarization system. In [40], the mul-
tiple channels are combined with a simple weighted sum ac-
cording to their SNR. Though straightforward to implement, it
does not take into account the time difference of arrival between
each microphone channel and might easily lead to a decrease in
performance.
Since the NIST RT’05 evaluation, the most common ap-
proach to multi-channel speaker diarization involves acoustic
beamforming as initially proposed in [41] and described in de-
tail in [42]. Many RT participants use the free and open-source
acoustic beamforming toolkit known as BeamformIt [43]
which consists of an enhanced delay-and-sum algorithm to
correct misalignments due to the time-delay-of-arrival (TDOA)
of speech to each microphone. Speech data can be optionally
preprocessed using Wiener filtering [44] to attenuate noise
using, for example, [45]. A reference channel is selected and
the other channels are appropriately aligned and combined with
a standard delay-and-sum algorithm. The contribution made by
each signal channel to the output is then dynamically weighted
according to its SNR or by using a cross-correlation-based
metric. Various additional algorithms are available in the
BeamformIt toolkit to select the optimum reference channel
and to stabilize the TDOA values between channels before the

360 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 2, FEBRUARY 2012
signals are summed. Finally, the TDOA estimates themselves
are made available as outputs and have been used successfully
to improve diarization, as explained in Section IV-A. Note
that, although there are other algorithms that can provide
better beamforming results for some cases, delay-and-sum
beamforming is the most reliable one when no information on
the location or nature of each microphone is known
a priori.
Among alternative beamforming algorithms we find maximum
likelihood (ML) [46] or generalized sidelobe canceller (GSC)
[47] which adaptively find the optimum parameters, and min-
imum variance distortionless response (MVDR) [48] when
prior information on ambient noise is available. All of these
have higher computational requirements and, in the case of the
adaptive algorithms, there is the danger of converging to inac-
curate parameters, especially when processing microphones of
different types.
B. Speech Activity Detection
Speech activity detection (SAD) involves the labeling of
speech and nonspeech segments. SAD can have a significant
impact on speaker diarization performance for two reasons.
The first stems directly from the standard speaker diarization
performance metric, namely the diarization error rate (DER),
which takes into account both the false alarm and missed
speaker error rates (see Section VI-A for more details on
evaluation metrics); poor SAD performance will therefore
lead to an increased DER. The second follows from the fact
that nonspeech segments can disturb the speaker diarization
process, and more specifically the acoustic models involved in
the process [49]. Indeed, the inclusion of non-speech segments
in speaker modelling leads to less discriminant models and thus
increased difficulties in segmentation. Consequently, a good
compromise between missed and false alarm speech error rates
has to be found to enhance the quality of the following speaker
diarization process.
SAD is a fundamental task in almost all fields of speech
processing (coding, enhancement, and recognition) and many
different approaches and studies have been reported in the
literature [50]. Initial approaches for diarization tried to solve
speech activity detection on the fly, i.e., by having a non-
speech cluster be a by-product of the diarization. However,
it became evident that better results are obtained using a
dedicated speech/nonspeech detector as preprocessing step.
In the context of meetings nonspeech segments may include
silence, but also ambient noise such as paper shuffling, door
knocks or non-lexical noise such as breathing, coughing, and
laughing, among other background noises. Therefore, highly
variable energy levels can be observed in the nonspeech parts
of the signal. Moreover, differences in microphones or room
configurations may result in variable SNRs from one meeting
to another. Thus, SAD is far from being trivial in this context
and typical techniques based on feature extraction (energy,
spectrum divergence between speech and background noise,
and pitch estimation) combined with a threshold-based decision
have proven to be relatively ineffective.
Model-based approaches tend to have better performances
and rely on a two-class detector, with models pre-trained with
external speech and nonspeech data [6], [41], [49], [51], [52].
Speech and nonspeech models may optionally be adapted to
specific meeting conditions [15]. Discriminant classifiers such
as linear discriminant analysis (LDA) coupled with Mel fre-
quency cepstrum coefficients (MFCCs) [53] or support vector
machines (SVMs) [54] have also been proposed in the litera-
ture. The main drawback of model-based approaches is their re-
liance on external data for the training of speech and nonspeech
models which makes them less robust to changes in acoustic
conditions. Hybrid approaches have been proposed as a poten-
tial solution. In most cases, an energy-based detection is first ap-
plied in order to label a limited amount of speech and nonspeech
data for which there is high confidence in the classification. In a
second step, the labeled data are used to train meeting-specific
speech and nonspeech models, which are subsequently used in a
model-based detector to obtain the final speech/nonspeech seg-
mentation [9], [55]–[57]. Finally, [58] combines a model-based
with a 4-Hz modulation energy-based detector. Interestingly, in-
stead of being applied as a preprocessing stage, in this system
SAD is incorporated into the speaker diarization process.
C. Segmentation
In the literature, the term “speaker segmentation” is some-
times used to refer to both segmentation and clustering. While
some systems treat each task separately many of present
state-of-the-art systems tackle them simultaneously, as de-
scribed in Section III-E. In these cases the notion of strictly
independent segmentation and clustering modules is less rel-
evant. However, both modules are fundamental to the task of
speaker diarization and some systems, such as that reported in
[6], apply distinctly independent segmentation and clustering
stages. Thus, the segmentation and clustering models are
described separately here.
Speaker segmentation is core to the diarization process and
aims at splitting the audio stream into speaker homogeneous
segments or, alternatively, to detect changes in speakers, also
known as speaker turns. The classical approach to segmentation
performs a hypothesis testing using the acoustic segments in
two sliding and possibly overlapping, consecutive windows. For
each considered change point there are two possible hypotheses:
first that both segments come from the same speaker (
), and
thus that they can be well represented by a single model; and
second that there are two different speakers (
), and thus that
two different models are more appropriate. In practice, models
are estimated from each of the speech windows and some cri-
teria are used to determine whether they are best accounted for
by two separate models (and hence two separate speakers), or by
a single model (and hence the same speaker) by using an empir-
ically determined or dynamically adapted threshold [10], [59].
This is performed across the whole audio stream and a sequence
of speaker turns is extracted.
Many different distance metrics have appeared in the liter-
ature. Next, we review the dominant approaches which have
been used for the NIST RT speaker diarization evaluations
during the last four years. The most common approach is that
of the Bayesian information criterion (BIC) and its associated
BIC metric [33] which has proved to be extremely popular,
e.g.,[60]–[62]. The approach requires the setting of an explicit
penalty term which controls the tradeoff between missed turns

Citations
More filters
Journal ArticleDOI

pyAudioAnalysis: An Open-Source Python Library for Audio Signal Analysis

Theodoros Giannakopoulos
- 11 Dec 2015 - 
TL;DR: In this paper, the authors present pyAudioAnalysis, an open-source Python library that provides a wide range of audio analysis procedures including: feature extraction, classification of audio signals, supervised and unsupervised segmentation and content visualization.
Journal ArticleDOI

Behavioral Signal Processing: Deriving Human Behavioral Informatics From Speech and Language

TL;DR: Behavioral informatics applications of these signal processing techniques that contribute to quantifying higher level, often subjectively described, human behavior in a domain-sensitive fashion are illustrated.
Journal ArticleDOI

A Survey of Available Corpora For Building Data-Driven Dialogue Systems: The Journal Version

TL;DR: A wide survey of publicly available datasets suitable for data-driven learning of dialogue systems is carried out and important characteristics of these datasets are discussed and how they can be used to learn diverse dialogue strategies.
Proceedings ArticleDOI

Speaker diarization with plda i-vector scoring and unsupervised calibration

TL;DR: A system that incorporates probabilistic linear discriminant analysis (PLDA) for i-vector scoring and uses unsupervised calibration of the PLDA scores to determine the clustering stopping criterion is proposed, and it is shown that PLDA scoring outperforms the same system with cosine scoring, and that overlapping segments reduce diarization error rate (DER) as well.
Posted Content

A Survey of Available Corpora for Building Data-Driven Dialogue Systems

TL;DR: This paper carried out a survey of publicly available datasets suitable for data-driven learning of dialogue systems and discussed important characteristics of these datasets, how they can be used to learn diverse dialogue strategies, and their other potential uses.
References
More filters
Journal ArticleDOI

A Bayesian Analysis of Some Nonparametric Problems

TL;DR: In this article, a class of prior distributions, called Dirichlet process priors, is proposed for nonparametric problems, for which treatment of many non-parametric statistical problems may be carried out, yielding results that are comparable to the classical theory.
Journal ArticleDOI

Hierarchical Dirichlet Processes

TL;DR: This work considers problems involving groups of data where each observation within a group is a draw from a mixture model and where it is desirable to share mixture components between groups, and considers a hierarchical model, specifically one in which the base measure for the childDirichlet processes is itself distributed according to a Dirichlet process.
Journal ArticleDOI

An alternative approach to linearly constrained adaptive beamforming

TL;DR: A beamforming structure is presented which can be used to implement a wide variety of linearly constrained adaptive array processors and is shown to incorporate algorithms which have been suggested previously for use in adaptive beamforming as well as to include new approaches.
Journal ArticleDOI

Factorial Hidden Markov Models

TL;DR: A generalization of HMMs in which this state is factored into multiple state variables and is therefore represented in a distributed manner, and a structured approximation in which the the state variables are decoupled, yielding a tractable algorithm for learning the parameters of the model.
Related Papers (5)
Frequently Asked Questions (13)
Q1. What is the key to the success of Bayesian machine learning?

Monte Carlo Markov chains (MCMCs) were first used [20] to provide a systematic approach to the computation of distributions via sampling, enabling the deployment of Bayesian methods. 

The main drawback of model-based approaches is their reliance on external data for the training of speech and nonspeech models which makes them less robust to changes in acoustic conditions. 

the inclusion of non-speech segments in speaker modelling leads to less discriminant models and thus increased difficulties in segmentation. 

Initial approaches for diarization tried to solve speech activity detection on the fly, i.e., by having a nonspeech cluster be a by-product of the diarization. 

Also known as agglomerative hierarchical clustering (AHC or AGHC), the bottom-up approach trains a number of clusters or models and aims at successively merging and reducing the number of clusters until only one remains for each speaker. 

Speaker diarization plays an important role in the analysis of meeting data since it allows for such content to be structured in speaker turns, to whichlinguistic content and other metadata can be added (such as the dominant speakers, the level of interactions, or emotions). 

a good compromise between missed and false alarm speech error rates has to be found to enhance the quality of the following speaker diarization process. 

When performing frame assignment using Viterbi algorithm a minimum assignment duration is usually enforced to avoid an unrealistic assignment of very small consecutive segments to different speaker models. 

the large variations in DER observed among the different meetings and meeting sets originate from the large variance of many important factors for speaker diarization, which makes the conference meeting domain not as easily tractable as more formalized settings such as broadcast news, lectures, or court house trials. 

3See http://nist.gov/speech/tests/rt.A common characteristic of these evaluations is that the only a priori knowledge available to the participants relates to the recording scenario/source (e.g., conference meetings, lectures, or coffee breaks for the meetings domain), the language (English), and the formats of the input and output files. 

Top-down approaches are also extremely computationally efficient and can be improved through cluster purification [17].3) Other Approaches: A recent alternative approach, though also bottom-up in nature, is inspired from rate-distortion theory and is based on an information-theoretic framework [18]. 

All these projects addressed the research and development of multimodal technologies dedicated to the enhancement of human-to-human communications (notably in distant access) by automatically extracting meeting content, making the information available to meeting participants, or for archiving purposes. 

More recently, an approach to the unsupervised discriminant analysis of inter-channel delay features was proposed in [92] and results of approximately 20% DER were reported using delay features alone.