How did the initial approaches for diarization work?

Initial approaches for diarization tried to solve speech activity detection on the fly, i.e., by having a nonspeech cluster be a by-product of the diarization.

What is the way to avoid an unrealistic assignment of very small consecutive segments to different speaker models?

When performing frame assignment using Viterbi algorithm a minimum assignment duration is usually enforced to avoid an unrealistic assignment of very small consecutive segments to different speaker models.

What is the reason for the large variations in DER observed among different meetings?

the large variations in DER observed among the different meetings and meeting sets originate from the large variance of many important factors for speaker diarization, which makes the conference meeting domain not as easily tractable as more formalized settings such as broadcast news, lectures, or court house trials.

What is the common characteristic of the evaluations?

3See http://nist.gov/speech/tests/rt.A common characteristic of these evaluations is that the only a priori knowledge available to the participants relates to the recording scenario/source (e.g., conference meetings, lectures, or coffee breaks for the meetings domain), the language (English), and the formats of the input and output files.

How many DERs were reported using delay features alone?

More recently, an approach to the unsupervised discriminant analysis of inter-channel delay features was proposed in [92] and results of approximately 20% DER were reported using delay features alone.

(Open Access) Speaker Diarization: A Review of Recent Research (2012) | Xavier Anguera Miro

Q: What is the key to the success of Bayesian machine learning?

Monte Carlo Markov chains (MCMCs) were first used [20] to provide a systematic approach to the computation of distributions via sampling, enabling the deployment of Bayesian methods.

Q: What is the main drawback of model-based approaches?

The main drawback of model-based approaches is their reliance on external data for the training of speech and nonspeech models which makes them less robust to changes in acoustic conditions.

Q: What is the common approach to agglomerative hierarchical clustering?

Also known as agglomerative hierarchical clustering (AHC or AGHC), the bottom-up approach trains a number of clusters or models and aims at successively merging and reducing the number of clusters until only one remains for each speaker.

356 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 2, FEBRUARY 2012

Speaker Diarization: A Review of Recent Research

Xavier Anguera Miro, Member, IEEE, Simon Bozonnet, Student Member, IEEE, Nicholas Evans, Member, IEEE,

Corinne Fredouille, Gerald Friedland, Member, IEEE, and Oriol Vinyals

Abstract—Speaker diarization is the task of determining “who

spoke when?” in an audio or video recording that contains an

unknown amount of speech and also an unknown number of

speakers. Initially, it was proposed as a research topic related to

automatic speech recognition, where speaker diarization serves

as an upstream processing step. Over recent years, however,

speaker diarization has become an important key technology for

many tasks, such as navigation, retrieval, or higher level inference

on audio data. Accordingly, many important improvements in

accuracy and robustness have been reported in journals and

conferences in the area. The application domains, from broadcast

news, to lectures and meetings, vary greatly and pose different

problems, such as having access to multiple microphones and

multimodal information or overlapping speech. The most recent

review of existing technology dates back to 2006 and focuses on

the broadcast news domain. In this paper, we review the cur-

rent state-of-the-art, focusing on research developed since 2006

that relates predominantly to speaker diarization for conference

meetings. Finally, we present an analysis of speaker diarization

performance as reported through the NIST Rich Transcription

evaluations on meeting data and identify important areas for

future research.

Index Terms—Meetings, rich transcription, speaker diarization.

I. INTRODUCTION

PEAKER diarization has emerged as an increasingly im-

portant and dedicated domain of speech research. Whereas

speaker and speech recognition involve, respectively, the recog-

nition of a person’s identity or the transcription of their speech,

speaker diarization relates to the problem of determining “who

spoke when?.” More formally this requires the unsupervised

identiﬁcation of each speaker within an audio stream and the

intervals during which each speaker is active.

Manuscript received August 19, 2010; revised December 03, 2010; accepted

February 13, 2011. Date of current version January 13, 2012. This work was

supported in part by the joint-national “Adaptable ambient living assistant”

(ALIAS) project funded through the European Ambient Assisted Living (AAL)

program under Agreement AAL-2009-2-049 and in part by the “Annotation

Collaborative pour l’Accessibilité Vidéo” (ACAV) project funded by the French

Ministry of Industry (Innovative Web call) under Contract 09.2.93.0966. The

work of X. Anguera Miro was supported in part by the Torres Quevedo Spanish

program. The associate editor coordinating the review of this manuscript and

approving it for publication was Prof. Sadaoki Furui.

X. Anguera Miro is with the Multimedia Research Group, Telefonica Re-

search, 08021 Barcelona, Spain (e-mail: xanguera@tid.es).

S. Bozonnet and N. Evans are with the Multimedia Communications

Department, EURECOM, 06904 Sophia Antipolis Cedex, France (e-mail:

bozonnet@eurecom.fr).

C. Fredouille is with the University of Avignon, CERI/LIA, F-84911 Avignon

Cedex 9, France (e-mail: corinne.fredouille@univ-avignon.fr).

G. Friedland and O. Vinyals are with the International Computer Science

Institute (ICSI), Berkeley, CA 94704 USA (e-mail: fractor@icsi.berkeley.edu;

evans@eurecom.fr).

Color versions of one or more of the ﬁgures in this paper are available online

at http://ieeexplore.ieee.org.

Digital Object Identiﬁer 10.1109/TASL.2011.2125954

Speaker diarization has utility in a majority of applications

related to audio and/or video document processing, such as

information retrieval for example. Indeed, it is often the case

that audio and/or video recordings contain more than one

active speaker. This is the case for telephone conversations (for

example stemming from call centers), broadcast news, debates,

shows, movies, meetings, domain-speciﬁc videos (such as

surgery operations for instance), or even lecture or conference

recordings including multiple speakers or questions/answers

sessions. In all such cases, it can be advantageous to automat-

ically determine the number of speakers involved in addition

to the periods when each speaker is active. Clear examples of

applications for speaker diarization algorithms include speech

and speaker indexing, document content structuring, speaker

recognition (in the presence of multiple or competing speakers),

to help in speech-to-text transcription (i.e., so-called speaker at-

tributed speech-to-text), speech translation and, more generally,

Rich Transcription (RT), a community within which the current

state-of-the-art technology has been developed. The most sig-

niﬁcant effort in the Rich Transcription domain comes directly

from the internationally competitive RT evaluations, sponsored

by the National Institute of Standards and Technology (NIST)

in the Unites States [1]. Initiated originally within the telephony

domain, and subsequently in that of broadcast news, today it is

in the domain of conference meetings that speaker diarization

receives the most attention. Speaker diarization is thus an

extremely important area of speech processing research.

An excellent review of speaker diarization research is pre-

sented in [2], although it predominantly focuses its attention to

speaker diarization for broadcast news. Coupled with the tran-

sition to conference meetings, however, the state-of-the-art has

advanced signiﬁcantly since then. This paper presents an up-to-

date review of present state-of-the-art systems and reviews the

progress made in the ﬁeld of speaker diarization since 2005 up

until now, including the most recent NIST RT evaluation that

was held in 2009. Ofﬁcial evaluations are an important vehicle

for pushing the state-of-the-art forward as it is only with stan-

dard experimental protocols and databases that it is possible to

meaningfully compare different approaches. While we also ad-

dress emerging new research in speaker diarization, in this paper

special emphasis is placed on established technologies within

the context of the NIST RT benchmark evaluations, which has

become a reliable indicator for the current state-of-the-art in

speaker diarization. This paper aims at giving a concise refer-

ence overview of established approaches, both for the general

reader and for those new to the ﬁeld. Despite rapid gains in

popularity over recent years, the ﬁeld is relatively embryonic

compared to the mature ﬁelds of speech and speaker recogni-

tion. There are outstanding opportunities for contributions and

we hope that this paper serves to encourage others to participate.

ANGUERA MIRO et al.: SPEAKER DIARIZATION: A REVIEW OF RECENT RESEARCH 357

Section II presents a brief history of speaker diarization

research and the transition to the conference meeting domain.

We describe the main differences between broadcast news

and conference meetings and present a high-level overview of

current approaches to speaker diarization. In Section III, we

present a more detailed description of the main algorithms that

are common to many speaker diarization systems, including

those recently introduced to make use of information coming

from multiple microphones, namely delay-and-sum beam-

forming. Section IV presents some of the most recent work in

the ﬁeld including efforts to handle multimodal information

and overlapping speech. We also discuss the use of features

based on inter-channel delay and prosodics and also attempts

to combine speaker diarization systems. In Section V, we

present an overview of the current status in speaker diarization

research. We describe the NIST RT evaluations, the different

datasets and the performance achieved by state-of-the-art sys-

tems. We also identify the remaining problems and highlight

potential solutions in the context of current work. Finally, our

conclusions are presented in Section VI.

II. S

PEAKER DIARIZATION

Over recent years, the scientiﬁc community has developed

research on speaker diarization in a number of different do-

mains, with the focus usually being dictated by funded research

projects. From early work with telephony data, broadcast

news (BN) became the main focus of research towards the

late 1990s and early 2000s and the use of speaker diariza-

tion was aimed at automatically annotating TV and radio

transmissions that are broadcast daily all over the world. An-

notations included automatic speech transcription and meta

data labeling, including speaker diarization. Interest in the

meeting domain grew extensively from 2002, with the launch

of several related research projects including the European

Union (EU) Multimodal Meeting Manager (M4) project, the

Swiss Interactive Multimodal Information Management (IM2)

project, the EU Augmented Multi-party Interaction (AMI)

project, subsequently continued through the EU Augmented

Multi-party Interaction with Distant Access (AMIDA) project

and, and ﬁnally, the EU Computers in the Human Interaction

Loop (CHIL) project. All these projects addressed the research

and development of multimodal technologies dedicated to the

enhancement of human-to-human communications (notably in

distant access) by automatically extracting meeting content,

making the information available to meeting participants, or for

archiving purposes.

These technologies have to meet challenging demands such

as content indexing, linking and/or summarization of on-going

or archived meetings, the inclusion of both verbal and nonverbal

human communication (people movements, emotions, interac-

tions with others, etc.). This is achieved by exploiting several

synchronized data streams, such as audio, video and textual in-

formation (agenda, discussion papers, slides, etc.), that are able

to capture different kinds of information that are useful for the

structuring and analysis of meeting content. Speaker diarization

plays an important role in the analysis of meeting data since it al-

lows for such content to be structured in speaker turns, to which

linguistic content and other metadata can be added (such as the

dominant speakers, the level of interactions, or emotions).

Undertaking benchmarking evaluations has proven to be

an extremely productive means for estimating and comparing

algorithm performance and for verifying genuine technolog-

ical advances. Speaker diarization is no exception and, since

2002, the US National Institute for Standards and Technology

(NIST) has organized ofﬁcial speaker diarization evaluations

involving broadcast news (BN) and, more recently, meeting

data. These evaluations have crucially contributed to bringing

researchers together and to stimulating new ideas to advance the

state-of-the-art. While other contrastive sub-domains such as

lecture meetings and coffee breaks have also been considered,

the conference meeting scenario has been the primary focus

of the NIST RT evaluations since 2004. The meeting scenario

is often referred to as “speech recognition complete,” i.e., a

scenario in which all of the problems that arise in any speech

recognition can be encountered in this domain. Conference

meetings thus pose a number of new challenges to speaker

diarization that typically were less relevant in earlier research.

A. Broadcast News Versus Conference Meetings

With the change of focus of the NIST RT evaluations from BN

to meetings diarization algorithms had to be adapted according

to the differences in the nature of the data. First, BN speech

data is usually acquired using boom or lapel microphones with

some recordings being made in the studio and others in the

ﬁeld. Conversely, meetings are usually recorded using desktop

or far-ﬁeld microphones (single microphones or microphone ar-

rays) which are more convenient for users than head-mounted or

lapel microphones.

As a result, the signal-to-noise ratio is gen-

erally better for BN data than it is for meeting recordings. Addi-

tionally, differences between meeting room conﬁgurations and

microphone placement lead to variations in recording quality,

including background noise, reverberation and variable speech

levels (depending on the distance between speakers and micro-

phones).

Second, BN speech is often read or at least prepared in ad-

vance while meeting speech tends to be more spontaneous in

nature and contains more overlapping speech. Although BN

recordings can contain speech that is overlapped with music,

laughter, or applause (far less common for conference meeting

data), in general, the detection of acoustic events and speakers

tends to be more challenging for conference meeting data than

for BN data.

Finally, the number of speakers is usually larger in BN but

speaker turns occur less frequently than they do in conference

meeting data, resulting in BN having a longer average speaker

turn length. An extensive analysis of BN characteristics is re-

ported in [3] and a comparison of BN and conference meeting

data can be found in [4].

Speaker diarization was evaluated prior to 2002 through NIST Speaker

Recognition (SR) evaluation campaigns (focusing on telephone speech) and

not within the RT evaluation campaigns.

Meeting databases recorded for research purposes usually contain

head-mounted and lapel microphone recordings for ground-truth creation

purposes only.

358 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 2, FEBRUARY 2012

Fig. 1. General Diarization system. (a) Alternative clustering schemas.

(b) General speaker diarization architecture.

B. Main Approaches

Most of present state-of-the-art speaker diarization systems

ﬁt into one of two categories: the bottom-up and the top-down

approaches, as illustrated in Fig. 1(a). The top-down approach

is initialized with very few clusters (usually one) whereas the

bottom-up approach is initialized with many clusters (usually

more clusters than expected speakers). In both cases the aim

is to iteratively converge towards an optimum number of clus-

ters. If the ﬁnal number is higher than the optimum then the

system is said to under-cluster. If it is lower it is said to over-

cluster. Both bottom-up and top-down approaches are generally

based on hidden Markov models (HMMs) where each state is a

Gaussian mixture model (GMM) and corresponds to a speaker.

Transitions between states correspond to speaker turns. In this

section, we brieﬂy outline the standard bottom-up and top-down

approaches as well as two recently proposed alternatives: one

based on information theory; and a second one based on a non

parametric Bayesian approach. Although these new approaches

have not been reported previously in the context of ofﬁcial NIST

RT evaluations they have shown strong potential on NIST RT

evaluation datasets and are thus included here. Additionally,

some other works propose sequential single-pass segmentation

and clustering approaches [5]–[7], although their performance

tends to fall short of the state-of-the-art.

1) Bottom-Up Approach: The bottom-up approach is by far

the most common in the literature. Also known as agglomer-

ative hierarchical clustering (AHC or AGHC), the bottom-up

approach trains a number of clusters or models and aims at

successively merging and reducing the number of clusters until

only one remains for each speaker. Various initializations have

been studied and, whereas some have investigated

-means clus-

tering, many systems use a uniform initialization, where the

audio stream is divided into a number of equal length abutted

segments. This simpler approach generally leads to equivalent

performance [8]. In all cases the audio stream is initially over-

segmented into a number of segments which exceeds the antic-

ipated maximum number of speakers. The bottom-up approach

then iteratively selects closely matching clusters to merge, hence

reducing the number of clusters by one upon each iteration.

Clusters are generally modeled with a GMM and, upon merging,

a single new GMM is trained on the data that was previously

assigned to the two individual clusters. Standard distance met-

rics, such as those described in Section III-C, are used to iden-

tify the closest clusters. A reassignment of frames to clusters

is usually performed after each cluster merging, via Viterbi re-

alignment for example, and the whole process is repeated itera-

tively, until some stopping criterion is reached, upon which there

should remain only one cluster for each detected speaker. Pos-

sible stopping criteria include thresholded approaches such as

the Bayesian Information Criterion (BIC) [9], Kullback–Leibler

(KL)-based metrics [10], the generalized likelihood ratio (GLR)

[11] or the recently proposed

metric [12]. Bottom-up systems

submitted to the NIST RT evaluations [9], [13] have performed

consistently well.

2) Top-Down Approach: In contrast with the previous ap-

proach, the top-down approach ﬁrst models the entire audio

stream with a single speaker model and successively adds new

models to it until the full number of speakers are deemed to be

accounted for. A single GMM model is trained on all the speech

segments available, all of which are marked as unlabeled. Using

some selection procedure to identify suitable training data from

the non-labeled segments, new speaker models are iteratively

added to the model one-by-one, with interleaved Viterbi realign-

ment and adaptation. Segments attributed to any one of these

new models are marked as labeled. Stopping criteria similar to

those employed in bottom-up systems may be used to terminate

the process or it can continue until no more relevant unlabeled

segments with which to train new speaker models remain. Top-

down approaches are far less popular than their bottom-up coun-

terparts. Some examples include [14]–[16]. While they are gen-

erally out-performed by the best bottom-up systems, top-down

approaches have performed consistently and respectably well

against the broader ﬁeld of other bottom-up entries. Top-down

approaches are also extremely computationally efﬁcient and can

be improved through cluster puriﬁcation [17].

3) Other Approaches: A recent alternative approach, though

also bottom-up in nature, is inspired from rate-distortion theory

and is based on an information-theoretic framework [18]. It is

completely non parametric and its results have been shown to

be comparable to those of state-of-the-art parametric systems,

with signiﬁcant savings in computation. Clustering is based on

mutual information, which measures the mutual dependence

of two variables [19]. Only a single global GMM is tuned for

the full audio stream, and mutual information is computed in

a new space of relevance variables deﬁned by the GMM com-

ponents. The approach aims at minimizing the loss of mutual

information between successive clusterings while preserving as

much information as possible from the original dataset. Two

suitable methods have been reported: the agglomerative infor-

mation bottleneck (aIB) [18] and the sequential information bot-

tleneck (sIB) [19]. Even if this new system does not lead to

better performance than parametric approaches, results com-

parable to state-of-the-art GMM systems are reported and are

achieved with great savings in computation.

Alternatively, Bayesian machine learning became popular by

the end of the 1990s and has recently been used for speaker

diarization. The key component of Bayesian inference is that

it does not aim at estimating the parameters of a system (i.e.,

to perform point estimates), but rather the parameters of their

ANGUERA MIRO et al.: SPEAKER DIARIZATION: A REVIEW OF RECENT RESEARCH 359

related distribution (hyperparameters). This allows for avoiding

any premature hard decision in the diarization problem and for

automatically regulating the system with the observations (e.g.,

the complexity of the model is data dependent). However, the

computation of posterior distributions often requires intractable

integrals and, as a result, the statistics community has developed

approximate inference methods. Monte Carlo Markov chains

(MCMCs) were ﬁrst used [20] to provide a systematic approach

to the computation of distributions via sampling, enabling the

deployment of Bayesian methods. However, sampling methods

are generally slow and prohibitive when the amount of data is

large, and they require to be run several times as the chains may

get stuck and not converge in a practical number of iterations.

Another alternative approach, known as Variational Bayes,

has been popular since 1993 [21], [22] and aims at providing a

deterministic approximation of the distributions. It enables an

inference problem to be converted to an optimization problem

by approximating the intractable distribution with a tractable

approximation obtained by minimizing the Kullback–Leibler

divergence between them. In [23] a Variational Bayes-EM

algorithm is used to learn a GMM speaker model and optimize

a change detection process and the merging criterion. In [24],

variational Bayes is combined successfully with eigenvoice

modeling, described in [25], for the speaker diarization of

telephone conversations. However, these systems still con-

sider classical Viterbi decoding for the classiﬁcation and

differ from the nonparametric Bayesian systems introduced in

Section IV-F.

Finally, the recently proposed speaker binary keys [26] have

been successfully applied to speaker diarization in meetings

[27] with similar performance to state-of-the-art systems but

also with considerable computational savings (running in

around 0.1 times real-time). Speaker binary keys are small bi-

nary vectors computed from the acoustic data using a universal

background model (UBM)-like model. Once they are computed

all processing tasks take place in the binary domain. Other

works in speaker diarization concerned with speed include [28],

[29] which achieve faster than real-time processing through the

use of several processing tricks applied to a standard bottom-up

approach ([28]) or by parallelizing most of the processing

in a GPU unit ([29]). The need for efﬁcient diarization sys-

tems is emphasized when processing very large databases or

when using diarization as a preprocessing step to other speech

algorithms.

III. M

AIN ALGORITHMS

Fig. 1(b) shows a block diagram of the generic modules which

make up most speaker diarization systems. The data prepro-

cessing step (Fig. 1(b)-i) tends to be somewhat domain spe-

ciﬁc. For meeting data, preprocessing usually involves noise re-

duction (such as Wiener ﬁltering for example), multi-channel

acoustic beamforming (see Section III-A), the parameterization

of speech data into acoustic features (such as MFCC, PLP, etc.)

and the detection of speech segments with a speech activity

detection algorithm (see Section III-B). Cluster initialization

(Fig. 1(b)-ii) depends on the approach to diarization, i.e., the

choice of an initial set of clusters in bottom-up clustering [8],

[13], [30] (see Section III-C) or a single segment in top-down

clustering [15], [16]. Next, in Fig. 1(b)-iii/iv, a distance between

clusters and a split/merging mechanism (see Section III-D) is

used to iteratively merge clusters [13], [31] or to introduce new

ones [16]. Optionally, data puriﬁcation algorithms can be used

to make clusters more discriminant [13], [17], [32]. Finally, as

illustrated in Fig. 1(b)-v, stopping criteria are used to determine

when the optimum number of clusters has been reached [33],

[34].

A. Acoustic Beamforming

The application of speaker diarization to the meeting domain

triggered the need for dealing with multiple microphones which

are often used to record the same meeting from different lo-

cations in the room [35]–[37]. The microphones can have dif-

ferent characteristics: wall-mounted microphones (intended for

speaker localization), lapel microphones, desktop microphones

positioned on the meeting room table or microphone arrays. The

use of different microphone combinations as well as differences

in microphone quality called for new approaches to speaker di-

arization with multiple channels.

The multiple distant microphone (MDM) condition was in-

troduced in the NIST RT’04 (Spring) evaluation. A variety of

algorithms have been proposed to extend mono-channel diariza-

tion systems to handle multiple channels. One option, proposed

in [38], is to perform speaker diarization on each channel inde-

pendently and then to merge the individual outputs. In order to

do so, a two axis merging algorithm is used which considers the

longest detected speaker segments in each channel and iterates

over the segmentation output. In the same year, a late-stage fu-

sion approach was also proposed [39]. In it, speaker segmen-

tation is performed separately in all channels and diarization

is applied only taking into account the channel whose speech

segments have the best signal-to-noise ratio (SNR). Subsequent

approaches investigated preprocessing to combine the acoustic

signals to obtain a single channel which could then be processed

by a regular mono-channel diarization system. In [40], the mul-

tiple channels are combined with a simple weighted sum ac-

cording to their SNR. Though straightforward to implement, it

does not take into account the time difference of arrival between

each microphone channel and might easily lead to a decrease in

performance.

Since the NIST RT’05 evaluation, the most common ap-

proach to multi-channel speaker diarization involves acoustic

beamforming as initially proposed in [41] and described in de-

tail in [42]. Many RT participants use the free and open-source

acoustic beamforming toolkit known as BeamformIt [43]

which consists of an enhanced delay-and-sum algorithm to

correct misalignments due to the time-delay-of-arrival (TDOA)

of speech to each microphone. Speech data can be optionally

preprocessed using Wiener ﬁltering [44] to attenuate noise

using, for example, [45]. A reference channel is selected and

the other channels are appropriately aligned and combined with

a standard delay-and-sum algorithm. The contribution made by

each signal channel to the output is then dynamically weighted

according to its SNR or by using a cross-correlation-based

metric. Various additional algorithms are available in the

BeamformIt toolkit to select the optimum reference channel

and to stabilize the TDOA values between channels before the

360 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 2, FEBRUARY 2012

signals are summed. Finally, the TDOA estimates themselves

are made available as outputs and have been used successfully

to improve diarization, as explained in Section IV-A. Note

that, although there are other algorithms that can provide

better beamforming results for some cases, delay-and-sum

beamforming is the most reliable one when no information on

the location or nature of each microphone is known

a priori.

Among alternative beamforming algorithms we ﬁnd maximum

likelihood (ML) [46] or generalized sidelobe canceller (GSC)

[47] which adaptively ﬁnd the optimum parameters, and min-

imum variance distortionless response (MVDR) [48] when

prior information on ambient noise is available. All of these

have higher computational requirements and, in the case of the

adaptive algorithms, there is the danger of converging to inac-

curate parameters, especially when processing microphones of

different types.

B. Speech Activity Detection

Speech activity detection (SAD) involves the labeling of

speech and nonspeech segments. SAD can have a signiﬁcant

impact on speaker diarization performance for two reasons.

The ﬁrst stems directly from the standard speaker diarization

performance metric, namely the diarization error rate (DER),

which takes into account both the false alarm and missed

speaker error rates (see Section VI-A for more details on

evaluation metrics); poor SAD performance will therefore

lead to an increased DER. The second follows from the fact

that nonspeech segments can disturb the speaker diarization

process, and more speciﬁcally the acoustic models involved in

the process [49]. Indeed, the inclusion of non-speech segments

in speaker modelling leads to less discriminant models and thus

increased difﬁculties in segmentation. Consequently, a good

compromise between missed and false alarm speech error rates

has to be found to enhance the quality of the following speaker

diarization process.

SAD is a fundamental task in almost all ﬁelds of speech

processing (coding, enhancement, and recognition) and many

different approaches and studies have been reported in the

literature [50]. Initial approaches for diarization tried to solve

speech activity detection on the ﬂy, i.e., by having a non-

speech cluster be a by-product of the diarization. However,

it became evident that better results are obtained using a

dedicated speech/nonspeech detector as preprocessing step.

In the context of meetings nonspeech segments may include

silence, but also ambient noise such as paper shufﬂing, door

knocks or non-lexical noise such as breathing, coughing, and

laughing, among other background noises. Therefore, highly

variable energy levels can be observed in the nonspeech parts

of the signal. Moreover, differences in microphones or room

conﬁgurations may result in variable SNRs from one meeting

to another. Thus, SAD is far from being trivial in this context

and typical techniques based on feature extraction (energy,

spectrum divergence between speech and background noise,

and pitch estimation) combined with a threshold-based decision

have proven to be relatively ineffective.

Model-based approaches tend to have better performances

and rely on a two-class detector, with models pre-trained with

external speech and nonspeech data [6], [41], [49], [51], [52].

Speech and nonspeech models may optionally be adapted to

speciﬁc meeting conditions [15]. Discriminant classiﬁers such

as linear discriminant analysis (LDA) coupled with Mel fre-

quency cepstrum coefﬁcients (MFCCs) [53] or support vector

machines (SVMs) [54] have also been proposed in the litera-

ture. The main drawback of model-based approaches is their re-

liance on external data for the training of speech and nonspeech

models which makes them less robust to changes in acoustic

conditions. Hybrid approaches have been proposed as a poten-

tial solution. In most cases, an energy-based detection is ﬁrst ap-

plied in order to label a limited amount of speech and nonspeech

data for which there is high conﬁdence in the classiﬁcation. In a

second step, the labeled data are used to train meeting-speciﬁc

speech and nonspeech models, which are subsequently used in a

model-based detector to obtain the ﬁnal speech/nonspeech seg-

mentation [9], [55]–[57]. Finally, [58] combines a model-based

with a 4-Hz modulation energy-based detector. Interestingly, in-

stead of being applied as a preprocessing stage, in this system

SAD is incorporated into the speaker diarization process.

C. Segmentation

In the literature, the term “speaker segmentation” is some-

times used to refer to both segmentation and clustering. While

some systems treat each task separately many of present

state-of-the-art systems tackle them simultaneously, as de-

scribed in Section III-E. In these cases the notion of strictly

independent segmentation and clustering modules is less rel-

evant. However, both modules are fundamental to the task of

speaker diarization and some systems, such as that reported in

[6], apply distinctly independent segmentation and clustering

stages. Thus, the segmentation and clustering models are

described separately here.

Speaker segmentation is core to the diarization process and

aims at splitting the audio stream into speaker homogeneous

segments or, alternatively, to detect changes in speakers, also

known as speaker turns. The classical approach to segmentation

performs a hypothesis testing using the acoustic segments in

two sliding and possibly overlapping, consecutive windows. For

each considered change point there are two possible hypotheses:

ﬁrst that both segments come from the same speaker (

), and

thus that they can be well represented by a single model; and

second that there are two different speakers (

), and thus that

two different models are more appropriate. In practice, models

are estimated from each of the speech windows and some cri-

teria are used to determine whether they are best accounted for

by two separate models (and hence two separate speakers), or by

a single model (and hence the same speaker) by using an empir-

ically determined or dynamically adapted threshold [10], [59].

This is performed across the whole audio stream and a sequence

of speaker turns is extracted.

Many different distance metrics have appeared in the liter-

ature. Next, we review the dominant approaches which have

been used for the NIST RT speaker diarization evaluations

during the last four years. The most common approach is that

of the Bayesian information criterion (BIC) and its associated

BIC metric [33] which has proved to be extremely popular,

e.g.,[60]–[62]. The approach requires the setting of an explicit

penalty term which controls the tradeoff between missed turns

Speaker Diarization: A Review of Recent Research

Figures

Citations

pyAudioAnalysis: An Open-Source Python Library for Audio Signal Analysis

Behavioral Signal Processing: Deriving Human Behavioral Informatics From Speech and Language

A Survey of Available Corpora For Building Data-Driven Dialogue Systems: The Journal Version

Speaker diarization with plda i-vector scoring and unsupervised calibration

A Survey of Available Corpora for Building Data-Driven Dialogue Systems

References

A Bayesian Analysis of Some Nonparametric Problems

Hierarchical Dirichlet Processes

Extrapolation, Interpolation, and Smoothing of Stationary Time Series

An alternative approach to linearly constrained adaptive beamforming

Factorial Hidden Markov Models

Related Papers (5)

An overview of automatic speaker diarization systems

Front-End Factor Analysis for Speaker Verification

X-Vectors: Robust DNN Embeddings for Speaker Recognition

Speaker diarization using deep neural network embeddings

Diarization is hard: Some experiences and lessons learned for the JHU team in the inaugural dihard challenge

Frequently Asked Questions (13)

Q1. What is the key to the success of Bayesian machine learning?

Q2. What is the main drawback of model-based approaches?

Q3. What is the main reason for the inclusion of nonspeech segments in speaker modelling?

Q4. How did the initial approaches for diarization work?

Q5. What is the common approach to agglomerative hierarchical clustering?

Q6. What is the role of speaker diarization in the analysis of meeting data?

Q7. What is the way to improve the quality of the speaker diarization process?

Q8. What is the way to avoid an unrealistic assignment of very small consecutive segments to different speaker models?

Q9. What is the reason for the large variations in DER observed among different meetings?

Q10. What is the common characteristic of the evaluations?

Q11. What are the main reasons why top-down approaches are so popular?

Q12. What is the main focus of the European Union Multimodal Meeting Manager project?

Q13. How many DERs were reported using delay features alone?