What have the authors contributed in "Dnn approach to speaker diarisation using speaker channels" ?

Q: What have the authors contributed in "Dnn approach to speaker diarisation using speaker channels" ?

Speaker diarisation addresses the question of “ who speaks when ” in audio recordings, and has been studied extensively in the context of tasks such as broadcast news, meetings, etc.

(Open Access) DNN approach to speaker diarisation using speaker channels (2017) | Rosanna Milner

This is a repository copy of DNN approach to speaker diarisation using speaker channels.

White Rose Research Online URL for this paper:

http://eprints.whiterose.ac.uk/121245/

Version: Accepted Version

Proceedings Paper:

Milner, R. and Hain, T. orcid.org/0000-0003-0939-3464 (2017) DNN approach to speaker

diarisation using speaker channels. In: 2017 IEEE International Conference on Acoustics,

Speech and Signal Processing (ICASSP). 2017 IEEE International Conference on

Acoustics, Speech and Signal Processing (ICASSP), March 5-9, 2017, New Orleans, USA.

IEEE , pp. 4925-4929. ISBN 9781509041176

https://doi.org/10.1109/ICASSP.2017.7953093

eprints@whiterose.ac.uk

https://eprints.whiterose.ac.uk/

Reuse

exception in section 29 of the Copyright, Designs and Patents Act 1988 allows the making of a single copy

solely for the purpose of non-commercial research or private study within the limits of fair dealing. The

publisher or other rights-holder may allow further reproduction and re-use of this version - refer to the White

Rose Research Online record for this item. Where records identify the publisher as the copyright holder,

users can verify any specific terms of use on the publisher’s website.

Takedown

If you consider content in White Rose Research Online to be in breach of UK law, please notify us by

emailing eprints@whiterose.ac.uk including the URL of the record and the reason for the withdrawal request.

DNN APPROACH TO SPEAKER DIARISATION USING SPEAKER CHANNELS

Rosanna Milner, Thomas Hain

Speech and Hearing Research Group, University of Shefﬁeld, UK

rmmilner2,t.hain@sheffield.ac.uk

ABSTRACT

Speaker diarisation addresses the question of “who speaks

when” in audio recordings, and has been studied extensively

in the context of tasks such as broadcast news, meetings,

etc. Performing diarisation on individual headset microphone

(IHM) channels is sometimes assumed to easily give the de-

sired output of speaker labelled segments with timing infor-

mation. However, it is shown that given imperfect data, such

as speaker channels with heavy crosstalk and overlapping

speech, this is not the case. Deep neural networks (DNNs)

can be trained on features derived from the concatenation of

speaker channel features to detect which is the correct chan-

nel for each frame. Crosstalk features can be calculated and

DNNs trained with or without overlapping speech to combat

problematic data. A simple frame decision metric of counting

occurrences is investigated as well as adding a bias against

selecting nonspeech for a frame. Finally, two different scor-

ing setups are applied to both datasets. The stricter SHEF

setup ﬁnds diarisation error rates (DER) of 9.2% on TBL

and 23.2% on RT07 while the NIST setup achieves 5.7% and

15.1% respectively.

Index Terms— speaker diarisation, multi-channel, cross-

talk, deep neural networks, speaker channels

1. INTRODUCTION

The task of speaker diarisation is an important prerequisite

task for audio indexing, automatic speech recognition (ASR)

and more [1, 2]. The objective is to split the audio into seg-

ments which are associated with a single speaker, and to iden-

tify among the set of segments those that are spoken by the

same speaker. Diarisation systems generally consist of three

main stages: speech activity detection (SAD), speaker seg-

mentation and speaker clustering. SAD aims to detect speech

segments which are passed to a speaker segmentation stage to

split the segments further at speaker change points (speaker

boundaries). Speaker clustering aims to group speaker seg-

ments together into speaker-homogeneous clusters. The ob-

jective is not only to group the speakers correctly, but also to

ﬁnd the correct number of clusters (i.e. speakers). Diarisation

has been well studied over the years, and toolkits are available

for this task which are designed to perform well for a speciﬁc

type of data [3, 4, 5].

The challenges to multi-channel diarisation differ by do-

main. For conversational telephone speech (CTS) only two

speakers are present. However, channel echo, speaker over-

lap, poor quality phone lines and noise cause errors, despite

independent channels for each speaker [2]. Broadcast news

(BN) data has background noises such as music, but also a

large number of speakers who may only occur very brieﬂy [6,

7]. Meeting data has become the focus for diarisation for con-

siderable time [8]. Speech is conversational with signiﬁcant

amounts of speaker overlap, as it is for CTS. However, there

are more speakers, and speech may be recorded with distant

or far-ﬁeld microphones. Multi-channel diarisation operates

in two different modes, depending on the distance between

the microphones and the speakers: using beam-forming to fo-

cus on speakers [9]; or detecting automatically which speaker

is closer and disregarding other speech [10, 11]. The for-

mer case is much harder. It helps beam-forming to know

who speaks and when [12], but knowing where the speech is

coming from can improve speaker segmentation performance

[13, 14], e.g. through the use of inter-channel delay infor-

mation [9]. Work presented here is related to the latter case:

microphones are far apart and assigned to speakers, although

not in close proximity to the speakers mouth.

Deep neural networks (DNNs) have been introduced into

different stages of a diarisation system. Artiﬁcial neural net-

works (ANN) have been trained to learn a feature transform

[15] and DNNs can be trained to detect speech/nonspeech in

an SAD stage where adapting the DNN leads to improved

performance [16]. A speaker segmentation stage using auto-

associative neural networks (AANN) was proposed in which a

windowing method is used where an AANN model is trained

for the left half of the window and tested on the right to give a

conﬁdence score on how likely each part belongs to the same

speaker [17]. Finally, DNNs have been applied to the cluster-

ing stage by training speaker separation DNNs and adapting

these to speciﬁc recordings [16, 18].

Typically, speaker diarisation is unsupervised meaning no

a priori information or metadata is used to aid a system. The

desired output of a system is speaker labelled segments with

timing information. Whether diarisation is performed unsu-

pervised (ICSI system [19]), semi or lightly supervised (sup-

plementary data such as imperfect transcripts [20]) or super-

vised (known speakers [21]), the desired output remains the

channel input hidden layers output

combinations target

classes

C1C2C3C4

C1C2C4C3

...

C4C3C2C1 (A) (B)

...

channel input hidden layers output

combinations target

classes

C1C2

C1C3

...

C4C3

...

Fig. 1. Feature concatenations and input labels are shown for

methods (A) a ﬁxed number of channels for every recording

and (B) a mixed number of channels across recordings.

same. It will be shown that the obvious method of performing

diarisation on the individual headset microphone (IHM) chan-

nels is not satisfactory given imperfect data, such as chan-

nels containing heavy crosstalk. Thus, two methods are pro-

posed which train DNNs to detect which channel contains the

correct speaker at a given frame. Both methods concatenate

speaker channel features in training and testing. The ﬁrst con-

catenates all speaker channels from a recording so it requires

each recording in a dataset to contain the same number of

speakers. As this is not portable to datasets which do not have

this trait, a second method is proposed which trains DNNs

on pairs of speaker channels. Furthermore, the problems of

crosstalk and overlapping speech are considered and as well

as simple counting frame decision metric vs. adding a bias

against selecting nonspeech.

2. DNN APPROACH USING SPEAKER CHANNELS

Two methods are presented: the ﬁrst method is channel de-

tection when the speciﬁc number of channels is ﬁxed and the

second is an extension to the ﬁrst in which the data consists

of a mixed number of channels.

2.1. Fixed number of channels per recording

DNNs are trained on concatenated features from all the

speaker channels. It requires every recording to contain the

same number of speakers. Every combination of the channels

are used for training, as this may help prevent channels being

biased in certain positions. Example (A) in Figure 1 depicts

the ordering of the concatenated features with their equivalent

label ﬁle for training. It assumes there are four IHM channels

for every recording. The channels are referred to as C1, C2,

C3, C4 while each speaker-pure segment is labelled as P1, P2,

P3, P4 corresponding to the position of the relevant channel

in the feature concatenation. Nonspeech is referred to as NS.

2.2. Mixed number of channels per recording

The ﬁxed method is not portable to datasets which do not con-

tain the same number of speakers in each recording. A differ-

frame C1C2C3C4 C1C2C4C3 … C4C3C2C1 output

… ... … … ...

204 P1 P1 P4 C1

205 P1 P1 P4 C1

206 P1 NS P4 C1

207 NS NS … P4 = NS

208 NS NS NS NS

209 P3 P4 P2 C3

210 P2 P2 P3 C2

… … … … ...

Fig. 2. Frame decisions are made considering the decoded

outputs from all combinations of feature concatenations on

the testset. The simple counting method gives the output dis-

played.

ent approach is required where pairs of features can be con-

catenated. Example (B) in Figure 1 displays how the channel

pairs are annotated as before, where position labels are nec-

essary to denote which channel contains speech and which is

nonspeech. For instances where the speech segment does not

belong to either channel, a nonspeech label is given.

As well as being applicable to all datasets, this alternative

approach also reduces the amount of data needed for training.

For a single recording in the ﬁxed method, the number of pos-

sible combinations for training is x!, where x in the number

of channels. Whereas for this method, the number of possible

feature pairs for training becomes x(x − 1). For example, if

there are 4 channels then the amount of combinations needed

for each method is 24 and 12 respectively.

2.3. Frame decision

All the combinations of feature concatenations are used for

testing and this gives a channel or nonspeech label to every

frame. This results in multiple labels for every frame, across

the different decoded feature concatenations, as shown in Fig-

ure 2. To make a decision on the correct label, one can sim-

ply count the occurrences and select the channel or nonspeech

that has been labelled the most. Alternatively, the occurrences

can be counted as before with a bias for or against nonspeech

applied as a multiplier to increase or reduce the likelihood

of selecting nonspeech. A bias for or against speciﬁc chan-

nels could also be applied, for example if a host in a TV pro-

gramme is known to talk more than the guests.

3. EXPERIMENTS

3.1. Data

The methods are evaluated with two datasets in different do-

mains. TBL is TV broadcast data which consists of 22 pro-

grammes from a talk–show with single distant microphone

(SDM) and IHM channels: four speakers as one host and

three guests. The recordings have been split into a training

set of 12 programmes for DNN training only, and a test set of

10 episodes which has a total of 40 speakers and 8749 seg-

ments in 5.3 hours of speech time. The audio was manually

transcribed to an accuracy of 0.1s.

The second is based on the established testset from the

NIST Rich Transcription evaluation in 2007 [8]. The com-

plete ﬁles were also manually transcribed to an accuracy of

0.1s

, which produces a different reference to the original

testset. This updated reference contains 8 conference meet-

ings with both SDM and IHM channel data and contains 35

speakers and 11144 segments over 8.9 hours of speech time.

Six meetings contain 4 participants, one has 5 and another 6.

3.2. Experimental setup

DNNs require training on concatenated IHM channels and

log-Mel ﬁlterbanks of 23 dimension are used as opposed

to Mel frequency cepstral coefﬁcients (MFCCs) as they are

found to yield better performance with DNNs [22]. Crosstalk

features (denoted CT), of 7 dimensions, may help reduce

errors caused by speech on the wrong channel [10]. The

energies are normalised across all N channels by

norm

(n) =

(n)

k=1

(n)

(1)

where E

(n) is the current channel i energy at frame n. Fur-

ther features are calculated such as kurtosis [23] and mean

cross-correlation and maximum normalised cross-correlation.

DNNs for the ﬁxed method are trained on TBL, whereas

DNNs for the mixed method are trained on TBL and the AMI

corpus [24]. The number of input neurons depends on the

number of concatenated channels. For 4 channels, there are

1472 input neurons, increasing to 1920 with CT, two hidden

layers of 1000 hidden units and 5 output neurons, which rep-

resent the 4 channels and nonspeech. For 2 channels, there

are 736 neurons, increasing to 960 with CT, two hidden layers

of 1000 hidden units and 3 output neurons, representing the

2 channels and nonspeech. Training on overlapping speech

may cause DNNs to learn errors and affect the performance

thus DNNs are trained with or without overlapping speech

(denoted OV).

3.3. Diarisation evaluation

Diarisation error rate (DER) is the standard metric for speaker

diarisation and is the sum of three error values: miss (MS),

false alarm (FA) and speaker error (SE) [25]. The DER does

not consider the segmentation quality in its evaluation of a

system, so all tables depict the number of detected segments

[?]. Two scoring methods are investigated. The standard eval-

uation method for RT07 data is to use a collar of 0.25s and

score speciﬁc portions of time only, not complete recordings,

with the NIST reference [8]. This will be referred to as the

NIST setup. In terms of the TBL dataset for the NIST setup,

the collar of 0.25s will be employed however the complete

recordings will be evaluated with the manually transcribed

reference. The second scoring setup will be referred to as

SHEF. As both datasets have been manually transcribed to an

accuracy of 0.1s, a stricter collar of 0.05s is used, and scoring

occurs on the complete ﬁles with this reference.

mini.dcs.shef.ac.uk/resources/dia-improvedrt07reference/

Scoring Channel #Segs #Spkrs DER%

Data: TBL

NIST

SDM 2030 82 16.6

IHM 8478 40 393.9

SHEF

SDM 2030 82 27.8

IHM 8478 40 335.9

Data: RT07

NIST

SDM 2648 72 37.9

IHM 13070 35 308.1

SHEF

SDM 2648 72 66.4

IHM 13070 35 371.0

Table 1. Baseline performance for both datasets on the SDM

and IHM channels evaluated using both NIST and SHEF scor-

ing setups, where #Segs represents the number of hypothesis

segments and #Spkrs represents the number of speakers.

3.4. Baseline experiments

The public domain toolkit, LIUM

SpkrDiarization [4], is tai-

lored for TV and radio broadcasts and consists of Bayesian in-

formation criterion (BIC) segmentation with cross-likelihood

ratio and integer linear programming and i-vector clustering.

Table 1 displays results for both datasets and a distinction

is made between the two scoring setups as previously de-

scribed: NIST and SHEF. Scoring also occurs on both SDM

and IHM channels. For the SDM results for the TBL dataset,

changing the collar has a dramatic effect on the DER, from

16.6% to 27.8% with the stricter collar. For RT07 SDM, the

NIST scoring gives 37.9% against the SHEF result 66.4%,

again, a large improvement in DER performance is seen.

For the IHM results, the imperfect data has large amounts

of crosstalk which negatively affects the performance and

causes large false alarms from incorrectly detected speech for

both datasets, seen in both scoring setups.

The SHEF setup is a stricter scoring method however ar-

guably more reliable to show the true performance given the

more accurate references. The rest of the paper will use this

scoring method. The NIST setup can be seen as more le-

nient scoring as 0.25s collar around every boundary is a large

portion of time to ignore from evaluation. However, for the

results to be comparable to other papers, the best result will

be scored in the NIST setup at the end.

3.5. Results

Results for the ﬁxed method can be seen in Table 2 for the

TBL dataset, in which there are 4 channels per recording. The

DERs are relatively similar apart from the DNN trained on

TBL+CT where the number of segments detected is dramati-

cally less than the other three. The DNN trained on TBL+OV

achieves the lowest DER of 8.0% with the lowest SE of 1.2%.

Training DNNs with crosstalk features degrades the result

commpared to DNNs without.

Table 3 displays the performance when the frame deci-

sion metric involves a bias against the nonspeech (NS) occur-

DNN

#Segs MS% FA% SE% DER%

TRN OV CT

Data: TBL

TBL x 6732 4.3 2.4 1.2 8.0

TBL x x 7136 4.3 2.4 1.7 8.4

TBL 7269 4.3 2.5 1.5 8.3

TBL x 2964 4.6 3.7 1.4 9.7

Table 2. Results for the DNNs trained with 4 ﬁxed channels

across recordings, with the counting frame decision metric.

NS bias #Segs MS% FA% SE% DER%

Data: TBL, DNN: TBL+OV

0.75 6594 4.3 2.6 1.3 8.2

0.5 6571 4.2 2.7 1.3 8.2

0.25 6569 4.2 2.8 1.4 8.3

Table 3. Results when a bias against nonspeech is introduced

for the frame decision metric for 4 channels concatenated,

speciﬁcally for DNN TBL+OV.

rences, multiplier is speciﬁed in the table. Errors in the miss

rate are reduced but these seem to be moved to the false alarm

and speaker error, thus increasing the DERs by 0.2-0.3%.

Table 4 displays results for the mixed method and two

additional DNNs are trained on AMI data. Comparing the

TBL results to the previous ﬁxed method, more segments are

found here although the performance is worse overall. Train-

ing DNNs with OV does not help performance as it does in

the ﬁxed method. The baseline of 27.8% DER is beaten in

all but two of the trained DNNs. A dramatically higher miss

rate than the false alarm and speaker error is seen across the

trained DNNs. This could imply the counting metric is too

simple as nonspeech is selected over the channels. The best

DNN is trained on TBL+CT and achieves a DER of 10.9%,

the only DNN which improves with CT. The DNNs trained

on AMI more than double the error. For RT07, again a large

amount of miss across the DNNs is seen, implying a non-

speech bias could help. The DERs are high and range from

58.2% to 80.1% which does not seem promising. The DNNs

trained on AMI do not outperform the TBL trained DNNs.

The lowest DER is found with the DNN trained on TBL only.

Based on the miss rate reported in Table 4, it can be found

that nonspeech is detected often. Table 5 shows the perfor-

mance when a bias against nonspeech is introduced. As the

bias decreases, the likelihood of selecting nonspeech is de-

creased and the amount of missed speech detected is reduced.

For TBL, this is a small gain from 10.9% to 9.2% with a bias

of 0.25. However, a large gain is seen for the RT07 dataset

which jumps from 58.2% to 23.2% DER with the same bias.

These lowest results with the NIST setup would change to

5.7% for TBL and 15.1% for RT07.

4. CONCLUSION

Two methods for training DNNs to detect the correct speaker

channel for the purposes for speaker diarisation are presented.

DNN

#Segs MS% FA% SE% DER%

TRN OV CT

Data: TBL

TBL x 8295 20.3 1.1 0.9 22.4

TBL x x 10551 34.8 0.7 1.1 36.5

TBL 8263 17.0 1.4 1.0 19.4

TBL x 7932 7.7 0.9 1.2 10.9

AMI 10354 16.6 1.0 4.9 22.5

AMI x 7683 22.9 0.9 5.0 28.8

Data: RT07

TBL x 7979 60.9 0.8 0.4 62.1

TBL x x 4169 79.6 0.4 0.1 80.1

TBL 8430 56.5 1.2 0.4 58.2

TBL x 5993 59.7 1.3 0.2 61.2

AMI 8791 58.9 0.5 0.1 59.5

AMI x 6873 62.4 0.5 0.1 63.0

Table 4. Results for the DNNs trained with mixed channels

across recordings, with the counting frame decision metric.

NS bias #Segs MS% FA% SE% DER%

Data: TBL, DNN: TBL+CT

0.75 7950 7.3 1.9 1.3 10.6

0.5 7420 5.4 2.4 1.5 9.4

0.25 7468 4.9 2.6 1.7 9.2

Data: RT07, DNN: TBL

0.75 9940 39.5 1.5 0.6 41.5

0.5 11983 20.3 3.2 0.9 24.4

0.25 13898 14.0 7.4 1.8 23.2

Table 5. Results when a bias against nonspeech is introduced

for the frame decision metric for pairs of channels concate-

nated, speciﬁcally for DNN TBL+CT for the TBL dataset and

DNN TBL for the RT07 datset.

The ﬁrst requires a ﬁxed number of speaker channels across

recordings and concatenates speaker channel features for

training and testing. The second does not require a ﬁxed

number of speaker channels and concatenates pairs of fea-

tures. These were evaluated using two datasets with the

former ﬁnding the best DER for the TBL dataset, however, it

is not applicable to datasets with varying numbers of speaker

channels and requires more training data. The mixed method

performs well for both TBL and RT07 datasets and achieves

best results when a bias against nonspeech is applied, giving

9.2% and 23.2% respectively for the stricter scoring setup.

For the NIST setup, this reduces to 5.7% and 15.1% DER.

5. ACKNOWLEDGEMENTS

The authors would like to thank Jana Eggink and the BBC

for supporting this work and providing the data. This

work was also supported by the EPSRC Programme Grant

EP/I031022/1 Natural Speech Technology. Results are found

here: https://dx.doi.org/10.6084/m9.ﬁgshare.4312469.v1

DNN approach to speaker diarisation using speaker channels

Figures

Citations

Multichannel Speaker Activity Detection for Meetings

Wearable Sensor-Based Location-Specific Occupancy Detection in Smart Environments

Using Deep Neural Networks for Speaker Diarisation

References

Speech separation by kurtosis maximization

Artificial neural network features for speaker diarization

DiarTk : An Open Source Toolkit for Research in Multistream Speaker Diarization and its Application to Meetings Recordings

Who Really Spoke When? Finding Speaker Turns and Identities in Broadcast News Audio

Speaker turn segmentation based on between-channel differences

Related Papers (5)

Learning speaker representation for neural network based multichannel speaker extraction

A new relativistic vision in speaker discrimination

Single Channel Target Speaker Extraction and Recognition with Speaker Beam

Training Multi-Speaker Neural Text-to-Speech Systems using Speaker-Imbalanced Speech Corpora

Speaker change detection using features through a neural network speaker classifier

Frequently Asked Questions (1)

Q1. What have the authors contributed in "Dnn approach to speaker diarisation using speaker channels" ?