scispace - formally typeset
Open AccessProceedings ArticleDOI

DNN approach to speaker diarisation using speaker channels

TLDR
It is shown that given imperfect data, such as speaker channels with heavy crosstalk and overlapping speech, this is not the case, and a simple frame decision metric of counting occurrences is investigated.
Abstract
Speaker diarisation addresses the question of “who speaks when” in audio recordings, and has been studied extensively in the context of tasks such as broadcast news, meetings, etc. Performing diarisation on individual headset microphone (IHM) channels is sometimes assumed to easily give the desired output of speaker labelled segments with timing information. However, it is shown that given imperfect data, such as speaker channels with heavy crosstalk and overlapping speech, this is not the case. Deep neural networks (DNNs) can be trained on features derived from the concatenation of speaker channel features to detect which is the correct channel for each frame. Crosstalk features can be calculated and DNNs trained with or without overlapping speech to combat problematic data. A simple frame decision metric of counting occurrences is investigated as well as adding a bias against selecting nonspeech for a frame. Finally, two different scoring setups are applied to both datasets. The stricter SHEF setup finds diarisation error rates (DER) of 9.2% on TBL and 23.2% on RT07 while the NIST setup achieves 5.7% and 15.1% respectively.

read more

Content maybe subject to copyright    Report

This is a repository copy of DNN approach to speaker diarisation using speaker channels.
White Rose Research Online URL for this paper:
http://eprints.whiterose.ac.uk/121245/
Version: Accepted Version
Proceedings Paper:
Milner, R. and Hain, T. orcid.org/0000-0003-0939-3464 (2017) DNN approach to speaker
diarisation using speaker channels. In: 2017 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP). 2017 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), March 5-9, 2017, New Orleans, USA.
IEEE , pp. 4925-4929. ISBN 9781509041176
https://doi.org/10.1109/ICASSP.2017.7953093
eprints@whiterose.ac.uk
https://eprints.whiterose.ac.uk/
Reuse
Unless indicated otherwise, fulltext items are protected by copyright with all rights reserved. The copyright
exception in section 29 of the Copyright, Designs and Patents Act 1988 allows the making of a single copy
solely for the purpose of non-commercial research or private study within the limits of fair dealing. The
publisher or other rights-holder may allow further reproduction and re-use of this version - refer to the White
Rose Research Online record for this item. Where records identify the publisher as the copyright holder,
users can verify any specific terms of use on the publishers website.
Takedown
If you consider content in White Rose Research Online to be in breach of UK law, please notify us by
emailing eprints@whiterose.ac.uk including the URL of the record and the reason for the withdrawal request.

DNN APPROACH TO SPEAKER DIARISATION USING SPEAKER CHANNELS
Rosanna Milner, Thomas Hain
Speech and Hearing Research Group, University of Sheffield, UK
rmmilner2,t.hain@sheffield.ac.uk
ABSTRACT
Speaker diarisation addresses the question of “who speaks
when” in audio recordings, and has been studied extensively
in the context of tasks such as broadcast news, meetings,
etc. Performing diarisation on individual headset microphone
(IHM) channels is sometimes assumed to easily give the de-
sired output of speaker labelled segments with timing infor-
mation. However, it is shown that given imperfect data, such
as speaker channels with heavy crosstalk and overlapping
speech, this is not the case. Deep neural networks (DNNs)
can be trained on features derived from the concatenation of
speaker channel features to detect which is the correct chan-
nel for each frame. Crosstalk features can be calculated and
DNNs trained with or without overlapping speech to combat
problematic data. A simple frame decision metric of counting
occurrences is investigated as well as adding a bias against
selecting nonspeech for a frame. Finally, two different scor-
ing setups are applied to both datasets. The stricter SHEF
setup finds diarisation error rates (DER) of 9.2% on TBL
and 23.2% on RT07 while the NIST setup achieves 5.7% and
15.1% respectively.
Index Terms speaker diarisation, multi-channel, cross-
talk, deep neural networks, speaker channels
1. INTRODUCTION
The task of speaker diarisation is an important prerequisite
task for audio indexing, automatic speech recognition (ASR)
and more [1, 2]. The objective is to split the audio into seg-
ments which are associated with a single speaker, and to iden-
tify among the set of segments those that are spoken by the
same speaker. Diarisation systems generally consist of three
main stages: speech activity detection (SAD), speaker seg-
mentation and speaker clustering. SAD aims to detect speech
segments which are passed to a speaker segmentation stage to
split the segments further at speaker change points (speaker
boundaries). Speaker clustering aims to group speaker seg-
ments together into speaker-homogeneous clusters. The ob-
jective is not only to group the speakers correctly, but also to
find the correct number of clusters (i.e. speakers). Diarisation
has been well studied over the years, and toolkits are available
for this task which are designed to perform well for a specific
type of data [3, 4, 5].
The challenges to multi-channel diarisation differ by do-
main. For conversational telephone speech (CTS) only two
speakers are present. However, channel echo, speaker over-
lap, poor quality phone lines and noise cause errors, despite
independent channels for each speaker [2]. Broadcast news
(BN) data has background noises such as music, but also a
large number of speakers who may only occur very briefly [6,
7]. Meeting data has become the focus for diarisation for con-
siderable time [8]. Speech is conversational with significant
amounts of speaker overlap, as it is for CTS. However, there
are more speakers, and speech may be recorded with distant
or far-field microphones. Multi-channel diarisation operates
in two different modes, depending on the distance between
the microphones and the speakers: using beam-forming to fo-
cus on speakers [9]; or detecting automatically which speaker
is closer and disregarding other speech [10, 11]. The for-
mer case is much harder. It helps beam-forming to know
who speaks and when [12], but knowing where the speech is
coming from can improve speaker segmentation performance
[13, 14], e.g. through the use of inter-channel delay infor-
mation [9]. Work presented here is related to the latter case:
microphones are far apart and assigned to speakers, although
not in close proximity to the speakers mouth.
Deep neural networks (DNNs) have been introduced into
different stages of a diarisation system. Artificial neural net-
works (ANN) have been trained to learn a feature transform
[15] and DNNs can be trained to detect speech/nonspeech in
an SAD stage where adapting the DNN leads to improved
performance [16]. A speaker segmentation stage using auto-
associative neural networks (AANN) was proposed in which a
windowing method is used where an AANN model is trained
for the left half of the window and tested on the right to give a
confidence score on how likely each part belongs to the same
speaker [17]. Finally, DNNs have been applied to the cluster-
ing stage by training speaker separation DNNs and adapting
these to specific recordings [16, 18].
Typically, speaker diarisation is unsupervised meaning no
a priori information or metadata is used to aid a system. The
desired output of a system is speaker labelled segments with
timing information. Whether diarisation is performed unsu-
pervised (ICSI system [19]), semi or lightly supervised (sup-
plementary data such as imperfect transcripts [20]) or super-
vised (known speakers [21]), the desired output remains the

channel input hidden layers output
combinations target
classes
NS
P4
P1
P1
P2
P3
C1C2C3C4
C1C2C4C3
...
C4C3C2C1 (A) (B)
...
...
...
NS
NS
NS
P1
P2
P3
P4
channel input hidden layers output
combinations target
classes
NS
P4
P1
P1
P2
P3
C1C2
C1C3
...
C4C3
...
...
...
NS
P3
NS
P1
P2
Fig. 1. Feature concatenations and input labels are shown for
methods (A) a fixed number of channels for every recording
and (B) a mixed number of channels across recordings.
same. It will be shown that the obvious method of performing
diarisation on the individual headset microphone (IHM) chan-
nels is not satisfactory given imperfect data, such as chan-
nels containing heavy crosstalk. Thus, two methods are pro-
posed which train DNNs to detect which channel contains the
correct speaker at a given frame. Both methods concatenate
speaker channel features in training and testing. The first con-
catenates all speaker channels from a recording so it requires
each recording in a dataset to contain the same number of
speakers. As this is not portable to datasets which do not have
this trait, a second method is proposed which trains DNNs
on pairs of speaker channels. Furthermore, the problems of
crosstalk and overlapping speech are considered and as well
as simple counting frame decision metric vs. adding a bias
against selecting nonspeech.
2. DNN APPROACH USING SPEAKER CHANNELS
Two methods are presented: the first method is channel de-
tection when the specific number of channels is fixed and the
second is an extension to the first in which the data consists
of a mixed number of channels.
2.1. Fixed number of channels per recording
DNNs are trained on concatenated features from all the
speaker channels. It requires every recording to contain the
same number of speakers. Every combination of the channels
are used for training, as this may help prevent channels being
biased in certain positions. Example (A) in Figure 1 depicts
the ordering of the concatenated features with their equivalent
label file for training. It assumes there are four IHM channels
for every recording. The channels are referred to as C1, C2,
C3, C4 while each speaker-pure segment is labelled as P1, P2,
P3, P4 corresponding to the position of the relevant channel
in the feature concatenation. Nonspeech is referred to as NS.
2.2. Mixed number of channels per recording
The fixed method is not portable to datasets which do not con-
tain the same number of speakers in each recording. A differ-
frame C1C2C3C4 C1C2C4C3 C4C3C2C1 output
... ...
204 P1 P1 P4 C1
205 P1 P1 P4 C1
206 P1 NS P4 C1
207 NS NS P4 = NS
208 NS NS NS NS
209 P3 P4 P2 C3
210 P2 P2 P3 C2
...
NS
P1
P2
P3
NS
P1
P2
P4
NS
P4
P3
P2
Fig. 2. Frame decisions are made considering the decoded
outputs from all combinations of feature concatenations on
the testset. The simple counting method gives the output dis-
played.
ent approach is required where pairs of features can be con-
catenated. Example (B) in Figure 1 displays how the channel
pairs are annotated as before, where position labels are nec-
essary to denote which channel contains speech and which is
nonspeech. For instances where the speech segment does not
belong to either channel, a nonspeech label is given.
As well as being applicable to all datasets, this alternative
approach also reduces the amount of data needed for training.
For a single recording in the fixed method, the number of pos-
sible combinations for training is x!, where x in the number
of channels. Whereas for this method, the number of possible
feature pairs for training becomes x(x 1). For example, if
there are 4 channels then the amount of combinations needed
for each method is 24 and 12 respectively.
2.3. Frame decision
All the combinations of feature concatenations are used for
testing and this gives a channel or nonspeech label to every
frame. This results in multiple labels for every frame, across
the different decoded feature concatenations, as shown in Fig-
ure 2. To make a decision on the correct label, one can sim-
ply count the occurrences and select the channel or nonspeech
that has been labelled the most. Alternatively, the occurrences
can be counted as before with a bias for or against nonspeech
applied as a multiplier to increase or reduce the likelihood
of selecting nonspeech. A bias for or against specific chan-
nels could also be applied, for example if a host in a TV pro-
gramme is known to talk more than the guests.
3. EXPERIMENTS
3.1. Data
The methods are evaluated with two datasets in different do-
mains. TBL is TV broadcast data which consists of 22 pro-
grammes from a talk–show with single distant microphone
(SDM) and IHM channels: four speakers as one host and
three guests. The recordings have been split into a training
set of 12 programmes for DNN training only, and a test set of
10 episodes which has a total of 40 speakers and 8749 seg-
ments in 5.3 hours of speech time. The audio was manually
transcribed to an accuracy of 0.1s.
The second is based on the established testset from the
NIST Rich Transcription evaluation in 2007 [8]. The com-

plete files were also manually transcribed to an accuracy of
0.1s
1
, which produces a different reference to the original
testset. This updated reference contains 8 conference meet-
ings with both SDM and IHM channel data and contains 35
speakers and 11144 segments over 8.9 hours of speech time.
Six meetings contain 4 participants, one has 5 and another 6.
3.2. Experimental setup
DNNs require training on concatenated IHM channels and
log-Mel filterbanks of 23 dimension are used as opposed
to Mel frequency cepstral coefficients (MFCCs) as they are
found to yield better performance with DNNs [22]. Crosstalk
features (denoted CT), of 7 dimensions, may help reduce
errors caused by speech on the wrong channel [10]. The
energies are normalised across all N channels by
E
norm
i
(n) =
E
i
(n)
N
P
k=1
E
k
(n)
(1)
where E
i
(n) is the current channel i energy at frame n. Fur-
ther features are calculated such as kurtosis [23] and mean
cross-correlation and maximum normalised cross-correlation.
DNNs for the fixed method are trained on TBL, whereas
DNNs for the mixed method are trained on TBL and the AMI
corpus [24]. The number of input neurons depends on the
number of concatenated channels. For 4 channels, there are
1472 input neurons, increasing to 1920 with CT, two hidden
layers of 1000 hidden units and 5 output neurons, which rep-
resent the 4 channels and nonspeech. For 2 channels, there
are 736 neurons, increasing to 960 with CT, two hidden layers
of 1000 hidden units and 3 output neurons, representing the
2 channels and nonspeech. Training on overlapping speech
may cause DNNs to learn errors and affect the performance
thus DNNs are trained with or without overlapping speech
(denoted OV).
3.3. Diarisation evaluation
Diarisation error rate (DER) is the standard metric for speaker
diarisation and is the sum of three error values: miss (MS),
false alarm (FA) and speaker error (SE) [25]. The DER does
not consider the segmentation quality in its evaluation of a
system, so all tables depict the number of detected segments
[?]. Two scoring methods are investigated. The standard eval-
uation method for RT07 data is to use a collar of 0.25s and
score specific portions of time only, not complete recordings,
with the NIST reference [8]. This will be referred to as the
NIST setup. In terms of the TBL dataset for the NIST setup,
the collar of 0.25s will be employed however the complete
recordings will be evaluated with the manually transcribed
reference. The second scoring setup will be referred to as
SHEF. As both datasets have been manually transcribed to an
accuracy of 0.1s, a stricter collar of 0.05s is used, and scoring
occurs on the complete files with this reference.
1
mini.dcs.shef.ac.uk/resources/dia-improvedrt07reference/
Scoring Channel #Segs #Spkrs DER%
Data: TBL
NIST
SDM 2030 82 16.6
IHM 8478 40 393.9
SHEF
SDM 2030 82 27.8
IHM 8478 40 335.9
Data: RT07
NIST
SDM 2648 72 37.9
IHM 13070 35 308.1
SHEF
SDM 2648 72 66.4
IHM 13070 35 371.0
Table 1. Baseline performance for both datasets on the SDM
and IHM channels evaluated using both NIST and SHEF scor-
ing setups, where #Segs represents the number of hypothesis
segments and #Spkrs represents the number of speakers.
3.4. Baseline experiments
The public domain toolkit, LIUM
SpkrDiarization [4], is tai-
lored for TV and radio broadcasts and consists of Bayesian in-
formation criterion (BIC) segmentation with cross-likelihood
ratio and integer linear programming and i-vector clustering.
Table 1 displays results for both datasets and a distinction
is made between the two scoring setups as previously de-
scribed: NIST and SHEF. Scoring also occurs on both SDM
and IHM channels. For the SDM results for the TBL dataset,
changing the collar has a dramatic effect on the DER, from
16.6% to 27.8% with the stricter collar. For RT07 SDM, the
NIST scoring gives 37.9% against the SHEF result 66.4%,
again, a large improvement in DER performance is seen.
For the IHM results, the imperfect data has large amounts
of crosstalk which negatively affects the performance and
causes large false alarms from incorrectly detected speech for
both datasets, seen in both scoring setups.
The SHEF setup is a stricter scoring method however ar-
guably more reliable to show the true performance given the
more accurate references. The rest of the paper will use this
scoring method. The NIST setup can be seen as more le-
nient scoring as 0.25s collar around every boundary is a large
portion of time to ignore from evaluation. However, for the
results to be comparable to other papers, the best result will
be scored in the NIST setup at the end.
3.5. Results
Results for the fixed method can be seen in Table 2 for the
TBL dataset, in which there are 4 channels per recording. The
DERs are relatively similar apart from the DNN trained on
TBL+CT where the number of segments detected is dramati-
cally less than the other three. The DNN trained on TBL+OV
achieves the lowest DER of 8.0% with the lowest SE of 1.2%.
Training DNNs with crosstalk features degrades the result
commpared to DNNs without.
Table 3 displays the performance when the frame deci-
sion metric involves a bias against the nonspeech (NS) occur-

DNN
#Segs MS% FA% SE% DER%
TRN OV CT
Data: TBL
TBL x 6732 4.3 2.4 1.2 8.0
TBL x x 7136 4.3 2.4 1.7 8.4
TBL 7269 4.3 2.5 1.5 8.3
TBL x 2964 4.6 3.7 1.4 9.7
Table 2. Results for the DNNs trained with 4 fixed channels
across recordings, with the counting frame decision metric.
NS bias #Segs MS% FA% SE% DER%
Data: TBL, DNN: TBL+OV
0.75 6594 4.3 2.6 1.3 8.2
0.5 6571 4.2 2.7 1.3 8.2
0.25 6569 4.2 2.8 1.4 8.3
Table 3. Results when a bias against nonspeech is introduced
for the frame decision metric for 4 channels concatenated,
specifically for DNN TBL+OV.
rences, multiplier is specified in the table. Errors in the miss
rate are reduced but these seem to be moved to the false alarm
and speaker error, thus increasing the DERs by 0.2-0.3%.
Table 4 displays results for the mixed method and two
additional DNNs are trained on AMI data. Comparing the
TBL results to the previous fixed method, more segments are
found here although the performance is worse overall. Train-
ing DNNs with OV does not help performance as it does in
the fixed method. The baseline of 27.8% DER is beaten in
all but two of the trained DNNs. A dramatically higher miss
rate than the false alarm and speaker error is seen across the
trained DNNs. This could imply the counting metric is too
simple as nonspeech is selected over the channels. The best
DNN is trained on TBL+CT and achieves a DER of 10.9%,
the only DNN which improves with CT. The DNNs trained
on AMI more than double the error. For RT07, again a large
amount of miss across the DNNs is seen, implying a non-
speech bias could help. The DERs are high and range from
58.2% to 80.1% which does not seem promising. The DNNs
trained on AMI do not outperform the TBL trained DNNs.
The lowest DER is found with the DNN trained on TBL only.
Based on the miss rate reported in Table 4, it can be found
that nonspeech is detected often. Table 5 shows the perfor-
mance when a bias against nonspeech is introduced. As the
bias decreases, the likelihood of selecting nonspeech is de-
creased and the amount of missed speech detected is reduced.
For TBL, this is a small gain from 10.9% to 9.2% with a bias
of 0.25. However, a large gain is seen for the RT07 dataset
which jumps from 58.2% to 23.2% DER with the same bias.
These lowest results with the NIST setup would change to
5.7% for TBL and 15.1% for RT07.
4. CONCLUSION
Two methods for training DNNs to detect the correct speaker
channel for the purposes for speaker diarisation are presented.
DNN
#Segs MS% FA% SE% DER%
TRN OV CT
Data: TBL
TBL x 8295 20.3 1.1 0.9 22.4
TBL x x 10551 34.8 0.7 1.1 36.5
TBL 8263 17.0 1.4 1.0 19.4
TBL x 7932 7.7 0.9 1.2 10.9
AMI 10354 16.6 1.0 4.9 22.5
AMI x 7683 22.9 0.9 5.0 28.8
Data: RT07
TBL x 7979 60.9 0.8 0.4 62.1
TBL x x 4169 79.6 0.4 0.1 80.1
TBL 8430 56.5 1.2 0.4 58.2
TBL x 5993 59.7 1.3 0.2 61.2
AMI 8791 58.9 0.5 0.1 59.5
AMI x 6873 62.4 0.5 0.1 63.0
Table 4. Results for the DNNs trained with mixed channels
across recordings, with the counting frame decision metric.
NS bias #Segs MS% FA% SE% DER%
Data: TBL, DNN: TBL+CT
0.75 7950 7.3 1.9 1.3 10.6
0.5 7420 5.4 2.4 1.5 9.4
0.25 7468 4.9 2.6 1.7 9.2
Data: RT07, DNN: TBL
0.75 9940 39.5 1.5 0.6 41.5
0.5 11983 20.3 3.2 0.9 24.4
0.25 13898 14.0 7.4 1.8 23.2
Table 5. Results when a bias against nonspeech is introduced
for the frame decision metric for pairs of channels concate-
nated, specifically for DNN TBL+CT for the TBL dataset and
DNN TBL for the RT07 datset.
The first requires a fixed number of speaker channels across
recordings and concatenates speaker channel features for
training and testing. The second does not require a fixed
number of speaker channels and concatenates pairs of fea-
tures. These were evaluated using two datasets with the
former finding the best DER for the TBL dataset, however, it
is not applicable to datasets with varying numbers of speaker
channels and requires more training data. The mixed method
performs well for both TBL and RT07 datasets and achieves
best results when a bias against nonspeech is applied, giving
9.2% and 23.2% respectively for the stricter scoring setup.
For the NIST setup, this reduces to 5.7% and 15.1% DER.
5. ACKNOWLEDGEMENTS
The authors would like to thank Jana Eggink and the BBC
for supporting this work and providing the data. This
work was also supported by the EPSRC Programme Grant
EP/I031022/1 Natural Speech Technology. Results are found
here: https://dx.doi.org/10.6084/m9.figshare.4312469.v1

Citations
More filters
Proceedings ArticleDOI

Multichannel Speaker Activity Detection for Meetings

TL;DR: This work investigates single- and multi-talk in a wide range of different crosstalk levels, and improved the detection accuracy towards a standardized voice activity detection overall by 12.89 % absolute.
Journal ArticleDOI

Wearable Sensor-Based Location-Specific Occupancy Detection in Smart Environments

TL;DR: This work proposes multimodal data fusion and deep learning approach relying on the smartphone’s microphone and accelerometer sensors to estimate occupancy and augments the model with a magnetometer-dependent fingerprinting-based localization scheme to assimilate the volume of location-specific gathering.
Dissertation

Using Deep Neural Networks for Speaker Diarisation

TL;DR: A method involving a pretrained Speaker Separation Deep Neural Network (ssDNN) is investigated which performs speaker clustering and speaker segmentation using DNNs successfully for meeting data and with mixed results for broadcast media.
References
More filters
Proceedings ArticleDOI

Speech separation by kurtosis maximization

TL;DR: A computationally efficient method of separating mixed speech signals using a recursive adaptive gradient descent technique with the cost function designed to maximize the kurtosis of the output (separated) signals is presented.
Proceedings ArticleDOI

Artificial neural network features for speaker diarization

TL;DR: This study proposes an artificial neural network architecture to learn a feature transform that is optimized for speaker diarization, and trains a multi-hidden-layer ANN to judge whether two given speech segments came from the same or different speakers, using a shared transform of the input features.
Proceedings ArticleDOI

DiarTk : An Open Source Toolkit for Research in Multistream Speaker Diarization and its Application to Meetings Recordings

TL;DR: An open source toolkit released under GPL license aiming at facilitating research in multistream speaker diarization and reproducing state-of-the-art results, explicitly designed to handle an arbitrary number of features with very different statistics while limiting the computational complexity.
Proceedings ArticleDOI

Who Really Spoke When? Finding Speaker Turns and Identities in Broadcast News Audio

TL;DR: A system which attempts to find true speaker identities from the text transcription of the audio using lexical pattern matching, and shows the effect on performance when using state-of-the-art speaker clustering and speech-to-text transcription systems instead of manual references.
Proceedings ArticleDOI

Speaker turn segmentation based on between-channel differences

TL;DR: This work has used LPC whitening, spectral-domain cross-correlation, and dynamic programming to sharpen and disambiguate timing differences between mic channels that may be dominated by noise and reverberation.
Related Papers (5)
Frequently Asked Questions (1)
Q1. What have the authors contributed in "Dnn approach to speaker diarisation using speaker channels" ?

Speaker diarisation addresses the question of “ who speaks when ” in audio recordings, and has been studied extensively in the context of tasks such as broadcast news, meetings, etc.