scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

DNN approach to speaker diarisation using speaker channels

07 Mar 2017-pp 4925-4929
TL;DR: It is shown that given imperfect data, such as speaker channels with heavy crosstalk and overlapping speech, this is not the case, and a simple frame decision metric of counting occurrences is investigated.
Abstract: Speaker diarisation addresses the question of “who speaks when” in audio recordings, and has been studied extensively in the context of tasks such as broadcast news, meetings, etc. Performing diarisation on individual headset microphone (IHM) channels is sometimes assumed to easily give the desired output of speaker labelled segments with timing information. However, it is shown that given imperfect data, such as speaker channels with heavy crosstalk and overlapping speech, this is not the case. Deep neural networks (DNNs) can be trained on features derived from the concatenation of speaker channel features to detect which is the correct channel for each frame. Crosstalk features can be calculated and DNNs trained with or without overlapping speech to combat problematic data. A simple frame decision metric of counting occurrences is investigated as well as adding a bias against selecting nonspeech for a frame. Finally, two different scoring setups are applied to both datasets. The stricter SHEF setup finds diarisation error rates (DER) of 9.2% on TBL and 23.2% on RT07 while the NIST setup achieves 5.7% and 15.1% respectively.

Summary (2 min read)

Introduction

  • When” in audio recordings, and has been studied extensively in the context of tasks such as broadcast news, meetings, etc.
  • Speaker clustering aims to group speaker segments together into speaker-homogeneous clusters.
  • Diarisation has been well studied over the years, and toolkits are available for this task which are designed to perform well for a specific type of data [3, 4, 5].
  • Thus, two methods are proposed which train DNNs to detect which channel contains the correct speaker at a given frame.
  • Furthermore, the problems of crosstalk and overlapping speech are considered and as well as simple counting frame decision metric vs. adding a bias against selecting nonspeech.

2.1. Fixed number of channels per recording

  • DNNs are trained on concatenated features from all the speaker channels.
  • It requires every recording to contain the same number of speakers.
  • Every combination of the channels are used for training, as this may help prevent channels being biased in certain positions.
  • Example (A) in Figure 1 depicts the ordering of the concatenated features with their equivalent label file for training.
  • The channels are referred to as C1, C2, C3, C4 while each speaker-pure segment is labelled as P1, P2, P3, P4 corresponding to the position of the relevant channel in the feature concatenation.

2.2. Mixed number of channels per recording

  • The fixed method is not portable to datasets which do not contain the same number of speakers in each recording.
  • Example (B) in Figure 1 displays how the channel pairs are annotated as before, where position labels are necessary to denote which channel contains speech and which is nonspeech.
  • As well as being applicable to all datasets, this alternative approach also reduces the amount of data needed for training.
  • For a single recording in the fixed method, the number of possible combinations for training is x!, where x in the number of channels.
  • Whereas for this method, the number of possible feature pairs for training becomes x(x − 1).

2.3. Frame decision

  • All the combinations of feature concatenations are used for testing and this gives a channel or nonspeech label to every frame.
  • To make a decision on the correct label, one can simply count the occurrences and select the channel or nonspeech that has been labelled the most.
  • The second is based on the established testset from the NIST Rich Transcription evaluation in 2007 [8].
  • This updated reference contains 8 conference meetings with both SDM and IHM channel data and contains 35 speakers and 11144 segments over 8.9 hours of speech time.
  • Six meetings contain 4 participants, one has 5 and another 6.

3.2. Experimental setup

  • DNNs require training on concatenated IHM channels and log-Mel filterbanks of 23 dimension are used as opposed to Mel frequency cepstral coefficients as they are found to yield better performance with DNNs [22].
  • Crosstalk features (denoted CT), of 7 dimensions, may help reduce errors caused by speech on the wrong channel [10].
  • DNNs for the fixed method are trained on TBL, whereas DNNs for the mixed method are trained on TBL and the AMI corpus [24].
  • For 4 channels, there are 1472 input neurons, increasing to 1920 with CT, two hidden layers of 1000 hidden units and 5 output neurons, which represent the 4 channels and nonspeech.

3.3. Diarisation evaluation

  • Diarisation error rate (DER) is the standard metric for speaker diarisation and is the sum of three error values: miss (MS), false alarm (FA) and speaker error (SE) [25].
  • The standard evaluation method for RT07 data is to use a collar of 0.25s and score specific portions of time only, not complete recordings, with the NIST reference [8].
  • As both datasets have been manually transcribed to an accuracy of 0.1s, a stricter collar of 0.05s is used, and scoring occurs on the complete files with this reference.

3.4. Baseline experiments

  • The public domain toolkit, LIUM SpkrDiarization [4], is tailored for TV and radio broadcasts and consists of Bayesian information criterion (BIC) segmentation with cross-likelihood ratio and integer linear programming and i-vector clustering.
  • Table 1 displays results for both datasets and a distinction is made between the two scoring setups as previously described: NIST and SHEF.
  • Scoring also occurs on both SDM and IHM channels.
  • For the SDM results for the TBL dataset, changing the collar has a dramatic effect on the DER, from 16.6% to 27.8% with the stricter collar.
  • The rest of the paper will use this scoring method.

3.5. Results

  • Results for the fixed method can be seen in Table 2 for the TBL dataset, in which there are 4 channels per recording.
  • The DNNs trained on AMI do not outperform the TBL trained DNNs.
  • D. Vijayasenan and F. Valente, “DiarTk: An Open Source Toolkit for Research in Multistream Speaker Diarization and its Application to Meetings Recordings.”.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

This is a repository copy of DNN approach to speaker diarisation using speaker channels.
White Rose Research Online URL for this paper:
http://eprints.whiterose.ac.uk/121245/
Version: Accepted Version
Proceedings Paper:
Milner, R. and Hain, T. orcid.org/0000-0003-0939-3464 (2017) DNN approach to speaker
diarisation using speaker channels. In: 2017 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP). 2017 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), March 5-9, 2017, New Orleans, USA.
IEEE , pp. 4925-4929. ISBN 9781509041176
https://doi.org/10.1109/ICASSP.2017.7953093
eprints@whiterose.ac.uk
https://eprints.whiterose.ac.uk/
Reuse
Unless indicated otherwise, fulltext items are protected by copyright with all rights reserved. The copyright
exception in section 29 of the Copyright, Designs and Patents Act 1988 allows the making of a single copy
solely for the purpose of non-commercial research or private study within the limits of fair dealing. The
publisher or other rights-holder may allow further reproduction and re-use of this version - refer to the White
Rose Research Online record for this item. Where records identify the publisher as the copyright holder,
users can verify any specific terms of use on the publishers website.
Takedown
If you consider content in White Rose Research Online to be in breach of UK law, please notify us by
emailing eprints@whiterose.ac.uk including the URL of the record and the reason for the withdrawal request.

DNN APPROACH TO SPEAKER DIARISATION USING SPEAKER CHANNELS
Rosanna Milner, Thomas Hain
Speech and Hearing Research Group, University of Sheffield, UK
rmmilner2,t.hain@sheffield.ac.uk
ABSTRACT
Speaker diarisation addresses the question of “who speaks
when” in audio recordings, and has been studied extensively
in the context of tasks such as broadcast news, meetings,
etc. Performing diarisation on individual headset microphone
(IHM) channels is sometimes assumed to easily give the de-
sired output of speaker labelled segments with timing infor-
mation. However, it is shown that given imperfect data, such
as speaker channels with heavy crosstalk and overlapping
speech, this is not the case. Deep neural networks (DNNs)
can be trained on features derived from the concatenation of
speaker channel features to detect which is the correct chan-
nel for each frame. Crosstalk features can be calculated and
DNNs trained with or without overlapping speech to combat
problematic data. A simple frame decision metric of counting
occurrences is investigated as well as adding a bias against
selecting nonspeech for a frame. Finally, two different scor-
ing setups are applied to both datasets. The stricter SHEF
setup finds diarisation error rates (DER) of 9.2% on TBL
and 23.2% on RT07 while the NIST setup achieves 5.7% and
15.1% respectively.
Index Terms speaker diarisation, multi-channel, cross-
talk, deep neural networks, speaker channels
1. INTRODUCTION
The task of speaker diarisation is an important prerequisite
task for audio indexing, automatic speech recognition (ASR)
and more [1, 2]. The objective is to split the audio into seg-
ments which are associated with a single speaker, and to iden-
tify among the set of segments those that are spoken by the
same speaker. Diarisation systems generally consist of three
main stages: speech activity detection (SAD), speaker seg-
mentation and speaker clustering. SAD aims to detect speech
segments which are passed to a speaker segmentation stage to
split the segments further at speaker change points (speaker
boundaries). Speaker clustering aims to group speaker seg-
ments together into speaker-homogeneous clusters. The ob-
jective is not only to group the speakers correctly, but also to
find the correct number of clusters (i.e. speakers). Diarisation
has been well studied over the years, and toolkits are available
for this task which are designed to perform well for a specific
type of data [3, 4, 5].
The challenges to multi-channel diarisation differ by do-
main. For conversational telephone speech (CTS) only two
speakers are present. However, channel echo, speaker over-
lap, poor quality phone lines and noise cause errors, despite
independent channels for each speaker [2]. Broadcast news
(BN) data has background noises such as music, but also a
large number of speakers who may only occur very briefly [6,
7]. Meeting data has become the focus for diarisation for con-
siderable time [8]. Speech is conversational with significant
amounts of speaker overlap, as it is for CTS. However, there
are more speakers, and speech may be recorded with distant
or far-field microphones. Multi-channel diarisation operates
in two different modes, depending on the distance between
the microphones and the speakers: using beam-forming to fo-
cus on speakers [9]; or detecting automatically which speaker
is closer and disregarding other speech [10, 11]. The for-
mer case is much harder. It helps beam-forming to know
who speaks and when [12], but knowing where the speech is
coming from can improve speaker segmentation performance
[13, 14], e.g. through the use of inter-channel delay infor-
mation [9]. Work presented here is related to the latter case:
microphones are far apart and assigned to speakers, although
not in close proximity to the speakers mouth.
Deep neural networks (DNNs) have been introduced into
different stages of a diarisation system. Artificial neural net-
works (ANN) have been trained to learn a feature transform
[15] and DNNs can be trained to detect speech/nonspeech in
an SAD stage where adapting the DNN leads to improved
performance [16]. A speaker segmentation stage using auto-
associative neural networks (AANN) was proposed in which a
windowing method is used where an AANN model is trained
for the left half of the window and tested on the right to give a
confidence score on how likely each part belongs to the same
speaker [17]. Finally, DNNs have been applied to the cluster-
ing stage by training speaker separation DNNs and adapting
these to specific recordings [16, 18].
Typically, speaker diarisation is unsupervised meaning no
a priori information or metadata is used to aid a system. The
desired output of a system is speaker labelled segments with
timing information. Whether diarisation is performed unsu-
pervised (ICSI system [19]), semi or lightly supervised (sup-
plementary data such as imperfect transcripts [20]) or super-
vised (known speakers [21]), the desired output remains the

channel input hidden layers output
combinations target
classes
NS
P4
P1
P1
P2
P3
C1C2C3C4
C1C2C4C3
...
C4C3C2C1 (A) (B)
...
...
...
NS
NS
NS
P1
P2
P3
P4
channel input hidden layers output
combinations target
classes
NS
P4
P1
P1
P2
P3
C1C2
C1C3
...
C4C3
...
...
...
NS
P3
NS
P1
P2
Fig. 1. Feature concatenations and input labels are shown for
methods (A) a fixed number of channels for every recording
and (B) a mixed number of channels across recordings.
same. It will be shown that the obvious method of performing
diarisation on the individual headset microphone (IHM) chan-
nels is not satisfactory given imperfect data, such as chan-
nels containing heavy crosstalk. Thus, two methods are pro-
posed which train DNNs to detect which channel contains the
correct speaker at a given frame. Both methods concatenate
speaker channel features in training and testing. The first con-
catenates all speaker channels from a recording so it requires
each recording in a dataset to contain the same number of
speakers. As this is not portable to datasets which do not have
this trait, a second method is proposed which trains DNNs
on pairs of speaker channels. Furthermore, the problems of
crosstalk and overlapping speech are considered and as well
as simple counting frame decision metric vs. adding a bias
against selecting nonspeech.
2. DNN APPROACH USING SPEAKER CHANNELS
Two methods are presented: the first method is channel de-
tection when the specific number of channels is fixed and the
second is an extension to the first in which the data consists
of a mixed number of channels.
2.1. Fixed number of channels per recording
DNNs are trained on concatenated features from all the
speaker channels. It requires every recording to contain the
same number of speakers. Every combination of the channels
are used for training, as this may help prevent channels being
biased in certain positions. Example (A) in Figure 1 depicts
the ordering of the concatenated features with their equivalent
label file for training. It assumes there are four IHM channels
for every recording. The channels are referred to as C1, C2,
C3, C4 while each speaker-pure segment is labelled as P1, P2,
P3, P4 corresponding to the position of the relevant channel
in the feature concatenation. Nonspeech is referred to as NS.
2.2. Mixed number of channels per recording
The fixed method is not portable to datasets which do not con-
tain the same number of speakers in each recording. A differ-
frame C1C2C3C4 C1C2C4C3 C4C3C2C1 output
... ...
204 P1 P1 P4 C1
205 P1 P1 P4 C1
206 P1 NS P4 C1
207 NS NS P4 = NS
208 NS NS NS NS
209 P3 P4 P2 C3
210 P2 P2 P3 C2
...
NS
P1
P2
P3
NS
P1
P2
P4
NS
P4
P3
P2
Fig. 2. Frame decisions are made considering the decoded
outputs from all combinations of feature concatenations on
the testset. The simple counting method gives the output dis-
played.
ent approach is required where pairs of features can be con-
catenated. Example (B) in Figure 1 displays how the channel
pairs are annotated as before, where position labels are nec-
essary to denote which channel contains speech and which is
nonspeech. For instances where the speech segment does not
belong to either channel, a nonspeech label is given.
As well as being applicable to all datasets, this alternative
approach also reduces the amount of data needed for training.
For a single recording in the fixed method, the number of pos-
sible combinations for training is x!, where x in the number
of channels. Whereas for this method, the number of possible
feature pairs for training becomes x(x 1). For example, if
there are 4 channels then the amount of combinations needed
for each method is 24 and 12 respectively.
2.3. Frame decision
All the combinations of feature concatenations are used for
testing and this gives a channel or nonspeech label to every
frame. This results in multiple labels for every frame, across
the different decoded feature concatenations, as shown in Fig-
ure 2. To make a decision on the correct label, one can sim-
ply count the occurrences and select the channel or nonspeech
that has been labelled the most. Alternatively, the occurrences
can be counted as before with a bias for or against nonspeech
applied as a multiplier to increase or reduce the likelihood
of selecting nonspeech. A bias for or against specific chan-
nels could also be applied, for example if a host in a TV pro-
gramme is known to talk more than the guests.
3. EXPERIMENTS
3.1. Data
The methods are evaluated with two datasets in different do-
mains. TBL is TV broadcast data which consists of 22 pro-
grammes from a talk–show with single distant microphone
(SDM) and IHM channels: four speakers as one host and
three guests. The recordings have been split into a training
set of 12 programmes for DNN training only, and a test set of
10 episodes which has a total of 40 speakers and 8749 seg-
ments in 5.3 hours of speech time. The audio was manually
transcribed to an accuracy of 0.1s.
The second is based on the established testset from the
NIST Rich Transcription evaluation in 2007 [8]. The com-

plete files were also manually transcribed to an accuracy of
0.1s
1
, which produces a different reference to the original
testset. This updated reference contains 8 conference meet-
ings with both SDM and IHM channel data and contains 35
speakers and 11144 segments over 8.9 hours of speech time.
Six meetings contain 4 participants, one has 5 and another 6.
3.2. Experimental setup
DNNs require training on concatenated IHM channels and
log-Mel filterbanks of 23 dimension are used as opposed
to Mel frequency cepstral coefficients (MFCCs) as they are
found to yield better performance with DNNs [22]. Crosstalk
features (denoted CT), of 7 dimensions, may help reduce
errors caused by speech on the wrong channel [10]. The
energies are normalised across all N channels by
E
norm
i
(n) =
E
i
(n)
N
P
k=1
E
k
(n)
(1)
where E
i
(n) is the current channel i energy at frame n. Fur-
ther features are calculated such as kurtosis [23] and mean
cross-correlation and maximum normalised cross-correlation.
DNNs for the fixed method are trained on TBL, whereas
DNNs for the mixed method are trained on TBL and the AMI
corpus [24]. The number of input neurons depends on the
number of concatenated channels. For 4 channels, there are
1472 input neurons, increasing to 1920 with CT, two hidden
layers of 1000 hidden units and 5 output neurons, which rep-
resent the 4 channels and nonspeech. For 2 channels, there
are 736 neurons, increasing to 960 with CT, two hidden layers
of 1000 hidden units and 3 output neurons, representing the
2 channels and nonspeech. Training on overlapping speech
may cause DNNs to learn errors and affect the performance
thus DNNs are trained with or without overlapping speech
(denoted OV).
3.3. Diarisation evaluation
Diarisation error rate (DER) is the standard metric for speaker
diarisation and is the sum of three error values: miss (MS),
false alarm (FA) and speaker error (SE) [25]. The DER does
not consider the segmentation quality in its evaluation of a
system, so all tables depict the number of detected segments
[?]. Two scoring methods are investigated. The standard eval-
uation method for RT07 data is to use a collar of 0.25s and
score specific portions of time only, not complete recordings,
with the NIST reference [8]. This will be referred to as the
NIST setup. In terms of the TBL dataset for the NIST setup,
the collar of 0.25s will be employed however the complete
recordings will be evaluated with the manually transcribed
reference. The second scoring setup will be referred to as
SHEF. As both datasets have been manually transcribed to an
accuracy of 0.1s, a stricter collar of 0.05s is used, and scoring
occurs on the complete files with this reference.
1
mini.dcs.shef.ac.uk/resources/dia-improvedrt07reference/
Scoring Channel #Segs #Spkrs DER%
Data: TBL
NIST
SDM 2030 82 16.6
IHM 8478 40 393.9
SHEF
SDM 2030 82 27.8
IHM 8478 40 335.9
Data: RT07
NIST
SDM 2648 72 37.9
IHM 13070 35 308.1
SHEF
SDM 2648 72 66.4
IHM 13070 35 371.0
Table 1. Baseline performance for both datasets on the SDM
and IHM channels evaluated using both NIST and SHEF scor-
ing setups, where #Segs represents the number of hypothesis
segments and #Spkrs represents the number of speakers.
3.4. Baseline experiments
The public domain toolkit, LIUM
SpkrDiarization [4], is tai-
lored for TV and radio broadcasts and consists of Bayesian in-
formation criterion (BIC) segmentation with cross-likelihood
ratio and integer linear programming and i-vector clustering.
Table 1 displays results for both datasets and a distinction
is made between the two scoring setups as previously de-
scribed: NIST and SHEF. Scoring also occurs on both SDM
and IHM channels. For the SDM results for the TBL dataset,
changing the collar has a dramatic effect on the DER, from
16.6% to 27.8% with the stricter collar. For RT07 SDM, the
NIST scoring gives 37.9% against the SHEF result 66.4%,
again, a large improvement in DER performance is seen.
For the IHM results, the imperfect data has large amounts
of crosstalk which negatively affects the performance and
causes large false alarms from incorrectly detected speech for
both datasets, seen in both scoring setups.
The SHEF setup is a stricter scoring method however ar-
guably more reliable to show the true performance given the
more accurate references. The rest of the paper will use this
scoring method. The NIST setup can be seen as more le-
nient scoring as 0.25s collar around every boundary is a large
portion of time to ignore from evaluation. However, for the
results to be comparable to other papers, the best result will
be scored in the NIST setup at the end.
3.5. Results
Results for the fixed method can be seen in Table 2 for the
TBL dataset, in which there are 4 channels per recording. The
DERs are relatively similar apart from the DNN trained on
TBL+CT where the number of segments detected is dramati-
cally less than the other three. The DNN trained on TBL+OV
achieves the lowest DER of 8.0% with the lowest SE of 1.2%.
Training DNNs with crosstalk features degrades the result
commpared to DNNs without.
Table 3 displays the performance when the frame deci-
sion metric involves a bias against the nonspeech (NS) occur-

DNN
#Segs MS% FA% SE% DER%
TRN OV CT
Data: TBL
TBL x 6732 4.3 2.4 1.2 8.0
TBL x x 7136 4.3 2.4 1.7 8.4
TBL 7269 4.3 2.5 1.5 8.3
TBL x 2964 4.6 3.7 1.4 9.7
Table 2. Results for the DNNs trained with 4 fixed channels
across recordings, with the counting frame decision metric.
NS bias #Segs MS% FA% SE% DER%
Data: TBL, DNN: TBL+OV
0.75 6594 4.3 2.6 1.3 8.2
0.5 6571 4.2 2.7 1.3 8.2
0.25 6569 4.2 2.8 1.4 8.3
Table 3. Results when a bias against nonspeech is introduced
for the frame decision metric for 4 channels concatenated,
specifically for DNN TBL+OV.
rences, multiplier is specified in the table. Errors in the miss
rate are reduced but these seem to be moved to the false alarm
and speaker error, thus increasing the DERs by 0.2-0.3%.
Table 4 displays results for the mixed method and two
additional DNNs are trained on AMI data. Comparing the
TBL results to the previous fixed method, more segments are
found here although the performance is worse overall. Train-
ing DNNs with OV does not help performance as it does in
the fixed method. The baseline of 27.8% DER is beaten in
all but two of the trained DNNs. A dramatically higher miss
rate than the false alarm and speaker error is seen across the
trained DNNs. This could imply the counting metric is too
simple as nonspeech is selected over the channels. The best
DNN is trained on TBL+CT and achieves a DER of 10.9%,
the only DNN which improves with CT. The DNNs trained
on AMI more than double the error. For RT07, again a large
amount of miss across the DNNs is seen, implying a non-
speech bias could help. The DERs are high and range from
58.2% to 80.1% which does not seem promising. The DNNs
trained on AMI do not outperform the TBL trained DNNs.
The lowest DER is found with the DNN trained on TBL only.
Based on the miss rate reported in Table 4, it can be found
that nonspeech is detected often. Table 5 shows the perfor-
mance when a bias against nonspeech is introduced. As the
bias decreases, the likelihood of selecting nonspeech is de-
creased and the amount of missed speech detected is reduced.
For TBL, this is a small gain from 10.9% to 9.2% with a bias
of 0.25. However, a large gain is seen for the RT07 dataset
which jumps from 58.2% to 23.2% DER with the same bias.
These lowest results with the NIST setup would change to
5.7% for TBL and 15.1% for RT07.
4. CONCLUSION
Two methods for training DNNs to detect the correct speaker
channel for the purposes for speaker diarisation are presented.
DNN
#Segs MS% FA% SE% DER%
TRN OV CT
Data: TBL
TBL x 8295 20.3 1.1 0.9 22.4
TBL x x 10551 34.8 0.7 1.1 36.5
TBL 8263 17.0 1.4 1.0 19.4
TBL x 7932 7.7 0.9 1.2 10.9
AMI 10354 16.6 1.0 4.9 22.5
AMI x 7683 22.9 0.9 5.0 28.8
Data: RT07
TBL x 7979 60.9 0.8 0.4 62.1
TBL x x 4169 79.6 0.4 0.1 80.1
TBL 8430 56.5 1.2 0.4 58.2
TBL x 5993 59.7 1.3 0.2 61.2
AMI 8791 58.9 0.5 0.1 59.5
AMI x 6873 62.4 0.5 0.1 63.0
Table 4. Results for the DNNs trained with mixed channels
across recordings, with the counting frame decision metric.
NS bias #Segs MS% FA% SE% DER%
Data: TBL, DNN: TBL+CT
0.75 7950 7.3 1.9 1.3 10.6
0.5 7420 5.4 2.4 1.5 9.4
0.25 7468 4.9 2.6 1.7 9.2
Data: RT07, DNN: TBL
0.75 9940 39.5 1.5 0.6 41.5
0.5 11983 20.3 3.2 0.9 24.4
0.25 13898 14.0 7.4 1.8 23.2
Table 5. Results when a bias against nonspeech is introduced
for the frame decision metric for pairs of channels concate-
nated, specifically for DNN TBL+CT for the TBL dataset and
DNN TBL for the RT07 datset.
The first requires a fixed number of speaker channels across
recordings and concatenates speaker channel features for
training and testing. The second does not require a fixed
number of speaker channels and concatenates pairs of fea-
tures. These were evaluated using two datasets with the
former finding the best DER for the TBL dataset, however, it
is not applicable to datasets with varying numbers of speaker
channels and requires more training data. The mixed method
performs well for both TBL and RT07 datasets and achieves
best results when a bias against nonspeech is applied, giving
9.2% and 23.2% respectively for the stricter scoring setup.
For the NIST setup, this reduces to 5.7% and 15.1% DER.
5. ACKNOWLEDGEMENTS
The authors would like to thank Jana Eggink and the BBC
for supporting this work and providing the data. This
work was also supported by the EPSRC Programme Grant
EP/I031022/1 Natural Speech Technology. Results are found
here: https://dx.doi.org/10.6084/m9.figshare.4312469.v1

Citations
More filters
Proceedings ArticleDOI
15 Apr 2018
TL;DR: This work investigates single- and multi-talk in a wide range of different crosstalk levels, and improved the detection accuracy towards a standardized voice activity detection overall by 12.89 % absolute.
Abstract: Multichannel recordings of meetings with a (wireless) headset for each person deliver commonly the best audio quality for subsequent analyses. However, still speech portions of other participants can couple into the microphone channel of the associated target speaker. Due to this crosstalk, a speaker activity detection (SAD) is required in order to identify only the speech portions of the target speaker in the related microphone channel. While most solutions are either complex and need a training process, or achieve insufficient results in multi-talk situations, we propose a low complexity method, which can handle both crosstalk and multi-talk situations. We investigate single- and multi-talk in a wide range of different crosstalk levels, and improved the detection accuracy towards a standardized voice activity detection overall by 12.89 % absolute, whereas a state-of-the-art multichannel SAD was exceeded even by 13.76 % absolute.

6 citations


Cites background from "DNN approach to speaker diarisation..."

  • ...State-of-the-art approaches typically use hidden Markov models [8, 9, 12, 13], multilayer perceptron classifiers [14], or approaches with deep neural networks [15]....

    [...]

Journal ArticleDOI
TL;DR: This work proposes multimodal data fusion and deep learning approach relying on the smartphone’s microphone and accelerometer sensors to estimate occupancy and augments the model with a magnetometer-dependent fingerprinting-based localization scheme to assimilate the volume of location-specific gathering.
Abstract: Occupancy detection helps enable various emerging smart environment applications ranging from opportunistic HVAC (heating, ventilation, and air-conditioning) control, effective meeting management, healthy social gathering, and public event planning and organization. Ubiquitous availability of smartphones and wearable sensors with the users for almost 24 hours helps revitalize a multitude of novel applications. The inbuilt microphone sensor in smartphones plays as an inevitable enabler to help detect the number of people conversing with each other in an event or gathering. A large number of other sensors such as accelerometer and gyroscope help count the number of people based on other signals such as locomotive motion. In this work, we propose multimodal data fusion and deep learning approach relying on the smartphone’s microphone and accelerometer sensors to estimate occupancy. We first demonstrate a novel speaker estimation algorithm for people counting and extend the proposed model using deep nets for handling large-scale fluid scenarios with unlabeled acoustic signals. We augment our occupancy detection model with a magnetometer-dependent fingerprinting-based localization scheme to assimilate the volume of location-specific gathering. We also propose crowdsourcing techniques to annotate the semantic location of the occupant. We evaluate our approach in different contexts: conversational, silence, and mixed scenarios in the presence of 10 people. Our experimental results on real-life data traces in natural settings show that our cross-modal approach can achieve approximately 0.53 error count distance for occupancy detection accuracy on average.

6 citations


Cites background from "DNN approach to speaker diarisation..."

  • ...Milner and Hain [31] concatenated features from all the audio channels and helped train the DNN (deep neural network) model with this mixed feature to predict the number of speakers using audio signals....

    [...]

Dissertation
16 Dec 2016
TL;DR: A method involving a pretrained Speaker Separation Deep Neural Network (ssDNN) is investigated which performs speaker clustering and speaker segmentation using DNNs successfully for meeting data and with mixed results for broadcast media.
Abstract: Speaker diarisation answers the question “who spoke when?” in an audio recording. The input may vary, but a system is required to output speaker labelled segments in time. Typical stages are Speech Activity Detection (SAD), speaker segmentation and speaker clustering. Early research focussed on Conversational Telephone Speech (CTS) and Broadcast News (BN) domains before the direction shifted to meetings and, more recently, broadcast media. The British Broadcasting Corporation (BBC) supplied data through the Multi-Genre Broadcast (MGB) Challenge in 2015 which showed the difficulties speaker diarisation systems have on broadcast media data. Diarisation is typically an unsupervised task which does not use auxiliary data or information to enhance a system. However, methods which do involve supplementary data have shown promise. Five semi-supervised methods are investigated which use a combination of inputs: different channel types and transcripts. The methods involve Deep Neural Networks (DNNs) for SAD, DNNs trained for channel detection, transcript alignment, and combinations of these approaches. However, the methods are only applicable when datasets contain the required inputs. Therefore, a method involving a pretrained Speaker Separation Deep Neural Network (ssDNN) is investigated which is applicable to every dataset. This technique performs speaker clustering and speaker segmentation using DNNs successfully for meeting data and with mixed results for broadcast media. The task of diarisation focuses on two aspects: accurate segments and speaker labels. The Diarisation Error Rate (DER) does not evaluate the segmentation quality as it does not measure the number of correctly detected segments. Other metrics exist, such as boundary and purity measures, but these also mask the segmentation quality. An alternative metric is presented based on the F-measure which considers the number of hypothesis segments correctly matched to reference segments. A deeper insight into the segment quality is shown through this metric.

4 citations


Cites methods from "DNN approach to speaker diarisation..."

  • ...Finally, the research in (Milner and Hain, 2017) presents Method 3 and 4 from Chapter 5, described in Section 5....

    [...]

References
More filters
Proceedings ArticleDOI
12 May 2008
TL;DR: This work presents the initial work toward developing an overlap detection system for improved meeting diarization, and investigates various features, with a focus on high-precision performance for use in the detector, and examines performance results on a subset of the AMI Meeting Corpus.
Abstract: State-of-the-art speaker diarization systems for meetings are now at a point where overlapped speech contributes significantly to the errors made by the system. However, little if no work has yet been done on detecting overlapped speech. We present our initial work toward developing an overlap detection system for improved meeting diarization. We investigate various features, with a focus on high-precision performance for use in the detector, and examine performance results on a subset of the AMI Meeting Corpus. For the high-quality signal case of a single mixed-headset channel signal, we demonstrate a relative improvement of about 7.4% DER over the baseline diarization system, while for the more challenging case of the single far-field channel signal relative improvement is 3.6%. We also outline steps towards improvement and moving beyond this initial phase.

150 citations

Journal ArticleDOI
TL;DR: This paper summarizes the collaboration of the LIA and CLIPS laboratories on speaker diarization of broadcast news during the spring NIST Rich Transcription 2003 evaluation campaign (NIST-RTO03S).

141 citations


"DNN approach to speaker diarisation..." refers background in this paper

  • ...Broadcast news (BN) data has background noises such as music, but also a large number of speakers who may only occur very briefly [6, 7]....

    [...]

DissertationDOI
21 Nov 2008
TL;DR: In this thesis methods are presented for which no external training data is required for training models, and these novel methods have been implemented in a large vocabulary continuous speech recognition system called SHoUT.
Abstract: In this thesis, research on large vocabulary continuous speech recognition for unknown audio conditions is presented. For automatic speech recognition systems based on statistical methods, it is important that the conditions of the audio used for training the statistical models match the conditions of the audio to be processed. Any mismatch will decrease the accuracy of the recognition. If it is unpredictable what kind of data can be expected, or in other words if the conditions of the audio to be processed are unknown, it is impossible to tune the models. If the material consists of `surprise data' the output of the system is likely to be poor. In this thesis methods are presented for which no external training data is required for training models. These novel methods have been implemented in a large vocabulary continuous speech recognition system called SHoUT. This system consists of three subsystems: speech/non-speech classification, speaker diarization and automatic speech recognition.

129 citations


"DNN approach to speaker diarisation..." refers background in this paper

  • ...Diarisation has been weIl studied over the years, and toolkits are available for this task wh ich are designed to perform weIl for a specific type of data [3, 4, 5]....

    [...]

Journal ArticleDOI
TL;DR: Tests performed on a large corpus of recorded meetings show classification accuracies of up to 96%, and automatic speech recognition performance close to that obtained using ground truth segmentation.
Abstract: The analysis of scenarios in which a number of microphones record the activity of speakers, such as in a round-table meeting, presents a number of computational challenges. For example, if each participant wears a microphone, speech from both the microphone's wearer (local speech) and from other participants (crosstalk) is received. The recorded audio can be broadly classified in four ways: local speech, crosstalk plus local speech, crosstalk alone and silence. We describe two experiments related to the automatic classification of audio into these four classes. The first experiment attempted to optimize a set of acoustic features for use with a Gaussian mixture model (GMM) classifier. A large set of potential acoustic features were considered, some of which have been employed in previous studies. The best-performing features were found to be kurtosis, "fundamentalness," and cross-correlation metrics. The second experiment used these features to train an ergodic hidden Markov model classifier. Tests performed on a large corpus of recorded meetings show classification accuracies of up to 96%, and automatic speech recognition performance close to that obtained using ground truth segmentation.

124 citations


"DNN approach to speaker diarisation..." refers background in this paper

  • ...Multi-channel diarisation operates in two different modes , depending on the distance between the microphones and the speakers: using beam-forming to focus on speakers [9]; or detecting automatically wh ich speaker is closer and disregarding other speech [10, 11]....

    [...]

Proceedings ArticleDOI
17 Sep 2006
TL;DR: This paper presents a system for the automatic segmentation of multiple-channel individual headset microphone (IHM) meeting recordings for automatic speech recognition that relies on an MLP classifier trained from several meeting room corpora to identify speech/non-speech segments of the recordings.
Abstract: One major research challenge in the domain of the analysis of meeting room data is the automatic transcription of what is spoken during meetings, a task which has gained considerable attention within the ASR research community through the NIST rich transcription evaluations conducted over the last three years. One of the major difficulties in carrying out automatic speech recognition (ASR) on this data is dealing with the challenging recording environment, which has instigated the development of novel audio pre-processing approaches. In this paper we present a system for the automatic segmentation of multiple-channel individual headset microphone (IHM) meeting recordings for automatic speech recognition. The system relies on an MLP classifier trained from several meeting room corpora to identify speech/non-speech segments of the recordings. We give a detailed analysis of the segmentation performance for a number of system configurations, with our best system achieving ASR performance on automatically generated segments within 1.3% (3.7% relative) of a manual segmentation of the data.

70 citations


"DNN approach to speaker diarisation..." refers background in this paper

  • ...Multi-channel diarisation operates in two different modes , depending on the distance between the microphones and the speakers: using beam-forming to focus on speakers [9]; or detecting automatically wh ich speaker is closer and disregarding other speech [10, 11]....

    [...]

  • ...Crosstalk features (denoted CT), of 7 dimensions, may help reduce errors caused by speech on the wrong channel [10]....

    [...]

Frequently Asked Questions (1)
Q1. What have the authors contributed in "Dnn approach to speaker diarisation using speaker channels" ?

Speaker diarisation addresses the question of “ who speaks when ” in audio recordings, and has been studied extensively in the context of tasks such as broadcast news, meetings, etc.