DNN approach to speaker diarisation using speaker channels
Summary (2 min read)
Introduction
- When” in audio recordings, and has been studied extensively in the context of tasks such as broadcast news, meetings, etc.
- Speaker clustering aims to group speaker segments together into speaker-homogeneous clusters.
- Diarisation has been well studied over the years, and toolkits are available for this task which are designed to perform well for a specific type of data [3, 4, 5].
- Thus, two methods are proposed which train DNNs to detect which channel contains the correct speaker at a given frame.
- Furthermore, the problems of crosstalk and overlapping speech are considered and as well as simple counting frame decision metric vs. adding a bias against selecting nonspeech.
2.1. Fixed number of channels per recording
- DNNs are trained on concatenated features from all the speaker channels.
- It requires every recording to contain the same number of speakers.
- Every combination of the channels are used for training, as this may help prevent channels being biased in certain positions.
- Example (A) in Figure 1 depicts the ordering of the concatenated features with their equivalent label file for training.
- The channels are referred to as C1, C2, C3, C4 while each speaker-pure segment is labelled as P1, P2, P3, P4 corresponding to the position of the relevant channel in the feature concatenation.
2.2. Mixed number of channels per recording
- The fixed method is not portable to datasets which do not contain the same number of speakers in each recording.
- Example (B) in Figure 1 displays how the channel pairs are annotated as before, where position labels are necessary to denote which channel contains speech and which is nonspeech.
- As well as being applicable to all datasets, this alternative approach also reduces the amount of data needed for training.
- For a single recording in the fixed method, the number of possible combinations for training is x!, where x in the number of channels.
- Whereas for this method, the number of possible feature pairs for training becomes x(x − 1).
2.3. Frame decision
- All the combinations of feature concatenations are used for testing and this gives a channel or nonspeech label to every frame.
- To make a decision on the correct label, one can simply count the occurrences and select the channel or nonspeech that has been labelled the most.
- The second is based on the established testset from the NIST Rich Transcription evaluation in 2007 [8].
- This updated reference contains 8 conference meetings with both SDM and IHM channel data and contains 35 speakers and 11144 segments over 8.9 hours of speech time.
- Six meetings contain 4 participants, one has 5 and another 6.
3.2. Experimental setup
- DNNs require training on concatenated IHM channels and log-Mel filterbanks of 23 dimension are used as opposed to Mel frequency cepstral coefficients as they are found to yield better performance with DNNs [22].
- Crosstalk features (denoted CT), of 7 dimensions, may help reduce errors caused by speech on the wrong channel [10].
- DNNs for the fixed method are trained on TBL, whereas DNNs for the mixed method are trained on TBL and the AMI corpus [24].
- For 4 channels, there are 1472 input neurons, increasing to 1920 with CT, two hidden layers of 1000 hidden units and 5 output neurons, which represent the 4 channels and nonspeech.
3.3. Diarisation evaluation
- Diarisation error rate (DER) is the standard metric for speaker diarisation and is the sum of three error values: miss (MS), false alarm (FA) and speaker error (SE) [25].
- The standard evaluation method for RT07 data is to use a collar of 0.25s and score specific portions of time only, not complete recordings, with the NIST reference [8].
- As both datasets have been manually transcribed to an accuracy of 0.1s, a stricter collar of 0.05s is used, and scoring occurs on the complete files with this reference.
3.4. Baseline experiments
- The public domain toolkit, LIUM SpkrDiarization [4], is tailored for TV and radio broadcasts and consists of Bayesian information criterion (BIC) segmentation with cross-likelihood ratio and integer linear programming and i-vector clustering.
- Table 1 displays results for both datasets and a distinction is made between the two scoring setups as previously described: NIST and SHEF.
- Scoring also occurs on both SDM and IHM channels.
- For the SDM results for the TBL dataset, changing the collar has a dramatic effect on the DER, from 16.6% to 27.8% with the stricter collar.
- The rest of the paper will use this scoring method.
3.5. Results
- Results for the fixed method can be seen in Table 2 for the TBL dataset, in which there are 4 channels per recording.
- The DNNs trained on AMI do not outperform the TBL trained DNNs.
- D. Vijayasenan and F. Valente, “DiarTk: An Open Source Toolkit for Research in Multistream Speaker Diarization and its Application to Meetings Recordings.”.
Did you find this useful? Give us your feedback
Citations
6 citations
Cites background from "DNN approach to speaker diarisation..."
...State-of-the-art approaches typically use hidden Markov models [8, 9, 12, 13], multilayer perceptron classifiers [14], or approaches with deep neural networks [15]....
[...]
6 citations
Cites background from "DNN approach to speaker diarisation..."
...Milner and Hain [31] concatenated features from all the audio channels and helped train the DNN (deep neural network) model with this mixed feature to predict the number of speakers using audio signals....
[...]
4 citations
Cites methods from "DNN approach to speaker diarisation..."
...Finally, the research in (Milner and Hain, 2017) presents Method 3 and 4 from Chapter 5, described in Section 5....
[...]
References
150 citations
141 citations
"DNN approach to speaker diarisation..." refers background in this paper
...Broadcast news (BN) data has background noises such as music, but also a large number of speakers who may only occur very briefly [6, 7]....
[...]
129 citations
"DNN approach to speaker diarisation..." refers background in this paper
...Diarisation has been weIl studied over the years, and toolkits are available for this task wh ich are designed to perform weIl for a specific type of data [3, 4, 5]....
[...]
124 citations
"DNN approach to speaker diarisation..." refers background in this paper
...Multi-channel diarisation operates in two different modes , depending on the distance between the microphones and the speakers: using beam-forming to focus on speakers [9]; or detecting automatically wh ich speaker is closer and disregarding other speech [10, 11]....
[...]
70 citations
"DNN approach to speaker diarisation..." refers background in this paper
...Multi-channel diarisation operates in two different modes , depending on the distance between the microphones and the speakers: using beam-forming to focus on speakers [9]; or detecting automatically wh ich speaker is closer and disregarding other speech [10, 11]....
[...]
...Crosstalk features (denoted CT), of 7 dimensions, may help reduce errors caused by speech on the wrong channel [10]....
[...]