Speaker Diarization: A Review of Recent Research
read more
Citations
pyAudioAnalysis: An Open-Source Python Library for Audio Signal Analysis
Behavioral Signal Processing: Deriving Human Behavioral Informatics From Speech and Language
A Survey of Available Corpora For Building Data-Driven Dialogue Systems: The Journal Version
Speaker diarization with plda i-vector scoring and unsupervised calibration
A Survey of Available Corpora for Building Data-Driven Dialogue Systems
References
A Bayesian Analysis of Some Nonparametric Problems
Hierarchical Dirichlet Processes
An alternative approach to linearly constrained adaptive beamforming
Factorial Hidden Markov Models
Related Papers (5)
Frequently Asked Questions (13)
Q2. What is the main drawback of model-based approaches?
The main drawback of model-based approaches is their reliance on external data for the training of speech and nonspeech models which makes them less robust to changes in acoustic conditions.
Q3. What is the main reason for the inclusion of nonspeech segments in speaker modelling?
the inclusion of non-speech segments in speaker modelling leads to less discriminant models and thus increased difficulties in segmentation.
Q4. How did the initial approaches for diarization work?
Initial approaches for diarization tried to solve speech activity detection on the fly, i.e., by having a nonspeech cluster be a by-product of the diarization.
Q5. What is the common approach to agglomerative hierarchical clustering?
Also known as agglomerative hierarchical clustering (AHC or AGHC), the bottom-up approach trains a number of clusters or models and aims at successively merging and reducing the number of clusters until only one remains for each speaker.
Q6. What is the role of speaker diarization in the analysis of meeting data?
Speaker diarization plays an important role in the analysis of meeting data since it allows for such content to be structured in speaker turns, to whichlinguistic content and other metadata can be added (such as the dominant speakers, the level of interactions, or emotions).
Q7. What is the way to improve the quality of the speaker diarization process?
a good compromise between missed and false alarm speech error rates has to be found to enhance the quality of the following speaker diarization process.
Q8. What is the way to avoid an unrealistic assignment of very small consecutive segments to different speaker models?
When performing frame assignment using Viterbi algorithm a minimum assignment duration is usually enforced to avoid an unrealistic assignment of very small consecutive segments to different speaker models.
Q9. What is the reason for the large variations in DER observed among different meetings?
the large variations in DER observed among the different meetings and meeting sets originate from the large variance of many important factors for speaker diarization, which makes the conference meeting domain not as easily tractable as more formalized settings such as broadcast news, lectures, or court house trials.
Q10. What is the common characteristic of the evaluations?
3See http://nist.gov/speech/tests/rt.A common characteristic of these evaluations is that the only a priori knowledge available to the participants relates to the recording scenario/source (e.g., conference meetings, lectures, or coffee breaks for the meetings domain), the language (English), and the formats of the input and output files.
Q11. What are the main reasons why top-down approaches are so popular?
Top-down approaches are also extremely computationally efficient and can be improved through cluster purification [17].3) Other Approaches: A recent alternative approach, though also bottom-up in nature, is inspired from rate-distortion theory and is based on an information-theoretic framework [18].
Q12. What is the main focus of the European Union Multimodal Meeting Manager project?
All these projects addressed the research and development of multimodal technologies dedicated to the enhancement of human-to-human communications (notably in distant access) by automatically extracting meeting content, making the information available to meeting participants, or for archiving purposes.
Q13. How many DERs were reported using delay features alone?
More recently, an approach to the unsupervised discriminant analysis of inter-channel delay features was proposed in [92] and results of approximately 20% DER were reported using delay features alone.