The ICSI RT07s Speaker Diarization System

doi:10.1007/978-3-540-68585-2_47

Home
/
Papers
/
The ICSI RT07s Speaker Diarization System

Book Chapter•DOI•

The ICSI RT07s Speaker Diarization System

Chuck Wooters¹, Marijn Huijbregts¹•Institutions (1)

International Computer Science Institute¹

01 Jan 2008-pp 509-519

TL;DR: The ICSI speaker diarization system as mentioned in this paper automatically performs both speaker segmentation and clustering without any prior knowledge of the identities or the number of speakers, using standard speech processing components and techniques such as HMMs, agglomerative clustering, and the Bayesian Information Criterion.

read less

Abstract: In this paper, we present the ICSI speaker diarization system. This system was used in the 2007 National Institute of Standards and Technology (NIST) Rich Transcription evaluation. The ICSI system automatically performs both speaker segmentation and clustering without any prior knowledge of the identities or the number of speakers. Our system uses "standard" speech processing components and techniques such as HMMs, agglomerative clustering, and the Bayesian Information Criterion. However, we have developed the system with an eye towards robustness and ease of portability. Thus we have avoided the use of any sort of model that requires training on "outside" data and we have attempted to develop algorithms that require as little tuning as possible. The system is simular to last year's system [1] except for three aspects. We used the most recent available version of the beam-forming toolkit, we implemented a new speech/non-speech detector that does not require models trained on meeting data and we performed our development on a much larger set of recordings.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Speaker Diarization: A Review of Recent Research

[...]

Xavier Anguera Miro¹, Simon Bozonnet², Nicholas Evans², Corinne Fredouille³, Gerald Friedland⁴, Oriol Vinyals⁴ - Show less +2 more•Institutions (4)

Telefónica¹, Institut Eurécom², University of Avignon³, Institute of Company Secretaries of India⁴

01 Feb 2012-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: An analysis of speaker diarization performance as reported through the NIST Rich Transcription evaluations on meeting data and identify important areas for future research are presented.

...read moreread less

Abstract: Speaker diarization is the task of determining “who spoke when?” in an audio or video recording that contains an unknown amount of speech and also an unknown number of speakers. Initially, it was proposed as a research topic related to automatic speech recognition, where speaker diarization serves as an upstream processing step. Over recent years, however, speaker diarization has become an important key technology for many tasks, such as navigation, retrieval, or higher level inference on audio data. Accordingly, many important improvements in accuracy and robustness have been reported in journals and conferences in the area. The application domains, from broadcast news, to lectures and meetings, vary greatly and pose different problems, such as having access to multiple microphones and multimodal information or overlapping speech. The most recent review of existing technology dates back to 2006 and focuses on the broadcast news domain. In this paper, we review the current state-of-the-art, focusing on research developed since 2006 that relates predominantly to speaker diarization for conference meetings. Finally, we present an analysis of speaker diarization performance as reported through the NIST Rich Transcription evaluations on meeting data and identify important areas for future research.

...read moreread less

706 citations

Proceedings Article•DOI•

An HDP-HMM for systems with state persistence

[...]

Emily B. Fox¹, Erik B. Sudderth², Michael I. Jordan², Alan S. Willsky¹•Institutions (2)

Massachusetts Institute of Technology¹, University of California, Berkeley²

05 Jul 2008

TL;DR: A sampling algorithm is developed that employs a truncated approximation of the DP to jointly resample the full state sequence, greatly improving mixing rates and demonstrating the advantages of the sticky extension, and the utility of the HDP-HMM in real-world applications.

...read moreread less

Abstract: The hierarchical Dirichlet process hidden Markov model (HDP-HMM) is a flexible, nonparametric model which allows state spaces of unknown size to be learned from data We demonstrate some limitations of the original HDP-HMM formulation (Teh et al, 2006), and propose a sticky extension which allows more robust learning of smoothly varying dynamics Using DP mixtures, this formulation also allows learning of more complex, multimodal emission distributions We further develop a sampling algorithm that employs a truncated approximation of the DP to jointly resample the full state sequence, greatly improving mixing rates Via extensive experiments with synthetic data and the NIST speaker diarization database, we demonstrate the advantages of our sticky extension, and the utility of the HDP-HMM in real-world applications

...read moreread less

313 citations

Book Chapter•DOI•

Bayesian Nonparametrics: Hierarchical Bayesian nonparametric models with applications

[...]

Yee Whye Teh, Michael I. Jordan

01 Apr 2010

TL;DR: The role of hierarchical modeling in Bayesian nonparametrics is discussed, focusing on models in which the infinite-dimensional parameters are treated hierarchically, and the value of these hierarchical constructions is demonstrated in a wide range of practical applications.

...read moreread less

Abstract: Hierarchical modeling is a fundamental concept in Bayesian statistics. The basic idea is that parameters are endowed with distributions which may themselves introduce new parameters, and this construction recurses. In this review we discuss the role of hierarchical modeling in Bayesian nonparametrics, focusing on models in which the infinite-dimensional parameters are treated hierarchically. For example, we consider a model in which the base measure for a Dirichlet process is itself treated as a draw from another Dirichlet process. This yields a natural recursion that we refer to as a hierarchical Dirichlet process. We also discuss hierarchies based on the Pitman-Yor process and on completely random processes. We demonstrate the value of these hierarchical constructions in a wide range of practical applications, in problems in computational biology, computer vision and natural language processing.

...read moreread less

290 citations

Journal Article•DOI•

Behavioral Signal Processing: Deriving Human Behavioral Informatics From Speech and Language

[...]

Shrikanth S. Narayanan¹, Panayiotis G. Georgiou¹•Institutions (1)

University of Southern California¹

07 Feb 2013

TL;DR: Behavioral informatics applications of these signal processing techniques that contribute to quantifying higher level, often subjectively described, human behavior in a domain-sensitive fashion are illustrated.

...read moreread less

Abstract: The expression and experience of human behavior are complex and multimodal and characterized by individual and contextual heterogeneity and variability. Speech and spoken language communication cues offer an important means for measuring and modeling human behavior. Observational research and practice across a variety of domains from commerce to healthcare rely on speech- and language-based informatics for crucial assessment and diagnostic information and for planning and tracking response to an intervention. In this paper, we describe some of the opportunities as well as emerging methodologies and applications of human behavioral signal processing (BSP) technology and algorithms for quantitatively understanding and modeling typical, atypical, and distressed human behavior with a specific focus on speech- and language-based communicative, affective, and social behavior. We describe the three important BSP components of acquiring behavioral data in an ecologically valid manner across laboratory to real-world settings, extracting and analyzing behavioral cues from measured data, and developing models offering predictive and decision-making support. We highlight both the foundational speech and language processing building blocks as well as the novel processing and modeling opportunities. Using examples drawn from specific real-world applications ranging from literacy assessment and autism diagnostics to psychotherapy for addiction and marital well being, we illustrate behavioral informatics applications of these signal processing techniques that contribute to quantifying higher level, often subjectively described, human behavior in a domain-sensitive fashion.

...read moreread less

286 citations

A sticky HDP-HMM with application to speaker diarization

[...]

Emily B. Fox¹, Erik B. Sudderth¹, Michael I. Jordan¹, Alan S. Willsky¹•Institutions (1)

Massachusetts Institute of Technology¹

01 Jun 2011

TL;DR: An augmented HDP-HMM is described that provides effective control over the switching rate and makes it possible to treat emission distributions nonparametrically, and a sampling algorithm is developed that employs a truncated approximation of the Dirichlet process to jointly resample the full state sequence.

...read moreread less

Abstract: United States. Air Force Office of Scientific Research (Grant FA9550-06-1-0324); United States. Army Research Office (Grant W911NF-06-1-0076); United States. Air Force Office of Scientific Research (Grant FA9559-08-1-0180); United States. Defense Advanced Research Projects Agency. Information Processing Techniques Office (Contract FA8750-05-2-0249)

...read moreread less

274 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45

Collapse

References

PDF

Open Access

More filters

Book•

Extrapolation, Interpolation, and Smoothing of Stationary Time Series

[...]

Norbert Wiener

01 Mar 1964

3,431 citations

Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion

[...]

S. Chen

01 Jan 1998

TL;DR: The segmentation algorithm can successfully detect acoustic changes; the clustering algorithm can produce clusters with high purity, leading to improvements in accuracy through unsupervised adaptation as much as the ideal clustering by the true speaker identities.

...read moreread less

Abstract: In this paper, we are interested in detecting changes in speaker identity, environmental condition and channel condition; we call this the problem of acoustic change detection. The input audio stream can be modeled as a Gaussian process in the cepstral space. We present a maximum likelihood approach to detect turns of a Gaussian process; the decision of a turn is based on the Bayesian Information Criterion (BIC), a model selection criterion well-known in the statistics literature. The BIC criterion can also be applied as a termination criterion in hierarchical methods for clustering of audio segments: two nodes can be merged only if the merging increases the BIC value. Our experiments on the Hub4 1996 and 1997 evaluation data show that our segmentation algorithm can successfully detect acoustic changes; our clustering algorithm can produce clusters with high purity, leading to improvements in accuracy through unsupervised adaptation as much as the ideal clustering by the true speaker identities.

...read moreread less

855 citations

Proceedings Article•DOI•

A robust speaker clustering algorithm

[...]

Jitendra Ajmera, Chuck Wooters¹•Institutions (1)

University of California, Berkeley¹

30 Nov 2003

TL;DR: The algorithm automatically performs both speaker segmentation and clustering without any prior knowledge of the identities or the number of speakers and has the following advantages: no threshold adjustment requirements; no need for training/development data; and robustness to different data conditions.

...read moreread less

Abstract: In this paper, we present a novel speaker segmentation and clustering algorithm. The algorithm automatically performs both speaker segmentation and clustering without any prior knowledge of the identities or the number of speakers. Our algorithm uses "standard" speech processing components and techniques such as HMM, agglomerative clustering, and the Bayesian information criterion. However, we have combined and modified these so as to produce an algorithm with the following advantages: no threshold adjustment requirements; no need for training/development data; and robustness to different data conditions. This paper also reports the performance of this algorithm on different datasets released by the USA National Institute of Standards and Technology (NIST) with different initial conditions and parameter settings. The consistently low speaker-diarization error rate clearly indicates the robustness and utility of the algorithm.

...read moreread less

263 citations

Proceedings Article•DOI•

Approaches and applications of audio diarization

[...]

D.A. Reynolds¹, Pedro A. Torres-Carrasquillo¹•Institutions (1)

Massachusetts Institute of Technology¹

18 Mar 2005

TL;DR: An overview of current audio diarization approaches is provided and performance and potential applications are discussed, as well as the performance of current systems as measured in the DARPA EARS Rich Transcription Fall 2004 (RT-04F) speaker diarized evaluation.

...read moreread less

Abstract: Audio diarization is the process of annotating an input audio channel with information that attributes (possibly overlapping) temporal regions of signal energy to their specific sources. These sources can include particular speakers, music, background noise sources, and other signal source/channel characteristics. Diarization has utility in making automatic transcripts more readable and in searching and indexing audio archives. In this paper, we provide an overview of current audio diarization approaches and discuss performance and potential applications. We outline the general framework of diarization systems and present the performance of current systems as measured in the DARPA EARS Rich Transcription Fall 2004 (RT-04F) speaker diarization evaluation. Lastly, we look at future challenges and directions for diarization research.

...read moreread less

191 citations

Journal Article•DOI•

Robust speaker change detection

[...]

Jitendra Ajmera, Iain McCowan, Hervé Bourlard

26 Jul 2004-IEEE Signal Processing Letters

TL;DR: In this article, the authors present a criterion which can be used to identify speaker changes in an audio stream without such tuning, which consists of calculating the log likelihood ratio (LLR) of two models with the same number of parameters.

...read moreread less

Abstract: Most commonly used criteria for speaker change detection like log likelihood ratio (LLR) and Bayesian information criterion (BIC) have an adjustable threshold/penalty parameter to make speaker change decisions. These parameters are not always robust to different acoustic conditions and have to be tuned. In this letter, we present a criterion which can be used to identify speaker changes in an audio stream without such tuning. The criterion consists of calculating the LLR of two models with the same number of parameters. Results on the Hub4 1997 evaluation set indicate that we achieve a performance comparable to using BIC with optimal penalty term.

...read moreread less

168 citations