Showing papers by "Chuck Wooters published in 2007"

PDF

Open Access

Journal Article•DOI•

Acoustic Beamforming for Speaker Diarization of Meetings

[...]

Xavier Anguera, Chuck Wooters, Javier Hernando¹•Institutions (1)

01 Sep 2007-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: The use of classic acoustic beamforming techniques is proposed together with several novel algorithms to create a complete frontend for speaker diarization in the meeting room domain and shows improvements in a speech recognition task.

...read moreread less

Abstract: When performing speaker diarization on recordings from meetings, multiple microphones of different qualities are usually available and distributed around the meeting room. Although several approaches have been proposed in recent years to take advantage of multiple microphones, they are either too computationally expensive and not easily scalable or they cannot outperform the simpler case of using the best single microphone. In this paper, the use of classic acoustic beamforming techniques is proposed together with several novel algorithms to create a complete frontend for speaker diarization in the meeting room domain. New techniques we are presenting include blind reference-channel selection, two-step time delay of arrival (TDOA) Viterbi postprocessing, and a dynamic output signal weighting algorithm, together with using such TDOA values in the diarization to complement the acoustic information. Tests on speaker diarization show a 25% relative improvement on the test set compared to using a single most centrally located microphone. Additional experimental results show improvements using these techniques in a speech recognition task.

...read moreread less

444 citations

Journal Article•DOI•

Speaker Diarization For Multiple-Distant-Microphone Meetings Using Several Sources of Information

[...]

José Manuel Pardo¹, Xavier Anguera², Chuck Wooters³•Institutions (3)

Technical University of Madrid¹, Telefónica², International Computer Science Institute³

01 Sep 2007-IEEE Transactions on Computers

TL;DR: The correlation between signals coming from multiple microphones is analyzed and an improved method for carrying out speaker diarization for meetings with multiple distant microphones is proposed, improving the Diarization Error Rate (DER) by 15% to 20% relative to previous systems.

...read moreread less

Abstract: Human-machine interaction in meetings requires the localization and identification of the speakers interacting with the system as well as the recognition of the words spoken. A seminal step toward this goal is the field of rich transcription research, which includes speaker diarization together with the annotation of sentence boundaries and the elimination of speaker disfluencies. The sub-area of speaker diarization attempts to identify the number of participants in a meeting and create a list of speech time intervals for each such participant. In this paper, we analyze the correlation between signals coming from multiple microphones and propose an improved method for carrying out speaker diarization for meetings with multiple distant microphones. The proposed algorithm makes use of acoustic information and information from the delays between signals coming from the different sources. Using this procedure, we were able to achieve state-of-the-art performance in the NIST spring 2006 rich transcription evaluation, improving the Diarization Error Rate (DER) by 15% to 20% relative to previous systems.

...read moreread less

91 citations

Journal Article•

The SRI-ICSI Spring 2007 Meeting and Lecture Recognition System.

[...]

Andreas Stolcke¹, Xavier Anguera¹, Kofi Boakye¹, Özgür Çetin², Adam Janin¹, Mathew Magimai-Doss¹, Chuck Wooters¹, Jing Zheng³ - Show less +4 more•Institutions (3)

International Computer Science Institute¹, Yahoo!², SRI International³

10 May 2007-CLEaR

TL;DR: The latest version of the SRI-ICSI meeting and lecture recognition system, as was used in the NIST RT-07 evaluations, is described, highlighting improvements made over the last year, and a new NIST metric designed to evaluate combined speech diarization and recognition is reported.

...read moreread less

Abstract: We describe the latest version of the SRI-ICSI meeting and lecture recognition system, as was used in the NIST RT-07 evaluations, highlighting improvements made over the last year. Changes in the acoustic preprocessing include updated beamforming software for processing of multiple distant microphones, and various adjustments to the speech segmenter for close-talking microphones. Acoustic models were improved by the combined use of neural-net-estimated phone posterior features, discriminative feature transforms trained with fMPE-MAP, and discriminative Gaussian estimation using MPE-MAP, as well as model adaptation specifically to nonnative and non-American speakers. The net effect of these enhancements was a 14-16% relative error reduction on distant microphones, and a 16-17% error reduction on close-talking microphones. Also, for the first time, we report results on a new "coffee break" meeting genre, and on a new NIST metric designed to evaluate combined speech diarization and recognition.

...read moreread less

62 citations

Proceedings Article•DOI•

A fast-match approach for robust, faster than real-time speaker diarization

[...]

Yan Huang, Oriol Vinyals, Gerald Friedland, Christian Müller, Nikki Mirghafori, Chuck Wooters - Show less +2 more

01 Dec 2007

TL;DR: A framework to speed up agglomerative clustering speaker diarization by adopting a computationally cheap method to reduce the hypothesis space of the more expensive and accurate model selection via Bayesian Information Criterion via BIC.

...read moreread less

Abstract: During the past few years, speaker diarization has achieved satisfying accuracy in terms of speaker Diarization Error Rate (DER). The most successful approaches, based on agglomerative clustering, however, exhibit an inherent computational complexity which makes real-time processing, especially in combination with further processing steps, almost impossible. In this article we present a framework to speed up agglomerative clustering speaker diarization. The basic idea is to adopt a computationally cheap method to reduce the hypothesis space of the more expensive and accurate model selection via Bayesian Information Criterion (BIC). Two strategies based on the pitch-correlogram and the unscented-trans-form based approximation of KL-divergence are used independently as a fast-match approach to select the most likely clusters to merge. We performed the experiments using the existing ICSI speaker diarization system. The new system using KL-divergence fast-match strategy only performs 14% of total BIC comparisons needed in the baseline system, speeds up the system by 41% without affecting the speaker Diarization Error Rate (DER). The result is a robust and faster than real-time speaker diarization system.

...read moreread less

39 citations

Proceedings Article•DOI•

The Blame Game: Performance Analysis of Speaker Diarization System Components

[...]

Marijn Huijbregts¹, Chuck Wooters²•Institutions (2)

University of Twente¹, University of California, Berkeley²

27 Aug 2007

TL;DR: In this paper, the performance analysis of a speaker diarization system similar to the system that was submitted by ICSI at the NIST RT06s evaluation benchmark is discussed.

...read moreread less

Abstract: In this paper we discuss the performance analysis of a speaker diarization system similar to the system that was submitted by ICSI at the NIST RT06s evaluation benchmark. The analysis that is based on a series of oracle experiments, provides a good understanding of the performance of each system component on a test set of twelve conference meetings used in previous NIST benchmarks. Our analysis shows that the speech activity detection component contributes most to the total diarization error rate (23%). The lack of ability to model verlapping speech is also a large source of errors (22%) followed by the component that creates the initial system models (15%).

...read moreread less

38 citations

Journal Article•

The ICSI RT07s Speaker Diarization System.

[...]

Chuck Wooters¹, Marijn Huijbregts¹•Institutions (1)

International Computer Science Institute¹

01 Jan 2007-CLEaR

TL;DR: This paper used the most recent available version of the beam-forming toolkit, implemented a new speech/non-speech detector that does not require models trained on meeting data and performed the development on a much larger set of recordings.

...read moreread less

Abstract: In this paper, we present the ICSI speaker diarization system. This system was used in the 2007 National Institute of Standards and Technology (NIST) Rich Transcription evaluation. The ICSI system automatically performs both speaker segmentation and clustering without any prior knowledge of the identities or the number of speakers. Our system uses "standard" speech processing components and techniques such as HMMs, agglomerative clustering, and the Bayesian Information Criterion. However, we have developed the system with an eye towards robustness and ease of portability. Thus we have avoided the use of any sort of model that requires training on "outside" data and we have attempted to develop algorithms that require as little tuning as possible. The system is simular to last year's system [1] except for three aspects. We used the most recent available version of the beam-forming toolkit, we implemented a new speech/non-speech detector that does not require models trained on meeting data and we performed our development on a much larger set of recordings.

...read moreread less

31 citations

Proceedings Article•DOI•

Filtering the Unknown: Speech Activity Detection in Heterogeneous Video Collections

[...]

Marijn Huijbregts¹, Chuck Wooters², Roeland Ordelman²•Institutions (2)

University of California, Berkeley¹, University of Twente²

27 Aug 2007

TL;DR: The speech activity detection system that was used for detecting speech regions in the Dutch TRECVID video collection is discussed, designed to filter non-speech like music or sound effects out of the signal without the use of predefined non- speech models.

...read moreread less

Abstract: In this paper we discuss the speech activity detection system that we used for detecting speech regions in the Dutch TRECVID video collection. The system is designed to filter non-speech like music or sound effects out of the signal without the use of predefined non-speech models. Because the system trains its models on-line, it is robust for handling out-of-domain data. The speech activity error rate on an out-of-domain test set, recordings of English conference meetings, was 4.4%. The overall error rate on twelve randomly selected five minute TRECVID fragments was 11.5%.

...read moreread less

26 citations