scispace - formally typeset
Search or ask a question

Showing papers by "Chuck Wooters published in 2007"


Journal ArticleDOI
TL;DR: The use of classic acoustic beamforming techniques is proposed together with several novel algorithms to create a complete frontend for speaker diarization in the meeting room domain and shows improvements in a speech recognition task.
Abstract: When performing speaker diarization on recordings from meetings, multiple microphones of different qualities are usually available and distributed around the meeting room. Although several approaches have been proposed in recent years to take advantage of multiple microphones, they are either too computationally expensive and not easily scalable or they cannot outperform the simpler case of using the best single microphone. In this paper, the use of classic acoustic beamforming techniques is proposed together with several novel algorithms to create a complete frontend for speaker diarization in the meeting room domain. New techniques we are presenting include blind reference-channel selection, two-step time delay of arrival (TDOA) Viterbi postprocessing, and a dynamic output signal weighting algorithm, together with using such TDOA values in the diarization to complement the acoustic information. Tests on speaker diarization show a 25% relative improvement on the test set compared to using a single most centrally located microphone. Additional experimental results show improvements using these techniques in a speech recognition task.

444 citations


Journal ArticleDOI
TL;DR: The correlation between signals coming from multiple microphones is analyzed and an improved method for carrying out speaker diarization for meetings with multiple distant microphones is proposed, improving the Diarization Error Rate (DER) by 15% to 20% relative to previous systems.
Abstract: Human-machine interaction in meetings requires the localization and identification of the speakers interacting with the system as well as the recognition of the words spoken. A seminal step toward this goal is the field of rich transcription research, which includes speaker diarization together with the annotation of sentence boundaries and the elimination of speaker disfluencies. The sub-area of speaker diarization attempts to identify the number of participants in a meeting and create a list of speech time intervals for each such participant. In this paper, we analyze the correlation between signals coming from multiple microphones and propose an improved method for carrying out speaker diarization for meetings with multiple distant microphones. The proposed algorithm makes use of acoustic information and information from the delays between signals coming from the different sources. Using this procedure, we were able to achieve state-of-the-art performance in the NIST spring 2006 rich transcription evaluation, improving the Diarization Error Rate (DER) by 15% to 20% relative to previous systems.

91 citations


Journal Article
10 May 2007-CLEaR
TL;DR: The latest version of the SRI-ICSI meeting and lecture recognition system, as was used in the NIST RT-07 evaluations, is described, highlighting improvements made over the last year, and a new NIST metric designed to evaluate combined speech diarization and recognition is reported.
Abstract: We describe the latest version of the SRI-ICSI meeting and lecture recognition system, as was used in the NIST RT-07 evaluations, highlighting improvements made over the last year. Changes in the acoustic preprocessing include updated beamforming software for processing of multiple distant microphones, and various adjustments to the speech segmenter for close-talking microphones. Acoustic models were improved by the combined use of neural-net-estimated phone posterior features, discriminative feature transforms trained with fMPE-MAP, and discriminative Gaussian estimation using MPE-MAP, as well as model adaptation specifically to nonnative and non-American speakers. The net effect of these enhancements was a 14-16% relative error reduction on distant microphones, and a 16-17% error reduction on close-talking microphones. Also, for the first time, we report results on a new "coffee break" meeting genre, and on a new NIST metric designed to evaluate combined speech diarization and recognition.

62 citations


Proceedings ArticleDOI
01 Dec 2007
TL;DR: A framework to speed up agglomerative clustering speaker diarization by adopting a computationally cheap method to reduce the hypothesis space of the more expensive and accurate model selection via Bayesian Information Criterion via BIC.
Abstract: During the past few years, speaker diarization has achieved satisfying accuracy in terms of speaker Diarization Error Rate (DER). The most successful approaches, based on agglomerative clustering, however, exhibit an inherent computational complexity which makes real-time processing, especially in combination with further processing steps, almost impossible. In this article we present a framework to speed up agglomerative clustering speaker diarization. The basic idea is to adopt a computationally cheap method to reduce the hypothesis space of the more expensive and accurate model selection via Bayesian Information Criterion (BIC). Two strategies based on the pitch-correlogram and the unscented-trans-form based approximation of KL-divergence are used independently as a fast-match approach to select the most likely clusters to merge. We performed the experiments using the existing ICSI speaker diarization system. The new system using KL-divergence fast-match strategy only performs 14% of total BIC comparisons needed in the baseline system, speeds up the system by 41% without affecting the speaker Diarization Error Rate (DER). The result is a robust and faster than real-time speaker diarization system.

39 citations


Proceedings ArticleDOI
27 Aug 2007
TL;DR: In this paper, the performance analysis of a speaker diarization system similar to the system that was submitted by ICSI at the NIST RT06s evaluation benchmark is discussed.
Abstract: In this paper we discuss the performance analysis of a speaker diarization system similar to the system that was submitted by ICSI at the NIST RT06s evaluation benchmark. The analysis that is based on a series of oracle experiments, provides a good understanding of the performance of each system component on a test set of twelve conference meetings used in previous NIST benchmarks. Our analysis shows that the speech activity detection component contributes most to the total diarization error rate (23%). The lack of ability to model verlapping speech is also a large source of errors (22%) followed by the component that creates the initial system models (15%).

38 citations


Journal Article
01 Jan 2007-CLEaR
TL;DR: This paper used the most recent available version of the beam-forming toolkit, implemented a new speech/non-speech detector that does not require models trained on meeting data and performed the development on a much larger set of recordings.
Abstract: In this paper, we present the ICSI speaker diarization system. This system was used in the 2007 National Institute of Standards and Technology (NIST) Rich Transcription evaluation. The ICSI system automatically performs both speaker segmentation and clustering without any prior knowledge of the identities or the number of speakers. Our system uses "standard" speech processing components and techniques such as HMMs, agglomerative clustering, and the Bayesian Information Criterion. However, we have developed the system with an eye towards robustness and ease of portability. Thus we have avoided the use of any sort of model that requires training on "outside" data and we have attempted to develop algorithms that require as little tuning as possible. The system is simular to last year's system [1] except for three aspects. We used the most recent available version of the beam-forming toolkit, we implemented a new speech/non-speech detector that does not require models trained on meeting data and we performed our development on a much larger set of recordings.

31 citations


Proceedings ArticleDOI
27 Aug 2007
TL;DR: The speech activity detection system that was used for detecting speech regions in the Dutch TRECVID video collection is discussed, designed to filter non-speech like music or sound effects out of the signal without the use of predefined non- speech models.
Abstract: In this paper we discuss the speech activity detection system that we used for detecting speech regions in the Dutch TRECVID video collection. The system is designed to filter non-speech like music or sound effects out of the signal without the use of predefined non-speech models. Because the system trains its models on-line, it is robust for handling out-of-domain data. The speech activity error rate on an out-of-domain test set, recordings of English conference meetings, was 4.4%. The overall error rate on twelve randomly selected five minute TRECVID fragments was 11.5%.

26 citations