Showing papers on "Speaker diarisation published in 2006"

PDF

Open Access

Journal Article•DOI•

Support vector machines using GMM supervectors for speaker verification

[...]

William M. Campbell¹, Douglas E. Sturim¹, Douglas A. Reynolds¹•Institutions (1)

10 Apr 2006-IEEE Signal Processing Letters

TL;DR: This work examines the idea of using the GMM supervector in a support vector machine (SVM) classifier and proposes two new SVM kernels based on distance metrics between GMM models that produce excellent classification accuracy in a NIST speaker recognition evaluation task.

...read moreread less

Abstract: Gaussian mixture models (GMMs) have proven extremely successful for text-independent speaker recognition. The standard training method for GMM models is to use MAP adaptation of the means of the mixture components based on speech from a target speaker. Recent methods in compensation for speaker and channel variability have proposed the idea of stacking the means of the GMM model to form a GMM mean supervector. We examine the idea of using the GMM supervector in a support vector machine (SVM) classifier. We propose two new SVM kernels based on distance metrics between GMM models. We show that these SVM kernels produce excellent classification accuracy in a NIST speaker recognition evaluation task.

...read moreread less

1,081 citations

Journal Article•DOI•

An overview of automatic speaker diarization systems

[...]

S. E. Tranter¹, Douglas A. Reynolds²•Institutions (2)

University of Cambridge¹, Massachusetts Institute of Technology²

01 Sep 2006-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: An overview of the approaches currently used in a key area of audio diarization, namely speaker diarizations, are provided and their relative merits and limitations are discussed.

...read moreread less

Abstract: Audio diarization is the process of annotating an input audio channel with information that attributes (possibly overlapping) temporal regions of signal energy to their specific sources. These sources can include particular speakers, music, background noise sources, and other signal source/channel characteristics. Diarization can be used for helping speech recognition, facilitating the searching and indexing of audio archives, and increasing the richness of automatic transcriptions, making them more readable. In this paper, we provide an overview of the approaches currently used in a key area of audio diarization, namely speaker diarization, and discuss their relative merits and limitations. Performances using the different techniques are compared within the framework of the speaker diarization task in the DARPA EARS Rich Transcription evaluations. We also look at how the techniques are being introduced into real broadcast news systems and their portability to other domains and tasks such as meetings and speaker verification

...read moreread less

634 citations

Joint Factor Analysis of Speaker and Session Variability: Theory and Algorithms

[...]

Patrick Kenny

01 Jan 2006

TL;DR: A full account of the algorithms needed to carry out a joint factor analysis of speaker and session variability in a training set in which each speaker is recorded over many different channels and the practical limitations that will be encountered if these algorithms are implemented on very large data sets are discussed.

...read moreread less

Abstract: We give a full account of the algorithms needed to carry out a joint factor analysis of speaker and session variability in a training set in which each speaker is recorded over many different channels and we discuss the practical limitations that will be encountered if these algorithms are implemented on very large data sets. This article is intended as a companion to (1) where we presented a new type of likelihood ratio statistic for speaker verification which is designed principally to deal with the problem of inter-session variability, that is the variability among recordings of a given speaker. This likelihood ratio statistic is based on a joint factor analysis of speaker and session variability in a training set in which each speaker is recorded over many different channels (such as one of the Switchboard II databases). Our purpose in the current article is to give detailed algorithms for carrying out such a factor analysis. Although we have only experimented with the applications of this model in speaker recognition we will also explain how it could serve as an integrated framework for progressive speaker-adaptation and on-line channel adaptation of HMM-based speech recognizers operating in situations where speaker identities are known. II. OVERVIEW OF THE JOINT FACTOR ANALYSIS MODEL The joint factor analysis model can be viewed Gaussian distribution on speaker- and channel-dependent (or, more accurately, session-dependent) HMM supervectors in which most (but not all) of the variance in the supervector population is assumed to be accounted for by a small number of hidden variables which we refer to as speaker and channel factors. The speaker factors and the channel factors play different roles in that, for a given speaker, the values of the speaker factors are assumed to be the same for all recordings of the speaker but the channel factors are assumed to vary from one recording to another. For example, the Gaussian distribution on speaker-dependent supervectors used in eigenvoice MAP (2) is a special case of the factor analysis model in which there are no channel factors and all of the variance in the speaker- dependent HMM supervectors is assumed to be accounted The authors are with the Centre de recherche informatique de Montr´

...read moreread less

440 citations

Journal Article•DOI•

Real-time speaker identification and verification

[...]

Tomi Kinnunen¹, Evgeny Karpov¹, Pasi Fränti¹•Institutions (1)

University of Eastern Finland¹

01 Dec 2006-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: This paper focuses on optimizing vector quantization (VQ) based speaker identification, which reduces the number of test vectors by pre-quantizing the test sequence prior to matching, and thenumber of speakers by pruning out unlikely speakers during the identification process.

...read moreread less

Abstract: In speaker identification, most of the computation originates from the distance or likelihood computations between the feature vectors of the unknown speaker and the models in the database. The identification time depends on the number of feature vectors, their dimensionality, the complexity of the speaker models and the number of speakers. In this paper, we concentrate on optimizing vector quantization (VQ) based speaker identification. We reduce the number of test vectors by pre-quantizing the test sequence prior to matching, and the number of speakers by pruning out unlikely speakers during the identification process. The best variants are then generalized to Gaussian mixture model (GMM) based modeling. We apply the algorithms also to efficient cohort set search for score normalization in speaker verification. We obtain a speed-up factor of 16:1 in the case of VQ-based modeling with minor degradation in the identification accuracy, and 34:1 in the case of GMM-based modeling. An equal error rate of 7% can be reached in 0.84 s on average when the length of test utterance is 30.4 s.

...read moreread less

248 citations

Journal Article•DOI•

Multistage speaker diarization of broadcast news

[...]

Claude Barras¹, Xuan Zhu¹, Sylvain Meignier¹, Jean-Luc Gauvain¹•Institutions (1)

Centre national de la recherche scientifique¹

01 Sep 2006-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: This paper describes recent advances in speaker diarization with a multistage segmentation and clustering system, which incorporates a speaker identification step, which builds upon the baseline audio partitioner used in the LIMSI broadcast news transcription system.

...read moreread less

Abstract: This paper describes recent advances in speaker diarization with a multistage segmentation and clustering system, which incorporates a speaker identification step. This system builds upon the baseline audio partitioner used in the LIMSI broadcast news transcription system. The baseline partitioner provides a high cluster purity, but has a tendency to split data from speakers with a large quantity of data into several segment clusters. Several improvements to the baseline system have been made. First, the iterative Gaussian mixture model (GMM) clustering has been replaced by a Bayesian information criterion (BIC) agglomerative clustering. Second, an additional clustering stage has been added, using a GMM-based speaker identification method. Finally, a post-processing stage refines the segment boundaries using the output of a transcription system. On the National Institute of Standards and Technology (NIST) RT-04F and ESTER evaluation data, the multistage system reduces the speaker error by over 70% relative to the baseline system, and gives between 40% and 50% reduction relative to a single-stage BIC clustering system

...read moreread less

217 citations

Journal Article•DOI•

Step-by-step and integrated approaches in broadcast news speaker diarization

[...]

Sylvain Meignier¹, Sylvain Meignier², Daniel Moraru², Corinne Fredouille¹, Jean-François Bonastre¹, Laurent Besacier² - Show less +2 more•Institutions (2)

University of Avignon¹, Centre national de la recherche scientifique²

01 Apr 2006-Computer Speech & Language

TL;DR: This paper summarizes the collaboration of the LIA and CLIPS laboratories on speaker diarization of broadcast news during the spring NIST Rich Transcription 2003 evaluation campaign (NIST-RTO03S).

...read moreread less

141 citations

Patent•

Method and apparatus for large population speaker identification in telephone interactions

[...]

Yaniv Zigel¹, Moshe Wasserblat¹•Institutions (1)

NICE Systems¹

19 Oct 2006

TL;DR: In this paper, a method and apparatus for determining whether a speaker uttering an utterance belongs to a predetermined set comprising known speakers, wherein a training utterance is available for each known speaker.

...read moreread less

Abstract: A method and apparatus for determining whether a speaker uttering an utterance belongs to a predetermined set comprising known speakers, wherein a training utterance is available for each known speaker. The method and apparatus test whether features extracted from the tested utterance provide a score exceeding a threshold when matched against one or more of models constructed upon voice samples of each known speaker. The method and system further provide optional enhancements such as determining, using, and updating model normalization parameters, a fast scoring algorithm, summed calls handling, or quality evaluation for the tested utterance.

...read moreread less

133 citations

Proceedings Article•

Super-Human Multi-Talker Speech Recognition: The IBM 2006 Speech Separation Challenge System

[...]

Trausti Kristjansson, John R. Hershey, Peder A. Olsen, Steven J. Rennie, Ramesh A. Gopinath - Show less +1 more

01 Jan 2006

TL;DR: A system for model based speech separation which achieves super-human recognition performance when two talkers speak at similar levels and incorporates a novel method for performing two-talker speaker identification and gain estimation is described.

...read moreread less

Abstract: We describe a system for model based speech separation which achieves super-human recognition performance when two talkers speak at similar levels. The system can separate the speech of two speakers from a single channel recording with remarkable results. It incorporates a novel method for performing two-talker speaker identification and gain estimation. We extend the method of model based high resolution signal reconstruction to incorporate temporal dynamics. We report on two methods for introducing dynamics; the first uses dynamics in the acoustic model space, the second incorporates dynamics based on sentence grammar. The addition of temporal constraints leads to dramatic improvements in the separation performance. Once the signals have been separated they are then recognized using speaker dependent labeling.

...read moreread less

119 citations

Dissertation•

Robust speaker diarization for meetings

[...]

Xavier Anguera Miró

21 Dec 2006

TL;DR: In this paper, a hierarchical bottom-up mono-channel speaker diarization system was used to extract speaker location information and obtain a single enhanced signal from all available microphones, which is then used for speaker segmentation and clustering.

...read moreread less

Abstract: This thesis shows research performed into the topic of speaker diarization for meeting rooms. It looks into the algorithms and the implementation of an offline speaker segmentation and clustering system for a meeting recording where usually more than one microphone is available. The main research and system implementation has been done while visiting the International Computes Science Institute (ICSI, Berkeley, California) for a period of two years. Speaker diarization is a well studied topic on the domain of broadcast news recordings. Most of the proposed systems involve some sort of hierarchical clustering of the data into clusters, where the optimum number of speakers of their identities are unknown a priory. A very commonly used method is called bottom-up clustering, where multiple initial clusters are iteratively merged until the optimum number of clusters is reached, according to some stopping criterion. Such systems are based on a single channel input, not allowing a direct application for the meetings domain. Although some efforts have been done to adapt such systems to multichannel data, at the start of this thesis no effective implementation had been proposed. Furthermore, many of these speaker diarization algorithms involve some sort of models training or parameter tuning using external data, which impedes its usability with data different from what they have been adapted to. The implementation proposed in this thesis works towards solving the aforementioned problems. Taking the existing hierarchical bottom-up mono-channel speaker diarization system from ICSI, it first uses a flexible acoustic beamforming to extract speaker location information and obtain a single enhanced signal from all available microphones. It then implements a train-free speech/non-speech detection on such signal and processes the resulting speech segments with an improved version of the mono-channel speaker diarization system. Such system has been modified to use speaker location information (then available) and several algorithms have been adapted or created new to adapt the system behavior to each particular recording by obtaining information directly from the acoustics, making it less dependent on the development data. The resulting system is flexible to any meetings room layout regarding the number of microphones and their placement. It is train-free making it easy to adapt to different sorts of data and domains of application. Finally, it takes a step forward into the use of parameters that are more robust to changes in the acoustic data. Two versions of the system were submitted with excellent results in RT05s and RT06s NIST Rich Transcription evaluations for meetings, where data from two different subdomains (lectures and conferences) was evaluated. Also, experiments using the RT datasets from all meetings evaluations were used to test the different proposed algorithms proving their suitability to the task.

...read moreread less

105 citations

Proceedings Article•

A Spectral Clustering Approach to Speaker Diarization

[...]

Huazhong Ning¹, Ming Liu¹, Hao Tang¹, Thomas S. Huang¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

01 Jan 2006

TL;DR: To apply the Ng-Jordan-Weiss (NJW) spectral clustering algorithm to speaker diarization, some domain specific solutions to the open issues of this algorithm are proposed: choice of metric; selection of scaling parameter; estimation of the number of clusters.

...read moreread less

Abstract: In this paper, we present a spectral clustering approach to explore the possibility of discovering structure from audio data. To apply the Ng-Jordan-Weiss (NJW) spectral clustering algorithm to speaker diarization, we propose some domain specific solutions to the open issues of this algorithm: choice of metric; selection of scaling parameter; estimation of the number of clusters. Then, a postprocessing step ‐ “Cross EM refinement” ‐ is conducted to further improve the performance of spectral learning. In experiments, this approach has performance very similar to the traditional hierarchical clustering on the audio data of Japanese Parliament Panel Discussions, but it runs much faster than the latter. Index Terms: Speaker Diarization, Spectral Clustering, Cross EM refinement, BIC.

...read moreread less

93 citations

Journal Article•DOI•

Model-based sequential organization in cochannel speech

[...]

Yang Shao¹, DeLiang Wang¹•Institutions (1)

Ohio State University¹

01 Dec 2006-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: This paper extends the traditional SID framework to cochannel speech and derive a joint objective for sequential grouping and SID, leading to a problem of search for the optimum hypothesis and proposes a hypothesis pruning algorithm based on speaker models in order to make the search computationally efficient.

...read moreread less

Abstract: A human listener has the ability to follow a speaker's voice while others are speaking simultaneously; in particular, the listener can organize the time-frequency energy of the same speaker across time into a single stream. In this paper, we focus on sequential organization in cochannel speech, or mixtures of two voices. We extract minimally corrupted segments, or usable speech, in cochannel speech using a robust multipitch tracking algorithm. The extracted usable speech is shown to capture speaker characteristics and improves speaker identification (SID) performance across various target-to-interferer ratios. To utilize speaker characteristics for sequential organization, we extend the traditional SID framework to cochannel speech and derive a joint objective for sequential grouping and SID, leading to a problem of search for the optimum hypothesis. Subsequently we propose a hypothesis pruning algorithm based on speaker models in order to make the search computationally efficient. Evaluation results show that the proposed system approaches the ceiling SID performance obtained with prior pitch information and yields significant improvement over alternative approaches to sequential organization.

...read moreread less

Proceedings Article•DOI•

Feature and model space speaker adaptation with full covariance Gaussians.

[...]

Daniel Povey, George Saon

17 Sep 2006

TL;DR: This work introduces efficient update methods to train adaptation matrices for the full covariance case and experiments with a simplified technique that works almost as well as the exact method.

...read moreread less

Abstract: Full covariance models can give better results for speech recognition than diagonal models, yet they introduce complications for standard speaker adaptation techniques such as MLLR and fMLLR. Here we introduce efficient update methods to train adaptation matrices for the full covariance case. We also experiment with a simplified technique in which we pretend that the full covariance Gaussians are diagonal and obtain adaptation matrices under that assumption. We show that this approximate method works almost as well as the exact method.

...read moreread less

Journal Article•DOI•

New approach to the polyglot speech generation by means of an HMM-based speaker adaptable synthesizer

[...]

Javier Latorre¹, Koji Iwano¹, Sadaoki Furui¹•Institutions (1)

Tokyo Institute of Technology¹

01 Oct 2006-Speech Communication

TL;DR: The performance obtained with the HMM-based polyglot synthesis method is better than that of methods based on phone mapping for both adaptation and synthesis, and can be used to create synthesizers for languages where no speech resources are available.

...read moreread less

Proceedings Article•

Study on Speaker Verification on Emotional Speech

[...]

Wei Wu, Thomas Fang Zheng, Mingxing Xu, Huanjun Bao

01 Jan 2006

TL;DR: It is found that speech with various emotions aggravates the verification performance of a GMM-UBM based speaker verification system and an emotion-dependent score normalization method is proposed, borrowed from the idea of Hnorm.

...read moreread less

Abstract: Besides background noise, channel effect and speaker’s health condition, emotion is another factor which may influence the performance of a speaker verification system In this paper, the performance of a GMM-UBM based speaker verification system on emotional speech is studied It is found that speech with various emotions aggravates the verification performance Two reasons for the performance aggravation are analyzed, they are mismatched emotions between the speaker models and the test utterances, and the articulating styles of certain emotions which create intense intra-speaker vocal variability In response to the first reason, an emotion-dependent score normalization method is proposed, which is borrowed from the idea of Hnorm Index Terms: speaker verification, emotional speech

...read moreread less

Proceedings Article•DOI•

Robust Speaker Diarization for Meetings: ICSI RT06s evaluation system

[...]

Xavier Anguera, Chuck Wooters, José Manuel Pardo¹•Institutions (1)

University of California, Berkeley¹

17 Sep 2006

TL;DR: Four of the main improvements to the ICSI speaker diarization system submitted for the NIST Rich Transcription evaluation (RT06s) conducted on the meetings environment are introduced: a new training-free speech/non-speech detection algorithm, a new algorithm for system initialization, and a frame purification algorithm to increase clusters differentiability.

...read moreread less

Abstract: In this paper we present the ICSI speaker diarization system submitted for the NIST Rich Transcription evaluation (RT06s) [1] conducted on the meetings environment. This is a set of yearly evaluations which in the last two years have included speaker diarization of two kinds of distinct meetings: conference room and lecture room. The system presented focuses on being robust to changes in the meeting conditions by not using any training data. In this paper we introduce four of the main improvements to the system from last years’ submission: The first is a new training-free speech/non-speech detection algorithm. The second is the introduction of a new algorithm for system initialization. The third is the use of a frame purification algorithm to increase clusters differentiability. The last improvement is the use of inter-channel delays as features, greatly improving performance. We show the diarization error rate (DER) score of this system on all available meeting datasets to date for the multiple distant microphone (MDM) and single distant microphone (SDM) conditions. Index Terms: Speaker diarization, speaker segmentation and clustering, meetings indexing.

...read moreread less

Journal Article•DOI•

Improved automatic speech recognition through speaker normalization

[...]

Diego Giuliani, Matteo Gerosa¹, Fabio Brugnara•Institutions (1)

University of Trento¹

01 Jan 2006-Computer Speech & Language

TL;DR: The novel method always performed better than the reference vocal tract length normalization method adopted in this work and when unsupervised static speaker adaptation was applied in combination with each of the two speaker normalization methods, a different behavior was observed on the two corpora.

...read moreread less

Proceedings Article•DOI•

Initial Study on Automatic Identification of Speaker Role in Broadcast News Speech

[...]

Yang Liu¹•Institutions (1)

University of Texas at Dallas¹

04 Jun 2006

TL;DR: An HMM-based approach and a maximum entropy model for speaker role labeling using Mandarin broadcast news speech are presented and it is found that themaximum entropy model performs slightly better than the HMM, and that the combination of them outperforms any model alone.

...read moreread less

Abstract: Identifying a speaker's role (anchor, reporter, or guest speaker) is important for finding the structural information in broadcast news speech. We present an HMM-based approach and a maximum entropy model for speaker role labeling using Mandarin broadcast news speech. The algorithms achieve classification accuracy of about 80% (compared to the baseline of around 50%) using the human transcriptions and manually labeled speaker turns. We found that the maximum entropy model performs slightly better than the HMM, and that the combination of them outperforms any model alone. The impact of the contextual role information is also examined in this study.

...read moreread less

Proceedings Article•

Speaker Diarization for Multiple Distant Microphone Meetings: Mixing Acoustic Features And Inter-Channel Time Differences

[...]

José Manuel Pardo, Xavier Anguera, Chuck Wooters

01 Jan 2006

TL;DR: This paper has developed a method to mix the TDOA values with the acoustic values by calculating a combined loglikelihood between both sets of vectors.

...read moreread less

Abstract: Speaker diarization for recordings made in meetings consists of identifying the number of participants in each meeting and creating a list of speech time intervals for each participant. In recently published work [7] we presented some experiments using only TDOA values (Time Delay Of Arrival for different channels) applied to this task. We demonstrated that information in those values can be used to segment the speakers. In this paper we have developed a method to mix the TDOA values with the acoustic values by calculating a combined loglikelihood between both sets of vectors. Using this method we have been able to reduce the DER by 16.34% (relative) for the NIST RT05s set (scored without overlap and manually transcribed references) the DER for our devel06s set (scored with overlap and force-aligned references) by 21% (relative) and the DER for the NIST RT06s (scored with overlap and manually transcribed references) by 15% (relative) . Index terms: Speaker diarization, speaker segmentation, meetings recognition.

...read moreread less

Journal Article•

Robust speaker segmentation for meetings : The ICSI-SRI spring 2005 diarization system

[...]

Xavier Anguera, Chuck Wooters, Barbara Peskin, Mateu Aguilo

01 Jan 2006-Lecture Notes in Computer Science

TL;DR: In this article, the ICSI-SRI entry in the Rich Transcription 2005 Spring Meeting Recognition Evaluation is described, which uses agglomerative clustering with a modified Bayesian Information Criterion (BIC) measure to determine when to stop merging clusters and to decide which pairs of clusters to merge.

...read moreread less

Abstract: In this paper we describe the ICSI-SRI entry in the Rich Transcription 2005 Spring Meeting Recognition Evaluation. The current system is based on the ICSI-SRI clustering system for Broadcast News (BN), with extra modules to process the different meetings tasks in which we participated. Our base system uses agglomerative clustering with a modified Bayesian Information Criterion (BIC) measure to determine when to stop merging clusters and to decide which pairs of clusters to merge. This approach does not require any pre-trained models, thus increasing robustness and simplifying the port from BN to the meetings domain. For the meetings domain, we have added several features to our baseline clustering system, including a purification module that tries to keep the clusters acoustically homogeneous throughout the clustering process, and a delay&sum beamforming algorithm which enhances signal quality for the multiple distant microphones (MDM) sub-task. In post-evaluation work we further improved the delay&sum algorithm, experimented with a new speech/non-speech detector and proposed a new system for the lecture room environment.

...read moreread less

Proceedings Article•DOI•

Purity Algorithms for Speaker Diarization of Meetings Data

[...]

Xavier Anguera¹, Chuck Wooters, Javier Hernando²•Institutions (2)

International Computer Science Institute¹, Polytechnic University of Catalonia²

14 May 2006

TL;DR: This paper presents two algorithms that aim to purify the clusters, the first assigns conflicting speech segments to a new cluster, and the second detects and eliminates non-speech frames when comparing two clusters.

...read moreread less

Abstract: When performing speaker diarization, it is common to use an agglomerative clustering approach where the acoustic data is first split in small pieces and then pairs are merged until reaching a stopping point. When using a purely agglomerative clustering technique, one cluster cannot be split into two. Therefore, errors caused by multiple speakers being assigned to one cluster can be common. Furthermore, clusters often contain non-speech frames, creating problems when deciding which two clusters to merge and when to stop the clustering. In this paper, we present two algorithms that aim to purify the clusters. The first assigns conflicting speech segments to a new cluster, and the second detects and eliminates non-speech frames when comparing two clusters. We show improvements of over 18% relative using three datasets from the most current Rich Transcription (RT) evaluations.

...read moreread less

Proceedings Article•DOI•

A Study of Intentional Voice Modifications for Evading Automatic Speaker Recognition

[...]

Sachin S. Kajarekar, H. Bratt, E. Shriberg, R. de Leon

28 Jun 2006

TL;DR: The investigation of the effect of intentional voice modifications on a state-of-the-art speaker recognition system shows vulnerability in both humans and speaker recognition systems to changed voices, and suggests a potential for collaboration between human analysts and automatic speaker recognition Systems to address this phenomenon.

...read moreread less

Abstract: We investigate the effect of intentional voice modifications on a state-of-the-art speaker recognition system. The investigation includes data collection, where normal and changed voices are collected from subjects conversing by telephone. For comparison purposes, it also includes an evaluation framework similar to that for NIST extended-data speaker recognition. Results show that the state-of-the-art system gives nearly perfect recognition performance in a clean condition using normal voices. Using the threshold from this condition, it falsely rejects 39% of subjects who change their voices during testing. However, this can be improved to 9% if a threshold from the changed-voice testing condition is used. We also compare machine performance with human performance from a pilot listening experiment. Results show that machine performance is comparable to human performance when normal voices are used for both training and testing. However, the machine outperforms humans when changed voices are used for testing. In general, the results show vulnerability in both humans and speaker recognition systems to changed voices, and suggest a potential for collaboration between human analysts and automatic speaker recognition systems to address this phenomenon.

...read moreread less

Proceedings Article•DOI•

Nuts and Flakes: a Study of Data Characteristics in Speaker Diarization

[...]

Nikki Mirghafori, Chuck Wooters

14 May 2006

TL;DR: This study calculated over forty features for each of 24 shows from the broadcast news corpus along the dimensions of speaker count, conversation turn, and speaker and show duration and observed that number of speakers, number of turns, and do-nothing DER correlated best with "nuttiness".

...read moreread less

Abstract: Researchers in the speaker diarization community have observed that some audio files show unusually high Diarization Error Rates (DER) (hard to crack "nuts"), and some exhibit hyper-sensitivity to tuning parameters ("flakes") The goal of this study is to systematically study the features that correlate with such behavior We calculated over forty features for each of 24 shows from the Broadcast News corpus along the dimensions of speaker count, conversation turn, and speaker and show duration We observed that number of speakers, number of turns, and do-nothing DER (a measure related to the percentage of time the dominant speaker spoke) correlated best with "nuttiness" The do-nothing DER and number of speakers were also the best correlates of "flakiness"

...read moreread less

Proceedings Article•DOI•

Experiments in Session Variability Modelling for Speaker Verification

[...]

Robert Vogt¹, Sridha Sridharan¹•Institutions (1)

Queensland University of Technology¹

14 May 2006

TL;DR: An approach to modelling session variability for GMM-based text-independent speaker verification incorporating a constrained session variability component in both the training and testing procedures is presented.

...read moreread less

Abstract: Presented is an approach to modelling session variability for GMM-based text-independent speaker verification incorporating a constrained session variability component in both the training and testing procedures. The proposed technique reduces the data labelling requirements and removes discrete categorisation needed by previous techniques and provides superior performance. Experiments on Mixer conversational telephony data show improvements of as much as 46% in equal error rate over a baseline system. In this paper the algorithm used for the enrollment procedure is described in detail. Results are also presented investigating the response of the technique to short test utterances and varying session subspace dimension.

...read moreread less

Proceedings Article•DOI•

Who Really Spoke When? Finding Speaker Turns and Identities in Broadcast News Audio

[...]

S.E. Tranter¹•Institutions (1)

University of Cambridge¹

14 May 2006

TL;DR: A system which attempts to find true speaker identities from the text transcription of the audio using lexical pattern matching, and shows the effect on performance when using state-of-the-art speaker clustering and speech-to-text transcription systems instead of manual references.

...read moreread less

Abstract: Automatic speaker segmentation and clustering methods have improved considerably over the last few years in the Broadcast News domain. However, these generally still produce locally consistent relative labels (such as spkr1, spkr2) rather than true speaker identities (such as Bill Clinton, Ted Koppel). This paper presents a system which attempts to find these true identities from the text transcription of the audio using lexical pattern matching, and shows the effect on performance when using state-of-the-art speaker clustering and speech-to-text transcription systems instead of manual references.

...read moreread less

Proceedings Article•DOI•

How to Deal with Multiple-Targets in Speaker Identification Systems?

[...]

Yaniv Zigel¹, Moshe Wasserblat¹•Institutions (1)

NICE Systems¹

28 Jun 2006

TL;DR: The top-norm method, specifically developed to improve results of open-set speaker identification systems, is demonstrated and it is demonstrated that the new method outperforms other normalization methods.

...read moreread less

Abstract: In open-set speaker identification systems a known phenomenon is that the false alarm (accept) error rate increases dramatically when increasing the number of registered speakers (models). In this paper, we demonstrate this phenomenon and suggest a solution using a new model-dependent score-normalization technique, called top-norm. The top-norm method was specifically developed to improve results of open-set speaker identification systems. Also, we suggest a score-normalization parameter adaptation technique. Experiments performed using speaker recognition corpora are described and demonstrate that the new method outperforms other normalization methods

...read moreread less

Book Chapter•DOI•

Speaker diarization: from broadcast news to lectures

[...]

Xuan Zhu¹, Claude Barras¹, Lori Lamel¹, Jean-Luc Gauvain¹•Institutions (1)

Centre national de la recherche scientifique¹

01 May 2006

TL;DR: The LIMSI speaker diarization system for lecture data was presented in the framework of the Rich Transcription 2006 Spring (RT-06S) meeting recognition evaluation as discussed by the authors, which combines agglomerative clustering based on Bayesian information criterion with a second clustering using state-of-theart speaker identification techniques.

...read moreread less

Abstract: This paper presents the LIMSI speaker diarization system for lecture data, in the framework of the Rich Transcription 2006 Spring (RT-06S) meeting recognition evaluation. This system builds upon the baseline diarization system designed for broadcast news data. The baseline system combines agglomerative clustering based on Bayesian information criterion with a second clustering using state-of-the-art speaker identification techniques. In the RT-04F evaluation, the baseline system provided an overall diarization error of 8.5% on broadcast news data. However since it has a high missed speech error rate on lecture data, a different speech activity detection approach based on the log-likelihood ratio between the speech and non-speech models trained on the seminar data was explored. The new speaker diarization system integrating this module provides an overall diarization error of 20.2% on the RT-06S Multiple Distant Microphone (MDM) data.

...read moreread less

Proceedings Article•DOI•

Fusion of a Complementary Feature Set with MFCC for Improved Closed Set Text-Independent Speaker Identification

[...]

Sandipan Chakroborty¹, Anindya Roy¹, Goutam Saha²•Institutions (2)

Indian Institute of Technology Kharagpur¹, University of Southern California²

01 Dec 2006

TL;DR: This work proposes a new set of features using a complementary filter bank structure which improves distinguishability of speaker specific cues present in the higher frequency zone and improves performance baseline of MFCC based system.

...read moreread less

Abstract: A state of the art speaker identification (SI) system requires a robust feature extraction unit followed by a speaker modeling scheme for generalized representation of these features. Over the years, Mel-frequency cepstral coefficients (MFCC) modeled on the human auditory system have been used as a standard acoustic feature set for SI applications. However, due to the structure of its filter bank, it captures vocal tract characteristics more effectively in the lower frequency regions. This work proposes a new set of features using a complementary filter bank structure which improves distinguishability of speaker specific cues present in the higher frequency zone. Unlike high level features that are difficult to extract, the proposed feature set involves little computational burden during the extraction process. When combined with MFCC via a parallel implementation of speaker models, the proposed feature improves performance baseline of MFCC based system. The proposition is validated by experiments conducted on two different kinds of databases namely YOHO (microphone speech) and POLYCOST (telephone speech) with Gaussian mixture model (GMM) as a classifier for various model orders.

...read moreread less

Journal Article•DOI•

Pseudo Pitch Synchronous Analysis of Speech With Applications to Speaker Recognition

[...]

Ran D. Zilca¹, Brian Kingsbury¹, Jiri Navratil¹, Ganesh N. Ramaswamy¹•Institutions (1)

IBM¹

01 Dec 2006-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: PPS signal processing procedures that attempt to align each individual frame to its natural cycle and avoid truncation of pitch cycles while still using constant frame size and frame offset are introduced in an effort to address the above problems.

...read moreread less

Abstract: The fine spectral structure related to pitch information is conveyed in Mel cepstral features, with variations in pitch causing variations in the features. For speaker recognition systems, this phenomenon, known as “pitch mismatch” between training and testing, can increase error rates. Likewise, pitch-related variability may potentially increase error rates in speech recognition systems for languages such as English in which pitch does not carry phonetic information. In addition, for both speech recognition and speaker recognition systems, the parsing of the raw speech signal into frames is traditionally performed using a constant frame size and a constant frame offset, without aligning the frames to the natural pitch cycles. As a result the power spectral estimation that is done as part of the Mel cepstral computation may include artifacts. Pitch synchronous methods have addressed this problem in the past, at the expense of adding some complexity by using a variable frame size and/or offset. This paper introduces Pseudo Pitch Synchronous (PPS) signal processing procedures that attempt to align each individual frame to its natural cycle and avoid truncation of pitch cycles while still using constant frame size and frame offset, in an effort to address the above problems. Text independent speaker recognition experiments performed on NIST speaker recognition tasks demonstrate a performance improvement when the scores produced by systems using PPS are fused with traditional speaker recognition scores. In addition, a better distribution of errors across trials may be obtained for similar error rates, and some insight regarding of role of the fundamental frequency in speaker recognition is revealed. Speech recognition experiments run on the Aurora-2 noisy digits task also show improved robustness and better accuracy for extremely low signal-to-noise ratio (SNR) data.

...read moreread less

Proceedings Article•DOI•

Improvements in MLLR-Transform-based Speaker Recognition

[...]

Andreas Stolcke, Luciana Ferrer¹, Sachin S. Kajarekar²•Institutions (2)

Stanford University¹, SRI International²

28 Jun 2006

TL;DR: This paper reports recent improvements to the use of MLLR transforms derived from a speech recognition system as speaker features in a speaker verification system, which has about 27% lower decision cost than a state-of-the-art ccpstral GMM speaker system, and 53%Lower decision cost when trained on 8 conversation sides per speaker.

...read moreread less

Abstract: We previously proposed the use of MLLR transforms derived from a speech recognition system as speaker features in a speaker verification system [1]. In this paper we report recent improvements to this approach. First, we noticed a fundamental problem in our previous implementation that stemmed from a mismatch between male and female recognition models, and the model transforms they produce. Although it affects only a small percentage of verification trials (those in which the gender detector commits errors), this mismatch has a large effect on average system performance. We solve this problem by consistently using only one recognition model (either male or female) in computing speaker adaptation transforms regardless of estimated speaker gender. A further accuracy boost is obtained by combining feature vectors derived from male and female vectors into one larger feature vector. Using 1-conversation-side training, the final system has about 27% lower decision cost than a state-of-the-art cepstral GMM speaker system, and 53% lower decision cost when trained on 8 conversation sides per speaker.

...read moreread less

Book Chapter•DOI•

Technical improvements of the E-HMM based speaker diarization system for meeting records

[...]

Corinne Fredouille¹, Gregory Senay¹•Institutions (1)

University of Avignon¹

01 May 2006

TL;DR: Experiments still conducted in the framework of the NIST RT'06S evaluation show the ability of the strategy in detecting overlapping speech (decrease of the missed speaker error rate), even if an overall gain in speaker diarization performance has not been achieved yet.

...read moreread less

Abstract: This paper is concerned with the speaker diarization task in the specific context of the meeting room recordings. Firstly, different technical improvements of an E-HMM based system are proposed and evaluated in the framework of the NIST RT'06S evaluation campaign. Related experiments show an absolute gain of 6.4% overall speaker diarization error rate (DER) and 12.9% on the development and evaluation corpora respectively. Secondly, this paper presents an original strategy to deal with the overlapping speech. Indeed, speech overlaps between speakers are largely involved in meetings due to the spontaneous nature of this kind of data and they are responsible for a decrease in performance of the speaker diarization system, if they are not dealt with. Experiments still conducted in the framework of the NIST RT'06S evaluation show the ability of the strategy in detecting overlapping speech (decrease of the missed speaker error rate), even if an overall gain in speaker diarization performance has not been achieved yet.

...read moreread less

Collapse