scispace - formally typeset
Search or ask a question

Showing papers on "Speaker diarisation published in 2006"


Journal ArticleDOI
TL;DR: This work examines the idea of using the GMM supervector in a support vector machine (SVM) classifier and proposes two new SVM kernels based on distance metrics between GMM models that produce excellent classification accuracy in a NIST speaker recognition evaluation task.
Abstract: Gaussian mixture models (GMMs) have proven extremely successful for text-independent speaker recognition. The standard training method for GMM models is to use MAP adaptation of the means of the mixture components based on speech from a target speaker. Recent methods in compensation for speaker and channel variability have proposed the idea of stacking the means of the GMM model to form a GMM mean supervector. We examine the idea of using the GMM supervector in a support vector machine (SVM) classifier. We propose two new SVM kernels based on distance metrics between GMM models. We show that these SVM kernels produce excellent classification accuracy in a NIST speaker recognition evaluation task.

1,081 citations


Journal ArticleDOI
TL;DR: An overview of the approaches currently used in a key area of audio diarization, namely speaker diarizations, are provided and their relative merits and limitations are discussed.
Abstract: Audio diarization is the process of annotating an input audio channel with information that attributes (possibly overlapping) temporal regions of signal energy to their specific sources. These sources can include particular speakers, music, background noise sources, and other signal source/channel characteristics. Diarization can be used for helping speech recognition, facilitating the searching and indexing of audio archives, and increasing the richness of automatic transcriptions, making them more readable. In this paper, we provide an overview of the approaches currently used in a key area of audio diarization, namely speaker diarization, and discuss their relative merits and limitations. Performances using the different techniques are compared within the framework of the speaker diarization task in the DARPA EARS Rich Transcription evaluations. We also look at how the techniques are being introduced into real broadcast news systems and their portability to other domains and tasks such as meetings and speaker verification

634 citations


01 Jan 2006
TL;DR: A full account of the algorithms needed to carry out a joint factor analysis of speaker and session variability in a training set in which each speaker is recorded over many different channels and the practical limitations that will be encountered if these algorithms are implemented on very large data sets are discussed.
Abstract: We give a full account of the algorithms needed to carry out a joint factor analysis of speaker and session variability in a training set in which each speaker is recorded over many different channels and we discuss the practical limitations that will be encountered if these algorithms are implemented on very large data sets. This article is intended as a companion to (1) where we presented a new type of likelihood ratio statistic for speaker verification which is designed principally to deal with the problem of inter-session variability, that is the variability among recordings of a given speaker. This likelihood ratio statistic is based on a joint factor analysis of speaker and session variability in a training set in which each speaker is recorded over many different channels (such as one of the Switchboard II databases). Our purpose in the current article is to give detailed algorithms for carrying out such a factor analysis. Although we have only experimented with the applications of this model in speaker recognition we will also explain how it could serve as an integrated framework for progressive speaker-adaptation and on-line channel adaptation of HMM-based speech recognizers operating in situations where speaker identities are known. II. OVERVIEW OF THE JOINT FACTOR ANALYSIS MODEL The joint factor analysis model can be viewed Gaussian distribution on speaker- and channel-dependent (or, more accurately, session-dependent) HMM supervectors in which most (but not all) of the variance in the supervector population is assumed to be accounted for by a small number of hidden variables which we refer to as speaker and channel factors. The speaker factors and the channel factors play different roles in that, for a given speaker, the values of the speaker factors are assumed to be the same for all recordings of the speaker but the channel factors are assumed to vary from one recording to another. For example, the Gaussian distribution on speaker-dependent supervectors used in eigenvoice MAP (2) is a special case of the factor analysis model in which there are no channel factors and all of the variance in the speaker- dependent HMM supervectors is assumed to be accounted The authors are with the Centre de recherche informatique de Montr´

440 citations


Journal ArticleDOI
TL;DR: This paper focuses on optimizing vector quantization (VQ) based speaker identification, which reduces the number of test vectors by pre-quantizing the test sequence prior to matching, and thenumber of speakers by pruning out unlikely speakers during the identification process.
Abstract: In speaker identification, most of the computation originates from the distance or likelihood computations between the feature vectors of the unknown speaker and the models in the database. The identification time depends on the number of feature vectors, their dimensionality, the complexity of the speaker models and the number of speakers. In this paper, we concentrate on optimizing vector quantization (VQ) based speaker identification. We reduce the number of test vectors by pre-quantizing the test sequence prior to matching, and the number of speakers by pruning out unlikely speakers during the identification process. The best variants are then generalized to Gaussian mixture model (GMM) based modeling. We apply the algorithms also to efficient cohort set search for score normalization in speaker verification. We obtain a speed-up factor of 16:1 in the case of VQ-based modeling with minor degradation in the identification accuracy, and 34:1 in the case of GMM-based modeling. An equal error rate of 7% can be reached in 0.84 s on average when the length of test utterance is 30.4 s.

248 citations


Journal ArticleDOI
TL;DR: This paper describes recent advances in speaker diarization with a multistage segmentation and clustering system, which incorporates a speaker identification step, which builds upon the baseline audio partitioner used in the LIMSI broadcast news transcription system.
Abstract: This paper describes recent advances in speaker diarization with a multistage segmentation and clustering system, which incorporates a speaker identification step. This system builds upon the baseline audio partitioner used in the LIMSI broadcast news transcription system. The baseline partitioner provides a high cluster purity, but has a tendency to split data from speakers with a large quantity of data into several segment clusters. Several improvements to the baseline system have been made. First, the iterative Gaussian mixture model (GMM) clustering has been replaced by a Bayesian information criterion (BIC) agglomerative clustering. Second, an additional clustering stage has been added, using a GMM-based speaker identification method. Finally, a post-processing stage refines the segment boundaries using the output of a transcription system. On the National Institute of Standards and Technology (NIST) RT-04F and ESTER evaluation data, the multistage system reduces the speaker error by over 70% relative to the baseline system, and gives between 40% and 50% reduction relative to a single-stage BIC clustering system

217 citations


Journal ArticleDOI
TL;DR: This paper summarizes the collaboration of the LIA and CLIPS laboratories on speaker diarization of broadcast news during the spring NIST Rich Transcription 2003 evaluation campaign (NIST-RTO03S).

141 citations


Patent
19 Oct 2006
TL;DR: In this paper, a method and apparatus for determining whether a speaker uttering an utterance belongs to a predetermined set comprising known speakers, wherein a training utterance is available for each known speaker.
Abstract: A method and apparatus for determining whether a speaker uttering an utterance belongs to a predetermined set comprising known speakers, wherein a training utterance is available for each known speaker. The method and apparatus test whether features extracted from the tested utterance provide a score exceeding a threshold when matched against one or more of models constructed upon voice samples of each known speaker. The method and system further provide optional enhancements such as determining, using, and updating model normalization parameters, a fast scoring algorithm, summed calls handling, or quality evaluation for the tested utterance.

133 citations


Proceedings Article
01 Jan 2006
TL;DR: A system for model based speech separation which achieves super-human recognition performance when two talkers speak at similar levels and incorporates a novel method for performing two-talker speaker identification and gain estimation is described.
Abstract: We describe a system for model based speech separation which achieves super-human recognition performance when two talkers speak at similar levels. The system can separate the speech of two speakers from a single channel recording with remarkable results. It incorporates a novel method for performing two-talker speaker identification and gain estimation. We extend the method of model based high resolution signal reconstruction to incorporate temporal dynamics. We report on two methods for introducing dynamics; the first uses dynamics in the acoustic model space, the second incorporates dynamics based on sentence grammar. The addition of temporal constraints leads to dramatic improvements in the separation performance. Once the signals have been separated they are then recognized using speaker dependent labeling.

119 citations


Dissertation
21 Dec 2006
TL;DR: In this paper, a hierarchical bottom-up mono-channel speaker diarization system was used to extract speaker location information and obtain a single enhanced signal from all available microphones, which is then used for speaker segmentation and clustering.
Abstract: This thesis shows research performed into the topic of speaker diarization for meeting rooms. It looks into the algorithms and the implementation of an offline speaker segmentation and clustering system for a meeting recording where usually more than one microphone is available. The main research and system implementation has been done while visiting the International Computes Science Institute (ICSI, Berkeley, California) for a period of two years. Speaker diarization is a well studied topic on the domain of broadcast news recordings. Most of the proposed systems involve some sort of hierarchical clustering of the data into clusters, where the optimum number of speakers of their identities are unknown a priory. A very commonly used method is called bottom-up clustering, where multiple initial clusters are iteratively merged until the optimum number of clusters is reached, according to some stopping criterion. Such systems are based on a single channel input, not allowing a direct application for the meetings domain. Although some efforts have been done to adapt such systems to multichannel data, at the start of this thesis no effective implementation had been proposed. Furthermore, many of these speaker diarization algorithms involve some sort of models training or parameter tuning using external data, which impedes its usability with data different from what they have been adapted to. The implementation proposed in this thesis works towards solving the aforementioned problems. Taking the existing hierarchical bottom-up mono-channel speaker diarization system from ICSI, it first uses a flexible acoustic beamforming to extract speaker location information and obtain a single enhanced signal from all available microphones. It then implements a train-free speech/non-speech detection on such signal and processes the resulting speech segments with an improved version of the mono-channel speaker diarization system. Such system has been modified to use speaker location information (then available) and several algorithms have been adapted or created new to adapt the system behavior to each particular recording by obtaining information directly from the acoustics, making it less dependent on the development data. The resulting system is flexible to any meetings room layout regarding the number of microphones and their placement. It is train-free making it easy to adapt to different sorts of data and domains of application. Finally, it takes a step forward into the use of parameters that are more robust to changes in the acoustic data. Two versions of the system were submitted with excellent results in RT05s and RT06s NIST Rich Transcription evaluations for meetings, where data from two different subdomains (lectures and conferences) was evaluated. Also, experiments using the RT datasets from all meetings evaluations were used to test the different proposed algorithms proving their suitability to the task.

105 citations


Proceedings Article
01 Jan 2006
TL;DR: To apply the Ng-Jordan-Weiss (NJW) spectral clustering algorithm to speaker diarization, some domain specific solutions to the open issues of this algorithm are proposed: choice of metric; selection of scaling parameter; estimation of the number of clusters.
Abstract: In this paper, we present a spectral clustering approach to explore the possibility of discovering structure from audio data. To apply the Ng-Jordan-Weiss (NJW) spectral clustering algorithm to speaker diarization, we propose some domain specific solutions to the open issues of this algorithm: choice of metric; selection of scaling parameter; estimation of the number of clusters. Then, a postprocessing step ‐ “Cross EM refinement” ‐ is conducted to further improve the performance of spectral learning. In experiments, this approach has performance very similar to the traditional hierarchical clustering on the audio data of Japanese Parliament Panel Discussions, but it runs much faster than the latter. Index Terms: Speaker Diarization, Spectral Clustering, Cross EM refinement, BIC.

93 citations


Journal ArticleDOI
TL;DR: This paper extends the traditional SID framework to cochannel speech and derive a joint objective for sequential grouping and SID, leading to a problem of search for the optimum hypothesis and proposes a hypothesis pruning algorithm based on speaker models in order to make the search computationally efficient.
Abstract: A human listener has the ability to follow a speaker's voice while others are speaking simultaneously; in particular, the listener can organize the time-frequency energy of the same speaker across time into a single stream. In this paper, we focus on sequential organization in cochannel speech, or mixtures of two voices. We extract minimally corrupted segments, or usable speech, in cochannel speech using a robust multipitch tracking algorithm. The extracted usable speech is shown to capture speaker characteristics and improves speaker identification (SID) performance across various target-to-interferer ratios. To utilize speaker characteristics for sequential organization, we extend the traditional SID framework to cochannel speech and derive a joint objective for sequential grouping and SID, leading to a problem of search for the optimum hypothesis. Subsequently we propose a hypothesis pruning algorithm based on speaker models in order to make the search computationally efficient. Evaluation results show that the proposed system approaches the ceiling SID performance obtained with prior pitch information and yields significant improvement over alternative approaches to sequential organization.

Proceedings ArticleDOI
17 Sep 2006
TL;DR: This work introduces efficient update methods to train adaptation matrices for the full covariance case and experiments with a simplified technique that works almost as well as the exact method.
Abstract: Full covariance models can give better results for speech recognition than diagonal models, yet they introduce complications for standard speaker adaptation techniques such as MLLR and fMLLR. Here we introduce efficient update methods to train adaptation matrices for the full covariance case. We also experiment with a simplified technique in which we pretend that the full covariance Gaussians are diagonal and obtain adaptation matrices under that assumption. We show that this approximate method works almost as well as the exact method.

Journal ArticleDOI
TL;DR: The performance obtained with the HMM-based polyglot synthesis method is better than that of methods based on phone mapping for both adaptation and synthesis, and can be used to create synthesizers for languages where no speech resources are available.

Proceedings Article
01 Jan 2006
TL;DR: It is found that speech with various emotions aggravates the verification performance of a GMM-UBM based speaker verification system and an emotion-dependent score normalization method is proposed, borrowed from the idea of Hnorm.
Abstract: Besides background noise, channel effect and speaker’s health condition, emotion is another factor which may influence the performance of a speaker verification system In this paper, the performance of a GMM-UBM based speaker verification system on emotional speech is studied It is found that speech with various emotions aggravates the verification performance Two reasons for the performance aggravation are analyzed, they are mismatched emotions between the speaker models and the test utterances, and the articulating styles of certain emotions which create intense intra-speaker vocal variability In response to the first reason, an emotion-dependent score normalization method is proposed, which is borrowed from the idea of Hnorm Index Terms: speaker verification, emotional speech

Proceedings ArticleDOI
17 Sep 2006
TL;DR: Four of the main improvements to the ICSI speaker diarization system submitted for the NIST Rich Transcription evaluation (RT06s) conducted on the meetings environment are introduced: a new training-free speech/non-speech detection algorithm, a new algorithm for system initialization, and a frame purification algorithm to increase clusters differentiability.
Abstract: In this paper we present the ICSI speaker diarization system submitted for the NIST Rich Transcription evaluation (RT06s) [1] conducted on the meetings environment. This is a set of yearly evaluations which in the last two years have included speaker diarization of two kinds of distinct meetings: conference room and lecture room. The system presented focuses on being robust to changes in the meeting conditions by not using any training data. In this paper we introduce four of the main improvements to the system from last years’ submission: The first is a new training-free speech/non-speech detection algorithm. The second is the introduction of a new algorithm for system initialization. The third is the use of a frame purification algorithm to increase clusters differentiability. The last improvement is the use of inter-channel delays as features, greatly improving performance. We show the diarization error rate (DER) score of this system on all available meeting datasets to date for the multiple distant microphone (MDM) and single distant microphone (SDM) conditions. Index Terms: Speaker diarization, speaker segmentation and clustering, meetings indexing.

Journal ArticleDOI
TL;DR: The novel method always performed better than the reference vocal tract length normalization method adopted in this work and when unsupervised static speaker adaptation was applied in combination with each of the two speaker normalization methods, a different behavior was observed on the two corpora.

Proceedings ArticleDOI
04 Jun 2006
TL;DR: An HMM-based approach and a maximum entropy model for speaker role labeling using Mandarin broadcast news speech are presented and it is found that themaximum entropy model performs slightly better than the HMM, and that the combination of them outperforms any model alone.
Abstract: Identifying a speaker's role (anchor, reporter, or guest speaker) is important for finding the structural information in broadcast news speech. We present an HMM-based approach and a maximum entropy model for speaker role labeling using Mandarin broadcast news speech. The algorithms achieve classification accuracy of about 80% (compared to the baseline of around 50%) using the human transcriptions and manually labeled speaker turns. We found that the maximum entropy model performs slightly better than the HMM, and that the combination of them outperforms any model alone. The impact of the contextual role information is also examined in this study.

Proceedings Article
01 Jan 2006
TL;DR: This paper has developed a method to mix the TDOA values with the acoustic values by calculating a combined loglikelihood between both sets of vectors.
Abstract: Speaker diarization for recordings made in meetings consists of identifying the number of participants in each meeting and creating a list of speech time intervals for each participant. In recently published work [7] we presented some experiments using only TDOA values (Time Delay Of Arrival for different channels) applied to this task. We demonstrated that information in those values can be used to segment the speakers. In this paper we have developed a method to mix the TDOA values with the acoustic values by calculating a combined loglikelihood between both sets of vectors. Using this method we have been able to reduce the DER by 16.34% (relative) for the NIST RT05s set (scored without overlap and manually transcribed references) the DER for our devel06s set (scored with overlap and force-aligned references) by 21% (relative) and the DER for the NIST RT06s (scored with overlap and manually transcribed references) by 15% (relative) . Index terms: Speaker diarization, speaker segmentation, meetings recognition.

Journal Article
TL;DR: In this article, the ICSI-SRI entry in the Rich Transcription 2005 Spring Meeting Recognition Evaluation is described, which uses agglomerative clustering with a modified Bayesian Information Criterion (BIC) measure to determine when to stop merging clusters and to decide which pairs of clusters to merge.
Abstract: In this paper we describe the ICSI-SRI entry in the Rich Transcription 2005 Spring Meeting Recognition Evaluation. The current system is based on the ICSI-SRI clustering system for Broadcast News (BN), with extra modules to process the different meetings tasks in which we participated. Our base system uses agglomerative clustering with a modified Bayesian Information Criterion (BIC) measure to determine when to stop merging clusters and to decide which pairs of clusters to merge. This approach does not require any pre-trained models, thus increasing robustness and simplifying the port from BN to the meetings domain. For the meetings domain, we have added several features to our baseline clustering system, including a purification module that tries to keep the clusters acoustically homogeneous throughout the clustering process, and a delay&sum beamforming algorithm which enhances signal quality for the multiple distant microphones (MDM) sub-task. In post-evaluation work we further improved the delay&sum algorithm, experimented with a new speech/non-speech detector and proposed a new system for the lecture room environment.

Proceedings ArticleDOI
14 May 2006
TL;DR: This paper presents two algorithms that aim to purify the clusters, the first assigns conflicting speech segments to a new cluster, and the second detects and eliminates non-speech frames when comparing two clusters.
Abstract: When performing speaker diarization, it is common to use an agglomerative clustering approach where the acoustic data is first split in small pieces and then pairs are merged until reaching a stopping point. When using a purely agglomerative clustering technique, one cluster cannot be split into two. Therefore, errors caused by multiple speakers being assigned to one cluster can be common. Furthermore, clusters often contain non-speech frames, creating problems when deciding which two clusters to merge and when to stop the clustering. In this paper, we present two algorithms that aim to purify the clusters. The first assigns conflicting speech segments to a new cluster, and the second detects and eliminates non-speech frames when comparing two clusters. We show improvements of over 18% relative using three datasets from the most current Rich Transcription (RT) evaluations.

Proceedings ArticleDOI
28 Jun 2006
TL;DR: The investigation of the effect of intentional voice modifications on a state-of-the-art speaker recognition system shows vulnerability in both humans and speaker recognition systems to changed voices, and suggests a potential for collaboration between human analysts and automatic speaker recognition Systems to address this phenomenon.
Abstract: We investigate the effect of intentional voice modifications on a state-of-the-art speaker recognition system. The investigation includes data collection, where normal and changed voices are collected from subjects conversing by telephone. For comparison purposes, it also includes an evaluation framework similar to that for NIST extended-data speaker recognition. Results show that the state-of-the-art system gives nearly perfect recognition performance in a clean condition using normal voices. Using the threshold from this condition, it falsely rejects 39% of subjects who change their voices during testing. However, this can be improved to 9% if a threshold from the changed-voice testing condition is used. We also compare machine performance with human performance from a pilot listening experiment. Results show that machine performance is comparable to human performance when normal voices are used for both training and testing. However, the machine outperforms humans when changed voices are used for testing. In general, the results show vulnerability in both humans and speaker recognition systems to changed voices, and suggest a potential for collaboration between human analysts and automatic speaker recognition systems to address this phenomenon.

Proceedings ArticleDOI
14 May 2006
TL;DR: This study calculated over forty features for each of 24 shows from the broadcast news corpus along the dimensions of speaker count, conversation turn, and speaker and show duration and observed that number of speakers, number of turns, and do-nothing DER correlated best with "nuttiness".
Abstract: Researchers in the speaker diarization community have observed that some audio files show unusually high Diarization Error Rates (DER) (hard to crack "nuts"), and some exhibit hyper-sensitivity to tuning parameters ("flakes") The goal of this study is to systematically study the features that correlate with such behavior We calculated over forty features for each of 24 shows from the Broadcast News corpus along the dimensions of speaker count, conversation turn, and speaker and show duration We observed that number of speakers, number of turns, and do-nothing DER (a measure related to the percentage of time the dominant speaker spoke) correlated best with "nuttiness" The do-nothing DER and number of speakers were also the best correlates of "flakiness"

Proceedings ArticleDOI
14 May 2006
TL;DR: An approach to modelling session variability for GMM-based text-independent speaker verification incorporating a constrained session variability component in both the training and testing procedures is presented.
Abstract: Presented is an approach to modelling session variability for GMM-based text-independent speaker verification incorporating a constrained session variability component in both the training and testing procedures. The proposed technique reduces the data labelling requirements and removes discrete categorisation needed by previous techniques and provides superior performance. Experiments on Mixer conversational telephony data show improvements of as much as 46% in equal error rate over a baseline system. In this paper the algorithm used for the enrollment procedure is described in detail. Results are also presented investigating the response of the technique to short test utterances and varying session subspace dimension.

Proceedings ArticleDOI
14 May 2006
TL;DR: A system which attempts to find true speaker identities from the text transcription of the audio using lexical pattern matching, and shows the effect on performance when using state-of-the-art speaker clustering and speech-to-text transcription systems instead of manual references.
Abstract: Automatic speaker segmentation and clustering methods have improved considerably over the last few years in the Broadcast News domain. However, these generally still produce locally consistent relative labels (such as spkr1, spkr2) rather than true speaker identities (such as Bill Clinton, Ted Koppel). This paper presents a system which attempts to find these true identities from the text transcription of the audio using lexical pattern matching, and shows the effect on performance when using state-of-the-art speaker clustering and speech-to-text transcription systems instead of manual references.

Proceedings ArticleDOI
28 Jun 2006
TL;DR: The top-norm method, specifically developed to improve results of open-set speaker identification systems, is demonstrated and it is demonstrated that the new method outperforms other normalization methods.
Abstract: In open-set speaker identification systems a known phenomenon is that the false alarm (accept) error rate increases dramatically when increasing the number of registered speakers (models). In this paper, we demonstrate this phenomenon and suggest a solution using a new model-dependent score-normalization technique, called top-norm. The top-norm method was specifically developed to improve results of open-set speaker identification systems. Also, we suggest a score-normalization parameter adaptation technique. Experiments performed using speaker recognition corpora are described and demonstrate that the new method outperforms other normalization methods

Book ChapterDOI
01 May 2006
TL;DR: The LIMSI speaker diarization system for lecture data was presented in the framework of the Rich Transcription 2006 Spring (RT-06S) meeting recognition evaluation as discussed by the authors, which combines agglomerative clustering based on Bayesian information criterion with a second clustering using state-of-theart speaker identification techniques.
Abstract: This paper presents the LIMSI speaker diarization system for lecture data, in the framework of the Rich Transcription 2006 Spring (RT-06S) meeting recognition evaluation. This system builds upon the baseline diarization system designed for broadcast news data. The baseline system combines agglomerative clustering based on Bayesian information criterion with a second clustering using state-of-the-art speaker identification techniques. In the RT-04F evaluation, the baseline system provided an overall diarization error of 8.5% on broadcast news data. However since it has a high missed speech error rate on lecture data, a different speech activity detection approach based on the log-likelihood ratio between the speech and non-speech models trained on the seminar data was explored. The new speaker diarization system integrating this module provides an overall diarization error of 20.2% on the RT-06S Multiple Distant Microphone (MDM) data.

Proceedings ArticleDOI
01 Dec 2006
TL;DR: This work proposes a new set of features using a complementary filter bank structure which improves distinguishability of speaker specific cues present in the higher frequency zone and improves performance baseline of MFCC based system.
Abstract: A state of the art speaker identification (SI) system requires a robust feature extraction unit followed by a speaker modeling scheme for generalized representation of these features. Over the years, Mel-frequency cepstral coefficients (MFCC) modeled on the human auditory system have been used as a standard acoustic feature set for SI applications. However, due to the structure of its filter bank, it captures vocal tract characteristics more effectively in the lower frequency regions. This work proposes a new set of features using a complementary filter bank structure which improves distinguishability of speaker specific cues present in the higher frequency zone. Unlike high level features that are difficult to extract, the proposed feature set involves little computational burden during the extraction process. When combined with MFCC via a parallel implementation of speaker models, the proposed feature improves performance baseline of MFCC based system. The proposition is validated by experiments conducted on two different kinds of databases namely YOHO (microphone speech) and POLYCOST (telephone speech) with Gaussian mixture model (GMM) as a classifier for various model orders.

Journal ArticleDOI
TL;DR: PPS signal processing procedures that attempt to align each individual frame to its natural cycle and avoid truncation of pitch cycles while still using constant frame size and frame offset are introduced in an effort to address the above problems.
Abstract: The fine spectral structure related to pitch information is conveyed in Mel cepstral features, with variations in pitch causing variations in the features. For speaker recognition systems, this phenomenon, known as “pitch mismatch” between training and testing, can increase error rates. Likewise, pitch-related variability may potentially increase error rates in speech recognition systems for languages such as English in which pitch does not carry phonetic information. In addition, for both speech recognition and speaker recognition systems, the parsing of the raw speech signal into frames is traditionally performed using a constant frame size and a constant frame offset, without aligning the frames to the natural pitch cycles. As a result the power spectral estimation that is done as part of the Mel cepstral computation may include artifacts. Pitch synchronous methods have addressed this problem in the past, at the expense of adding some complexity by using a variable frame size and/or offset. This paper introduces Pseudo Pitch Synchronous (PPS) signal processing procedures that attempt to align each individual frame to its natural cycle and avoid truncation of pitch cycles while still using constant frame size and frame offset, in an effort to address the above problems. Text independent speaker recognition experiments performed on NIST speaker recognition tasks demonstrate a performance improvement when the scores produced by systems using PPS are fused with traditional speaker recognition scores. In addition, a better distribution of errors across trials may be obtained for similar error rates, and some insight regarding of role of the fundamental frequency in speaker recognition is revealed. Speech recognition experiments run on the Aurora-2 noisy digits task also show improved robustness and better accuracy for extremely low signal-to-noise ratio (SNR) data.

Proceedings ArticleDOI
28 Jun 2006
TL;DR: This paper reports recent improvements to the use of MLLR transforms derived from a speech recognition system as speaker features in a speaker verification system, which has about 27% lower decision cost than a state-of-the-art ccpstral GMM speaker system, and 53%Lower decision cost when trained on 8 conversation sides per speaker.
Abstract: We previously proposed the use of MLLR transforms derived from a speech recognition system as speaker features in a speaker verification system [1]. In this paper we report recent improvements to this approach. First, we noticed a fundamental problem in our previous implementation that stemmed from a mismatch between male and female recognition models, and the model transforms they produce. Although it affects only a small percentage of verification trials (those in which the gender detector commits errors), this mismatch has a large effect on average system performance. We solve this problem by consistently using only one recognition model (either male or female) in computing speaker adaptation transforms regardless of estimated speaker gender. A further accuracy boost is obtained by combining feature vectors derived from male and female vectors into one larger feature vector. Using 1-conversation-side training, the final system has about 27% lower decision cost than a state-of-the-art cepstral GMM speaker system, and 53% lower decision cost when trained on 8 conversation sides per speaker.

Book ChapterDOI
01 May 2006
TL;DR: Experiments still conducted in the framework of the NIST RT'06S evaluation show the ability of the strategy in detecting overlapping speech (decrease of the missed speaker error rate), even if an overall gain in speaker diarization performance has not been achieved yet.
Abstract: This paper is concerned with the speaker diarization task in the specific context of the meeting room recordings. Firstly, different technical improvements of an E-HMM based system are proposed and evaluated in the framework of the NIST RT'06S evaluation campaign. Related experiments show an absolute gain of 6.4% overall speaker diarization error rate (DER) and 12.9% on the development and evaluation corpora respectively. Secondly, this paper presents an original strategy to deal with the overlapping speech. Indeed, speech overlaps between speakers are largely involved in meetings due to the spontaneous nature of this kind of data and they are responsible for a decrease in performance of the speaker diarization system, if they are not dealt with. Experiments still conducted in the framework of the NIST RT'06S evaluation show the ability of the strategy in detecting overlapping speech (decrease of the missed speaker error rate), even if an overall gain in speaker diarization performance has not been achieved yet.