scispace - formally typeset
Search or ask a question

Showing papers on "Speaker diarisation published in 2008"


Journal ArticleDOI
TL;DR: The main findings are that a combination of Gaussian mixture models and monophone HMM models attains near‐100% text‐independent identification accuracy on utterances that are longer than one second, and the sampling rate of 11025 Hz achieves the best performance.
Abstract: This paper reports the results of our experiments on speaker identification in the SCOTUS corpus, which includes oral arguments from the Supreme Court of the United States. Our main findings are as follows: 1) a combination of Gaussian mixture models and monophone HMM models attains near‐100% text‐independent identification accuracy on utterances that are longer than one second; (2) the sampling rate of 11025 Hz achieves the best performance (higher sampling rates are harmful); and a sampling rate as low as 2000 Hz still achieves more than 90% accuracy; (3) a distance score based on likelihood numbers was used to measure the variability of phones among speakers; we found that the most variable phone is the phone UH (as in good), and the velar nasal NG is more variable than the other two nasal sounds M and N; 4.) our models achieved “perfect” forced alignment on very long speech segments (one hour). These findings and their significance are discussed.

585 citations


Journal ArticleDOI
TL;DR: According to event-related potential results, language comprehension takes very rapid account of the social context, and the construction of meaning based on language alone cannot be separated from the social aspects of language use.
Abstract: When do listeners take into account who the speaker is? We asked people to listen to utterances whose content sometimes did not match inferences based on the identity of the speaker (e.g., If only I looked like Britney Spears in a male voice, or I have a large tattoo on my back spoken with an upper-class accent). Event-related brain responses revealed that the speaker's identity is taken into account as early as 200300 msec after the beginning of a spoken word, and is processed by the same early interpretation mechanism that constructs sentence meaning based on just the words. This finding is difficult to reconcile with standard Gricean models of sentence interpretation in which comprehenders initially compute a local, context-independent meaning for the sentence (semantics) before working out what it really means given the wider communicative context and the particular speaker (pragmatics). Because the observed brain response hinges on voice-based and usually stereotype-dependent inferences about the speaker, it also shows that listeners rapidly classify speakers on the basis of their voices and bring the associated social stereotypes to bear on what is being said. According to our event-related potential results, language comprehension takes very rapid account of the social context, and the construction of meaning based on language alone cannot be separated from the social aspects of language use. The linguistic brain relates the message to the speaker immediately.

368 citations


Book ChapterDOI
01 Jan 2008
TL;DR: The ICSI speaker diarization system as mentioned in this paper automatically performs both speaker segmentation and clustering without any prior knowledge of the identities or the number of speakers, using standard speech processing components and techniques such as HMMs, agglomerative clustering, and the Bayesian Information Criterion.
Abstract: In this paper, we present the ICSI speaker diarization system. This system was used in the 2007 National Institute of Standards and Technology (NIST) Rich Transcription evaluation. The ICSI system automatically performs both speaker segmentation and clustering without any prior knowledge of the identities or the number of speakers. Our system uses "standard" speech processing components and techniques such as HMMs, agglomerative clustering, and the Bayesian Information Criterion. However, we have developed the system with an eye towards robustness and ease of portability. Thus we have avoided the use of any sort of model that requires training on "outside" data and we have attempted to develop algorithms that require as little tuning as possible. The system is simular to last year's system [1] except for three aspects. We used the most recent available version of the beam-forming toolkit, we implemented a new speech/non-speech detector that does not require models trained on meeting data and we performed our development on a much larger set of recordings.

224 citations


Proceedings ArticleDOI
12 May 2008
TL;DR: This work presents the initial work toward developing an overlap detection system for improved meeting diarization, and investigates various features, with a focus on high-precision performance for use in the detector, and examines performance results on a subset of the AMI Meeting Corpus.
Abstract: State-of-the-art speaker diarization systems for meetings are now at a point where overlapped speech contributes significantly to the errors made by the system. However, little if no work has yet been done on detecting overlapped speech. We present our initial work toward developing an overlap detection system for improved meeting diarization. We investigate various features, with a focus on high-precision performance for use in the detector, and examine performance results on a subset of the AMI Meeting Corpus. For the high-quality signal case of a single mixed-headset channel signal, we demonstrate a relative improvement of about 7.4% DER over the baseline diarization system, while for the more challenging case of the single far-field channel signal relative improvement is 3.6%. We also outline steps towards improvement and moving beyond this initial phase.

150 citations


01 Jan 2008
TL;DR: This paper presents the ALIZE/SpkDet open source software packages for text independent speaker recognition, based on the well-known UBM/GMM approach, which includes also the latest speaker recognition developments such as Latent Factor Analysis (LFA) and unsupervised adaptation.
Abstract: This paper presents the ALIZE/SpkDet open source software packages for text independent speaker recognition. This software is based on the well-known UBM/GMM approach. It includes also the latest speaker recognition developments such as Latent Factor Analysis (LFA) and unsupervised adaptation. Discriminant classifiers such as SVM supervectors are also provided , linked with the Nuisance Attribute Projection (NAP). The software performance is demonstrated within the framework of the NIST'06 SRE evaluation campaign. Several other applications like speaker diarization, embedded speaker recognition , password dependent speaker recognition and pathological voice assessment are also presented.

114 citations


Journal ArticleDOI
TL;DR: This paper proposed a new physiological feature which emphasizes individual information for text-independent speaker identification by using a non-uniform subband processing strategy to emphasize the physiological information involved in speech production.

111 citations


Journal ArticleDOI
TL;DR: This survey focuses on two challenging speech processing topics, namely: speaker segmentation and speaker clustering, and model-based, metric- based, and hybrid speakers segmentation algorithms are reviewed.

108 citations


Book ChapterDOI
01 Jan 2008
TL;DR: The Spring 2007 RT-07 Rich Transcription Meeting Recognition Evaluation (RT-07) as mentioned in this paper was the fifth in a series of community-wide evaluations of language technologies in the meeting domain.
Abstract: We present the design and results of the Spring 2007 (RT-07) Rich Transcription Meeting Recognition Evaluation; the fifth in a series of community-wide evaluations of language technologies in the meeting domain. For 2007, we supported three evaluation tasks: Speech-To-Text (STT) transcription, "Who Spoke When" Diarization (SPKR), and Speaker Attributed Speech-To-Text (SASTT). The SASTT task, which combines STT and SPKR tasks, was a new evaluation task. The test data consisted of three test sets: Conference Meetings, Lecture Meetings, and Coffee Breaks from lecture meetings. The Coffee Break data was included as a new test set this year. Twenty-one research sites materially contributed to the evaluation by providing data or building systems. The lowest STT word error rates with up to four simultaneous speakers in the multiple distant microphone condition were 40.6 %, 49.8 %, and 48.4 % for the conference, lecture, and coffee break test sets respectively. For the SPKR task, the lowest diarization error rates for all speech in the multiple distant microphone condition were 8.5 %, 25.8 %, and 25.5 % for the conference, lecture, and coffee break test sets respectively. For the SASTT task, the lowest speaker attributed word error rates for segments with up to three simultaneous speakers in the multiple distant microphone condition were 40.3 %, 59.3 %, and 68.4 % for the conference, lecture, and coffee break test sets respectively.

108 citations


Journal ArticleDOI
TL;DR: In this paper, an explicit session term in the Gaussian mixture speaker modeling framework is proposed to model mismatch in speaker recognition by combining the true speaker model with an additional session-dependent offset constrained to lie in a low-dimensional subspace representing session variability.

101 citations


Proceedings ArticleDOI
12 May 2008
TL;DR: A stream-based approach for unsupervised multi-speaker conversational speech segmentation that produces segmentation error rates better than the state of the art ones reported in previous work on the segmentation task in the NIST 2000 Speaker Recognition Evaluation (SRE).
Abstract: This paper presents a stream-based approach for unsupervised multi-speaker conversational speech segmentation. The main idea of this work is to exploit prior knowledge about the speaker space to find a low dimensional vector of speaker factors that summarize the salient speaker characteristics. This new approach produces segmentation error rates that are better than the state of the art ones reported in our previous work on the segmentation task in the NIST 2000 Speaker Recognition Evaluation (SRE). We also show how the performance of a speaker recognition system in the core test of the 2006 NIST SRE is affected, comparing the results obtained using single speaker and automatically segmented test data.

99 citations


Book ChapterDOI
01 Jan 2008
TL;DR: The intrinsic dependence that the lexical content of the password phrase has on the accuracy is demonstrated and several research results will be presented and analyzed to show key techniques used in text-dependent speaker recognition systems from different sites.
Abstract: Text-dependent speaker recognition characterizes a speaker recognition task, such as verification or identification, in which the set of words (or lexicon) used during the testing phase is a subset of the ones present during the enrollment phase. The restricted lexicon enables very short enrollment (or registration) and testing sessions to deliver an accurate solution but, at the same time, represents scientific and technical challenges. Because of the short enrollment and testing sessions, text-dependent speaker recognition technology is particularly well suited for deployment in large-scale commercial applications. These are the bases for presenting an overview of the state of the art in text-dependent speaker recognition as well as emerging research avenues. In this chapter, we will demonstrate the intrinsic dependence that the lexical content of the password phrase has on the accuracy. Several research results will be presented and analyzed to show key techniques used in text-dependent speaker recognition systems from different sites. Among these, we mention multichannel speaker model synthesis and continuous adaptation of speaker models with threshold tracking. Since text-dependent speaker recognition is the most widely used voice biometric in commercial deployments, several

Proceedings ArticleDOI
20 Oct 2008
TL;DR: A realtime system for analyzing group meetings that uses a novel omnidirectional camera-microphone system to automatically discover the visual focus of attention (VFOA) and new 3-D visualization schemes for meeting scenes and the results of an analysis are presented.
Abstract: This paper presents a realtime system for analyzing group meetings that uses a novel omnidirectional camera-microphone system. The goal is to automatically discover the visual focus of attention (VFOA), i.e. "who is looking at whom", in addition to speaker diarization, i.e. "who is speaking and when". First, a novel tabletop sensing device for round-table meetings is presented; it consists of two cameras with two fisheye lenses and a triangular microphone array. Second, from high-resolution omnidirectional images captured with the cameras, the position and pose of people's faces are estimated by STCTracker (Sparse Template Condensation Tracker); it realizes realtime robust tracking of multiple faces by utilizing GPUs (Graphics Processing Units). The face position/pose data output by the face tracker is used to estimate the focus of attention in the group. Using the microphone array, robust speaker diarization is carried out by a VAD (Voice Activity Detection) and a DOA (Direction of Arrival) estimation followed by sound source clustering. This paper also presents new 3-D visualization schemes for meeting scenes and the results of an analysis. Using two PCs, one for vision and one for audio processing, the system runs at about 20 frames per second for 5-person meetings.

Patent
10 Oct 2008
TL;DR: In this paper, a method automatically recognizes speech received through an input is presented, which detects whether the received speech input matches a speaker model according to an adaptable predetermined criterion, and then assigns the speaker model to a speaker set when no match occurs based on the input.
Abstract: A method automatically recognizes speech received through an input. The method accesses one or more speaker-independent speaker models. The method detects whether the received speech input matches a speaker model according to an adaptable predetermined criterion. The method creates a speaker model assigned to a speaker model set when no match occurs based on the input.

01 Sep 2008
TL;DR: This paper shows the danger of not using different speakers in the trainingand test-sets and demonstrates that lip-reading visual features, when compared with the MFCCs commonly used for audio speech recognition, have inherently small variation within a single speaker across all classes spoken.
Abstract: In speech recognition, the problem of speaker variability has been well studied. Common approaches to dealing with it include normalising for a speaker's vocal tract length and learning a linear transform that moves the speaker-independent models closer to to a new speaker. In pure lip-reading (no audio) the problem has been less well studied. Results are often presented that are based on speaker-dependent (single speaker) or multispeaker (speakers in the test-set are also in the training-set) data, situations that are of limited use in real applications. This paper shows the danger of not using different speakers in the trainingand test-sets. Firstly, we present classification results on a new single-word database AVletters 2 which is a high-definition version of the well known AVletters database. By careful choice of features, we show that it is possible for the performance of visual-only lip-reading to be very close to that of audio-only recognition for the single speaker and multi-speaker configurations. However, in the speaker independent configuration, the performance of the visual-only channel degrades dramatically. By applying multidimensional scaling (MDS) to both the audio features and visual features, we demonstrate that lip-reading visual features, when compared with the MFCCs commonly used for audio speech recognition, have inherently small variation within a single speaker across all classes spoken. However, visual features are highly sensitive to the identity of the speaker, whereas audio features are relatively invariant.

Patent
20 Aug 2008
TL;DR: In this article, the method of operating a man-machine interface unit includes classifying at least one utterance of a speaker to be of a first type or of a second type, and automatically adding a new speaker to the speaker data base based on utterances of one of the clusters.
Abstract: The method of operating a man-machine interface unit includes classifying at least one utterance of a speaker to be of a first type or of a second type. If the utterance is classified to be of the first type, the utterance belongs to a known speaker of a speaker data base, and if the utterance is classified to be of the second type, the utterance belongs to an unknown speaker that is not included in the speaker data base. The method also includes storing a set of utterances of the second type, clustering the set of utterances into clusters, wherein each cluster comprises utterances having similar features, and automatically adding a new speaker to the speaker data base based on utterances of one of the clusters.

Proceedings Article
01 Jan 2008
TL;DR: In this paper, the authors compared several alternative procedures for this task with a particular focus on training and testing with short utterances and showed that better performance can be obtained when an independent rather than simultaneous optimisation of the two core variability subspaces is used.
Abstract: Training the speaker and session subspaces is an integral problem in developing a joint factor analysis GMM speaker verification system. This work investigates and compares several alternative procedures for this task with a particular focus on training and testing with short utterances. Experiments show that better performance can be obtained when an independent rather than simultaneous optimisation of the two core variability subspaces is used. It is additionally shown that for verification trials on short utterances it is important for the session subspace to be trained with matched length utterances. Conversely, the speaker transform should always be trained with as much data as possible.

Journal ArticleDOI
TL;DR: In this study 10 types of disguised voices and normal voices from 20 male college students were used as test samples and each disguised voice was compared with all normal voices in the database to make speaker identification and speaker verification.

Book
01 Jan 2008
TL;DR: A study of Acoustic Correlates of Speaker Age, and the impact of Visual and Auditory Cues in Age Estimation on Decoding of Semantic Emotion.

Journal ArticleDOI
TL;DR: A novel alternative stopping method for AHC based on information change rate (ICR) is proposed and is demonstrated to be more robust to data source variation than the BIC-based one.
Abstract: Many current state-of-the-art speaker diarization systems exploit agglomerative hierarchical clustering (AHC) as their speaker clustering strategy, due to its simple processing structure and acceptable level of performance. However, AHC is known to suffer from performance robustness under data source variation. In this paper, we address this problem. We specifically focus on the issues associated with the widely used clustering stopping method based on Bayesian information criterion (BIC) and the merging-cluster selection scheme based on generalized likelihood ratio (GLR). First, we propose a novel alternative stopping method for AHC based on information change rate (ICR). Through experiments on several meeting corpora, the proposed method is demonstrated to be more robust to data source variation than the BIC-based one. The average improvement obtained in diarization error rate (DER) by this method is 8.76% (absolute) or 35.77% (relative). We also introduce a selective AHC (SAHC) in the paper, which first runs AHC with the ICR-based stopping method only on speech segments longer than 3 s and then classifies shorter speech segments into one of the clusters given by the initial AHC. This modified version of AHC is motivated by our previous analysis that the proportion of short speech turns (or segments) in a data source is a significant factor contributing to the robustness problem arising in the GLR-based merging-cluster selection scheme. The additional performance improvement obtained by SAHC is 3.45% (absolute) or 14.08% (relative) in terms of averaged DER.

Proceedings Article
01 Jan 2008
TL;DR: A UBM-GMM verification system is applied to semi-automatically extracted formant features and speaker comparisons are expressed as likelihood ratios (the ratio of similarity to typicality), enabling speakers to be distinguished with a low error rate.
Abstract: A new method for speaker verification based on formant features is presented A UBM-GMM verification system is applied to semi-automatically extracted formant features Speakerspecific vocal tract configurations, including the speakers’ variability, are incorporated in the speaker models Speaker comparisons are expressed as likelihood ratios (the ratio of similarity to typicality) F1, F2 and F3 values all enable speakers to be distinguished with a low error rate The corresponding bandwidths further lower the error rate

Book ChapterDOI
01 Jan 2008
TL;DR: The SRI-ICSI meeting and lecture recognition system was used in the NIST RT-07 evaluations, highlighting improvements made over the last year as mentioned in this paper, including updated beamforming software for processing of multiple distant microphones, and various adjustments to the speech segmenter for close-talking microphones.
Abstract: We describe the latest version of the SRI-ICSI meeting and lecture recognition system, as was used in the NIST RT-07 evaluations, highlighting improvements made over the last year. Changes in the acoustic preprocessing include updated beamforming software for processing of multiple distant microphones, and various adjustments to the speech segmenter for close-talking microphones. Acoustic models were improved by the combined use of neural-net-estimated phone posterior features, discriminative feature transforms trained with fMPE-MAP, and discriminative Gaussian estimation using MPE-MAP, as well as model adaptation specifically to nonnative and non-American speakers. The net effect of these enhancements was a 14-16% relative error reduction on distant microphones, and a 16-17% error reduction on close-talking microphones. Also, for the first time, we report results on a new "coffee break" meeting genre, and on a new NIST metric designed to evaluate combined speech diarization and recognition.

Proceedings ArticleDOI
01 Mar 2008
TL;DR: Experimental results showed that current standard voice transformation techniques are able to fool the GMM-based system but not the Phonetic speaker identification system, implying that future speaker identification systems should include idiosyncratic knowledge in order to successfully distinguish transformed speech from natural speech and thus be armed against imposter attacks.
Abstract: With the development of voice transformation and speech synthesis technologies, speaker identification systems are likely to face attacks from imposters who use voice transformed or synthesized speech to mimic a particular speaker. Therefore, we investigated in this paper how speaker identification systems perform on voice transformed speech. We conducted experiments with two different approaches, the classical GMM-based speaker identification system and the Phonetic speaker identification system. Our experimental results showed that current standard voice transformation techniques are able to fool the GMM-based system but not the Phonetic speaker identification system. These findings imply that future speaker identification systems should include idiosyncratic knowledge in order to successfully distinguish transformed speech from natural speech and thus be armed against imposter attacks.

01 Jan 2008
TL;DR: In his thesis, Valente showed how the speaker clustering problem could be formulated in a principled way in terms of Bayesian model selection.
Abstract: The speaker diarization problem consists in determining how many speakers there are in a given speechfile and in partitioning the speech file into intervals each of which is assigned to one of the speakers. Thecollection of all intervals assigned to a given speaker is known as a cluster. We assume that the givenspeech file has already been partitioned into segments, that is, intervals each containing the speech of asingle speaker. These segments may be of very short duration and the possibility that the same speakeris talking in two successive segments is not excluded. The problem then is how to cluster the segmentsso that there is a 1–1 correspondence between speakers and cl usters.In his thesis, Valente showed how the speaker clustering problem could be formulated in a principledway in terms of Bayesian model selection [1]. The primary problem, namely determining the number ofspeakers in the given speech file, can be viewed as one of deter mining the number of components in amixture distribution where each mixture component is a speaker model; a Bayesian approach formulatesthis problem more generally, as one of calculating a posterior probability distribution over the numberof mixture components. Similarly a Bayesian approach to the question of which segments should beassigned to which speakers results in a posterior probability distribution on all possible assignments.

Journal ArticleDOI
TL;DR: This paper found that participants recognized the speech acts that speakers performed with their utterances and this recognition was automatic and occurred for both written and spoken utterances for both observers and participants, and participants performed either a recognition probe task or a lexical decision task after being exposed to utterances that performed specific speech acts.

Proceedings Article
01 Jan 2008
TL;DR: Experiments using the SRI-FRTIV corpus reveal that “furtive” speech poses a significant challenge, but conversations and interviews, despite stylistic differences, are well matched, and high-effort oration, in contrast to high- Effort read speech, shares characteristics with conversational and interview styles.
Abstract: We study the question of how intrinsic variations (associated with the speaker rather than the recording environment) affect text-independent speaker verification performance. Experiments using the SRI-FRTIV corpus, which systematically varies both vocal effort and speaking style, reveal that (1) “furtive” speech poses a significant challenge; (2) conversations and interviews, despite stylistic differences, are well matched; (3) high-effort oration, in contrast to high-effort read speech, shares characteristics with conversational and interview styles; and (4) train/test pairings are generally symmetrical. Implications for further work in the area are discussed.

Journal ArticleDOI
TL;DR: The challenges met while designing a speaker detector for the Microsoft RoundTable distributed meeting device are presented, and a novel boosting-based multimodal speaker detection (BMSD) algorithm is proposed that reduces the error rate of SSL-only approach by 24.6%, and the SSL and MPD fusion approach by 20.9%.
Abstract: Identifying the active speaker in a video of a distributed meeting can be very helpful for remote participants to understand the dynamics of the meeting. A straightforward application of such analysis is to stream a high resolution video of the speaker to the remote participants. In this paper, we present the challenges we met while designing a speaker detector for the Microsoft RoundTable distributed meeting device, and propose a novel boosting-based multimodal speaker detection (BMSD) algorithm. Instead of separately performing sound source localization (SSL) and multiperson detection (MPD) and subsequently fusing their individual results, the proposed algorithm fuses audio and visual information at feature level by using boosting to select features from a combined pool of both audio and visual features simultaneously. The result is a very accurate speaker detector with extremely high efficiency. In experiments that includes hundreds of real-world meetings, the proposed BMSD algorithm reduces the error rate of SSL-only approach by 24.6%, and the SSL and MPD fusion approach by 20.9%. To the best of our knowledge, this is the first real-time multimodal speaker detection algorithm that is deployed in commercial products.

Proceedings ArticleDOI
01 Dec 2008
TL;DR: A branch and bound feature selection algorithm is applied to select a subset of 15 features among the 1379 originally extracted features, which out perform those obtained by state-of-the-art techniques, since a perfect classification accuracy is obtained.
Abstract: Gender classification is a challenging problem, which finds applications in speaker indexing, speaker recognition, speaker diarization, annotation and retrieval of multimedia databases, voice synthesis, smart human-computer interaction, biometrics, social robots etc. Although it has been studied for more than thirty years, by no means it is a solved problem. Processing emotional speech in order to identify speakerpsilas gender makes the problem even more interesting. A large pool of 1379 features is created including 605 novel features. A branch and bound feature selection algorithm is applied to select a subset of 15 features among the 1379 originally extracted. Support vector machines with various kernels are tested as gender classifiers, when applied to two databases, namely: the Berlin database of Emotional Speech and the Danish Emotional Speech database. The reported classification results out perform those obtained by state-of-the-art techniques, since a perfect classification accuracy is obtained.

Proceedings ArticleDOI
30 Dec 2008
TL;DR: This paper explored a cross-lingual speaker adaptation technique for HMM-based speech synthesis, where a source voice model for English is transformed into a target speaker model using Mandarin Chinese speech data from the target speaker.
Abstract: This paper explores a cross-lingual speaker adaptation technique for HMM-based speech synthesis, where a source voice model for English is transformed into a target speaker model using Mandarin Chinese speech data from the target speaker. A phone mapping- based method is adopted to map Chinese Initial/Finals into English phonemes and two types of mapping rules, including one-to-one and one-to-sequence mappings, are compared. In order to avoid having to map prosodic features between languages, the adaptation procedure uses regression classes and transforms that are constructed for triphone models, then used to adapt the phonetic-and-prosodic- context-dependent models. From the experimental results, we found that a one-to-sequence phone mapping is better than a one-to-one mapping, and that the similarity between adapted English speech and target Chinese speaker is reasonable.

Patent
Sudhir Raman Ahuja1, Jingdong Chen1, Yiteng Arden Huang, Dong Liu1, Qiru Zhou1 
03 Mar 2008
TL;DR: In this paper, a method and apparatus for active speaker selection in teleconferencing applications illustratively comprises a microphone array module, a speaker recognition system, a user interface, and a speech signal selection module.
Abstract: A method and apparatus for performing active speaker selection in teleconferencing applications illustratively comprises a microphone array module, a speaker recognition system, a user interface, and a speech signal selection module. The microphone array module separates the speech signal from each active speaker from those of other active speakers, providing a plurality of individual speaker's speech signals. The speaker recognition system identifies each currently active speaker using conventional speaker recognition/identification techniques. These identities are then transmitted to a remote teleconferencing location for display to remote participants via a user interface. The remote participants may then select one of the identified speakers, and the speech signal selection module then selects for transmission the speech signal associated with the selected identified speaker, thereby enabling the participants at the remote location to listen to the selected speaker and neglect the speech from other active speakers.

01 Jan 2008
TL;DR: In this article, the authors examined combining both relevance MAP and subspace speaker adaptation processes to train speaker models for use in speaker verification systems with a particular focus on short utterance lengths.
Abstract: This paper examines combining both relevance MAP and subspace speaker adaptation processes to train GMM speaker models for use in speaker verification systems with a particular focus on short utterance lengths. The subspace speaker adaptation method involves developing a speaker GMM mean supervector as the sum of a speaker-independent prior distribution and a speaker dependent offset constrained to lie within a low-rank subspace, and has been shown to provide improvements in accuracy over ordinary relevance MAP when the amount of training data is limited. It is shown through testing on NIST SRE data that combining the two processes provides speaker models which lead to modest improvements in verification accuracy for limited data situations, in addition to improving the performance of the speaker verification system when a larger amount of available training data is available.