scispace - formally typeset
Search or ask a question

Showing papers on "Speaker diarisation published in 1997"


Journal ArticleDOI
01 Sep 1997
TL;DR: A tutorial on the design and development of automatic speaker-recognition systems is presented and a new automatic speakers recognition system is given that performs with 98.9% correct decalcification.
Abstract: A tutorial on the design and development of automatic speaker-recognition systems is presented. Automatic speaker recognition is the use of a machine to recognize a person from a spoken phrase. These systems can operate in two modes: to identify a particular person or to verify a person's claimed identity. Speech processing and the basic components of automatic speaker-recognition systems are shown and design tradeoffs are discussed. Then, a new automatic speaker-recognition system is given. This recognizer performs with 98.9% correct decalcification. Last, the performances of various systems are compared.

1,686 citations


PatentDOI
Dimitri Kanevsky1, Stephane H. Maes1
TL;DR: In this article, a method and apparatus for securing access to a service or facility employing automatic speech recognition, text-independent speaker identification, natural language understanding techniques and additional dynamic and static features is presented.
Abstract: A method and apparatus for securing access to a service or facility employing automatic speech recognition, text-independent speaker identification, natural language understanding techniques and additional dynamic and static features. The method includes the steps of receiving and decoding speech containing indicia of the speaker such as a name, address or customer number; accessing a database containing information on candidate speakers; questioning the speaker based on the information; receiving, decoding and verifying an answer to the question; obtaining a voice sample of the speaker and verifying the voice sample against a model; generating a score based on the answer and the voice sample; and granting access if the score is equal to or greater than a threshold. Alternatively, the method includes the steps of receiving and decoding speech containing indicia of the speaker; generating a sub-list of speaker candidates having indicia substantially matching the speaker; activating databases containing information about the speaker candidates in the sub-list; performing voice classification analysis; eliminating speaker candidates based on the voice classification analysis; questioning the speaker regarding the information; eliminating speaker candidates based on the answer; and iteratively repeating prior steps until one speaker candidate (in which case the speaker is granted access), or no speaker candidate remains (in which case the speaker is not granted access).

474 citations


Book
01 Jan 1997
TL;DR: In this article, the authors discuss the nature of perceptual adjustment to voice listening to voices in speech perception using an episodic lexicon and a speaker adaptation approach for articulatory recovery and adaptation.
Abstract: Complex representations ised in speech processing - overview of the book some thoughts on "normalization" in speech perception words and voices - perception and production in episodic lexicon on the nature of perceptual adjustment to voice listening to voices - theory and practice in voice perception research talker normalization - phonetic constancy as a cognitive process normalization of vowels by breath sounds speech perception without speaker normalization - an exemplar model speaker modeling for speaker adaptation in automatic speech recognition overcoming speaker variability in automatic speech recognition - the speaker adaptation approach vocal tract normalization for articulatory recovery and adaptation.

323 citations


Patent
Stephane H. Maes1
TL;DR: In this paper, feature vectors representing each of a plurality of overlapping frames of an arbitrary, text independent speech signal are computed and compared to vector parameters and variances stored as codewords in one or more codebooks corresponding to each of enrolled users to provide speaker dependent information for speech recognition and ambiguity resolution.
Abstract: Feature vectors representing each of a plurality of overlapping frames of an arbitrary, text independent speech signal are computed and compared to vector parameters and variances stored as codewords in one or more codebooks corresponding to each of one or more enrolled users to provide speaker dependent information for speech recognition and/or ambiguity resolution. Other information such as aliases and preferences of each enrolled user may also be enrolled and stored, for example, in a database. Correspondence of the feature vectors may be ranked by closeness of correspondence to a codeword entry and the number of frames corresponding to each codebook are accumulated or counted to identify a potential enrolled speaker. The differences between the parameters of the feature vectors and codewords in the codebooks can be used to identify a new speaker and an enrollment procedure can be initiated. Continuous authorization and access control can be carried out based on any utterance either by verification of the authorization of a speaker of a recognized command or comparison with authorized commands for the recognized speaker. Text independence also permits coherence checks to be carried out for commands to validate the recognition process.

161 citations


Patent
30 Apr 1997
TL;DR: In this article, a technique for the generation of garbage models from the very same data used to generate speaker dependent speech recognition models, e.g., word models, is described.
Abstract: Methods and apparatus for the generation of speaker dependent garbage models from the very same data used to generate speaker dependent speech recognition models, e.g., word models, are described. The technique involves processing the data included in the speaker dependent speech recognition models to create one or more speaker dependent garbage models. The speaker dependent garbage model generation technique involves what may be described as distorting or morphing of a speaker dependent speech recognition model to generate a speaker dependent garbage model therefrom. One or more speaker dependent speech recognition models may then be combined with the generated speaker dependent garbage model to produce an updated garbage model. The scoring of speaker dependent garbage models is varied in accordance with the present invention as a function of the number of speech recognition models from which the speaker dependent garbage model was created. In one embodiment, the number of speaker dependent speech recognition models which are used in generating a speaker dependent garbage model is limited to a preselected maximum number which is empirically determined.

156 citations


PatentDOI
TL;DR: In systems where both speaker independent and speaker dependent speech recognition operations are performed independently, in parallel, one or more speaker independent models of words or phrases which are to be recognized by the speaker independent speech recognizer are included as garbage (OOV) models in the speaker dependentspeech recognizer.
Abstract: Methods and apparatus for generating and using both speaker dependent and speaker independent garbage models in speaker dependent speech recognition applications are described. The present invention recognizes that in some speech recognition systems, e.g., systems where multiple speech recognition operations are performed on the same signal, it may be desirable to recognize and treat words or phrases in one part of the speech recognition system as garbage or out of vocabulary utterances with the understanding that the very same words or phrases will be recognized and treated as in-vocabulary by another portion of the system. In accordance with the present invention, in systems where both speaker independent and speaker dependent speech recognition operations are performed independently, e.g., in parallel, one or more speaker independent models of words or phrases which are to be recognized by the speaker independent speech recognizer are included as garbage (OOV) models in the speaker dependent speech recognizer. This reduces the risk of obtaining conflicting speech recognition results from the speaker independent and speaker dependent speech recognition circuits. The present invention also provides for the generation of speaker dependent garbage models from the very same data used to generate speaker dependent speech recognition models, e.g., word models. The technique involves processing the data included in the speaker dependent speech recognition models to create one or more speaker dependent garbage models.

139 citations


Proceedings ArticleDOI
21 Apr 1997
TL;DR: Two corpora collected at Lincoln Laboratory for the study of handset transducer effects on the speech signal are described: the handset TIMIT (HTIMIT) corpus and the Lincoln Laboratory Handset Database (LLHDB).
Abstract: This paper describes two corpora collected at Lincoln Laboratory for the study of handset transducer effects on the speech signal: the handset TIMIT (HTIMIT) corpus and the Lincoln Laboratory Handset Database (LLHDB). The goal of these corpora are to minimize all confounding factors and produce speech predominately differing only in handset transducer effects. The speech is recorded directly from a telephone unit in a sound-booth using prompted text and extemporaneous photograph descriptions. The two corpora allow comparison of speech collected from a person speaking into a handset (LLHDB) versus speech played through a loudspeaker into a handset (HTIMIT). A comparison of analysis and results between the two corpora addresses the realism of artificially creating handset degraded speech by playing recorded speech through the handsets. The corpora are designed primarily for speaker recognition experimentation (in terms of amount of speech and level of transcription), but since both speaker and speech recognition systems operate on the same acoustic features affected by the handset, the knowledge gleaned is directly transferable to speech recognizers. Initial speaker identification performance on these corpora are presented. In addition, the application of HTIMIT in developing a handset detector that was successfully used on a Switchboard speaker verification task is described.

137 citations


Patent
01 Aug 1997
TL;DR: In this article, a call-placement system for telephone services in response to speech is described, which allows a customer to place a call by speaking a person's name which serves as a destination identifier without having to speak an additional command or steering word.
Abstract: Methods and apparatus for activating telephone services in response to speech are described. A directory including names is maintained for each customer. A speaker dependent speech template and a telephone number for each name, is maintained as part of each customer's directory. Speaker independent speech templates are used for recognizing commands. The present invention has the advantage of permitting a customer to place a call by speaking a person's name which serves as a destination identifier without having to speak an additional command or steering word to place the call. This is achieved by treating the receipt of a spoken name in the absence of a command as an implicit command to place a call. Explicit speaker independent commands are used to invoke features or services other than call placement. Speaker independent and speaker dependent speech recognition are performed on a customer's speech in parallel. An arbiter is used to decide which function or service should be performed when an apparent conflict arises as a result of both the speaker dependent and speaker independent speech recognition step outputs. Stochastic grammars, word spotting and/or out-of-vocabulary rejection are used as part of the speech recognition process to provide a user friendly interface which permits the use of spontaneous speech. Voice verification is performed on a selective basis where security is of concern.

110 citations


Proceedings ArticleDOI
21 Apr 1997
TL;DR: Experimental results in the context of batch supervised adaptation demonstrate the effectiveness of the proposed speaker adaptive training method in large vocabulary speech recognition tasks and show that significant reductions in word error rate can be achieved over the common pooled speaker-independent paradigm.
Abstract: This paper describes the speaker adaptive training (SAT) approach for speaker independent (SI) speech recognizers as a method for joint speaker normalization and estimation of the parameters of the SI acoustic models. In SAT, speaker characteristics are modeled explicitly as linear transformations of the SI acoustic parameters. The effect of inter-speaker variability in the training data is reduced, leading to parsimonious acoustic models that represent more accurately the phonetically relevant information of the speech signal. The proposed training method is applied to the Wall Street Journal (WSJ) corpus that consists of multiple training speakers. Experimental results in the context of batch supervised adaptation demonstrate the effectiveness of the proposed method in large vocabulary speech recognition tasks and show that significant reductions in word error rate can be achieved over the common pooled speaker-independent paradigm.

108 citations


Patent
Stephane H. Maes1
28 Jan 1997
TL;DR: In this article, a consistency check in the form of a decision tree is provided to accelerate the speaker recognition process and increase the accuracy of the system's recognition. But the consistency check is only applied to the speaker-independent recognition model.
Abstract: Speaker recognition is attempted on input speech signals concurrently with provision of input speech signals to a speech recognition system. If a speaker is recognized, a speaker dependent model which has been trained on an enrolled speaker is supplied to the speech recognition system. If not recognized, then a speaker-independent recognition model is used or, alternatively, the new speaker is enrolled. Other speaker specific information such as a special language model, grammar, vocabulary, a dictionary, a list of names, a language and speaker dependent preferences can also be provided to improve the speech recognition function or even configure or customize the speech recognition system or the response of any system such as a computer or network controlled in response thereto. A consistency check in the form of a decision tree is preferably provided to accelerate the speaker recognition process and increase the accuracy thereof. Further training of a model and/or enrollment of additional speakers may be initiated upon completion of speaker recognition and/or adaptively upon each speaker utterance.

106 citations


PatentDOI
TL;DR: In this paper, a speaker class processing model which is speaker independent within the class may be trained on one or more members of the class and selected for implementation in a speech recognition processor in accordance with the speaker class recognized to further improve speech recognition to level comparable to that of a speaker dependent model.
Abstract: Clusters of quantized feature vectors are processed against each other using a threshold distance value to cluster mean values of sets of parameters contained in speaker specific codebooks to form classes of speakers against which feature vectors computed from an arbitrary input speech signal can be compared to identify a speaker class. The number of codebooks considered in the comparison may be thus reduced to limit mixture elements which engender ambiguity and reduce system response speed when the speaker population becomes large. A speaker class processing model which is speaker independent within the class may be trained on one or more members of the class and selected for implementation in a speech recognition processor in accordance with the speaker class recognized to further improve speech recognition to level comparable to that of a speaker dependent model. Formation of speaker classes can be supervised by identification of groups of speakers to be included in the class and the speaker class dependent model trained on members of a respective group.

Patent
21 Feb 1997
TL;DR: In this article, a speech model is produced for use in determining whether a speaker associated with the speech model produced an unidentified speech sample, without using an external mechanism to monitor the accuracy with which the contents were identified.
Abstract: A speech model is produced for use in determining whether a speaker associated with the speech model produced an unidentified speech sample. First a sample of speech of a particular speaker is obtained. Next, the contents of the sample of speech are identified using speech recognition. Finally, a speech model associated with the particular speaker is produced using the sample of speech and the identified contents thereof. The speech model is produced without using an external mechanism to monitor the accuracy with which the contents were identified.

Proceedings ArticleDOI
21 Apr 1997
TL;DR: Results on the 1996 NIST Speaker Recognition Evaluation corpus show that using handset-matched background models reduces false acceptances by more than 60% over previously reported (handset-independent) approaches.
Abstract: This paper studies the effects of handset distortion on telephone-based speaker recognition performance, resulting in the following observations: (1) the major factor in speaker recognition errors is whether the handset type (e.g., electret, carbon) is different across training and testing, not whether the telephone lines are mismatched, (2) the distribution of speaker recognition scores for true speakers is bimodal, with one mode dominated by matched handset tests and the other by mismatched handsets, (3) cohort-based normalization methods derive much of their performance gains from implicitly selecting cohorts trained with the same handset type as the claimant, and (4) utilizing a handset-dependent background model which is matched to the handset type of the claimant's training data sharpens and separates the true and false speaker score distributions. Results on the 1996 NIST Speaker Recognition Evaluation corpus show that using handset-matched background models reduces false acceptances (at a 10% miss rate) by more than 60% over previously reported (handset-independent) approaches.

Patent
17 Nov 1997
TL;DR: In this article, a system for establishing an identity of a speaker including a computerized system which includes at least two voice authentication algorithms is presented, each of which is different from one another and serves for independently analyzing a voice of the speaker for obtaining an independent positive or negative authentication of the voice by each of the algorithms.
Abstract: A system for establishing an identity of a speaker including a computerized system which includes at least two voice authentication algorithms. Each of the at least two voice authentication algorithms is different from one another and serves for independently analyzing a voice of the speaker for obtaining an independent positive or negative authentication of the voice by each of the algorithms. If every one of the algorithms provide positive authentication, the speaker is positively identified, whereas, if at least one of the algorithms provides negative authentication, the speaker is negatively identified.


01 Jan 1997
TL;DR: Development of robust and speaker independent algorithms for mouth location and lip contour extraction is necessary in order to obtain informative features about visual speech (visual front end) and this approach is described.
Abstract: This paper describes the audio-visual database collected at AT&T Labs{Research for the study of bimodal speech recognition. To date, this database consists of two multiple speaker parts, namely isolated confusable words and connected letters, thus allowing the study of some popular and relatively simple speaker independent audio-visual recognition tasks. In addition, a single speaker connected digits database is collected to facilitate speedy development and testing of various algorithms. Intentionally, no lip markings are used on the subjects during data collection. Development of robust and speaker independent algorithms for mouth location and lip contour extraction is thus necessary in order to obtain informative features about visual speech (visual front end). We describe our approach to this problem, and we report our automatic speech-reading and audio-visual speech recognition results on the single speaker connected digits task.

Proceedings ArticleDOI
21 Apr 1997
TL;DR: It is shown that significant advantage can be gained by performing frequency warping and ML speaker adaptation in a unified framework and a procedure is described which compensates utterances by simultaneously scaling the frequency axis and reshaping the spectral energy contour.
Abstract: Frequency warping approaches to speaker normalization have been proposed and evaluated on various speech recognition tasks. These techniques have been found to significantly improve performance even for speaker independent recognition from short utterances over the telephone network. In maximum likelihood (ML) based model adaptation a linear transformation is estimated and applied to the model parameters in order to increase the likelihood of the input utterance. The purpose of this paper is to demonstrate that significant advantage can be gained by performing frequency warping and ML speaker adaptation in a unified framework. A procedure is described which compensates utterances by simultaneously scaling the frequency axis and reshaping the spectral energy contour. This procedure is shown to reduce the error rate in a telephone based connected digit recognition task by 30-40%.

Patent
29 Dec 1997
TL;DR: In this article, a method and system of adapting speech recognition models to a speaker environment may comprise receiving a spoken password and getting a set of speaker independent (SI) speech recognition model.
Abstract: The method and system of adapting speech recognition models to a speaker environment may comprise receiving a spoken password (52) and getting a set of speaker independent (SI) speech recognition models (54). A mapping sequence may be determined for the spoken password (56). Using the mapping sequence, a speaker ID may be identified (58). A transform may be determined (66) between the SI speech recognition models and the spoken password using the mapping sequence. Speaker adapted (SA) speech recognition models may be generated (68) by applying the transform to SI speech recognition models. A speech input may be recognized (70) by applying the SA speech recognition models.

Patent
Stephane H. Maes1
06 May 1997
TL;DR: In this article, fast and detailed match techniques for speaker recognition are combined into a hybrid system in which speakers are associated in groups when potential confusion is detected between a speaker being enrolled and a previously enrolled speaker.
Abstract: Fast and detailed match techniques for speaker recognition are combined into a hybrid system in which speakers are associated in groups when potential confusion is detected between a speaker being enrolled and a previously enrolled speaker. Thus the detailed match techniques are invoked only at the potential onset of saturation of the fast match technique while the detailed match is facilitated by limitation of comparisons to the group and the development of speaker-dependent models which principally function to distinguish between members of a group rather than to more fully characterize each speaker. Thus storage and computational requirements are limited and fast and accurate speaker recognition can be extended over populations of speakers which would degrade or saturate fast match systems and degrade performance of detailed match systems.

Proceedings ArticleDOI
21 Apr 1997
TL;DR: An audio retrieval system which lets Internet users efficiently access a large audio database containing recordings of the proceedings of the United States House of Representatives using a novel method based on speaker identification that has been successfully integrated into a World Wide Web based search and browse system.
Abstract: We report on an audio retrieval system which lets Internet users efficiently access a large audio database containing recordings of the proceedings of the United States House of Representatives. The audio has been temporally aligned to text transcripts of the proceedings (which are manually generated by the US Government) using a novel method based on speaker identification. Speaker sequence and approximate timing information is extracted from the text transcript and used to constrain a Viterbi alignment of speaker models to the observed audio. Speakers are modeled by computing Gaussian statistics of cepstral coefficients extracted from samples of each person's speech. The speaker identification is used to locate speaker transition points in the audio which are then linked to corresponding speaker transitions in the text transcript. The alignment system has been successfully integrated into a World Wide Web based search and browse system as an experimental service on the Internet.

01 Jan 1997
TL;DR: A tutorial on the design and development of automatic speaker-recognition systems is presented and a new automatic speakers recognition system is given that performs with 98.9% correct identification.
Abstract: A tutorial on the design and development of automatic speaker-recognition systems is presented. Automatic speaker recognition is the use of a machine to recognize a person from a spoken phrase. These systems can operate in two modes: to identify a particular person or to verify a person's claimed identity. Speech processing and the basic components of automatic speaker-recognition systems are shown and design tradeoffs are discussed. Then, a new automatic speaker-recognition system is given. This recognizer performs with 98.9% correct identification. Last, the performances of various systems are compared.

PatentDOI
TL;DR: The voice print system of the present invention is a subword-based, text-dependent automatic speaker verification system that embodies the capability of user-selectable passwords with no constraints on the choice of vocabulary words or the language.
Abstract: The voice print system of the present invention is a subword-based, text-dependent automatic speaker verification system that embodies the capability of user-selectable passwords with no constraints on the choice of vocabulary words or the language. Automatic blind speech segmentation allows speech to be segmented into subword units without any linguistic knowledge of the password. Subword modeling is performed using a multiple classifiers. The system also takes advantage of such concepts as multiple classifier fusion and data resampling to successfully boost the performance. Key word/key phrase spotting is used to optimally locate the password phrase. Numerous adaptation techniques increase the flexibility of the base system, and include: channel adaptation, fusion adaptation, model adaptation and threshold adaptation.


Proceedings Article
01 Jan 1997
TL;DR: The extent to which the use of formant frequencies can improve recognition accuracy and reduce computational complexity for speaker normalization algorithms using frequency warping is studied.
Abstract: Speaker-dependent automatic speech recognition systems are known to outperform speaker-independent systems when enough training data are available to model acoustical variability among speakers. Speaker normalization techniques modify the spectral representation of incoming speech waveforms in an attempt to reduce variability between speakers. Recent successful speaker normalization algorithms have incorporated a speaker-specific frequency warping to the initial signal processing stages. These algorithms, however, do not make extensive use of acoustic features contained in the incoming speech. In this paper we study the possible benefits of the use of acoustic features in speaker normalization algorithms using frequency warping. We study the extent to which the use of such features, including specifically the use of formant frequencies, can improve recognition accuracy and reduce computational complexity for speaker normalization. We examine the characteristics and limitations of several types of feature sets and warping functions as we compare their performance relative to existing algorithms.

Proceedings ArticleDOI
21 Apr 1997
TL;DR: The presented paper is interested in a speaker identification problem where the attributes representing the voice of a particular speaker are obtained from very short segments of the speech waveform corresponding only to one pitch period of vowels.
Abstract: The presented paper is interested in a speaker identification problem The attributes representing the voice of a particular speaker are obtained from very short segments of the speech waveform corresponding only to one pitch period of vowels The patterns formed from the samples of a pitch period waveform are either matched in the time domain by use of a nonlinear time warping method, known as dynamic time warping (DTW), or they are converted into cepstral coefficients and compared using the cepstral distance measure Since an uttered speech signal usually contains a lot of vowels the techniques using a combination both various classifiers and multiple classifier outputs are considered in the decision making process Experiments performed for a hundred speakers are described

Proceedings ArticleDOI
TL;DR: This paper describes a three-stage processing system consisting of a shot boundary detection stage, an audio classification stage, and a speaker identification stage to determine the presence of different actors in isolated shots to show the efficacy of speaker identification for labeling video clips in terms of persons present in them.
Abstract: Video content characterization is a challenging problem in video databases. The aim of such characterization is to generate indices that can describe a video clip in terms of objects and their actions in the clip. Generally, such indices are extracted by performing image analysis on the video clips. Many such indices can also be generated by analyzing the embedded audio information of video clips. Indices pertaining to context, scene emotion, and actors or characters present in a video clip appear especially suitable for generation via audio analysis techniques of keyword spotting, and speech and speaker recognition. In this paper, we examine the potential of speaker identification techniques for characterizing video clips in terms of actors present in them. We describe a three-stage processing system consisting of a shot boundary detection stage, an audio classification stage, and a speaker identification stage to determine the presence of different actors in isolated shots. Experimental results using the movie A Few Good Men are presented to show the efficacy of speaker identification for labeling video clips in terms of persons present in them.© (1997) COPYRIGHT SPIE--The International Society for Optical Engineering. Downloading of the abstract is permitted for personal use only.

Book ChapterDOI
12 Mar 1997
TL;DR: The selection of the most critical subbands for the speaker recognition task and the choice of an optimal division of the frequency domain are discussed.
Abstract: This paper presents a new method for automatic speaker recognition. The principle is to split the whole spectral domain into partial frequency subbands on which recognizers are independently applied and then recombined to yield a global recognition decision. In this article, we particularly discuss the selection of the most critical subbands for the speaker recognition task and the choice of an optimal division of the frequency domain.

Patent
23 Dec 1997
TL;DR: In this article, an adaptive speaker compensation system and method, such as for use in a multimedia computer, stores speaker response filter coefficients for each speaker and adaptively compensates received audio for non-linear speaker characteristics based on the stored speaker response filters coefficients.
Abstract: An adaptive speaker compensation system and method, such as for use in a multimedia computer, stores speaker response filter coefficients for each speaker and adaptively compensates received audio for non-linear speaker characteristics based on the stored speaker response filter coefficients. The speaker response filter coefficients preferably represent an inverse response of a speaker response curve for each speaker in the audio system. Preferably a library memory containing prestored speaker characteristic data, such as the speaker response filter coefficients, is selectively accessed by the adaptive speaker compensation system to download the speaker response filter coefficients based on identification of a speaker type and channel for which the speaker is being used.

Patent
18 Apr 1997
TL;DR: In this paper, a normalizing model is matched to a source model based, or dependent, upon an acoustic input device whose transfer characteristics color acoustic characteristics of a source as represented in the source model.
Abstract: Adverse effects of type mismatch between acoustic input devices used during testing and during training in machine-based recognition of the source of acoustic phenomena are minimized. A normalizing model is matched to a source model based, or dependent, upon an acoustic input device whose transfer characteristics color acoustic characteristics of a source as represented in the source model. An application of the present invention is to speaker recognition, i.e., recognition of the identity of a speaker by the speaker's voice.

Patent
TL;DR: In this paper, speech signals from speakers having known identities are used to create sets of acoustic models along with their corresponding identities are stored in a memory, and a plurality of sets of cohort models that characterize the speech signals are selected from the stored sets of models, and linked to the set of models of each identified speaker.
Abstract: Speech signals from speakers having known identities are used to create sets of acoustic models. The acoustic models along with their corresponding identities are stored in a memory. A plurality of sets of cohort models that characterize the speech signals are selected from the stored sets of acoustic models, and linked to the set of acoustic models of each identified speaker. During a testing session speech signals produced by an unknown speaker having a claimed identity are processed to generate processed speech signals. The processed speech signals are compared to the set of models of the claimed speaker to produce first scores. The processed speech signals are also compared to the sets cohort models to produce second scores. A subset of scores are dynamically selected from the second scores according to a predetermined criteria. The unknown speaker is validated as the claimed speaker if the difference between the first and a combination of the subset of scores is greater than a predetermined threshold value.