scispace - formally typeset
Search or ask a question

Showing papers on "Speaker recognition published in 2000"


Journal ArticleDOI
TL;DR: The major elements of MIT Lincoln Laboratory's Gaussian mixture model (GMM)-based speaker verification system used successfully in several NIST Speaker Recognition Evaluations (SREs) are described.

4,673 citations


Proceedings Article
01 Jan 2000
TL;DR: A database designed to evaluate the performance of speech recognition algorithms in noisy conditions and recognition results are presented for the first standard DSR feature extraction scheme that is based on a cepstral analysis.
Abstract: This paper describes a database designed to evaluate the performance of speech recognition algorithms in noisy conditions. The database may either be used for the evaluation of front-end feature extraction algorithms using a defined HMM recognition back-end or complete recognition systems. The source speech for this database is the TIdigits, consisting of connected digits task spoken by American English talkers (downsampled to 8kHz) . A selection of 8 different real-world noises have been added to the speech over a range of signal to noise ratios and special care has been taken to control the filtering of both the speech and noise. The framework was prepared as a contribution to the ETSI STQ-AURORA DSR Working Group [1]. Aurora is developing standards for Distributed Speech Recognition (DSR) where the speech analysis is done in the telecommunication terminal and the recognition at a central location in the telecom network. The framework is currently being used to evaluate alternative proposals for front-end feature extraction. The database has been made publicly available through ELRA so that other speech researchers can evaluate and compare the performance of noise robust algorithms. Recognition results are presented for the first standard DSR feature extraction scheme that is based on a cepstral analysis.

1,909 citations


Journal ArticleDOI
TL;DR: A new model-based speaker adaptation algorithm called the eigenvoice approach, which constrains the adapted model to be a linear combination of a small number of basis vectors obtained offline from a set of reference speakers, and thus greatly reduces the number of free parameters to be estimated from adaptation data.
Abstract: This paper describes a new model-based speaker adaptation algorithm called the eigenvoice approach. The approach constrains the adapted model to be a linear combination of a small number of basis vectors obtained offline from a set of reference speakers, and thus greatly reduces the number of free parameters to be estimated from adaptation data. These "eigenvoice" basis vectors are orthogonal to each other and guaranteed to represent the most important components of variation between the reference speakers. Experimental results for a small-vocabulary task (letter recognition) given in the paper show that the approach yields major improvements in performance for tiny amounts of adaptation data. For instance, we obtained 16% relative improvement in error rate with one letter of supervised adaptation data, and 26% relative improvement with four letters of supervised adaptation data. After a comparison of the eigenvoice approach with other speaker adaptation algorithms, the paper concludes with a discussion of future work.

554 citations


Journal ArticleDOI
TL;DR: The performance trade-off of missed detections and false alarms for each system and the effects on performance of training condition, test segment duration, the speakers' sex and the match or mismatch of training and test handsets are presented.

403 citations


Journal ArticleDOI
TL;DR: This article goes into detail about the BioID system functions, explaining the data acquisition and preprocessing techniques for voice, facial, and lip imagery data and the classification principles used for optical features and the sensor fusion options.
Abstract: Biometric identification systems, which use physical features to check a person's identity, ensure much greater security than password and number systems. Biometric features such as the face or a fingerprint can be stored on a microchip in a credit card, for example. A single feature, however, sometimes fails to be exact enough for identification. Another disadvantage of using only one feature is that the chosen feature is not always readable. Dialog Communication Systems (DCS AG) developed BioID, a multimodal identification system that uses three different features-face, voice, and lip movement-to identify people. With its three modalities, BioID achieves much greater accuracy than single-feature systems. Even if one modality is somehow disturbed-for example, if a noisy environment drowns out the voice-the ether two modalities still lead to an accurate identification. This article goes into detail about the system functions, explaining the data acquisition and preprocessing techniques for voice, facial, and lip imagery data. The authors also explain the classification principles used for optical features and the sensor fusion options (the combinations of the three results-face, voice, lip movement-to obtain varying levels of security).

386 citations


Journal ArticleDOI
TL;DR: This paper proposes a new segmentation method, called DISTBIC, which combines two different segmentation techniques and is efficiency in detecting speaker turns even close to one another (i.e., separated by a few seconds).

299 citations


Journal ArticleDOI
M.J.F. Gales1
TL;DR: This paper examines an adaptation scheme requiring very few parameters, cluster adaptive training (CAT), which may be viewed as a simple extension to speaker clustering, a linear interpolation of all the cluster means is used as the mean of the particular speaker.
Abstract: When performing speaker adaptation, there are two conflicting requirements. First, the speaker transform must be powerful enough to represent the speaker. Second, the transform must be quickly and easily estimated for any particular speaker. The most popular adaptation schemes have used many parameters to adapt the models to be representative of an individual speaker. This limits how rapidly the models may be adapted to a new speaker or the acoustic environment. This paper examines an adaptation scheme requiring very few parameters, cluster adaptive training (CAT). CAT may be viewed as a simple extension to speaker clustering. Rather than selecting a single cluster as representative of a particular speaker, a linear interpolation of all the cluster means is used as the mean of the particular speaker. This scheme naturally falls into an adaptive training framework. Maximum likelihood estimates of the interpolation weights are given. Furthermore, simple re-estimation formulae for cluster means, represented both explicitly and by sets of transforms of some canonical mean, are given. On a speaker-independent task CAT reduced the word error rate using very little adaptation data. In addition when combined with other adaptation schemes it gave a 5% reduction in word error rate over adapting a speaker-independent model set.

293 citations


Journal ArticleDOI
TL;DR: My focus here is recognition errors as a problem for spoken-language systems, especially when processing diverse speaker styles or speech produced in noisy field settings, but when speech is combined with another input mode within a multimodal architecture, it is shown that two modes can function better than one alone.
Abstract: My focus here is recognition errors as a problem for spoken-language systems, especially when processing diverse speaker styles or speech produced in noisy field settings. However, when speech is combined with another input mode within a multimodal architecture, recent research has shown that two modes can function better than one alone. I also outline when and why multimodal systems display error-handling advantages. Recent studies on mobile speech and accented speakers have found that:

281 citations


Journal ArticleDOI
TL;DR: By understanding the cognitive processes surrounding human “acoustic memory” and processing, interface designers may be able to integrate speech more effectively and guide users more successfully.
Abstract: Continued research and development should be able to improve certain speech input, output, and dialogue applications. Speech recognition and generation is sometimes helpful for environments that are hands-busy, eyes-busy, mobility-required, or hostile and shows promise for telephone-based services. Dictation input is increasingly accurate, but adoption outside the disabled-user community has been slow compared to visual interfaces. Obvious physical problems include fatigue from speaking continuously and the disruption in an office filled with people speaking. By understanding the cognitive processes surrounding human “acoustic memory” and processing, interface designers may be able to integrate speech more effectively and guide users more successfully. By appreciating the differences between human-human interaction and human-computer interaction, designers may then be able to choose appropriate applications for human use of speech with computers. The key distinction may be the rich emotional content conveyed by prosody, or the pacing, intonation, and amplitude in spoken language. The emotive aspects of prosody are potent for human-human interaction but may be disruptive for human-computer interaction. The syntactic aspects of prosody, such as rising tone for questions, are important for a system’s recognition and generation of sentences. Now consider human acoustic memory and processing. Short-term and working memory are sometimes called acoustic or verbal memory. The part of Ben Shneiderman

277 citations


Proceedings ArticleDOI
11 Dec 2000
TL;DR: A new technique for normalising the polynomial kernel is developed and used to achieve performance comparable to other classifiers on the YOHO database.
Abstract: The performance of the support vector machine (SVM) on a speaker verification task is assessed. Since speaker verification requires binary decisions, support vector machines seem to be a promising candidate to perform the task. A new technique for normalising the polynomial kernel is developed and used to achieve performance comparable to other classifiers on the YOHO database. We also present results on a speaker identification task.

250 citations



Journal ArticleDOI
TL;DR: A Bayesian interpretation framework (based on the likelihood ratio) represents an adequate solution for the interpretation of the aforementioned evidence in the judicial process and allows likening the speaker recognition to the same logic than the other forensic identification evidences.

Journal ArticleDOI
TL;DR: This article summarizes the 1999 NIST Speaker Recognition Evaluation, which discussed the overall research objectives, the three task definitions, the development and evaluation data sets, the specified performance measures and their manner of presentation, the overall quality of the results.

Journal ArticleDOI
TL;DR: A study on the possible merits of such a display for bandlimited speech with respect to intelligibility and talker recognition against a background of competing voices finds no difference between the use of an individualized 3D auditory display and a general display.
Abstract: In a 3D auditory display, sounds are presented over headphones in a way that they seem to originate from virtual sources in a space around the listener. This paper describes a study on the possible merits of such a display for bandlimited speech with respect to intelligibility and talker recognition against a background of competing voices. Different conditions were investigated: speech material (words/sentences), presentation mode (monaural/binaural/3D), number of competing talkers (1–4), and virtual position of the talkers (in 45°-steps around the front horizontal plane). Average results for 12 listeners show an increase of speech intelligibility for 3D presentation for two or more competing talkers compared to conventional binaural presentation. The ability to recognize a talker is slightly better and the time required for recognition is significantly shorter for 3D presentation in the presence of two or three competing talkers. Although absolute localization of a talker is rather poor, spatial separation appears to have a significant effect on communication. For either speech intelligibility, talker recognition, or localization, no difference is found between the use of an individualized 3D auditory display and a general display.

Proceedings ArticleDOI
30 Jul 2000
TL;DR: A method of automatically detecting a talking person using video and audio data from a single microphone using a time-delayed neural network and a spatio-temporal search for a speaking person is described.
Abstract: The visual motion of the mouth and the corresponding audio data generated when a person speaks are highly correlated. This fact has been exploited for lip/speech-reading and for improving speech recognition. We describe a method of automatically detecting a talking person (both spatially and temporally) using video and audio data from a single microphone. The audio-visual correlation is learned using a time-delayed neural network, which is then used to perform a spatio-temporal search for a speaking person. Applications include videoconferencing, video indexing and improving human-computer interaction (HCI). An example HCI application is provided.

Patent
10 May 2000
TL;DR: In this article, a technique for adaptation of a speech recognizing system across multiple remote communication sessions with a speaker is presented. But, the technique requires the speaker to engage in a training session.
Abstract: A technique for adaptation of a speech recognizing system across multiple remote communication sessions with a speaker. The speaker can be a telephone caller. An acoustic model is utilized for recognizing the speaker's speech. Upon initiation of a first remote session with the speaker, the acoustic model is speaker-independent. During the first session, the speaker is uniquely identified and speech samples are obtained from the speaker. In the preferred embodiment, the samples are obtained without requiring the speaker to engage in a training session. The acoustic model is then modified based upon the samples thereby forming a modified model. The model can be modified during the session or after the session is terminated. Upon termination of the session, the modified model is then stored in association with an identification of the speaker. During a subsequent remote session, the speaker is identified and, then, the modified acoustic model is utilized to recognize the speaker's speech. Additional speech samples are obtained during the subsequent session and, then, utilized to further modify the acoustic model. In this manner, an acoustic model utilized for recognizing the speech of a particular speaker is cumulatively modified according to speech samples obtained during multiple sessions with the speaker. As a result, the accuracy of the speech recognizing system improves for the speaker even when the speaker only engages in relatively short remote sessions.

Proceedings Article
01 Jan 2000
TL;DR: The theory and promise of the Missing Data approach to robust Automatic Speech Recognition is developed and the probability calculation is adapted to use these estimates as weighting factors for the complementary reliable/unreliable interpretations for each feature vector component.
Abstract: In previous work we have developed the theory and demonstrated the promise of the Missing Data approach to robust Automatic Speech Recognition. This technique is based on hard decisions as to whether each time-frequency \pixel" is either reliable or unreliable. In this paper we replace these discrete decisions with soft estimates of the probability that each \pixel" is reliable. We adapt the probability calculation to use these estimates as weighting factors for the complementary reliable/unreliable interpretations for each feature vector component. Experiments using the TIDigits connected digit recognition task demonstrate that this technique a ords signi cant performance improvements at low SNRs.

Journal ArticleDOI
Chin-Hui Lee1, Qiang Huo
01 Aug 2000
TL;DR: The mathematical framework for Bayesian adaptation of acoustic and language model parameters is first described, and maximum a posteriori point estimation is developed for hidden Markov models and a number of useful parameters densities commonly used in automatic speech recognition and natural language processing.
Abstract: Recent advances in automatic speech recognition are accomplished by designing a plug-in maximum a posteriori decision rule such that the forms of the acoustic and language model distributions are specified and the parameters of the assumed distributions are estimated from a collection of speech and language training corpora. Maximum-likelihood point estimation is by far the most prevailing training method. However, due to the problems of unknown speech distributions, sparse training data, high spectral and temporal variabilities in speech, and possible mismatch between training and testing conditions, a dynamic training strategy is needed. To cope with the changing speakers and speaking conditions in real operational conditions for high-performance speech recognition, such paradigms incorporate a small amount of speaker and environment specific adaptation data into the training process. Bayesian adaptive learning is an optimal way to combine prior knowledge in an existing collection of general models with a new set of condition-specific adaptation data. In this paper, the mathematical framework for Bayesian adaptation of acoustic and language model parameters is first described. Maximum a posteriori point estimation is then developed for hidden Markov models and a number of useful parameters densities commonly used in automatic speech recognition and natural language processing.

Journal ArticleDOI
TL;DR: An approach to voice characteristics conversion for an HMM-based text-to-speech synthesis system using speaker interpolation, which can synthesize speech with various voice quality without large database in synthesis phase.
Abstract: This paper describes an approach to voice characteristics conversion for an HMM-based text-to-speech synthesis system using speaker interpolation.Although most text-to-speech synthesis systems which synthesize speech by concatenating speech units can synthesize speech with acceptable quality, they still cannot synthesize speech with various voice quality such as speaker individualities and emotions;In order to control speaker individualities and emotions, therefore, they need a large database, which records speech units with various voice characteristics in sythesis phase.On the other hand, our system synthesize speech with untrained speaker’s voice quality by interpolating HMM parameters among some representative speakers’ HMM sets.Accordingly, our system can synthesize speech with various voice quality without large database in synthesis phase.An HMM interpolation technique is derived from a probabilistic similarity measure for HMMs, and used to synthesize speech with untrained speaker’s voice quality by interpolating HMM parameters among some representative speakers’ HMM sets.The results of subjective experiments show that we can gradually change the voice quality of synthesized speech from one’s to the other’s by changing the interpolation ratio.

PatentDOI
TL;DR: In this article, the authors proposed a bidirectional network based on an IEEE 1394 bus for voice control in an individual consumer electronics appliance, where the distance between the individual microphones is limited on account of the dimensions of the appliance.
Abstract: Voice control systems are used in diverse technical fields. In this case, the spoken words are detected by one or more microphones and then fed to a speech recognition system. In order to enable voice control even from a relatively great distance, the voice signal must be separated from interfering background signals. This can be effected by spatial separation using microphone arrays comprising two or more microphones. In this case, it is advantageous for the individual microphones of the microphone array to be distributed spatially over the greatest possible distance. In an individual consumer electronics appliance, however, the distances between the individual microphones are limited on account of the dimensions of the appliance. Therefore, the voice control system according to the invention comprises a microphone array having a plurality of microphones which are distributed between different appliances, in which case the signals generated by the microphones can be transmitted to the central speech recognition unit, advantageously via a bidirectional network based on an IEEE 1394 bus.

Proceedings Article
01 Jan 2000
TL;DR: This work proposes a novel statistical modeling and compensation method for robust speaker recognition that yields similar improvements as the HNORM score-based compensation method, but with a fraction of the training time.
Abstract: A novel statistical modeling and compensation method for robust speaker recognition is presented. The method specifically addresses the degradation in speaker verification performance due to the mismatch in channels (e.g., telephone handsets) between enrollment and testing sessions. In mismatched conditions, the new approach uses speaker-independent channel transformations to synthesize a speaker model that corresponds to the channel of the testing session. Effectively verification is always performed in matched channel conditions. Results on the 1998 NIST Speaker Recognition Evaluation corpus show that the new approach yields performance that matches the best reported results. Specifically, our approach yields similar improvements (19.9% reduction in EER compared to CMN alone) as the HNORM score-based compensation method, but with a fraction of the training time.

Journal ArticleDOI
TL;DR: Face recognition is one of the few biometric methods that possess the merits of both high accuracy and low intrusiveness and has the accuracy of a physiological approach without being intrusive.
Abstract: Introduction In today's networked world, the need to maintain the security of information or physical property is becoming both increasingly important and increasingly difficult. From time to time we hear about the crimes of credit card fraud, computer break-in's by hackers, or security breaches in a company or government building. In the year 1998, sophisticated cyber crooks caused well over US $100 million in losses (Reuters, 1999). In most of these crimes, the criminals were taking advantage of a fundamental flaw in the conventional access control systems: the systems do not grant access by "who we are", but by "what we have", such as ID cards, keys, passwords, PIN numbers, or mother's maiden name. None of these means are really define us. Rather, they merely are means to authenticate us. It goes without saying that if someone steals, duplicates, or acquires these identity means, he or she will be able to access our data or our personal property any time they want. Recently, technology became available to allow verification of "true" individual identity. This technology is based in a field called "biometrics". Biometric access control are automated methods of verifying or recognizing the identity of a living person on the basis of some physiological characteristics, such as fingerprints or facial features, or some aspects of the person's behavior, like his/her handwriting style or keystroke patterns. Since biometric systems identify a person by biological characteristics, they are difficult to forge. Among the various biometric ID methods, the physiological methods (fingerprint, face, DNA) are more stable than methods in behavioral category (keystroke, voice print). The reason is that physiological features are often non-alterable except by severe injury. The behavioral patterns, on the other hand, may fluctuate due to stress, fatigue, or illness. However, behavioral IDs have the advantage of being non-intrusiveness. People are more comfortable signing their names or speaking to a microphone than placing their eyes before a scanner or giving a drop of blood for DNA sequencing. Face recognition is one of the few biometric methods that possess the merits of both high accuracy and low intrusiveness. It has the accuracy of a physiological approach without being intrusive. For this reason, since the early 70's (Kelly, 1970), face recognition has drawn the attention of researchers in fields from security, psychology, and image processing, to computer vision. Numerous algorithms have been proposed for face recognition; for detailed survey please see Chellappa (1995) and Zhang (1997). While network security and access control are it most widely discussed applications, face recognition has also proven useful in other multimedia information processing areas. Chan et al. (1998) use face recognition techniques to browse video database to find out shots of particular people. Li et al. (1993) code the face images with a compact parameterized facial model for low-bandwidth communication applications such as videophone and teleconferencing. Recently, as the technology has matured, commercial products (such as Miros' TrueFace (1999) and Visionics' FaceIt (1999)) have appeared on the market. Despite the commercial success of those face recognition products, a few research issues remain to be explored. In the next section, we will begin our study of face recognition by discussing several metrics to evaluate the recognition performance. Section 3 provides a framework for a generic face recognition algorithm. Then in Section 4 we discuss the various factors that affect the performance of the face recognition system. In section 5, we show the readers several famous face recognition examples, such as eigenface and neural network. Then finally a conclusion is given in section 6. Performance Evaluation Metrics The two standard biometric measures to indicate the identifying power are False Rejection Rate (FRR) and False Acceptance Rate (FAR). …

Journal ArticleDOI
TL;DR: A new technique, verbal information verification (VIV), in which spoken utterances of a claimed speaker are verified against the key information in the speaker's registered profile automatically; to decide whether the claimed identity should be accepted or rejected is proposed.
Abstract: Traditional speaker authentication focuses on speaker verification (SV) and speaker identification, which is accomplished by matching the speaker's voice with his or her registered speech patterns In this paper, we propose a new technique, verbal information verification (VIV), in which spoken utterances of a claimed speaker are verified against the key (usually confidential) information in the speaker's registered profile automatically; to decide whether the claimed identity should be accepted or rejected Using the proposed sequential procedure involving three question-response turns, we achieved an error-free result in a telephone speaker authentication experiment with 100 speakers We further propose a speaker authentication system by combining VIV with SV In the system, a user is verified by VIV in the first four to five accesses, usually from different acoustic environments During these uses, one of the key questions pertains to a pass-phrase for SV The VIV system collects and verifies the pass-phrase utterance for use as training data for speaker model construction After a speaker-dependent model is constructed, the system then migrates to SV This approach avoids the inconvenience of a formal enrollment procedure, ensures the quality of the training data for SV, and mitigates the mismatch caused by different acoustic environments between training and testing Experiments showed that the proposed system improved the SV performance by over 40% in equal-error rate compared to a conventional SV system

Journal ArticleDOI
TL;DR: The approach transforms features such as mel-cepstral features, log spectrum, and prosody-based features with a non-linear artificial neural network to maximize speaker recognition performance specifically in the setting of telephone handset mismatch between training and testing.

Journal ArticleDOI
TL;DR: Experimental results show that small EBF networks with basis function parameters estimated by the EM algorithm outperform the large RBF networks trained in the conventional approach.
Abstract: This paper proposes to incorporate full covariance matrices into the radial basis function (RBF) networks and to use the expectation-maximization (EM) algorithm to estimate the basis function parameters. The resulting networks, referred to as elliptical basis function (EBF) networks, are evaluated through a series of text-independent speaker verification experiments involving 258 speakers from a phonetically balanced, continuous speech corpus (TIMIT). We propose a verification procedure using RBF and EBF networks as speaker models and show that the networks are readily applicable to verifying speakers using LP-derived cepstral coefficients as features. Experimental results show that small EBF networks with basis function parameters estimated by the EM algorithm outperform the large RBF networks trained in the conventional approach. The results also show that the equal error rate achieved by the EBF networks is about two-third of that achieved by the vector quantization-based speaker models.

01 Jan 2000
TL;DR: In this article, the authors propose a method to solve the problem of "uniformity" and "uncertainty" in the context of education.iii.iiiiii.
Abstract: iii

Journal ArticleDOI
TL;DR: A specific large speech database in Castilian Spanish called AHUMADA (/aumada/) has been designed and acquired under controlled conditions and some experimental results including different speech variability factors are presented.

Patent
TL;DR: In this article, a speech recognition system that automatically converts a pre-recorded audio file into a written text is described. But the system is based on a speech-to-text conversation.
Abstract: A system and method for quickly improving the accuracy of a speech recognition program. The system is based on a speech recognition program that automatically converts a pre-recorded audio file into a written text. The system parses the written text into segments, each of which is corrected by the system and saved in a retrievable manner in association with the computer. The standard speech files are saved towards improving accuracy in speech-to-text conversation by the speech recognition program. The system further includes facilities to repetitively establish an independent instance of the written text from the pre-recorded audio file using the speech recognition program. This independent instance can then be broken into segments and each segment in said independent instance replaced with a corrected segment associated with the segment. In this manner, repetitive instruction of a speech recognition program can be facilitated. A system and method for directing pre-recorded audio files to a speech recognition program that does not accept such files is also disclosed. Such system and method are necessary to sue the system and method for quickly improving the accuracy of a speech recognition program with some pre-existing speech recognition programs.

01 Jan 2000
TL;DR: In this article, a large speech database in Castilian Spanish called AHUMADA (/aum ada/) has been designed and acquired under controlled conditions, together with a detailed description of the database, some experimental results including diAerent speech variability factors are also presented.
Abstract: Speaker recognition is an emerging task in both commercial and forensic applications. Nevertheless, while in certain applications we can estimate, adapt or hypothesize about our working conditions, most of the commercial applications and almost the whole of the forensic approaches to speaker recognition are still open problems, due to several reasons. Some of these reasons can be stated: environmental conditions are (usually) rapidly changing or highly degraded, acquisition processes are not always under control, incriminated people exhibit low degree of cooperativeness, etc., inducing a wide range of variability sources on speech utterances. In this sense, real approaches to speaker identification necessarily imply taking into account all these variability factors. In order to isolate, analyze and measure the eAect of some of the main variability sources that can be found in real commercial and forensic applications, and their influence in automatic recognition systems, a specific large speech database in Castilian Spanish called AHUMADA (/aum ada/) has been designed and acquired under controlled conditions. In this paper, together with a detailed description of the database, some experimental results including diAerent speech variability factors are also presented. ” 2000 Elsevier Science B.V. All rights reserved.

Proceedings ArticleDOI
05 Jun 2000
TL;DR: A speaker tracking system is built by using successively a speaker change detector and a speaker verification system to find in a conversation between several persons target speakers chosen in a set of enrolled users.
Abstract: A speaker tracking system (STS) is built by using successively a speaker change detector and a speaker verification system. The aim of the STS is to find in a conversation between several persons (some of them having already enrolled and other being totally unknown) target speakers chosen in a set of enrolled users. In a first step, speech is segmented into homogeneous segments containing only one speaker, without any use of a priori knowledge about speakers. Then, the resulting segments are checked to belong to one of the target speakers. The system has been used in a NIST evaluation test with satisfactory results.