scispace - formally typeset
Search or ask a question

Showing papers on "Speaker diarisation published in 2000"


Journal ArticleDOI
TL;DR: The major elements of MIT Lincoln Laboratory's Gaussian mixture model (GMM)-based speaker verification system used successfully in several NIST Speaker Recognition Evaluations (SREs) are described.

4,673 citations


Journal ArticleDOI
TL;DR: This paper proposes a new segmentation method, called DISTBIC, which combines two different segmentation techniques and is efficiency in detecting speaker turns even close to one another (i.e., separated by a few seconds).

299 citations


Proceedings ArticleDOI
11 Dec 2000
TL;DR: A new technique for normalising the polynomial kernel is developed and used to achieve performance comparable to other classifiers on the YOHO database.
Abstract: The performance of the support vector machine (SVM) on a speaker verification task is assessed. Since speaker verification requires binary decisions, support vector machines seem to be a promising candidate to perform the task. A new technique for normalising the polynomial kernel is developed and used to achieve performance comparable to other classifiers on the YOHO database. We also present results on a speaker identification task.

250 citations



Journal ArticleDOI
01 Aug 2000
TL;DR: This paper describes some of the requisite speech and language technologies that would be required and introduces an effort aimed at integrating these technologies into a system, called Rough 'n' Ready, which indexes speech data, creates a structural summarization, and provides tools for browsing the stored data.
Abstract: With the advent of essentially unlimited data storage capabilities and with the proliferation of the use of the Internet, it becomes reasonable to imagine a world in which it would be possible to access any of the stored information at will with a few keystrokes or voice commands. Since much of this data will be in the form of speech from various sources, it becomes important to develop the technologies necessary for indexing and browsing such audio data. This paper describes some of the requisite speech and language technologies that would be required and introduces an effort aimed at integrating these technologies into a system, called Rough 'n' Ready, which indexes speech data, creates a structural summarization, and provides tools for browsing the stored data. The technologies highlighted in the paper include speaker-independent continuous speech recognition, speaker segmentation and identification, name spotting, topic classification, story segmentation, and information retrieval. The system automatically segments the continuous audio input stream by speaker, clusters audio segments from the same speaker, identifies speakers known to the system, and transcribes the spoken words. It also segments the input stream into stories, based on their topic content, and locates the names of persons, places, and organizations. These structural features are stored in a database and are used to construct highly selective search queries for retrieving specific content from large audio archives.

196 citations


Journal ArticleDOI
TL;DR: This article summarizes the 1999 NIST Speaker Recognition Evaluation, which discussed the overall research objectives, the three task definitions, the development and evaluation data sets, the specified performance measures and their manner of presentation, the overall quality of the results.

167 citations


Patent
10 May 2000
TL;DR: In this article, a technique for adaptation of a speech recognizing system across multiple remote communication sessions with a speaker is presented. But, the technique requires the speaker to engage in a training session.
Abstract: A technique for adaptation of a speech recognizing system across multiple remote communication sessions with a speaker. The speaker can be a telephone caller. An acoustic model is utilized for recognizing the speaker's speech. Upon initiation of a first remote session with the speaker, the acoustic model is speaker-independent. During the first session, the speaker is uniquely identified and speech samples are obtained from the speaker. In the preferred embodiment, the samples are obtained without requiring the speaker to engage in a training session. The acoustic model is then modified based upon the samples thereby forming a modified model. The model can be modified during the session or after the session is terminated. Upon termination of the session, the modified model is then stored in association with an identification of the speaker. During a subsequent remote session, the speaker is identified and, then, the modified acoustic model is utilized to recognize the speaker's speech. Additional speech samples are obtained during the subsequent session and, then, utilized to further modify the acoustic model. In this manner, an acoustic model utilized for recognizing the speech of a particular speaker is cumulatively modified according to speech samples obtained during multiple sessions with the speaker. As a result, the accuracy of the speech recognizing system improves for the speaker even when the speaker only engages in relatively short remote sessions.

154 citations


Journal ArticleDOI
TL;DR: An approach to voice characteristics conversion for an HMM-based text-to-speech synthesis system using speaker interpolation, which can synthesize speech with various voice quality without large database in synthesis phase.
Abstract: This paper describes an approach to voice characteristics conversion for an HMM-based text-to-speech synthesis system using speaker interpolation.Although most text-to-speech synthesis systems which synthesize speech by concatenating speech units can synthesize speech with acceptable quality, they still cannot synthesize speech with various voice quality such as speaker individualities and emotions;In order to control speaker individualities and emotions, therefore, they need a large database, which records speech units with various voice characteristics in sythesis phase.On the other hand, our system synthesize speech with untrained speaker’s voice quality by interpolating HMM parameters among some representative speakers’ HMM sets.Accordingly, our system can synthesize speech with various voice quality without large database in synthesis phase.An HMM interpolation technique is derived from a probabilistic similarity measure for HMMs, and used to synthesize speech with untrained speaker’s voice quality by interpolating HMM parameters among some representative speakers’ HMM sets.The results of subjective experiments show that we can gradually change the voice quality of synthesized speech from one’s to the other’s by changing the interpolation ratio.

140 citations


Proceedings Article
01 Jan 2000
TL;DR: This work proposes a novel statistical modeling and compensation method for robust speaker recognition that yields similar improvements as the HNORM score-based compensation method, but with a fraction of the training time.
Abstract: A novel statistical modeling and compensation method for robust speaker recognition is presented. The method specifically addresses the degradation in speaker verification performance due to the mismatch in channels (e.g., telephone handsets) between enrollment and testing sessions. In mismatched conditions, the new approach uses speaker-independent channel transformations to synthesize a speaker model that corresponds to the channel of the testing session. Effectively verification is always performed in matched channel conditions. Results on the 1998 NIST Speaker Recognition Evaluation corpus show that the new approach yields performance that matches the best reported results. Specifically, our approach yields similar improvements (19.9% reduction in EER compared to CMN alone) as the HNORM score-based compensation method, but with a fraction of the training time.

125 citations


Journal ArticleDOI
TL;DR: A new technique, verbal information verification (VIV), in which spoken utterances of a claimed speaker are verified against the key information in the speaker's registered profile automatically; to decide whether the claimed identity should be accepted or rejected is proposed.
Abstract: Traditional speaker authentication focuses on speaker verification (SV) and speaker identification, which is accomplished by matching the speaker's voice with his or her registered speech patterns In this paper, we propose a new technique, verbal information verification (VIV), in which spoken utterances of a claimed speaker are verified against the key (usually confidential) information in the speaker's registered profile automatically; to decide whether the claimed identity should be accepted or rejected Using the proposed sequential procedure involving three question-response turns, we achieved an error-free result in a telephone speaker authentication experiment with 100 speakers We further propose a speaker authentication system by combining VIV with SV In the system, a user is verified by VIV in the first four to five accesses, usually from different acoustic environments During these uses, one of the key questions pertains to a pass-phrase for SV The VIV system collects and verifies the pass-phrase utterance for use as training data for speaker model construction After a speaker-dependent model is constructed, the system then migrates to SV This approach avoids the inconvenience of a formal enrollment procedure, ensures the quality of the training data for SV, and mitigates the mismatch caused by different acoustic environments between training and testing Experiments showed that the proposed system improved the SV performance by over 40% in equal-error rate compared to a conventional SV system

116 citations


Proceedings ArticleDOI
30 Jul 2000
TL;DR: An algorithm is implemented that classies story segments into three Speaker Roles based on several content and duration features and correctly classies about 80% of segments when applied to ASR derived transcriptions of broadcast data.
Abstract: Previous work has shown that providing information about story structure is critical for browsing audio broadcasts. We investigate the hypothesis that Speaker Role is an important cue to story structure. We implement an algorithm that classies story segments into three Speaker Roles based on several content and duration features. The algorithm correctly classies about 80% of segments (compared with a baseline frequency of 35.4%) when applied to ASR derived transcriptions of broadcast data.

Journal ArticleDOI
TL;DR: The approach transforms features such as mel-cepstral features, log spectrum, and prosody-based features with a non-linear artificial neural network to maximize speaker recognition performance specifically in the setting of telephone handset mismatch between training and testing.

Proceedings ArticleDOI
05 Jun 2000
TL;DR: A speaker tracking system is built by using successively a speaker change detector and a speaker verification system to find in a conversation between several persons target speakers chosen in a set of enrolled users.
Abstract: A speaker tracking system (STS) is built by using successively a speaker change detector and a speaker verification system. The aim of the STS is to find in a conversation between several persons (some of them having already enrolled and other being totally unknown) target speakers chosen in a set of enrolled users. In a first step, speech is segmented into homogeneous segments containing only one speaker, without any use of a priori knowledge about speakers. Then, the resulting segments are checked to belong to one of the target speakers. The system has been used in a NIST evaluation test with satisfactory results.

Patent
Fereydoun Maali1, Mahesh Viswanathan1
26 Apr 2000
TL;DR: In this article, a method and apparatus for identifying a speaker in an audio-video source using both audio and video information was disclosed for identification of an utterance speaker in a speech utterance.
Abstract: A method and apparatus are disclosed for identifying a speaker in an audio-video source using both audio and video information. An audio-based speaker identification system identifies one or more potential speakers for a given segment using an enrolled speaker database. A video-based speaker identification system identifies one or more potential speakers for a given segment using a face detector/recognizer and an enrolled face database. An audio-video decision fusion process evaluates the individuals identified by the audio-based and video-based speaker identification systems and determines the speaker of an utterance in accordance with the present invention. A linear variation is imposed on the ranked-lists produced using the audio and video information. The decision fusion scheme of the present invention is based on a linear combination of the audio and the video ranked-lists. The line with the higher slope is assumed to convey more discriminative information. The normalized slopes of the two lines are used as the weight of the respective results when combining the scores from the audio-based and video-based speaker analysis. In this manner, the weights are derived from the data itself.

Proceedings Article
01 Jan 2000
TL;DR: Experimental results show that pitch information is not necessarily useful for rejection of synthetic speech, and it is required to develop techniques to discriminate synthetic speech from natural speech.
Abstract: This paper describes security of speaker verification systems against imposture using synthetic speech. We propose a text-prompted speaker verification technique which utilizes pitch information in addition to spectral information, and investigate whether synthetic speech is rejected. Experimental results show that pitch information is not necessarily useful for rejection of synthetic speech, and it is required to develop techniques to discriminate synthetic speech from natural speech.

Journal ArticleDOI
TL;DR: The speaker verification performance of human listeners was compared to that of computer algorithms/systems and human performance in general seemed relatively robust to degradation.

Journal ArticleDOI
TL;DR: Two approaches to detecting and tracking speakers in multispeaker audio using an adapted Gaussian mixture model, universal background model (GMM-UBM) speaker detection system as the core speaker recognition engine and an external segmentational algorithm based on blind clustering are described.

Proceedings ArticleDOI
05 Jun 2000
TL;DR: It was found that a low LPC order in GSM coding is responsible for most performance degradations and a speaker recognition system equivalent in performance to the original one which decodes and reanalyzes speech before performing recognition is obtained.
Abstract: This paper investigates the influence of GSM speech coding on text independent speaker recognition performance. The three existing GSM speech coder standards were considered. The whole TIMIT database was passed through these coders, obtaining three transcoded databases. In a first experiment, it was found that the use of GSM coding degrades significantly the identification and verification performance (performance in correspondence with the perceptual speech quality of each coder). In a second experiment, the features for the speaker recognition system were calculated directly from the information available in the encoded bit stream. It was found that a low LPC order in GSM coding is responsible for most performance degradations. By extracting the features directly from the encoded bit-stream, we also managed to obtain a speaker recognition system equivalent in performance to the original one which decodes and reanalyzes speech before performing recognition.

Patent
07 Jun 2000
TL;DR: In this article, a hierarchical speaker tree clustering system was used to identify speakers participating in an audio-video source, whether or not such speakers have been previously registered or enrolled, and a hierarchical enrolled speaker database was used that included one or more background models for unenrolled speakers to assign a speaker to each identified segment.
Abstract: A method and apparatus are disclosed for identifying speakers participating in an audio-video source, whether or not such speakers have been previously registered or enrolled. A speaker segmentation system separates the speakers and identifies all possible frames where there is a segment boundary between non-homogeneous speech portions. A hierarchical speaker tree clustering system clusters homogeneous segments (generally corresponding to the same speaker), and assigns a cluster identifier to each detected segment, whether or not the actual name of the speaker is known. A hierarchical enrolled speaker database is used that includes one or more background models for unenrolled speakers to assign a speaker to each identified segment. Once speech segments are identified by the segmentation system, the disclosed unknown speaker identification system compares the segment utterances to the enrolled speaker database using a hierarchical approach and finds the “closest” speaker, if any, to assign a speaker label to each identified segment. A speech segment having an unknown speaker is initially assigned a general speaker label from a set of background models for speaker identification, such as “unenrolled male” or “unenrolled female.” The “unenrolled” segment is assigned a cluster identifier and is positioned in the hierarchical tree. Thus, the hierarchical speaker tree clustering system assigns a unique cluster identifier corresponding to a given node, for each speaker to further differentiate the general speaker labels.

Patent
Roland Kuhn1, Olivier Thyes1, Patrick Nguyen1, Jean-Claude Junqua1, Robert C. Boman1 
05 Jul 2000
TL;DR: In this article, the speaker space can be constructed using training speakers that are entirely separate from the population of client speakers, or from client speakers or from a mix of training and client speakers.
Abstract: Client speaker locations in a speaker space are used to generate speech models for comparison with test speaker data or test speaker speech models. The speaker space can be constructed using training speakers that are entirely separate from the population of client speakers, or from client speakers, or from a mix of training and client speakers. Reestimation of the speaker space based on client environment information is also provided to improve the likelihood that the client data will fall within the speaker space. During enrollment of the clients into the speaker space, additional client speech can be obtained when predetermined conditions are met. The speaker distribution can also be used in the client enrollment step.

Journal ArticleDOI
TL;DR: An adaptation technique called speaker cluster weighting (SCW) which provides a means for improving upon generic hierarchical speaker clustering techniques and a word error rate reduction of 20% has been achieved from the baseline speaker independent (SI) recognition system.

Proceedings Article
01 Jan 2000
TL;DR: It is concluded that an evaluation of the promise of training ASV material on emotional speech requires in-depth analyses of the individual differences in vocal reactivity and further exploration of the link between acoustic changes under stress or emotion and verification results.
Abstract: The ongoing work described in this contribution attempts to demonstrate the need to train ASV algorithms on emotional speech, in addition to neutral speech, in order to achieve more robust results in real life verification situations. A computerized induction program with 6 different tasks, producing different types of stressful or emotional speaker states, was developed, pretested, and used to record French, German, and English speaking participants. For a subset of these speakers, physiological data were obtained to determine the degree of physiological arousal produced by the emotion inductions and to determine the correlation between physiological responses and voice production as revealed in acoustic parameters. In collaboration with a commercial ASV provider (Ensigma Ltd.), a standard verification procedure was applied to this speech material. This paper reports the first set of preliminary analyses for the subset of 30 German speakers. It is concluded that an evaluation of the promise of training ASV material on emotional speech requires in-depth analyses of the individual differences in vocal reactivity and further exploration of the link between acoustic changes under stress or emotion and verification results.

Proceedings Article
01 Jan 2000
TL;DR: This paper examines three algorithms to recognize speaker’s emotion using the speech signals, MLB, NN, and HMM, which achieved recognition rates of 68.9%, 69.3%, and 89.1%, respectively for the speaker dependent and context-independent classification.
Abstract: This paper examines three algorithms to recognize speaker’s emotion using the speech signals. Target emotions are happiness, sadness, anger, fear, boredom and neutral state. MLB(Maximum-Likelihood Bayes), NN(Nearest Neighbor) and HMM(Hidden Markov Model) algorithms are used as the pattern matching techniques. In all cases, pitch and energy are used as the features. The feature vectors for MLB and NN are composed of pitch mean, pitch standard deviation, energy mean, energy standard deviation, etc. For HMM, vectors of delta pitch with delta-delta pitch and delta energy with delta-delta energy are used. A corpus of emotional speech data was recorded and the subjective evaluation of the data was performed by 23 untrained listeners. The subjective recognition result was 56% and was compared with the classifiers’ recognition rates. MLB, NN, and HMM classifiers achieved recognition rates of 68.9%, 69.3%, and 89.1%, respectively, for the speaker dependent and context-independent classification.

Journal ArticleDOI
TL;DR: In this article, a speaker recognition task carried out by a close-knit network of speakers (university friends who have lived in shared accommodation with each other for two years) is presented.
Abstract: This article presents results from a speaker recognition task carried out by a close-knit network of speakers (university friends who have lived in shared accommodation with each other for two years). Ten male speakers recorded a scripted message on to an answer machine via a mobile telephone. Two foil speakers from outside thenetwork were also recorded. Samples of between 8 and 10 seconds were extracted from all twelve recordings, and used as stimuli for an open speaker recognition test performed by the network members. Listeners varied widely in their performance, and one listener failed to recognize his own voice. Some of the voices were easy to identify, but several speakers were consistently misidentified, and one speaker was particularly hard to identify. Both of the foil speakers were sometimes mistaken for network members Auditory analysis of the voices shows, as expected, that speakers with the most distinctive regional accents and other idiosyncratic features were the most consistently identified. Acoustic analysis of F0 was also undertaken. It was found that the speakers who were most consistently identified were those with relatively high and low mean F0 values, as well as those with the widest and narrowest overall F0 range. Speakers with average pitch values and ranges in the middle of the overall group values proved harder to identify. The findings support the view that average pitch is a robust diagnostic of speaker identity, not only for forensic phoneticians, but also for naive listeners. They furthermore demonstrate that naive speaker recognition, even among members of a close-knit social network, is not a task which can be achieved infallibly.


Proceedings ArticleDOI
05 Jun 2000
TL;DR: This paper uses classical adaptation approaches for the incremental training of client models in a speaker verification system using a segmental-EM procedure and investigates on the impact of various scenarios of impostor attacks during the incremental enrollment phase.
Abstract: Classical adaptation approaches are generally used for speaker or environment adaptation of speech recognition systems. In this paper, we use such techniques for the incremental training of client models in a speaker verification system. The initial model is trained on a very limited amount of data and then progressively updated with access data, using a segmental-EM procedure. In supervised mode (i.e. when access utterances are certified), the incremental approach yields equivalent performance to the batch one. We also investigate on the impact of various scenarios of impostor attacks during the incremental enrollment phase. All results are obtained with the Picassoft platform-the state-of-the-art speaker verification system developed in the PICASSO project.

Proceedings Article
01 Jan 2000
TL;DR: It is shown that a more detailed modeling of adaptation classes and the use of confidence measures improve the adaptation performance, especially on the VERBMOBIL task, a German conversational speech corpus.
Abstract: Automatic recognition of conversational speech tends to have higher word error rates (WER) than read speech. Improvements gained from unsupervised speaker adaptation methods like Maximum Likelihood Linear Regression (MLLR) [1] are reduced because of their sensitivity to recognition errors in the first pass. We show that a more detailed modeling of adaptation classes and the use of confidence measures improve the adaptation performance. We present experimental results on the VERBMOBIL task, a German conversational speech corpus.

Patent
Stephane H. Maes1
12 Apr 2000
TL;DR: In this article, feature vectors representing each of a plurality of overlapping frames of an arbitrary, text independent speech signal are computed and compared to vector parameters and variances stored as codewords in one or more codebooks corresponding to each of enrolled users to provide speaker dependent information for speech recognition and ambiguity resolution.
Abstract: Feature vectors representing each of a plurality of overlapping frames of an arbitrary, text independent speech signal are computed and compared to vector parameters and variances stored as codewords in one or more codebooks corresponding to each of one or more enrolled users to provide speaker dependent information for speech recognition and/or ambiguity resolution. Other information such as aliases and preferences of each enrolled user may also be enrolled and stored, for example, in a database. Correspondence of the feature vectors may be ranked by closeness of correspondence to a codeword entry and the number of frames corresponding to each codebook are accumulated or counted to identify a potential enrolled speaker. The differences between the parameters of the feature vectors and codewords in the codebooks can be used to identify a new speaker and an enrollment procedure can be initiated. Continuous authorization and access control can be carried out based on any utterance either by verification of the authorization of a speaker of a recognized command or comparison with authorized commands for the recognized speaker. Text independence also permits coherence checks to be carried out for commands to validate the recognition process.

PatentDOI
TL;DR: In this paper, a verification unit receives an utterance from a speaker and identifies a command associated with the utterance by performing speaker independent recognition, and verifies the speaker identity by comparing it with a speaker verification template associated with a specified command.
Abstract: A method and apparatus for performing speaker verification with speaker verification templates are disclosed. A verification unit receives an utterance from a speaker. The verification unit identifies a command associated with the utterance by performing speaker independent recognition. If a speaker verification template associated with the identified command includes adequate verification data, the verification unit eliminates a prompt for a password and verifies the speaker identity by comparing the utterance with a speaker verification template associated with the identified command.

Patent
20 Mar 2000
TL;DR: In this paper, a method to transmit face images including the steps of: preparing a facial shape estimation unit receiving speech produced by a speaker and outputting a signal estimation the speaker's facial shape when he/she speaks, transmitting the speech from the transmitting side to the receiving side and applying it to the facial shape estimator, and generating a motion picture of the speaker facial shape based on the signal estimator's output.
Abstract: A method to transmit face images including the steps of: preparing a facial shape estimation unit receiving speech produced by a speaker and outputting a signal estimation the speaker's facial shape when he/she speaks; transmitting the speech produced by the speaker from the transmitting side to the receiving side and applying it to the facial shape estimation unit so as to estimate the speaker's facial shape; and generating a motion picture of the speaker's facial shape based on the signal estimation the speaker's facial shape output by the facial shape estimation unit.