Showing papers on "Speaker diarisation published in 2000"

PDF

Open Access

Journal Article•DOI•

Speaker Verification Using Adapted Gaussian Mixture Models

[...]

Douglas A. Reynolds¹, Thomas F. Quatieri¹, Robert B. Dunn¹•Institutions (1)

01 Jan 2000-Digital Signal Processing

TL;DR: The major elements of MIT Lincoln Laboratory's Gaussian mixture model (GMM)-based speaker verification system used successfully in several NIST Speaker Recognition Evaluations (SREs) are described.

...read moreread less

4,673 citations

Journal Article•DOI•

DISTBIC: a speaker-based segmentation for audio data indexing

[...]

P. Delacourt¹, Christian Wellekens¹•Institutions (1)

Institut Eurécom¹

01 Sep 2000-Speech Communication

TL;DR: This paper proposes a new segmentation method, called DISTBIC, which combines two different segmentation techniques and is efficiency in detecting speaker turns even close to one another (i.e., separated by a few seconds).

...read moreread less

299 citations

Proceedings Article•DOI•

Support vector machines for speaker verification and identification

[...]

Vincent Wan¹, William M. Campbell¹•Institutions (1)

University of Sheffield¹

11 Dec 2000

TL;DR: A new technique for normalising the polynomial kernel is developed and used to achieve performance comparable to other classifiers on the YOHO database.

...read moreread less

Abstract: The performance of the support vector machine (SVM) on a speaker verification task is assessed. Since speaker verification requires binary decisions, support vector machines seem to be a promising candidate to perform the task. A new technique for normalising the polynomial kernel is developed and used to achieve performance comparable to other classifiers on the YOHO database. We also present results on a speaker identification task.

...read moreread less

250 citations

Dissertation•

Use of speech recognition in computer-assisted language learning

[...]

Silke M. Witt

15 Feb 2000

210 citations

Journal Article•DOI•

Speech and language technologies for audio indexing and retrieval

[...]

John Makhoul, Francis Kubala, Tim Leek, Daben Liu, Long Nguyen, Richard Schwartz, Amit Srivastava - Show less +3 more

01 Aug 2000

TL;DR: This paper describes some of the requisite speech and language technologies that would be required and introduces an effort aimed at integrating these technologies into a system, called Rough 'n' Ready, which indexes speech data, creates a structural summarization, and provides tools for browsing the stored data.

...read moreread less

Abstract: With the advent of essentially unlimited data storage capabilities and with the proliferation of the use of the Internet, it becomes reasonable to imagine a world in which it would be possible to access any of the stored information at will with a few keystrokes or voice commands. Since much of this data will be in the form of speech from various sources, it becomes important to develop the technologies necessary for indexing and browsing such audio data. This paper describes some of the requisite speech and language technologies that would be required and introduces an effort aimed at integrating these technologies into a system, called Rough 'n' Ready, which indexes speech data, creates a structural summarization, and provides tools for browsing the stored data. The technologies highlighted in the paper include speaker-independent continuous speech recognition, speaker segmentation and identification, name spotting, topic classification, story segmentation, and information retrieval. The system automatically segments the continuous audio input stream by speaker, clusters audio segments from the same speaker, identifies speakers known to the system, and transcribes the spoken words. It also segments the input stream into stories, based on their topic content, and locates the names of persons, places, and organizations. These structural features are stored in a database and are used to construct highly selective search queries for retrieving specific content from large audio archives.

...read moreread less

196 citations

Journal Article•DOI•

The NIST 1999 Speaker Recognition Evaluation An Overview

[...]

Alvin F. Martin¹, Mark A. Przybocki¹•Institutions (1)

National Institute of Standards and Technology¹

01 Jan 2000-Digital Signal Processing

TL;DR: This article summarizes the 1999 NIST Speaker Recognition Evaluation, which discussed the overall research objectives, the three task definitions, the development and evaluation data sets, the specified performance measures and their manner of presentation, the overall quality of the results.

...read moreread less

167 citations

Patent•

Adaptation of a speech recognition system across multiple remote sessions with a speaker

[...]

Hy Murveit¹, Ashvin Kannan¹•Institutions (1)

Nuance Communications¹

10 May 2000

TL;DR: In this article, a technique for adaptation of a speech recognizing system across multiple remote communication sessions with a speaker is presented. But, the technique requires the speaker to engage in a training session.

...read moreread less

Abstract: A technique for adaptation of a speech recognizing system across multiple remote communication sessions with a speaker. The speaker can be a telephone caller. An acoustic model is utilized for recognizing the speaker's speech. Upon initiation of a first remote session with the speaker, the acoustic model is speaker-independent. During the first session, the speaker is uniquely identified and speech samples are obtained from the speaker. In the preferred embodiment, the samples are obtained without requiring the speaker to engage in a training session. The acoustic model is then modified based upon the samples thereby forming a modified model. The model can be modified during the session or after the session is terminated. Upon termination of the session, the modified model is then stored in association with an identification of the speaker. During a subsequent remote session, the speaker is identified and, then, the modified acoustic model is utilized to recognize the speaker's speech. Additional speech samples are obtained during the subsequent session and, then, utilized to further modify the acoustic model. In this manner, an acoustic model utilized for recognizing the speech of a particular speaker is cumulatively modified according to speech samples obtained during multiple sessions with the speaker. As a result, the accuracy of the speech recognizing system improves for the speaker even when the speaker only engages in relatively short remote sessions.

...read moreread less

154 citations

Journal Article•DOI•

Speaker interpolation for HMM-based speech synthesis system.

[...]

Takayoshi Yoshimura¹, Keiichi Tokuda¹, Takashi Masuko², Takao Kobayashi², Tadashi Kitamura¹ - Show less +1 more•Institutions (2)

Nagoya Institute of Technology¹, Tokyo Institute of Technology²

01 Jan 2000-The Journal of The Acoustical Society of Japan (e)

TL;DR: An approach to voice characteristics conversion for an HMM-based text-to-speech synthesis system using speaker interpolation, which can synthesize speech with various voice quality without large database in synthesis phase.

...read moreread less

Abstract: This paper describes an approach to voice characteristics conversion for an HMM-based text-to-speech synthesis system using speaker interpolation.Although most text-to-speech synthesis systems which synthesize speech by concatenating speech units can synthesize speech with acceptable quality, they still cannot synthesize speech with various voice quality such as speaker individualities and emotions;In order to control speaker individualities and emotions, therefore, they need a large database, which records speech units with various voice characteristics in sythesis phase.On the other hand, our system synthesize speech with untrained speaker’s voice quality by interpolating HMM parameters among some representative speakers’ HMM sets.Accordingly, our system can synthesize speech with various voice quality without large database in synthesis phase.An HMM interpolation technique is derived from a probabilistic similarity measure for HMMs, and used to synthesize speech with untrained speaker’s voice quality by interpolating HMM parameters among some representative speakers’ HMM sets.The results of subjective experiments show that we can gradually change the voice quality of synthesized speech from one’s to the other’s by changing the interpolation ratio.

...read moreread less

140 citations

Proceedings Article•

A model-based transformational approach to robust speaker recognition.

[...]

Remco Teunen, Ben Shahshahani, Larry Heck

01 Jan 2000

TL;DR: This work proposes a novel statistical modeling and compensation method for robust speaker recognition that yields similar improvements as the HNORM score-based compensation method, but with a fraction of the training time.

...read moreread less

Abstract: A novel statistical modeling and compensation method for robust speaker recognition is presented. The method specifically addresses the degradation in speaker verification performance due to the mismatch in channels (e.g., telephone handsets) between enrollment and testing sessions. In mismatched conditions, the new approach uses speaker-independent channel transformations to synthesize a speaker model that corresponds to the channel of the testing session. Effectively verification is always performed in matched channel conditions. Results on the 1998 NIST Speaker Recognition Evaluation corpus show that the new approach yields performance that matches the best reported results. Specifically, our approach yields similar improvements (19.9% reduction in EER compared to CMN alone) as the HNORM score-based compensation method, but with a fraction of the training time.

...read moreread less

125 citations

Journal Article•DOI•

Automatic verbal information verification for user authentication

[...]

Qi Li¹, Biing-Hwang Juang, Chin-Hui Lee•Institutions (1)

Bell Labs¹

01 Sep 2000-IEEE Transactions on Speech and Audio Processing

TL;DR: A new technique, verbal information verification (VIV), in which spoken utterances of a claimed speaker are verified against the key information in the speaker's registered profile automatically; to decide whether the claimed identity should be accepted or rejected is proposed.

...read moreread less

Abstract: Traditional speaker authentication focuses on speaker verification (SV) and speaker identification, which is accomplished by matching the speaker's voice with his or her registered speech patterns In this paper, we propose a new technique, verbal information verification (VIV), in which spoken utterances of a claimed speaker are verified against the key (usually confidential) information in the speaker's registered profile automatically; to decide whether the claimed identity should be accepted or rejected Using the proposed sequential procedure involving three question-response turns, we achieved an error-free result in a telephone speaker authentication experiment with 100 speakers We further propose a speaker authentication system by combining VIV with SV In the system, a user is verified by VIV in the first four to five accesses, usually from different acoustic environments During these uses, one of the key questions pertains to a pass-phrase for SV The VIV system collects and verifies the pass-phrase utterance for use as training data for speaker model construction After a speaker-dependent model is constructed, the system then migrates to SV This approach avoids the inconvenience of a formal enrollment procedure, ensures the quality of the training data for SV, and mitigates the mismatch caused by different acoustic environments between training and testing Experiments showed that the proposed system improved the SV performance by over 40% in equal-error rate compared to a conventional SV system

...read moreread less

116 citations

Proceedings Article•DOI•

The Rules Behind Roles: Identifying Speaker Role in Radio Broadcasts

[...]

Regina Barzilay¹, Michael Collins², Julia Hirschberg², Steve Whittaker²•Institutions (2)

Columbia University¹, AT&T²

30 Jul 2000

TL;DR: An algorithm is implemented that classies story segments into three Speaker Roles based on several content and duration features and correctly classies about 80% of segments when applied to ASR derived transcriptions of broadcast data.

...read moreread less

Abstract: Previous work has shown that providing information about story structure is critical for browsing audio broadcasts. We investigate the hypothesis that Speaker Role is an important cue to story structure. We implement an algorithm that classies story segments into three Speaker Roles based on several content and duration features. The algorithm correctly classies about 80% of segments (compared with a baseline frequency of 35.4%) when applied to ASR derived transcriptions of broadcast data.

...read moreread less

Journal Article•DOI•

Robustness to telephone handset distortion in speaker recognition by discriminative feature design

[...]

Larry Heck¹, Yochai Konig, M. Kemal Sönmez², Mitch Weintraub¹•Institutions (2)

Nuance Communications¹, SRI International²

01 Jun 2000-Speech Communication

TL;DR: The approach transforms features such as mel-cepstral features, log spectrum, and prosody-based features with a non-linear artificial neural network to maximize speaker recognition performance specifically in the setting of telephone handset mismatch between training and testing.

...read moreread less

Proceedings Article•DOI•

A speaker tracking system based on speaker turn detection for NIST evaluation

[...]

Jean-François Bonastre¹, P. Delacourt, Corinne Fredouille, Teva Merlin, C. Wellekens - Show less +1 more•Institutions (1)

University of Avignon¹

05 Jun 2000

TL;DR: A speaker tracking system is built by using successively a speaker change detector and a speaker verification system to find in a conversation between several persons target speakers chosen in a set of enrolled users.

...read moreread less

Abstract: A speaker tracking system (STS) is built by using successively a speaker change detector and a speaker verification system. The aim of the STS is to find in a conversation between several persons (some of them having already enrolled and other being totally unknown) target speakers chosen in a set of enrolled users. In a first step, speech is segmented into homogeneous segments containing only one speaker, without any use of a priori knowledge about speakers. Then, the resulting segments are checked to belong to one of the target speakers. The system has been used in a NIST evaluation test with satisfactory results.

...read moreread less

Patent•

Fusion of audio and video based speaker identification for multimedia information access

[...]

Fereydoun Maali¹, Mahesh Viswanathan¹•Institutions (1)

IBM¹

26 Apr 2000

TL;DR: In this article, a method and apparatus for identifying a speaker in an audio-video source using both audio and video information was disclosed for identification of an utterance speaker in a speech utterance.

...read moreread less

Abstract: A method and apparatus are disclosed for identifying a speaker in an audio-video source using both audio and video information. An audio-based speaker identification system identifies one or more potential speakers for a given segment using an enrolled speaker database. A video-based speaker identification system identifies one or more potential speakers for a given segment using a face detector/recognizer and an enrolled face database. An audio-video decision fusion process evaluates the individuals identified by the audio-based and video-based speaker identification systems and determines the speaker of an utterance in accordance with the present invention. A linear variation is imposed on the ranked-lists produced using the audio and video information. The decision fusion scheme of the present invention is based on a linear combination of the audio and the video ranked-lists. The line with the higher slope is assumed to convey more discriminative information. The normalized slopes of the two lines are used as the weight of the respective results when combining the scores from the audio-based and video-based speaker analysis. In this manner, the weights are derived from the data itself.

...read moreread less

Proceedings Article•

Imposture using synthetic speech against speaker verification based on spectrum and pitch.

[...]

Takashi Masuko, Keiichi Tokuda, Takao Kobayashi

01 Jan 2000

TL;DR: Experimental results show that pitch information is not necessarily useful for rejection of synthetic speech, and it is required to develop techniques to discriminate synthetic speech from natural speech.

...read moreread less

Abstract: This paper describes security of speaker verification systems against imposture using synthetic speech. We propose a text-prompted speaker verification technique which utilizes pitch information in addition to spectral information, and investigate whether synthetic speech is rejected. Experimental results show that pitch information is not necessarily useful for rejection of synthetic speech, and it is required to develop techniques to discriminate synthetic speech from natural speech.

...read moreread less

Journal Article•DOI•

Speaker Verification by Human Listeners

[...]

Astrid Schmidt-Nielsen¹, Thomas H. Crystal²•Institutions (2)

United States Naval Research Laboratory¹, Princeton University²

01 Jan 2000-Digital Signal Processing

TL;DR: The speaker verification performance of human listeners was compared to that of computer algorithms/systems and human performance in general seemed relatively robust to degradation.

...read moreread less

Journal Article•DOI•

Approaches to Speaker Detection and Tracking in Conversational Speech

[...]

Robert B. Dunn¹, Douglas A. Reynolds¹, Thomas F. Quatieri¹•Institutions (1)

Massachusetts Institute of Technology¹

01 Jan 2000-Digital Signal Processing

TL;DR: Two approaches to detecting and tracking speakers in multispeaker audio using an adapted Gaussian mixture model, universal background model (GMM-UBM) speaker detection system as the core speaker recognition engine and an external segmentational algorithm based on blind clustering are described.

...read moreread less

Proceedings Article•DOI•

GSM speech coding and speaker recognition

[...]

Laurent Besacier, Sara Grassi, Alain Dufaux, Michael Ansorge, Fausto Pellandini - Show less +1 more

05 Jun 2000

TL;DR: It was found that a low LPC order in GSM coding is responsible for most performance degradations and a speaker recognition system equivalent in performance to the original one which decodes and reanalyzes speech before performing recognition is obtained.

...read moreread less

Abstract: This paper investigates the influence of GSM speech coding on text independent speaker recognition performance. The three existing GSM speech coder standards were considered. The whole TIMIT database was passed through these coders, obtaining three transcoded databases. In a first experiment, it was found that the use of GSM coding degrades significantly the identification and verification performance (performance in correspondence with the perceptual speech quality of each coder). In a second experiment, the features for the speaker recognition system were calculated directly from the information available in the encoded bit stream. It was found that a low LPC order in GSM coding is responsible for most performance degradations. By extracting the features directly from the encoded bit-stream, we also managed to obtain a speaker recognition system equivalent in performance to the original one which decodes and reanalyzes speech before performing recognition.

...read moreread less

Patent•

Methods and apparatus for identifying unknown speakers using a hierarchical tree structure

[...]

Homayoon S. M. Beigi¹, Mahesh Viswanathan¹•Institutions (1)

IBM¹

07 Jun 2000

TL;DR: In this article, a hierarchical speaker tree clustering system was used to identify speakers participating in an audio-video source, whether or not such speakers have been previously registered or enrolled, and a hierarchical enrolled speaker database was used that included one or more background models for unenrolled speakers to assign a speaker to each identified segment.

...read moreread less

Abstract: A method and apparatus are disclosed for identifying speakers participating in an audio-video source, whether or not such speakers have been previously registered or enrolled. A speaker segmentation system separates the speakers and identifies all possible frames where there is a segment boundary between non-homogeneous speech portions. A hierarchical speaker tree clustering system clusters homogeneous segments (generally corresponding to the same speaker), and assigns a cluster identifier to each detected segment, whether or not the actual name of the speaker is known. A hierarchical enrolled speaker database is used that includes one or more background models for unenrolled speakers to assign a speaker to each identified segment. Once speech segments are identified by the segmentation system, the disclosed unknown speaker identification system compares the segment utterances to the enrolled speaker database using a hierarchical approach and finds the “closest” speaker, if any, to assign a speaker label to each identified segment. A speech segment having an unknown speaker is initially assigned a general speaker label from a set of background models for speaker identification, such as “unenrolled male” or “unenrolled female.” The “unenrolled” segment is assigned a cluster identifier and is positioned in the hierarchical tree. Thus, the hierarchical speaker tree clustering system assigns a unique cluster identifier corresponding to a given node, for each speaker to further differentiate the general speaker labels.

...read moreread less

Patent•

Speaker verification and speaker identification based on a priori knowledge

[...]

Roland Kuhn¹, Olivier Thyes¹, Patrick Nguyen¹, Jean-Claude Junqua¹, Robert C. Boman¹ - Show less +1 more•Institutions (1)

Panasonic¹

05 Jul 2000

TL;DR: In this article, the speaker space can be constructed using training speakers that are entirely separate from the population of client speakers, or from client speakers or from a mix of training and client speakers.

...read moreread less

Abstract: Client speaker locations in a speaker space are used to generate speech models for comparison with test speaker data or test speaker speech models. The speaker space can be constructed using training speakers that are entirely separate from the population of client speakers, or from client speakers, or from a mix of training and client speakers. Reestimation of the speaker space based on client environment information is also provided to improve the likelihood that the client data will fall within the speaker space. During enrollment of the clients into the speaker space, additional client speech can be obtained when predetermined conditions are met. The speaker distribution can also be used in the client enrollment step.

...read moreread less

Journal Article•DOI•

A comparison of novel techniques for rapid speaker adaptation

[...]

Timothy J. Hazen¹•Institutions (1)

Massachusetts Institute of Technology¹

01 May 2000-Speech Communication

TL;DR: An adaptation technique called speaker cluster weighting (SCW) which provides a means for improving upon generic hierarchical speaker clustering techniques and a word error rate reduction of 20% has been achieved from the baseline speaker independent (SI) recognition system.

...read moreread less

Proceedings Article•

Can automatic speaker verification be improved by training the algorithms on emotional speech

[...]

Klaus R. Scherer, Tom Johnstone, Gudrun Klasmeyer, Thomas Bänziger

01 Jan 2000

TL;DR: It is concluded that an evaluation of the promise of training ASV material on emotional speech requires in-depth analyses of the individual differences in vocal reactivity and further exploration of the link between acoustic changes under stress or emotion and verification results.

...read moreread less

Abstract: The ongoing work described in this contribution attempts to demonstrate the need to train ASV algorithms on emotional speech, in addition to neutral speech, in order to achieve more robust results in real life verification situations. A computerized induction program with 6 different tasks, producing different types of stressful or emotional speaker states, was developed, pretested, and used to record French, German, and English speaking participants. For a subset of these speakers, physiological data were obtained to determine the degree of physiological arousal produced by the emotion inductions and to determine the correlation between physiological responses and voice production as revealed in acoustic parameters. In collaboration with a commercial ASV provider (Ensigma Ltd.), a standard verification procedure was applied to this speech material. This paper reports the first set of preliminary analyses for the subset of 30 German speakers. It is concluded that an evaluation of the promise of training ASV material on emotional speech requires in-depth analyses of the individual differences in vocal reactivity and further exploration of the link between acoustic changes under stress or emotion and verification results.

...read moreread less

Proceedings Article•

Speaker dependent emotion recognition using speech signals

[...]

Bong Seok Kang¹, Chul Hee Han¹, Sang Tae Lee², Dae Hee Youn¹, Chungyong Lee - Show less +1 more•Institutions (2)

Yonsei University¹, Korea Research Institute of Standards and Science²

01 Jan 2000

TL;DR: This paper examines three algorithms to recognize speaker’s emotion using the speech signals, MLB, NN, and HMM, which achieved recognition rates of 68.9%, 69.3%, and 89.1%, respectively for the speaker dependent and context-independent classification.

...read moreread less

Abstract: This paper examines three algorithms to recognize speaker’s emotion using the speech signals. Target emotions are happiness, sadness, anger, fear, boredom and neutral state. MLB(Maximum-Likelihood Bayes), NN(Nearest Neighbor) and HMM(Hidden Markov Model) algorithms are used as the pattern matching techniques. In all cases, pitch and energy are used as the features. The feature vectors for MLB and NN are composed of pitch mean, pitch standard deviation, energy mean, energy standard deviation, etc. For HMM, vectors of delta pitch with delta-delta pitch and delta energy with delta-delta energy are used. A corpus of emotional speech data was recorded and the subjective evaluation of the data was performed by 23 untrained listeners. The subjective recognition result was 56% and was compared with the classifiers’ recognition rates. MLB, NN, and HMM classifiers achieved recognition rates of 68.9%, 69.3%, and 89.1%, respectively, for the speaker dependent and context-independent classification.

...read moreread less

Journal Article•DOI•

Telephone speaker recognition amongst members of a close social network

[...]

Paul Foulkes, Anthony Barron

19 Aug 2000-International Journal of Speech Language and The Law

TL;DR: In this article, a speaker recognition task carried out by a close-knit network of speakers (university friends who have lived in shared accommodation with each other for two years) is presented.

...read moreread less

Abstract: This article presents results from a speaker recognition task carried out by a close-knit network of speakers (university friends who have lived in shared accommodation with each other for two years). Ten male speakers recorded a scripted message on to an answer machine via a mobile telephone. Two foil speakers from outside thenetwork were also recorded. Samples of between 8 and 10 seconds were extracted from all twelve recordings, and used as stimuli for an open speaker recognition test performed by the network members. Listeners varied widely in their performance, and one listener failed to recognize his own voice. Some of the voices were easy to identify, but several speakers were consistently misidentified, and one speaker was particularly hard to identify. Both of the foil speakers were sometimes mistaken for network members Auditory analysis of the voices shows, as expected, that speakers with the most distinctive regional accents and other idiosyncratic features were the most consistently identified. Acoustic analysis of F0 was also undertaken. It was found that the speakers who were most consistently identified were those with relatively high and low mean F0 values, as well as those with the widest and narrowest overall F0 range. Speakers with average pitch values and ranges in the middle of the overall group values proved harder to identify. The findings support the view that average pitch is a robust diagnostic of speaker identity, not only for forensic phoneticians, but also for naive listeners. They furthermore demonstrate that naive speaker recognition, even among members of a close-knit social network, is not a task which can be achieved infallibly.

...read moreread less

Proceedings Article•

On the importance of components of the MFCC in speech and speaker recognition.

[...]

Bin Zhen, Xihong Wu, Zhimin Liu, Huisheng Chi

01 Jan 2000

Proceedings Article•DOI•

Behavior of a Bayesian adaptation method for incremental enrollment in speaker verification

[...]

Corinne Fredouille, Johnny Mariéthoz, C. Jaboulet, Jean Hennebert, J.-F. Mokbet, Frédéric Bimbot - Show less +2 more

05 Jun 2000

TL;DR: This paper uses classical adaptation approaches for the incremental training of client models in a speaker verification system using a segmental-EM procedure and investigates on the impact of various scenarios of impostor attacks during the incremental enrollment phase.

...read moreread less

Abstract: Classical adaptation approaches are generally used for speaker or environment adaptation of speech recognition systems. In this paper, we use such techniques for the incremental training of client models in a speaker verification system. The initial model is trained on a very limited amount of data and then progressively updated with access data, using a segmental-EM procedure. In supervised mode (i.e. when access utterances are certified), the incremental approach yields equivalent performance to the batch one. We also investigate on the impact of various scenarios of impostor attacks during the incremental enrollment phase. All results are obtained with the Picassoft platform-the state-of-the-art speaker verification system developed in the PICASSO project.

...read moreread less

Proceedings Article•

Improved MLLR speaker adaptation using confidence measures for conversational speech recognition.

[...]

Michael Pitz, Frank Wessel, Hermann Ney¹•Institutions (1)

RWTH Aachen University¹

01 Jan 2000

TL;DR: It is shown that a more detailed modeling of adaptation classes and the use of confidence measures improve the adaptation performance, especially on the VERBMOBIL task, a German conversational speech corpus.

...read moreread less

Abstract: Automatic recognition of conversational speech tends to have higher word error rates (WER) than read speech. Improvements gained from unsupervised speaker adaptation methods like Maximum Likelihood Linear Regression (MLLR) [1] are reduced because of their sensitivity to recognition errors in the first pass. We show that a more detailed modeling of adaptation classes and the use of confidence measures improve the adaptation performance. We present experimental results on the VERBMOBIL task, a German conversational speech corpus.

...read moreread less

Patent•

Text independent speaker recognition with simultaneous speech recognition for transparent command ambiguity resolution and continuous access control

[...]

Stephane H. Maes¹•Institutions (1)

IBM¹

12 Apr 2000

TL;DR: In this article, feature vectors representing each of a plurality of overlapping frames of an arbitrary, text independent speech signal are computed and compared to vector parameters and variances stored as codewords in one or more codebooks corresponding to each of enrolled users to provide speaker dependent information for speech recognition and ambiguity resolution.

...read moreread less

Abstract: Feature vectors representing each of a plurality of overlapping frames of an arbitrary, text independent speech signal are computed and compared to vector parameters and variances stored as codewords in one or more codebooks corresponding to each of one or more enrolled users to provide speaker dependent information for speech recognition and/or ambiguity resolution. Other information such as aliases and preferences of each enrolled user may also be enrolled and stored, for example, in a database. Correspondence of the feature vectors may be ranked by closeness of correspondence to a codeword entry and the number of frames corresponding to each codebook are accumulated or counted to identify a potential enrolled speaker. The differences between the parameters of the feature vectors and codewords in the codebooks can be used to identify a new speaker and an enrollment procedure can be initiated. Continuous authorization and access control can be carried out based on any utterance either by verification of the authorization of a speaker of a recognized command or comparison with authorized commands for the recognized speaker. Text independence also permits coherence checks to be carried out for commands to validate the recognition process.

...read moreread less

Patent•DOI•

Method and apparatus for performing speaker verification based on speaker independent recognition of commands

[...]

John E. Schier¹, Patrick E. Jackson¹•Institutions (1)

Cisco Systems, Inc.¹

29 Dec 2000-Journal of the Acoustical Society of America

TL;DR: In this paper, a verification unit receives an utterance from a speaker and identifies a command associated with the utterance by performing speaker independent recognition, and verifies the speaker identity by comparing it with a speaker verification template associated with a specified command.

...read moreread less

Abstract: A method and apparatus for performing speaker verification with speaker verification templates are disclosed. A verification unit receives an utterance from a speaker. The verification unit identifies a command associated with the utterance by performing speaker independent recognition. If a speaker verification template associated with the identified command includes adequate verification data, the verification unit eliminates a prompt for a password and verifies the speaker identity by comparing the utterance with a speaker verification template associated with the identified command.

...read moreread less

Patent•

Method and system for integrated audiovisual speech coding at low bitrate

[...]

Hani Camille Yehia, Takaaki Kuratate, Eric Vatikiotis-Bateson

20 Mar 2000

TL;DR: In this paper, a method to transmit face images including the steps of: preparing a facial shape estimation unit receiving speech produced by a speaker and outputting a signal estimation the speaker's facial shape when he/she speaks, transmitting the speech from the transmitting side to the receiving side and applying it to the facial shape estimator, and generating a motion picture of the speaker facial shape based on the signal estimator's output.

...read moreread less

Abstract: A method to transmit face images including the steps of: preparing a facial shape estimation unit receiving speech produced by a speaker and outputting a signal estimation the speaker's facial shape when he/she speaks; transmitting the speech produced by the speaker from the transmitting side to the receiving side and applying it to the facial shape estimation unit so as to estimate the speaker's facial shape; and generating a motion picture of the speaker's facial shape based on the signal estimation the speaker's facial shape output by the facial shape estimation unit.

...read moreread less