scispace - formally typeset
Search or ask a question

Showing papers on "Speaker diarisation published in 1999"


Book
01 Aug 1999
TL;DR: Automatic Speech and Speaker Recognition: Advanced Topics groups together in a single volume a number of important topics on speech and speaker recognition, topics which are of fundamental importance, but not yet covered in detail in existing textbooks.
Abstract: Research in the field of automatic speech and speaker recognition has made a number of significant advances in the last two decades, influenced by advances in signal processing, algorithms, architectures, and hardware. These advances include: the adoption of a statistical pattern recognition paradigm; the use of the hidden Markov modeling framework to characterize both the spectral and the temporal variations in the speech signal; the use of a large set of speech utterance examples from a large population of speakers to train the hidden Markov models of some fundamental speech units; the organization of speech and language knowledge sources into a structural finite state network; and the use of dynamic, programming based heuristic search methods to find the best word sequence in the lexical network corresponding to the spoken utterance. Automatic Speech and Speaker Recognition: Advanced Topics groups together in a single volume a number of important topics on speech and speaker recognition, topics which are of fundamental importance, but not yet covered in detail in existing textbooks. Although no explicit partition is given, the book is divided into five parts: Chapters 1-2 are devoted to technology overviews; Chapters 3-12 discuss acoustic modeling of fundamental speech units and lexical modeling of words and pronunciations; Chapters 13-15 address the issues related to flexibility and robustness; Chapter 16-18 concern the theoretical and practical issues of search; Chapters 19-20 give two examples of algorithm and implementational aspects for recognition system realization. Audience: A reference book for speech researchers and graduate students interested in pursuing potential research on the topic. May also be used as a text for advanced courses on the subject.

242 citations


Patent
Jonathan Foote1, Lynn D. Wilcox1
11 Mar 1999
TL;DR: In this paper, a method for segmenting audio-video recording of meetings containing slide presentations by one or more speakers is described, under the assumption that only one person is speaking during an interval when slides are displayed in the video.
Abstract: Methods for segmenting audio-video recording of meetings containing slide presentations by one or more speakers are described. These segments serve as indexes into the recorded meeting. If an agenda is provided for the meeting, these segments can be labeled using information from the agenda. The system automatically detects intervals of video that correspond to presentation slides. Under the assumption that only one person is speaking during an interval when slides are displayed in the video, possible speaker intervals are extracted from the audio soundtrack by finding these regions. Since the same speaker may talk across multiple slide intervals, the acoustic data from these intervals is clustered to yield an estimate of the number of distinct speakers and their order. Clustering the audio data from these intervals yields an estimate of the number of different speakers and their order. Merged clustered audio intervals corresponding to a single speaker are then used as training data for a speaker segmentation system. Using speaker identification techniques, the full video is then segmented into individual presentations based on the extent of each presenter's speech. The speaker identification system optionally includes the construction of a hidden Markov model trained on the audio data from each slide interval. A Viterbi assignment then segments the audio according to speaker.

208 citations


Journal ArticleDOI
TL;DR: A novel method is proposed which finds accurate alignments between source and target speaker utterances which modifies the utterance of a source speaker to sound-like speech from a target speaker.

181 citations


Patent
TL;DR: In this paper, a speaker verification model is constructed and trained upon the speech of known client speakers (and also impostor speakers, in the case of speaker verification) and the model is concatenated to define supervectors and a linear transformation upon these supervector results in a dimensionality reduction yielding a low-dimensional space called eigenspace.
Abstract: Speech models are constructed and trained upon the speech of known client speakers (and also impostor speakers, in the case of speaker verification). Parameters from these models are concatenated to define supervectors and a linear transformation upon these supervectors results in a dimensionality reduction yielding a low-dimensional space called eigenspace. The training speakers are then represented as points or distributions in eigenspace. Thereafter, new speech data from the test speaker is placed into eigenspace through a similar linear transformation and the proximity in eigenspace of the test speaker to the training speakers serves to authenticate or identify the test speaker.

162 citations


Patent
09 Apr 1999
TL;DR: In this paper, a query search system retrieves information responsive to a textual query containing a text string (one or more key words), and the identity of a given speaker, and the results of content and speaker-based audio information retrieval methods are combined to provide references to audio information (and indirectly to video).
Abstract: Methods and apparatus are provided for retrieving audio information based on the audio content as well as the identity of the speaker. The results of content and speaker-based audio information retrieval methods are combined to provide references to audio information (and indirectly to video). A query search system retrieves information responsive to a textual query containing a text string (one or more key words), and the identity of a given speaker. An indexing system transcribes and indexes the audio information to create time-stamped content index file(s) and speaker index file(s). An audio retrieval system uses the generated content and speaker indexes to perform query-document matching based on the audio content and the speaker identity. Documents satisfying the user-specified content and speaker constraints are identified by comparing the start and end times of the document segments in both the content and speaker domains. Documents satisfying the user-specified content and speaker constraints are assigned a combined score that can be used in accordance with the present invention to rank-order the identified documents returned to the user, with the best-matched segments at the top of the list.

127 citations


01 Jan 1999
TL;DR: The vector quantization approach will be used, due to ease of implementation and high accuracy, in this project.
Abstract: Speaker identification is a difficult task, and the task has several differ approaches. The state of the art for speaker identification techniques include dynamic time warped(DTW) template matching, Hidden Mark Modeling(HMM), and codebook schemes based on vector quantization(VQ)[2]. In this project, the vector quantization approach will be used, due to ease of implementation and high accuracy [7].

124 citations


Journal ArticleDOI
TL;DR: Two approaches are concentrated on extracting features that are robust against channel variations and transforming the speaker models to compensate for channel effects, which resulted in a 38% relative improvement on the closed-set 30-s training 5-s testing condition of the NIST'95 Evaluation task.
Abstract: This paper addresses the issue of closed-set text-independent speaker identification from samples of speech recorded over the telephone. It focuses on the effects of acoustic mismatches between training and testing data, and concentrates on two approaches: (1) extracting features that are robust against channel variations and (2) transforming the speaker models to compensate for channel effects. First, an experimental study shows that optimizing the front end processing of the speech signal can significantly improve speaker recognition performance. A new filterbank design is introduced to improve the robustness of the speech spectrum computation in the front-end unit. Next, a new feature based on spectral slopes is described. Its ability to discriminate between speakers is shown to be superior to that of the traditional cepstrum. This feature can be used alone or combined with the cepstrum. The second part of the paper presents two model transformation methods that further reduce channel effects. These methods make use of a locally collected stereo database to estimate a speaker-independent variance transformation for each speech feature used by the classifier. The transformations constructed on this stereo database can then be applied to speaker models derived from other databases. Combined, the methods developed in this paper resulted in a 38% relative improvement on the closed-set 30-s training 5-s testing condition of the NIST'95 Evaluation task, after cepstral mean removal.

97 citations


Proceedings Article
01 Jan 1999
TL;DR: It is shown that significant speed-ups can be obtained while sacrificing surprisingly little accuracy, and it is expected that these techniques, involving lowering model order as well as processing fewer speech frames, will apply equally well to other recognition systems.
Abstract: Model (GMM-UBM) speaker recognition system has demonstrated very high performance in several NIST evaluations. Such evaluations, however, are concerned only with classification accuracy. In many applications, system effectiveness must be evaluated in light of both accuracy and execution speed. We present here a number of techniques for decreasing computation. Using data from the Switchboard telephone speech corpus, we show that significant speed-ups can be obtained while sacrificing surprisingly little accuracy. We expect that these techniques, involving lowering model order as well as processing fewer speech frames, will apply equally well to other recognition systems.

91 citations


Proceedings ArticleDOI
15 Mar 1999
TL;DR: This paper investigates the relative sensitivity of a Gaussian mixture model (GMM) based voice verification algorithm to computer voice-altered imposters.
Abstract: This paper investigates the relative sensitivity of a Gaussian mixture model (GMM) based voice verification algorithm to computer voice-altered imposters. First, a new trainable speech synthesis algorithm based on trajectory models of the speech line spectral frequency (LSF) parameters is presented in order to model the spectral characteristics of a target voice. A GMM based speaker verifier is then constructed for the 138 speaker YOHO database and shown to have an initial equal-error rate (EER) of 1.45% for the case of casual imposter attempts using a single combination-lock phrase test. Next, imposter voices are automatically altered using the synthesis algorithm to mimic the customer's voice. After voice transformation, the false acceptance rate is shown to increase from 1.45% to over 86% if the baseline EER threshold is left unmodified. Furthermore, at a customer false rejection rate of 25%, the false acceptance rate for the voice-altered imposter remains as high as 34.6%.

89 citations


Proceedings Article
01 Jan 1999
TL;DR: Experimental results show that false acceptance rates for synthetic speech reached over 70% by training the synthesis system using only 1 sentence from each customer, and current security of HMM-based speaker verification systems against synthetic speech is inadequate.
Abstract: For speaker verification systems, security against imposture is one of the most important problems, and many approaches to reducing false acceptance of impostors as well as false rejection of clients have been investigated. On the other hand, imposture using synthetic speech has not been considered. In this paper, we investigate imposture against speaker verification systems using synthetic speech. We use an HMM-based text-prompted speaker verification system with a false acceptance rate of 0% for human impostors as a reference system, and adopt a trainable HMM-based speech synthesis system for imposture. Experimental results show that false acceptance rates for synthetic speech reached over 70% by training the synthesis system using only 1 sentence from each customer, and current security of HMM-based speaker verification systems against synthetic speech is inadequate.

85 citations


PatentDOI
Kerry A. Ortega1, James R. Lewis1, Ronald VanBuskirk1, Huifang Wang1, Stephane H. Maes1 
TL;DR: In this article, a method and apparatus for transcribing text from multiple speakers in a computer system having a speech recognition application is presented, where the system receives speech from one of a plurality of speakers through a single channel, assigns a speaker ID to the speaker, transcribes the speech into text, and associates the speaker ID with the speech and text.
Abstract: A method and apparatus for transcribing text from multiple speakers in a computer system having a speech recognition application. The system receives speech from one of a plurality of speakers through a single channel, assigns a speaker ID to the speaker, transcribes the speech into text, and associates the speaker ID with the speech and text. In order to detect a speaker change, the system monitors the speech input through the channel for a speaker change.

Patent
15 Dec 1999
TL;DR: In this paper, a verification of the speaker adaptation performance is proposed to secure that the recognition rate never decreases (significantly), but only increases or stays at the same level, in order to prevent adaptation to misrecognized words in unsupervised or on-line automatic speech recognition systems.
Abstract: To prevent adaptation to misrecognized words in unsupervised or on-line automatic speech recognition systems confidence measures are used or the user reaction is interpreted to decide whether a recognized phoneme, several phonemes, a word, several words or a whole utterance should be used for adaptation of the speaker independent model set to a speaker adapted model set or not and, in case an adaptation is executed, how strong the adaptation with this recognized utterance or part of this recognized utterance should be performed. Furtheron, a verification of the speaker adaptation performance is proposed to secure that the recognition rate never decreases (significantly), but only increases or stays at the same level.

PatentDOI
TL;DR: A method and apparatus are disclosed for identifying speakers participating in an audio-video source, whether or not such speakers have been previously registered or enrolled, to assign a speaker label to each identified segment.
Abstract: A method and apparatus are disclosed for identifying speakers participating in an audio-video source, whether or not such speakers have been previously registered or enrolled. The speaker identification system uses an enrolled speaker database that includes background models for unenrolled speakers, such as “unenrolled male” or “unenrolled female,” to assign a speaker label to each identified segment. Speaker labels are identified for each speech segment by comparing the segment utterances to the enrolled speaker database and finding the “closest” speaker, if any. A speech segment having an unknown speaker is initially assigned a general speaker label from the set of background models. The “unenrolled” segment is assigned a segment number and receives a cluster identifier assigned by the clustering system. If a given segment is assigned a temporary speaker label associated with an unenrolled speaker, the user can be prompted by the present invention to identify the speaker. Once the user assigns a speaker label to an audio segment having an unknown speaker, the same speaker name can be automatically assigned to any segments that are assigned to the same cluster and the enrolled speaker database can be automatically updated to enroll the previously unknown speaker.

Patent
16 Mar 1999
TL;DR: In this article, a frequency warping function generator estimates a vocal tract area function of each normalization-target speaker by changing feature quantities of the vocal tract configuration of the standard speaker.
Abstract: In a speaker normalization processor apparatus, a vocal-tract configuration estimator estimates feature quantities of a vocal-tract configuration showing an anatomical configuration of a vocal tract of each normalization-target speaker, by looking up to a correspondence between vocal-tract configuration parameters and Formant frequencies previously determined based on a vocal tract model of the standard speaker, based on speech waveform data of each normalization-target speaker. A frequency warping function generator estimates a vocal-tract area function of each normalization-target speaker by changing feature quantities of a vocal-tract configuration of the standard speaker based on the feature quantities of the vocal-tract configuration of each normalization-target speaker estimated by the estimation means and the feature quantities of the vocal-tract configuration of the standard speaker, estimating Formant frequencies of speech uttered by each normalization-target speaker based on the estimated vocal-tract area function of each normalization-target speaker, and generating a frequency warping function showing a correspondence between input speech frequencies and frequencies after frequency warping.

Patent
TL;DR: In this article, a hierarchical speaker model tree is created by merging similar speaker models on a layer by layer basis, and each speaker is also grouped into a cohort of similar speakers.
Abstract: A method for unsupervised environmental normalization for speaker verification using hierarchical clustering is disclosed. Training data (speech samples) are taken from T enrolled (registered) speakers over any one of M channels, e.g., different microphones, communication links, etc. For each speaker, a speaker model is generated, each containing a collection of distributions of audio feature data derived from the speech sample of that speaker. A hierarchical speaker model tree is created, e.g., by merging similar speaker models on a layer by layer basis. Each speaker is also grouped into a cohort of similar speakers. For each cohort, one or more complementary speaker models are generated by merging speaker models outside that cohort. When training data from a new speaker to be enrolled is received over a new channel, the speaker model tree as well as the complementary models are updated. Consequently, adaptation to data from new environments is possible by incorporating such data into the verification model whenever it is encountered.

Proceedings ArticleDOI
15 Mar 1999
TL;DR: The GMM based system outperformed the HMM system and an MLP to weight the phonemes provided a significant improvement in performance for male speakers but no improvement has yet been achieved for women.
Abstract: This paper compares two approaches to speaker verification, Gaussian mixture models (GMMs) and hidden Markov models (HMMs). The GMM based system outperformed the HMM system, this was mainly due to the ability of the GMM to make better use of the training data. The best scoring GMM frames were strongly correlated with particular phonemes, e.g. vowels and nasals. Two techniques were used to try and exploit the different amounts of discrimination provided by the phonemes to improve the performance of the GMM based system. Applying linear weighting to the phonemes showed that less than half of the phonemes were contributing to the overall system performance. Using an MLP to weight the phonemes provided a significant improvement in performance for male speakers but no improvement has yet been achieved for women.

Patent
Aruna Bayya1, Dianne L. Steiger1
30 Sep 1999
TL;DR: A speech recognition training system that provides for model generation to be used within speaker dependent speech recognition systems requiring very limited training data, including single token training, is presented in this article.
Abstract: A speech recognition training system that provides for model generation to be used within speaker dependent speech recognition systems requiring very limited training data, including single token training. The present invention provides a very fast and reliable training method based on the segmentation of a speech signal for subsequent estimating of speaker dependent word models. In addition, the invention provides for a robust method of performing end-point detection of a word contained within a speech utterance or speech signal. The invention is geared ideally for speaker dependent speech recognition systems that employ word-based speaker dependent models. The invention provides the end-point detection method is operable to extract a desired word or phrase from a speech signal that is recorded in varying degrees of undesirable background noise. In addition, the invention provides a simplified method of building the speaker dependent models using a simplified hidden Markov modeling method. The invention requires very limited training and is operable within systems having constrained budgets of memory and processing resources. Some examples of candidate application areas include cellular telephones and other devices that would benefit from abbreviated training and have inherently limited memory and processing power.

PatentDOI
TL;DR: The disclosed audio transcription and speaker classification system includes a speech recognition system, a speaker segmentation system and a speaker identification system for automatically transcribing audio information from an audio-video source and concurrently identifying the speakers.
Abstract: A method and apparatus are disclosed for automatically transcribing audio information from an audio-video source and concurrently identifying the speakers. The disclosed audio transcription and speaker classification system includes a speech recognition system, a speaker segmentation system and a speaker identification system. A common front-end processor computes feature vectors that are processed along parallel branches in a multi-threaded environment by the speech recognition system, speaker segmentation system and speaker identification system, for example, using a shared memory architecture that acts in a server-like manner to distribute the computed feature vectors to a channel associated with each parallel branch. The speech recognition system produces transcripts with time-alignments for each word in the transcript. The speaker segmentation system separates the speakers and identifies all possible frames where there is a segment boundary between non-homogeneous speech portions. The speaker identification system thereafter uses an enrolled speaker database to assign a speaker to each identified segment. The audio information from the audio-video source is concurrently transcribed and segmented to identify segment boundaries. Thereafter, the speaker identification system assigns a speaker label to each portion of the transcribed text.

PatentDOI
Carlos Antonio Franceschi1
TL;DR: In this article, a speech recognition apparatus is used to determine when a speaker wants to spell a first word. But the speaker may also indicate a desire to stop phonetic spelling.
Abstract: Speech recognition apparatus includes means for determining when a speaker desires to spell a first word. The speaker may then say a sequence of words selected from a large vocabulary without being restricted to a pre-specified phonetic alphabet. The apparatus recognizes the spoken words, associates letters with these words and then arranges the letters to form the first word. The speaker may also indicate a desire to stop phonetic spelling. Apparatus may also be used for selecting items from a list.

01 Jan 1999
TL;DR: This new method is specifically designed to lower the complexity of the modeling phase, compared to classical techniques, as well as to decrease the required amount of learning data, making it particularly well-suited to on-line learning (needed for speaker indexation) and use on embedded systems.
Abstract: This paper presents a new approach to speaker recognition and indexation systems, based on non-directly-acoustic processing. This new method is specifically designed to lower the complexity of the modeling phase, compared to classical techniques, as well as to decrease the required amount of learning data, making it particularly well-suited to on-line learning (needed for speaker indexation) and use on embedded systems.

PatentDOI
TL;DR: In one embodiment of the invention, a voice messaging system incorporates a DGMM to identify the speaker who generated a message, if that speaker is a member of a chosen list of target speakers, or to identifies the speaker as a “non-target” otherwise.
Abstract: Speaker identification is performed using a single Gaussian mixture model (GMM) for multiple speakers—referred to herein as a Discriminative Gaussian mixture model (DGMM). A likelihood sum of the single GMM is factored into two parts, one of which depends only on the Gaussian mixture model, and the other of which is a discriminative term. The discriminative term allows for the use of a binary classifier, such as a support vector machine (SVM). In one embodiment of the invention, a voice messaging system incorporates a DGMM to identify the speaker who generated a message, if that speaker is a member of a chosen list of target speakers, or to identify the speaker as a “non-target” otherwise.

ReportDOI
01 Jan 1999
TL;DR: It is shown that using speech synthesized from the three codecs, GMM-based speaker verification and phone-based language recognition performance generally degrades with coder bit rate, i.e., from GSM to G.723.1, relative to an uncoded baseline.
Abstract: : In this paper, we investigate the effect of speech coding on speaker and language recognition tasks. Three coders were selected to cover a wide range of quality and bit rates: GSM at 12.2 kb/s, G.729 at 8 kb/s, and G.723.1 at 5.3 kb/s. Our objective is to measure recognition performance from either the synthesized speech or directly from the coder parameters themselves. We show that using speech synthesized from the three codecs, GMM-based speaker verification and phone-based language recognition performance generally degrades with coder bit rate, i.e., from GSM to G.729 to G.723.1, relative to an uncoded baseline. In addition, speaker verification for all codecs shows a performance decrease as the degree of mismatch between training and testing conditions increases, while language recognition exhibited no decrease in performance. We also present initial results in determining the relative importance of codec system components in their direct use for recognition tasks. For the G.729 codec, it is shown that removal of the post-filter in the decoder helps speaker verification performance under the mismatched condition. On the other hand, with use of G.729 LSF-based mel-cepstra, performance decreases under all conditions, indicating the need for a residual contribution to the feature representation.

Proceedings ArticleDOI
01 Jan 1999
TL;DR: Techniques to optimally determine the relative weights of the independent decisions based on audio and video to achieve the best combination to improve the performance under mismatched conditions are explored.
Abstract: Audio-based speaker identification degrades severely when there is a mismatch between training and test conditions either due to channel or noise. In this paper, we explore various techniques to fuse video based speaker identification with audio-based speaker identification to improve the performance under mismatched conditions. Specifically, we explore techniques to optimally determine the relative weights of the independent decisions based on audio and video to achieve the best combination. Experiments on video broadcast news data suggest that significant improvements can be achieved by the combination in acoustically degraded conditions.

Patent
Yifan Gong1
15 Apr 1999
TL;DR: In this paper, a two-stage model adaptation scheme is presented to deal with variabilities from speaker, microphone channel and background noises in a car environment, where the first stage adapts speaker independent HMM seed model set to a speaker and microphone dependent model set.
Abstract: The recognition of hands-free speech in a car environment has to deal with variabilities from speaker, microphone channel and background noises. A two-stage model adaptation scheme is presented. The first stage adapts speaker-independent HMM seed model set to a speaker and microphone dependent model set. The second stage adapts speaker and microphone-dependent model set to a speaker, microphone, and noise dependent model set, which is then used for speech recognition. Both adaptations are based on maximum-likelihood linear regression (MLLR).

Proceedings ArticleDOI
15 Mar 1999
TL;DR: The APT is exploited to develop a speaker adaptation scheme in which the cepstral means of a speech recognition model are transformed to better match the speech of a given speaker.
Abstract: In previous work, a class of transforms were proposed which achieve a remapping of the frequency axis much like conventional vocal tract length normalization. These mappings, known collectively as all-pass transforms (APT), were shown to produce substantial improvements in the performance of a large vocabulary speech recognition system when used to normalize incoming speech prior to recognition. In this application, the most advantageous characteristic of the APT was its cepstral-domain linearity; this linearity makes speaker normalization simple to implement, and provides for the robust estimation of the parameters characterizing individual speakers. In the current work, we exploit the APT to develop a speaker adaptation scheme in which the cepstral means of a speech recognition model are transformed to better match the speech of a given speaker. In a set of speech recognition experiments conducted on the Switchboard corpus, we report reductions in word error rate of 3.7% absolute.

Proceedings ArticleDOI
10 Jun 1999
TL;DR: The paper proposes a fuzzy approach to the hidden Markov model (HMM) method called the fuzzy HMM for speech and speaker recognition, regarded as an application of the fuzzy expectation-maximisation algorithm to the Baum-Welch algorithm in the HMM.
Abstract: The paper proposes a fuzzy approach to the hidden Markov model (HMM) method called the fuzzy HMM for speech and speaker recognition. The fuzzy HMM algorithm is regarded as an application of the fuzzy expectation-maximisation (EM) algorithm to the Baum-Welch algorithm in the HMM. Speech and speaker recognition experiments using the Texas Instruments (TI46) speech data corpus show better results for fuzzy HMMs compared with conventional HMMs.

Proceedings Article
01 Jan 1999
TL;DR: In this paper, a binary tree data-base approach for arranging the trained speaker models based on a distance measure designed for comparing two sets of distributions is presented, and two techniques are presented for creating a model of the complement space to the cohort which is used for rejection purposes.
Abstract: This paper presents a hierarchical approach to the Large-Scale Speaker Recognition problem. In here the authors present a binary tree data-base approach for arranging the trained speaker models based on a distance measure designed for comparing two sets of distributions. The combination of this hierarchical structure and the distance measure [1] provide the means for conducting a large-scale verification task. In addition, two techniques are presented for creating a model of the complement-space to the cohort which is used for rejection purposes. Results are presented for the drastic improvements achieved mainly in reducing the false-acceptance of the speaker verification system without any significant false-rejection degradation.

Patent
TL;DR: In this article, a method for generating and using both speaker dependent and speaker independent garbage models in speaker dependent speech recognition applications is described. But this method does not address the problem of recognizing and treating words or phrases in one part of the speech recognition system as garbage or out-of-vocabulary utterances with the understanding that the very same phrases will be recognized and treated as invocabulary by another portion.
Abstract: Methods and apparatus for generating and using both speaker dependent and speaker independent garbage models in speaker dependent speech recognition applications are described. The present invention recognizes that in some speech recognition systems, e.g., systems where multiple speech recognition operations are performed on the same signal, it may be desirable to recognize and treat words or phrases in one part of the speech recognition system as garbage or out of vocabulary utterances with the understanding that the very same words or phrases will be recognized and treated as in-vocabulary by another portion of the system. In accordance with the present invention, in systems where both speaker independent and speaker dependent speech recognition operations are performed independently, e.g., in parallel, one or more speaker independent models of words or phrases which are to be recognized by the speaker independent speech recognizer are included as garbage (OOV) models in the speaker dependent speech recognizer. This reduces the risk of obtaining conflicting speech recognition results from the speaker independent and speaker dependent speech recognition circuits. When an OOV model is recognized, an indication that none of the words represented by the speaker dependent models have been detected may be provided. The present invention also provides for the generation of speaker dependent garbage models from the very same data used to generate speaker dependent speech recognition models, e.g., word models.

Proceedings ArticleDOI
15 Mar 1999
TL;DR: How to find the eigenvoices, give a maximum-likelihood estimator for the new speaker's eigenvoice coefficients, and summarize mean adaptation experiments carried out on the Isolet database are reviewed.
Abstract: Previously, we presented a radically new class of fast adaptation techniques for speech recognition, based on prior knowledge of speaker variation. To obtain this prior knowledge, one applies a dimensionality reduction technique to T vectors of dimension D derived from T speaker-dependent (SD) models. This offline step yields T basis vectors, the eigenvoices. We constrain the model for new speaker S to be located in the space spanned by the first K eigenvoices. Speaker adaptation involves estimating K eigenvoice coefficients for the new speaker; typically, K is very small compared to original dimension D. Here, we review how to find the eigenvoices, give a maximum-likelihood estimator for the new speaker's eigenvoice coefficients, and summarize mean adaptation experiments carried out on the Isolet database. We present new results which assess the impact on performance of changes in training of the SD models. Finally, we interpret the first few eigenvoices obtained.

Proceedings ArticleDOI
15 Mar 1999
TL;DR: A new algorithm, the modified-mean cepstral mean normalization with frequency warping (MMCMNFW) method, which improves upon the commonly-employed cepStral mean subtraction method, has been developed and is shown to offer improved recognition rates over other existing channel normalization methods on these databases.
Abstract: The performance of automatic speaker recognition systems is significantly degraded by acoustic mismatches between training and testing conditions. Such acoustic mismatches are commonly encountered in systems that operate on speech collected over telephone networks, where different handsets and different network routes impose varying convolutional distortions on the speech signal. A new algorithm, the modified-mean cepstral mean normalization with frequency warping (MMCMNFW) method, which improves upon the commonly-employed cepstral mean subtraction method, has been developed. Experimental results on closed-set speaker identification tasks on a channel-corrupted subset of the TIMIT database and on a subset of the NTIMIT database are presented. The new algorithm is shown to offer improved recognition rates over other existing channel normalization methods on these databases.