scispace - formally typeset
Search or ask a question

Showing papers on "Speaker recognition published in 1997"


Journal ArticleDOI
01 Sep 1997
TL;DR: A tutorial on the design and development of automatic speaker-recognition systems is presented and a new automatic speakers recognition system is given that performs with 98.9% correct decalcification.
Abstract: A tutorial on the design and development of automatic speaker-recognition systems is presented. Automatic speaker recognition is the use of a machine to recognize a person from a spoken phrase. These systems can operate in two modes: to identify a particular person or to verify a person's claimed identity. Speech processing and the basic components of automatic speaker-recognition systems are shown and design tradeoffs are discussed. Then, a new automatic speaker-recognition system is given. This recognizer performs with 98.9% correct decalcification. Last, the performances of various systems are compared.

1,686 citations


Proceedings Article
01 Jan 1997
TL;DR: The DET Curve is introduced as a means of representing performance on detection tasks that involve a tradeoff of error types and why it is likely to produce approximately linear curves.
Abstract: : We introduce the DET Curve as a means of representing performance on detection tasks that involve a tradeoff of error types. We discuss why we prefer it to the traditional ROC Curve and offer several examples of its use in speaker recognition and language recognition. We explain why it is likely to produce approximately linear curves. We also note special points that may be included on these curves, how they are used with multiple targets, and possible further applications.

1,516 citations


Journal ArticleDOI
TL;DR: The issue of speech recognizer training from a broad perspective with root in the classical Bayes decision theory is discussed, and the superiority of the minimum classification error (MCE) method over the distribution estimation method is shown by providing the results of several key speech recognition experiments.
Abstract: A critical component in the pattern matching approach to speech recognition is the training algorithm, which aims at producing typical (reference) patterns or models for accurate pattern comparison. In this paper, we discuss the issue of speech recognizer training from a broad perspective with root in the classical Bayes decision theory. We differentiate the method of classifier design by way of distribution estimation and the discriminative method of minimizing classification error rate based on the fact that in many realistic applications, such as speech recognition, the real signal distribution form is rarely known precisely. We argue that traditional methods relying on distribution estimation are suboptimal when the assumed distribution form is not the true one, and that "optimality" in distribution estimation does not automatically translate into "optimality" in classifier design. We compare the two different methods in the context of hidden Markov modeling for speech recognition. We show the superiority of the minimum classification error (MCE) method over the distribution estimation method by providing the results of several key speech recognition experiments. In general, the MCE method provides a significant reduction of recognition error rate.

728 citations


PatentDOI
Dimitri Kanevsky1, Stephane H. Maes1
TL;DR: In this article, a method and apparatus for securing access to a service or facility employing automatic speech recognition, text-independent speaker identification, natural language understanding techniques and additional dynamic and static features is presented.
Abstract: A method and apparatus for securing access to a service or facility employing automatic speech recognition, text-independent speaker identification, natural language understanding techniques and additional dynamic and static features. The method includes the steps of receiving and decoding speech containing indicia of the speaker such as a name, address or customer number; accessing a database containing information on candidate speakers; questioning the speaker based on the information; receiving, decoding and verifying an answer to the question; obtaining a voice sample of the speaker and verifying the voice sample against a model; generating a score based on the answer and the voice sample; and granting access if the score is equal to or greater than a threshold. Alternatively, the method includes the steps of receiving and decoding speech containing indicia of the speaker; generating a sub-list of speaker candidates having indicia substantially matching the speaker; activating databases containing information about the speaker candidates in the sub-list; performing voice classification analysis; eliminating speaker candidates based on the voice classification analysis; questioning the speaker regarding the information; eliminating speaker candidates based on the answer; and iteratively repeating prior steps until one speaker candidate (in which case the speaker is granted access), or no speaker candidate remains (in which case the speaker is not granted access).

474 citations


Proceedings Article
01 Jan 1997
TL;DR: This paper compares two approaches to background model representation for a text-independent speaker verification task using Gaussian mixture models and describes how Bayesian adaptation can be used to derive claimant speaker models, providing a structure leading to significant computational savings during recognition.
Abstract: This paper compares two approaches to background model representation for a text-independent speaker verification task using Gaussian mixture models. We compare speaker-dependent background speaker sets to the use of a universal, speaker-independent background model (UBM). For the UBM, we describe how Bayesian adaptation can be used to derive claimant speaker models, providing a structure leading to significant computational savings during recognition. Experiments are conducted on the 1996 NIST Speaker Recognition Evaluation corpus and it is clearly shown that a system using a UBM and Bayesian adaptation of claimant models produces superior performance compared to speaker-dependent background sets or the UBM with independent claimant models. In addition, the creation and use of a telephone handset-type detector and a procedure called hnorm is also described which shows further, large improvements in verification performance, especially under the difficult mismatched handset conditions. This is believed to be the first use of applying a handset-type detector and explicit handset-type normalization for the speaker verification task.

383 citations


Journal ArticleDOI
TL;DR: Recent advances in speaker recognition technology include VQ- and ergodic-HMM-based text-independent recognition methods, a text-prompted recognition method, parameter/distance normalization and model adaptation techniques, and methods of updating models and a priori thresholds in speaker verification.

326 citations


Book
01 Jan 1997
TL;DR: In this article, the authors discuss the nature of perceptual adjustment to voice listening to voices in speech perception using an episodic lexicon and a speaker adaptation approach for articulatory recovery and adaptation.
Abstract: Complex representations ised in speech processing - overview of the book some thoughts on "normalization" in speech perception words and voices - perception and production in episodic lexicon on the nature of perceptual adjustment to voice listening to voices - theory and practice in voice perception research talker normalization - phonetic constancy as a cognitive process normalization of vowels by breath sounds speech perception without speaker normalization - an exemplar model speaker modeling for speaker adaptation in automatic speech recognition overcoming speaker variability in automatic speech recognition - the speaker adaptation approach vocal tract normalization for articulatory recovery and adaptation.

323 citations


Patent
24 Jul 1997
TL;DR: In this paper, a speech recognition manager receives representations of one or more words from a speech decoding system (106) and interprets the received words based upon the current context state so as to provide extremely accurate, flexible, extendable and scalable speech recognition and interpretation.
Abstract: A speech recognition manager receives representations of one or more words from a speech decoding system (106) and interprets the received words based upon the current context state so as to provide extremely accurate, flexible, extendable and scalable speech recognition and interpretation. The speech recognition manager limits the number of words that the speech decoding system (106) can recognize in a given context state in order to increase the speed and accuracy of the speech recognition process. Whenever the context state changes, the manager loads a new list of words that can be recognized for the new context state into the speech decoding system (106) so that while the speed and accuracy of the speech recognition process is increased, the total grammatical structure recognized can be easily increased as well.

201 citations


Patent
25 Nov 1997
TL;DR: In this paper, an apparatus and method for the robust recognition of speech during a call in a noisy environment is presented, where specific background noise models are created to model various background noises which may interfere in the error free recognition.
Abstract: An apparatus and method for the robust recognition of speech during a call in a noisy environment is presented. Specific background noise models are created to model various background noises which may interfere in the error free recognition of speech. These background noise models are then used to determine which noise characteristics a particular call has. Once a determination has been made of the background noise in any given call, speech recognition is carried out using the appropriate background noise model.

194 citations


Patent
Stephane H. Maes1
TL;DR: In this paper, feature vectors representing each of a plurality of overlapping frames of an arbitrary, text independent speech signal are computed and compared to vector parameters and variances stored as codewords in one or more codebooks corresponding to each of enrolled users to provide speaker dependent information for speech recognition and ambiguity resolution.
Abstract: Feature vectors representing each of a plurality of overlapping frames of an arbitrary, text independent speech signal are computed and compared to vector parameters and variances stored as codewords in one or more codebooks corresponding to each of one or more enrolled users to provide speaker dependent information for speech recognition and/or ambiguity resolution. Other information such as aliases and preferences of each enrolled user may also be enrolled and stored, for example, in a database. Correspondence of the feature vectors may be ranked by closeness of correspondence to a codeword entry and the number of frames corresponding to each codebook are accumulated or counted to identify a potential enrolled speaker. The differences between the parameters of the feature vectors and codewords in the codebooks can be used to identify a new speaker and an enrollment procedure can be initiated. Continuous authorization and access control can be carried out based on any utterance either by verification of the authorization of a speaker of a recognized command or comparison with authorized commands for the recognized speaker. Text independence also permits coherence checks to be carried out for commands to validate the recognition process.

161 citations


Proceedings ArticleDOI
21 Apr 1997
TL;DR: A novel method to combine different knowledge sources and estimate the confidence in a word hypothesis, via a neural network, is described, and a measure of the joint performance of the recognition and confidence systems is proposed.
Abstract: This paper proposes a probabilistic framework to define and evaluate confidence measures for word recognition. We describe a novel method to combine different knowledge sources and estimate the confidence in a word hypothesis, via a neural network. We also propose a measure of the joint performance of the recognition and confidence systems. The definitions and algorithms are illustrated with results on the Switchboard Corpus.

Patent
11 Jul 1997
TL;DR: In this article, a voice recognition device used as a peripheral device for a game machine including a voice input device is described, which is used for recognizing the player's voice by comparing the voice signal output from the voice input devices with data from previously defined voice recognition dictionaries and generating control signals relating to the game on the basis of the recognition result.
Abstract: A voice recognition device used as a peripheral device for a game machine including a voice input device, a voice recognition section for recognizing the player's voice by comparing the voice signal output from the voice input device with data from previously defined voice recognition dictionaries and generating control signals relating to the game on the basis of the recognition result. The voice recognition section includes a non-specific speaker voice recognition dictionary which is previously defined for unspecified speakers, and a specific speaker voice recognition dictionary which is defined by the player.

Patent
30 Apr 1997
TL;DR: In this article, a technique for the generation of garbage models from the very same data used to generate speaker dependent speech recognition models, e.g., word models, is described.
Abstract: Methods and apparatus for the generation of speaker dependent garbage models from the very same data used to generate speaker dependent speech recognition models, e.g., word models, are described. The technique involves processing the data included in the speaker dependent speech recognition models to create one or more speaker dependent garbage models. The speaker dependent garbage model generation technique involves what may be described as distorting or morphing of a speaker dependent speech recognition model to generate a speaker dependent garbage model therefrom. One or more speaker dependent speech recognition models may then be combined with the generated speaker dependent garbage model to produce an updated garbage model. The scoring of speaker dependent garbage models is varied in accordance with the present invention as a function of the number of speech recognition models from which the speaker dependent garbage model was created. In one embodiment, the number of speaker dependent speech recognition models which are used in generating a speaker dependent garbage model is limited to a preselected maximum number which is empirically determined.

Proceedings ArticleDOI
21 Apr 1997
TL;DR: A new approach to automatic speech recognition based on independent class-conditional probability estimates in several frequency sub-bands is presented, shown to be especially applicable to environments which cause partial corruption of the frequency spectrum of the signal.
Abstract: A new approach to automatic speech recognition based on independent class-conditional probability estimates in several frequency sub-bands is presented. The approach is shown to be especially applicable to environments which cause partial corruption of the frequency spectrum of the signal. Some of the issues involved in the implementation of the approach are also addressed.

PatentDOI
TL;DR: In systems where both speaker independent and speaker dependent speech recognition operations are performed independently, in parallel, one or more speaker independent models of words or phrases which are to be recognized by the speaker independent speech recognizer are included as garbage (OOV) models in the speaker dependentspeech recognizer.
Abstract: Methods and apparatus for generating and using both speaker dependent and speaker independent garbage models in speaker dependent speech recognition applications are described. The present invention recognizes that in some speech recognition systems, e.g., systems where multiple speech recognition operations are performed on the same signal, it may be desirable to recognize and treat words or phrases in one part of the speech recognition system as garbage or out of vocabulary utterances with the understanding that the very same words or phrases will be recognized and treated as in-vocabulary by another portion of the system. In accordance with the present invention, in systems where both speaker independent and speaker dependent speech recognition operations are performed independently, e.g., in parallel, one or more speaker independent models of words or phrases which are to be recognized by the speaker independent speech recognizer are included as garbage (OOV) models in the speaker dependent speech recognizer. This reduces the risk of obtaining conflicting speech recognition results from the speaker independent and speaker dependent speech recognition circuits. The present invention also provides for the generation of speaker dependent garbage models from the very same data used to generate speaker dependent speech recognition models, e.g., word models. The technique involves processing the data included in the speaker dependent speech recognition models to create one or more speaker dependent garbage models.

Proceedings ArticleDOI
21 Apr 1997
TL;DR: Two corpora collected at Lincoln Laboratory for the study of handset transducer effects on the speech signal are described: the handset TIMIT (HTIMIT) corpus and the Lincoln Laboratory Handset Database (LLHDB).
Abstract: This paper describes two corpora collected at Lincoln Laboratory for the study of handset transducer effects on the speech signal: the handset TIMIT (HTIMIT) corpus and the Lincoln Laboratory Handset Database (LLHDB). The goal of these corpora are to minimize all confounding factors and produce speech predominately differing only in handset transducer effects. The speech is recorded directly from a telephone unit in a sound-booth using prompted text and extemporaneous photograph descriptions. The two corpora allow comparison of speech collected from a person speaking into a handset (LLHDB) versus speech played through a loudspeaker into a handset (HTIMIT). A comparison of analysis and results between the two corpora addresses the realism of artificially creating handset degraded speech by playing recorded speech through the handsets. The corpora are designed primarily for speaker recognition experimentation (in terms of amount of speech and level of transcription), but since both speaker and speech recognition systems operate on the same acoustic features affected by the handset, the knowledge gleaned is directly transferable to speech recognizers. Initial speaker identification performance on these corpora are presented. In addition, the application of HTIMIT in developing a handset detector that was successfully used on a Switchboard speaker verification task is described.

Proceedings ArticleDOI
21 Apr 1997
TL;DR: This paper addresses the approach of speaker normalization which aims at normalizing speaker's vocal tract length based on frequency warping (FWP), and investigates the formant-based and ML-based FWP in linear and nonlinear warping modes.
Abstract: In speech recognition, speaker-dependence of a speech recognition system comes from speaker-dependence of the speech feature, and the variation of vocal tract shape is the major source of inter-speaker variations of the speech feature, though there are some other sources which also contribute. In this paper, we address the approach of speaker normalization which aims at normalizing speaker's vocal tract length based on frequency warping (FWP). The FWP is implemented in the front-end preprocessing of our speech recognition system. We investigate the formant-based and ML-based FWP in linear and nonlinear warping modes, and compare them in detail. All experimental results are based on our JANUS3 large vocabulary continuous speech recognition system and the Spanish Spontaneous Scheduling Task database (SSST).

Proceedings ArticleDOI
21 Apr 1997
TL;DR: It is indeed shown that even using quite simple recombination strategies, this subband ASR approach can yield at least comparable performance on clean speech while providing better robustness in the case of narrowband noise.
Abstract: In the framework of hidden Markov models (HMM) or hybrid HMM/artificial neural network (ANN) systems, we present a new approach towards automatic speech recognition (ASR). The general idea is to divide up the full frequency band (represented in terms of critical bands) into several subbands, compute phone probabilities for each subband on the basis of subband acoustic features, perform dynamic programming independently for each band, and merge the subband recognizers (recombining the respective, possibly weighted, scores) at some segmental level corresponding to temporal anchor points. The results presented in this paper confirm some preliminary tests reported earlier. On both isolated word and continuous speech tasks, it is indeed shown that even using quite simple recombination strategies, this subband ASR approach can yield at least comparable performance on clean speech while providing better robustness in the case of narrowband noise.

Patent
11 Jun 1997
TL;DR: A portable acoustic signal (speech signal) preprocessing (SSP) device for accessing an automatic speech/speaker recognition (ASSR) server comprises a microphone for converting sound including speech, silence and background noise signals to analog signals; an analog signals to digital converter for converting the analog signal to digital signals; a digital signal processor (DSP) for generating feature vector data representing the digitized speech and silence/background noise, and for generating channel characterization signals; and an acoustic coupler for converting feature vector signals and the characterization signals to acoustic signals and coupling the acoustic signals
Abstract: A portable acoustic signal (speech signal) preprocessing (SSP) device for accessing an automatic speech/speaker recognition (ASSR) server comprises a microphone for converting sound including speech, silence and background noise signals to analog signals; an analog signals to digital converter for converting the analog signals to digital signals; a digital signal processor (DSP) for generating feature vector data representing the digitized speech and silence/background noise, and for generating channel characterization signals; and an acoustic coupler for converting the feature vector data and the characterization signals to acoustic signals and coupling the acoustic signals to a communication channel to access the ASSR server to perform speech and speaker recognition at a remote location. The SSP device may also be configured to compress and encrypt data transmitted to the ASSR server via the DSP and encryption keys stored in a memory. The ASSR server receives the preprocessed acoustic signals to perform speech/speaker recognition by setting references, selecting appropriate decoding models and algorithms to decode the acoustic signals by modeling the channel transfer function from the channel characterization signals and processing the silence/background noise data to reduce word error rate for speech recognition and to perform accurate speaker recognition. A client/server system having the portable SSP device and the ASSR server can be used to remotely activate, reset, or change personal identification numbers (PINs) or user passwords for smartcards, magnetic cards, or electronic money cards.

Proceedings Article
01 Jan 1997
TL;DR: A simple speaker normalization algorithm combining frequency warping and spectral shaping introduced in [5] is shown to reduce acoustic variability and improve recognition performance for children speakers, and age-dependent acoustic modeling further reduces word error rate.
Abstract: In this paper, the acoustic and linguistic characteristics of children speech are investigated in the context of automatic speech recognition. Acoustic variability is identi ed as a major hurdle in building high performance ASR applications for children. A simple speaker normalization algorithm combining frequency warping and spectral shaping introduced in [5] is shown to reduce acoustic variability and signi cantly improve recognition performance for children speakers (by 25{ 45%). Age-dependent acoustic modeling further reduces word error rate by 10%. Piece-wise linear and phoneme-dependent frequency warping algorithms are proposed for reducing acoustic mismatch between the children and adult acoustic spaces.

Patent
01 Aug 1997
TL;DR: In this article, a call-placement system for telephone services in response to speech is described, which allows a customer to place a call by speaking a person's name which serves as a destination identifier without having to speak an additional command or steering word.
Abstract: Methods and apparatus for activating telephone services in response to speech are described. A directory including names is maintained for each customer. A speaker dependent speech template and a telephone number for each name, is maintained as part of each customer's directory. Speaker independent speech templates are used for recognizing commands. The present invention has the advantage of permitting a customer to place a call by speaking a person's name which serves as a destination identifier without having to speak an additional command or steering word to place the call. This is achieved by treating the receipt of a spoken name in the absence of a command as an implicit command to place a call. Explicit speaker independent commands are used to invoke features or services other than call placement. Speaker independent and speaker dependent speech recognition are performed on a customer's speech in parallel. An arbiter is used to decide which function or service should be performed when an apparent conflict arises as a result of both the speaker dependent and speaker independent speech recognition step outputs. Stochastic grammars, word spotting and/or out-of-vocabulary rejection are used as part of the speech recognition process to provide a user friendly interface which permits the use of spontaneous speech. Voice verification is performed on a selective basis where security is of concern.

Proceedings ArticleDOI
21 Apr 1997
TL;DR: Experimental results in the context of batch supervised adaptation demonstrate the effectiveness of the proposed speaker adaptive training method in large vocabulary speech recognition tasks and show that significant reductions in word error rate can be achieved over the common pooled speaker-independent paradigm.
Abstract: This paper describes the speaker adaptive training (SAT) approach for speaker independent (SI) speech recognizers as a method for joint speaker normalization and estimation of the parameters of the SI acoustic models. In SAT, speaker characteristics are modeled explicitly as linear transformations of the SI acoustic parameters. The effect of inter-speaker variability in the training data is reduced, leading to parsimonious acoustic models that represent more accurately the phonetically relevant information of the speech signal. The proposed training method is applied to the Wall Street Journal (WSJ) corpus that consists of multiple training speakers. Experimental results in the context of batch supervised adaptation demonstrate the effectiveness of the proposed method in large vocabulary speech recognition tasks and show that significant reductions in word error rate can be achieved over the common pooled speaker-independent paradigm.

Patent
Stephane H. Maes1
28 Jan 1997
TL;DR: In this article, a consistency check in the form of a decision tree is provided to accelerate the speaker recognition process and increase the accuracy of the system's recognition. But the consistency check is only applied to the speaker-independent recognition model.
Abstract: Speaker recognition is attempted on input speech signals concurrently with provision of input speech signals to a speech recognition system. If a speaker is recognized, a speaker dependent model which has been trained on an enrolled speaker is supplied to the speech recognition system. If not recognized, then a speaker-independent recognition model is used or, alternatively, the new speaker is enrolled. Other speaker specific information such as a special language model, grammar, vocabulary, a dictionary, a list of names, a language and speaker dependent preferences can also be provided to improve the speech recognition function or even configure or customize the speech recognition system or the response of any system such as a computer or network controlled in response thereto. A consistency check in the form of a decision tree is preferably provided to accelerate the speaker recognition process and increase the accuracy thereof. Further training of a model and/or enrollment of additional speakers may be initiated upon completion of speaker recognition and/or adaptively upon each speaker utterance.

Patent
30 May 1997
TL;DR: In this article, the authors propose a method to authenticate a voice message recipient's network address by generating a "network file" that includes voice clips and associated network addresses that are extracted from voice messages received across a network.
Abstract: Authentication of voice message recipient network addresses employs generating (102) and storing (104) a 'network file' that includes 'voice clips' and associated network addresses that are extracted from voice messages received across a network (10) from voice message systems (16, 18). A voice clip is the first one to three seconds of voice extracted from each received voice message. Over time, the network file will grow to contain multiple voice clips and associated network voice message addresses. When a voice message originator subsequently enters a recipient's network address (106), the originating voice message system searches (114) the network file for the network address, retrieves the associated voice clip (116), and plays it for the voice message originator to authenticate the recipient's network address. Voice authentication of a voice message originator entails encoding (134) into a 'voice print file', original voice clips and associated network addresses received from positively identified voice message originators. Thereafter, when a questionable voice message is received (138), the voice message system extracts a new voice clip (142), generates a new voice print (144), and compares it with the original voice print associated with the voice message address (148). If the voice prints are substantially the same, the received voice message is annotated with an 'authenticating' message (150).

PatentDOI
TL;DR: In this paper, a speaker class processing model which is speaker independent within the class may be trained on one or more members of the class and selected for implementation in a speech recognition processor in accordance with the speaker class recognized to further improve speech recognition to level comparable to that of a speaker dependent model.
Abstract: Clusters of quantized feature vectors are processed against each other using a threshold distance value to cluster mean values of sets of parameters contained in speaker specific codebooks to form classes of speakers against which feature vectors computed from an arbitrary input speech signal can be compared to identify a speaker class. The number of codebooks considered in the comparison may be thus reduced to limit mixture elements which engender ambiguity and reduce system response speed when the speaker population becomes large. A speaker class processing model which is speaker independent within the class may be trained on one or more members of the class and selected for implementation in a speech recognition processor in accordance with the speaker class recognized to further improve speech recognition to level comparable to that of a speaker dependent model. Formation of speaker classes can be supervised by identification of groups of speakers to be included in the class and the speaker class dependent model trained on members of a respective group.

PatentDOI
Mazin G. Rahim1
TL;DR: A speech recognition system which effectively recognizes unknown speech from multiple acoustic environments includes a set of secondary models, each associated with one or more particular acoustic environments, integrated with a base set of recognition models.
Abstract: A speech recognition system which effectively recognizes unknown speech from multiple acoustic environments includes a set of secondary models, each associated with one or more particular acoustic environments, integrated with a base set of recognition models. The speech recognition system is trained by making a set of secondary models in a first stage of training, and integrating the set of secondary models with a base set of recognition models in a second stage of training.

Proceedings Article
01 Jan 1997
TL;DR: A statistical model of pitch is developed that allows unbiased estimation of pitch statistics from pitch tracks which are subject to doubling and/or halving and which argues by a simple correlation model and empirically demonstrate that “clean” pitch is distributed with a lognormal distribution rather than the often assumed normal distribution.
Abstract: Statistics of pitch have recently been used in speaker recognition systems with good results. The success of such systems depends on robust and accurate computation of pitch statistics in the presence of pitch tracking errors. In this work, we develop a statistical model of pitch that allows unbiased estimation of pitch statistics from pitch tracks which are subject to doubling and/or halving. We first argue by a simple correlation model and empirically demonstrate by QQ plots that “clean” pitch is distributed with a lognormal distribution rather than the often assumed normal distribution. Second, we present a probabilistic model for estimated pitch via a pitch tracker in the presence of doubling/halving, which leads to a mixture of three lognormal distributions with tied means and variances for a total of four free parameters. We use the obtained pitch statistics as features in speaker verification on the March 1996 NIST Speaker Recognition Evaluation data (subset of Switchboard) and report results on the most difficult portion of the database: the “one-session” condition with males only for both the claimant and imposter speakers. Pitch statistics provide 22% reduction in false alarm rate at 1% miss rate and 11% reduction in false alarm rate at 10% miss rate over the cepstrum-only system.

Patent
21 Feb 1997
TL;DR: In this article, a speech model is produced for use in determining whether a speaker associated with the speech model produced an unidentified speech sample, without using an external mechanism to monitor the accuracy with which the contents were identified.
Abstract: A speech model is produced for use in determining whether a speaker associated with the speech model produced an unidentified speech sample. First a sample of speech of a particular speaker is obtained. Next, the contents of the sample of speech are identified using speech recognition. Finally, a speech model associated with the particular speaker is produced using the sample of speech and the identified contents thereof. The speech model is produced without using an external mechanism to monitor the accuracy with which the contents were identified.


01 Jan 1997
TL;DR: A learning based approach to speech recognition and person recognition from image sequences and it is shown that, besides speech information, the recovered model parameters also contain person dependent information and a novel method for person recognition is presented which is based on these features.
Abstract: This thesis presents a learning based approach to speech recognition and person recognition from image sequences. An appearance based model of the articulators is learned from example images and is used to locate, track, and recover visual speech features. A major difficulty in model based approaches is to develop a scheme which is general enough to account for the large appearance variability of objects but which does not lack in specificity. The method described here decomposes the lip shape and the intensities in the mouth region into weighted sums of basis shapes and basis intensities, respectively, using a Karhunen-Loeve expansion. The intensities deform with the shape model to provide shape independent intensity information. This information is used in image search, which is based on a similarity measure between the model and the image. Visual speech features can be recovered from the tracking results and represent shape and intensity information. A speechreading (lip-reading) system is presented which models these features by Gaussian distributions and their temporal dependencies by hidden Markov models. The models are trained using the EM-algorithm and speech recognition is performed based on maximum posterior probability classification. It is shown that, besides speech information, the recovered model parameters also contain person dependent information and a novel method for person recognition is presented which is based on these features. Talking persons are represented by spatio-temporal models which describe the appearance of the articulators and their temporal changes during speech production. Two different topologies for speaker models are described: Gaussian mixture models and hidden Markov models. The proposed methods were evaluated for lip localisation, lip tracking, speech recognition, and speaker recognition on an isolated digit database of 12 subjects, and on a continuous digit database of 37 subjects. The techniques were found to achieve good performance for all tasks listed above. For an isolated digit recognition task, the speechreading system outperformed previously reported systems and performed slightly better than untrained human speechreaders.