scispace - formally typeset
Search or ask a question

Showing papers on "Speaker recognition published in 1991"


Journal ArticleDOI
TL;DR: The role of statistical methods in this powerful technology as applied to speech recognition is addressed and a range of theoretical and practical issues that are as yet unsolved in terms of their importance and their effect on performance for different system implementations are discussed.
Abstract: The use of hidden Markov models for speech recognition has become predominant in the last several years, as evidenced by the number of published papers and talks at major speech conferences. The reasons this method has become so popular are the inherent statistical (mathematically precise) framework; the ease and availability of training algorithms for cstimating the parameters of the models from finite training sets of speech data; the flexibility of the resulting recognition system in which one can easily change the size, type, or architecture of the models to suit particular words, sounds, and so forth; and the ease of implementation of the overall recognition system. In this expository article, we address the role of statistical methods in this powerful technology as applied to speech recognition and discuss a range of theoretical and practical issues that are as yet unsolved in terms of their importance and their effect on performance for different system implementations.

1,480 citations


PatentDOI
TL;DR: A speech recognition apparatus having reference pattern adaptation stores a plurality of reference patterns representing speech to be recognized, each stored reference pattern having associated therewith a quality value representing the effectiveness of that pattern for recognizing an incoming speech utterance.
Abstract: A speech recognition apparatus having reference pattern adaptation stores a plurality of reference patterns representing speech to be recognized, each stored reference pattern having associated therewith a quality value representing the effectiveness of that pattern for recognizing an incoming speech utterance. The method and apparatus provide user correction actions representing the accuracy of a speech recognition, dynamically, during the recognition of unknown incoming speech utterances and after training of the system. The quality values are updated, during the speech recognition process, for at least a portion of those reference patterns used during the speech recognition process. Reference patterns having low quality values, indicative of either inaccurate representation of the unknown speech or non-use, can be deleted so long as the reference pattern is not needed, for example, where the reference pattern is the last instance of a known word or phrase. Various methods and apparatus are provided for determining when reference patterns can be deleted or added, to the reference memory, and when the scores or values associated with a reference pattern should be increased or decreased to represent the "goodness" of the reference pattern in recognizing speech.

263 citations


Proceedings Article
01 Jan 1991
TL;DR: This paper presents some of the design considerations of BREF, a large read-speech corpus for French designed to provide continuous speech data for the development of dictation machines, for the evaluation of continuous speech recognition systems, and for the study of phonological variations.
Abstract: This paper presents some of the design considerations of BREF, a large read-speech corpus for French. BREF was designed to provide continuous speech data for the development of dictation machines, for the evaluation of continuous speech recognition systems (both speaker-dependent and speakerindependent), and for the study of phonological variations. The texts to be read were selected from 5 million words of the French newspaper, Le Monde. In total, 11,000 texts were selected, with selection criteria that emphasisized maximizing the number of distinct triphones. Separate text materials were selected for training and test corpora. Ninety speakers have been recorded, each providing between 5,000 and 10,000 words (approximately 40-70 min.) of speech.

225 citations


Proceedings ArticleDOI
19 Feb 1991
TL;DR: DECIPHER as discussed by the authors is a speaker-independent continuous speech recognition system based on hidden Markov model (HMM) technology, which is used in SRI's Air Travel Information Systems (ATIS) and Resource Management systems.
Abstract: This paper describes improvements to DECIPHER, the speech recognition component in SRI's Air Travel Information Systems (ATIS) and Resource Management systems. DECIPHER is a speaker-independent continuous speech recognition system based on hidden Markov model (HMM) technology. We show significant performance improvements in DECIPHER due to (1) the addition of tied-mixture HMM modeling (2) rejection of out-of-vocabulary speech and background noise while continuing to recognize speech (3) adapting to the current speaker (4) the implementation of N-gram statistical grammars with DECIPHER. Finally we describe our performance in the February 1991 DARPA Resource Management evaluation (4.8 percent word error) and in the February 1991 DARPA-ATIS speech and SLS evaluations (95 sentences correct, 15 wrong of 140). We show that, for the ATIS evaluation, a well-conceived system integration can be relatively robust to speech recognition errors and to linguistic variability and errors.

172 citations


Journal ArticleDOI
N.Z. Tisby1
TL;DR: The results show that even with a short sequence of only four isolated digits, a speaker can be verified with an average equal-error rate of less than 3 %, and the small improvement over the vector quantization approach indicates the weakness of the Markovian transition probabilities for characterizing speaker-dependent transitional information.
Abstract: Linear predictive hidden Markov models have proved to be efficient for statistically modeling speech signals. The possible application of such models to statistical characterization of the speaker himself is described and evaluated. The results show that even with a short sequence of only four isolated digits, a speaker can be verified with an average equal-error rate of less than 3 %. These results are slightly better than the results obtained using speaker-dependent vector quantizers, with comparable numbers of spectral vectors. The small improvement over the vector quantization approach indicates the weakness of the Markovian transition probabilities for characterizing speaker-dependent transitional information. >

121 citations


Proceedings ArticleDOI
14 Apr 1991
TL;DR: The authors describe two systems in which neural network classifiers are merged with dynamic programming (DP) time alignment methods to produce high-performance continuous speech recognizers.
Abstract: The authors describe two systems in which neural network classifiers are merged with dynamic programming (DP) time alignment methods to produce high-performance continuous speech recognizers. One system uses the connectionist Viterbi-training (CVT) procedure, in which a neural network with frame-level outputs is trained using guidance from a time alignment procedure. The other system uses multi-state time-delay neural networks (MS-TDNNs), in which embedded DP time alignment allows network training with only word-level external supervision. The CVT results on the, TI Digits are 99.1% word accuracy and 98.0% string accuracy. The MS-TDNNs are described in detail, with attention focused on their architecture, the training procedure, and results of applying the MS-TDNNs to continuous speaker-dependent alphabet recognition: on two speakers, word accuracy is respectively 97.5% and 89.7%. >

111 citations


Journal ArticleDOI
TL;DR: Recent advances in and perspectives of research on speaker-dependent-feature extraction from speech waves, automatic speaker identification and verification, speaker adaptation in speech recognition, and voice conversion techniques are discussed.

108 citations


PatentDOI
TL;DR: A name recognition system used to provide access to a database based on the voice recognition of a proper name spoken by a person who may not know the correct pronunciation of the name.
Abstract: A name recognition system (FIG. 1 )used to provide access to a database based on the voice recognition of a proper name spoken by a person who may not know the correct pronunciation of the name. During an enrollment phase (10), for each name-text entered (11) into a text database (12), text-derived recognition models (22) are created for each of a selected number of pronunciations of a name-text, with each recognition model being constructed from a respective sequence of phonetic features (15) generated by a Boltzmann machine (13). During a name recognition phase (20), the spoken input (24,25) of a name (by a person who may not know the correct pronunciation) is compared (26) with the recognition models (22) looking for a pattern match--selection of a corresponding name-text is made based on a decision rule (28).

95 citations


Proceedings ArticleDOI
Jay G. Wilpon1, L.G. Miller1, P. Modi1
14 Apr 1991
TL;DR: A hidden Markov model based key wordspotting algorithm developed previously can recognize key words from a predefined vocabulary list spoken in an unconstrained fashion and improvements in the feature analysis and modeling techniques used to train the system are explored.
Abstract: A hidden Markov model based key wordspotting algorithm developed previously can recognize key words from a predefined vocabulary list spoken in an unconstrained fashion. Improvements in the feature analysis used to represent the speech signal and modeling techniques used to train the system are explored. The authors discuss several task domain issues which influence evaluation criteria. They present results from extensive evaluations on three speaker independent databases: the 20 word vocabulary Stonehenge Road Rally database, distributed by the National Security Agency, a five word vocabulary used to automate operator-assisted calls, and a three word Spanish vocabulary that is currently being tested in Spain's telephone network. Currently, recognition accuracies range from 99.9% on the Spanish database to 74% (with 8.8 FA/H/W) on the Stonehenge task. >

92 citations


Proceedings ArticleDOI
14 Apr 1991
TL;DR: A speaker verification system using connected word verification phrases has been implemented and studied and the system has been evaluated on a 20-speaker telephone database of connected digital utterances.
Abstract: A speaker verification system using connected word verification phrases has been implemented and studied. Verification utterances are represented as concatenated speaker-dependent whole-word hidden Markov models (HMMs). Verification phrases are specified as strings of words drawn from a small fixed vocabulary, such as the digits. Phrases can either be individualized or randomized for greater security. Training techniques to create speaker-dependent models for verification are used in which initial word models are created by bootstrapping from existing speaker-independent models. The system has been evaluated on a 20-speaker telephone database of connected digital utterances. Using approximately 66 s of connected digit training utterances per speaker, the verification equal-error rate is approximately 3.5% for 1.1 s test utterances and 0.3% for 4.4 s test utterances. In comparison, the performance of a template-based system using the same amount of training data is 6.7% and 1.5%, respectively. >

88 citations


Proceedings ArticleDOI
14 Apr 1991
TL;DR: The authors present a novel technique for obtaining a phonetic transcription for a new word, which is needed to add the new word to the system, using DECtalk's text-to-sound rules.
Abstract: The authors report on the detection of new words for the speaker-dependent and speaker-independent paradigms. A useful operating point in a speaker-dependent paradigm is defined at 71% detection rate and 1% false alarm rate. The authors present a novel technique for obtaining a phonetic transcription for a new word, which is needed to add the new word to the system. The technique utilizes DECtalk's text-to-sound rules to obtain an initial phonetic transcription for the new word. Since these text-to-sound rules are imperfect, a probabilistic transformation technique is used that produces a phonetic pronunciation network of all possible pronunciations given DECtalk's transcription. The network is used to constrain a phonetic recognition process that results in an improved phonetic transcription for the new word. The resulting transcriptions are sufficient for speech recognition purposes. >

Proceedings ArticleDOI
14 Apr 1991
TL;DR: Experimental results on a 40-speaker database indicate that the modified neural approach significantly outperforms both a standard multilayer perceptron and a vector quantization based system.
Abstract: A speaker recognition system, using a modified form of feedforward neural network based on radial basis functions (RBFs), is presented. Each person to be recognized has his/her own neural model which is trained to recognise spectral feature vectors representative of his/her speech. Experimental results on a 40-speaker database indicate that the modified neural approach significantly outperforms both a standard multilayer perceptron and a vector quantization based system. The best performance for 4 digit test utterances is obtained from an RBF network with 384 RBF nodes in the hidden layer, given an 8% true talker rejection rate for a fixed 1% imposter acceptance rate. Additional advantages include a substantial reduction in training time over an MLP approach, and the ability to readily interpret the resulting model. >

Proceedings ArticleDOI
14 Apr 1991
TL;DR: Experimental results show that adapting the spectral observation probabilities of each state of the model by the back propagation of errors can correct misclassification errors.
Abstract: Speaker verification is performed by comparing the output probabilities of two Markov models of the same phonetic unit. One of these Markov models is speaker-specific, being built from utterances from the speaker whose identity is to be verified. The second model is built from utterances from a large population of speakers. The performance of the system is improved by treating the pair of models as a connectionist network, an alpha-net, which then allows discriminative training to be carried out. Experimental results show that adapting the spectral observation probabilities of each state of the model by the back propagation of errors can correct misclassification errors. The real-time implementation of the system produced an average digit error rate of 4.5% and only one misclassification in 600 trials using a five-digit sequence. >

Journal ArticleDOI
TL;DR: A novel algorithm simultaneously performing consonant/vowel (C/V) segmentation and pitch detection is proposed and an improvement of 12% in consonant recognition rate is obtained and the number of recognition candidates is reduced.
Abstract: A novel algorithm simultaneously performing consonant/vowel (C/V) segmentation and pitch detection is proposed. Based on this algorithm, a consonant enhancement method and a hierarchical neural network scheme are explored for Mandarin speech recognition. As a result, an improvement of 12% in consonant recognition rate is obtained and the number of recognition candidates is reduced from 1300 to 63. A series of experiments over all Mandarin syllables (about 1300) is demonstrated in the speaker-dependent mode. Comparisons with the decoder timer waveform algorithm are evaluated to show that the performance is satisfactory. An overall recognition rate of 90.14% is obtained. >

Proceedings ArticleDOI
11 Jun 1991
TL;DR: The authors summarize a speaker adaptation algorithm based on codebook mapping from one speaker to a standard speaker to be useful in various kinds of speech recognition systems such as hidden-Markov-model-based, feature- based, and neural-network-based systems.
Abstract: The authors summarize a speaker adaptation algorithm based on codebook mapping from one speaker to a standard speaker. This algorithm has been developed to be useful in various kinds of speech recognition systems such as hidden-Markov-model-based, feature-based, and neural-network-based systems. The codebook mapping speaker adaptation algorithm has been much improved by introducing several ideas based on fuzzy vector quantization. This fuzzy codebook mapping algorithm is also applicable to voice conversion between arbitrary speakers. >

Journal ArticleDOI
TL;DR: Comparisons of patterns of confusions for the two tasks supported the notion that voices are remembered in terms of a “prototype” and a set of deviations from that prototype, and that over time the deviations are forgotten so that identification responses converge on the most “typical” sounding voices.

Proceedings ArticleDOI
14 Apr 1991
TL;DR: A VQ (vector-quantization)-based text-independent speaker recognition method which is robust against utterance variations, and a normalization method, talker variability normalization (TVN), which normalizes parameter variation taking both inter- and intra-speaker variability into consideration.
Abstract: The authors describe a VQ (vector-quantization)-based text-independent speaker recognition method which is robust against utterance variations. Three techniques are introduced to cope with temporal and text-dependent spectral variations. First, either an ergodic hidden Markov model or a voiced/unvoiced decision is used to classify input speech into broad phonetic classes. Second, a new distance measure, the distortion-intersection measure (DIM), is introduced for calculating VQ distortion of input speech compared to speaker-independent codebooks. Third, a normalization method, talker variability normalization (TVN), is introduced. TVN normalizes parameter variation taking both inter- and intra-speaker variability into consideration. The system was tested using utterances of nine speakers recorded over three years. The combination of the three techniques achieves high speaker identification accuracies of 98.5% using only vocal tract information and 99.0% using both vocal tract and pitch information. >


Journal ArticleDOI
TL;DR: Experimental results using utterances of cerebral palsied persons with an array of articulatory abilities are presented and it is found that an ergodic model is found to outperform a standard left-to-right (Bakis) model structure.

Journal ArticleDOI
TL;DR: Recognition results are as good as those obtained in the time delay neural network system developed by Waibel et al. (1989), and suggest that LVQ could be the basis for a high-performance speech recognition system.
Abstract: A shift-tolerant neural network architecture for phoneme recognition is described. The system is based on algorithms for learning vector quantization (LVQ), recently developed by Kohonen (1986, 1988), which pay close attention to approximating optimal decision lines in a discrimination task. Recognition performances in the 98%-99% correct range were obtained for LVQ networks aimed at speaker-dependent recognition of phonemes in small but ambiguous Japanese phonemic classes. A correct recognition rate of 97.7% was achieved by a large LVQ network covering all Japanese consonants. These recognition results are as good as those obtained in the time delay neural network system developed by Waibel et al. (1989), and suggest that LVQ could be the basis for a high-performance speech recognition system. >

Proceedings ArticleDOI
14 Apr 1991
TL;DR: The proposed voice conversion algorithm was used with two male speakers and, in terms of speaker identification accuracy, the speech converted by segment-sized units gave a score 20% higher than thespeech converted frame-by-frame.
Abstract: A voice conversion algorithm that uses speech segments as conversion units is proposed. Input speech is decomposed into speech segments by a speech recognition module, and the segments are replaced by speech segments uttered by another speaker. This algorithm makes it possible to convert not only the static characteristics but also the dynamic characteristics of speaker individuality. The proposed voice conversion algorithm was used with two male speakers. Spectrum distortion between target speech and the converted speech was reduced to one-third the natural spectrum distortion between the two speakers. A listening experiment showed that, in terms of speaker identification accuracy, the speech converted by segment-sized units gave a score 20% higher than the speech converted frame-by-frame. >

Proceedings ArticleDOI
14 Apr 1991
TL;DR: The authors already have a state-of-the-art speaker-independent speech recognition system, SPHINX, and extended it to speaker-dependent speech recognition, which demonstrated a substantial difference between speaker- dependent and -independent systems.
Abstract: The DARPA Resource Management task is used as the domain to investigate the performance of speaker-independent, speaker-dependent, and speaker-adaptive speech recognition. The authors already have a state-of-the-art speaker-independent speech recognition system, SPHINX. The error rate for RM2 test set is 4.3%. They extended SPHINX to speaker-dependent speech recognition. The error rate is reduced to 1.4-2.6% with 600-2400 training sentences for each speaker, which demonstrated a substantial difference between speaker-dependent and -independent systems. Based on speaker-independent models, a study was made of speaker-adaptive speech recognition. With 40 adaptation sentences for each speaker, the error rate can be reduced from 4.3% to 3.1%. >

Proceedings ArticleDOI
14 Apr 1991
TL;DR: It is demonstrated that while the static feature gives the best individual performance, multiple linear combinations of feature sets based on regression analysis can reduce error rates.
Abstract: The performance of dynamic features in automatic speaker recognition is examined. Second- and third-order regression analysis examining the performance of the associated feature sets independently, in combination, and in the presence of noise is included. It is shown that each regression order has a clear optimum. These are independent of the analysis order of the static feature from which the dynamic features are derived, and insensitive to low-level noise added to the test speech. It is also demonstrated that while the static feature gives the best individual performance, multiple linear combinations of feature sets based on regression analysis can reduce error rates. >

Proceedings ArticleDOI
14 Apr 1991
TL;DR: A combination of a high-performance speaker identification system and an isolated word recognizer is presented, capable of automatically producing speech and speaker identification with a closed set of speakers.
Abstract: A combination of a high-performance speaker identification system and an isolated word recognizer is presented. The front-end text-independent speaker identification system determines the most likely speaker for an input word. The speaker identity is then used to choose the reference word models for the speech recognizer. When used with a closed set of speakers, the combination is capable of automatically producing speech and speaker identification. For an open set of speakers, the speaker recognition system acts as speaker quantizer which associates the unknown speaker with an acoustically similar speaker. The matching speaker's word models are used in the speech recognizer. The application of this front-end speaker recognizer is described for a DTW and HMM speech recognizer. Results on a combination using a DTW word recognizer are 100% for closed set experiments. >

Journal ArticleDOI
TL;DR: The methods and motivation for VAA data collection and validation procedures, the current contents of thedatabase, and the results of exploratory research on a 1088-speaker subset of the database are described.

Proceedings ArticleDOI
04 Nov 1991
TL;DR: It is concluded that not only has the VQ technique reduced the amount of computation and storage, but it has also created new ideas for solving various problems in speech/speaker recognition.
Abstract: The author reviews major methods of applying the vector quantization (VQ) technique to speech and speaker recognition. These include speech recognition based on the combination of VQ and the DTW/HMM (dynamic time warping/hidden Markov model) technique. VQ-distortion-based recognition, learning VQ algorithms, speaker adaptation by VQ-codebook mapping, and VQ-distortion-based speaker recognition. It is concluded that not only has the VQ technique reduced the amount of computation and storage, but it has also created new ideas for solving various problems in speech/speaker recognition. >

Proceedings ArticleDOI
14 Apr 1991
TL;DR: The authors investigate the use of continuous features derived by perceptual linear predictive (PLP) analysis, examine the effect of adding temporal features, and compare it to the previously studied use of multiframe input.
Abstract: The authors investigate the use of continuous features derived by perceptual linear predictive (PLP) analysis, examine the effect of adding temporal features, and compare it to the previously studied use of multiframe input. Comparisons of the MLP (multilayer perceptron) and conventional Gaussian classifiers are also reported. The speaker-dependent portion of the Resource Management database was used for this test. Additionally, some experiments were performed with a perplexity-2200 speaker-independent recognition task on a subset of the TIMIT database. In each case, the PLP features were used as input to the networks. The experiments show the advantage of continuous PLP features and their first and second temporal derivatives. >

PatentDOI
Kazunaga Yoshida1, Takao Watanabe1
TL;DR: A speech recognition apparatus is adapted to the speech of the particular speaker by converting the reference pattern into a normalized pattern by a neural network unit, internal parameters of which are modified through a learning operation using a normalized feature vector of the training pattern produced by the voice of the particularly speaker and normalized on the basis of thereference pattern.
Abstract: A speech recognition apparatus of the speaker adaptation type operates to recognize an inputted speech pattern produced by a particular speaker by using a reference pattern produced by a voice of a standard speaker. The speech recognition apparatus is adapted to the speech of the particular speaker by converting the reference pattern into a normalized pattern by a neural network unit, internal parameters of which are modified through a learning operation using a normalized feature vector of the training pattern produced by the voice of the particular speaker and normalized on the basis of the reference pattern, so that the neural netowrk unit provides an optimum output similar to the corresponding normalized feature vector of the training pattern. In the alternative, the speech recognition apparatus operates to recognize an inputted speech pattern by converting the inputted speech pattern into a normalized speech pattern by the neural network unit, internal parameters of which are modified through a learning operation using a feature vector of the reference pattern normalized on the basis of the training pattern, so that the neural network unit provides an optimum output similar to the corresponding normalized feature vector of the reference pattern and recognizing the normalized speech pattern according to the reference pattern.

Proceedings ArticleDOI
L. Mathan1, Laurent Miclet1
14 Apr 1991
TL;DR: In this paper, the authors trained multilayer perceptrons to confirm or reject the choice made by a Markov model system during recognition by classifying the trace of the winning model.
Abstract: In isolated-word recognition from everyday speech, a considerable share of the input lies outside the permitted vocabulary, and has to be rejected. The authors trained multilayer perceptrons to confirm or reject the choice made by a Markov model system during recognition by classifying the trace of the winning model. This rejection method is totally independent of the recognition procedure. Results show that performance on a database containing field data is better than with other rejection procedures. >

Proceedings ArticleDOI
14 Apr 1991
TL;DR: Efficiency in the adaptive incremental training using a small number of training tokens extracted from continuous speech was confirmed in the TDNN-LR system and provides large-vocabulary and continuous speech recognition.
Abstract: An investigation of speech recognition and language processing is described. The speech recognition part consists of the large phonemic time-delay neural networks (TDNNs) which can automatically spot all 24 Japanese phonemes by simply scanning input speech. The language processing part is made up of a predictive LR parser which predicts subsequent phonemes based on the currently proposed phonemes. This TDNN-LR recognition system provides large-vocabulary and continuous speech recognition. Recognition experiments for ATR's conference registration task were performed using the TDNN-LR method. Speaker-dependent phrase recognition rates of 65.1% for the first choices and 88.8% within the fifth choices were attained. Also, efficiency in the adaptive incremental training using a small number of training tokens extracted from continuous speech was confirmed in the TDNN-LR system. >