scispace - formally typeset
Search or ask a question

Showing papers on "Speaker recognition published in 1993"


Book
01 Jan 1993
TL;DR: This book presents a meta-modelling framework for speech recognition that automates the very labor-intensive and therefore time-heavy and therefore expensive and expensive process of manually modeling speech.
Abstract: 1. Fundamentals of Speech Recognition. 2. The Speech Signal: Production, Perception, and Acoustic-Phonetic Characterization. 3. Signal Processing and Analysis Methods for Speech Recognition. 4. Pattern Comparison Techniques. 5. Speech Recognition System Design and Implementation Issues. 6. Theory and Implementation of Hidden Markov Models. 7. Speech Recognition Based on Connected Word Models. 8. Large Vocabulary Continuous Speech Recognition. 9. Task-Oriented Applications of Automatic Speech Recognition.

8,442 citations


Journal ArticleDOI
TL;DR: NoISEX-92 specifies a carefully controlled experiment on artificially noisy speech data, examining performance for a limited digit recognition task but with a relatively wide range of noises and signal-to-noise ratios.

1,960 citations


PatentDOI
TL;DR: A speech recognition interface system capable of handling a plurality of application programs simultaneously, and realizing convenient speech input and output modes which are suitable for the applications in the window systems and the speech mail systems, is presented in this article.
Abstract: A speech recognition interface system capable of handling a plurality of application programs simultaneously, and realizing convenient speech input and output modes which are suitable for the applications in the window systems and the speech mail systems. The system includes a speech recognition unit for carrying out a speech recognition processing for a speech input made by a user to obtain a recognition result; a program management table for managing program management data indicating a speech recognition interface function required by each application program; and a message processing unit for exchanging messages with the plurality of application programs in order to specify an appropriate recognition vocabulary to be used in the speech recognition processing of the speech input to the speech recognition unit, and to transmit the recognition result for the speech input obtained by the speech recognition unit by using the appropriate recognition vocabulary to appropriate ones of the plurality of application programs, according to the program management data managed by the program management table.

255 citations


Proceedings ArticleDOI
27 Apr 1993
TL;DR: Methods that create models to specify both speaker and phonetic information accurately by using only a small amount of training data for each speaker are investigated and supplementing these methods by adding a phoneme-independent speaker model to make up for the lack of speaker information.
Abstract: Methods that create models to specify both speaker and phonetic information accurately by using only a small amount of training data for each speaker are investigated. For a text-dependent speaker recognition method, in which arbitrary key texts are prompted from the recognizer, speaker-specific phoneme models are necessary to identify the key text and recognize the speaker. Two methods of making speaker-specific phoneme models are discussed: phoneme-adaptation of a phoneme-independent speaker model and speaker-adaptation of universal phoneme models. The authors also investigate supplementing these methods by adding a phoneme-independent speaker model to make up for the lack of speaker information. This combination achieves a rejection rate as high as 98.5% for speech that differs from the key text and a speaker verification rate of 100.0%. >

189 citations


Journal ArticleDOI
TL;DR: A new class of hidden Markov models is proposed for the acoustic representation of words in an automatic speech recognition system that is more flexible than previously reported fenone-based word models, which lead to an improved capability of modeling variations in pronunciation.
Abstract: A new class of hidden Markov models is proposed for the acoustic representation of words in an automatic speech recognition system. The models, built from combinations of acoustically based sub-word units called fenones, are derived automatically from one or more sample utterances of a word. Because they are more flexible than previously reported fenone-based word models, they lead to an improved capability of modeling variations in pronunciation. They are therefore particularly useful in the recognition of continuous speech. In addition, their construction is relatively simple, because it can be done using the well-known forward-backward algorithm for parameter estimation of hidden Markov models. Appropriate reestimation formulas are derived for this purpose. Experimental results obtained on a 5000-word vocabulary natural language continuous speech recognition task are presented to illustrate the enhanced power of discrimination of the new models. >

170 citations


Proceedings Article
23 Sep 1993

125 citations


Journal ArticleDOI
TL;DR: It was found that speaker-adaptive systems outperform both speaker-independent and speaker-dependent systems, suggesting that the most effective system is one that begins with speaker- independent training and continues to adapt to users.
Abstract: The DARPA Resource Management task is used as a domain for investigating the performance of speaker-independent, speaker-dependent, and speaker-adaptive speech recognition. The error rate of the speaker-independent recognition system, SPHINX, was reduced substantially by incorporating between-word triphone models additional dynamic features, and sex-dependent, semicontinuous hidden Markov models. The error rate for speaker-independent recognition was 4.3%. On speaker-dependent data, the error rate was further reduced to 2.6-1.4% with 600-2400 training sentences for each speaker. Using speaker-independent models, the authors studied speaker-adaptive recognition. Both codebooks and output distributions were considered for adaptation. It was found that speaker-adaptive systems outperform both speaker-independent and speaker-dependent systems, suggesting that the most effective system is one that begins with speaker-independent training and continues to adapt to users. >

94 citations



Journal ArticleDOI
TL;DR: An "on the road" evaluation of the performance of an adaptive microphone array for speech enhancement in cars shows a signal-to-noise ratio (SNR) improvement in the range of 10-15 dB over the telephone frequency range with a slight favour for higher frequencies when driving at normal speeds.
Abstract: Handsfree speaker input of mobile telephones is most desirable today in order to enable safe operation in cars. This is also a prerequisite for voice control of car devices. This paper presents an "on the road" evaluation of the performance of an adaptive microphone array for speech enhancement in cars. The results show a signal-to-noise ratio (SNR) improvement in the range of 10-15 dB over the telephone frequency range with a slight favour for higher frequencies when driving at normal speeds. >

83 citations


Proceedings ArticleDOI
27 Apr 1993
TL;DR: An algorithm for attributing a sample of unconstrained speech to one of several known speakers is described, based on measurement of the similarity of distributions of features extracted from reference speech samples and from the sample to be attributed.
Abstract: An algorithm for attributing a sample of unconstrained speech to one of several known speakers is described. The algorithm is based on measurement of the similarity of distributions of features extracted from reference speech samples and from the sample to be attributed. The measure of feature distribution similarity employed is not based on any assumed form of the distributions involved. The theoretical basis of the algorithm is examined, and a plausible connection is shown to the divergence statistic of Kullback (1972). Experimental results are presented for the King telephone database and the Switchboard database. The performance of the algorithm is better than that reported for algorithms based on Gaussian modeling and robust discrimination. >

78 citations


Journal ArticleDOI
TL;DR: Different architectures for sequence and speech recognition are reviewed, including recurrent networks as well as hybrid systems involving hidden Markov models, sometimes combined with statistical techniques for recognition of sequences of patterns.
Abstract: The task discussed in this paper is that of learning to map input sequences to output sequences. In particular, problems of phoneme recognition in continuous speech are considered, but most of the discussed techniques could be applied to other tasks, such as the recognition of sequences of handwritten characters. The systems considered in this paper are based on connectionist models, or artificial neural networks, sometimes combined with statistical techniques for recognition of sequences of patterns, stressing the integration of prior knowledge and learning. Different architectures for sequence and speech recognition are reviewed, including recurrent networks as well as hybrid systems involving hidden Markov models.

PatentDOI
TL;DR: The improved baseline algorithm addresses the co-channel problem of speaker spotting when plural speech signals are intermixed on the same channel by using a union of reference sets for pairs of speakers as the reference set for a co- channel signal, and/or by conversational state modelling.
Abstract: A speaker recognition apparatus employs a non-parametric baseline algorithm for speaker recognition which characterizes a given speaker's speech patterns by a set of speech feature vectors, and generates match scores which are sums of a ScoreA set equal to the average of the minimum Euclidean squared distance between the unknown speech frame and all reference frames of a given speaker over all frames of the unknown input, and ScoreB set equal to the average of the minimum Euclidean squared distance between each frame of the reference set to all frames of the unknown input The performance on a queue of talkers is further improved by normalization of reference message match distances The improved baseline algorithm addresses the co-channel problem of speaker spotting when plural speech signals are intermixed on the same channel by using a union of reference sets for pairs of speakers as the reference set for a co-channel signal, and/or by conversational state modelling


Journal ArticleDOI
TL;DR: A method for combining phonetic and fenonic models is presented and results of experiments with speaker-dependent and speaker-independent models on several isolated-word recognition tasks are reported.
Abstract: A technique for constructing Markov models for the acoustic representation of words is described. Word models are constructed from models of subword units called fenones. Fenones represent very short speech events and are obtained automatically through the use of a vector quantizer. The fenonic baseform for a word-i.e., the sequence of fenones used to represent the word-is derived automatically from one or more utterances of that word. Since the word models are all composed from a small inventory of subword models, training for large-vocabulary speech recognition systems can be accomplished with a small training script. A method for combining phonetic and fenonic models is presented. Results of experiments with speaker-dependent and speaker-independent models on several isolated-word recognition tasks are reported. The results are compared with those for phonetics-based Markov models and template-based dynamic programming (DP) matching. >

Proceedings ArticleDOI
27 Apr 1993
TL;DR: The authors use speaker identification (SI) for a performance evaluation as it is very sensitive to feature changes, and propose a target for robustness in terms of matched noise conditions, which is found to give the best resilience under cross conditions for a single feature.
Abstract: A variety of features and their sensitivity to noise mismatch between the model and test noise conditions are assessed. The authors use speaker identification (SI) for a performance evaluation as it is very sensitive to feature changes, and propose a target for robustness in terms of matched noise conditions. Two primary features, mel frequency cepstral coefficients (MFCCs) and PLP, are considered along with their RASTA and first-order regression extensions. PLP-RASTA is found to give the best resilience under cross conditions for a single feature, and the linear discriminant analysis (LDA) combination of MFCC and PLP-RASTA gives the best performance overall. Only in combined training are satisfactory results for any feature found. >

PatentDOI
TL;DR: In this paper, a plurality of feature data having a series of coefficient sets is generated, each set having a set of coefficients indicating the short term special amplitude of a speech signal.
Abstract: Apparatus and method for speaker recognition includes generating, in response to a speech signal, a plurality of feature data having a series of coefficient sets, each set having a plurality of coefficients indicating the short term special amplitude in a plurality of frequency bands. The feature data is compared with predetermined speaker reference data, and recognition of a corresponding speaker is indicated in dependence upon such comparison. The frequency bands are unevenly spaced along the frequency axis, and a long term average spectral magnitude of at least one of said coefficients is derived and used for normalizing the at least one coefficient.

Proceedings ArticleDOI
27 Apr 1993
TL;DR: The authors describe speech segmentation and clustering algorithms based on speaker features, where speakers, the number of speakers, and speech context are unknown.
Abstract: The authors describe speech segmentation and clustering algorithms based on speaker features, where speakers, the number of speakers, and speech context are unknown. Several problems are formulated and their solutions are proposed. As in the simpler case, when speech segmentations are known, the output probability clustering algorithm is proposed. In the case of unknown segmentation, an ergodic HMM (hidden Markov model)-based technique is applicable. Both cases are evaluated using simulated multispeaker dialogue speech data. >

Proceedings ArticleDOI
27 Apr 1993
TL;DR: A novel approach to the problems of topic and speaker identification that makes use of large-vocabulary continuous speech recognition and some empirical results on topic and Speaker identification that have been obtained on the extensive Switchboard corpus of telephone conversations are presented.
Abstract: The authors describe a novel approach to the problems of topic and speaker identification that makes use of large-vocabulary continuous speech recognition. A theoretical framework for dealing with these problems in a symmetric way is provided. Some empirical results on topic and speaker identification that have been obtained on the extensive Switchboard corpus of telephone conversations are presented. >


Journal Article
TL;DR: A text-independent speaker recognition method using predictive neural networks, which is a nonlinear prediction model based on multilayer perceptrons, gave the highest recognition accuracy of 100.0% and the effectiveness of the predictive neural Networks for representing speaker individuality was clarified.
Abstract: A text-independent speaker recognition method using predictive neural networks is described. The speech production process is regarded as a nonlinear process, so the speaker individuality in the speech signal also includes nonlinearity. Therefore, the predictive neural network, which is a nonlinear prediction model based on multilayer perceptrons, is expected to be a more suitable model for representing speaker individuality. For text-independent speaker recognition, an ergodic model which allows transitions to any other state is adopted as the speaker model and one predictive neural network is assigned to each state. The proposed method was compared to distortion-based methods, hidden Markov model (HMM)-based methods, and a discriminative neural-network-based method through text-independent speaker recognition experiments on 24 female speakers. The proposed method gave the highest recognition accuracy of 100.0% and the effectiveness of the predictive neural networks for representing speaker individuality was clarified. >

Proceedings Article
01 Jan 1993
TL;DR: The ESPRITWERNICKE project brings together a number of different groups from Europe and the US and focuses on extending the state-of-the-art for hybrid hidden Markov model/connectionist approaches to large vocabulary, continu-ous speech recognition.
Abstract: International Computer Science Institute (ICSI), USA(Author list is alphabetical with the exception of the typist.)ABSTRACTThis paper describes the research underway for the ESPRITWERNICKE project. The project brings together a num-ber of different groups from Europe and the US and focuseson extending the state-of-the-art for hybrid hidden Markovmodel/connectionist approaches to large vocabulary, continu-ous speech recognition. Thispaper describes the specific goalsoftheresearchandpresentstheworkperformedtodate. Resultsare reported for the resource management talker-independentrecognition task. The paper concludes with a discussion of theprojected future work.Keywords: Recognition, Neural Nets, HMM.1. BACKGROUNDW

Proceedings Article
01 Jan 1993
TL;DR: It is believed that visual speech does convey personal identity information, and that its use in conjunction with acoustic speech results in improved automatic speaker recognition performance in terms of accuracy, robustness, and protection against impersonation.
Abstract: • an improvement in recognition performance resulting from data fusion for normal input data and for a range of degraded input data conditions. • discriminative models will show better performance than predictive models. This will result from the more stringent data alignment (lip sync) requirements and the lack of classifier input information redundancy in the predictive scheme. Also, predictive modelling is expected to show less robustness to input data variability. • fuzzy ARTMAP models are expected to outperform MLP predictive models with regard to accuracy. This is because the fuzzy ARTMAP map field realises a minimax learning rule that conjointly allows the minimisation of predictive error and the maximisation of predictive generalisation [9]. Standard back–propagation learning does not provide such a mechanism. In addition, tremendous savings in training times are expected with the fuzzy ARTMAP approach. 4 CONCLUSIONS Humans often rely on multiple senses – particularly hearing and vision – for many recognition tasks. We have proposed the joint use of acoustic and visual information for reliable automatic speaker recognition. The initial step towards this goal has combined static facial image information with voice, and the results have shown that performance improvements can be achieved even with a relatively simple integration scheme. We believe that visual speech does convey personal identity information, and that its use in conjunction with acoustic speech results in improved automatic speaker recognition performance in terms of accuracy, robustness, and protection against impersonation. ACKNOWLEDGEMENT Sincere thanks are expressed to the Beit Trust for the financial support in form of a Beit Trust Fellowship awarded to the first author. 3 CLASSIFICATION Each known person is allocated an artificial neural network model. Multi–layer perceptrons (MLPs) trained as predictive or discriminative networks (Figure 1) are considered. speech speech Prediction Prediction Visual Acoustic speech Visual Identity error (a) Predictive modelling Visual speech Identity Acoustic speech Categorisation (b) Discriminative modelling Figure 1: Predictive and discriminative classifier modes For speedier and incremental training, fuzzy ARTMAP models are preferred. Each fuzzy ARTMAP model is trained to make a many–to–one mapping from acoustic speech to visual speech for its allotted person (Figure 2). The predictive error across an utterance acts as recognition measure. A fuzzy ARTMAP neural network architecture is composed of a pair of fuzzy Adaptive Resonance Theory modules (modules A and B in Figure 2) [9]. Each module creates recognition categories corresponding to its input data. The two modules are interconnected by …

Journal ArticleDOI
TL;DR: A method for speaker normalization and adaption using connectionist networks is developed and the results suggest that rapid speaker adaptation resulting in high classification accuracy can be accomplished by this method.
Abstract: A method for speaker normalization and adaption using connectionist networks is developed. A speaker-specific linear transformation of observations of the speech signal is computed using second-order network units. Classification is accomplished by a multilayer feedforward network that operates on the normalized speech data. The network is adapted for a new talker by modifying the transformation parameters while leaving the classifier fixed. This is accomplished by backpropagating classification error through the classifier to the second-order transformation units. This method was evaluated for the classification of ten vowels for 76 speakers using the first two formant values of the Peterson-Barney data. The results suggest that rapid speaker adaptation resulting in high classification accuracy can be accomplished by this method. >

Journal ArticleDOI
TL;DR: It is experimentally shown that one can optimize the system and further improve recognition accuracy for speaker-independent recognition by controlling the distance measure's sensitivity to spectral peaks and the spectral tilt and by utilizing the speech dynamic features.
Abstract: Several recently proposed automatic speech recognition (ASR) front-ends are experimentally compared in speaker-dependent, speaker-independent (or cross-speaker) recognition. The perceptually based linear predictive (PLP) front-end, with the root-power sums (RPS) distance measure, yields generally the highest accuracies, especially in cross-speaker recognition., It is experimentally shown that one can optimize the system and further improve recognition accuracy for speaker-independent recognition by controlling the distance measure's sensitivity to spectral peaks and the spectral tilt and by utilizing the speech dynamic features. For a digit vocabulary and five reference templates obtained with a clustering algorithm, the optimization improves recognition accuracy from 97% to 98.1%, with respect to the PL-PRPS front-end. >

Proceedings ArticleDOI
21 Mar 1993
TL;DR: It is demonstrated that, counter to intuitions, given a fixed amount of training speech, the number of training speakers has little effect on the accuracy, and how much speech is needed for speaker-independent (SI) recognition in order to achieve the same performance as speaker-dependent (SD) recognition is shown.
Abstract: This paper describes several key experiments in large vocabulary speech recognition. We demonstrate that, counter to our intuitions, given a fixed amount of training speech, the number of training speakers has little effect on the accuracy. We show how much speech is needed for speaker-independent (SI) recognition in order to achieve the same performance as speaker-dependent (SD) recognition. We demonstrate that, though the N-Best Paradigm works quite well up to vocabularies of 5,000 words, it begins to break down with 20,000 words and long sentences. We compare the performance of two feature preprocessing algorithms for microphone independence and we describe a new microphone adaptation algorithm based on selection among several codebook transformations.

Journal ArticleDOI
D.B. Roe1, J.G. Wilpon1
TL;DR: The dimensions of the speech recognition task, speech feature analysis, pattern classification using hidden Markov models, language processing, and the current accuracy of speech recognition systems are discussed.
Abstract: The fundamentals of speech recognition are reviewed. The dimensions of the speech recognition task, speech feature analysis, pattern classification using hidden Markov models, language processing, and the current accuracy of speech recognition systems are discussed. The applications of speech recognition in telecommunications, voice dictation, speech understanding for data retrieval, and consumer products are examined. >

01 Jan 1993
TL;DR: This article presented an algorithm for the construction of models that attempt to capture the variation that occurs in the pronunciations of words in spontaneous (i.e., non-read) speech.
Abstract: Over the past 40 years, significant progress has been made in the fields of speech recognition and speech understanding. Current state-of-the-art speech recognition systems are capable of achieving word-level accuracies of 90% to 95% on continuous speech recognition tasks using 5000 words. Even larger systems, capable of recognizing 20,000 words are just now being developed. Speech understanding systems have recently been developed that perform fairly well within a restricted domain. While the size and performance of modern speech recognition and understanding systems are impressive, it is evident to anyone who has used these systems that the technology is primitive compared to our own human ability to understand speech. Some of the difficulties hampering progress in the fields of speech recognition and understanding stem from the many sources of variation that occur during human communication. One of the sources of variation that occurs in human communication is the different ways that words can be pronounced. There are many causes of pronunciation variation, such as: the phonetic environment in which the word occurs, the dialect of the speaker, the speaker's age, the speaker's gender, and the speaking rate. Some researchers have shown improvements in speech recognition performance on a read-speech task through the use of explicit pronunciation modeling, while others have not shown any significant improvements. This thesis presents an algorithm for the construction of models that attempt to capture the variation that occurs in the pronunciations of words in spontaneous (i.e., non-read) speech. A technique for developing alternate pronunciations of words and then estimating the probabilities of the alternate pronunciations is presented. Additionally, we describe the development and implementation of a spoken-language understanding system called the Berkeley Restaurant Project (BeRP). Multiple pronunciation word models constructed using the algorithm proposed in this thesis are evaluated within the context of the BeRP system. The results of this evaluation show that the explicit modeling of variation in the pronunciation of words improves the performance of both the speech recognition and the speech understanding components of the BeRP system.

Journal ArticleDOI
TL;DR: It is a difficult task to relate a speech impairment rating with speech recognition accuracy, and a statistical causal model is proposed that is very appealing in its structure to support inference and thus can be applied to perform various assessments such as the success of automatic recognition of dysarthric speech.
Abstract: The evaluation of the degree of speech impairment and the utility of computer recognition of impaired speech are separately and independently performed. Particular attention is paid to the question concerning whether or not there is a relationship between naive listeners' subjective judgments of impaired speech and the performance of a laboratory version of a speech recognition system. It is a difficult task to relate a speech impairment rating with speech recognition accuracy. Towards this end, a statistical causal model is proposed. This model is very appealing in its structure to support inference, and thus can be applied to perform various assessments such as the success of automatic recognition of dysarthric speech. The application of this model is illustrated with a case study of a dysarthric speaker compared against a normal speaker serving as a control. >

Patent
Joseph Desimone1, Jian-Tu Hsieh1
22 Dec 1993
TL;DR: In this article, a speech signal derived from a user's utterance, and a bio-signal which is indicative of the user's emotional state, are provided to a speech recognition system.
Abstract: The recognition rate of a speech recognition system is improved by compensating for changes in the user's speech that result from factors such as emotion, anxiety or fatigue. A speech signal derived from a user's utterance, and a bio-signal, which is indicative of the user's emotional state, are provided to a speech recognition system. The bio-signal is used to provide a reference frequency that changes when the user's emotional state changes. An utterance is identified by examining the relative magnitudes of its frequency components and the position of the frequency components relative to the reference frequency.

Proceedings ArticleDOI
21 Mar 1993
TL;DR: A unified approach to identifying non-linguistic speech features from the recorded signal using phone-based acoustic likelihoods is presented and is shown to be effective for text-independent language, sex, and speaker identification and can enable better and more friendly human-machine interaction.
Abstract: Over the last decade technological advances have been made which enable us to envision real-world applications of speech technologies. It is possible to foresee applications where the spoken query is to be recognized without even prior knowledge of the language being spoken, for example, information centers in public places such as train stations and airports. Other applications may require accurate identification of the speaker for security reasons, including control of access to confidential information or for telephone-based transactions. Ideally, the speaker's identity can be verified continually during the transaction, in a manner completely transparent to the user. With these views in mind, this paper presents a unified approach to identifying non-linguistic speech features from the recorded signal using phone-based acoustic likelihoods.This technique is shown to be effective for text-independent language, sex, and speaker identification and can enable better and more friendly human-machine interaction. With 2s of speech, the language can be identified with better than 99% accuracy. Error in sex-identification is about 1% on a per-sentence basis, and speaker identification accuracies of 98.5% on TIMIT (168 speakers) and 99.2% on BREF (65 speakers), were obtained with one utterance per speaker, and 100% with 2 utterances for both corpora. An experiment using unsupervised adaptation for speaker identification on the 168 TIMIT speakers had the same identification accuracies obtained with supervised adaptation.