scispace - formally typeset
Search or ask a question

Showing papers on "Speaker recognition published in 1989"


BookDOI
01 Jan 1989

528 citations


Proceedings ArticleDOI
23 May 1989
TL;DR: A word-spotting system using Gaussian hidden Markov models is presented and it is observed that performance can be greatly affected by the choice of features used, the covariance structure of the Gaussian models, and transformations based on energy and feature distributions.
Abstract: A word-spotting system using Gaussian hidden Markov models is presented. Several aspects of this problem are investigated. Specifically, results are reported on the use of various signal processing and feature transformation techniques. The authors have observed that performance can be greatly affected by the choice of features used, the covariance structure of the Gaussian models, and transformations based on energy and feature distributions. Due to the open-set nature of the problem, the specific techniques for modeling out-of-vocabulary speech and the choice of scoring metric can have a significant effect on performance. >

280 citations


Journal ArticleDOI
TL;DR: It is demonstrated that neural networks are able to extract speech information from the visual images and that this information can be used to improve automatic vowel recognition.
Abstract: Results from a series of experiments that use neural networks to process the visual speech signals of a male talker are presented. In these preliminary experiments, the results are limited to static images of vowels. It is demonstrated that these networks are able to extract speech information from the visual images and that this information can be used to improve automatic vowel recognition. The structure of speech and its corresponding acoustic and visual signals are reviewed. The specific data that was used in the experiments along with the network architectures and algorithms are described. The results of integrating the visual and auditory signals for vowel recognition in the presence of acoustic noise are presented. >

212 citations


Proceedings ArticleDOI
23 May 1989
TL;DR: In this paper, an alternative approach to speaker adaptation for a large-vocabulary hidden-Markov-model-based speech recognition system is described, based on the use of a stochastic model representing the different properties of the new speaker and an old speaker for which the full training set of 20 minutes is available.
Abstract: An alternative approach to speaker adaptation for a large-vocabulary hidden-Markov-model-based speech recognition system is described. The goal of this investigation was to train the IBM speech recognition system with only five minutes of speech data from a new speaker instead of the usual 20 minutes without the recognition rate dropping by more than 1-2%. The approach is based on the use of a stochastic model representing the different properties of the new speaker and an old speaker for which the full training set of 20 minutes is available. It is called a speaker Markov model. It is shown how the parameters of such a model can be derived and how it can be used for transforming the training set of the old speaker in order to use it in addition to the short training set of the new speaker. The adaptation algorithm was tested with 12 speakers. The average recognition rate dropped from 96.4% to 95.2% for a 5000-word vocabulary task. The decoding time increased by a factor of 1.35; this factor is often 3-5 if other adaptation algorithms are used. >

180 citations


Proceedings ArticleDOI
23 May 1989
TL;DR: The authors have developed a splitting procedure which initializes each new cluster (statistical model) by splitting off all tokens in the training set which were poorly represented by the current set of models, which gives excellent recognition performance in connected-word tasks.
Abstract: The authors describe an HMM (hidden Markov model) clustering procedure and discuss its application to connected-word systems and to large-vocabulary recognition based on phonelike units. It is shown that the conventional approach of maximizing likelihood is easily implemented but does not work well in practice, as it tends to give improved models of tokens for which the initial model was generally quite good, but does not improve tokens which are poorly represented by the initial model. The authors have developed a splitting procedure which initializes each new cluster (statistical model) by splitting off all tokens in the training set which were poorly represented by the current set of models. This procedure is highly efficient and gives excellent recognition performance in connected-word tasks. In particular, for speaker-independent connected-digit recognition, using two HMM-clustered models, the recognition performance is as good as or better than previous results using 4-6 models/digit obtained from template-based clustering. >

108 citations


Proceedings ArticleDOI
23 May 1989
TL;DR: The authors present the results of speaker-verification technology development for use over long-distance telephone lines, using template-based dynamic time warping and hidden Markov modeling for discriminant analysis techniques which improve the discrimination between true speakers and imposters.
Abstract: The authors present the results of speaker-verification technology development for use over long-distance telephone lines. A description is given of two large speech databases that were collected to support the development of new speaker verification algorithms. Also discussed are the results of discriminant analysis techniques which improve the discrimination between true speakers and imposters. A comparison is made of the performance of two speaker-verification algorithms, one using template-based dynamic time warping, and the other, hidden Markov modeling. >

108 citations


Proceedings ArticleDOI
Stephen Cox1, John S. Bridle
23 May 1989
TL;DR: A general approach to speaker adaptation in speech recognition is described, in which speaker differences are treated as arising from a parameterized transformation.
Abstract: A general approach to speaker adaptation in speech recognition is described, in which speaker differences are treated as arising from a parameterized transformation. Given some unlabeled data from a particular speaker, a process is described which maximizes the likelihood of this data by estimating the transformation parameters at the same time as refining estimates of the labels. The technique is illustrated using isolated vowel spectra and phonetically motivated linear spectrum transformations and is shown to give significantly better performance than nonadaptive classification. >

70 citations


Proceedings ArticleDOI
23 May 1989
TL;DR: A shift-tolerant neural network architecture for phoneme recognition based on LVQ2, an algorithm which pays close attention to approximating the optimal Bayes decision line in a discrimination task, which is suggested to be the basis for a successful speech recognition system.
Abstract: The authors describe a shift-tolerant neural network architecture for phoneme recognition. The system is based on LVQ2, an algorithm which pays close attention to approximating the optimal Bayes decision line in a discrimination task. Recognition performances in the 98-99% correct range were obtained for LVQ2 networks aimed at speaker-dependent recognition of phonemes in small but ambiguous Japanese phonemic classes. A correct recognition rate of 97.7% was achieved by a single, larger LVQ2 network covering all Japanese consonants. These recognition results are at least as high as those obtained in the time delay neural network system and suggest that LVQ2 could be the basis for a successful speech recognition system. >

66 citations


PatentDOI
TL;DR: In this paper, a method and apparatus for real-time speech recognition with and without speaker dependency is presented. But the method is not suitable for speech recognition in the presence of speaker dependency.
Abstract: A method and apparatus for real time speech recognition with and without speaker dependency which includes the following steps. Converting the speech signals into a series of primitive sound spectrum parameter frames; detecting the beginning and ending of speech according to the primitive sound spectrum parameter frame, to determine the sound spectrum parameter frame series; performing non-linear time domain normalization on the sound spectrum parameter frame series using sound stimuli, to obtain speech characteristic parameter frame series with predefined lengths on the time domain; performing amplitude quantization normalization on the speech characteristic parameter frames; comparing the speech characteristic parameter frame series with the reference samples, to determine the reference sample which most closely matches the speech characteristic parameter frame series; and determining the recognition result according to the most closely matched reference sample.

54 citations


Proceedings ArticleDOI
23 May 1989
TL;DR: A unified framework is discussed which can be used to accomplish the goal of creating effective basic models of speech and points out the relative advantages of each type of speech unit based on the results of a series of recognition experiments.
Abstract: The problem of how to select and construct a set of fundamental unit statistical models suitable for speech recognition is addressed. A unified framework is discussed which can be used to accomplish the goal of creating effective basic models of speech. The performances of three types of fundamental units, namely whole word, phoneme-like, and acoustic segment units, in a 1109-word vocabulary speech recognition task are compared. The authors point out the relative advantages of each type of speech unit based on the results of a series of recognition experiments. >

47 citations


Proceedings ArticleDOI
23 May 1989
TL;DR: The authors describe a system for speaker-dependent speech recognition based on acoustic subword units that showed results comparable to those of whole-word-based systems.
Abstract: The authors describe a system for speaker-dependent speech recognition based on acoustic subword units. Several strategies for automatic generation of an acoustic lexicon are outlined. Preliminary tests have been performed on a small vocabulary. In these tests, the proposed system showed results comparable to those of whole-word-based systems. >

Journal ArticleDOI
TL;DR: An automatic speaker adaptation algorithm for speech recognition, in which a small amount of training material of unspecified text can be used, which reduces the mean word recognition error rate from 4.9 to 2.9%.
Abstract: The author proposes an automatic speaker adaptation algorithm for speech recognition, in which a small amount of training material of unspecified text can be used. The algorithm is easily applied to vector-quantization- (VQ) speech recognition systems consisting of a VQ codebook and a word dictionary in which each word is represented as a sequence of codebook entries. In the adaptation algorithm, the VQ codebook is modified for each new speaker, whereas the word dictionary is universally used for all speakers. The important feature of this algorithm is that a set of spectra in training frames and the codebook entries are clustered hierarchically. Based on the vectors representing deviation between centroids of the training frame clusters and the corresponding codebook clusters, adaptation is performed hierarchically from small to large numbers of clusters. The spectral resolution of the adaptation process is improved accordingly. Results of recognition experiments using utterances of 100 Japanese city names show that adaptation reduces the mean word recognition error rate from 4.9 to 2.9%. Since the error rate for speaker-dependent recognition is 2.2%, the adaptation method is highly effective. >

Proceedings ArticleDOI
23 May 1989
TL;DR: The author introduces the methodological novelties that allowed for progress along three axes: from isolated-word recognition to continuous speech, from speaker-dependent recognition to speaker-independent, and from small vocabularies to large vocabULARies.
Abstract: An overview is given of recent advances in the domain of speech recognition. The author focuses on speech recognition, but also mentions some progress in other areas of speech processing (speaker recognition, speech synthesis, speech analysis and coding) using similar methodologies. The problems related to automatic speech processing are identified, and the initial approaches that have been followed in order to address those problems are described. The author then introduces the methodological novelties that allowed for progress along three axes: from isolated-word recognition to continuous speech, from speaker-dependent recognition to speaker-independent, and from small vocabularies to large vocabularies. Special emphasis centers on the improvements made possible by Markov models and, more recently, by connectionist models, resulting in improved performance for difficult vocabularies or in more robust systems. Some specialized hardware is described, as are efforts aimed at assessing speech-recognition systems. >

Proceedings ArticleDOI
15 Oct 1989
TL;DR: A preliminary investigation of techniques that automatically detect when the speaker has used a word that is not in the vocabulary, and develops a technique that uses a general model for the acoustics of any word to recognize the existence of new words.
Abstract: In practical large vocabulary speech recognition systems, it is nearly impossible for a speaker to remember which words are in the vocabulary. The probability of the speaker using words outside the vocabulary can be quite high. For the case when a speaker uses a new word, current systems will always' recognize other words within the vocabulary in place of the new word, and the speaker wouldn't know what the problem is.In this paper, we describe a preliminary investigation of techniques that automatically detect when the speaker has used a word that is not in the vocabulary. We developed a technique that uses a general model for the acoustics of any word to recognize the existence of new words. Using this general word model, we measure the correct detection of new words versus the false alarm rate.Experiments were run using the DARPA 1000-word Resource Management Database for continuous speech recognition. The recognition system used is the BBN BYBLOS continuous speech recognition system (Chow et al., 1987). The preliminary results indicate a detection rate of 74% with a false alarm rate of 3.4%.

Journal ArticleDOI
TL;DR: Although speech recognition applications for disabled people are well within the capacity of available technology, it is primarily a lack of human factors work which is impeding developments in this field.

Proceedings ArticleDOI
23 May 1989
TL;DR: The authors propose a speaker adaptation algorithm which does not depend on speech recognition algorithms and is applied to hidden Markov models and neural networks and evaluated using a database of 216 phonetically balanced words and 5240 important Japanese words uttered by three speakers.
Abstract: The authors propose a speaker adaptation algorithm which does not depend on speech recognition algorithms. The proposed spectral mapping algorithm is based on three ideas: (1) accurate representation of the input vector by separate vector quantization and fuzzy vector quantization, (2) continuous spectral mapping from one speaker to another by fuzzy mapping, and (3) accurate establishment of spectral correspondence based on the fuzzy relationship of the membership function obtained from supervised training. The spectrum dynamic features are also utilized. The algorithm is applied to hidden Markov models (HMMs) and neural networks and evaluated using a database of 216 phonetically balanced words and 5240 important Japanese words uttered by three speakers. The HMM speaker adapted recognition rate for /b,d,g/ is 79.5%. The average recognition rate for the top-three choices is about 91%. The algorithm was applied to neural networks and resulted in almost the same performance. The algorithm was also applied to voice conversion, and a preference score of 65.6% was obtained. >

PatentDOI
Masayuki Sakanishi1, Hiroki Yoshida1, Takaaki Ishii1, Hiroshi Sato1, Makoto Hoshino1 
TL;DR: A speech recognition system that detects a similarity between each speech pattern which has already been registered in the system and a speech pattern newly generated in response to the user's utterance made while the system is in a registration mode.
Abstract: A speech recognition system that detects a similarity between each speech pattern which has already been registered in the system and a speech pattern newly generated in response to the user's utterance made while the system is in a registration mode. The system further provides to the user in the registration mode information representing the detected similarity. The speech recognition system may be incorporated into a telephone apparatus or a radio telephone apparatus, in which a call origination may be automatically made in response to the user's utterance.

Proceedings ArticleDOI
Amano1, Aritsuka1, Hataoka1, Ichikawa1
01 Jan 1989
TL;DR: About 80% of the errors occurring in conventional template matching, which the discrimination rules were designed to recover, were in fact recovered, and this confirms the effectiveness of the proposed phoneme recognition method.
Abstract: A rule-based phoneme recognition method is proposed. This method uses neural networks for acoustic feature detection and fuzzy logic for the decision procedure. Rules for phoneme recognition are prepared for each pair of phonemes (pair-discrimination rules). Recognition experiments were performed using Japanese city names uttered by two male speakers. About 80% of the errors occurring in conventional template matching, which the discrimination rules were designed to recover, were in fact recovered (an improvement in recognition rate of 4.0 to 8.0%). This confirms the effectiveness of the proposed method. >

Proceedings ArticleDOI
15 Oct 1989
TL;DR: This work discusses refinements of the stochastic segment model, an alternative to hidden Markov models for representation of the acoustic variability of phonemes, and focuses on mechanisms for better modelling time correlation of features across an entire segment.
Abstract: The heart of a speech recognition system is the acoustic model of sub-word units (e.g., phonemes). In this work we discuss refinements of the stochastic segment model, an alternative to hidden Markov models for representation of the acoustic variability of phonemes. We concentrate on mechanisms for better modelling time correlation of features across an entire segment. Results are presented for speaker-independent phoneme classification in continuous speech based on the TIMIT database.

PatentDOI
TL;DR: A speaker verification system receives input speech from a speaker of unknown identity and undergoes linear predictive coding analysis and transformation to maximize separability between true speakers and impostors when compared to reference speech parameters which have been similarly transformed.
Abstract: A speaker verification system receives input speech from a speaker of unknown identity. The speech undergoes linear predictive coding (LPC) analysis and transformation to maximize separability between true speakers and impostors when compared to reference speech parameters which have been similarly transformed. The transformation incorporated a "inter-class" covariance matrix of successful impostors within a database.

Proceedings Article
01 Jan 1989
TL;DR: A pulse motor for use in time pieces comprises a stator which defines a circular space, and a circular rotor provided with a plurality of equally spaced slots extending inwardly from the periphery of the rotor at an angle to the radial direction.
Abstract: In this paper, the dynamic-programming algorithm for continuous-speech recognition is modified in orderto obtain a top-N sentence-hypotheses Iist instead of the usual one sentence only. The theoretical basis of this extension is a generalization of Bellman's principle of optimality. Due to the computational complexity of the new algorithm, a sub-optimal variant is proposed, and experimental results within the SPICOS system are presented.

Proceedings ArticleDOI
23 May 1989
TL;DR: The authors have applied connectionist learning procedures to speaker-independent continuous recognition, creating a system which has achieved 97% word accuracy and 91% sentence accuracy in preliminary tests on the TI/NBS connected-digits database.
Abstract: The authors have applied connectionist learning procedures to speaker-independent continuous recognition, creating a system which has achieved 97% word accuracy and 91% sentence accuracy in preliminary tests on the TI/NBS connected-digits database. The system uses a four-layer back-propagation network with recurrent connections to generate and refine hypotheses about the identity of an utterance over successive intervals. The hypotheses generated by the network are used as input to a Markov-chain-based Viterbi recognizer which produces a final identification of the entire utterance. >

PatentDOI
TL;DR: A speech recognition method and apparatus take into account a system transfer function between the speaker and the recognition apparatus, which update a signal representing the transfer function on a periodic basis during actual speech recognition.
Abstract: A speech recognition method and apparatus take into account a system transfer function between the speaker and the recognition apparatus. The method and apparatus update a signal representing the transfer function on a periodic basis during actual speech recognition. The transfer function representing signal is updated about every fifty words as determined by the speech recognition apparatus. The method and apparatus generate an initial transfer function representing signal and generate from the speech input, successive input frames which are employed for modifying the value of the current transfer function signal so as to eliminate error and distortion. The error and distortion occur, for example, as a speaker changes the direction of his profile relative to a microphone, as the speaker's voice changes or as other effects occur that alter the spectra of the input speech frames. The method is automatic and does not require the knowledge of the input words or text.

Proceedings Article
01 Jan 1989
TL;DR: An algorithm for recognition of connected words has been adapted to an application for mobile radio telephony and several manners of generating feature vectors were evaluated using two databases collected in a small car moving at about 120 km/h.

Proceedings ArticleDOI
01 Jan 1989
TL;DR: Connectionist learning procedures are applied to the task of speaker-independent continuous speech recognition, creating a system which has achieved a recognition rate of 97% correct in preliminary tests on the Texas Instruments/National Bureau of Standards Connected Digits Database.
Abstract: Connectionist learning procedures are applied to the task of speaker-independent continuous speech recognition, creating a system which has achieved a recognition rate of 97% correct in preliminary tests on the Texas Instruments/National Bureau of Standards Connected Digits Database. Two versions of the system were implemented, both of which used four-layer backpropagation networks. One used a static (nonrecurrent) network with a history mechanism, in which the input weights were slaved together, as they are in time-delay neural networks (TDNNs), and the other used a recurrent connection structure similar to that proposed by J.L. Elman (Tech. Rep., Univ. of California, San Diego, April 1988). The final recognition accuracies produced by the two approaches were not significantly different. The networks generated and refined hypotheses about the identity of utterances over successive intervals. The hypotheses generated by the networks were used as input to a Markov-chain-based Viterbi recognizer which produced a final identification of the entire utterance. >

01 Jan 1989
TL;DR: An alternative approach to speaker adaptation for a large-vocabulary hidden-Markov-model-based speech recognition system is described, based on the use of a stochastic model representing the different properties of the new speaker and an old speaker for which the full training set of 20 minutes is available.
Abstract: This paper describes an alternative approach to speaker adaptation for a large vocabulary Hidden Markov Model based speech recognition system. The goal of this investigation was to train the IBM speech recognition system with only 5 minutes of speech data from a new speaker instead of the usual 20 minutes. At the same time the recognition rate should not drop by more than 1-2%. The approach is based on the use of a stochastic model representing the different properties of the new speaker and an old speaker for which the full training set of 20 minutes is available. Such a model can he called a ‘‘Speaker Markov Model”. It is shown how the parameters of such a model can be derived and how it can be used for transforming the training set of the old speaker in order to use it in addition to the short training set of the new speaker. The adaptation algorithm was tested with 12 speakers including male and female speakers as well as speakers with foreign accent. The average recognition rate dropped only from 96.4% to 95.2% for a 5000 word vocabulary task if the adaptation was used instead of the full training. Mostly important is that the decoding time’increases only by a factor of 1.35 while this factor is often 3-5 if other adaptation algorithms are used.

PatentDOI
TL;DR: In this article, a speech processing apparatus was proposed that enables processor elements (403a to 403r) each comprising at least one nonlinear oscillator circuit (621) to be used as band pass filters by using the entrainment taking place in each of the processor elements.
Abstract: A speech processing apparatus of the present invention enables processor elements (403a to 403r) each comprising at least one nonlinear oscillator circuit (621) to be used as band pass filters by using the entrainment taking place in each of the processor elements, whereby the speech of a particular talker in the speech of a plurality of talkers can be recognized.

Proceedings ArticleDOI
Kammerer1, Kupper1
01 Jan 1989
TL;DR: Several design strategies for feedforward networks are examined within the scope of pattern classification and a hierarchical structure with pairwise training of two-class models is superior to a single uniform network for speaker-independent word recognition.
Abstract: Several design strategies for feedforward networks are examined within the scope of pattern classification. Single- and two-layer perceptron models are adapted for experiments in isolated-word recognition. Direct (one-step) classification and several hierarchical (two-step) schemes have been considered. For a vocabulary of 20 English words spoken repeatedly by 11 speakers, the word classes are found to be separable by hyperplanes in the chosen feature space. Since for speaker-dependent word recognition the underlying database contains only a small training set, an automatic expansion of the training material improves the generalization properties of the networks. This method accounts for a wide variety of observable temporal structures for each word and gives a better overall estimate of the network parameters, which leads to a recognition rate of 99.5%. For speaker-independent word recognition, a hierarchical structure with pairwise training of two-class models is superior to a single uniform network (98% average recognition rate). >

Proceedings ArticleDOI
08 May 1989
TL;DR: An algorithm is presented for adaptation and self-learning of the hidden Markov model (HMM) that makes the HMM-based speech recognition robust, so that well-trained models can be adapted to new speaking conditions or a new speaker.
Abstract: An algorithm is presented for adaptation and self-learning of the hidden Markov model (HMM). It makes the HMM-based speech recognition robust, so that well-trained models can be adapted to new speaking conditions or a new speaker. The self-learning consists of the fact that, during recognition, all test tokens can be used to augment the current model. Both procedures increase the size of the training set. The algorithm was tested on a speaker-dependent speech recognition system for the whole Chinese vocabulary and a speaker-independent system for 0-9 digits. Experiments show that the algorithm is very successful, both for new-speaker adaptation and for variations of speech in a single speaker under various conditions. >

Proceedings ArticleDOI
Joseph Picone1
23 May 1989
TL;DR: A clustering algorithm is introduced that allows clustering of HMM (hidden Markov models) models directly and high-performance speaker-independent digit recognition on a studio-quality connected-digit database is demonstrated.
Abstract: A clustering algorithm is introduced that allows clustering of HMM (hidden Markov models) models directly. This clustering algorithm determines the appropriate duration profile for a recognition unit. High-performance speaker-independent digit recognition on a studio-quality connected-digit database is demonstrated using this algorithm. >