scispace - formally typeset
Search or ask a question

Showing papers on "Speaker recognition published in 1985"


Proceedings ArticleDOI
26 Apr 1985
TL;DR: A vector quantization (VQ) codebook was used as an efficient means of characterizing the short-time spectral features of a speaker and was used to recognize the identity of an unknown speaker from his/her unlabelled spoken utterances based on a minimum distance (distortion) classification rule.
Abstract: In this study a vector quantization (VQ) codebook was used as an efficient means of characterizing the short-time spectral features of a speaker. A set of such codebooks were then used to recognize the identity of an unknown speaker from his/her unlabelled spoken utterances based on a minimum distance (distortion) classification rule. A series of speaker recognition experiments was performed using a 100-talker (50 male and 50 female) telephone recording database consisting of isolated digit utterances. For ten random but different isolated digits, over 98% speaker identification accuracy was achieved. The effects, on performance, of different system parameters such as codebook sizes, the number of test digits, phonetic richness of the text, and difference in recording sessions were also studied in detail.

493 citations


Proceedings ArticleDOI
Richard Schwartz1, Y. L. Chow1, Owen Kimball1, S. Roucos1, M. Krasner1, John Makhoul1 
26 Apr 1985
TL;DR: The combination of general spectral information and specific acoustic-phonetic features is shown to result in more accurate phonetic recognition than either representation by itself.
Abstract: This paper describes the results of our work in designing a system for phonetic recognition of unrestricted continuous speech. We describe several algorithms used to recognize phonemes using context-dependent Hidden Markov Models of the phonemes. We present results for several variations of the parameters of the algorithms. In addition, we propose a technique that makes it possible to integrate traditional acoustic-phonetic features into a hidden Markov process. The categorical decisions usually associated with heuristic acoustic-phonetic algorithms are replaced by automated training techniques and global search strategies. The combination of general spectral information and specific acoustic-phonetic features is shown to result in more accurate phonetic recognition than either representation by itself.

367 citations


Journal ArticleDOI
01 Nov 1985
TL;DR: A discussion of inherent performance limitations, along with a review of the performance achieved by listening, visual examination of spectrograms, and automatic computer techniques, attempts to provide a perspective with which to evaluate the potential of speaker recognition and productive directions for research into and application of speaker Recognition technology.
Abstract: The usefulness of identifying a person from the characteristics of his voice is increasing with the growing importance of automatic information processing and telecommunications. This paper reviews the voice characteristics and identification techniques used in recognizing people by their voices. A discussion of inherent performance limitations, along with a review of the performance achieved by listening, visual examination of spectrograms, and automatic computer techniques, attempts to provide a perspective with which to evaluate the potential of speaker recognition and productive directions for research into and application of speaker recognition technology.

350 citations


PatentDOI
TL;DR: A system is disclosed for recognizing a pattern in a collection of data given a context of one or more other patterns previously identified, which enables an operator to confirm the system's best guess as to the spoken word merely by speaking another word.
Abstract: A system is disclosed for recognizing a pattern in a collection of data given a context of one or more other patterns previously identified. Preferably the system is a speech recognition system, the patterns are words and the collection of data is a sequence of acoustic frames. During the processing of each of a plurality of frames, for each word in an active vocabulary, the system updates a likelihood score representing a probability of a match between the word and the frame, combines a language model score based on one or more previously recognized words with that likelihood score, and prunes the word from the active vocabulary if the combined score is below a threshold. A rapid match is made between the frames and each word of an initial vocabulary to determine which words should originally be placed in the active vocabulary. Preferably the system enables an operator to confirm the system's best guess as to the spoken word merely by speaking another word, to indicate that an alternate guess by the system is correct by typing a key associated with that guess, and to indicate that neither the best guess nor the alternate guesses was correct by typing yet another key. The system includes other features, including ones for determining where among the frames to look for the start of speech, and a special hardware processor for computing likelihood scores.

208 citations


Journal ArticleDOI
01 Nov 1985
TL;DR: The nature of variabilities is discussed, to describe the kinds of speech knowledge that may help us understand variabilities, and to advocate and suggest specific procedures for the increased utilization ofspeech knowledge in automatic speech recognition.
Abstract: In automatic speech recognition, the acoustic signal is the only tangible connection between the talker and the machine. While the signal conveys linguistic information, this information is often encoded in such a complex manner that the signal exhibits a great deal of variability. In addition, variations in environment and speaker can introduce further distortions that are linguistically irrelevant. This paper has three aims: 1) to discuss the nature of variabilities; 2) to describe the kinds of speech knowledge that may help us understand variabilities; and 3) to advocate and suggest specific procedures for the increased utilization of speech knowledge in automatic speech recognition.

186 citations


Journal Article
TL;DR: The technique of electroglottography is reviewed from the perspective of a laboratory instrument for assessing laryngeal function, a device to assist speech and speaker recognition, and as a potential diagnostic aid in the clinic.
Abstract: The technique of electroglottography is reviewed from the perspective of a laboratory instrument for assessing laryngeal function, a device to assist speech and speaker recognition, and as a potential diagnostic aid in the clinic. A description of the electronic functioning of the electroglottograph (EGG) is provided. Considerable emphasis is given to contemporary research which has focused on laryngeal assessment using the EGG. Methods for validating and aiding the interpretation or reading of the EGG are discussed, including photoglottography, stroboscopy, ultrahigh-speed laryngeal cinematography, and others. The relationship of the EGG to glottal area and glottal volume velocity estimated by inverse filtering is presented. An elementary model of the EGG is described and used to predict characteristic features of the EGG waveform. Clinical data as well as data obtained from subjects with a normal functioning larynx are analyzed. Applications of the EGG to speech processing are outlined, including real-time detection of voicing, voiced and unvoiced speech segments, and silence intervals. The EGG device has potential for assisting speech and speaker recognition systems in certain applications.

177 citations





Journal ArticleDOI
TL;DR: Tape recordings of 24 speakers conversing over an unprocessed channel and over an LPC voice processing system were subjected to listening tests, suggesting that frequently voiced concerns about speaker recognition over narrow‐band voice communication systems may not be justified.
Abstract: Tape recordings of 24 speakers conversing over an unprocessed channel and over an LPC voice processing system were subjected to listening tests. The listeners were 24 co‐workers who attempted to identify each speaker from a group of about 40 people working in the same branch. Prior to the recognition test, each of the listeners also rated his or her familiarity with each of the speakers and the distinctiveness of each speaker’s voice. There was some loss in voice recognition over LPC, but the recognition accuracy was still quite high (69% vs 88% for unprocessed voices), suggesting that frequently voiced concerns about speaker recognition over narrow‐band voice communication systems may not be justified. Talker familiarity was significantly correlated with correct identifications. There was no significant correlation between the rated distinctiveness of the speaker and correct identifications. However, familiarity and distinctiveness ratings were highly correlated. This suggests that people consider a familiar voice to be distinctive regardless of whatever characteristics might make that particular voice stand out in a crowd.

45 citations


Journal Article
TL;DR: Results indicate that the voice recognition system might be appropriate for rehabilitation programs though further technologic refinement of the device would increase its effectiveness.

Proceedings ArticleDOI
Frederick Jelinek1
01 Apr 1985
TL;DR: The architecture of an experimental, real-time, isolated-word, speech recognition system with a 5,000-word vocabulary which can be used for dictating office correspondence is described and some recent experimental results obtained are given.
Abstract: The Speech Recognition Group at IBM, Yorktown Heights, has recently completed the implementation of an experimental, real-time, isolated-word, speech recognition system with a 5,000-word vocabulary which can be used for dictating office correspondence. Typical recognition accuracy is greater than 94% correct word recognition for words within the vocabulary. We first describe the architecture of this system, and then give some recent experimental results obtained with it for read and spontaneously dictated speech from five speakers.

Proceedings ArticleDOI
01 Apr 1985
TL;DR: The methods found to be most effective rely on the training process to incorporate channel variability, and it is shown that the direct approach, of using simple channel-invariant features, can discard much speaker dependent information.
Abstract: In this paper, we examine several methods for text-independent speaker identification of telephone speech with limited duration data, The issue addressed is the assessment of channel characteristics, especially linear aspects, and methods for improving speaker identification performance when the speaker to be identified is on a different telephone channel than that data used for training. We show experimental evidence illustrating the cross-channel problem and also show that the direct approach, of using simple channel-invariant features, can discard much speaker dependent information. The methods we have found to be most effective rely on the training process to incorporate channel variability.

Proceedings ArticleDOI
01 Apr 1985
TL;DR: An application of source coding to speaker recognition is described, where each speaker is represented by a sequence of vector quantization codebooks; known input utterances are classified using these codebook sequences and the resulting classification distortion is compared to a rejection threshold.
Abstract: An application of source coding to speaker recognition is described. The method is text-dependent - the text spoken is known, and the problem is to determine who said it. Each speaker is represented by a sequence of vector quantization codebooks; known input utterances are classified using these codebook sequences and the resulting classification distortion is compared to a rejection threshold. On a 16 speaker test population with an additional 111 imposters, this method achieved a false rejection rate of 0.8%, an imposter acceptance rate of 1.8%, and within the 16 speakers, an identification error rate of 0.0%.

Proceedings ArticleDOI
01 Apr 1985
TL;DR: A method for speaker dependent connected speech recognition based on phonemic units is described, in which each phoneme is characterized by a very simple 3-state Hidden Markov Model which is trained on connected speech by a Viterbi algorithm.
Abstract: In this paper, a method for speaker dependent connected speech recognition based on phonemic units is described. In this recognition system, each phoneme is characterized by a very simple 3-state Hidden Markov Model (HMM) which is trained on connected speech by a Viterbi algorithm. Each state has associated with it a continuous (Gaussian) or discrete probability density function (pdf). With the phonemic models so obtained, the recognition is then performed either directly at word level (by the reconstruction of reference words from the models of the constituting phonemes) or via a phonemic labelling. Good results are obtained as well with a German ten digit vocabulary (20 phonemes) as with a French 80 word vocabulary (36 phonemes).

Journal ArticleDOI
TL;DR: This article pointed out the logical fallacy in Douglas and Gibbins' argument that self-other and acquaintance-other recognition errors are often instances of self-deception, and they presented no evidence that either type of recognition error was not an instance of self deception.
Abstract: Douglas and Gibbins (1983) recently argued that our demonstration that errors in self-other recognition are often instances of self-deception was inadequate. In their study, they found that both self-other and acquaintance-other recognition errors met two of the four criteria we had offered as necessary and sufficient for ascribing self-deception. They presented no evidence that either type of recognition error was not an instance of self-deception. Here we describe the original basis of our demonstration and point out the logical fallacy in Douglas and Gibbins' argument.

Proceedings ArticleDOI
01 Apr 1985
TL;DR: The speech recognition accuracy of this method in recognizing non-training voice data was 95.8% with automatic segmentation, and the category of the nearest reference pattern is taken as the result.
Abstract: This paper descrives recognition method, reference pattern generation method, and evaluation about the speaker independent recognition for telephone speech response systems. Input utterance is analyzed by 19 channel BPFs. The power and vocal cord source characteristics are normalized. The time normalization is realized by linearly compressing or expanding to 32 frames. The speech pattern undergoes pattern matching with male and female reference patterns, and the category of the nearest reference pattern is taken as the result. It is necessary to optimize the reference patterns so that the speech can be correctly recognized in spite of the difference of formant frequencies, and slight segmentation errors. To optimize the reference patterns, the recognition of the training patterns and updating of the reference patterns are repeated. A total of 256 male and female reference patterns were generated The speech recognition accuracy of this method in recognizing non-training voice data was 95.8% with automatic segmentation.

Journal ArticleDOI
TL;DR: A text‐independent speaker clustering approach to speaker‐indepencent speaker recognition through vector quantization (VQ) was investigated, where the distortion value was used as a clustering measure.
Abstract: A text‐independent speaker clustering approach to speaker‐indepencent speaker recognition through vector quantization (VQ) was investigated, where the distortion value was used as a clustering measure. To show the possibility of the text‐independent speaker clustering, speaker recognition experiments were carried out using the Harvard sentence database. Nine male speakers uttered ten different Harvard sentences each. Codebooks were generated from the first five sentences for each speaker using Weighted Likelihood Ratio measure (WLR) through LPC analysis. Using 128 vectors in each codebook, a speaker recognition rate of 98% was attained on the latter five Harvard sentences. Effects of codebook size and input length are also discussed. The above approach based on framewise VQ only utilizes the static distribution of LPC spectra. VQ for multiframe codebooks was used to represent the coarticulation units. The results of speaker recognition experiments based on multi‐frame codebooks will be compared with fixed length VQ approaches.

Proceedings ArticleDOI
01 Apr 1985
TL;DR: This paper describes the speaker independent large vocabulary speech recognition system based on phoneme recognition, which employs LPC cepstrum coefficients as the feature parameter and statistical distance measure between an input pattern and phoneme reference template.
Abstract: This paper describes the speaker independent large vocabulary speech recognition system based on phoneme recognition. Phoneme recognition employs LPC cepstrum coefficients as the feature parameter and statistical distance measure between an input pattern and phoneme reference template. Using power dips of low and high frequency range, similarity to unvoiced feature and similarity to nasal feature, the consonant segments are detected. The discrimination of phonemes is performed individually for vowels, semi-vowels and consonants. Phoneme sequence which is result of phoneme recognition is matched with each item of the word dictionary and the item with the highest similarity in the dictionary is output as the recognition result. An average phoneme recognition score is 81.4% for 212 words uttered by forty speakers including males and females; 90.6% for vowels, 78.0% for semivowels and 71.9% for consonants. An average score of word recognition is 95.6% for 274 Japanese city names uttered by forty speakers.

Proceedings ArticleDOI
01 Apr 1985
TL;DR: A very positive experience is reported which is being made with a system based on very short sub-word units, called "diphones", with much alleviated problems related to storage require, discrimination of similar words and training time.
Abstract: Almost all CSR systems presently in practical use are based on whole-word template matching. Although their performances are quite high, a few problems arise due to the use of whole words as ba sic units. They are related to storage require ments, coarticulation effects at the junction be tween words, discrimination of similar words and training time, especially for large vocabularies. In this paper we report a very positive experi ence which is being made with a system based on very short sub-word units, called "diphones". With the approach described in this paper, the above mentioned problems are much alleviated, with no penalty on performance. The first part is devoted to the presentation and discussion of the peculiar characteristics of the diphones, of the language model based on them and of the overall recognition system. Then a set of procedures used for training the system on a new application and for extracting the diphone templates for any new speaker are briefly de scribed. Finally we report and discuss the results of different tests performed on various recogni tion tasks.

Proceedings ArticleDOI
D. Mergel1, Hermann Ney
01 Apr 1985
TL;DR: A variant of the Markov source modelling of entire words based on automatically determined subword units is described, applied to speaker-dependent and independent recognition of the German digits (telephone speech).
Abstract: A variant of the Markov source modelling of entire words based on automatically determined subword units is described. Each word of the vocabulary is modelled as a linear sequence of phoneme segments given by a phonetic transcription. For every phoneme a minimum and maximum duration are to be specified. Matching an utterance to the models must be performed within these absolute durational constraints. This is achieved by a dynamic programming time alignment different from the conventional ones. The acoustic emission is defined by means of phonetically labelled prototype vectors. The parameters of the models are automatically trained by an iterative procedure similar to the Viterbi algorithm. The method is applied to speaker-dependent and independent recognition of the German digits (telephone speech).

Journal ArticleDOI
TL;DR: An experimental investigation to determine the human speaker recognition performance of LPC voice processors indicates the importance of high-frequency data bandwidth for speaker recognition.
Abstract: Immediate identification of speakers' voices can be highly important to efficient communication in certain applications. This correspondence describes an experimental investigation to determine the human speaker recognition performance of LPC voice processors. A small group of coworkers were used as the test subjects. The test results indicate the importance of high-frequency data bandwidth for speaker recognition.

01 Jul 1985
TL;DR: A study was conducted to determine potential commercial aircraft flight deck applications and implementation guidelines for voice recognition and synthesis as mentioned in this paper, and the potential voice recognition applications fell into five general categories: programming, interrogation, data entry, switch and mode selection, and continuous/time-critical action control.
Abstract: A study was conducted to determine potential commercial aircraft flight deck applications and implementation guidelines for voice recognition and synthesis At first, a survey of voice recognition and synthesis technology was undertaken to develop a working knowledge base Then, numerous potential aircraft and simulator flight deck voice applications were identified and each proposed application was rated on a number of criteria in order to achieve an overall payoff rating The potential voice recognition applications fell into five general categories: programming, interrogation, data entry, switch and mode selection, and continuous/time-critical action control The ratings of the first three categories showed the most promise of being beneficial to flight deck operations Possible applications of voice synthesis systems were categorized as automatic or pilot selectable and many were rated as being potentially beneficial In addition, voice system implementation guidelines and pertinent performance criteria are proposed Finally, the findings of this study are compared with those made in a recent NASA study of a 1995 transport concept

Proceedings ArticleDOI
19 Dec 1985
TL;DR: A speech recognition time warping algorithm is adapted to picture analysis to recognize patterns despite variations in scale and orientation so that objects may be recognized regardless of whether they are embedded in other parts or they are distorted.
Abstract: The aim of this study is to adapt a speech recognition time warping algorithm to picture analysis. Our goal is to recognize patterns despite variations in scale and orientation. We may recognize objects regardless of whether they are embedded in other parts or they are distorted. The programs input real pictures, extract the contours and then encode and compare them to a pattern dictionary. The computer time is particularly short for such a recognition process.

Journal ArticleDOI
TL;DR: Quality evaluation tests are reported which show that this type of coder, operating at 7.2 kbps, allows the transmission of telephone speech with communications quality and is a good candidate for telephony applications such as digital trunk transmissions, satellite speech communications, secure voice communications, and audio distribution systems.
Abstract: In this paper, we discuss the implementation of a medium-bit-rate linear prediction baseband coder on an IBM bipolar signal processor prototype having a high processing capacity. We show that the implementation of our algorithm requires a processing load of 5 MIPS, with a program size of 5K instructions. We then discuss the application of our coder in a normal telephone environment, which requires mu-law to linear PCM conversion and other signal processing functions such as voice activity detection, automatic gain control, echo control, and error recovery. Quality evaluation tests are also reported which show that this type of coder, operating at 7.2 kbps, allows the transmission of telephone speech with communications quality. Moreover, obtained intelligibility scores and speaker recognition levels are high enough to demonstrate that this coder is a good candidate for telephony applications such as digital trunk transmissions, satellite speech communications, secure voice communications, and audio distribution systems.


Journal ArticleDOI
TL;DR: In this article, a weighted cepstral distance measure using LPC derived cepstrum coefficient variability was tested in a speaker-independent English digit recognition system using standard DTW alignment techniques.
Abstract: The cepstral distance has been one of the most efficient spectral distance measures in speech and speaker recognition [S. Furui, IEEE Trans. Acoust. Speech Signal Process. ASSP‐29, 254–272 (1981)]. A new weighted cepstral distance measure using LPC derived cepstrum coefficient variability was tested in a speaker‐independent English digit recognition system using standard DTW alignment techniques [L. R. Rubiner, S. E. Levinson, A. E. Rosenberg, and J. G. Wilpon, IEEE Trans. Acoust. Speech Signal Process. ASSP‐27, 134–141 (1979)]. The results show a recognition accuracy of > 99% for the digits [K. L. Shipley. A. E. Rosenberg, and D. E. Bock, J. Acoust. Soc. Am. Suppl. 1 72, S80 (1982)]. Recognition results using the same data base and the log likelihood LPC distance are about 97.4%. Hence there is a large improvement in performance with the new weighted cepstral distance.

Proceedings ArticleDOI
01 Apr 1985
TL;DR: This paper describes the approach for enhancing the performance of a speaker-dependent, discrete word recognition system in a noisy environment by means of cepstral subtraction techniques applied iteratively which results in a significant improvement in speech recognition accuracy.
Abstract: This paper describes the approach for enhancing the performance of a speaker-dependent, discrete word recognition system in a noisy environment by means of cepstral subtraction techniques applied iteratively. A series of experiments have shown that these iterative methods provide enhanced performance in the word boundary detector which results in a significant improvement in speech recognition accuracy.

Patent
27 Sep 1985
TL;DR: This article used a general language model to evaluate both the keyword hypothesis and the alternative hypothesis that the observed speech is not a keyword, and used concatenations of a set of filler templates.
Abstract: A system employing a method that detects the occurrence of keywords in continuously-spoken speech evaluates both the keyword hypothesis, and the alternative hypothesis that the observed speech is not a keyword. A general language model is used to evaluate the latter hypothesis. Arbitrary utterances of the language, according to this model, are approximated by concatenations of a set of filler templates. The system allows for automatic detection of the occurrence of keywords in unrestricted natural speech. The system can be trained by a particular speaker, or can function independently of the speaker.

Patent
11 Mar 1985
TL;DR: In this article, the output from a voice recognition system is synthesized by a voice synthesizer into a voice corresponding with the switch code signal and announced through a speaker to confirm the operator of the command.
Abstract: PURPOSE:To control each load easily through voice, by providing means for recognized word, means for controlling the load on the basis of the results of voice recognition and means for identifying only the voice of driver. CONSTITUTION:Voice for control is produced by a driver through a microphone 1 for voice recognition sensor provided in the cabin and recognized by a voice recognition system 2. The output from said system 2 is synthesized by a voice synthesizer 3 into a voice corresponding with the switch code signal and announced through a speaker 4 to confirm the operator of the command. Alternatively, it is converted by a display drive 5 into a signal such as character and displayed 6. If there is no error in the command, each load is controlled by an operating system controller 7 and a non-operating system controller 8 after elapsing predetermined time. Tone of operator has been stored in a voice identifier to prevent control of the operating system load through the voice of other than the operator.