scispace - formally typeset
Search or ask a question

Showing papers on "Speaker recognition published in 1983"


Book
28 Oct 1983
TL;DR: This volume clearly demonstrates that any valid theory of speaker recognition must integrate the approaches of a number of disciplines and it is itself an important step towards that integration.
Abstract: List of tables List of figures Acknowledgements Introduction 1. Perspectives on speaker recognition 2. The bases of between-speaker differences 3. Short term parameters: segments and co-articulation 4. Long term quality 5. Conclusions References Index.

235 citations


Journal ArticleDOI
TL;DR: The results of a new method based on rate-distortion speech coding (speech coding by vector quantization), minimum cross-entropy pattern classification, and information-theoretic spectral distortion measures for discrete utterance speech recognition are presented.
Abstract: The results of a new method are presented for discrete utterance speech recognition. The method is based on rate-distortion speech coding (speech coding by vector quantization), minimum cross-entropy pattern classification, and information-theoretic spectral distortion measures. Separate vector quantization code books are designed from training sequences for each word in the recognition vocabulary. Inputs from outside the training sequence are classified by performing vector quantization and finding the code book that achieves the lowest average distortion per speech frame. The new method obviates time alignment. It achieves 99 percent accuracy for speaker-dependent recognition of a 20 -word vocabulary that includes the ten digits, with higher accuracy for recognition of the digit subset. For speaker-independent recognition, the method achieves 88 percent accuracy for the 20 -word vocabulary and 95 percent for the digit subset. Background of the method, detailed empirical results, and an analysis of computational requirements are presented.

92 citations


Patent
27 Jan 1983
TL;DR: In this article, an individual verification apparatus consisting of a verification data file (20), a speech input section (10), a data memory (30), speech recognition unit (40), and a speaker verification unit (50) is described.
Abstract: An individual verification apparatus comprises a verification data file (20), a speech input section (10), a data memory (30), a speech recognition unit (40), and a speaker verification unit (50). In the verification data file key codes set by customers and corresponding reference data for individual verification are registered. Speech of the key code spoken by a customer is processed by the speech input section (10) and the result is stored in the data memory (30). The speech recognition unit (40) recognizes the input key code based on the key code data stored in the data memory (30). The speaker verification unit (50) verifies the customer by comparing the key code data with speech reference data of customers having the recognized key code.

68 citations


Proceedings ArticleDOI
14 Apr 1983
TL;DR: A new technique for text-independent speaker recognition is proposed which uses a statistical model of the speaker's vector quantized speech which retains text- independent properties while allowing considerably shorter test utterances than comparable speaker recognition systems.
Abstract: A new technique for text-independent speaker recognition is proposed which uses a statistical model of the speaker's vector quantized speech. The technique retains text-independent properties while allowing considerably shorter test utterances than comparable speaker recognition systems. The frequently-occurring vectors or characters form a model of multiple points in the n dimensional speech space instead of the usual single point models, The speaker recognition depends on the statistical distribution of the distances between the speech frames from the unknown speaker and the closest points in the model. Models were generated with 100 seconds of conversational training speech for each of 11 male speakers. The system was able to identify 11 speakers with 96%, 87%, and 79% accuracy from sections of unknown speech of durations of 10, 5, and 3 seconds, respectively. Accurate recognition was also obtained even when there were variations in channels over which the training and testing data were obtained. A real-time demonstration system has been implemented including both training and recognition processes.

66 citations


Journal ArticleDOI
TL;DR: A new model for recognizing the speaker's intended meaning in determining a response is presented, which makes use of the speaker’s plan, his beliefs about the domain and about the hearer's relevant capacities.
Abstract: Human conversational participants depend upon the ability of their partners to recognize their intentions, so that those partners may respond appropriately. In such interactions, the speaker encodes his intentions about the hearer's response in a variety of sentence types. Instead of telling the hearer what to do, the speaker may just state his goals, and expect a response that meets these goals at least part way. This paper presents a new model for recognizing the speaker's intended meaning in determining a response. It shows that this recognition makes use of the speaker's plan, his beliefs about the domain and about the hearer's relevant capacities.

58 citations


Proceedings ArticleDOI
01 Apr 1983
TL;DR: New technique for use in a word recognition system where word templates are represented as sequences of descrete phoneme-like (pseudo-phoneme) templates which are automatically determined from a training set of word utterances by a clustering technique.
Abstract: This paper describes new technique for use in a word recognition system. This recognition system is especially efffective in speaker-dependent large vocabulary word recognition based on multiple reference templates. In this system, word templates are represented as sequences of descrete phoneme-like (pseudo-phoneme) templates which are automatically determined from a training set of word utterances by a clustering technique. In speaker-dependent 641 city names word recognition experiments, 96.3% recognition accuracy was obtained using 256 phoneme-like templates.

56 citations


Journal ArticleDOI
TL;DR: In this paper, two experiments were conducted to investigate how subjects remember paralinguistic speaker's voice information without apparent intent, with the suibjects' stated task being only to remember the sentences, incidental memory for which speaker spoke which sentences was facilitated by fabricated personal histories of the speakers.

53 citations


Proceedings ArticleDOI
01 Apr 1983
TL;DR: It is demonstrated that by using Bayesian techniques, prior knowledge derived from speaker-independent data can be combined with speaker-dependent training data to improve system performance.
Abstract: In order to achieve state-of-the-art performance in a speaker-dependent speech recognition task, it is necessary to collect a large number of acoustic data samples during the training process. Providing these samples to the system can be a long and tedious process for users. One way to attack this problem is to make use of extra information from a data bank representing a large population of speakers. In this paper we demonstrate that by using Bayesian techniques, prior knowledge derived from speaker-independent data can be combined with speaker-dependent training data to improve system performance.

35 citations


Proceedings ArticleDOI
01 Apr 1983
TL;DR: Results indicate that discrimination between similar sounding words can be greatly improved, and an alternative DTW approach which is able to focus its attention on those parts of a speech pattern which serve to distinguish it from similar patterns is presented.
Abstract: Whole-word pattern matching using dynamic time-warping (DTW) has achieved considerable success as an algorithm for automatic speech recognition. However, the performance of such an algorithm is ultimately limited by its inability to discriminate between similar sounding words. The problem arises because all differences between speech patterns are treated as being equally important, hence the algorithm is particularly susceptible to confusions caused by irrelevant differences. This paper presents an alternative DTW approach which is able to focus its attention on those parts of a speech pattern which serve to distinguish it from similar patterns. A network-type data structure is derived from reference speech patterns, and the separate paths through the network determine the regions where recognition takes place. Results indicate that discrimination between similar sounding words can be greatly improved.

23 citations


Proceedings ArticleDOI
01 Apr 1983
TL;DR: Recognition results on sentences from a 5000-word vocabulary drawn from office correspondence are presented, which comprises the 5000 most frequently occurring words in a data-base of 14,000 office memoranda and letters, and has a perplexity of 90.
Abstract: Recognition results on sentences from a 5000-word vocabulary drawn from office correspondence are presented. The sentences were read with pauses between the words. The vocabulary comprises the 5000 most frequently occurring words in a data-base of 14,000 office memoranda and letters, and has a perplexity of 90, measured from a trigram language model. Experiments were carried out with 6 speakers (4 male, 2 female) in an office environment using a close-talking microphone. The recognition system was automatically trained to each speaker by having the speaker read 100 typical sentences from the office correspondence data-base. Recognition was carried out for each speaker on 20 test sentences, consisting of 299 words. The recognition rate (% words correct) averaged across the 6 speakers was 94.5%.

22 citations



Journal ArticleDOI
TL;DR: A speaker-independent segmentation procedure which automatically adapts the classifier to the speaker-dependent effects of coarticulation and is well suited for a speech input where the number of words in a word string is not known to the recognition system.
Abstract: Recognition of connected words can be performed by segmenting the word string automatically into single-word components which are then classified by a single-word recognition system. We propose and investigate a speaker-independent segmentation procedure which is based completely on statistical principles. An estimation algorithm, adapted to the statistical data of the signal parameters, determines the word boundaries. The statistical data are computed from vocabulary-dependent speech samples of different speakers. The segmentation procedure, which operates independently of the single-word recognizer, has been tested with connected digits. The results show that an estimation algorithm based on quadratic polynomials yields a very reliable segmentation. The segmentation procedure is also well suited for a speech input where the number of words in a word string is not known to the recognition system. Based on the above segmentation procedure, we have carried out several recognition experiments on two-to-four-digit strings. The investigations show that the proposed segmentation algorithm provides an efficient tool to tackle the effects of coarticulation between adjacent words. We present a training procedure which automatically adapts the classifier to the speaker-dependent effects of coarticulation.

Proceedings ArticleDOI
T. Iwata, H. Ishizuka, M. Watari, T. Hoshi, M. Mizuno 
01 Jan 1983
TL;DR: A single chip implementing a distance calculator, dynamic programming equation calculator and pipelined operations for use in speech recognition and up to 340 isolated words or 40 connected words can be recognized in realtime.
Abstract: This report will discuss a single chip implementing a distance calculator, dynamic programming equation calculator and pipelined operations for use in speech recognition. Up to 340 isolated words or 40 connected words can be recognized in realtime.

Proceedings ArticleDOI
14 Apr 1983
TL;DR: The results of the simulation indicate that performance estimates from recognition experiments should be allowed wide error tolerances, and they illustrate the danger of trying too many features on the same database.
Abstract: Experiments are described in automatic, text-independent speaker recognition using three databases: good quality read speech, conversations over simulated telephone links, and conversations over real telephone links. A recognition system is evaluated on this material using a set of features which were believed to have some resistance to transmission degradations, namely, F 0 statistics and statistics of low-order cepstrum coefficient variation. Performance is reasonable on the first two databases but poor on the telephone speech. A new set of features based on the frequencies of peaks in the short-term smoothed spectrum is found to perform better on the telephone speech, presumably because of its greater resistance to noise and nonlinear distortions. A computer simulation of the recognition experiments is described. The results of the simulation indicate that performance estimates from recognition experiments should be allowed wide error tolerances, and they illustrate the danger of trying too many features on the same database.

Proceedings ArticleDOI
01 Apr 1983
TL;DR: A methodology is described to obtain a set of segments and rules that represents adequately the speech performance of a given speaker and how such a segment data base can be used for speech coding at very low bit rate, synthesis from unrestricted text, and continuous speech recognition.
Abstract: A methodology is described to obtain a set of segments and rules that represents adequately the speech performance of a given speaker. This methodology proceeds from an initial set of diphones extracted from a neutral context and modify this set with larger and/or smaller segments depending on the match with natural utterances. Each segment is stored as a sequence of frames coded using LPC coefficients. An estimate of the likelihood of timescale distortion is associated with each frame. It represents knowledge on temporal variability that can be used by synthesis rules and/or pattern matching algorithms. It is then shown how such a segment data base can be used for 1) speech coding at very low bit rate ( ∼ 400 bit/sec), 2) synthesis from unrestricted text, 3) continuous speech recognition.


Proceedings ArticleDOI
J. Wolf1, M. Krasner, K. Karnofsky, Richard Schwartz, S. Roucos 
01 Apr 1983
TL;DR: Preliminary results show that the probabilistic methods perform significantly better than a minimum-distance classifier for the multi-session paradigm.
Abstract: In this paper, we present the preliminary performance of four methods for text-independent speaker identification using speech transmitted over radio channels. In a previous paper [1], we showed that for both laboratory-quality and simulated noisy-channel data in a single-session paradigm, new probabilistic classifiers yielded performance superior to that of a minimum distance classifier. We have recently compiled a speech database consisting of speech transmissions over a radio-channel. The lower quality and higher variability of this database differ markedly from the laboratory-quality databases often used in speech processing research. We present preliminary results with the same four methods of text-independent speaker identification using the radio-channel database with several experimental paradigms including multi-session paradigms. These results show that the probabilistic methods perform significantly better than a minimum-distance classifier for the multi-session paradigm.

Proceedings ArticleDOI
01 Apr 1983
TL;DR: This paper addresses the use of linear frequency warping for template normalization and describes both a technique for estimating the long-term distribution of the frequencies of a talker's formants and a techniques for automatically predicting an optimal linear frequency warp.
Abstract: In a template-based, speaker-independent, speech recognition system, stored templates may be used in matching the speech of new users. For optimal results, templates should be carefully selected and proper normalization algorithms should be applied for each new talker. This paper addresses the use of linear frequency warping for template normalization and describes both a technique for estimating the long-term distribution of the frequencies of a talker's formants and a technique for automatically predicting an optimal linear frequency warp.




Proceedings Article
08 Aug 1983
TL;DR: The FOPHO (F_oreign Phonetician) speech recognition project concerns the development of a system to produce a reasonably high quality phonetic transcription output from continuous speech input.
Abstract: The FOPHO (F_oreign Phonetician) speech recognition project concerns the development of a system to produce a reasonably high quality phonetic transcription output from continuous speech input. The system is developed to perform in a way which approximates the actions of a phonetician trying to transcribe a foreign tongue, (in the case of FOPHO, Australian English). Because of this central philosophy, FOPHO is a very interactive system and has facilities for automatic learning and analysis of its own performance. Good quality recognition is achieved through algorithms which are very context-dependent and which are sensitive to a variety of possible productions of similar sounds even though the system itself is speaker independent.

Journal ArticleDOI
TL;DR: This system, which is designed to provide three major functions — intelligent interface, knowledge-base management, and problem solving — will provide communications with the computer in a form natural to humans, particularly via speech and graphics.
Abstract: Work in the area of speech recognition by computer has been taking place for about 30 years — almost since the inception of the computer itself. The results of these efforts can be seen today in a number of products capable of various degrees of speech recognition. Although the systems used in these products exhibit limitations when compared to people's ability to recognize and respond to speech, the use of speech in our everyday work is so natural and desirable that these systems are finding numerous applications. Speech recognition has even become important on a global level, as can be seen by the integral role it will play in the proposed fifth-generation Japanese computer system. This system, which is designed to provide three major functions — intelligent interface, knowledge-base management, and problem solving — will provide communications with the computer in a form natural to humans, particularly via speech and graphics.

Proceedings ArticleDOI
E. Bronson1
14 Apr 1983
TL;DR: A discrete utterance recognition technique which applies formal language theory to a symbol string derived from the speech input using stored context-free grammars for the allowed vocabulary is described.
Abstract: This paper describes a discrete utterance recognition technique which applies formal language theory to a symbol string derived from the speech input. Analysis is performed to obtain a representation of the input utterance in terms of acoustically consistent labeled regions. Syntactic pattern recognition is then used to parse this representation of the input word using stored context-free grammars for the allowed vocabulary. Preliminary results are reported.

Proceedings ArticleDOI
Aaron E. Rosenberg1
14 Apr 1983
TL;DR: A probabilistic model is developed to account for the error rate behavior of isolated word speech recognition systems and results indicate that two-way mixture distributions account quite well for the experimental performance results.
Abstract: A probabilistic model is developed to account for the error rate behavior of isolated word speech recognition systems. Two kinds of error are examined, confusion error, an a priori characterization of a recognizer which measures differences between words, and recognition rank error, an a posteriori characterization, which, in addition to taking into account differences between words, accounts for differences between different tokens of the same word. It is shown that these kinds of error can be modelled by describing recognition trials as Bernoulli trials. Good models of error rate behavior as a function of vocabulary size can be obtained if the distributions of confusion or rank number are considered to be mixtures of binomial distributions. The data obtained from a recent experiment in isolated word recognition with a large vocabulary, (1109 words), are used to evaluate the model. Model functions based on mixture distributions are fit by means of an optimization algorithm to experimental error rate functions obtained from each of six talkers and three partitions of the vocabulary. The results indicate that two-way mixture distributions account quite well for the experimental performance results.

Journal ArticleDOI
TL;DR: A low cost, microcomputer-based voice recognition device makes a convenient input channel for an interactive model of a manufacturing system and potential exists for useful voice control of simulations in the near future.
Abstract: A low cost, microcomputer-based voice recognition device makes a convenient input channel for an interactive model of a manufacturing system. The problems with current hardware are its limited capabilities and unreliable operation. However, the potential exists for useful voice control of simulations in the near future

Journal ArticleDOI
TL;DR: The goal of the research is to develop an automatic typewriter that will automatically edit and type text under voice control and an application of the composition dynamic programming method for the solution of basic problems in the recognition and understanding of speech.
Abstract: This article discusses the automatic processing of speech signals with the aim of finding a sequence of works (speech recognition) or a concept (speech understanding) being transmitted by the speech signal. The goal of the research is to develop an automatic typewriter that will automatically edit and type text under voice control. A dynamic programming method is proposed in which all possible class signals are stored, after which the presented signal is compared to all the stored signals during the recognition phase. Topics considered include element-by-element recognition of words of speech, learning speech recognition, phoneme-by-phoneme speech recognition, the recognition of connected speech, understanding connected speech, and prospects for designing speech recognition and understanding systems. An application of the composition dynamic programming method for the solution of basic problems in the recognition and understanding of speech is presented.

01 Sep 1983
TL;DR: The results of tests for both the speaker dependent and speaker-independent case indicate that phase may be an important feature to consider in the development of word recognition systems.
Abstract: : The use of phase-only representations of speech for isolated word recognition is explored Until recently the ear was thought to be short-term phase insensitive However, short-term phase-only reconstructed speech has been shown to retain much of the intelligibility of the original signal Using cepstral and analytic signal processing techniques, a system for isolated word recognition is developed The results of tests for both the speaker dependent and speaker-independent case indicate that phase may be an important feature to consider in the development of word recognition systems (Author)

Proceedings ArticleDOI
14 Apr 1983
TL;DR: In this work several distance classifiers are evaluated for use in text-independent speaker identification and it is found that both the maximum a posteriori probability criterion and the correlation distance measure yield extremely poor results.
Abstract: A survey of research efforts in the area of speaker recognition indicate that for the same choice of speaker-dependent speech parameters the recognition accuracy is significantly affected by the distance measure used. In this work several distance classifiers are evaluated for use in text-independent speaker identification. The four distance measures investigated are the Mahalanobis distance, maximum a posteriori probability, nearest neighbor criterion and the correlation distance measure. It is found that both the maximum a posteriori probability criterion and the correlation distance measure yield extremely poor results. The Mahalanobis distance and the nearest neighborhood criterion yield relatively poor results (error \sim20-30 %) with the former consistently superior to the latter. It is shown that these scores can be improved through a proposed variation of the nearest neighbor method.

Journal ArticleDOI
TL;DR: In this paper, a Speaker Recognizability Test (SRT) was designed which tries to establish how well a given communications system preserves a speaker's identity, and no attempt was made to identify the cues used by listeners for speaker recognition.
Abstract: A Speaker Recognizability Test (SRT) has been designed which tries to establish how well a given communications system preserves a speaker's identity. Contrary to previous efforts, no attempt is made to identify the cues used by listeners for speaker recognition. Instead, listeners are asked directly to identify a speaker who says an utterance. The test is constructed as follows: Several sentences are collected from five male and five female speakers. One sentence from each speaker is used as reference. The listening team consists of ten listeners. Each listener is presented 20 different sentences and is asked to identify the speaker of each one of them by comparing it with the ten reference sentences. Among the issues considered in the design of the test is the choice of speakers, the use of reference sentences from the same or different sessions of data collection, and the use of processed or unprocessed speech for reference.