scispace - formally typeset
Search or ask a question

Showing papers on "Speaker recognition published in 1981"


Journal ArticleDOI
TL;DR: This paper discusses word recognition as a classical pattern-recognition problem and shows how some fundamental concepts of signal processing, information theory, and computer science can be combined to give us the capability of robust recognition of isolated words and simple connected word sequences.
Abstract: The art and science of speech recognition have been advanced to the state where it is now possible to communicate reliably with a computer by speaking to it in a disciplined manner using a vocabulary of moderate size. It is the purpose of this paper to outline two aspects of speech-recognition research. First, we discuss word recognition as a classical pattern-recognition problem and show how some fundamental concepts of signal processing, information theory, and computer science can be combined to give us the capability of robust recognition of isolated words and simple connected word sequences. We then describe methods whereby these principles, augmented by modern theories of formal language and semantic analysis, can be used to study some of the more general problems in speech recognition. It is anticipated that these methods will ultimately lead to accurate mechanical recognition of fluent speech under certain controlled conditions.

246 citations


Journal ArticleDOI
TL;DR: The results of the experiments show that there is only a slight difference between the recognition accuracies for statistical features and dynamic features over the long term, and it is more efficient to use statistical features than dynamic features.
Abstract: This paper describes results of speaker recognition experiments using statistical features and dynamic features of speech spectra extracted from fixed Japanese word utterances. The speech wave is transformed into a set of time functions of log area ratios and a fundamental frequency. In the case of statistical features, a mean value and a standard deviation for each time function and a correlation matrix between these functions are calculated in the voiced portion of each word, and after a feature selection procedure, they are compared with reference features. In the case of dynamic features, the time functions are brought into time registration with reference functions. The results of the experiments show that there is only a slight difference between the recognition accuracies for statistical features and dynamic features over the long term. Since the amount of calculation necessary for recognition using statistical features is only about one-tenth of that for recognition using dynamic features, it is more efficient to use statistical features than dynamic features. When training utterances are recorded over ten months for each customer and spectral equalization is applied, 99.5 percent and 96.3 percent verification accuracies can be obtained for input utterances ten months and five years later, respectively, using statistical features extracted from two words. Combination of dynamic features with statistical features can reduce the error rate to half that obtained with either one alone.

131 citations


Patent
27 Mar 1981
TL;DR: In this paper, a set of signals representative of the correspondence of the identified speaker's features with the feature templates of said reference words is generated, and an unknown speaker is analyzed and the reference word sequence of the utterance is identified.
Abstract: In a speaker recognition and verification arrangement, acoustic feature templates are stored for predetermined reference words. Each template is a standardized set of acoustic features for one word, formed for example by averaging the values of acoustic features from a plurality of speakers. Responsive to the utterances of identified speakers, a set of signals representative of the correspondence of the identified speaker's features with said feature templates of said reference words is generated. An utterance of an unknown speaker is analyzed and the reference word sequence of the utterance is identified. A set of signals representative of the correspondence of the unknown speaker's utterance features and the stored templates for the recognized words is generated. The unknown speaker is identified jointly responsive to the correspondence signals of the identified speakers and unknown speaker.

65 citations


01 Jan 1981
TL;DR: The results of the experiments show that there is only a slight differ- ence between the recognition accuracies for statistical features and dy- namic features over the long term, and it is more efficient to use statistical features than dynamic features.
Abstract: This paper describes results of speaker recognition experi- ments using statistical features and dynamic features of speech spectra extracted from fixed Japanese word utterances. The speech wave is transformed into a set of time functions of log area ratios and a funda- mental frequency. In the case of statistical features, a mean value and a standard deviation for each time function and a correlation matrix be- tween these functions are calculated in the voiced portion of each word, and after a feature selection procedure, they are compared with refer- ence features. In the case of dynamic features, the time functions are brought into time registration with reference functions. The results of the experiments show that there is only a slight differ- ence between the recognition accuracies for statistical features and dy- namic features over the long term. Since the amount of calculation necessary for recognition using statistical features is only about one-tenth of that for recognition using dynamic features, it is more efficient to use statistical features than dynamic features. When training utterances are recorded over ten months for each customer and spectral equalization is applied, 99.5 percent and 96.3 percent verification accuracies can be obtained for input utterances ten months and five years later, respec- tively, using statistical features extracted from two words. Combination of dynamic features with statistical features can reduce the error rate to half that obtained with either one alone.

52 citations


Proceedings ArticleDOI
01 Apr 1981
TL;DR: Improvements in discriminability among similar words can be achieved by modifying the pattern similarity algorithm so that the recognition decision is made in two passes.
Abstract: One of the major drawbacks of the standard pattern recognition approach to isolated word recognition is that poor performance is generally achieved for word vocabularies with acoustically similar words. This poor performance is related to the pattern similarity (distance) algorithms that are generally used in which a global distance between the test pattern and each reference pattern is computed. Since acoustically similar words are, by definition, globally similar, it is difficult to reliably discriminate such words, and a high error rate is obtained. By modifying the pattern similarity algorithm so that the recognition decision is made in two passes, improvements in discriminability among similar words can be achieved. In particular, on the first pass the recognizer provides a set of global distance scores which are used to decide a class (or a set of possible classes) in which the spoken word is estimated to belong. On the second pass a locally weighted distance is used to provide optimal separation among words in the chosen class (or classes) and the recognition decision is made on the basis of these local distance scores. For a highly complex vocabulary (letters of the alphabet, digits, and 3 command words) recognition improvements of from 3 to 7 percent were obtained using the two-pass recognition strategy.

32 citations


PatentDOI
TL;DR: In this paper, a similarity measure is calculated from comparing selected feature vectors among an input speech signal sequence of feature vectors (A) and a selected sequence (B) of reference vectors selected from a plurality of pre-stored reference sequences.
Abstract: Speaker recognition is decided by a similarity measure (D) calculated from comparing selected feature vectors among an input speech signal sequence of feature vectors (A) and a selected sequence (B) of reference vectors selected from a plurality of pre-stored reference sequences. Prior to comparison of the input and reference vector sequences, the two sequences are time normalized to align corresponding feature vectors. A significant sound specifying signal (V) including a time sequence of elementary signals is generated in synchronism with one of the input and reference sequences and indicates which feature vectors in that one of the input and reference sequences are considered to represent significant sound. The similarity measure (D) is then calculated in accordance with the comparison of those feature vectors in the one sequence which are indicated by the significant sound specifying signal as representing significant sound and the corresponding feature vectors of the other sequence.

29 citations


Journal ArticleDOI
TL;DR: Results indicate that fundamental-frequency mean, formant mean and formant bandwidth are the most important parameters, among those investigated, for speaker recognition, and although listeners differ in the average score recorded, they may be treated as reacting identically to changes in the factors.
Abstract: Many previous experimenters, by manipulating parameters in isolation, have examined the potentiality of these parameters as speaker-characterizing features, not their relative habitual importance for speaker recognition in everyday life. The two experiments reported here investigate this relative importance by the simultaneous manipulation of parameters. Using synthetic speech in a voice similarity judgment format, the first experiment employs eight factors in a restricted factorial design, and the second a subset of four of these factors in a full factorial design. Results indicate that (i) fundamental-frequency mean, formant mean and formant bandwidth are the most important parameters, among those investigated, for speaker recognition, and (ii) although listeners differ in the average score recorded, they may be treated as reacting identically to changes in the factors. Implications for perception theory are outlined.

27 citations


Proceedings ArticleDOI
Hermann Ney1
01 Apr 1981
TL;DR: An optimization technique for locating the initial and final points of utterances by means of dynamic programming and results are presented for end-point detection in a speaker recognition system using only the speech intensity as acoustic parameter.
Abstract: This paper describes an optimization technique for locating the initial and final points of utterances. Acoustic parameters extracted from each signal segment are converted into a cost function versus time. An overall cost for the presence of a speech signal is introduced and is to be optimized with respect to the unknown initial and final points. The optimization is carried out by means of dynamic programming. The computation grows linearly with the number of segments. In a second stage, the locations of the obtained endpoints are refined by matching, transition templates against the input signal. Results are presented for end-point detection in a speaker recognition system using only the speech intensity as acoustic parameter.

21 citations


Journal ArticleDOI
TL;DR: In this paper, a speaker-independent performance of word-based speech recognition systems is improved by automatically deducing general characteristics of the current speaker and using them to derive speaker-normalizing transforms.
Abstract: This work is aimed at enhancing the speaker‐independent performance of word‐based speech recognition systems by rapidly and automatically deducing general characteristics of the current speaker and using them to derive speaker‐normalizing transforms. DP matching is used to align and compare corresponding frames of the incoming speech and reference vocabulary. A single transform is then computed for all voiced speech and another for all unvoiced speech. The transform consist of a linear filtering component and, optionally, a constrained frequency shift. Experiments have been carried out with twenty male and female, native and non‐native English speakers each producing 150 digits. Adaptation on all 150 digits reduces recognition errors by a factor of three (4.5% to 1.5%). With adaptation on just three randomly selected digits, the reduction factor is two. Frequency shifting is useful only when the amount of adaptation material is large and the reference speech is not exclusively from the same sex as the cur...

16 citations


Proceedings ArticleDOI
01 Apr 1981
TL;DR: The paper describes isolated-word recognition experiments on a multi-speaker speech recognition system that uses Redundant Hash Addressing for fast comparison of the phonemic transcriptions with referent strings stored in a dictionary.
Abstract: The paper describes isolated-word recognition experiments on a multi-speaker speech recognition system. The system is organized in two main stages. At the phonemic recognition stage the phonemic transcription of the speech waveform is produced by simultaneous segmentation and labeling accomplished by the Learning Subspace Method. It directly produces an approximately correct number of phonemes. At the word recognition stage Redundant Hash Addressing is used for fast comparison of the phonemic transcriptions with referent strings stored in a dictionary. The average word recognition accuracy in a 200-word experiment with five speakers was about 95 per cent.

10 citations


Proceedings ArticleDOI
01 Apr 1981
TL;DR: In this study it is hypothesized that distributions of template distance scores are reasonably consistent for individual speakers and vary characteristically from speaker to speaker.
Abstract: One method for providing speaker independent word recognition capability is to construct a small set of templates for each vocabulary word that typifies and spans individual speaker word reference templates over a large population of speakers. Word recognition decision functions are based on combinations of template distance scores obtained by processing an unknown input utterance and comparing it with the ensemble of reference templates. In this study it is hypothesized that distributions of template distance scores are reasonably consistent for individual speakers and vary characteristically from speaker to speaker. This property is exploited to provide a speaker recognition capability in combination with word recognition. It is shown that good speaker recognition performance depends on the input of a sequence of distinct words. For a 20-speaker population, on the average, the correct speaker is in the top 1% of the candidates in the identification made over a sequence of seven distinct words.

Proceedings ArticleDOI
01 Apr 1981
TL;DR: A method to identify vowels in continuous speech of unspecified speakers is discussed and it is seen that rather simple algorithms in the articulatory domain are effective to deal with coarticulation effects in various contexts.
Abstract: An effective method to estimate articulatory movements has been developed and applied for speech recognition. Movements of articulatory parameters estimated from speech waves are usefel features in speech recognition, especially for continuous speech. In this paper, a method to identify vowels in continuous speech of unspecified speakers is discussed. First, results for discrete utterances of Japanese five vowels are shown. A normalization procedure to eliminate speaker differences is useful in vowel recognition. Second, results on vowel discrimination in continuous speech are reported. It is seen that rather simple algorithms in the articulatory domain are effective to deal with coarticulation effects in various contexts.

Proceedings ArticleDOI
01 Apr 1981
TL;DR: The results indicate that the parameters comprising the optimal set chosen are speaker-dependedt, and a technique using dynamic programming was used to select a subset of k best features among the entire set N.
Abstract: The main objective of this work was to investigate the effectiveness of long-term averages of the orthogonal linear prediction parameters in text-independent speaker recognition. To investigate the possibility of feature selection, a technique using dynamic programming (1) was used to select a subset of k best features among the entire set N. The results indicate that the parameters comprising the optimal set chosen are speaker-dependedt. Verification accuracies of 96.5% were obtained using the selected optimal 8- parameter (out of 12) feature set for each speaker in a verification scheme, in which the reference parameters were generated from 100 seconds of time-spaced voiced speech and the test parameters were generated from 5 seconds of voiced speech.


Proceedings ArticleDOI
01 Apr 1981
TL;DR: The results of this investigation clearly show that Markel's technique is superior for applications using very short speech segments for both the speaker models and the recognition trials.
Abstract: This paper describes the design and implementation of a realtime speaker recognition system. The system performs text independent, closed set speaker recognition with up to 30 talkers in realtime. In addition, the reference speech used to characterize the 30 talkers can be extracted from as little as 10 seconds of speech from each talker, and the actual recognition performed with less than one minute of speech from the unknown talker. Two speaker recognition algorithms previously developed by Markel and Pfeifer were investigated for use in the realtime system. The results of this investigation clearly show that Markel's technique is superior for applications using very short speech segments for both the speaker models and the recognition trials. Markel's technique was implemented in realtime in a high speed progranmable signal processor. A test of this implementation with a set of 30 male speakers resulted in recognition accuracies of 93-100% for models generated with only 10 seconds of speech, and recognition trials using only 10 seconds of unknown speech.

Proceedings ArticleDOI
01 Apr 1981
TL;DR: Structure and performance in an acoustic processor in the conversational speech recognition system, called Voice Q-A System II, are described, together with a comparison with the old system.
Abstract: Structure and performance in an acoustic processor in the conversational speech recognition system, called Voice Q-A System II, are described, together with a comparison with the old system The acoustic processor adopts the LPC peak weighted spectral matching measure The acoustic processor is much improved on phoneme segmentation and vowel recognition The task is train seat reservation service, which contains 112 words Input is conversational speech with a short pause between adjacent phrases The recognition test was made on the acoustic processor, which is connected with the linguistic processor Results show 969% phrase recognition rate and 668% phoneme recognition rate on the average for nine male speakers

Journal ArticleDOI
TL;DR: The authors presented an algorithm which chooses a reference template for each word in the vocabulary from a set of N exemplars, which minimizes the worst matching behavior and total error over the N sets of exemplars.
Abstract: Presented here for a speaker dependent system, is an algorithm which chooses a reference template for each word in the vocabulary from a set of N exemplars. The goal of the algorithm is to produce a reference set that minimizes the worst matching behavior and total error over the N sets of exemplars. The results of the experiments presented here show a reduction in the average error rate from 16.4% to 10.2% over a set of 4 male speakers and 4 female speakers.

Proceedings ArticleDOI
Hermann Ney1
01 Apr 1981
TL;DR: A speaker recognition system is investigated which operates on telephone speech and performs speech analysis by means of the clipped autocorrelation function, and the time warping method based on dynamic programming is used to bring sample utterances into time registration with reference utterances.
Abstract: A speaker recognition system is investigated which operates on telephone speech and performs speech analysis by means of the clipped autocorrelation function. The advantages of the clipped autocorrelation function are its simple computation and its reduced dynamic variability as compared to the standard autocorrelation function. Utterances are represented by time contours of clipped autocorrelation coefficients. The time warping method based on dynamic programming is used to bring sample utterances into time registration with reference utterances. Different methods of preprocessing the time contours are studied with respect to speaker discrimination. For cooperative speakers, verification error rates of 3% and less than 2% were obtained using speaker independent and speaker individual thresholds, respectively.

Book ChapterDOI
01 Jan 1981
TL;DR: In this paper, the authors investigated the performance of human listeners in voice identification in obscene telephone calls and kidnap cases. And they found that human listeners were more likely to identify the criminal voice than the audio of the criminal's voice.
Abstract: Most criminal identifications are made using visual cues, but there are some instances when both visual and verbal information is available, and others when only verbal cues exist. In obscene telephone calls, for example, often the only possible method of identifying the speaker is by his voice. Similarly, in kidnap cases information concerning the kidnapper’s voice is sometimes made available by him over the telephone. In such situations, when a suspect is in the hands of the police a voice-matching exercise may be undertaken. Here a witness may be asked whether the suspect’s voice resembles the criminal’s voice, or a machine may be employed in an attempt to answer these questions if a record (e.g. a tape) of the original criminal voice is available. In the Psychology Department of the North East London Polytechnic, Brian Clifford and I have recently begun a programme of research (funded by the Home Office) concerning the voice identification performance of human listeners in this kind of context.

01 Jan 1981
TL;DR: The results of the Voice Command System (VCS) flight experiment on the five-day STS-41 mission were presented in this paper, where two mission specialists, Bill Shepherd and Bruce Melnick, used the speaker-dependent system to evaluate the operational effectiveness of using voice to control a spacecraft system.
Abstract: This report presents the results of the Voice Command System (VCS) flight experiment on the five-day STS-41 mission. Two mission specialists,Bill Shepherd and Bruce Melnick, used the speaker-dependent system to evaluate the operational effectiveness of using voice to control a spacecraft system. In addition, data was gathered to analyze the effects of microgravity on speech recognition performance.

Journal ArticleDOI
TL;DR: In this paper, each word to be recognized was modeled as a sequence of segments with each segment being variable length and having a uniform spectrum with added noise, and the model was then used to produce synthetic tokens for testing the recognizer.
Abstract: The performance of automatic speech recognition systems is commonly measured by using large quantities of natural speech as a benchmark. For recognizers which accept large vocabularies, or when many alternative vocabularies are of interest, this method requires an unreasonably large natural speech corpus. Synthetic speech which incorporates variations in pronunciation is an alternative in these cases. This approach was used to evaluate the performance of a speaker trained, isolated word speech recognition system [H. Murveit, M. Lowy, and R. W. Brodersen, J. Acoust. Soc. Am. Suppl. 1 69, S8 (1981)]. Each word to be recognized was modeled as a sequence of segments with each segment being variable length and having a uniform spectrum with added noise. A few tokens of natural speech were used to evaluate the parameters of the model. The model was then used to produce synthetic tokens for testing the recognizer. Good correlation was obtained between the confusion matrices for synthetic and natural speech when recognized. [Work supported in part by DARPA.]

Proceedings ArticleDOI
01 Apr 1981
TL;DR: In the absence of perfect speaker normalization techniques, speaker-independent recognition using spectral pattern matching is improved significantly by using multiple lexical patterns for each word.
Abstract: In the absence of perfect speaker normalization techniques, speaker-independent recognition using spectral pattern matching is improved significantly by using multiple lexical patterns for each word


23 Jul 1981
TL;DR: An algorithm for compressing the spectral representation of an utterance along the time axis while keeping the main features intact is described to save template storage space and to reduce the time required for recognition.
Abstract: : This paper describes an algorithm for compressing the spectral representation of an utterance along the time axis while keeping the main features intact. The goal of the algorithm is to save template storage space and to reduce the time required for recognition. For 8 speakers, 5 data sets each, the results indicated that we can save about 40% of the template space and 35% of the recognition time with only a slightly higher error rate.