Showing papers on "Speaker recognition published in 1981"

PDF

Open Access

Journal Article•DOI•

Isolated and Connected Word Recognition--Theory and Selected Applications

[...]

Lawrence R. Rabiner¹, Stephen E. Levinson¹•Institutions (1)

01 May 1981-IEEE Transactions on Communications

TL;DR: This paper discusses word recognition as a classical pattern-recognition problem and shows how some fundamental concepts of signal processing, information theory, and computer science can be combined to give us the capability of robust recognition of isolated words and simple connected word sequences.

...read moreread less

Abstract: The art and science of speech recognition have been advanced to the state where it is now possible to communicate reliably with a computer by speaking to it in a disciplined manner using a vocabulary of moderate size. It is the purpose of this paper to outline two aspects of speech-recognition research. First, we discuss word recognition as a classical pattern-recognition problem and show how some fundamental concepts of signal processing, information theory, and computer science can be combined to give us the capability of robust recognition of isolated words and simple connected word sequences. We then describe methods whereby these principles, augmented by modern theories of formal language and semantic analysis, can be used to study some of the more general problems in speech recognition. It is anticipated that these methods will ultimately lead to accurate mechanical recognition of fluent speech under certain controlled conditions.

...read moreread less

246 citations

Journal Article•DOI•

Comparison of speaker recognition methods using statistical features and dynamic features

[...]

Sadaoki Furui

01 Jun 1981-IEEE Transactions on Acoustics, Speech, and Signal Processing

TL;DR: The results of the experiments show that there is only a slight difference between the recognition accuracies for statistical features and dynamic features over the long term, and it is more efficient to use statistical features than dynamic features.

...read moreread less

Abstract: This paper describes results of speaker recognition experiments using statistical features and dynamic features of speech spectra extracted from fixed Japanese word utterances. The speech wave is transformed into a set of time functions of log area ratios and a fundamental frequency. In the case of statistical features, a mean value and a standard deviation for each time function and a correlation matrix between these functions are calculated in the voiced portion of each word, and after a feature selection procedure, they are compared with reference features. In the case of dynamic features, the time functions are brought into time registration with reference functions. The results of the experiments show that there is only a slight difference between the recognition accuracies for statistical features and dynamic features over the long term. Since the amount of calculation necessary for recognition using statistical features is only about one-tenth of that for recognition using dynamic features, it is more efficient to use statistical features than dynamic features. When training utterances are recorded over ten months for each customer and spectral equalization is applied, 99.5 percent and 96.3 percent verification accuracies can be obtained for input utterances ten months and five years later, respectively, using statistical features extracted from two words. Combination of dynamic features with statistical features can reduce the error rate to half that obtained with either one alone.

...read moreread less

131 citations

Patent•

Speaker identification system using word recognition templates

[...]

John E. Holmgren¹, Aaron E. Rosenberg¹, John W. Upton¹•Institutions (1)

Bell Labs¹

27 Mar 1981

TL;DR: In this paper, a set of signals representative of the correspondence of the identified speaker's features with the feature templates of said reference words is generated, and an unknown speaker is analyzed and the reference word sequence of the utterance is identified.

...read moreread less

Abstract: In a speaker recognition and verification arrangement, acoustic feature templates are stored for predetermined reference words. Each template is a standardized set of acoustic features for one word, formed for example by averaging the values of acoustic features from a plurality of speakers. Responsive to the utterances of identified speakers, a set of signals representative of the correspondence of the identified speaker's features with said feature templates of said reference words is generated. An utterance of an unknown speaker is analyzed and the reference word sequence of the utterance is identified. A set of signals representative of the correspondence of the unknown speaker's utterance features and the stored templates for the recognized words is generated. The unknown speaker is identified jointly responsive to the correspondence signals of the identified speakers and unknown speaker.

...read moreread less

65 citations

Comparison of Speaker Recognition Methods Using Statistical Features and Dynamic Features

[...]

Parcor Coeff

01 Jan 1981

TL;DR: The results of the experiments show that there is only a slight differ- ence between the recognition accuracies for statistical features and dy- namic features over the long term, and it is more efficient to use statistical features than dynamic features.

...read moreread less

Abstract: This paper describes results of speaker recognition experi- ments using statistical features and dynamic features of speech spectra extracted from fixed Japanese word utterances. The speech wave is transformed into a set of time functions of log area ratios and a funda- mental frequency. In the case of statistical features, a mean value and a standard deviation for each time function and a correlation matrix be- tween these functions are calculated in the voiced portion of each word, and after a feature selection procedure, they are compared with refer- ence features. In the case of dynamic features, the time functions are brought into time registration with reference functions. The results of the experiments show that there is only a slight differ- ence between the recognition accuracies for statistical features and dy- namic features over the long term. Since the amount of calculation necessary for recognition using statistical features is only about one-tenth of that for recognition using dynamic features, it is more efficient to use statistical features than dynamic features. When training utterances are recorded over ten months for each customer and spectral equalization is applied, 99.5 percent and 96.3 percent verification accuracies can be obtained for input utterances ten months and five years later, respec- tively, using statistical features extracted from two words. Combination of dynamic features with statistical features can reduce the error rate to half that obtained with either one alone.

...read moreread less

52 citations

Proceedings Article•DOI•

Isolated word recognition using a two-pass pattern recognition approach

[...]

Lawrence R. Rabiner¹, J. G. Wilpon•Institutions (1)

Bell Labs¹

01 Apr 1981

TL;DR: Improvements in discriminability among similar words can be achieved by modifying the pattern similarity algorithm so that the recognition decision is made in two passes.

...read moreread less

Abstract: One of the major drawbacks of the standard pattern recognition approach to isolated word recognition is that poor performance is generally achieved for word vocabularies with acoustically similar words. This poor performance is related to the pattern similarity (distance) algorithms that are generally used in which a global distance between the test pattern and each reference pattern is computed. Since acoustically similar words are, by definition, globally similar, it is difficult to reliably discriminate such words, and a high error rate is obtained. By modifying the pattern similarity algorithm so that the recognition decision is made in two passes, improvements in discriminability among similar words can be achieved. In particular, on the first pass the recognizer provides a set of global distance scores which are used to decide a class (or a set of possible classes) in which the spoken word is estimated to belong. On the second pass a locally weighted distance is used to provide optimal separation among words in the chosen class (or classes) and the recognition decision is made on the basis of these local distance scores. For a highly complex vocabulary (letters of the alphabet, digits, and 3 command words) recognition improvements of from 3 to 7 percent were obtained using the two-pass recognition strategy.

...read moreread less

32 citations

Patent•DOI•

Speaker recognizer in which a significant part of a preselected one of input and reference patterns is pattern matched to a time normalized part of the other

[...]

Hiroaki Sakoe

30 Jun 1981-Journal of the Acoustical Society of America

TL;DR: In this paper, a similarity measure is calculated from comparing selected feature vectors among an input speech signal sequence of feature vectors (A) and a selected sequence (B) of reference vectors selected from a plurality of pre-stored reference sequences.

...read moreread less

Abstract: Speaker recognition is decided by a similarity measure (D) calculated from comparing selected feature vectors among an input speech signal sequence of feature vectors (A) and a selected sequence (B) of reference vectors selected from a plurality of pre-stored reference sequences. Prior to comparison of the input and reference vector sequences, the two sequences are time normalized to align corresponding feature vectors. A significant sound specifying signal (V) including a time sequence of elementary signals is generated in synchronism with one of the input and reference sequences and indicates which feature vectors in that one of the input and reference sequences are considered to represent significant sound. The similarity measure (D) is then calculated in accordance with the comparison of those feature vectors in the one sequence which are indicated by the significant sound specifying signal as representing significant sound and the corresponding feature vectors of the other sequence.

...read moreread less

29 citations

Journal Article•DOI•

An Experimental Study of the Relative Importance of Acoustic Parameters for Auditory Speaker Recognition

[...]

Roger W. Brown¹•Institutions (1)

University of Edinburgh¹

01 Oct 1981-Language and Speech

TL;DR: Results indicate that fundamental-frequency mean, formant mean and formant bandwidth are the most important parameters, among those investigated, for speaker recognition, and although listeners differ in the average score recorded, they may be treated as reacting identically to changes in the factors.

...read moreread less

Abstract: Many previous experimenters, by manipulating parameters in isolation, have examined the potentiality of these parameters as speaker-characterizing features, not their relative habitual importance for speaker recognition in everyday life. The two experiments reported here investigate this relative importance by the simultaneous manipulation of parameters. Using synthetic speech in a voice similarity judgment format, the first experiment employs eight factors in a restricted factorial design, and the second a subset of four of these factors in a full factorial design. Results indicate that (i) fundamental-frequency mean, formant mean and formant bandwidth are the most important parameters, among those investigated, for speaker recognition, and (ii) although listeners differ in the average score recorded, they may be treated as reacting identically to changes in the factors. Implications for perception theory are outlined.

...read moreread less

27 citations

Proceedings Article•DOI•

An optimization algorithm for determining the endpoints of isolated utterances

[...]

Hermann Ney¹•Institutions (1)

Philips¹

01 Apr 1981

TL;DR: An optimization technique for locating the initial and final points of utterances by means of dynamic programming and results are presented for end-point detection in a speaker recognition system using only the speech intensity as acoustic parameter.

...read moreread less

Abstract: This paper describes an optimization technique for locating the initial and final points of utterances. Acoustic parameters extracted from each signal segment are converted into a cost function versus time. An overall cost for the presence of a speech signal is introduced and is to be optimized with respect to the unknown initial and final points. The optimization is carried out by means of dynamic programming. The computation grows linearly with the number of segments. In a second stage, the locations of the obtained endpoints are refined by matching, transition templates against the input signal. Results are presented for end-point detection in a speaker recognition system using only the speech intensity as acoustic parameter.

...read moreread less

21 citations

Journal Article•DOI•

Speaker adaptation for word‐based speech recognition systems

[...]

Melvyn J. Hunt

01 May 1981-Journal of the Acoustical Society of America

TL;DR: In this paper, a speaker-independent performance of word-based speech recognition systems is improved by automatically deducing general characteristics of the current speaker and using them to derive speaker-normalizing transforms.

...read moreread less

Abstract: This work is aimed at enhancing the speaker‐independent performance of word‐based speech recognition systems by rapidly and automatically deducing general characteristics of the current speaker and using them to derive speaker‐normalizing transforms. DP matching is used to align and compare corresponding frames of the incoming speech and reference vocabulary. A single transform is then computed for all voiced speech and another for all unvoiced speech. The transform consist of a linear filtering component and, optionally, a constrained frequency shift. Experiments have been carried out with twenty male and female, native and non‐native English speakers each producing 150 digits. Adaptation on all 150 digits reduces recognition errors by a factor of three (4.5% to 1.5%). With adaptation on just three randomly selected digits, the reduction factor is two. Frequency shifting is useful only when the amount of adaptation material is large and the reference speech is not exclusively from the same sex as the cur...

...read moreread less

16 citations

Proceedings Article•DOI•

Experiments on an isolated-word recognition system for multiple speakers

[...]

H. Riittinen¹, S. Haltsonen, E. Reuhkala, M. Jalanko•Institutions (1)

Helsinki University of Technology¹

01 Apr 1981

TL;DR: The paper describes isolated-word recognition experiments on a multi-speaker speech recognition system that uses Redundant Hash Addressing for fast comparison of the phonemic transcriptions with referent strings stored in a dictionary.

...read moreread less

Abstract: The paper describes isolated-word recognition experiments on a multi-speaker speech recognition system. The system is organized in two main stages. At the phonemic recognition stage the phonemic transcription of the speech waveform is produced by simultaneous segmentation and labeling accomplished by the Learning Subspace Method. It directly produces an approximately correct number of phonemes. At the word recognition stage Redundant Hash Addressing is used for fast comparison of the phonemic transcriptions with referent strings stored in a dictionary. The average word recognition accuracy in a 200-word experiment with five speakers was about 95 per cent.

...read moreread less

10 citations

Proceedings Article•DOI•

Speaker identification and verification combined with speaker independent word recognition

[...]

Aaron E. Rosenberg¹, K. Shipley•Institutions (1)

Bell Labs¹

01 Apr 1981

TL;DR: In this study it is hypothesized that distributions of template distance scores are reasonably consistent for individual speakers and vary characteristically from speaker to speaker.

...read moreread less

Abstract: One method for providing speaker independent word recognition capability is to construct a small set of templates for each vocabulary word that typifies and spans individual speaker word reference templates over a large population of speakers. Word recognition decision functions are based on combinations of template distance scores obtained by processing an unknown input utterance and comparing it with the ensemble of reference templates. In this study it is hypothesized that distributions of template distance scores are reasonably consistent for individual speakers and vary characteristically from speaker to speaker. This property is exploited to provide a speaker recognition capability in combination with word recognition. It is shown that good speaker recognition performance depends on the input of a sequence of distinct words. For a 20-speaker population, on the average, the correct speaker is in the top 1% of the candidates in the identification made over a sequence of seven distinct words.

...read moreread less

Proceedings Article•DOI•

Vowel identification in continuous speech using articulatory parameters

[...]

Katsuhiko Shirai¹•Institutions (1)

Waseda University¹

01 Apr 1981

TL;DR: A method to identify vowels in continuous speech of unspecified speakers is discussed and it is seen that rather simple algorithms in the articulatory domain are effective to deal with coarticulation effects in various contexts.

...read moreread less

Abstract: An effective method to estimate articulatory movements has been developed and applied for speech recognition. Movements of articulatory parameters estimated from speech waves are usefel features in speech recognition, especially for continuous speech. In this paper, a method to identify vowels in continuous speech of unspecified speakers is discussed. First, results for discrete utterances of Japanese five vowels are shown. A normalization procedure to eliminate speaker differences is useful in vowel recognition. Second, results on vowel discrimination in continuous speech are reported. It is seen that rather simple algorithms in the articulatory domain are effective to deal with coarticulation effects in various contexts.

...read moreread less

Proceedings Article•DOI•

Text-independent speaker recognition using orthogonal linear prediction

[...]

Malayappan Shridhar¹, N. Mohankrishnan, M. Baraniecki•Institutions (1)

University of Windsor¹

01 Apr 1981

TL;DR: The results indicate that the parameters comprising the optimal set chosen are speaker-dependedt, and a technique using dynamic programming was used to select a subset of k best features among the entire set N.

...read moreread less

Abstract: The main objective of this work was to investigate the effectiveness of long-term averages of the orthogonal linear prediction parameters in text-independent speaker recognition. To investigate the possibility of feature selection, a technique using dynamic programming (1) was used to select a subset of k best features among the entire set N. The results indicate that the parameters comprising the optimal set chosen are speaker-dependedt. Verification accuracies of 96.5% were obtained using the selected optimal 8- parameter (out of 12) feature set for each speaker in a verification scheme, in which the reference parameters were generated from 100 seconds of time-spaced voiced speech and the test parameters were generated from 5 seconds of voiced speech.

...read moreread less

Patent•

Voice recognition processing system

[...]

Kijima Yuuji, Kimura Shinta, Nara Yasuhiro, Kobayashi Atsuhito

24 Sep 1981

Proceedings Article•DOI•

A realtime implementation of a text independent speaker recognition system

[...]

E. Wrench

01 Apr 1981

TL;DR: The results of this investigation clearly show that Markel's technique is superior for applications using very short speech segments for both the speaker models and the recognition trials.

...read moreread less

Abstract: This paper describes the design and implementation of a realtime speaker recognition system. The system performs text independent, closed set speaker recognition with up to 30 talkers in realtime. In addition, the reference speech used to characterize the 30 talkers can be extracted from as little as 10 seconds of speech from each talker, and the actual recognition performed with less than one minute of speech from the unknown talker. Two speaker recognition algorithms previously developed by Markel and Pfeifer were investigated for use in the realtime system. The results of this investigation clearly show that Markel's technique is superior for applications using very short speech segments for both the speaker models and the recognition trials. Markel's technique was implemented in realtime in a high speed progranmable signal processor. A test of this implementation with a set of 30 male speakers resulted in recognition accuracies of 93-100% for models generated with only 10 seconds of speech, and recognition trials using only 10 seconds of unknown speech.

...read moreread less

Proceedings Article•DOI•

Acoustic processing in the conversational speech recognition system

[...]

K. Shikano

01 Apr 1981

TL;DR: Structure and performance in an acoustic processor in the conversational speech recognition system, called Voice Q-A System II, are described, together with a comparison with the old system.

...read moreread less

Abstract: Structure and performance in an acoustic processor in the conversational speech recognition system, called Voice Q-A System II, are described, together with a comparison with the old system The acoustic processor adopts the LPC peak weighted spectral matching measure The acoustic processor is much improved on phoneme segmentation and vowel recognition The task is train seat reservation service, which contains 112 words Input is conversational speech with a short pause between adjacent phrases The recognition test was made on the acoustic processor, which is connected with the linguistic processor Results show 969% phrase recognition rate and 668% phoneme recognition rate on the average for nine male speakers

...read moreread less

Journal Article•DOI•

Effect of reference set selection on speaker dependent speech recognition

[...]

Zongge Li, F. Alleva, Raj Reedy¹•Institutions (1)

Carnegie Mellon University¹

01 May 1981-Journal of the Acoustical Society of America

TL;DR: The authors presented an algorithm which chooses a reference template for each word in the vocabulary from a set of N exemplars, which minimizes the worst matching behavior and total error over the N sets of exemplars.

...read moreread less

Abstract: Presented here for a speaker dependent system, is an algorithm which chooses a reference template for each word in the vocabulary from a set of N exemplars. The goal of the algorithm is to produce a reference set that minimizes the worst matching behavior and total error over the N sets of exemplars. The results of the experiments presented here show a reduction in the average error rate from 16.4% to 10.2% over a set of 4 male speakers and 4 female speakers.

...read moreread less

Proceedings Article•DOI•

Telephone-line speaker recognition using clipped autocorrelation analysis

[...]

Hermann Ney¹•Institutions (1)

Philips¹

01 Apr 1981

TL;DR: A speaker recognition system is investigated which operates on telephone speech and performs speech analysis by means of the clipped autocorrelation function, and the time warping method based on dynamic programming is used to bring sample utterances into time registration with reference utterances.

...read moreread less

Abstract: A speaker recognition system is investigated which operates on telephone speech and performs speech analysis by means of the clipped autocorrelation function. The advantages of the clipped autocorrelation function are its simple computation and its reduced dynamic variability as compared to the standard autocorrelation function. Utterances are represented by time contours of clipped autocorrelation coefficients. The time warping method based on dynamic programming is used to bring sample utterances into time registration with reference utterances. Different methods of preprocessing the time contours are studied with respect to speaker discrimination. For cooperative speakers, verification error rates of 3% and less than 2% were obtained using speaker independent and speaker individual thresholds, respectively.

...read moreread less

Book Chapter•DOI•

Voice Identification by Man and Machine: a Review of Research

[...]

Ray Bull

01 Jan 1981

TL;DR: In this paper, the authors investigated the performance of human listeners in voice identification in obscene telephone calls and kidnap cases. And they found that human listeners were more likely to identify the criminal voice than the audio of the criminal's voice.

...read moreread less

Abstract: Most criminal identifications are made using visual cues, but there are some instances when both visual and verbal information is available, and others when only verbal cues exist. In obscene telephone calls, for example, often the only possible method of identifying the speaker is by his voice. Similarly, in kidnap cases information concerning the kidnapper’s voice is sometimes made available by him over the telephone. In such situations, when a suspect is in the hands of the police a voice-matching exercise may be undertaken. Here a witness may be asked whether the suspect’s voice resembles the criminal’s voice, or a machine may be employed in an attempt to answer these questions if a record (e.g. a tape) of the original criminal voice is available. In the Psychology Department of the North East London Polytechnic, Brian Clifford and I have recently begun a programme of research (funded by the Home Office) concerning the voice identification performance of human listeners in this kind of context.

...read moreread less

STS-41 Voice Command System Flight Experiment Report

[...]

George A. Salazar

01 Jan 1981

TL;DR: The results of the Voice Command System (VCS) flight experiment on the five-day STS-41 mission were presented in this paper, where two mission specialists, Bill Shepherd and Bruce Melnick, used the speaker-dependent system to evaluate the operational effectiveness of using voice to control a spacecraft system.

...read moreread less

Abstract: This report presents the results of the Voice Command System (VCS) flight experiment on the five-day STS-41 mission. Two mission specialists,Bill Shepherd and Bruce Melnick, used the speaker-dependent system to evaluate the operational effectiveness of using voice to control a spacecraft system. In addition, data was gathered to analyze the effects of microgravity on speech recognition performance.

...read moreread less

Journal Article•DOI•

An efficient method for testing large vocabulary automatic speech recognition systems

[...]

Brian W. Eukel

01 Nov 1981-Journal of the Acoustical Society of America

TL;DR: In this paper, each word to be recognized was modeled as a sequence of segments with each segment being variable length and having a uniform spectrum with added noise, and the model was then used to produce synthetic tokens for testing the recognizer.

...read moreread less

Abstract: The performance of automatic speech recognition systems is commonly measured by using large quantities of natural speech as a benchmark. For recognizers which accept large vocabularies, or when many alternative vocabularies are of interest, this method requires an unreasonably large natural speech corpus. Synthetic speech which incorporates variations in pronunciation is an alternative in these cases. This approach was used to evaluate the performance of a speaker trained, isolated word speech recognition system [H. Murveit, M. Lowy, and R. W. Brodersen, J. Acoust. Soc. Am. Suppl. 1 69, S8 (1981)]. Each word to be recognized was modeled as a sequence of segments with each segment being variable length and having a uniform spectrum with added noise. A few tokens of natural speech were used to evaluate the parameters of the model. The model was then used to produce synthetic tokens for testing the recognizer. Good correlation was obtained between the confusion matrices for synthetic and natural speech when recognized. [Work supported in part by DARPA.]

...read moreread less

Proceedings Article•DOI•

Speaker-independent word recognition using sex-dependent clustering

[...]

T. Diller, J. Siebenand

01 Apr 1981

TL;DR: In the absence of perfect speaker normalization techniques, speaker-independent recognition using spectral pattern matching is improved significantly by using multiple lexical patterns for each word.

...read moreread less

Abstract: In the absence of perfect speaker normalization techniques, speaker-independent recognition using spectral pattern matching is improved significantly by using multiple lexical patterns for each word

...read moreread less

Proceedings Article•

Statistical model of acoustic parameters for speaker recognition

[...]

A. Federico, G. Ibba, A. Paoloni

01 Mar 1981

Effect of Reference Set Selection on Speaker Dependent Speech Recognition. Frame Compression in Isolated Word Recognition

[...]

Zongge Li, F. Alleva, Raj Reddy

23 Jul 1981

TL;DR: An algorithm for compressing the spectral representation of an utterance along the time axis while keeping the main features intact is described to save template storage space and to reduce the time required for recognition.

...read moreread less

Abstract: : This paper describes an algorithm for compressing the spectral representation of an utterance along the time axis while keeping the main features intact. The goal of the algorithm is to save template storage space and to reduce the time required for recognition. For 8 speakers, 5 data sets each, the results indicated that we can save about 40% of the template space and 35% of the recognition time with only a slightly higher error rate.

...read moreread less