scispace - formally typeset
Search or ask a question

Showing papers on "Speaker recognition published in 1982"


Book ChapterDOI
TL;DR: This chapter focuses on Continuous Speech Recognition (CSR) and summarizes acoustic processing techniques and describes an elegant linguistic decoder based on dynamic programming that is practical under certain conditions.
Abstract: Publisher Summary Speech recognition research can be distinguished into three areas: isolated word recognition where words are separated by distinct pauses; continuous speech recognition where sentences are produced continuously in a natural manner; and speech understanding where the aim is not transcription but understanding in the sense that the system responds correctly to a spoken instruction or request. This chapter focuses on Continuous Speech Recognition (CSR) and summarizes acoustic processing techniques. The Markov models of speech processes are introduced in the chapter and it describes an elegant linguistic decoder based on dynamic programming that is practical under certain conditions. The the practical aspects of the sentence hypothesis search conducted by the linguistic decoder is discussed in the chapter and it introduces algorithms for extracting model parameter values automatically from the data. The methods of assessing the performance of the CSR systems and the relative difficulty of recognition tasks are discussed. The chapter illustrates the capabilities of present recognition systems by describing the results of certain recognition experiments.

66 citations


Journal ArticleDOI
TL;DR: In this article, a set of studies explored the nature of the 7-month-old infant's perception of human voices and found that infants learned to respond discriminatively to groups of male vs. female voices.
Abstract: The present set of studies explored the nature of the 7-month-old infant's perception of human voices. In Experiment I, infants learned to respond discriminatively to groups of male vs. female voices. That this was evidence of male/female categorization was supported in Experiment II, in which it was shown that infants did not learn to respond discriminatively to the same voices when they were randomly organized into “categories” containing both male and female voices. The extent to which fundamental frequency may have contributed to this male/female classification was investigated in Experiment III. The combined results of these three studies suggested that, although pitch is possibly one cue to which infants are attending when classifying these voices, it could not account fully for this ability. It remains for future research to identify other cues which may contribute to male/female categorization, as well as to investigate the developmental course of speaker recognition and classification in general.

45 citations


PatentDOI
TL;DR: In this article, a method and apparatus for recognizing an unknown speaker from a plurality of speaker candidates is presented, where portions of speech from the speaker candidates and from the unknown speaker are sampled and digitized.
Abstract: A method and apparatus for recognizing an unknown speaker from a plurality of speaker candidates. Portions of speech from the speaker candidates and from the unknown speaker are sampled and digitized. The digitized samples are converted into frames of speech, each frame representing a point in an LPC-12 multi-dimensional speech space. Using a character covering algorithm, a set of frames of speech is selected, called characters, from the frames of speech of all speaker candidates. The speaker candidates' portions of speech are divided into smaller portions called segments. A smaller plurality of model characters for each speaker candidate is selected from the character set. For each set of model characters the distance from each speaker candidate's frame of speech to the closest character in the model set is determined and stored in a model histogram. When a model histogram is completed for a segment a distance D is found whereby at least a majority of frames have distances greater D. The mean distance value of D and variance across all segments for both speaker and imposter is then calculated. These values are added to the set of model characters to form the speaker model. To perform recognition the frames of the unknown speaker as they are received are buffered and compared with the sets of model characters to form model histograms for each speaker. A likelihood ratio is formed. The speaker candidate with the highest likelihood ratio is chosen as the unknown speaker.

44 citations


Proceedings ArticleDOI
01 May 1982
TL;DR: This paper develops the use of probability density function (pdf) estimation for text-independent speaker identification and compares the performance of two parametric and one non-parametric pdf estimation methods to one distance classification method that uses the Mahalanobis distance.
Abstract: Most text-independent speaker identification methods to date depend on the use of some distance metric for classification. In this paper we develop the use of probability density function (pdf) estimation for text-independent speaker identification. We compare the performance of two parametric and one non-parametric pdf estimation methods to one distance classification method that uses the Mahalanobis distance. Under all conditions tested, the pdf estimation methods performed substantially better than the Mahalanobis distance method. The best method is a non-parametric pdf estimation method.

35 citations


Journal ArticleDOI
TL;DR: A method for speaker independent connected word recognition is described, based on a syntax-directed dynamic programming algorithm which matches the isolated word templates to sentence length utterances of a 100 speaker population.
Abstract: A method for speaker independent connected word recognition is described. Speaker independence is achieved by clustering isolated word utterances of a 100 speaker population. Connected word recognition is based on a syntax-directed dynamic programming algorithm which matches the isolated word templates to sentence length utterances. The method has been tested on an artificial task-oriented language based on a 127 word vocabulary. Four subjects, two men and two women, spoke a total of 209 sentences comprising 1750 words. At an average speaking rate of 171 words/min over dialed-up telephone lines, a correct word recognition rate of 97 percent was observed.

27 citations


Proceedings ArticleDOI
01 May 1982
TL;DR: This paper examines several template-based recognition techniques using isolated utterances and highly ambiguous vocabularies in a speaker-dependent recognition system and concludes that a system which combined both featural and template information led to the best performance for six out of eight speakers.
Abstract: Template-based recognition systems overcome errors in the short-term matching process by comparing whole sequences of acoustic events. In many vocabularies, each word has a highly distinctive sequence. Some vocabularies have confusable words with very similar sequences, leading to poor recognition performance. Improvements in discriminability among similar words may be achieved by altering the matching algorithm, or by improving the reference template set. Both techniques are instances of multi-exemplar learning techniques which improve recognition performance through automatic evaluation of training data. This paper examines several such techniques using isolated utterances and highly ambiguous vocabularies (e.g., the "E" set; 3 B C D E G P V T Z) in a speaker-dependent recognition system. A system which combined both featural and template information led to the best performance for six out of eight speakers. Using this technique, E-set error rates improved from 37% to 10%.

18 citations


PatentDOI
TL;DR: In a speech recognition system, similarity calculations between speech feature patterns are reduced by stopping similarity calculations for any one reference pattern when a frame in the pattern fails to exceed a corresponding similarity threshold as discussed by the authors.
Abstract: In a speech recognition system, similarity calculations between speech feature patterns are reduced by stopping similarity calculations for any one reference pattern when a frame in the pattern fails to exceed a corresponding similarity threshold.

17 citations


Proceedings ArticleDOI
01 May 1982
TL;DR: A new, fast method for discrete utterance recognition of telephone bandwidth speech that obviates time normalization and uses approximately 6000 bits to represent each utterance in the recognition vocabulary is presented.
Abstract: We present a new, fast method for discrete utterance recognition of telephone bandwidth speech. The method is based on speech coding by vector quantization and minimum cross-entropy pattern classification. Separate vector quantization codebooks are designed from training sequences for each word in the recognition vocabulary. Inputs from outside the training sequence are classified by performing vector quantization and finding the codebook that achieves the lowest average distortion per speech frame. The new method obviates time normalization and uses approximately 6000 bits to represent each utterance in the recognition vocabulary. Preliminary limited testing on speaker dependent digit recognition has demonstrated excellent performance. Detailed tests are now in progress.

16 citations


Proceedings ArticleDOI
01 May 1982
TL;DR: A grouping of phoneme is proposed so that one adaptation parameter set is used for all phonemes that belong to any one group, and the cost of phoneme class-specific adaptation is very high, but the method needs a large learning set.
Abstract: Speaker dependence of automatic speech recognition systems can be reduced by applying speaker-specific transformations to adapt the speech signal of a new speaker to that of the reference speaker. Initial investigations showed that speaker adaptation can be performed by transformations using spectral weighting and spectral warping. These heuristic methods can be substituted by a general linear matrix transformation, the parameters of which are determined by mean square error optimisation. The improvement of the recognition rate achievable by this matrix transformation is very high, but the method needs a large learning set. This can be reduced by restriction of the matrix to a band including the main diagonal in the middle. This banded matrix yields results close to those of the general matrix. Adaptation can be performed speaker-specifically as well as speaker- and class-specifically. As the cost of phoneme class-specific adaptation is very high, a grouping of phonemes is proposed so that one adaptation parameter set is used for all phonemes that belong to any one group.

15 citations


Journal ArticleDOI
TL;DR: A new approach to text‐independent speaker recognition, developed to perform with short unknown utterances, models the spectral traits of a speaker with multiple sub‐models rather than using a single statistical distribution as done with previous approaches.
Abstract: This paper presents a new approach to text‐independent speaker recognition. The technique, developed to perform with short unknown utterances, models the spectral traits of a speaker with multiple sub‐models rather than using a single statistical distribution as done with previous approaches. The recognition is based on the statistical distribution of the distances between the unknown speaker and each of the speaker models. Only frames that are close to one of the speaker's sub‐models are considered in the recognition decision, so that speech events not encountered in the training data do not bias the recognition. The technique has been tested on a conversational data base. Models were generated using 100 s of speech from each of 11 male talkers. Unknown speech was obtained one week after the model data. Recognition accuracies of 96%, 87%, and 79% were obtained for unknown speech durations of 10, 5, and 3 s, respectively. The use of multiple sub‐models to characterize spectral traits results in improved discrimination between speakers, particularly when short speech segments are recognized. [Work supported by U. S. Air Force, Rome Air Development Center.]

12 citations


Journal ArticleDOI
TL;DR: The development of a high accuracy (about 99%) text-independent speaker recognition system is discussed in this paper and any two parameter sets of the first stage tests are combines logically to obtain a significantly higher recognition accuracy than is possible with any single-speaker-sensitive parameter set.


Proceedings ArticleDOI
Akio Komatsu1, Akira Ichikawa1, Kazuo Nakata1, Yoshiaki Asakawa1, H. Matsuzaka1 
03 May 1982
TL;DR: An algorithm for phoneme recognition in continuous speech is presented, a continuous matching process is employed to bypass the segmentation problem and a hierarchical recognition algorithm is proposed to realize feasible matching in a real time.
Abstract: An algorithm for phoneme recognition in continuous speech is presented. A continuous matching process is employed to bypass the segmentation problem. A large set of standard patterns is used to solve the allophonic variation problem. Also, a hierarchical recognition algorithm is proposed to realize feasible matching in a real time. In the first stage of the hierarchical recognition algorithm, vowels in speech are spotted. To optimize accuracy in vowel spotting, each standard pattern is carefully selected, constraints on the "phoneme chain" of continuous speech are utilized, and partial standard pattern matching is employed for detailed phoneme analysis. The second stage recognizes consonants between vowels. Experimental results show a 91% vowel recognition rate and 80% consonant recognition rate for a specified speaker.

Proceedings ArticleDOI
T. Nitta1, T. Murata, Harumi Tsuboi, Koichi Takeda, T. Kawada, S. Watanabe 
01 May 1982
TL;DR: A newly developed voice-activated word processor and a two-stage recognition method to achieve a precise recognition of isolated monosyllables are described.
Abstract: This paper describes a newly developed voice-activated word processor and a two-stage recognition method to achieve a precise recognition of isolated monosyllables At the first stage, the recognizer segments a monosyllable into an initial consonantal part and a final part (ie, the vowel region), and computes similarities between the input speech and orthonormal mode functions of each consonantal segment which is designed from multiple speakers using K-L expansion and adapted to a new speaker ( Adaptive Multiple Similarity Method) At the second stage, frame-by-frame similarity scores, extracted at the phoneme recognizer using Multiple Similarity Method, are applied to candidate monosyllables to make a final decision The average monosyllable recognition accuracy with six speakers was about 95%

Proceedings ArticleDOI
Hermann Ney1, R. Gierloff
01 May 1982
TL;DR: The experiments indicate that feature weighting and feature selection can reduce the error rates by a factor of two or more both for speaker identification and speaker verification.
Abstract: This paper describes a technique for increasing the ability of a text-dependent speaker recognition system to discriminate between speaker classes; this technique is to be performed in conjunction with the nonlinear time alignment between a reference pattern and a test pattern. Unlike the standard approach, where the training of the recognition system merely consists of storing and averaging or selecting the time normalized training patterns separately for each class, the training phase of the system is extended in that a weight is determined for each individual feature component of the complete reference pattern according to the ability of the feature to distinguish between speaker classes. The weights depend on the time axis as well as on the frequency axis. The overall distance computed after nonlinear time alignment between a reference pattern and a test pattern thus becomes a function of the given set of weights of the reference class considered. For each class, the optimum weights result from the ideal criterion of minimum error rate. Instead of this criterion, the closely related but mathematically more convenient Fisher criterion is used that leads to a closed from solution for the unknown weights. Based on these weights, the selection of subsets of effective features is studied in order to further improve the class discrimination. The feature weighting and selecting techniques are tested using a data base of utterances recorded off dialed-up telephone lines. The experiments indicate that feature weighting and feature selection can reduce the error rates by a factor of two or more both for speaker identification and speaker verification.

Journal ArticleDOI
TL;DR: A low cost speaker-dependent speech recognition unit using Walsh-Hadamard transform (WIT) and a WHT LSI has been developed to reduce the cost and the space of the recognition unit, and a high rate of recognition has been obtained.
Abstract: Speech recognition systems are coming to a practical stage thanks to the recent progress of the semiconductor technology. We have developed a low cost speaker-dependent speech recognition unit using Walsh-Hadamard transform (WIT). A WHT LSI has been developed to reduce the cost and the space of the recognition unit, and a high rate of recognition has been obtained. The speech recognition algorithm and the LSI are described in this paper.


Book ChapterDOI
Patrick Corsi1
01 Jan 1982
TL;DR: This paper presents a unified discussion of the scientific and practical issues in the field of speaker recognition, and distinguishes between the Verification and Identification tasks.
Abstract: This paper presents a unified discussion of the scientific and practical issues in the field of speaker recognition. Besides some background on speaker recognition by listening and visual analysis of spectrograms, we survey the computer recognition methods, and briefly discuss some technical aspects of various speaker recognizers, Methods for selecting an efficient set of features, and examples of results of experimental studies are also presented. We then differentiate between the Verification and Identification tasks.

Book ChapterDOI
Stephen E. Levinson1
01 Jan 1982
TL;DR: A method for speaker independent connected word recognition is described, based on a syntax-directed dynamic programming algorithm which matches the isolated word templates to sentence length utterances of a 100 speaker population.
Abstract: A method for speaker independent connected word recognition is described. Speaker independence is achieved by clustering isolated word utterances of a 100 speaker population. Connected word recognition is based on a syntax-directed dynamic programming algorithm which matches the isolated word templates to sentence length utterances. The method has been tested on a task oriented English-like language based on a 127 word vocabulary. Four subjects, two men and two women, spoke a total of 209 sentences comprising 1750 words. At an average speaking rate of 171 words per minute over dialed-up telephone lines, a correct word recognition rate of 97% was observed.


Journal ArticleDOI
01 Mar 1982
TL;DR: Reports on major areas of technological development in the field of automatic recognition of speech processing and its major characteristics, device construction, and applications for speech processing.
Abstract: Reports on major areas of technological development in the field of automatic recognition of speech processing. Identifies its major characteristics, device construction, and applications for speech processing.

Proceedings ArticleDOI
01 May 1982
TL;DR: This work addresses the development of a reliable, high accuracy text-independent speaker recognition system for a small population, with the reference parameters characterizing each speaker obtained from short segments of speech.
Abstract: This work addresses the development of a reliable, high accuracy text-independent speaker recognition system for a small population, with the reference parameters characterizing each speaker obtained from short segments of speech. Initially the potential for speaker discrimination of several different vocal parameter sets was investigated. These included the LPC, Reflection, Cepstrum and Log Area Ratio coefficients, speech power spectrum parameters and the inverse filter spectral coefficients. It was then decided to use any two parameter sets in a composite decision-making scheme. A "repeat feature" was incorporated into the speaker recognition system, whereby a speaker was asked to read a fresh test speech segment if the decisions made by using the two different parameter sets individually were not coincident. Test results indicate that a significant improvement in accuracy is realizable.


01 Dec 1982
TL;DR: An experiment to determine the possibilities of obtaining some speaker independence using speaker dependent voice recognition equipment revealed about 99% accuracy when the user's speech templates were in memory along with those of four other users.
Abstract: : This report discusses the results of an experiment to determine the possibilities of obtaining some speaker independence using speaker dependent voice recognition equipment. The results revealed about 99% accuracy when the user's speech templates were in memory along with those of four other users. If the user's voice patterns were not in memory but those of the four other users still were in memory, recognition accuracy still hovered around 95%. (Author)

Proceedings ArticleDOI
03 May 1982
TL;DR: The final goal of this work is to provide the deaf person with additional information (or "keys") which disambiguate the labial image.
Abstract: Lip-reading is widely used by profoundly deaf individuals for the reception of the spoken language. This is a very difficult task because the labial image is ambiguous, The final goal of this work is to provide the deaf person with additional information (or "keys") which disambiguate the labial image. Phoneme recognition in continuous speech is used to produce the keys. To allow complete freedom in running speech, and in order to provide keys synchronously with speech production, no lexical, syntactical or semantical informations are used. Algorithms are adapted to a given speaker through a learning phase where prototypes are built for the phonetic units to be recognized. Recognition algorithms are a combination of segmentation and centisecond labeling. The keys system is optimized taking into account the confusions made by the recognition programs. Recognition scores for multiple speakers are indicated both at the phonetic level and at the keys level.

Proceedings ArticleDOI
Guy Mercier1, A. Callec, J. Monne, M. Querre, O. Trevarain 
01 May 1982
TL;DR: The acoustic-phonetic recognizer which performs the early stages of analysis in the KEAL system is described, which is to transform the continuous speech signal representing the uttered sentence into a string of lower units.
Abstract: This paper describes the acoustic-phonetic recognizer which performs the early stages of analysis in the KEAL system. The objective of this module is to transform the continuous speech signal representing the uttered sentence into a string of lower units. Four main linguistic units have been considered : phones, phonemes, syllables and words. The KEAL acoustic-phonetic recognizer consists of components for carrying out three main tasks : acoustic analysis, labelling and training. Syllabic segmentation accuracy of 95%, an average phonemic recognition rate of 61% and a word recognition accuracy of 93% are obtained using 26 phonemic classes, isolated words (digits and operators : +,-,*,...) and continuous speech which different speakers. Preliminary results on number recognition (each number being composed of several digits spoken without insertion of pauses) give an accuracy of 90% after speaker adaptation.

Proceedings ArticleDOI
Y. Nara1, K. Iwata, Y. Kijima, A. Kobayashi, S. Kimura, S. Sasaki, J. Tanahashi 
01 May 1982
TL;DR: A new matching algorithm for large vocabulary spoken word recognition is proposed, which gives a recognition score compatible to that of the traditional DP matching algorithm, but requires less than 1/10 as much calculation.
Abstract: We propose a new matching algorithm for large vocabulary spoken word recognition, which gives a recognition score compatible to that of the traditional DP matching algorithm, but requires less than 1/10 as much calculation. By a computer simulation of 1,000 categories in speaker dependent recognition of speech samples uttered by five male adult speakers, an average recognition score of 95.8% was obtained. We have constructed a real-time speaker dependent speech recognizer using our algorithm. We are now examining the application of this recognizer to Japanese text input.

Proceedings Article
01 Sep 1982
TL;DR: In this article, a low cost voice recognition system for isolated words and small vocabularies (typically 15 words) is described, where the two main features of the system are: possibility of integration in a small size CMOS chip (typically 35 mm2) having minimum power consumption (less than 200?W at 3V) and automatic adaptation to the speaker without any tedious training mode.
Abstract: A low cost voice recognition system for isolated words and small vocabularies (typically 15 words) is described. The two main features of the system are: possibility of integration in a small size CMOS chip (typically 35 mm2) having minimum power consumption (less than 200?W at 3V) and automatic adaptation to the speaker without any tedious training mode.

Journal ArticleDOI
Hermann Ney1
TL;DR: New techniques for automatic speaker recognition from telephone speech are described, based on spectral analysis of fixed sentence-long utterances, which is carried out by a dynamic programming algorithm which minimizes timing differences between corresponding speech events.

Journal ArticleDOI
TL;DR: A speaker‐independent isolated word recognition system which accepts telephone line speech which gets the recognition accuracy greater than 96% with 12 words spoken by 130 talkers and the same result was also obtained in the recognition test of the prototype machine.
Abstract: This paper describes a speaker‐independent isolated word recognition system which accepts telephone line speech. A recognition method is named selective weighted matching (SWM) which uses a weighted distance measure. The input speech signal is frequency‐analyzed every 10 ms by a filter bank. The individual glottal characteristic is normalized frame by frame using a least‐square‐fit line of the speech spectrum. Each reference pattern has a specific region in the time‐frequency domain. In the matching process of that region, the weighted distance computation is carried out under the predetermined condition. In the computer simulation of telephone line speech, we got the recognition accuracy greater than 96% with 12 words (digits and two command words in Japanese) spoken by 130 talkers. The same result was also obtained in the recognition test of the prototype machine.