scispace - formally typeset
Search or ask a question

Showing papers on "Speaker recognition published in 1986"



Proceedings ArticleDOI
01 Apr 1986
TL;DR: Vector quantization (VQ) is a technique that reduces the computation amount and memory size drastically and is proposed in order to improve speaker-independent recognition.
Abstract: Vector quantization (VQ) is a technique that reduces the computation amount and memory size drastically. In this paper, speaker adaptation algorithms through VQ are proposed in order to improve speaker-independent recognition. The speaker adaptation algorithms use VQ codebooks of a reference speaker and an input speaker. Speaker adaptation is performed by substituting vectors in the codebook of a reference speaker for vectors of the input speaker's codebook, or vice versa. To confirm the effectiveness of these algorithms, word recognition experiments are carried out using the IBM office correspondence task uttered by 11 speakers. The total number of words is 1174 for each speaker, and the number of different words is 422. The average word recognition rate using different speaker's reference through speaker adaptation is 80.9%, and the rate within the second choice is 92.0%.

269 citations


Proceedings ArticleDOI
01 Apr 1986
TL;DR: The experimental results show that the instantaneous and transitional representations are relatively uncorrelated thus providing complementary information for speaker recognition, and simple transmission channel variations are shown to affect the instantaneous spectral representations and the corresponding recognition performance significantly, while the transitional representations and performance are relatively resistant.
Abstract: The use of instantaneous and transitional spectral representations of spoken utterances for speaker recognition is investigated. LPC derived-cepstral coefficients are used to represent instantaneous spectral information and best linear fits of each cepstral coefficient over a specified time window are used to represent transitional information. An evaluation has been carried out using a data base of isolated digit utterances over dialed-up telephone lines by 10 talkers. Two vector quantization (VQ) codebooks, instantaneous and transitional, are constructed from training utterances for each speaker. The experimental results show that the instantaneous and transitional representations are relatively uncorrelated thus providing complementary information for speaker recognition. A rectangular window of approximately 100-150 ms duration provides an effective estimate of spectral transitions for speaker recognition. Also, simple transmission channel variations are shown to affect the instantaneous spectral representations and the corresponding recognition performance significantly, while the transitional representations and performance are relatively resistant.

228 citations


Proceedings ArticleDOI
07 Apr 1986
TL;DR: Results are given which show that HMMs provide a versatile pattern matching tool suitable for some image processing tasks as well as speech processing problems.
Abstract: A handwritten script recognition system is presented which uses Hidden Markov Models (HMM), a technique widely used in speech recognition. The script is encoded as templates in the form of a sequence of quantised inclination angles of short equal length vectors together with some additional features. A HMM is created for each written word from a set of training data. Incoming templates are recognised by calculating which model has the highest probability for producing that template. The task chosen to test the system is that of handwritten word recognition, where the words are digits written by one person. Results are given which show that HMMs provide a versatile pattern matching tool suitable for some image processing tasks as well as speech processing problems.

124 citations


Journal ArticleDOI
TL;DR: This paper focuses on the long-term intra-speaker variability of feature parameters as on the most crucial problems in speaker recognition, and presents an investigation into methods for reducing the effects of long- term spectral variability on recognition accuracy.

79 citations


PatentDOI
TL;DR: In this article, a continuous speech recognition system with a speech processor and a word recognition computer subsystem is described, which is characterized by an element for developing a graph for confluent links between confluent nodes.
Abstract: A continuous speech recognition system having a speech processor and a word recognition computer subsystem, characterized by an element for developing a graph for confluent links between confluent nodes; an element for developing a graph of boundary links between adjacent words; an element for storing an inventory of confluent links and boundary links as a coding inventory; an element for converting an unknown utterance into an encoded sequence of confluent links and boundary links corresponding to recognition sequences stored in the word recognition subsystem recognition vocabulary for speech recognition. The invention also includes a method for achieving continouous speech recognition by characterizing speech as a sequence of confluent links which are matched with candidate words. The invention also applies to isolated word speech recognition as with continuous speech recognition, except that in such case there are no boundary links.

68 citations


Proceedings ArticleDOI
01 Apr 1986
TL;DR: This paper describes the results of the work in designing a system for large-vocabulary word recognition of continuous speech, and generalizes the use of context-dependent Hidden Markov Models of phonemes to take into account word-dependent coarticulatory effects.
Abstract: This paper describes the results of our work in designing a system for large-vocabulary word recognition of continuous speech. We generalize the use of context-dependent Hidden Markov Models (HMM) of phonemes to take into account word-dependent coarticulatory effects, Robustness is assured by smoothing the detailed word-dependent models with less detailed but more robust models. We describe training and recognition algorithms for HMMs of phonemes-in-context. On a task with a 334-word vocabulary and no grammar (i.e., a branching factor of 334), in speaker-dependent mode, we show an average reduction in word error rate from 24% using context-independent phoneme models, to 10% when using robust context-dependent phoneme models.

59 citations


Journal ArticleDOI
TL;DR: Improved recognition performance is demonstrated by explicitly modeling the correlation between spectral measurements of adjacent frames; and using a distance measure which is a function of the recognition reference frame being used.
Abstract: The performance of current speaker independent speech recognition technology is limited by the inadequacy of the measures of the speech data to discriminate between different speech sounds. In particular, two critical assumptions that underlie and limit most current recognition techniques are that: 1) speech data from different frames are statistically independent (e.g., there are no between-frame interactions); and 2) speech data statistics are independent of phonetic events (e.g., distance measures are fixed and independent of input or reference speech). In the context of speaker independent isolated digit recognition, improved recognition performance is demonstrated by: 1) explicitly modeling the correlation between spectral measurements of adjacent frames; and 2) using a distance measure which is a function of the recognition reference frame being used. A statistical model was created from a 2464 token database (2 tokens of each of 11 words "zero" through "nine" and "oh") for 112 speakers. Primary features include energy and filter bank amplitudes. Interspeaker variability was estimated by time aligning all training tokens and creating an ensemble of 224 feature vectors for each reference frame. Normal distributions were then estimated individually for each frame jointly with its neighbors. Testing was performed on a multidialect database of 2486 spoken digit tokens collected from 113 (different) speakers using maximum-likelihood decision methods. The substitution rate dropped from 1.7 to 1.4 percent with incorporation of between-frame statistics, and further to 0.6 percent with incorporation of frame-specific statistics in the likelihood model.

44 citations


Proceedings ArticleDOI
Richard V. Cox1, D. Bock, K. Bauer, James D. Johnston, J. Snyder 
01 Apr 1986
TL;DR: The underlying principles of the AVPS algorithm, its implementation, and laboratory test results are described, and the quality of the decrypted speech is considered very natural, and speaker recognition is retained — a significant advantage over digital vocoders.
Abstract: The Analog Voice Privacy System is based on individual sample permutation of the output samples of a sub-band coder analysis filterbank. The system has a large number of digital keys, giving it the strength of a digital encryption system, but also retains the good quality characteristics of analog scramblers. It has been implemented in a real-time hardware prototype designed for evaluation in the field. The units work with any modular telephone and standard 120 volts AC electricity. The device contains two circuitry boards, one for analog and one for digital processing which contain four digital signal processors. There are 125! possible permutation keys. These prototypes were designed to be tested in real telephone environments. To date, the device has been successfully tested over long distance telephone connections, several different analog and digital PBXs and telephone switches, and a channel simulator. The quality of the decrypted speech is considered very natural, and in particular, speaker recognition is retained. This is a significant advantage over digital vocoders. This paper describes the underlying principles of the algorithm, the details of its implementation and laboratory test results.

43 citations


Proceedings ArticleDOI
01 Apr 1986
TL;DR: Methods for text-independent speaker identification that deal with the variability in the data introduced by unknown telephone channels including probabilistic channel modeling, a channel-invariant model and a modified-Gaussian model are considered.
Abstract: We consider methods for text-independent speaker identification that deal with the variability in the data introduced by unknown telephone channels. The methods investigated include probabilistic channel modeling, a channel-invariant model and a modified-Gaussian model. The methods are described and then evaluated with experiments conducted with a twenty speaker database of long distance telephone calls.

36 citations


Proceedings ArticleDOI
01 Apr 1986
TL;DR: It is shown that in speaker-dependent recognition of the alpha-numeric vocabulary, the PLP method in VQ-based ASR yields similar recognition scores as does the standard ASR system.
Abstract: The perceptually based linear predictive (PLP) speech analysis method is applied to isolated word automatic speech recognition (ASR). Low dimensionality of the PLP analysis vector, which is otherwise identical in form to the standard linear predictive (LP) analysis vector, allows for computational and storage savings in ASR. We show that in speaker-dependent recognition of the alpha-numeric vocabulary, the PLP method in VQ-based ASR yields similar recognition scores as does the standard ASR system. The main focus of the paper is on cross-speaker ASR. We demonstrate in experiments with vowel centroids of two male and one female speakers that PLP speech representation is more consistent with the underlying phonetic information than the standard LP method. Conclusions from the experiments are confirmed by superior performance of the PLP method in cross-speaker isolated word recognition.

Proceedings ArticleDOI
07 Apr 1986
TL;DR: This paper describes a technique of spectral transformation for improved adaptation of a knowledge data base or reference templates to new speakers in automatic speech recognition (ASR).
Abstract: This paper describes a technique of spectral transformation for improved adaptation of a knowledge data base or reference templates to new speakers in automatic speech recognition (ASR). Based on a statistical analysis tool (Canonical correlation analysis) the proposed method permits to improve speaker independance in Large vocabulary ASR. Application to an isolated word recognizer improved a 70% correct score to 87%.

Proceedings ArticleDOI
01 Apr 1986
TL;DR: A new method, based on template matching, that utilizes temporal information to advantage in text-dependent recognition as a special case and is compared with that of similar recently-developed methods.
Abstract: Text-independent speaker recognition methods have been based on measurements of long-term statistics of individual speech frames. These methods are not capable of modeling speaker-dependent speech dynamics. In this paper, we describe a new method, based on template matching, that utilizes temporal information to advantage. The template-matching method performs text-dependent recognition as a special case. Performance of the template-matching method is compared with that of similar recently-developed methods.

Proceedings ArticleDOI
01 Apr 1986
TL;DR: A database of spoken Japanese sound has been collected for use in designing and evaluating algorithms for automatic speech recognition and part of the database has been distributed among the members of the committee.
Abstract: A database of spoken Japanese sound has been collected for use in designing and evaluating algorithms for automatic speech recognition. The database is composed of 323 words. It is a special feature of this database that all samples are uttered four times by each speaker (i. e. four tokens per word). Seventy-five male and 75 female data are collected at 15 recording places. Speaker data include sex, age, height, etc. Fifteen research institutions and private enterprises engaged in speech research and development have taken part in the data collection. This is a result of four years' effort by a committee supported by JEIDA (Japan Electronic Industry Development Association). Part of the database has been distributed among the members of the committee.

Journal ArticleDOI
01 Apr 1986
TL;DR: The algorithms proposed here are composed of simple image-processing, and it is shown they work well and will make it possible to realize them in real-time.
Abstract: Though technology in speech recognition has progressed recently, Automatic Speech Recognition (ASR) is vulnerable to noise. Lip-information is thought to be useful for speech recognition in noisy situations, such as in a factory or in a car.This paper describes speech recognition enhancement by lip-information. Two types of usage are dealt with. One is the detection of start and stop of speech from lip-information. This is the simplest usage of lip-information. The other is lip-pattern recognition, and it is used for speech recognition together with sound information. The algorithms for both usages are proposed, and the experimental system shows they work well. The algorithms proposed here are composed of simple image-processing. Future progress in image-processing will make it possible to realize them in real-time.

Journal ArticleDOI
TL;DR: The state of the art in speech coding, text-to-speech synthesis, speech recognition, and speaker recognition is discussed, with a focus on solving the problem of continuous speech recognition for large vocabularies and verifying talkers' identities from a limited amount of spoken text.
Abstract: This paper presents an overview of the current activities in speech research. We will discuss the state of the art in speech coding, text-to-speech synthesis, speech recognition, and speaker recognition. In the speech coding area, current algorithms perform well at bit rates down to 9.6 kb/s, and the research is directed at bringing the rate for high-quality speech coding down to 2.4 kb/s. In text-to-speech synthesis, what we currently are able to produce is very intelligible but not yet completely natural. Current research aims at providing higher quality and intelligibility to the synthetic speech that these systems produce. Finally, today's systems for speech and speaker recognition provide excellent performance on limited tasks; i.e., limited vocabulary, modest syntax, small talker populations, constrained inputs, etc. Current research is directed at solving the problem of continuous speech recognition for large vocabularies, and at verifying talkers' identities from a limited amount of spoken text.

Proceedings ArticleDOI
01 Apr 1986
TL;DR: A framework for developing a phonetically based recognition system for speech recognition is discussed, the recognition task is the class of sounds known as the semivowels.
Abstract: A phonetically based approach to speech recognition uses speech specific knowledge obtained from phonotactics, phonology and acoustic phonetics to capture relevant phonetic information. Thus, a recognition system based on this approach can make broad classifications as well as detailed phonetic distinctions. This paper discusses a framework for developing a phonetically based recognition system. The recognition task is the class of sounds known as the semivowels. The recognition results reported, though incomplete, are encouraging.

Journal ArticleDOI
TL;DR: In this paper, the effect of linear predictive coding (LPC) on the recognition of previously unfamiliar speakers was investigated. And the results showed that the more distinctive male speakers and the females were better recognized than the less distinctive males for unprocessed speech.
Abstract: The effect of narrow‐band digital processing, using a linear predictive coding (LPC) algorithm at 2400 bits/s, on the recognition of previously unfamiliar speakers was investigated. In two experiments, rated voice distinctiveness was used to select three sets of five speakers (two sets of males and one set of females). The more distinctive male speakers and the females were better recognized than the less distinctive males for unprocessed speech. With LPC processed speech, there were large losses in speaker recognition for the more distinctive males and the females, whereas the less distinctive males showed no recognition loss. This interaction is discouraging to prospects for developing practical procedures for comparing speaker recognition over various voice communication systems.



Proceedings ArticleDOI
01 Apr 1986
TL;DR: Four unsupervised speaker adaptation methods for vowel templates are described and evaluated and show that these methods work well and that the top-down approach is better than the bottom-up one.
Abstract: Four unsupervised speaker adaptation methods for vowel templates are described and evaluated There are two approaches to automatically obtaining information on vowel classification and location One is based on feature parameters and the other on the results of input speech recognition Here, the former is referred to as a bottom-up approach and the latter as a top-down one Two adaptation techniques are also presented The first is template selection from pre-stored sets and the second is template modification Combining these approaches and techniques, four adaptation methods are derived These four methods are evaluated in terms of spectral distortion and word recognition rate They are then compared considering performance, required calculation, rate of correctly used vowels, and type of input speech The results show that these methods work well and that the top-down approach is better than the bottom-up one They also show that the modification technique is better than the selection technique

Proceedings ArticleDOI
01 Apr 1986
TL;DR: This paper uses an extension of the well-known hidden Markow models in order to model more accurately the properties of the phonetic labeling stage and presents experimental results which were computed speaker independently.
Abstract: This paper addresses the problem of generating word hypotheses in continuous German speech. It uses an extension of the well-known hidden Markow models in order to model more accurately the properties of the phonetic labeling stage. A powerful scoring function is derived. Experimental results are presented which were computed speaker independently.

Proceedings ArticleDOI
07 Apr 1986
TL;DR: An automatic incremental network generation algorithm for speaker independent isolated word recognition is described, which is possible to add new words to the network at any time; because of its complete freedom from human intervention, it is language and vocabulary independent.
Abstract: It is well known that a network representation of templates has many advantages; however, generating a network by hand is an impossible task for a large vocabulary database. This paper describes an automatic incremental network generation algorithm for speaker independent isolated word recognition. Because of its incremental nature, it is possible to add new words to the network at any time; because of its complete freedom from human intervention, it is language and vocabulary independent. By applying this technique to speaker-independent recognition, recognition accuracy of 99% was obtained for the digits, and 91.92% was obtained for the alphabets.


Proceedings ArticleDOI
01 Apr 1986
TL;DR: This paper shows how a semi-automatic design of a speech recognition system can be done as a planning activity and results in the recognition of connectedly spoken letters pronounced by 70 new speakers are presented.
Abstract: This paper shows how a semi-automatic design of a speech recognition system can be done as a planning activity. Recognition performances are used for deciding plan refinement. Inductive learning is performed for setting action preconditions. Experimental results in the recognition of connectedly spoken letters pronounced by 70 new speakers are presented.

Proceedings ArticleDOI
07 Apr 1986
TL;DR: A learning method in which the syllable templates are automatically optimized, based on speaker-dependent recognition system, showed an average syllable recognition accuracy of 71.0% without and 82.5% with automatic learning.
Abstract: In this speaker-dependent recognition system, recognition is based on syllable template matching and each syllable has several templates. In the initial training for each speaker, 590 templates for 111 syllables are made, each including various contextual variations. The authors studied a learning method in which the syllable templates are automatically optimized. It is judged whether or not an input syllable should be learned according to the recent recognition condition. If it should be learned, the input syllable pattern replaces the template that contributes the least to recognition in the templates segmented from the same context and in the same syllable category. Automatic learning was evaluated on recognition of speech data obtained by reading Japanese sentences at a rate of about 4 to 5 syllables per second. The results over eight speakers showed an average syllable recognition accuracy of 71.0% without and 82.5% with automatic learning. Further, by increasing the maximum number of templates to 1024, it rose to 84.8%.

Proceedings ArticleDOI
07 Apr 1986
TL;DR: This work defines novel distance measures for speech recognition which are specifically designed to differentiate between confusable speech sounds.
Abstract: This work defines novel distance measures for speech recognition which: 1. Model the statistical interaction between adjacent speech frames, 2. Model the statistical characteristics of different speech sounds individually, 3. Are specifically designed to differentiate between confusable speech sounds. Speaker independent recognition tests performed on the Texas Instruments multi-dialect isolated digit data base give substitution rates as low as 0.6 % with a vocabulary of 11 digits.

Proceedings ArticleDOI
01 Apr 1986
TL;DR: The possibility of using an automatic speech recognition system as a front end to a computer for Chinese-character processing is explored and some preliminary experiments are reported which indicate that the syllable inventory of spoken Standard Chinese belongs into the category of "difficult" vocabularies.
Abstract: The possibility of using an automatic speech recognition system as a front end to a computer for Chinese-character processing is explored in this paper. Aspects of the Chinese language are discussed in relation to the capabilities of current state-of-the-art isolated-word recognition systems. Some preliminary experiments are reported which indicate that the syllable inventory of spoken Standard Chinese belongs into the category of "difficult" vocabularies. The vocabulary size is of the order of 350 syllables with a large number of similar word pairs. Recognition rates using linear predictive coding, Itakura distance measures and dynamic time warping are of the order of 25-30%.

Proceedings ArticleDOI
01 Apr 1986
TL;DR: A system for automatic recognition of continuous speech recognition is proposed, utilizing both words and syllables as units of recognition, and using both local and global coherence in reducing word candidates as well as modifying or renewing word and syllable templates.
Abstract: Critical re-examination of the premises of conventional systems for continuous speech recognition has led to a study of human processes of speech perception. It was found that deletion of a syllable is often not noticed by a listenter, suggesting that the basic unit of continuous speech perception is larger than the syllable. Further experiments are described and discussed on the size of actual units of perception, effects of syntactic roles, unit, organization and access to the mental lexicon, effects of context, as well as effects of repeated listening. A system for automatic recognition of continuous speech is then proposed on the basis of these results and considerations, utilizing both words and syllables as units of recognition, and using both local and global coherence in reducing word candidates as well as modifying or renewing word and syllable templates.

Proceedings ArticleDOI
01 Apr 1986
TL;DR: This paper describes a speaker-independent isolated word recognition algorithm for telephone voice and its recognition performance, which consists of dynamic time warping and statistical word discrimination.
Abstract: This paper describes a speaker-independent isolated word recognition algorithm for telephone voice and its recognition performance. The recognition algorithm consists of two processes ; dynamic time warping and statistical word discrimination. In the first process, input speech is compared with each word template using the dynamic time warping technique. Multiple word templates are used to deal with speech variations among speakers, where each word template is represented by a sequence of phoneme-like templates. To attain high recognition ability, a new technique for generating word templates is proposed. In the second process, statistical word discrimination is carried out for word candidates which have relatively low reliability in the first process. Discrimination functions are calculated based on statistics of transition tendencies of speech characteristics between adjacent frames, and the final word decision is made. The system was trained using utterances from 1305 speakers and tested with utterances from 259 speakers. The average recognition rate of 96.5% was obtained for a 16-word Japanese vocabulary set.