scispace - formally typeset
Search or ask a question

Showing papers on "Speaker recognition published in 1987"


PatentDOI
TL;DR: In this paper, confusion coefficients between the labels of the label alphabet for initial training and those for adaptation are determined by alignment of adaption speech with the corresponding initially trained Markov model.
Abstract: For circumstance adaption, for example, speaker adaption, confusion coefficients between the labels of the label alphabet for initial training and those for adaption are determined by alignment of adaption speech with the corresponding initially trained Markov model. That is, each piece of adaptation speech is aligned with a corresponding initially trained Markov model by the Viterbi algorithm, and each label in the adaption speech is mapped onto one of the states of the Markov models. In respect of each adaptation lable ID, the parameter values for each initial training label of the states which are mapped onto the adaptation label in concern are accumulated and normalized to generate a confusion coefficient between each initial training label and each adaptation label. The parameter table of each Markov model is rewritten in respect of the adaptation label alphabet using the confusion coefficients.

204 citations


Proceedings ArticleDOI
06 Apr 1987
TL;DR: Three different approaches for automatically segmenting speech into phonetic units are described, onebased on template matching, one based on detecting the spectral changes that occur at the boundaries between phoneticunits and one based upon a constrained-clustering vector quantization approach.
Abstract: For large vocabulary and continuous speech recognition, the sub-word-unit-based approach is a viable alternative to the whole-word-unit-based approach. For preparing a large inventory of subword units, an automatic segmentation is preferrable to manual segmentation as it substantially reduces the work associated with the generation of templates and gives more consistent results. In this paper we discuss some methods for automatically segmenting speech into phonetic units. Three different approaches are described, one based on template matching, one based on detecting the spectral changes that occur at the boundaries between phonetic units and one based on a constrained-clustering vector quantization approach. An evaluation of the performance of the automatic segmentation methods is given.

156 citations


Journal ArticleDOI
TL;DR: This work has shown that text dependent speech material used for speaker recognition can be either text dependent (constrained text) or text independent (free text).
Abstract: Automatic speaker recognition has long been an interesting and challenging problem to speech researchers.1−10 The problem, depending on the nature of the final task, can be classified into two different categories: speaker verification and speaker identification. In a speaker verification task, the recognizer is asked to verify an identity claim made by an unknown speaker and a decision to reject or accept the identity claim is made. In a speaker identification task, the recognizer is asked to decide which out of a population of N speakers is best classified as the unknown speaker. The decision may include a choice of “no classification” (i.e., a choice that the specific speaker is not in a given closed set of speakers). The input speech material used for speaker recognition can be either text dependent (constrained text) or text independent (free text).

138 citations


PatentDOI
TL;DR: Speaker verification is performed by computing principal components of a fixed text statement comprising a speaker identification code and a two-word phrase, and principal spectral components of an random word phrase.
Abstract: Speaker verification is performed by computing principal components of a fixed text statement comprising a speaker identification code and a two-word phrase, and principal spectral components of a random word phrase. A multi-phrase strategy is utilized in access control to allow successive verification attempts in a single session, if the speaker fails initial attempts. Based upon a verification attempt, the system produces a verification score which is compared with a threshold value. On successive attempts, the criterion for acceptance is changed, and one of a number of criteria must be satisfied for acceptance in subsequent attempts. A speaker normalization function can also be invoked to modify the verification score of persons enrolled with the system who inherently produce scores which result in denial of access. Accuracy of the verification system is enhanced by updating the reference template which then more accurately symbolizes the person's speech signature.

79 citations


Proceedings ArticleDOI
01 Apr 1987
TL;DR: A new algorithm is introduced that transforms hidden Markov models of speech derived from one "prototype" speaker so that they model the speech of a new speaker in the form of a probabilistic spectral mapping.
Abstract: This paper deals with rapid speaker adaptation for speech recognition. We introduce a new algorithm that transforms hidden Markov models of speech derived from one "prototype" speaker so that they model the speech of a new speaker. The Speaker normalization is accomplished by a probabilistic spectral mapping from one speaker to another. For a 350 word task with a grammar and using only 15 seconds of speech for normalization, the recognition accuracy is 97% averaged over 6 speakers. This accuracy would normally require over 5 minutes of speaker dependent training. We derive the probabilistic spectral transformation of HMMs, describe an algorithm to estimate the transformation, and present recognition results.

76 citations


PatentDOI
TL;DR: In this article, an apparatus operates to identify the speech signal of an unknown speaker as one of a finite number of speakers, each speaker is modeled and recognized with any example of their speech, and the output is a list of scores that measure how similar the input speaker is to each of the speakers whose models are stored in the system.
Abstract: An apparatus operates to identify the speech signal of an unknown speaker as one of a finite number of speakers. Each speaker is modeled and recognized with any example of their speech. The input to the system is analog speech and the output is a list of scores that measure how similar the input speaker is to each of the speakers whose models are stored in the system. The system includes front end processing means which is responsive to the speech signal to provide digitized samples of the speech signal at an output which are stored in a memory. The stored digitized samples are then retrieved and divided into frames. The frames are processed to provide a series of speech parameters indicative of the nature of the speech content in each of the frames. The processor for producing the speech parameters is coupled to either a speaker modeling means, whereby a model for each speaker is provided and consequently stored, or a speaker recognition mode, whereby the speech parameters are again processed with current parameters and compared with the stored parameters during each speech frame. The comparison is accomplished over a predetermined number of frames whereby a favorable comparison is indicative of a known speaker for which a model is stored.

65 citations


Journal ArticleDOI
TL;DR: Several vector quantization approaches to the problem of text-dependent speaker verification are described in this paper, where a source codebook is designed to represent a particular speaker saying a particular utterance, and this same utterance is spoken by a speaker to be verified and is encoded in the source code book representing the speaker whose identity was claimed.
Abstract: Several vector quantization approaches to the problem of text-dependent speaker verification are described. In each of these approaches, a source codebook is designed to represent a particular speaker saying a particular utterance. Later, this same utterance is spoken by a speaker to be verified and is encoded in the source codebook representing the speaker whose identity was claimed. The speaker is accepted if the verification utterance's quantization distortion is less than a prespecified speaker-specific threshold. The best approach achieved a 0.7 percent false acceptance rate and a 0.6 percent false rejection rate on a speaker population comprising 16 admissible speakers and 111 casual imposters. The approaches are described, and detailed experimental results are presented and discussed.

58 citations


Journal ArticleDOI
M. Bush1, G. Kopec
TL;DR: A system for speaker-independent connected digit recognition is described in which explicit acoustic-phonetic features and constraints play a significant role and the best configurations of the recognizer achieve string recognition accuracies.
Abstract: A system for speaker-independent connected digit recognition is described in which explicit acoustic-phonetic features and constraints play a significant role. The digit vocabulary is modeled using a finite-state pronunciation network whose branches correspond to meaningful acoustic-phonetic units. Each branch is associated with an acoustic pattern matcher which employs a combination of whole-spectrum and feature-based metrics. The system has been evaluated using 17 000 utterances from the Texas Instruments (TI) multidialect, connected digits database. The best configurations of the recognizer achieve string recognition accuracies of approximately 96 and 97 percent when the length of the input string is unknown and known, respectively, and when different talkers are used for training and testing.

46 citations


Proceedings ArticleDOI
01 Apr 1987
TL;DR: The problem addressed by this study is the suppression of an undesired talker when two talkers are communicating simultaneously on the same monophonic channel (co-channel speech).
Abstract: The problem addressed by this study is the suppression of an undesired talker when two talkers are communicating simultaneously on the same monophonic channel (co-channel speech). Two different applications are considered, improved intelligibility for human listeners, and improved performance for automatic speech and speaker recognition (ASR) systems. For the human intelligibility problem, the desired talker is the weaker of the two signals with voice-to-voice power ratios (Power desired / Power interference), or VVRs, as low as -18dB. For ASR applications, the desired talker is the stronger of the two signals, with VVRs as low as 5dB. Signal analysis algorithms have been developed which attempt to separate the co-channel spectrum into components due to the two different (stronger and weaker) talkers.

42 citations


Journal ArticleDOI
TL;DR: A set of dynamic adaptation procedures for updating expected feature values during recognition using maximum a posteriori probability (MAP) estimation techniques to update the mean vectors of sets of feature values on a speaker-by-speaker basis.
Abstract: In this paper, we describe efforts to improve the performance of FEATURE, the Carnegie-Mellon University speaker-independent speech recognition system that classifies isolated letters of the English alphabet by enabling the system to learn the acoustical characteristics of individual speakers. Even when features are designed to be speaker-independent, it is frequently observed that feature values may vary more from speaker to speaker for a single letter than they vary from letter to letter. In these cases, it is necessary to adjust the system's statistical description of the features of individual speakers to obtain improved recognition performance. This paper describes a set of dynamic adaptation procedures for updating expected feature values during recognition. The algorithm uses maximum a posteriori probability (MAP) estimation techniques to update the mean vectors of sets of feature values on a speaker-by-speaker basis. The MAP estimation algorithm makes use of both knowledge of the observations input to the system from an individual speaker and the relative variability of the features' means within and across all speakers. In addition, knowledge of the covariance of the features' mean vectors across the various letters enables the system to adapt its representation of similar-sounding letters after any one of them is presented to the classifier. The use of dynamic speaker adaptation improves classification performance of FEATURE by 49 percent after four presentations of the alphabet, when the system is provided with supervised training indicating which specific utterance had been presented to the classifier from a particular user. Performance can be improved by as much as 31 percent when the system is allowed to adapt passively in an unsupervised learning mode. without any information from individual users.

41 citations


Journal ArticleDOI
TL;DR: A semiautomatic design of a speech recognition system can be done as a planning activity and results in the recognition of connected letters spoken by 100 speakers are presented.
Abstract: This paper shows how a semiautomatic design of a speech recognition system can be done as a planning activity. Recognition performances are used for deciding plan refinement. Inductive learning is performed for setting action preconditions. Experimental results in the recognition of connected letters spoken by 100 speakers are presented.

Journal ArticleDOI
TL;DR: In this paper, a Fourcin laryngograph was used to make recordings of three male speakers and the Lx signals were presented to a group of eight listeners, who performed both an AX discrimination and a speaker identification test.
Abstract: Using a Fourcin laryngograph, Lx recordings of three male speakers were made. After manipulation, the Lx signals were presented to a group of eight listeners, who performed both an AX discrimination and a speaker identification test. The results show that the listeners made use of the three parameters varied in the listening tests, viz. speech rhythm, F0 contour and F0 height. Furthermore, the data suggest that the relevance of these different parameters for speaker recognition is speaker-dependent rather than absolute.

Journal ArticleDOI
Lawrence R. Rabiner1, Jay G. Wilpon1
TL;DR: Algorithms based on both template matching (via dynamic time warping (DTW) procedures) and hidden Markov models (HMMs) have been developed which yield high accuracy on several standard vocabularies, including the 10 digits and the set of 26 letters of the English alphabet.

Journal ArticleDOI
TL;DR: The underlying principles of the AVPS algorithm, its implementation, and laboratory test results are described, and the quality of the decrypted speech is considered very natural, and speaker recognition is retained — a significant advantage over digital vocoders.
Abstract: The Analog Voice Privacy System (AVPS) is a voice scrambler that permutes individual output samples from a subband coder analysis filterbank. The system has 125! possible permutation keys, giving it the cryptanalytical strength of a digital encryption system. However, it retains the good voice-quality characteristics of analog scramblers. The AVPS has been implemented in a real-time hardware prototype designed for evaluation in telephone environments and works with any modular telephone and standard 120V ac electrical power. The unit contains two circuit boards — one for analog and one for digital processing — that each use four digital signal processors. To date, we have successfully tested it over long-distance telephone connections, several analog and digital PBXs and telephone switches, and a channel simulator. The quality of the decrypted speech is considered very natural, and speaker recognition is retained — a significant advantage over digital vocoders. This paper describes the underlying principles of the AVPS algorithm, its implementation, and laboratory test results.


Proceedings ArticleDOI
01 Apr 1987
TL;DR: The results of an extensive evaluation of a speaker verification system for access control using a 200 speaker population and over 40,000 impostor attempts, both performed on line, over a 4-month period are presented.
Abstract: The results of an extensive evaluation of a speaker verification system for access control are presented. The system employs an algorithm based on the Principal Spectral Components representation derived from the short term spectrum of the speech signal. This system designed for access control applications has been evaluated using a 200 speaker population and a total of over 13,000 true speaker attempts and over 40,000 impostor attempts, both performed on line, over a 4-month period. A true speaker rejection rate of less than 1 % and an impostor acceptance rate of less than 0.1 % have been obtained.

Journal ArticleDOI
TL;DR: A computer-controlled testing system and a set of standard tests are developed to assess the performance of speech recognition devices sold by Texas Instruments, Votan, Dragon, IBM, Interstate, and NEC, demonstrating several reliable performance differences among these systems.

Proceedings ArticleDOI
01 Apr 1987
TL;DR: The results of several evaluations of the utility of the SRB metric as a substitute for human judgement of the goodness of articulation of a whole word are presented.
Abstract: The Indiana Speech Training Aid project (ISTRA) is evaluating the use of speaker-dependent speech recognition to provide feedback for deaf speakers or normal-hearing misarticulating children, to assist them in improving their speech. Ongoing clinical trials of the ISTRA system have demonstrated effective improvement in speech production. The theoretical approach is first to form templates from a child's current best productions of a word and then to use the score generated by matching new utterances to these templates as feedback to indicate the goodness of articulation. This paper presents the results of several evaluations of the utility of the SRB metric as a substitute for human judgement of the goodness of articulation of a whole word. Also, the confusion matrices resulting from recognition of acoustically similar words are discussed in terms of possible modifications of the algorithms.

Proceedings Article
23 Aug 1987
TL;DR: A paradigm for automatic speech recognition using networks of actions performing variable depth analysis is presented and preliminary results in the recognition of isolated letters and digits are presented.
Abstract: A paradigm for automatic speech recognition using networks of actions performing variable depth analysis is presented. The paradigm produces descriptions of speech properties that are related to speech units through Markov models representing system performance. Preliminary results in the recognition of isolated letters and digits are presented.

Proceedings ArticleDOI
01 Apr 1987
TL;DR: Comparison of performance of the two methods shows that a new speaker's codebook is not necessary to represent the new speaker, and a vector quantization approach to speaker adaptation is evaluated.
Abstract: In view of designing a speaker-independent large vocabulary recognition system, we evaluate a vector quantization approach to speaker adaptation. Only one speaker (the reference speaker) pronounces the application vocabulary. He also pronounces a small vocabulary called the adaptation vocabulary. Each new speaker then merely pronounces the adaptation vocabulary. Two adaptation methods are investigated, establishing a correspondence between the codebooks of these two speakers. This allows us to transform the reference utterances of the reference speaker into suitable references for the new speaker. Method I uses a transposed codebook to represent the new speaker during the recognition process whereas Method II uses a codebook which is obtained by clustering on the new speaker's pronunciation of the adaptation vocabulary. Experiments were carried out on a 20-speaker database (10 male, 10 female). The adaptation vocabulary contains 136 words; the application one has 104 words. The mean recognition error rate without adaptation is 22.3% for inter-speaker experiments; after one of the two methods has been implemented the mean recognition error rate is 10.5%. Comparison of performance of the two methods shows that a new speaker's codebook is not necessary to represent the new speaker.

01 Jan 1987
TL;DR: This study examines the set of CV, VC, CVC and some CCVC sequences which are non-occurring in monomorphemic words in a 20,000 word lexicon and suggests that many sequences in which the prevocalic and postvocalic consonants are similar, or identical, are excluded.
Abstract: This study examines the set of CV, VC, CVC and some CCVC sequences which are non-occurring in monomorphemic words in a 20,000 word lexicon. A preliminary analysis suggests that many sequences in which the prevocalic and postvocalic consonants are similar, or identical, are excluded. The sequences are discussed in relation to 'reduced forms', characteristic offast speech, word boundary assimilation and lexical access.

Book ChapterDOI
01 May 1987
TL;DR: A Speech Recognition Methodology is proposed which is based on the general assumption of ‘fuzzyness’ of both speech-data and knowledge-sources and on other fundamental assumptions which are also the bases of the proposed methodology.
Abstract: In this paper a Speech Recognition Methodology is proposed which is based on the general assumption of ‘fuzzyness’ of both speech-data and knowledge-sources. Besides this general principle, there are other fundamental assumptions which are also the bases of the proposed methodology: ‘Modularity’ in the knowledge organization, ‘Homogeneity’ in the representation of data and knowledge, ‘Passiveness’ of the ‘understanding flow’ (no backtraking or feedback), and ‘Parallelism’ in the recognition activity.



Journal ArticleDOI
TL;DR: A novel isolated-word recognition system for monosyllabic tonal languages is proposed which depends on the energy-time profiles of the utterances at different frequency bands and a mean accuracy of 97-99% was achieved for speaker-dependent recognition over the ten Cantonese digits.
Abstract: A novel isolated-word recognition system for monosyllabic tonal languages is proposed which depends on the energy-time profiles (ETP) of the utterances at different frequency bands. Training procedures, together with the classification strategy will be discussed. A mean accuracy of 97-99% was achieved for speaker-dependent recognition over the ten Cantonese digits.

Journal ArticleDOI
TL;DR: Several important areas need substantial clarification or expansion before the reported findings of Koenig, ‘‘Spectrographic voice identification: A forensic survey’’ can be readily accepted.
Abstract: Several important areas need substantial clarification or expansion before the reported findings of Koenig, ‘‘Spectrographic voice identification: A forensic survey’’ [J. Acoust. Soc. Am. 79, 2088–2090 (1986)], can be readily accepted. They are: (1) the method of ‘‘voiceprint’’ analysis used, (2) ‘‘voiceprint’’ examiners’ qualifications, and (3) the means for determining the FBI’s correct identification.

01 Jan 1987
TL;DR: This paper proposes the use of synthetic speech as a means of handling the collection of reference data and speaker normalization in large-vocabulary speech recognition.
Abstract: A major problem in large-vocabulary speech recognition is the collection of reference data and speaker normalization. In this paper we propose the use of synthetic speech as a means of handling this problem. An experimental scheme for such a system will be described.



Proceedings ArticleDOI
M. Codogno1, L. Fissore
01 Apr 1987
TL;DR: Two different approaches are exploited to obtain sets of models in which the state duration is characterized by suited probability density functions, and two difficult speaker-dependent recognition tasks have been carried out.
Abstract: The classical first-order Hidden Markov Models with continuous probabilistic density function (HMMCs) seem to be a promising tool for speech modelling with reference to the task of both isolated word and continuous speech recognition. However, these models have a strong limitation because they are poorly able to capture the information about duration, sometimes the most important feature that permits to distinguish between similar sounds. In this paper two different approaches are exploited, in such a way to obtain sets of models in which the state duration is characterized by suited probability density functions. In order to evaluate the performance of both model sets, two difficult speaker-dependent recognition tasks have been carried out. It has been also tested the opportunity of using a limited-size training lexicon for a new speaker, and merge these duration models with the other ones obtained through some speakers.