scispace - formally typeset
Search or ask a question

Showing papers on "Speaker diarisation published in 1987"


Journal ArticleDOI
TL;DR: This work has shown that text dependent speech material used for speaker recognition can be either text dependent (constrained text) or text independent (free text).
Abstract: Automatic speaker recognition has long been an interesting and challenging problem to speech researchers.1−10 The problem, depending on the nature of the final task, can be classified into two different categories: speaker verification and speaker identification. In a speaker verification task, the recognizer is asked to verify an identity claim made by an unknown speaker and a decision to reject or accept the identity claim is made. In a speaker identification task, the recognizer is asked to decide which out of a population of N speakers is best classified as the unknown speaker. The decision may include a choice of “no classification” (i.e., a choice that the specific speaker is not in a given closed set of speakers). The input speech material used for speaker recognition can be either text dependent (constrained text) or text independent (free text).

138 citations


PatentDOI
TL;DR: Speaker verification is performed by computing principal components of a fixed text statement comprising a speaker identification code and a two-word phrase, and principal spectral components of an random word phrase.
Abstract: Speaker verification is performed by computing principal components of a fixed text statement comprising a speaker identification code and a two-word phrase, and principal spectral components of a random word phrase. A multi-phrase strategy is utilized in access control to allow successive verification attempts in a single session, if the speaker fails initial attempts. Based upon a verification attempt, the system produces a verification score which is compared with a threshold value. On successive attempts, the criterion for acceptance is changed, and one of a number of criteria must be satisfied for acceptance in subsequent attempts. A speaker normalization function can also be invoked to modify the verification score of persons enrolled with the system who inherently produce scores which result in denial of access. Accuracy of the verification system is enhanced by updating the reference template which then more accurately symbolizes the person's speech signature.

79 citations


Proceedings ArticleDOI
01 Apr 1987
TL;DR: A new algorithm is introduced that transforms hidden Markov models of speech derived from one "prototype" speaker so that they model the speech of a new speaker in the form of a probabilistic spectral mapping.
Abstract: This paper deals with rapid speaker adaptation for speech recognition. We introduce a new algorithm that transforms hidden Markov models of speech derived from one "prototype" speaker so that they model the speech of a new speaker. The Speaker normalization is accomplished by a probabilistic spectral mapping from one speaker to another. For a 350 word task with a grammar and using only 15 seconds of speech for normalization, the recognition accuracy is 97% averaged over 6 speakers. This accuracy would normally require over 5 minutes of speaker dependent training. We derive the probabilistic spectral transformation of HMMs, describe an algorithm to estimate the transformation, and present recognition results.

76 citations


PatentDOI
TL;DR: In this article, an apparatus operates to identify the speech signal of an unknown speaker as one of a finite number of speakers, each speaker is modeled and recognized with any example of their speech, and the output is a list of scores that measure how similar the input speaker is to each of the speakers whose models are stored in the system.
Abstract: An apparatus operates to identify the speech signal of an unknown speaker as one of a finite number of speakers. Each speaker is modeled and recognized with any example of their speech. The input to the system is analog speech and the output is a list of scores that measure how similar the input speaker is to each of the speakers whose models are stored in the system. The system includes front end processing means which is responsive to the speech signal to provide digitized samples of the speech signal at an output which are stored in a memory. The stored digitized samples are then retrieved and divided into frames. The frames are processed to provide a series of speech parameters indicative of the nature of the speech content in each of the frames. The processor for producing the speech parameters is coupled to either a speaker modeling means, whereby a model for each speaker is provided and consequently stored, or a speaker recognition mode, whereby the speech parameters are again processed with current parameters and compared with the stored parameters during each speech frame. The comparison is accomplished over a predetermined number of frames whereby a favorable comparison is indicative of a known speaker for which a model is stored.

65 citations


Journal ArticleDOI
TL;DR: Several vector quantization approaches to the problem of text-dependent speaker verification are described in this paper, where a source codebook is designed to represent a particular speaker saying a particular utterance, and this same utterance is spoken by a speaker to be verified and is encoded in the source code book representing the speaker whose identity was claimed.
Abstract: Several vector quantization approaches to the problem of text-dependent speaker verification are described. In each of these approaches, a source codebook is designed to represent a particular speaker saying a particular utterance. Later, this same utterance is spoken by a speaker to be verified and is encoded in the source codebook representing the speaker whose identity was claimed. The speaker is accepted if the verification utterance's quantization distortion is less than a prespecified speaker-specific threshold. The best approach achieved a 0.7 percent false acceptance rate and a 0.6 percent false rejection rate on a speaker population comprising 16 admissible speakers and 111 casual imposters. The approaches are described, and detailed experimental results are presented and discussed.

58 citations


Journal ArticleDOI
TL;DR: A set of dynamic adaptation procedures for updating expected feature values during recognition using maximum a posteriori probability (MAP) estimation techniques to update the mean vectors of sets of feature values on a speaker-by-speaker basis.
Abstract: In this paper, we describe efforts to improve the performance of FEATURE, the Carnegie-Mellon University speaker-independent speech recognition system that classifies isolated letters of the English alphabet by enabling the system to learn the acoustical characteristics of individual speakers. Even when features are designed to be speaker-independent, it is frequently observed that feature values may vary more from speaker to speaker for a single letter than they vary from letter to letter. In these cases, it is necessary to adjust the system's statistical description of the features of individual speakers to obtain improved recognition performance. This paper describes a set of dynamic adaptation procedures for updating expected feature values during recognition. The algorithm uses maximum a posteriori probability (MAP) estimation techniques to update the mean vectors of sets of feature values on a speaker-by-speaker basis. The MAP estimation algorithm makes use of both knowledge of the observations input to the system from an individual speaker and the relative variability of the features' means within and across all speakers. In addition, knowledge of the covariance of the features' mean vectors across the various letters enables the system to adapt its representation of similar-sounding letters after any one of them is presented to the classifier. The use of dynamic speaker adaptation improves classification performance of FEATURE by 49 percent after four presentations of the alphabet, when the system is provided with supervised training indicating which specific utterance had been presented to the classifier from a particular user. Performance can be improved by as much as 31 percent when the system is allowed to adapt passively in an unsupervised learning mode. without any information from individual users.

41 citations


Proceedings ArticleDOI
01 Apr 1987
TL;DR: Comparison of performance of the two methods shows that a new speaker's codebook is not necessary to represent the new speaker, and a vector quantization approach to speaker adaptation is evaluated.
Abstract: In view of designing a speaker-independent large vocabulary recognition system, we evaluate a vector quantization approach to speaker adaptation. Only one speaker (the reference speaker) pronounces the application vocabulary. He also pronounces a small vocabulary called the adaptation vocabulary. Each new speaker then merely pronounces the adaptation vocabulary. Two adaptation methods are investigated, establishing a correspondence between the codebooks of these two speakers. This allows us to transform the reference utterances of the reference speaker into suitable references for the new speaker. Method I uses a transposed codebook to represent the new speaker during the recognition process whereas Method II uses a codebook which is obtained by clustering on the new speaker's pronunciation of the adaptation vocabulary. Experiments were carried out on a 20-speaker database (10 male, 10 female). The adaptation vocabulary contains 136 words; the application one has 104 words. The mean recognition error rate without adaptation is 22.3% for inter-speaker experiments; after one of the two methods has been implemented the mean recognition error rate is 10.5%. Comparison of performance of the two methods shows that a new speaker's codebook is not necessary to represent the new speaker.

13 citations



Proceedings ArticleDOI
01 Apr 1987
TL;DR: An automatic speech recognition system for Italian language has been developed at IBM Italy Scientific Center in Rome, able to recognize in real time natural language sentences, composed with words from a dictionary of 6500 items, dictated by a speaker with short pauses among them.
Abstract: An automatic speech recognition system for Italian language has been developed at IBM Italy Scientific Center in Rome. It is able to recognize in real time natural language sentences, composed with words from a dictionary of 6500 items, dictated by a speaker with short pauses among them. The system is speaker dependent, before using it the speaker has to perform the training stage reading a predefined text 15--20 minutes long. It runs on an architecture composed by an IBM 3090 mainframe and a PC/AT based workstation with signal processing equipments.

5 citations


Journal ArticleDOI
Mike Talbot1
TL;DR: It was found that a modified means of template formation, giving rise to more representative templates, could improve recognition figures, especially for female speakers.
Abstract: Many automatic speech recognisers work on the principle of matching incoming utterances to a library of stored voice templates There are two main shortcomings of this approach, which can potentially be overcome by careful interface design Firstly, the templates, collected under strictly controlled conditions, are not necessarily representative of the speaker's normal voice Secondly, although the speaker's voice is likely to alter during the course of using the speech recogniser, the templates representing that voice will remain unchanged This will result in a gradual lessening of the similarity of template and utterance In the context of an information-retrieval task using fully automatic speech recognition, attempts were made to overcome the above problems It was found that a modified means of template formation, giving rise to more representative templates, could improve recognition figures, especially for female speakers However, attempts at constantly updating the templates in accordance with drifts in the speaker's diction were ineffectual in this instance This latter result conflicts with the results of earlier, comparable studies

5 citations


Proceedings ArticleDOI
01 Apr 1987
TL;DR: An adaptation algorithm using Parzen estimation and interpolation of the emission densities between the new and the old speaker models was investigated and is able to give satisfactory recognition rates adapting the HMMs on the basis of only 40 training words uttered by the new speaker.
Abstract: The main problems with HMMs of sub-word units are the large amount of training data and computer time needed for estimating the parameters of the models. In some applications it is not proposable that a new speaker utters many hundreds of words to train the system, hence the interest arises for a quick adaptation based on some tens of training utterances. Two bounds are given for comparison with the results of the speaker adaptation, namely the recognition rates of speaker dependent and cross speaker recognition. Speaker dependent recognition is achieved by training the HMMs with nearly 1000 words uttered by the same speaker used in the tests. Cross speaker recognition, that gives a lower bound to the performance, concerns experiments in which the models were trained by a speaker different from that who uttered the test sentences. An adaptation algorithm using Parzen estimation and interpolation of the emission densities between the new and the old speaker models was investigated. It is able to give satisfactory recognition rates adapting the HMMs on the basis of only 40 training words uttered by the new speaker.

Journal ArticleDOI
TL;DR: In this article, evidence for and against these various possibilities from a large number of psycholinguistic studies, particularly those dealing with various sex-specific effects in a variety of languages, is surveyed and the implications of automatic speech recognition and understanding systems are discussed.
Abstract: If the same phonetic string is spoken by a male speaker and a female speaker, both drawn from a linguistically homogeneous population, then a listener perceives that both speakers are saying the same sounds but generally knows that the speakers are different and, in particular, that the speakers are of a different sex. What is not well understood is whether the listener has to determine the identity of the sex of the speaker in order to determine the identity of the sounds, or whether the two pieces of information are essentially independent and can be determined concurrently, or whether the identity of the sounds can be roughly determined without knowledge of the speaker's sex but fine recognition is possible only after the speaker's sex is known. In this presentation, evidence for and against these various possibilities from a large number of psycholinguistic studies, particularly those dealing with various sex‐specific effects in a variety of languages, is surveyed and the implications for the design of automatic speech recognition and understanding systems are discussed.

Proceedings ArticleDOI
01 Jan 1987
TL;DR: By taking advantage of the four-tone structure in the pitch contour of Mandarin speech, text-independent speaker identification of using orthogonal pitch parameter is described, and this system outperforms that of using parameters of pitch contours or vocal tract only.
Abstract: By taking advantage of the four-tone structure in the pitch contour of Mandarin speech, text-independent speaker identification of using orthogonal pitch parameter is described. Slopes, mean, and duration of the pitch contour of each word in an utterance are taken as recognition features. An 85% identification rate is achieved by using parameters of pitch contour only. When incorporating parameters of pitch contour with parameters of vocal tract, this system outperforms that of using parameters of pitch contour or vocal tract only. A recognition rate of 99.2% is reached in such a system.


Dissertation
01 Jan 1987
TL;DR: A recognition system is proposed, which overcomes difficulties by employing vector quantization techniques to reduce the storage of reference patterns, and eliminating the need for dynamic time warping which reduces the computational complexity of the system.
Abstract: The work presented in this thesis concerns the recognition of isolated words using a pattern matching approach. In such a system, an unknown speech utterance, which is to be identified, is transformed into a pattern of characteristic features. These features are then compared with a set of pre-stored reference patterns that were generated from the vocabulary words. The unknown word is identified as that vocabulary word for which the reference pattern gives the best match. One of the major difficul ties in the pattern comparison process is that speech patterns, obtained from the same word, exhibit non-linear temporal fluctuations and thus a high degree of redundancy. The initial part of this thesis considers various dynamic time warping techniques used for normalizing the temporal differences between speech patterns. Redundancy removal methods are also considered, and their effect on the recognition accuracy is assessed. Although the use of dynamic time warping algorithms provide considerable improvement in the accuracy of isolated word recognition schemes, the performance is ultimately limited by their poor ability to discriminate between acoustically similar words. Methods for enhancing the identification rate among acoustically similar words, by using common pattern features for similar sounding regions, are investigated. Pattern matching based, speaker independent systems, can only operate with a high recognition rate, by using multiple reference patterns for each of the words included in the vocabulary. These patterns are obtained from the utterances of a group of speakers. The use of multiple reference patterns, not only leads to a large increase in the memory requirements of the recognizer, but also an increase in the computational load. A recognition system is proposed in this thesis, which overcomes these difficulties by (i) employing vector quantization techniques to reduce the storage of reference patterns, and (ii) eliminating the need for dynamic time warping which reduces the computational complexity of the system. Finally, a method of identifying the acoustic structure of an utterance in terms of voiced, unvoiced, and silence segments by using fuzzy set theory is proposed. The acoustic structure is then employed to enhance the recognition accuracy of a conventional isolated word recognizer.

01 Jan 1987
TL;DR: The speaker veri:fication technique was evaluated with over 5000 words from 13 speakers employing telephone handsets and accessing a verification simulation over the switched telephone network, showing significantly degraded results if the prompted words are selected at random from the entire training vocabulary.
Abstract: The speaker veri:fication technique was evaluated with over 5000 words from 13 speakers employing telephone handsets and accessing a verification simulation over the switched telephone network. A total of 977 verifi.cation attempts and 21421 impersonation attempts was simulated. A false acceptance rate of 0.1% was attained after an average of 1.2 words input. The same conditions gave rise to a false rejection rate of 0.5% after an average of 2 words input. The above results are significantly degraded if the prompted words are selected at random from the entire training vocabulary and not according to their power to discriminate among the enrolled speakers.



Proceedings ArticleDOI
01 Apr 1987
TL;DR: It is shown that the distribution of vowels in eigenvector space can be used for speaker identification as well as a vowel space model, which is based on a linear transformation of the principle components in the three-formant space.
Abstract: Phonetic components of vocalic segments are primarily contained in the three lowest formants in speech signals. However, different vocal tracts and speaking habits do show formant pattern differences for different speakers. This study intends to analyze or to separate phonetic and speaker characteristics into a set of orthogonal dimensions. Statistical analysis by a linear transformation of the principle components in the three-formant space is the basis of the study. Experiments on the vocalic segments of a conversational database, and vowel portions of specific digits in a connected digit database are reported. Experimental recognition results and a statistical interpretation are presented. Comparison with a vowel space model is also investigated. It is shown that the distribution of vowels in eigenvector space can be used for speaker identification.