scispace - formally typeset
Search or ask a question

Showing papers on "Speaker diarisation published in 1990"


Proceedings ArticleDOI
03 Apr 1990
TL;DR: A technique for using the speech of multiple reference speakers as a basis for speaker adaptation in large-vocabulary continuous-speech recognition is introduced, and the usual probabilistic spectrum transformation can be applied to the reference HMM to model a new speaker.
Abstract: A technique for using the speech of multiple reference speakers as a basis for speaker adaptation in large-vocabulary continuous-speech recognition is introduced. In contrast to other methods that use a pooled reference model, this technique normalizes the training speech from multiple reference speakers to a single common feature space before pooling it. The normalized and pooled speech is then treated as if it came from a single reference speaker for training the reference hidden Markov model (HMM). The usual probabilistic spectrum transformation can be applied to the reference HMM to model a new speaker. Preliminary experimental results are reported from applying this approach to over 100 reference speakers from the speaker-independent portion of the DARPA 1000-Word Resource Management Database. >

179 citations


Journal ArticleDOI
TL;DR: The task of speaker verification, a subset of the general problem of speaker recognition, is defined and the feature selection and pattern matching steps of the recognition procedure are examined.
Abstract: The task of speaker verification, a subset of the general problem of speaker recognition is defined. The feature selection and pattern matching steps of the recognition procedure are examined. Speaker verification system design and performance are discussed, and databases for evaluating them are briefly considered. An example of a speaker verification system is described. An overview of industry research in this area is given. >

146 citations


Proceedings ArticleDOI
03 Apr 1990
TL;DR: An acoustic-class-dependent technique for text-independent speaker identification on very short utterances is described, based on maximum-likelihood estimation of a Gaussian mixture model representation of speaker identity.
Abstract: An acoustic-class-dependent technique for text-independent speaker identification on very short utterances is described. The technique is based on maximum-likelihood estimation of a Gaussian mixture model representation of speaker identity. Gaussian mixtures are noted for their robustness as a parametric model and their ability to form smooth estimates of rather arbitrary underlying densities. Speaker model parameters are estimated using a special case of the iterative expectation-maximization (EM) algorithm, and a number of techniques are investigated for improving model robustness. The system is evaluated using a 12 reference speaker population from a conversational speech database. It achieves 80% average text-independent speaker identification performance for a 1-s test utterance length. >

122 citations


PatentDOI
TL;DR: In this article, a set of speaker specific enrollment parameters for normalizing analysis parameters including the speaker's pitch, the frequency spectrum of the speech as a function of time, and certain measurements of speech signal in the time-domain.
Abstract: The present invention processes an independent body of speech during an enrollment process and creates a set of speaker specific enrollment parameters for normalizing analysis parameters including the speaker's pitch, the frequency spectrum of the speech as a function of time, and certain measurements of the speech signal in the time-domain. A particular objective of the invention is to make these analysis parameters have the same meaning from speaker to speaker. Thus after the pre-processing performed by this invention, the parameters would look much the same for the same word independent of speaker. In this manner, variations in the speech signal caused by the physical makeup of a speaker's throat, mouth, lips, teeth, and nasal cavity would be, at least in part, reduced by the pre-processing.

92 citations


Journal ArticleDOI
TL;DR: In this article, five approaches that can be used to control and simplify the speech recognition task are examined: isolated words, speaker-dependent systems, limited vocabulary size, a tightly constrained grammar, and quiet and controlled environmental conditions.
Abstract: Five approaches that can be used to control and simplify the speech recognition task are examined. They entail the use of isolated words, speaker-dependent systems, limited vocabulary size, a tightly constrained grammar, and quiet and controlled environmental conditions. The five components of a speech recognition system are described: a speech capture device, a digital signal processing module, preprocessed signal storage, reference speech patterns, and a pattern-matching algorithm. Current speech recognition systems are reviewed and categorized. Speaker recognition approaches and systems are also discussed. >

87 citations


Proceedings Article
John S. Bridle, Stephen Cox1
01 Oct 1990
TL;DR: A method of training this network to "tune in" the speaker parameters to a particular speaker based on a trick for converting a supervised network to an unsupervised mode is outlined, indicating an improvement over speaker-independent performance and, for unlabelled data, a performance close to that achieved on labelled data.
Abstract: A particular form of neural network is described, which has terminals for acoustic patterns, class labels and speaker parameters. A method of training this network to "tune in" the speaker parameters to a particular speaker is outlined, based on a trick for converting a supervised network to an unsupervised mode. We describe experiments using this approach in isolated word recognition based on whole-word hidden Markov models. The results indicate an improvement over speaker-independent performance and, for unlabelled data, a performance close to that achieved on labelled data.

65 citations


Proceedings ArticleDOI
03 Apr 1990
TL;DR: It is shown that different classes of phonemes are not equally effective in discriminating between speakers and that verification performance can be considerably improved by separately classifying speech segments representing each broad phonetic category as belonging to an impostor or as belong to the true speaker.
Abstract: A text-independent speaker verification system based on an adaptive vocal tract model which emulates the vocal tract of the speaker is described. Each speaker is represented by a set of feature vectors derived from speech segments belonging to different classes of phonemes. Linear predictive hidden Markov modeling and maximum-likelihood Viterbi decoding are applied to a speech utterance to obtain different classes of phonemes pronounced by a speaker. It is shown that different classes of phonemes are not equally effective in discriminating between speakers and that verification performance can be considerably improved by separately classifying speech segments representing each broad phonetic category as belonging to an impostor or as belonging to the true speaker. A weighted linear combination of scores for individual categories can be used as the final verification score. The weights are chosen to reflect the effectiveness of particular classes of phonemes in discriminating between speakers and are adjusted to maximize the verification performance. >

60 citations


PatentDOI
TL;DR: In this paper, a feature extracting part extracts features of an unknown speaker for every segmented block by using the time-series acoustic parameters and a distance calculating part calculates a distance between the features of the speaker extracted by the feature extractor and reference features stored in a memory.
Abstract: In a speaker verification system, a detecting part detects a speech section of an input speech signal by using a time-series acoustic parameters thereof. A segmentation part calculates individuality information for segmentation by using the time-series acoustic parameters within the speech section, and segments the input speech section into a plurality of blocks based on the individuality information. A feature extracting part extracts features of an unknown speaker for every segmented block by using the time-series acoustic parameters. A distance calculating part calculates a distance between the features of the speaker extracted by the feature extracting part and reference features stored in a memory. A decision part makes a decision as to whether or not the unknown speaker is a real speaker by comparing the calculated distance with a predetermined threshold value. Segmentation is made by calculating a primary moment of the spectrum, over a block, and finding successive values which satisfy a predetermined criterion.

47 citations


Patent
16 Nov 1990
TL;DR: In this paper, various functions associated with some of the words or instructions recognizable by a speaker independent voice recognition device are presented to an operator via one or more menus (200a-200d) so that the operator may select any of several functions by using a limited set of speaker independent commands.
Abstract: Various functions (or portions thereof) are associated with some of the words or instructions recognizable by a speaker independent voice recognition device (128). This association is presented to an operator via one or more menus (200a-200d) so that the operator may select any of several functions by use of a limited set of speaker independent commands.

46 citations


Proceedings ArticleDOI
24 Jun 1990
TL;DR: Recent efforts to further improve the performance of the Sphinx system for speaker-independent continuous speech recognition are reported, with incorporation of additional dynamic features, semi-continuous hidden Markov models, and speaker clustering.
Abstract: The paper reports recent efforts to further improve the performance of the Sphinx system for speaker-independent continuous speech recognition. The recognition error rate is significantly reduced with incorporation of additional dynamic features, semi-continuous hidden Markov models, and speaker clustering. For the June 1990 (RM2) evaluation test set, the error rates of our current system are 4.3% and 19.9% for word-pair grammar and no grammar respectively.

26 citations


Proceedings ArticleDOI
Biing-Hwang Juang1, F.K. Soong1
03 Apr 1990
TL;DR: It is found that incorporation of memory in source coders in general enhances the speaker recognition accuracy but that more remarkable improvements can be accomplished by properly including potential source variations in the coder design/training.
Abstract: The use of nonmemoryless source coders in speaker recognition problems is studied, and the effects of source variations, including speaking inconsistency and channel mismatch, in source coder designs for the intended application are discussed. It is found that incorporation of memory in source coders in general enhances the speaker recognition accuracy but that more remarkable improvements can be accomplished by properly including potential source variations in the coder design/training. An experiment with a 100-speaker database shows a 99.5% recognition accuracy. >



Proceedings ArticleDOI
24 Jun 1990
TL;DR: A new paradigm for speaker-independent (SI) training of hidden Markov models (HMM) is presented, which uses a large amount of speech from a few speakers instead of the traditional practice of using a little speech from many speakers.
Abstract: This paper reports on two contributions to large vocabulary continuous speech recognition. First, we present a new paradigm for speaker-independent (SI) training of hidden Markov models (HMM), which uses a large amount of speech from a few speakers instead of the traditional practice of using a little speech from many speakers. In addition, combination of the training speakers is done by averaging the statistics of independently trained models rather than the usual pooling of all the speech data from many speakers prior to training. With only 12 training speakers for SI recognition, we achieved a 7.5% word error rate on a standard grammar and test set from the DARPA Resource Management corpus. This performance is comparable to our best condition for this test suite, using 109 training speakers.Second, we show a significant improvement for speaker adaptation (SA) using the new SI corpus and a small amount of speech from the new (target) speaker. A probabilistic spectral mapping is estimated independently for each training (reference) speaker and the target speaker. Each reference model is transformed to the space of the target speaker and combined by averaging. Using only 40 utterances from the target speaker for adaptation, the error rate dropped to 4.1% --- a 45% reduction in error compared to the SI result.

Proceedings ArticleDOI
03 Apr 1990
TL;DR: The principle of trajectory space comparison for text-independent speaker recognition and some solutions to the space comparison problem based on vector quantization are presented and the comparison of the recognition rates of different solutions is reported.
Abstract: The principle of trajectory space comparison for text-independent speaker recognition and some solutions to the space comparison problem based on vector quantization are presented. The comparison of the recognition rates of different solutions is reported. The experimental system achieves a 99.5% text-independent speaker recognition rate for 23 speakers, using five phrases for training and five for test. A speaker-independent continuous speech recognition system is built in which this principle is used for speaker adaptation. >


Proceedings ArticleDOI
Stephen Cox1, J.S. Bridle
03 Apr 1990
TL;DR: Results of using this technique with whole-word hidden Markov models (HMMs) indicate an improvement over speaker-independent performance and, for unlabeled data, a performance close to that achieved on labeled data.
Abstract: A particular form of neural network is described which has terminals for acoustic patterns, class labels, and speaker parameters. A method of training this network to tune in the speaker parameters to a new speaker is outlined. This process can also be viewed from a Bayesian perspective as maximizing the likelihood of the speaker's data by optimizing the model and speaker parameters. A method for doing this when the data are labeled is described. Results of using this technique with whole-word hidden Markov models (HMMs) indicate an improvement over speaker-independent performance and, for unlabeled data, a performance close to that achieved on labeled data. >

Book ChapterDOI
01 Jan 1990
TL;DR: This paper presents a connectionist approach to automatic speaker identification, based for the first time on the LVQ (Learning Vector Quantization) algorithm, and indicates the results obtained for different combinations of parameters.
Abstract: This paper presents a connectionist approach to automatic speaker identification, based for the first time on the LVQ (Learning Vector Quantization) algorithm. For each “subscriber” to the identification system, a number of references is fixed. The algorithm is based on a nearest neighbor principle, with adaptation through learning. The identification is realized by comparing to a given threshold the distance of the unknown utterance to the nearest reference. Preliminary tests run on a 10 speakers set show an identication rate of 97% for MFC coefficients. We present the identification system and data base used, and indicate the results obtained for different combinations of parameters. We further evaluate our system, by comparing its performances with a Bayesian system.



Proceedings ArticleDOI
03 Apr 1990
TL;DR: A method of dealing with articulatory speaker variations in hidden Markov models (HMMs) for speaker adaptation through the /b,d,g/ recognition task shows 82.5% recognition accuracy, which is better than the rates of other methods.
Abstract: A method of dealing with articulatory speaker variations in hidden Markov models (HMMs) for speaker adaptation is proposed. Speech data from many speakers are spectrally mapped onto a standard speaker. These data are used to teach the HMM the interspeaker articulatory variations that subsist across the spectral mapping. The proposed method is compared to other adaptation methods through the /b,d,g/ recognition task. The results show 82.5% recognition accuracy, which is better than the rates of other methods. Evaluation experiments on a Japanese all phoneme recognition task and a continuous-speech recognition task are reported. Average recognition rates for Japanese all phonemes are 71.3% and 93.2%, for the best candidate and the top-three candidates, respectively. These are 0.7% and 1.5% higher than the rates of the basic spectrum mapping method. In the continuous-speech recognition experiment, average phrase recognition rates are 74.9% and 96.2%, for the best candidate and the top-five candidates, respectively. >

Proceedings ArticleDOI
03 Apr 1990
TL;DR: A method for adaptation of the IBM speech recognition system in the situation where the system is already trained for the new speaker and one tries to further adapt and improve the system while it is actually being used by the new Speaker in the recognition mode is described.
Abstract: A method for adaptation of the IBM speech recognition system in the situation where the system is already trained for the new speaker and one tries to further adapt and improve the system while it is actually being used by the new speaker in the recognition mode is described. A special kind of adaptation is investigated where the emphasis is not on the adaptation of the statistical parameters of the Markov models but on the adaptation of the structure of these models. This structure is defined by the baseforms describing the composition of word models from phone models in the system. Therefore, baseform adaptation corresponds directly to the adaptation of the new system to the personal speaker characteristics of the new user. Several different baseform adaptation schemes are investigated and it is demonstrated that for a speaker who has already trained the system and achieves a 95.2% recognition performance, the performance can be further improved to 96.3%. >

Proceedings Article
01 Nov 1990
TL;DR: The results show that adaptation based on the fuzzy histogram algorithm yields the highest accuracy in an HMM-based speech recognition system.
Abstract: In this paper, we compare the performances of speaker adaptation which consist of two stages of processing for an HMM-based speech recognition system. We compare three kinds of VQ adaptation methods which may be used in the first stage to reduce the distortion error for a new speaker : label prototype adaptation, adaptation with a codebook from adaptation speech itself, and adaptation with a mapped codebook. We then compare the performance of four kinds of HMM parameter adaptation methods which may be used in the second stage to transform HMM parameters for a new speaker : adaptation by the Viterbi algorithm, that by the DTW algorithm, that by the iterative alignment algorithm. The results show that adaptation based on the fuzzy histogram algorithm yields the highest accuracy in an HMM-based speech recognition system.



Proceedings ArticleDOI
03 Apr 1990
TL;DR: A statistical method for recognizing phonemes in continuous speech using a parametric expression of speaker individuality and the effective calculation of phoneme likelihood, especially for consonants in various phoneme environments is presented.
Abstract: A statistical method for recognizing phonemes in continuous speech is presented. Two aspects of the system are discussed. The first aspect is speaker adaptation to improve the recognition rate. A parametric expression of speaker individuality is used, which is calculated from the spectral distortion in the vector quantization. Each acoustic feature space is divided by the speaker individuality parameter. The second is the effective calculation of phoneme likelihood, especially for consonants in various phoneme environments. Since the acoustic features of consonants are strongly dependent on the surrounding phonemes, phoneme segments which have high scores are extracted. The other parts are discriminated under the assumption that the reliable parts of phonemes really exist in the string of utterance. In the frame level, the correct recognition rate of all phoneme categories reaches 78.3% in a multispeaker experiment (six males) and 72.7% in a completely speaker-independent experiment. >