scispace - formally typeset
Search or ask a question

Showing papers on "Speaker diarisation published in 1989"


Proceedings ArticleDOI
23 May 1989
TL;DR: In this paper, an alternative approach to speaker adaptation for a large-vocabulary hidden-Markov-model-based speech recognition system is described, based on the use of a stochastic model representing the different properties of the new speaker and an old speaker for which the full training set of 20 minutes is available.
Abstract: An alternative approach to speaker adaptation for a large-vocabulary hidden-Markov-model-based speech recognition system is described. The goal of this investigation was to train the IBM speech recognition system with only five minutes of speech data from a new speaker instead of the usual 20 minutes without the recognition rate dropping by more than 1-2%. The approach is based on the use of a stochastic model representing the different properties of the new speaker and an old speaker for which the full training set of 20 minutes is available. It is called a speaker Markov model. It is shown how the parameters of such a model can be derived and how it can be used for transforming the training set of the old speaker in order to use it in addition to the short training set of the new speaker. The adaptation algorithm was tested with 12 speakers. The average recognition rate dropped from 96.4% to 95.2% for a 5000-word vocabulary task. The decoding time increased by a factor of 1.35; this factor is often 3-5 if other adaptation algorithms are used. >

180 citations


Proceedings ArticleDOI
23 May 1989
TL;DR: The authors present the results of speaker-verification technology development for use over long-distance telephone lines, using template-based dynamic time warping and hidden Markov modeling for discriminant analysis techniques which improve the discrimination between true speakers and imposters.
Abstract: The authors present the results of speaker-verification technology development for use over long-distance telephone lines. A description is given of two large speech databases that were collected to support the development of new speaker verification algorithms. Also discussed are the results of discriminant analysis techniques which improve the discrimination between true speakers and imposters. A comparison is made of the performance of two speaker-verification algorithms, one using template-based dynamic time warping, and the other, hidden Markov modeling. >

108 citations


Proceedings ArticleDOI
Stephen Cox1, John S. Bridle
23 May 1989
TL;DR: A general approach to speaker adaptation in speech recognition is described, in which speaker differences are treated as arising from a parameterized transformation.
Abstract: A general approach to speaker adaptation in speech recognition is described, in which speaker differences are treated as arising from a parameterized transformation. Given some unlabeled data from a particular speaker, a process is described which maximizes the likelihood of this data by estimating the transformation parameters at the same time as refining estimates of the labels. The technique is illustrated using isolated vowel spectra and phonetically motivated linear spectrum transformations and is shown to give significantly better performance than nonadaptive classification. >

70 citations


Journal ArticleDOI
TL;DR: An automatic speaker adaptation algorithm for speech recognition, in which a small amount of training material of unspecified text can be used, which reduces the mean word recognition error rate from 4.9 to 2.9%.
Abstract: The author proposes an automatic speaker adaptation algorithm for speech recognition, in which a small amount of training material of unspecified text can be used. The algorithm is easily applied to vector-quantization- (VQ) speech recognition systems consisting of a VQ codebook and a word dictionary in which each word is represented as a sequence of codebook entries. In the adaptation algorithm, the VQ codebook is modified for each new speaker, whereas the word dictionary is universally used for all speakers. The important feature of this algorithm is that a set of spectra in training frames and the codebook entries are clustered hierarchically. Based on the vectors representing deviation between centroids of the training frame clusters and the corresponding codebook clusters, adaptation is performed hierarchically from small to large numbers of clusters. The spectral resolution of the adaptation process is improved accordingly. Results of recognition experiments using utterances of 100 Japanese city names show that adaptation reduces the mean word recognition error rate from 4.9 to 2.9%. Since the error rate for speaker-dependent recognition is 2.2%, the adaptation method is highly effective. >

41 citations


Proceedings ArticleDOI
15 Oct 1989
TL;DR: A preliminary investigation of techniques that automatically detect when the speaker has used a word that is not in the vocabulary, and develops a technique that uses a general model for the acoustics of any word to recognize the existence of new words.
Abstract: In practical large vocabulary speech recognition systems, it is nearly impossible for a speaker to remember which words are in the vocabulary. The probability of the speaker using words outside the vocabulary can be quite high. For the case when a speaker uses a new word, current systems will always' recognize other words within the vocabulary in place of the new word, and the speaker wouldn't know what the problem is.In this paper, we describe a preliminary investigation of techniques that automatically detect when the speaker has used a word that is not in the vocabulary. We developed a technique that uses a general model for the acoustics of any word to recognize the existence of new words. Using this general word model, we measure the correct detection of new words versus the false alarm rate.Experiments were run using the DARPA 1000-word Resource Management Database for continuous speech recognition. The recognition system used is the BBN BYBLOS continuous speech recognition system (Chow et al., 1987). The preliminary results indicate a detection rate of 74% with a false alarm rate of 3.4%.

40 citations


Proceedings ArticleDOI
23 May 1989
TL;DR: The authors propose a speaker adaptation algorithm which does not depend on speech recognition algorithms and is applied to hidden Markov models and neural networks and evaluated using a database of 216 phonetically balanced words and 5240 important Japanese words uttered by three speakers.
Abstract: The authors propose a speaker adaptation algorithm which does not depend on speech recognition algorithms. The proposed spectral mapping algorithm is based on three ideas: (1) accurate representation of the input vector by separate vector quantization and fuzzy vector quantization, (2) continuous spectral mapping from one speaker to another by fuzzy mapping, and (3) accurate establishment of spectral correspondence based on the fuzzy relationship of the membership function obtained from supervised training. The spectrum dynamic features are also utilized. The algorithm is applied to hidden Markov models (HMMs) and neural networks and evaluated using a database of 216 phonetically balanced words and 5240 important Japanese words uttered by three speakers. The HMM speaker adapted recognition rate for /b,d,g/ is 79.5%. The average recognition rate for the top-three choices is about 91%. The algorithm was applied to neural networks and resulted in almost the same performance. The algorithm was also applied to voice conversion, and a preference score of 65.6% was obtained. >

36 citations


PatentDOI
TL;DR: A speaker verification system receives input speech from a speaker of unknown identity and undergoes linear predictive coding analysis and transformation to maximize separability between true speakers and impostors when compared to reference speech parameters which have been similarly transformed.
Abstract: A speaker verification system receives input speech from a speaker of unknown identity. The speech undergoes linear predictive coding (LPC) analysis and transformation to maximize separability between true speakers and impostors when compared to reference speech parameters which have been similarly transformed. The transformation incorporated a "inter-class" covariance matrix of successful impostors within a database.

32 citations


PatentDOI
TL;DR: A speech recognition method and apparatus take into account a system transfer function between the speaker and the recognition apparatus, which update a signal representing the transfer function on a periodic basis during actual speech recognition.
Abstract: A speech recognition method and apparatus take into account a system transfer function between the speaker and the recognition apparatus. The method and apparatus update a signal representing the transfer function on a periodic basis during actual speech recognition. The transfer function representing signal is updated about every fifty words as determined by the speech recognition apparatus. The method and apparatus generate an initial transfer function representing signal and generate from the speech input, successive input frames which are employed for modifying the value of the current transfer function signal so as to eliminate error and distortion. The error and distortion occur, for example, as a speaker changes the direction of his profile relative to a microphone, as the speaker's voice changes or as other effects occur that alter the spectra of the input speech frames. The method is automatic and does not require the knowledge of the input words or text.

24 citations


PatentDOI
Masao Watari1
TL;DR: In this article, a plurality of control reference patterns similar to the verification reference pattern are determined from among the control reference pattern candidates, and the speaker to be verified is judged as the registered speaker on the basis of first and second dissimilarities.
Abstract: Control reference pattern candidates corresponding to a verification reference patterns of a registered speaker are synthesized by connecting unit speech patterns of a plurality of speakers. A plurality of control reference patterns similar to the verification reference pattern are determined from among the control reference pattern candidates. First dissimilarity between an input pattern of a speaker to be verified and the verification reference pattern specified by the registered speaker and second dissimilarity between the input pattern and the control reference patterns specified by the registered speaker are calculated. The speaker to be verified is judged as the registered speaker on the basis of the first and second dissimilarities.

21 citations


01 Jan 1989
TL;DR: An alternative approach to speaker adaptation for a large-vocabulary hidden-Markov-model-based speech recognition system is described, based on the use of a stochastic model representing the different properties of the new speaker and an old speaker for which the full training set of 20 minutes is available.
Abstract: This paper describes an alternative approach to speaker adaptation for a large vocabulary Hidden Markov Model based speech recognition system. The goal of this investigation was to train the IBM speech recognition system with only 5 minutes of speech data from a new speaker instead of the usual 20 minutes. At the same time the recognition rate should not drop by more than 1-2%. The approach is based on the use of a stochastic model representing the different properties of the new speaker and an old speaker for which the full training set of 20 minutes is available. Such a model can he called a ‘‘Speaker Markov Model”. It is shown how the parameters of such a model can be derived and how it can be used for transforming the training set of the old speaker in order to use it in addition to the short training set of the new speaker. The adaptation algorithm was tested with 12 speakers including male and female speakers as well as speakers with foreign accent. The average recognition rate dropped only from 96.4% to 95.2% for a 5000 word vocabulary task if the adaptation was used instead of the full training. Mostly important is that the decoding time’increases only by a factor of 1.35 while this factor is often 3-5 if other adaptation algorithms are used.

19 citations


PatentDOI
TL;DR: In this article, a speech processing apparatus was proposed that enables processor elements (403a to 403r) each comprising at least one nonlinear oscillator circuit (621) to be used as band pass filters by using the entrainment taking place in each of the processor elements.
Abstract: A speech processing apparatus of the present invention enables processor elements (403a to 403r) each comprising at least one nonlinear oscillator circuit (621) to be used as band pass filters by using the entrainment taking place in each of the processor elements, whereby the speech of a particular talker in the speech of a plurality of talkers can be recognized.

Proceedings ArticleDOI
08 May 1989
TL;DR: An algorithm is presented for adaptation and self-learning of the hidden Markov model (HMM) that makes the HMM-based speech recognition robust, so that well-trained models can be adapted to new speaking conditions or a new speaker.
Abstract: An algorithm is presented for adaptation and self-learning of the hidden Markov model (HMM). It makes the HMM-based speech recognition robust, so that well-trained models can be adapted to new speaking conditions or a new speaker. The self-learning consists of the fact that, during recognition, all test tokens can be used to augment the current model. Both procedures increase the size of the training set. The algorithm was tested on a speaker-dependent speech recognition system for the whole Chinese vocabulary and a speaker-independent system for 0-9 digits. Experiments show that the algorithm is very successful, both for new-speaker adaptation and for variations of speech in a single speaker under various conditions. >

Proceedings Article
01 Jan 1989
TL;DR: This work attempts to combine neural networks with knowledge from speech science to build a speaker independent speech recognition system and combines delays, copies of activations of hidden and output units at the input level, and Back-Propagation for Sequences (BPS), a learning algorithm for networks with local self-loops.
Abstract: We attempt to combine neural networks with knowledge from speech science to build a speaker independent speech recognition system. This knowledge is utilized in designing the preprocessing, input coding, output coding, output supervision and architectural constraints. To handle the temporal aspect of speech we combine delays, copies of activations of hidden and output units at the input level, and Back-Propagation for Sequences (BPS), a learning algorithm for networks with local self-loops. This strategy is demonstrated in several experiments, in particular a nasal discrimination task for which the application of a speech theory hypothesis dramatically improved generalization.

Patent
17 Nov 1989
TL;DR: In this article, the mean frequency characteristics and mean pitch frequency of a voice were used as an input for speaker recognition using a neural network and the output of the neural network 20 is inputted to a decision circuit 30 and identified or matched.
Abstract: PURPOSE:To reduce deterioration in recognition rate with time and to easily perform real-time processing by using the mean frequency characteristics and mean pitch frequency of a voice as an input for speaker recognition which uses a neural network. CONSTITUTION:The input voice is divided equally into blocks with time and the voice waveform is passed through band-pass filters 10 of plural channels; and the respective obtained blocks, i.e. frequency characteristics of constant time intervals and respective blocks obtained by passing the voice waveform through a pitch extraction part in parallel to said processing, i.e. pitch frequencies of constant time intervals are averaged by an averaging circuit 15 in block units and inputted to the neural network 20. Then the output of the neural network 20 is inputted to a decision circuit 30 and identified or matched. Consequently, the deterioration in recognition rate with time is small and the real-time processing is easily performed.

Proceedings Article
01 Jan 1989
TL;DR: Experimental results show that 10% or more of speech acts as little more than noise, interfering in the task of speaker recognition, and a classifier is developed here, and it is shown that the spoken digit 'nine' is good, while 'six' is bad.
Abstract: A recent approach to speaker identification is based on personalised codebooks. The algorithm compares incoming test features with a set of N codebooks, one for each valid member of the user population, and the codebook which gives rise to the smallest accumula.ted distance for the full test feature sequence is assumed to identify the speaker. Results from this inherently text-independent approach have highlighted the performance variations for different test utterances: the spoken digit 'nine' is good, while 'six' is bad. This observation has Iead to the idea of classifying speech, via a text and speaker-independent codebook, according to empirical discriminating properlies in the recognition task. Such a classifier is developed here, and experimental results show that 10% or more of speech acts as little more than noise, interfering in the task of speaker recognition.

Proceedings ArticleDOI
26 Mar 1989
TL;DR: Development of a fixed limited vocabulary automatic speaker recognition system based on extraction of ceptstral features from single isolated word utterances by various speakers is described.
Abstract: Development of a fixed limited vocabulary automatic speaker recognition system is described. The operation of the system is based on extraction of ceptstral features from single isolated word utterances by various speakers. A dynamic time warping algorithm is used in the comparison stage to bring the feature vectors being compared into time alignment. A nearest neighbor rule is used to determine the identity of the speaker. >

Patent
Kazunaga Yoshida1, Takao Watanabe1
17 May 1989
TL;DR: In this paper, a speech recognition system is adapted to a particular speaker by converting the reference pattern to normalized pattern through learning operation using training pattern prodused provisionally by the particular speaker.
Abstract: A speech recognition apparatus of the speaker adaptation type operates to recognize inputted speech pattern produced by a particular speaker according to reference pattern produced by a standard speaker. The speech recognition apparatus is adapted to the particular speaker by converting the reference pattern to normalized pattern through learning operation using training pattern prodused provisionally by the particular speaker. In the alternative, the speech recognition apparatus is trained through conversion of the training pattern with reference to the reference pattern. The speech recognition apparatus operates to convert inputted speech pattern into normalized speech pattern in real time basis according to the result of learning operation and to recognize the normalized speech pattern according to the reference pattern.


Proceedings ArticleDOI
23 May 1989
TL;DR: A recognition system based on a reference library of synthetic phoneme prototypes is described, and speaker-independent recognition results are given for male speakers on isolated words and connected digits.
Abstract: A recognition system based on a reference library of synthetic phoneme prototypes is described. The phoneme templates are specified in terms of formant synthesis parameters. The vocabulary and grammar are described in a finite-state network where each node represents a phoneme. A transition between two phonemes in the net is expanded to a number of new nodes using interpolation on the synthesis parameters or at the spectrum level. For each node, a 16-channel filter bank section is computed from the synthesis parameters. Adaptation to each speaker's individual voice source spectrum is performed during recognition. Auditory forward masking is incorporated. Speaker-independent recognition results are given for male speakers on isolated words and connected digits. Future improvements include coarticulation and reduction rules and speaker adaptation of phoneme parameters. The method could also be used in combination with hidden Markov models to provide reference data in cases not covered by the training material. >

Proceedings Article
01 Jan 1989
TL;DR: The author has developed a method of speaker identification based on representation of speakers by some LPC-coded vowels, which consisted in use of a noise canceller to identify speakers under noise conditions.
Abstract: The author has developed a method of speaker identification,based on representation of speakers by some LPC-coded vowels.The minimum cumulated spectral difference between corresponding test and reference samples was the decision criterion in the recognition task.The experiments reported nere investigated the ability of the modified method to identify speakers under noise conditions.Tne modification of the method consisted in use a noise canceller.

Proceedings ArticleDOI
11 Apr 1989
TL;DR: It is shown that, by means of adaptation procedures based on statistical correlation analysis, error rates as low as those of a speaker-dependent recognition system can be achieved after an extremely short training phase with any new speaker.
Abstract: Algorithms for a fast speaker adaptation in a speech-recognition system are described. The techniques aim at transformations of the feature vectors, which have to be optimized with respect to some constraints. The methods transform every feature vector, computed in a 10-ms frame rate, into a speaker-normalized vector. The advantage of adaptation by transforming the feature vectors is that this procedure can be applied no matter which classification scheme is used. It is shown that, by means of adaptation procedures based on statistical correlation analysis, error rates as low as those of a speaker-dependent recognition system can be achieved after an extremely short training phase with any new speaker. The key is that the feature vectors are extended nonlinearly to a polynomial vector of second or higher order. Since the algorithms necessary for calculating the transformation matrices are typical for signal processing a real-time implementation on digital signal processors appears feasible. >

Proceedings ArticleDOI
15 Oct 1989
TL;DR: This paper describes the baseline (single reference speaker) speaker-adaptation system and gives current performance results from a recent formal evaluation of the system, and describes the proposal for adapting from multiple reference speakers.
Abstract: We introduce a new technique for using the speech of multiple reference speakers as a basis for speaker adaptation in large vocabulary continuous speech recognition. In contrast to other methods that use a pooled reference model, this technique normalizes the training speech from multiple reference speakers to a single common feature space before pooling it. The normalized and pooled speech can then be treated as if it came from a single reference speaker for training the reference hidden Markov model (HMM). Our usual probabilistic spectrum transformation can be applied to the reference HMM to model a new (target) speaker. In this paper, we describe our baseline (single reference speaker) speaker-adaptation system and give current performance results from a recent formal evaluation of the system. We also describe our proposal for adapting from multiple reference speakers and report on recent preliminary experimental results in support of the proposed technique.



Proceedings ArticleDOI
21 Feb 1989
TL;DR: The BBN BYBLOS continuous speech recognition system has been used to develop a method of speaker adaptation from limited training and techniques employed to accomplish this transformation are reviewed and experimental results conducted on the DARPA Resource Management database are presented.
Abstract: The BBN BYBLOS continuous speech recognition system has been used to develop a method of speaker adaptation from limited training. The key step in the method is the estimation of a probabilistic spectral mapping between a prototype speaker, for whom there exists a well-trained speaker-dependent hidden Markov model (HMM), and a target speaker for whom there is only a small amount of training speech available. The mapping defines a set of transformation matrices which are used to modify the parameters of the prototype model. The resulting transformed model is then used as an approximation to a well-trained model for the target speaker. We review the techniques employed to accomplish this transformation and present experimental results conducted on the DARPA Resource Management database.


01 Jan 1989
TL;DR: Two large speech databases that were collected to support the development of new speaker verification algorithms and the results of discriminant analysis techniques which improve the discrimination between true speakers and impostors are described.
Abstract: In this paper we present the results of speaker verification technology development for use over long distance telephone lines. We describe two large speech databases that were collected to support the development of new speaker verification algorithms. We discuss the results of discriminant analysis techniques which improve the discrimination between true speakers and impostors. We compare the performance of two speaker verification algorithms, one using template based Dynamic Time Warping (DTW) and the other, Hidden Markov Modeling (HMM).