scispace - formally typeset
Search or ask a question

Showing papers on "Speaker diarisation published in 1991"


Journal ArticleDOI
TL;DR: The system described here is capable of accurately verifying an individual’s claimed identity from a short sample of his or her speech, and a rationale was developed for determining the size of the test required to allow hypotheses regarding the system's true error rates to be tested with stated confidence levels.

230 citations


Proceedings Article
01 Jan 1991
TL;DR: This paper presents some of the design considerations of BREF, a large read-speech corpus for French designed to provide continuous speech data for the development of dictation machines, for the evaluation of continuous speech recognition systems, and for the study of phonological variations.
Abstract: This paper presents some of the design considerations of BREF, a large read-speech corpus for French. BREF was designed to provide continuous speech data for the development of dictation machines, for the evaluation of continuous speech recognition systems (both speaker-dependent and speakerindependent), and for the study of phonological variations. The texts to be read were selected from 5 million words of the French newspaper, Le Monde. In total, 11,000 texts were selected, with selection criteria that emphasisized maximizing the number of distinct triphones. Separate text materials were selected for training and test corpora. Ninety speakers have been recorded, each providing between 5,000 and 10,000 words (approximately 40-70 min.) of speech.

225 citations


Journal ArticleDOI
TL;DR: Recent advances in and perspectives of research on speaker-dependent-feature extraction from speech waves, automatic speaker identification and verification, speaker adaptation in speech recognition, and voice conversion techniques are discussed.

108 citations


Proceedings ArticleDOI
14 Apr 1991
TL;DR: A speaker verification system using connected word verification phrases has been implemented and studied and the system has been evaluated on a 20-speaker telephone database of connected digital utterances.
Abstract: A speaker verification system using connected word verification phrases has been implemented and studied. Verification utterances are represented as concatenated speaker-dependent whole-word hidden Markov models (HMMs). Verification phrases are specified as strings of words drawn from a small fixed vocabulary, such as the digits. Phrases can either be individualized or randomized for greater security. Training techniques to create speaker-dependent models for verification are used in which initial word models are created by bootstrapping from existing speaker-independent models. The system has been evaluated on a 20-speaker telephone database of connected digital utterances. Using approximately 66 s of connected digit training utterances per speaker, the verification equal-error rate is approximately 3.5% for 1.1 s test utterances and 0.3% for 4.4 s test utterances. In comparison, the performance of a template-based system using the same amount of training data is 6.7% and 1.5%, respectively. >

88 citations


Proceedings ArticleDOI
14 Apr 1991
TL;DR: Experimental results show that adapting the spectral observation probabilities of each state of the model by the back propagation of errors can correct misclassification errors.
Abstract: Speaker verification is performed by comparing the output probabilities of two Markov models of the same phonetic unit. One of these Markov models is speaker-specific, being built from utterances from the speaker whose identity is to be verified. The second model is built from utterances from a large population of speakers. The performance of the system is improved by treating the pair of models as a connectionist network, an alpha-net, which then allows discriminative training to be carried out. Experimental results show that adapting the spectral observation probabilities of each state of the model by the back propagation of errors can correct misclassification errors. The real-time implementation of the system produced an average digit error rate of 4.5% and only one misclassification in 600 trials using a five-digit sequence. >

74 citations


Proceedings ArticleDOI
11 Jun 1991
TL;DR: The authors summarize a speaker adaptation algorithm based on codebook mapping from one speaker to a standard speaker to be useful in various kinds of speech recognition systems such as hidden-Markov-model-based, feature- based, and neural-network-based systems.
Abstract: The authors summarize a speaker adaptation algorithm based on codebook mapping from one speaker to a standard speaker. This algorithm has been developed to be useful in various kinds of speech recognition systems such as hidden-Markov-model-based, feature-based, and neural-network-based systems. The codebook mapping speaker adaptation algorithm has been much improved by introducing several ideas based on fuzzy vector quantization. This fuzzy codebook mapping algorithm is also applicable to voice conversion between arbitrary speakers. >

63 citations


Proceedings ArticleDOI
14 Apr 1991
TL;DR: A VQ (vector-quantization)-based text-independent speaker recognition method which is robust against utterance variations, and a normalization method, talker variability normalization (TVN), which normalizes parameter variation taking both inter- and intra-speaker variability into consideration.
Abstract: The authors describe a VQ (vector-quantization)-based text-independent speaker recognition method which is robust against utterance variations. Three techniques are introduced to cope with temporal and text-dependent spectral variations. First, either an ergodic hidden Markov model or a voiced/unvoiced decision is used to classify input speech into broad phonetic classes. Second, a new distance measure, the distortion-intersection measure (DIM), is introduced for calculating VQ distortion of input speech compared to speaker-independent codebooks. Third, a normalization method, talker variability normalization (TVN), is introduced. TVN normalizes parameter variation taking both inter- and intra-speaker variability into consideration. The system was tested using utterances of nine speakers recorded over three years. The combination of the three techniques achieves high speaker identification accuracies of 98.5% using only vocal tract information and 99.0% using both vocal tract and pitch information. >

59 citations


Proceedings ArticleDOI
14 Apr 1991
TL;DR: The authors already have a state-of-the-art speaker-independent speech recognition system, SPHINX, and extended it to speaker-dependent speech recognition, which demonstrated a substantial difference between speaker- dependent and -independent systems.
Abstract: The DARPA Resource Management task is used as the domain to investigate the performance of speaker-independent, speaker-dependent, and speaker-adaptive speech recognition. The authors already have a state-of-the-art speaker-independent speech recognition system, SPHINX. The error rate for RM2 test set is 4.3%. They extended SPHINX to speaker-dependent speech recognition. The error rate is reduced to 1.4-2.6% with 600-2400 training sentences for each speaker, which demonstrated a substantial difference between speaker-dependent and -independent systems. Based on speaker-independent models, a study was made of speaker-adaptive speech recognition. With 40 adaptation sentences for each speaker, the error rate can be reduced from 4.3% to 3.1%. >

41 citations


Proceedings ArticleDOI
14 Apr 1991
TL;DR: A combination of a high-performance speaker identification system and an isolated word recognizer is presented, capable of automatically producing speech and speaker identification with a closed set of speakers.
Abstract: A combination of a high-performance speaker identification system and an isolated word recognizer is presented. The front-end text-independent speaker identification system determines the most likely speaker for an input word. The speaker identity is then used to choose the reference word models for the speech recognizer. When used with a closed set of speakers, the combination is capable of automatically producing speech and speaker identification. For an open set of speakers, the speaker recognition system acts as speaker quantizer which associates the unknown speaker with an acoustically similar speaker. The matching speaker's word models are used in the speech recognizer. The application of this front-end speaker recognizer is described for a DTW and HMM speech recognizer. Results on a combination using a DTW word recognizer are 100% for closed set experiments. >

33 citations


Journal ArticleDOI
TL;DR: The methods and motivation for VAA data collection and validation procedures, the current contents of thedatabase, and the results of exploratory research on a 1088-speaker subset of the database are described.

31 citations


Proceedings ArticleDOI
04 Nov 1991
TL;DR: It is concluded that not only has the VQ technique reduced the amount of computation and storage, but it has also created new ideas for solving various problems in speech/speaker recognition.
Abstract: The author reviews major methods of applying the vector quantization (VQ) technique to speech and speaker recognition. These include speech recognition based on the combination of VQ and the DTW/HMM (dynamic time warping/hidden Markov model) technique. VQ-distortion-based recognition, learning VQ algorithms, speaker adaptation by VQ-codebook mapping, and VQ-distortion-based speaker recognition. It is concluded that not only has the VQ technique reduced the amount of computation and storage, but it has also created new ideas for solving various problems in speech/speaker recognition. >

PatentDOI
Kazunaga Yoshida1, Takao Watanabe1
TL;DR: A speech recognition apparatus is adapted to the speech of the particular speaker by converting the reference pattern into a normalized pattern by a neural network unit, internal parameters of which are modified through a learning operation using a normalized feature vector of the training pattern produced by the voice of the particularly speaker and normalized on the basis of thereference pattern.
Abstract: A speech recognition apparatus of the speaker adaptation type operates to recognize an inputted speech pattern produced by a particular speaker by using a reference pattern produced by a voice of a standard speaker. The speech recognition apparatus is adapted to the speech of the particular speaker by converting the reference pattern into a normalized pattern by a neural network unit, internal parameters of which are modified through a learning operation using a normalized feature vector of the training pattern produced by the voice of the particular speaker and normalized on the basis of the reference pattern, so that the neural netowrk unit provides an optimum output similar to the corresponding normalized feature vector of the training pattern. In the alternative, the speech recognition apparatus operates to recognize an inputted speech pattern by converting the inputted speech pattern into a normalized speech pattern by the neural network unit, internal parameters of which are modified through a learning operation using a feature vector of the reference pattern normalized on the basis of the training pattern, so that the neural network unit provides an optimum output similar to the corresponding normalized feature vector of the reference pattern and recognizing the normalized speech pattern according to the reference pattern.

Patent
17 Sep 1991
TL;DR: In this paper, a speech segment correspondence unit makes a dynamic programming (DP) based correspondence between the obtained speech segments and training speech data of the target speaker, thereby making a SP correspondence table.
Abstract: Input speech of a reference speaker, who wants to convert his/her voice quality, and speech of a target speaker are converted into a digital signal by an analog to digital (A/D) converter. The digital signal is then subjected to speech analysis by a linear predictive coding (LPC) analyzer. Speech data of the reference speaker is processed into speech segments by a speech segmentation unit. A speech segment correspondence unit makes a dynamic programming (DP) based correspondence between the obtained speech segments and training speech data of the target speaker, thereby making a speech segment correspondence table. A speaker individuality conversion is made on the basis of the speech segment correspondence table by a speech individuality conversion and synthesis unit.

Proceedings ArticleDOI
14 Apr 1991
TL;DR: An attempt was made to enhance the performance of a DTW (dynamic time warping) speech recognizer by preprocessing speech parameters using a neural network transformation, using a multilayer perceptron trained with speech utterances of a single speaker.
Abstract: An attempt was made to enhance the performance of a DTW (dynamic time warping) speech recognizer by preprocessing speech parameters using a neural network transformation. A multilayer perceptron trained with speech utterances of a single speaker has been used in front of a DTW recognizer. Results show an improvement of about 15% in the recognition rate in all cases, even with a speaker that was not used for training. If the network is not completely speaker independent, a dynamic adaptation to the speaker could be performed. >

Proceedings ArticleDOI
14 Apr 1991
TL;DR: The integrated noise model was noted for having a noise suppression characteristic that arises naturally from the statistical model, which is important if one wishes to avoid the ad hoc fine tuning of thresholds required in the noise processing approach implemented.
Abstract: The use of probabilistic mixture densities for text-independent speaker identification in a noisy telephone channel environment is investigated. Two techniques for noise compensation are considered. In the first approach, a background noise model is integrated directly into the model for speech. In the second approach, noise preprocessing techniques are used to compensate noisy observations before passing them along to the speaker identification on conversational utterances collected over long-distance telephone channels from ten speakers. The integrated noise model was noted for having a noise suppression characteristic that arises naturally from the statistical model, which is important if one wishes to avoid the ad hoc fine tuning of thresholds required in the noise processing approach implemented. >


Proceedings ArticleDOI
D.A. Gaganelis1, E. Frangoulis1
14 Apr 1991
TL;DR: In this method, K-means clustering is used during training for robust speaker reference templates through the use of the Fourier-Bessel functions, which allows the combination of multiple feature sets in a single classification test.
Abstract: The authors report on a novel method for telephone-based speaker verification. In this method, K-means clustering is used during training for robust speaker reference templates. The classification is made through the use of the Fourier-Bessel functions, which transform the original problem to a multidimensional detection problem. This technique allows the combination of multiple feature sets in a single classification test. Experiments with a number of speakers and words over the telephone network show the potential benefits of the new techniques using the standard and multivariate Gaussian classifiers. >

Proceedings ArticleDOI
14 Apr 1991
TL;DR: A speaker model using a neural network is proposed for reference speaker clustering on speaker independent speech recognition and neural prediction modeling by multilayer perceptron and learning matrix vector-quantization are considered for the speaker modeling.
Abstract: A speaker model using a neural network is proposed for reference speaker clustering on speaker independent speech recognition. Speaker individuality is embedded in not only a static short time spectrum and a pitch frequency, but also a dynamic spectral pattern and pitch pattern. In conventional modeling, speaker individuality is based on the former static features. The authors try to capture the latter dynamic features, of speaker by a neural speaker model. Two methods, neural prediction modeling by multilayer perceptron and learning matrix vector-quantization, are considered for the speaker modeling. Using the measures of speaker modeling, speaker clustering of the reference patterns based on mutual information is carried out for speaker independent speech recognition. >

Proceedings ArticleDOI
14 Apr 1991
TL;DR: A speaker-adaptive speech recognition method using a stochastic speaker classifier and four integrated 9-state ergodic speaker hidden Markov models estimated from the command words uttered by 116 training speakers is proposed.
Abstract: A speaker-adaptive speech recognition method using a stochastic speaker classifier is proposed. The stochastic speaker classifier decides which spectral feature subspace is suitable for the input speaker by using integrated speaker Markov models. In the acoustic HMMs (hidden Markov models), the observation emission probabilities, are presented as joint probabilities for speaker individuality obtained from the speaker classifier and feature vectors, from the acoustic preprocessor. Evaluation experiments are performed using a telephone speech database of 50 command words and 10 Japanese digits. Using four integrated 9-state ergodic speaker hidden Markov models estimated from the command words uttered by 116 training speakers, the best word recognition accuracy of 98.1% is achieved for the 10 digits uttered by 116 test speakers. This is an improvement of 2% over the conventional pooled training method. >

Proceedings ArticleDOI
30 Sep 1991
TL;DR: A codeword-dependent neural network is presented as a nonlinear mapping function to transform speech data between two speakers that significantly reduced the error rate and made full use of dynamic information.
Abstract: Speaker normalization may have a significant impact on both speaker-adaptive and speaker-independent speech recognition. In this paper, a codeword-dependent neural network (CDNN) is presented for speaker normalization. The network is used as a nonlinear mapping function to transform speech data between two speakers. The mapping function is characterized by two important properties. First, the assembly of mapping functions enhances overall mapping quality. Second, multiple input vectors are used simultaneously in the transformation. This not only makes full use of dynamic information but also alleviates possible errors in the supervision data. Large-vocabulary continuous speech recognition is chosen to study the effect of speaker normalization. Using speaker-dependent semi-continuous hidden Markov models, performance evaluation over 360 testing sentences from new speakers showed that speaker normalization significantly reduced the error rate from 41.9% to 5.0% when only 40 speaker-dependent sentences were used to estimate CDNN parameters. >

Journal ArticleDOI
TL;DR: The DragonDictate recognizer was tested with two texts that differed greatly in vocabulary and style and performance for all three speakers was better than the performance for the reference speaker on unadapted models.

02 Sep 1991
TL;DR: An automatic text-independent speaker recognition system is presented, which is suitable for identification as well as for verification purposes, based on spotting the stable part of the vowel phonemes of the test utterances, extracting parameter vectors and classifying them to a speaker-dependent vowel reference database.
Abstract: An automatic text-independent speaker recognition system is presented, which is suitable for identification as well as for verification purposes. The system is based on spotting the stable part of the vowel phonemes of the test utterances, extracting parameter vectors and classifying them to a speaker-dependent vowel reference database. The system was tested over a period of four months with a population of 12 male and female speakers with non-correlated training and test data. The accuracy of the system as measured by experimentation is satisfactory considering that the training utterances per speaker do not exceed 50 sec and the test utterances 1 sec in average. >



Proceedings ArticleDOI
Stephan Euler1, J. Zinke1
14 Apr 1991
TL;DR: The authors discuss the extension and adaptation of a speaker-independent, small-vocabulary, isolated word recognition system based on tied density hidden Markov models and compares different algorithms to avoid zero probabilities in the word models due to insufficient data.
Abstract: The authors discuss the extension and adaptation of a speaker-independent, small-vocabulary, isolated word recognition system based on tied density hidden Markov models. In the proposed approach, the density functions are trained from a basic set of words using acoustic segmentation, position-dependent segment labeling, and clustering of the segment specific densities. Then the parameters of the word models are estimated by means of a Viterbi update procedure. With a given set of densities the Viterbi update can also be used to generate models for words not included in the basic set. The dependency between the recognition performance and the amount of reference data both for speaker-independent and speaker-dependent experiments is examined in detail. The authors compare different algorithms to avoid zero probabilities in the word models due to insufficient data. >



Proceedings ArticleDOI
Marco Ferretti1, A.M. Mazza1
14 Apr 1991
TL;DR: A set of techniques to perform fast speaker adaptation for a large vocabulary, natural-language, speech recognition system are presented and the experimentation has been carried out using a 20000-word, real-time,natural-language speech recognizer for the Italian language.
Abstract: A set of techniques to perform fast speaker adaptation for a large vocabulary, natural-language, speech recognition system are presented. The experimentation has been carried out using a 20000-word, real-time, natural-language speech recognizer for the Italian language. To perform speaker adaptation within the framework of the probabilistic approach to speech recognition two different problems must be addressed: codebook adaptation and hidden Markov model parameters adaptation. The basic idea is to use a set of data collected from several different speakers as a source of a priori knowledge with a small speech sample provided by the new speaker to perform the adaptation task. Several different techniques for codebook adaptation have been tried and discussed. >

Proceedings ArticleDOI
14 Apr 1991
TL;DR: A speaker adaptation method for HMM (hidden Markov model) based speaker-independent speech recognition without supervising reduces the confusion between models, which is caused by training using large-size training data, by controlling the influences of the training samples used in HMM training by considering the similarity of speaker individuality.
Abstract: A speaker adaptation method for HMM (hidden Markov model) based speaker-independent speech recognition without supervising is presented. This method reduces the confusion between models, which is caused by training using large-size training data, by controlling the influences of the training samples used in HMM training by considering the similarity of speaker individuality. A Markov model and a hidden Markov model are used to represent an input speaker's individuality. These models are compared through their entropy and /b, d, g, m, n, N/ recognition task. The results show that a hidden Markov model is more suitable than a Markov model. >