scispace - formally typeset
Search or ask a question

Showing papers on "Speaker diarisation published in 1992"


Patent
12 Feb 1992
TL;DR: In this article, a speaker voice verification system uses temporal decorrelation linear transformation and includes a collector for receiving speech inputs from an unknown speaker claiming a specific identity, a word-level speech features calculator operable to use a temporal decor correlation linear transformation for generating wordlevel speech feature vectors from such speech inputs, and a word level speech feature storage for storing word level feature vectors known to belong to a speaker with the specific identity.
Abstract: A speaker voice verification system uses temporal decorrelation linear transformation and includes a collector for receiving speech inputs from an unknown speaker claiming a specific identity, a word-level speech features calculator operable to use a temporal decorrelation linear transformation for generating word-level speech feature vectors from such speech inputs, word-level speech feature storage for storing word-level speech feature vectors known to belong to a speaker with the specific identity, a word-level speech feature vectors received from the unknown speaker with those received from the word-level speech feature storage, and speaker verification decision circuitry for determining, based on the similarity score, whether the unknown speaker's identity is the same as that claimed The word-level vector scorer further includes concatenation circuitry as well as a word-specific orthogonalizing linear transformer Other systems and methods are also disclosed

143 citations



PatentDOI
TL;DR: In this article, a speaker verification system which accepts or rejects the claimed identity of an individual based on analysis and measurements of the speaker's utterances is presented, elicited by prompting the individual seeking identification to read test phrases chosen at random by the verification system composed of words from a small vocabulary.
Abstract: A speaker verification system which accepts or rejects the claimed identity of an individual based on analysis and measurements of the speaker's utterances. The utterances are elicited by prompting the individual seeking identification to read test phrases chosen at random by the verification system composed of words from a small vocabulary. Nearest-neighbor distances between speech frames derived from such spoken test phrases and speech frames of corresponding vocabulary "words" from previously stored utterances of the speaker seeking identification are computed along with distances between such spoken test phrases and corresponding vocabulary words for a set of reference speakers. The claim for identification is accepted or rejected based on the relationship among such distances and a predetermined threshold value.

46 citations


Proceedings ArticleDOI
23 Mar 1992
TL;DR: Using the original method developed by Laforia, a series of text-independent speaker recognition experiments, characterized by a long-term multivariate auto-regressive modelization, gives first-rate results without using more than one sentence.
Abstract: Two models, the temporal decomposition and the multivariate linear prediction, of the spectral evolution of speech signals capable of processing some aspects of the speech variability are presented. A series of acoustic-phonetic decoding experiments, characterized by the use of spectral targets of the temporal decomposition techniques and a speaker-dependent mode, gives good results compared to a reference system (i.e., 70% vs. 60% for the first choice). Using the original method developed by Laforia, a series of text-independent speaker recognition experiments, characterized by a long-term multivariate auto-regressive modelization, gives first-rate results (i.e., 98.4% recognition rate for 420 speakers) without using more than one sentence. Taking into account the interpretation of the models, these results show how interesting the cinematic models are for obtaining a reduced variability of the speech signal representation. >

30 citations


Proceedings ArticleDOI
23 Mar 1992
TL;DR: In this article, a text-independent speaker recognition method using predictive neural networks is described, where an ergodic model which allows transitions to any other state is adopted as the speaker model and one predictive neural network is assigned to each state.
Abstract: A text-independent speaker recognition method using predictive neural networks is described. The speech production process is regarded as a nonlinear process, so the speaker individuality in the speech signal also includes nonlinearity. Therefore, the predictive neural network, which is a nonlinear prediction model based on multilayer perceptrons, is expected to be a more suitable model for representing speaker individuality. For text-independent speaker recognition, an ergodic model which allows transitions to any other state is adopted as the speaker model and one predictive neural network is assigned to each state. The proposed method was compared to distortion-based methods, hidden Markov model (HMM)-based methods, and a discriminative neural-network-based method through text-independent speaker recognition experiments on 24 female speakers. The proposed method gave the highest recognition accuracy of 100.0% and the effectiveness of the predictive neural networks for representing speaker individuality was clarified. >

28 citations


Proceedings ArticleDOI
23 Mar 1992
TL;DR: An approach to text-independent speaker verification that uses a two-stage classifier that consists of a speaker-independent phoneme detector trained to recognize a phoneme that is distinctive from speaker to speaker.
Abstract: Text-independent speaker verification systems typically depend upon averaging over a long utterance to obtain a feature set for classification. However, not all speech is equally suited to the task of speaker verification. An approach to text-independent speaker verification that uses a two-stage classifier is presented. The first stage consists of a speaker-independent phoneme detector trained to recognize a phoneme that is distinctive from speaker to speaker. The second stage is trained to recognize the frames of speech from the target speaker that are admitted by the phoneme detector. A common feature vector based on the linear predictive coding (LPC) cepstrum is projected in different directions for each of these pattern recognition tasks. Results of tests using the described speaker verification system are shown. >

26 citations


Proceedings ArticleDOI
Jerome R. Bellegarda1, P.V. de Souza1, A. Nadas1, David Nahamoo1, Michael Picheny1, Lalit R. Bahl1 
23 Mar 1992
TL;DR: An adaptation strategy based on a piecewise linear mapping between the feature space of a new speaker and that of a reference speaker is described, which results in a robust speaker adaptation procedure which allows for a drastic reduction in the amount of training data required from the new speaker.
Abstract: In a large vocabulary speech recognition system, it is desirable to make use of previously acquired speech data when encountering new speakers. The authors describe an adaptation strategy based on a piecewise linear mapping between the feature space of a new speaker and that of a reference speaker. This speaker-normalizing mapping is used to transform the previously acquired parameters of the reference speaker onto the space of the new speaker. This results in a robust speaker adaptation procedure which allows for a drastic reduction in the amount of training data required from the new speaker. The performance of this method is illustrated on an isolated utterance speech recognition task with a vocabulary of 20000 words. >

24 citations


Proceedings Article
01 Jan 1992

17 citations


01 Jan 1992
TL;DR: New perceptually based features were found which, unfortunately, did not outperform traditional speech production features with respect to speaker identification errors and a main contribution is a new information theoretic shape measure between line spectrum pair (LSP) frequency features.
Abstract: Scope and method of study. This work derives and demonstrates new and powerful features and measures for automatic speaker recognition and compares them with traditional ones. Automatic speaker recognition is the use of a machine to recognize a person from a spoken phrase. Speaker recognition systems can identify a particular person or verify a person's claimed identity. The scope of this study is limited to speech collected from cooperative users in office environments and without adverse microphone or channel impairments. The success of these systems depends directly upon the power of the features and measures used to discriminate among people. The focus of this research is to discover powerful features and measures for speaker verification. After a thorough literature review, concepts were synthesized from such diverse fields as signal processing, information theory, pattern recognition, physiology, and speech production and perception. The most promising innovations were then compared analytically and by computer simulation. Findings and conclusions. New perceptually based features were found which, unfortunately, did not outperform traditional speech production features with respect to speaker identification errors. Powerful new production features and measures for speaker verification were discovered. The main contribution is a new information theoretic shape measure between line spectrum pair (LSP) frequency features. This new measure, the divergence shape, can be interpreted geometrically as the shape of an information theoretic measure called divergence. The LSPs were found to be very effective features in this divergence shape measure. The experimental results show this combination yields 0.05% speaker identification error, which is superior by over an order of magnitude to the performance of any other claim reported in the literature.

17 citations


Proceedings Article
01 Jan 1992
TL;DR: A new text-independent speaker ecognition method that uses a modeling of the spectral evolution of the speech signals, which is capable of processing some aspects of the inter-speaker variability : the AR-Vector models is proposed.
Abstract: In this paper, a new text-independent speaker ecognition method is proposed. This method uses a modeling of the spectral evolution of the speech signals, which is capable of processing some aspects of the inter-speaker variability : the AR-Vector models. Some inter-speaker measures are presented and their advantages/inconvenients are discussed. A training technique to learn discriminant AR-Vector models is proposed. The evaluation of this method is carried out on the TIMIT database recorded by cooperative speakers without any impostor. A series of text-independent speaker identification experiments are described. There is no specific corpus for the training sentences and the training corpus is different from the test corpus. Two speech qualities are tested (i.e., good quality and phone quality). The experiments with good speech quality give first-rate results (i.e, identification rate of 100% for 420 speakers) without using more than two sentences for each test.

17 citations


Proceedings ArticleDOI
23 Mar 1992
TL;DR: The CPAM approach is shown to perform better than a vector quantization based approach in text-independent speaker recognition, and as well as the text-dependent, conventional, continuous mixture HMM approach with significant representation efficiency.
Abstract: A continuous probabilistic acoustic map (CPAM) approach to speaker recognition is investigated. In the CPAM formulation, the speech input of a speaker is parameterized as a mixture of tied, universal probability density functions (PDFs) with either a CPAM model alone for text-independent operation or a CPAM-based hidden Markov model (HMM) for text-dependent operation. A continuously spoken digit database of 20 speakers (10 M, 10 F) is used to evaluate the CPAM approach in both identification and verification performance. The CPAM approach is shown to perform better than a vector quantization based approach in text-independent speaker recognition, and as well as the text-dependent, conventional, continuous mixture HMM approach with significant representation efficiency. In particular, the CPAM-based HMM achieves an identification error rate of 1.7% and a verification equal-error rate of 4.0% with a CPAM of 128 PDFs while a conventional, continuous mixtures HMM needs 400 PDFs to achieve corresponding error rates of 1.9% and 4.0% using the same combined cepstral features and three-digit test utterances. >

PatentDOI
Sanada Toru1, Shinta Kimura1
TL;DR: In this article, a speaker adapted speech recognition system consisting of a plurality of acoustic templates of speakers for managing correspondence between an acoustic feature of the speech and a content of speech, a converting portion for converting the acoustic feature managed by the acoustic templates according to a set parameter, a learning portion for learning the parameter, at which the acoustic features of the acoustic template as converted by the converting portion is approximately coincident with those of a corresponding speech feature for learning when the speech input for learning is provided.
Abstract: A speaker adapted speech recognition system achieving a high recognition rate for an unknown speaker, comprises a plurality of acoustic templates of speakers for managing correspondence between an acoustic feature of the speech and a content of the speech, a converting portion for converting the acoustic feature of the speech managed by the acoustic templates according to a set parameter, a learning portion for learning the parameter, at which the acoustic feature of the acoustic template as converted by the converting portion is approximately coincidence with the acoustic feature of a corresponding speech input for learning when the speech input for learning is provided, a selection portion for selecting one or more of the acoustic templates having the closest acoustic features to that of a speech input for selection; the acoustic features of which are converted by the converting portion by comparing the corresponding acoustic feature of the speech input for selection with the corresponding acoustic features converted by the converting portion when the speech input for selection is provided, and an acoustic template for the unknown speaker is created by converting the acoustic features of the acoustic templates of the speakers that are selected by the selection portion, by the converter, for recognize the content of the speech input of the unknown speaker by using the created acoustic template of the speaker.

Proceedings ArticleDOI
07 Jun 1992
TL;DR: It was shown that the two-way classifiers can be combined to achieve 100% speaker identification performance for large speaker populations.
Abstract: The N-way speaker identification task is partitioned into N*(N-1)/2 binary-pair classifications. The binary-pair classifications are performed with small neural nets, each trained to make independent binary decisions on small fragments of speech data. Three issues were investigated concerning optimally combining a large number of fragmentary binary decisions into a single N-way decision: (1) incorporating speech energy and phonetic content information to compute an improved probability measure at the individual speech frame level; (2) combining binary frame-level decisions into a binary segment-level decision; and (3) combining the binary segment-level decisions into a single N-way segment level decision. It was shown that the two-way classifiers can be combined to achieve 100% speaker identification performance for large speaker populations. >

Proceedings ArticleDOI
23 Feb 1992
TL;DR: A speaker-independent normalization network is constructed such that speaker variation effects can be minimized and performance evaluation showed that speaker-normalized front-end reduced the error rate by 15% for the DARPA resource management speaker- independent speech recognition task.
Abstract: For speaker-independent speech recognition, speaker variation is one of the major error sources In this paper, a speaker-independent normalization network is constructed such that speaker variation effects can be minimized To achieve this goal, multiple speaker clusters are constructed from the speaker-independent training database A codeword-dependent neural network is associated with each speaker cluster The cluster that contains the largest number of speakers is designated as the golden cluster The objective function is to minimize distortions between acoustic data in each cluster and the golden speaker cluster Performance evaluation showed that speaker-normalized front-end reduced the error rate by 15% for the DARPA resource management speaker-independent speech recognition task

Proceedings ArticleDOI
23 Mar 1992
TL;DR: A procedure for text-independent speaker identification in noisy environments where the interfering background signals cannot be characterized using traditional broadband or impulsive noise models is examined.
Abstract: A procedure for text-independent speaker identification in noisy environments where the interfering background signals cannot be characterized using traditional broadband or impulsive noise models is examined. In the procedure, both the speaker and the background processes are modeled using mixtures of Gaussians. Speaker and background models are integrated into a unified statistical framework allowing the decoupling of the underlying speech process from the noise corrupted observations via the expectation-minimization algorithm. Using this formalism, speaker model parameters are estimated in the presence of the background process, and a scoring procedure is implemented for computing the speaker likelihood in the noise corrupted environment. The performance was evaluated using a 16-speaker conversational speech database with both speech babble and white noise background processes. >

Journal ArticleDOI
TL;DR: This paper investigates text-independent speaker verification, which involves the determination of whether or not a test utterance belongs to a specific reference speaker and the required information stored in the templates is different in this case.


Book ChapterDOI
01 Jan 1992
TL;DR: A large vocabulary continuous speech recognition system developed at AT&T Bell Laboratories is described, and the methods used to provide high word recognition accuracy are discussed, focusing on the techniques adopted to select the set of fundamental speech units and to provide the acoustic models of these sub-word units based on a continuous density HMM (CDHMM) framework.
Abstract: The field of large vocabulary continuous speech recognition has advanced to the point where there are several systems capable of providing greater than 95% word accuracy for speaker independent recognition, of a 1000 word vocabulary, spoken fluently for a task with a perplexity of about 60. There are several factors which account for the high performance achieved by these systems, including the use of effective feature analysis, the use of hidden Markov model (HMM) methodology, the use of context-dependent sub-word units to capture intra-word and inter-word phonemic variations, and the use of corrective training techniques to emphasize differences between acoustically similar words in the vocabulary. In this paper we describe a large vocabulary continuous speech recognition system developed at AT&T Bell Laboratories, and discuss the methods used to provide high word recognition accuracy. In particular we focus our discussion on the techniques adopted to select the set of fundamental speech units and to provide the acoustic models of these sub-word units based on a continuous density HMM (CDHMM) framework. Different modeling approaches, such as a discrete HMM and a tied-mixture HMM, will also be discussed and compared to the CDHMM approach.

Patent
11 Jul 1992
TL;DR: In this article, the authors used Hidden Markov Model (HMM) to adapt to a new unknown speaker using statistic modelling of word sub-units (hidden Markov model recognition) by transformation of characteristic vectors of the new speaker and a reference speaker.
Abstract: The speech recognition menthol adapting to a new unknown speaker uses statistic modelling of word sub-units (Hidden Markov Model recognition). The method is carried out by transformation of characteristic vectors of the new speaker and a reference speaker. Multi-dimensional distribution functions are used in place of quantised character vectors of a reference speaker. The characteristic vectors of the new speaker and a reference speaker are transformed into a common characteristic space. To calculate the necessary transformation matrices, the new speaker repeats some predetermined works in a training phase. USE/ADVANTAGE - Speech recognition via telephone, e.g. for automatic recognition systems etc. in vehicles. Quick recognition for large vocabulary.


Proceedings ArticleDOI
23 Mar 1992
TL;DR: The authors address the problem of speaker recognition using very short utterances, both for training and for recognition, using a nonlinear vectorial interpolation technique to exploit speaker-specific correlations between two suitably defined parameter vector sequences.
Abstract: The authors address the problem of speaker recognition using very short utterances, both for training and for recognition. The authors propose to exploit speaker-specific correlations between two suitably defined parameter vector sequences. A nonlinear vectorial interpolation technique is used to capture speaker-specific information, through least-square-error minimization. The experiments show the feasibility of recognizing a speaker among a population of about 100 persons using only an utterance of one word both for training and for recognition. >

Proceedings Article
30 Nov 1992
TL;DR: A Gender Dependent Neural Network which can be tuned for each gender, while sharing most of the speaker independent parameters is discussed, which uses a classification network to help generate gender-dependent phonetic probabilities for a statistical (HMM) recognition system.
Abstract: We would like to incorporate speaker-dependent consistencies, such as gender, in an otherwise speaker-independent speech recognition system. In this paper we discuss a Gender Dependent Neural Network (GDNN) which can be tuned for each gender, while sharing most of the speaker independent parameters. We use a classification network to help generate gender-dependent phonetic probabilities for a statistical (HMM) recognition system. The gender classification net predicts the gender with high accuracy, 98.3% on a Resource Management test set. However, the integration of the GDNN into our hybrid HMM-neural network recognizer provided an improvement in the recognition score that is not statistically significant on a Resource Management test set.

Proceedings ArticleDOI
23 Mar 1992
TL;DR: Using the trained model and a brief, unconstrained sample of a new speaker's voice, the system produces a speaker voice code that can be used to adapt a recognition system to the new speaker without retraining.
Abstract: SVCnet, a system for modeling speaker variability, is presented. Encoder neural networks specialized for each speech sound produce low-dimensionality models of acoustical variation, and these models are further combined into an overall model of voice variability. A training procedure is described which minimizes the dependence of this model on which sounds have been uttered. Using the trained model (SVCnet) and a brief, unconstrained sample of a new speaker's voice, the system produces a speaker voice code that can be used to adapt a recognition system to the new speaker without retraining. A system which combines SVCnet with a MS-TDNN recognizer is described. >


01 Jan 1992
TL;DR: A modification of the semi-continuous codebook updating, which allows rapid speaker adaptation, based on the idea that phonetic information already incorporated in a trained model should be used to update the codebook.
Abstract: This paper presents a new approach to speaker adaptation based on semi-continuous hidden Markov models (SCHMM). We introduce a modification of the semi-continuous codebook updating, which allows rapid speaker adaptation. The approach bases on the idea that phonetic information already incorporated in a trained model should be used to update the codebook. Thus the different acoustic representation of a new speaker is learned while the connectior, between codebook entries and model states remains the same. Several experiments were carried out with a small speech sample. It is possible to demonstrate that the new codebook updating performs better than conventional SCHMM codebook updating and that using a speech sample comprising about 40 seconds of adaptation speech is enough to achieve 50 percent of the difference in performance between full speakerdependent training and no adaptation at all.



01 Jan 1992
TL;DR: A speaker independent continuous speech recognition system based on phoneme level hidden Markov models is described, configured to recognise continuously spoken airborne reconnaissance reports, a task which involves a vocabulary of approximately 500 words.
Abstract: : This memorandum describes the development of a speaker independent continuous speech recognition system based on phoneme level hidden Markov models. The system is configured to recognise continuously spoken airborne reconnaissance reports, a task which involves a vocabulary of approximately 500 words. On a test set of speech from 80 male subjects, the final system achieves a word accuracy of 74.1% with no explicit syntactic constraints.

Proceedings ArticleDOI
30 Aug 1992
TL;DR: Reports on the speaker-independent continuous speech recognition system with speaker adaptation, based on continuous HMM with mixture Gaussian distribution, and several fast algorithms are applied for real time calculation.
Abstract: Reports on the speaker-independent continuous speech recognition system with speaker adaptation. This system is based on continuous HMM with mixture Gaussian distribution, and several fast algorithms are applied for real time calculation. To reduce HMM states, the system uses the mono-state context dependent model, called acoustic phonetic segment. The calculation of mixture Gaussian distributions is reduced, by varying the number of mixtures dynamically. The fast Viterbi calculation algorithm with duration control is used. The system has been successfully implemented as a man-machine interface of the plant control expert system, achieving sentence accuracy of 98.7%. >