scispace - formally typeset
Search or ask a question

Showing papers on "Speaker diarisation published in 1995"


Journal ArticleDOI
TL;DR: The individual Gaussian components of a GMM are shown to represent some general speaker-dependent spectral shapes that are effective for modeling speaker identity and is shown to outperform the other speaker modeling techniques on an identical 16 speaker telephone speech task.
Abstract: This paper introduces and motivates the use of Gaussian mixture models (GMM) for robust text-independent speaker identification. The individual Gaussian components of a GMM are shown to represent some general speaker-dependent spectral shapes that are effective for modeling speaker identity. The focus of this work is on applications which require high identification rates using short utterance from unconstrained conversational speech and robustness to degradations produced by transmission over a telephone channel. A complete experimental evaluation of the Gaussian mixture speaker model is conducted on a 49 speaker, conversational telephone speech database. The experiments examine algorithmic issues (initialization, variance limiting, model order selection), spectral variability robustness techniques, large population performance, and comparisons to other speaker modeling techniques (uni-modal Gaussian, VQ codebook, tied Gaussian mixture, and radial basis functions). The Gaussian mixture speaker model attains 96.8% identification accuracy using 5 second clean speech utterances and 80.8% accuracy using 15 second telephone speech utterances with a 49 speaker population and is shown to outperform the other speaker modeling techniques on an identical 16 speaker telephone speech task. >

3,134 citations


Journal ArticleDOI
TL;DR: A scheme for developing a voice conversion system that converts the speech signal uttered by a source speaker to a speech signal having the voice characteristics of the target speaker using formants and a formant vocoder is proposed.

207 citations


Proceedings ArticleDOI
09 May 1995
TL;DR: The data augmentation technique is based on the metamorphic algorithm first proposed in Bellegarda et al.
Abstract: Speaker adaptation typically involves customizing some existing (reference) models in order to account for the characteristics of a new speaker. This work considers the slightly different paradigm of customizing some reference data for the purpose of populating the new speaker's space, and then using the resulting (augmented) data to derive the customized models. The data augmentation technique is based on the metamorphic algorithm first proposed in Bellegarda et al. [1992], assuming that a relatively modest amount of data (100 sentences) is available from each new speaker. This contraint requires that reference speakers be selected with some care. The performance of this method is illustrated on a portion of the Wall Street Journal task.

165 citations


Proceedings Article
01 Jan 1995
TL;DR: This paper explores supervised speaker adaptation and normalization in the MLP component of a hybrid hidden Markov model/ multilayer perceptron version of SRI's DECIPHERTM speech recognition system.
Abstract: In a speaker-independent, large-vocabulary continuous speech recognition systems, recognition accuracy varies considerably from speaker to speaker, and performance may be significantly degraded for outlier speakers such as nonnative talkers. In this paper, we explore supervised speaker adaptation and normalization in the MLP component of a hybrid hidden Markov model/ multilayer perceptron version of SRI's DECIPHERTM speech recognition system. Normalization is implemented through an additional transformation network that preprocesses the cepstral input to the MLP. Adaptation is accomplished through incremental retraining of the MLP weights on adaptation data. Our approach combines both adaptation and normalization in a single, consistent manner, works with limited adaptation data, and is text-independent. We show significant improvement in recognition accuracy.

95 citations


Proceedings ArticleDOI
09 May 1995
TL;DR: In this article, a modular system for flexible human-computer interaction via speech is presented, which integrates acoustic and visual information (automatic lip-reading) improving overall recognition, especially in noisy environments.
Abstract: We present the development of a modular system for flexible human-computer interaction via speech. The speech recognition component integrates acoustic and visual information (automatic lip-reading) improving overall recognition, especially in noisy environments. The image of the lips, constituting the visual input, is automatically extracted from the camera picture of the speaker's face by the lip locator module. Finally, the speaker's face is automatically acquired and followed by the face tracker sub-system. Integration of the three functions results in the first bi-modal speech recognizer allowing the speaker reasonable freedom of movement within a possibly noisy room while continuing to communicate with the computer via voice. Compared to audio-alone recognition, the combined system achieves a 20 to 50 percent error rate reduction for various signal/noise conditions.

94 citations


PatentDOI
TL;DR: In this article, a neural network is trained to transform distant-talking cepstrum coefficients, derived from a microphone array receiving speech from a speaker distant therefrom, into a form substantially similar to close-talking coefficients that would be derived from an audio microphone close to the speaker, for providing robust hands-free speech and speaker recognition in adverse practical environments with existing speech-and speaker recognition systems which have been trained on close talking speech.
Abstract: A neural network is trained to transform distant-talking cepstrum coefficients, derived from a microphone array receiving speech from a speaker distant therefrom, into a form substantially similar to close-talking cepstrum coefficients that would be derived from a microphone close to the speaker, for providing robust hands-free speech and speaker recognition in adverse practical environments with existing speech and speaker recognition systems which have been trained on close-talking speech.

81 citations


Journal ArticleDOI
TL;DR: A speech spectrum transformation method by interpolating multi-speakers' spectral patterns and multi-functional representation with Radial Basis Function networks to generate new spectrum patterns close to those of the target speaker.

63 citations


Journal ArticleDOI
TL;DR: A discriminative training approach is used which takes into account the models of other competing speakers and formulates the optimization criterion such that speaker separation is enhanced and speaker recognition error rate on the training data is directly minimized.
Abstract: The use of discriminative training to construct hidden Markov models of speakers for verification and identification is studied As opposed to conventional maximum likelihood training which estimates a speaker’s model based only on the training utterances from the same speaker, a discriminative training approach is used which takes into account the models of other competing speakers and formulates the optimization criterion such that speaker separation is enhanced and speaker recognition error rate on the training data is directly minimized The optimization solution is obtained with a probabilistic descent algorithm For all experiments an isolated digit database consisting of 100 speakers is used For speaker identification, the resulting discriminative speaker models reduce the identification error rate by more than 25% over the results obtained with the conventional training algorithm A new normalized score function is proposed which makes the verification formulation consistent with the minimum error training objective When combining the proposed verification score function with discriminative training, an average equal error rate of 08% is achieved using only one‐digit test utterances This represents an error rate reduction of over 80% from an average equal error rate of 61% when using the conventional algorithm for training and the unnormalized score function for testing

63 citations


Proceedings ArticleDOI
09 May 1995
TL;DR: This paper investigates features that are based on amplitude and frequency modulations of speech formants, high resolution measurement of fundamental frequency and location of "secondary pulses", measured using a high-resolution energy operator.
Abstract: The performance of systems for speaker identification (SID) can be quite good with clean speech, though much lower with degraded speech. Thus it is useful to search for new features for SID, particularly features that are robust over a degraded channel. This paper investigates features that are based on amplitude and frequency modulations of speech formants, high resolution measurement of fundamental frequency and location of "secondary pulses", measured using a high-resolution energy operator. When these features are added to traditional features using an existing SID system with a 168 speaker telephone speech database, SID performance improved by as much as 4% for male speakers and 8.2% for female speakers.

57 citations


PatentDOI
TL;DR: In this paper, a system and method for adaptation of a speaker independent speech recognition system for use by a particular user is presented, where a test speaker's acoustic characterization is compared with acoustic characterization data generated for a plurality of training speakers.
Abstract: A system and method for adaptation of a speaker independent speech recognition system for use by a particular user. The system and method gather acoustic characterization data from a test speaker and compare the data with acoustic characterization data generated for a plurality of training speakers. A match score is computed between the test speaker's acoustic characterization for a particular acoustic subspace and each training speaker's acoustic characterization for the same acoustic subspace. The training speakers are ranked for the subspace according to their scores and a new acoustic model is generated for the test speaker based upon the test speaker's acoustic characterization data and the acoustic characterization data of the closest matching training speakers. The process is repeated for each acoustic subspace.

56 citations


Proceedings Article
01 Jan 1995

Patent
29 Sep 1995
TL;DR: In this article, a technique for improving speech recognition in low-cost, speech interactive devices is proposed, which calls for implementing a speaker-specific word enrollment and detection unit in parallel with a word detection unit to permit comprehension of spoken commands or messages issued by binary questions when no recognizable words are found.
Abstract: A technique for improving speech recognition in low-cost, speech interactive devices. This technique calls for implementing a speaker-specific word enrollment and detection unit in parallel with a word detection unit to permit comprehension of spoken commands or messages issued by binary questions when no recognizable words are found. Preferably, specific speaker detection will be based on the speaker's own personal list of words or expression. Other facets include complementing non-specific pre-registered word characteristic information with individual, speaker-specific verbal characteristics to improve recognition in cases where the speaker has unusual speech mannerisms or accent and response alteration in which speaker-specification registration functions are leveraged to provide access and permit changes to a predefined responses table according to user needs and tastes.

PatentDOI
TL;DR: In this article, a speaker recognition method and system which applies adaptive component weighting to each frame of speech for attenuating non-vocal tract components and normalizing speech components is presented.
Abstract: The present invention relates to a speaker recognition method and system which applies adaptive component weighting to each frame of speech for attenuating non-vocal tract components and normalizing speech components. A linear predictive all pole model is used to form a new transfer function having a moving average component. A normalized spectrum is determined from the new transfer function. The normalized spectrum is defined having improved characteristics for speech components. From the improved speech components, improved speaker recognition over a channel is obtained.

Proceedings Article
01 Jan 1995
TL;DR: Experimental application of the speaker recognition method based on hidden Markov model composition to text-independent speaker identification and verification in various kinds of noisy environments demonstrated considerable improvement in speaker recognition for speech utterances of male speakers.
Abstract: This paper investigates a speaker recognition method that is robust against background noise. In noisy environments, one important issue is how to create a model for each speaker so as to compensate for noise. The method described here is based on hidden Markov model (HMM) composition, which combines a speaker HMM and a noise-source HMM into a noise-added speaker HMM with a particular signal-to-noise ratio (SNR). Since it is difficult to measure the SNR of input speech with non-stationary noise exactly, this method creates several noise-added speaker HMMs with various SNRs. The HMM that has the highest likelihood value for the input speech is selected, and a speaker decision is made using this likelihood value. Experimental application of this method to text-independent speaker identification and verification in various kinds of noisy environments demonstrated considerable improvement in speaker recognition for speech utterances of male speakers.

Journal ArticleDOI
TL;DR: A technique of adapting all the speech models to a new speaker's voice when he has given an incomplete set of the vocabulary is presented, based upon using the training-set to obtain estimates of correlations between sounds.


Proceedings ArticleDOI
09 May 1995
TL;DR: A new system is presented for text-dependent speaker verification that uses data fusion concepts to combine the results of distortion-based and discriminant-based classifiers and yields an equal error rate for this task, which is better than the individual performance of either classifier.
Abstract: A new system is presented for text-dependent speaker verification The system uses data fusion concepts to combine the results of distortion-based and discriminant-based classifiers Hence, both intraspeaker and interspeaker information are utilized in the final decision The distortion and discriminant-based classifiers are based on dynamic time warping (DTW) and the neural tree network (NTN), respectively The system is evaluated with several hundred two word utterances collected over a telephone channel The combined classifier yields an equal error rate of two percent for this task, which is better than the individual performance of either classifier



01 Jan 1995
TL;DR: Speech Reference EPFL-CHAPTER-82317 describes the development of language-based communication techniques and their applications in the classroom and the response of students to these techniques.
Abstract: Keywords: speech Reference EPFL-CHAPTER-82317 Record created on 2006-03-10, modified on 2017-05-10

Proceedings Article
01 Jan 1995
TL;DR: The results tend to show that the speaker-dependent information captured by long-term second-order statistics is consistently common to all phonetic classes, and that the homogeneity of the test material may improve the quality of the estimates.
Abstract: Second-order statistical methods show very good results for automatic speaker identi cation in controlled recording conditions [2] These approaches are generally used on the entire speech material available In this paper, we study the in uence of the content of the test speech material on the performances of such methods, ie under a more analytical approach [3] The goal is to investigate on the kind of information which is used by these methods, and where it is located in the speech signal Liquids and glides together, vowels, and more particularly nasal vowels and nasal consonants, are found to be particularly speaker speci c: test utterances of 1 second, composed in majority of acoustic material from one of these classes provide better speaker identi cation results than phonetically balanced test utterances, even though the training is done, in both cases, with 15 seconds of phonetically balanced speech Nevertheless, results with other phoneme classes are never dramatically poor These results tend to show that the speaker-dependent information captured by long-term second-order statistics is consistently common to all phonetic classes, and that the homogeneity of the test material may improve the quality of the estimates

Journal Article
TL;DR: In this paper, five approaches that can be used to control and simplify the speech recognition task are examined: isolated words, speaker-dependent systems, limited vocabulary size, a tightly constrained grammar, and quiet and controlled environmental conditions.
Abstract: Five approaches that can be used to control and simplify the speech recognition task are examined. They entail the use of isolated words, speaker-dependent systems, limited vocabulary size, a tightly constrained grammar, and quiet and controlled environmental conditions. The five components of a speech recognition system are described: a speech capture device, a digital signal processing module, preprocessed signal storage, reference speech patterns, and a pattern-matching algorithm. Current speech recognition systems are reviewed and categorized. Speaker recognition approaches and systems are also discussed. >


Proceedings Article
01 Jan 1995
TL;DR: This chapter discusses several techniques for identifying segment transitions in an audio stream and a novel speaker discrimination is described that makes segmentation decisions when a continuously updated model of the current speaker suddenly ceases to sufficiently account for the input data.
Abstract: This chapter discusses several techniques for identifying segment transitions in an audio stream. Gross features are first identified that control more detailed and computationally expensive analysis down stream. The immediate goal of the audio processing is to identify transition points between segments and to do an initial content oriented labeling of the segments. The technique illustrated is a combination of signal processing techniques for feature extraction and intelligent symbolic level processing for decision making. The symbolic processing includes knowledge about characteristics of some of the basic signal types that can be encountered. Pitch is tracked using some basic streaming principles and then used as one cue to speaker transitions. A novel speaker discrimination is also described that makes segmentation decisions when a continuously updated model of the current speaker suddenly ceases to sufficiently account for the input data. Segment transition decisions in audio are based on less temporally localized information than are video transition decisions.

PatentDOI
TL;DR: A method for recognizing spoken utterances of a speaker is disclosed, the method comprising the steps of providing a database of labeled speech data and providing a prototype of a Hidden Markov Model (HMM) definition to define the characteristics of the HMM.
Abstract: A method for recognizing spoken utterances of a speaker is disclosed, the method comprising the steps of providing a database of labeled speech data; providing a prototype of a Hidden Markov Model (HMM) definition to define the characteristics of the HMM; and parameterizing speech utterances according to one of linear prediction parameters or Mel-scale filter bank parameters. The method further includes selecting a frame period for accommodating the parameters and generating HMMs and decoding to specified speech utterances by causing the user to utter predefined training speech utterances for each HMM. The method then statistically computes the generated HMMs with the prototype HMM to provide a set of fully trained HMMs for each utterance indicative of the speaker. The trained HMMs are used for recognizing a speaker by computing Laplacian distances via distance table lookup for utterances of the speaker during the selected frame period; and iteratively decoding node transitions corresponding to the spoken utterances during the selected frame period to determine which predefined utterance is present.


01 Jan 1995
TL;DR: This paper describes recent efforts by the CMU speech group to improve the recognition of speech found in long sections of the broadcast news show Marketplace, and compares the recognition accuracy of the SPHINX-II system for different environmental and speaker conditions.
Abstract: Practical applications of continuous speech recognition in realistic environments place increasing demands for speaker and environment independence. Until recently, this robustness has been measured using evaluation procedures where speaker and environment boundaries are known, with utterances containing complete or nearly complete sentences. This paper describes recent efforts by the CMU speech group to improve the recognition of speech found in long sections of the broadcast news show Marketplace. Most of our effort was concentrated in two areas: the automatic segmentation and classification of environments, and the construction of a suitable lexicon and language model. We review the extensions to SPHINX-II that were necessary to enable it to process continuous broadcast news and we compare the recognition accuracy of the SPHINX-II system for different environmental and speaker conditions.

Proceedings Article
01 Jan 1995
TL;DR: Comparing continuous density hidden Markov models, dynamic time warping (DTW) and distortion-based vector quantisa-tion (VQ) for speaker recognition, across incremen-tal amounts of training data shows TD to be superior to TI architecture for speaker Recognition, and TD digit performance illustrates zero, 1 and 9 to be good discrim-inators.
Abstract: This paper evaluates continuous density hidden Markov models (CDHMM), dynamic time warping (DTW) and distortion-based vector quantisa-tion (VQ) for speaker recognition, across incremen-tal amounts of training data. In comparing VQ and CDHMMs for text-independent (TI) speaker recognition , it is shown that VQ performs better than an equivalent CDHMM with one training version, but is outperformed by the CDHMM when trained with ten training versions. In text-dependent (TD) experiments , a comparison of DTW, VQ and CDHMMs shows that DTW outperforms VQ and CDHMMs for sparse amounts of training data, but with more data, the performance of each model is indistinguishable. Further analysis shows TD to be superior to TI architecture for speaker recognition, and TD digit performance illustrates zero, 1 and 9 to be good discrim-inators.

Proceedings ArticleDOI
07 Mar 1995
TL;DR: This paper deals with the problem of unsupervised speaker classification, where no a priori speaker information is available, and proposes an algorithm that accepts multi-speaker dialogue speech data, estimates the number of speakers and assigns each speech segment to its speaker.
Abstract: Speaker recognition and verification has been used in a variety of commercial, forensic and military applications. The classical problem is that of supervised recognition, in which there is sufficient a priori information on the speakers to be identified. In such cases, the recognition system has speaker models, estimated during training sessions. This paper deals with the problem of unsupervised speaker classification, where no a priori speaker information is available. The algorithm accepts multi-speaker dialogue speech data, estimates the number of speakers and assigns each speech segment to its speaker. Preliminary results are described.

Patent
10 Mar 1995
TL;DR: In this paper, a spectrum mapping processing section 22 quantizes the acoustic feature parameters of the voice of a selected speaker stored in a voice data-base 10 based on the inputted character string to be voice synthesized employing the code book of the speaker.
Abstract: PURPOSE: To allow learning with a small amount of learning data and to perform a tone quality conversion with high precision by generating and outputting voices signals of a target speaker corresponding to a character string based on the acoustic feature parameters of the voice signals of the target speaker. CONSTITUTION: A spectrum mapping processing section 22 quantizes the acoustic feature parameters of the voice of a selected speaker stored in a voice data-base 10 based on the inputted character string to be voice synthesized employing the code book of the speaker. Moreover, based on the corresponding relationship between the speaker's code book and the mapping code book, the acoustic parameters of the voice signals of the speaker corresponding to the character string are generated by the section 22. Furthermore, a voice synthesis section 24 generates and outputs the voice signals of the speaker corresponding to the character string based on the acoustic feature parameters of the voice signals of the speaker generated by the section 22. Therefore, the voices for a voice tone quality conversion are allowed to be different and the voice tone quality conversion from learning voices, Japanese and words to English words is accomplished.