scispace - formally typeset
Search or ask a question

Showing papers on "Speaker diarisation published in 1981"


Patent
27 Mar 1981
TL;DR: In this paper, a set of signals representative of the correspondence of the identified speaker's features with the feature templates of said reference words is generated, and an unknown speaker is analyzed and the reference word sequence of the utterance is identified.
Abstract: In a speaker recognition and verification arrangement, acoustic feature templates are stored for predetermined reference words. Each template is a standardized set of acoustic features for one word, formed for example by averaging the values of acoustic features from a plurality of speakers. Responsive to the utterances of identified speakers, a set of signals representative of the correspondence of the identified speaker's features with said feature templates of said reference words is generated. An utterance of an unknown speaker is analyzed and the reference word sequence of the utterance is identified. A set of signals representative of the correspondence of the unknown speaker's utterance features and the stored templates for the recognized words is generated. The unknown speaker is identified jointly responsive to the correspondence signals of the identified speakers and unknown speaker.

65 citations


PatentDOI
TL;DR: In this paper, a system for generating demisyllable templates from a reference first speaker using both manual and automatic analysis is presented. But the analysis for a second speaker is simplified and automated by comparing with the first speaker's templates.
Abstract: A system for generating speech pattern templates for use with either speech recognition or speech synthesis. Reference demisyllable templates are first generated from a reference first speaker using both manual and automatic analysis. The analysis for a second speaker is simplified and automated by comparing with the first speaker's templates. The second speaker speaks the same words at a rate time-warped to match the first speakers rate and template. We define a demisyllable as each of the two halves of a syllable, assuming a syllable starts and ends with a noisy consonant, and the syllable is split at its vowel center, thereby simplifying concatenation and comparison. Key features of the invention include generating a set of signals representative of the time alignment between the first and second speaker's templates, and the time-of-occurence boundaries of each syllable in a word.

28 citations


Proceedings ArticleDOI
01 Apr 1981
TL;DR: In this study it is hypothesized that distributions of template distance scores are reasonably consistent for individual speakers and vary characteristically from speaker to speaker.
Abstract: One method for providing speaker independent word recognition capability is to construct a small set of templates for each vocabulary word that typifies and spans individual speaker word reference templates over a large population of speakers. Word recognition decision functions are based on combinations of template distance scores obtained by processing an unknown input utterance and comparing it with the ensemble of reference templates. In this study it is hypothesized that distributions of template distance scores are reasonably consistent for individual speakers and vary characteristically from speaker to speaker. This property is exploited to provide a speaker recognition capability in combination with word recognition. It is shown that good speaker recognition performance depends on the input of a sequence of distinct words. For a 20-speaker population, on the average, the correct speaker is in the top 1% of the candidates in the identification made over a sequence of seven distinct words.

10 citations


Proceedings ArticleDOI
01 Apr 1981
TL;DR: The results of this investigation clearly show that Markel's technique is superior for applications using very short speech segments for both the speaker models and the recognition trials.
Abstract: This paper describes the design and implementation of a realtime speaker recognition system. The system performs text independent, closed set speaker recognition with up to 30 talkers in realtime. In addition, the reference speech used to characterize the 30 talkers can be extracted from as little as 10 seconds of speech from each talker, and the actual recognition performed with less than one minute of speech from the unknown talker. Two speaker recognition algorithms previously developed by Markel and Pfeifer were investigated for use in the realtime system. The results of this investigation clearly show that Markel's technique is superior for applications using very short speech segments for both the speaker models and the recognition trials. Markel's technique was implemented in realtime in a high speed progranmable signal processor. A test of this implementation with a set of 30 male speakers resulted in recognition accuracies of 93-100% for models generated with only 10 seconds of speech, and recognition trials using only 10 seconds of unknown speech.

7 citations


Proceedings ArticleDOI
01 Apr 1981
TL;DR: A procedure for the adaptation of a phonetic recognition system to a new speaker, where the reference for each phoneme of the autoregressive model is transformed linearly in order to fit optimally the utterances of a new speakers.
Abstract: A procedure for the adaptation of a phonetic recognition system to a new speaker, is described. The reference for each phoneme(cepstrum of the autoregressive model) is transformed linearly in order to fit optimally the utterances of a new speaker. In a first step, samples uttered by the new speaker are mapped onto corresponding utterances (reference) through a dynamic comparison. In a second step, the linear transformation is computed through canonical correlation analysis of the samples. Experimental simulations provided satisfactory results.

6 citations


Journal ArticleDOI
TL;DR: The authors presented an algorithm which chooses a reference template for each word in the vocabulary from a set of N exemplars, which minimizes the worst matching behavior and total error over the N sets of exemplars.
Abstract: Presented here for a speaker dependent system, is an algorithm which chooses a reference template for each word in the vocabulary from a set of N exemplars. The goal of the algorithm is to produce a reference set that minimizes the worst matching behavior and total error over the N sets of exemplars. The results of the experiments presented here show a reduction in the average error rate from 16.4% to 10.2% over a set of 4 male speakers and 4 female speakers.

5 citations