scispace - formally typeset
Search or ask a question

Showing papers on "Speaker diarisation published in 1986"



Proceedings ArticleDOI
01 Apr 1986
TL;DR: Vector quantization (VQ) is a technique that reduces the computation amount and memory size drastically and is proposed in order to improve speaker-independent recognition.
Abstract: Vector quantization (VQ) is a technique that reduces the computation amount and memory size drastically. In this paper, speaker adaptation algorithms through VQ are proposed in order to improve speaker-independent recognition. The speaker adaptation algorithms use VQ codebooks of a reference speaker and an input speaker. Speaker adaptation is performed by substituting vectors in the codebook of a reference speaker for vectors of the input speaker's codebook, or vice versa. To confirm the effectiveness of these algorithms, word recognition experiments are carried out using the IBM office correspondence task uttered by 11 speakers. The total number of words is 1174 for each speaker, and the number of different words is 422. The average word recognition rate using different speaker's reference through speaker adaptation is 80.9%, and the rate within the second choice is 92.0%.

269 citations


Journal ArticleDOI
TL;DR: This paper focuses on the long-term intra-speaker variability of feature parameters as on the most crucial problems in speaker recognition, and presents an investigation into methods for reducing the effects of long- term spectral variability on recognition accuracy.

79 citations


Proceedings ArticleDOI
01 Apr 1986
TL;DR: Methods for text-independent speaker identification that deal with the variability in the data introduced by unknown telephone channels including probabilistic channel modeling, a channel-invariant model and a modified-Gaussian model are considered.
Abstract: We consider methods for text-independent speaker identification that deal with the variability in the data introduced by unknown telephone channels. The methods investigated include probabilistic channel modeling, a channel-invariant model and a modified-Gaussian model. The methods are described and then evaluated with experiments conducted with a twenty speaker database of long distance telephone calls.

36 citations


Proceedings ArticleDOI
01 Apr 1986
TL;DR: A new method, based on template matching, that utilizes temporal information to advantage in text-dependent recognition as a special case and is compared with that of similar recently-developed methods.
Abstract: Text-independent speaker recognition methods have been based on measurements of long-term statistics of individual speech frames. These methods are not capable of modeling speaker-dependent speech dynamics. In this paper, we describe a new method, based on template matching, that utilizes temporal information to advantage. The template-matching method performs text-dependent recognition as a special case. Performance of the template-matching method is compared with that of similar recently-developed methods.

24 citations


Journal ArticleDOI
TL;DR: A technique for learning spectral transformations, based on a statistical-analysis tool (canonical correlation analysis), to adapt a standard dictionary to arbitrary speakers, should permit to improve speaker independence in large vocabulary ASR.

22 citations


Journal ArticleDOI
01 Sep 1986-Language

13 citations


Proceedings ArticleDOI
07 Apr 1986
TL;DR: This work defines novel distance measures for speech recognition which are specifically designed to differentiate between confusable speech sounds.
Abstract: This work defines novel distance measures for speech recognition which: 1. Model the statistical interaction between adjacent speech frames, 2. Model the statistical characteristics of different speech sounds individually, 3. Are specifically designed to differentiate between confusable speech sounds. Speaker independent recognition tests performed on the Texas Instruments multi-dialect isolated digit data base give substitution rates as low as 0.6 % with a vocabulary of 11 digits.

8 citations


Proceedings ArticleDOI
01 Apr 1986
TL;DR: A speaker independent isolated Arabic word recognition system has been successfully implemented and tested and the speaker independent recognition rate is about 96% for ten Arabic words.
Abstract: A speaker independent isolated Arabic word recognition system has been successfully implemented and tested. End-point detection and voiced/unvoiced segmentation algorithms are based on energy and zero-crossing rate (ZCR) using adaptive ZCR threshold. Formants are extracted using ZCI histograms obtained from filtered speech signal [1]. The speaker independent recognition rate is about 96% for ten Arabic words.

3 citations


Journal ArticleDOI
TL;DR: In this paper, the same VQ codebook is used for constructing a transformation, or more precisely, a mapping, between the feature space of a new speaker and that of a standard speaker.
Abstract: A speaker‐specific vector quantization (VQ) codebook was proposed and applied successfully to both text‐independent and text‐dependent speaker recognition applications [e.g., F. K. Soong, A. E. Rosenberg, L. R. Rabiner, and B‐H. Juang, ICASSP‐85 (1985) and A. E. Rosenberg and F. K. Soong, ICASSP‐86 (1986)]. In this talk the same VQ codebook is used for constructing a transformation, or more precisely, a mapping, between the feature space of a new speaker and that of a standard speaker. The mapping is constructed by using standard dynamic programming (DP) procedure to align training tokens spoken by a new speaker with tokens of the same text spoken by the standard speaker. Along the optimal alignment paths, a correspondence between the spectral feature vectors of the new speaker and the VQ codebook indices of the standard speaker is established and, for each VQ codebook index, a centroid is computed as the “average” of all the corresponding feature vectors of the new speaker. A new VQ codebook for the new ...

1 citations



01 Jun 1986
TL;DR: The research experiment consisted of construction of a system for identifying a natural language sentence using only speaker independent phonemes as the input, and the research conclusions are that such a system can be built, and that the useful vocabulary must be expandable as the recognition system becomes more frequently used.
Abstract: Computerized processing of human speech input may be accomplished by (1) recognizing the phoneme sounds in the speech signals, (2) correctly identifying the words in each spoken sentence, (3) interpreting the meaning of the sentence, and (4) generating proper responses for each utterance. Individual speakers talk differently, and even an individual's enunciation patterns change with differing environments and discourse domains. These differences are called speaker idiosyncracies. Regional speech dialects are included as a speaker idiosyncracy. The source of such speaker differences has been identified as an individual's pronunciations of the vowel phonemes. Computerized speech processors treat these speaker idiosyncracies as errors when the input phoneme sounds are unrecognizable. A generalized speech recognition system must accommodate such speech errors and, more particularly, ignore the speaker dependent pronunciations of phonemes. This implies that vowel phoneme pronunciations, the source of speaker idiosyncracies and speech processing errors, should be overlooked during recognition of vocalized sentences. The research experiment consisted of construction of a system for identifying a natural language sentence using only speaker independent phonemes as the input. The motivating hypothesis for the experiment is that spoken sentences can be recognized from limited phoneme input. The research system accepts only strings of consonant phonemes, which are recognizable in a speaker independent environment. The original 'spoken' sentence is reproduced from the consonant phonemes and formatted as a word sequence for subsequent transmission to a natural language processing system. The system uses a vocabulary of general words and an expandable dictionary of domain specific words during the sentence reconstruction process. The research conclusions are that such a system can be built, and that the useful vocabulary must be expandable as the recognition system becomes more frequently used. The research system is intended as an interface between existing acoustic phoneme recognizers and existing natural language processors. The system accomplishes word recognition using only the consonant phonemes from continuous speech sentences, and generates word sequences in sentence form for output to an existing natural language processor. The domain specific vocabulary subsets used by the system facilitate its use as a sentence pre-processor especially with natural language understanding systems which rely on scripts, and the associated domain specific vocabularies, for semantic processing of topic oriented sentence groups.