scispace - formally typeset
Search or ask a question

Showing papers on "Speaker recognition published in 1978"


Journal ArticleDOI
TL;DR: A tutorial survey of techniques for using contextual information in pattern recognition is presented, with emphasis on the problems of image classification and text recognition, where the text is in the form of machine and handprinted characters, cursive script, and speech.

119 citations


PatentDOI
TL;DR: In this article, a speech recognition method for detecting and recognizing one or more keywords in a continuous audio signal is disclosed, where each keyword is represented by a keyword template representing one of the target patterns, and each target pattern comprises statistics of each of at least one spectrum selected from a predetermined system for processing of the incoming audio.
Abstract: A speech recognition method for detecting and recognizing one or more keywords in a continuous audio signal is disclosed. Each keyword is represented by a keyword template representing one or more target patterns, and each target pattern comprises statistics of each of at least one spectrum selected from plural short-term spectra generated according to a predetermined system for processing of the incoming audio. The spectra are processed by a frequency equalization and normalizing method to enhance the separation between the spectral pattern classes during later analysis. The processed audio spectra are grouped into spectral patterns, are transformed to reduce dimensionality of the patterns, and are compared by means of likelihood statistics with the target patterns of the keyword templates. A concatenation technique employing a loosely set detection threshold makes it very unlikely that a correct pattern will be rejected.

78 citations


Journal ArticleDOI
TL;DR: Dynamic programming is applied to the selection of feature subsets in text-independent speaker identification, showing a lower average identification error in comparison to that of the "knock-out" strategy, the cepstral coefficients, and the PARCOR coefficients.
Abstract: Dynamic programming is applied to the selection of feature subsets in text-independent speaker identification. Each feature is long-term averaged in order to reduce its variability to text information. The resulting subset of features shows a lower average identification error in comparison to that of the "knock-out" strategy, the cepstral coefficients, and the PARCOR coefficients.

35 citations


Journal ArticleDOI
TL;DR: For automatic recognition, spectral models that contain zeros are found to be particularly effective, and their parameters are shown to be sufficient for the complete separation of /s/- and /\int/ -samples in CV and VCV utterances.
Abstract: A model is described for the spectral characteristics of voiceless fricative consonants of Japanese, based on an equivalent circuit representation of their generation mechanism. The model, together with its three simplified versions, is then evaluated from the point of view of automatic recognition as well as of synthesis of speech. For automatic recognition, spectral models that contain zeros are found to be particularly effective, and their parameters are shown to be sufficient for the complete separation of /s/- and /\int/ -samples in CV and VCV utterances. On the other hand, perceptual experiments using synthetic stimuli reveal considerably smaller differences between models with spectral zeros and those without zeros.

21 citations


PatentDOI
TL;DR: In this article, the authors recover the time waveform envelope of a human voice signal by rectifying the voice envelope and then obtaining slope information about the original waveform to detect peaks which are successively held by sample and hold circuitry.
Abstract: This invention recovers the time waveform envelope of a human voice signalithout the loss of subtleties providing speaker recognition cues and voice quality. This is accomplished by rectifying the voice envelope and then obtaining slope information about the original time waveform to detect peaks which are successively held by sample and hold circuitry.

10 citations


Journal ArticleDOI
Kashyap1, Mittal
TL;DR: A method of recognizing isolated words and phrases from a given vocabulary spoken by any member in a given group of speakers, the identity of the speaker being unknown to the system is described.
Abstract: We describe a method of recognizing isolated words and phrases from a given vocabulary spoken by any member in a given group of speakers, the identity of the speaker being unknown to the system. The word utterance is divided into 20-30 nearly equal frames, frame boundaries being aligned with glottal pulses for voiced speech. A constant number of pitch periods are included in each frame. Statistical decision rules are used to determine the phoneme in each frame. Using the string of phonemes from all the frames of the utterance, a word decision is obtained using (phonological) syntactic rules. The syntactic rules used here are of 2 types, namely, 1) those obtained from the theory of word construction from phonemes in English as applied to our vocabulary, 2) those used to correct possible errors in phonemic decisions obtained earlier based on the decisions of neighboring segments. In our experiment, the vocabulary had 40 words, consisting of many pairs of words which are phonemically close to each other. The number of speakers was 6. The identity of the speaker is not known to the system. In testing 400 words utterances, the recognition rate was about 80 percent for phonemes (for 11 phonemes) but the word recognition was 98.1 percent correct. Phonological-syntactic rules played an important role in upgrading the word recognition rate over the phoneme recognition rate.

8 citations


Journal ArticleDOI
TL;DR: One of the most difficult problems in speaker recognition is that the feature parameters frequently vary after a long time interval; one uses the time pattern of both the fundamental frequency and log‐area‐ratio parameters and the other uses several kinds of statistical features derived from them.
Abstract: One of the most difficult problems in speaker recognition is that the feature parameters frequently vary after a long time interval. We examined this effect on two kinds of speaker recognition; one uses the time pattern of both the fundamental frequency and log‐area‐ratio parameters and the other uses several kinds of statistical features derived from them. Results of speaker recognition experiments revealed that the long‐term variation effects have a great influence on both recognition methods, but are more evident in recognition using statistical parameters. In order to reduce the error rate after a long interval, it is desirable to collect learning samples of each speaker over a long period and measure the weighted distance based on the long‐term variability of the feature parameters. When the learning samples are collected over a short period, it is effective to apply spectral equalization using the spectrum averaged over all the voiced portions of the input speech. By this method, an accuracy of 95% can be obtained in speaker verification even after five years using statistical parameters of a spoken word.One of the most difficult problems in speaker recognition is that the feature parameters frequently vary after a long time interval. We examined this effect on two kinds of speaker recognition; one uses the time pattern of both the fundamental frequency and log‐area‐ratio parameters and the other uses several kinds of statistical features derived from them. Results of speaker recognition experiments revealed that the long‐term variation effects have a great influence on both recognition methods, but are more evident in recognition using statistical parameters. In order to reduce the error rate after a long interval, it is desirable to collect learning samples of each speaker over a long period and measure the weighted distance based on the long‐term variability of the feature parameters. When the learning samples are collected over a short period, it is effective to apply spectral equalization using the spectrum averaged over all the voiced portions of the input speech. By this method, an accuracy of 95% ...

7 citations


Proceedings ArticleDOI
01 Apr 1978
TL;DR: This paper describes a method for text-independent speaker identification by applying canonical discriminant analysis to the predetermined subspaces in the observation space using the 21-dimensional observation vectors obtained from every 40 msec voiced segments.
Abstract: This paper describes a method for text-independent speaker identification. In this method, in order to utilize phoneme-dependent personal information in addition to personal information common to all phonemes, multiple personal factor spaces are constructed by applying canonical discriminant analysis to the predetermined subspaces in the observation space. The decision is based on a liklihood measure derived from a posteriori probabilities in all the factor spaces. Using the 21-dimensional observation vectors obtained from every 40 msec voiced segments, the methods of construction of the subspaces and others were examined. An identification accuracy comparable to human listeners was achieved.

5 citations


Proceedings ArticleDOI
01 Apr 1978
TL;DR: A new mml (minimum-maximum-locating) method for time-normalization is presented together with an overview of techniques used for feature-extraction, which have been obtained with the AUROS system.
Abstract: In the speaker-verification task speakers are assumed to be cooperative and are therefore willing to pronounce a pre-arranged code sentence. This paper deals with the extraction of speaker-specific features from a time-normalized parametric description of the code sentence. A new mml (minimum-maximum-locating) method for time-normalization is presented together with an overview of techniques used for feature-extraction. Results of recognition experiments are discussed, which have been obtained with the AUROS system (Automatic Recognition of Speakers by Computers).

3 citations


Proceedings ArticleDOI
01 Apr 1978
TL;DR: A method for text-independent speaker identification has been developed which utilizes vowel sounds as the basis for extracting speaker characteristics, and it was found that vowel recognition is not necessary.
Abstract: A method for text-independent speaker identification has been developed which utilizes vowel sounds as the basis for extracting speaker characteristics. Using 63 minutes of conversational speech data from 20 speakers, it was found that vowel recognition is not necessary. Instead, the vowel samples can be pooled such that they represent each person's vowel space, which is expected to be very speaker-dependent. A sequential analysis process has improved the decision procedure by allowing vowel samples to be tested until a specified level of confidence is reached in the identification. This dynamic decision procedure is similar to a human perception process where we can quickly identify a unique voice, but listen longer when there is uncertainty.

3 citations


Book ChapterDOI
01 Jan 1978
TL;DR: During the last two decades the emphasis in research in automatic speech recognition has gradually shifted from the former type of device to the latter.
Abstract: Automatic speech recognition may be defined as any process which decodes the acoustic signal produced by the human voice into a sequence of linguistic units which contain the message that the speaker wishes to convey. At one extreme this includes the “phonetic typewriter,” a hypothetical device which types any words spoken into it, and at the other, “speech understanding systems” which extract the intended meaning from the sounds and carry out some appropriate action such as replying to a question or controlling a robot. During the last two decades the emphasis in research in automatic speech recognition has gradually shifted from the former type of device to the latter.

Proceedings ArticleDOI
01 Apr 1978
TL;DR: The results obtained in an identification experiment showed the superiority of the Atal approach, as for recognition rate and sensitivity to intra-speaker variability.
Abstract: Over the past years many different methods for automatic speaker recognition have been successfully proposed and tested. The aim of the paper is to present the results of a comparison carried out on two such approaches, viz. Atal /1/ and Sambur /2/, both using, but in a different manner, the parametric representation of speech derived from the linear prediction model. The experiment was carried out on 500 phrase length utterances of 10 speakers recorded over a three -month period. The results obtained in an identification experiment showed the superiority of the Atal approach, as for recognition rate and sensitivity to intra-speaker variability.

01 Apr 1978
TL;DR: The development of a system for recognizing connected speech in real time using a commercially available speech preprocessor, a minicomputer and programs written in FORTRAN is described.
Abstract: : This report describes the development of a system for recognizing connected speech in real time using a commercially available speech preprocessor, a minicomputer and programs written in FORTRAN. The system was tested on two speakers using the digits and the word 'point' with inconclusive results. Recognition accuracy of 86% was achieved for one speaker whereas accuracy for the other speaker was lower (39%) due to an anomalous difference between training and test data for that speaker's voice. (Author)

01 Jan 1978
TL;DR: The method which reduced candidate words in the vocabulary by means of pre-matching using both local and global features of a spoken word was adopted, to eliminate the most unlike group of candidates using the measurements of both features from the vocabulary list to reduce the recognition time.
Abstract: SUMMARY If we enlarge the vocabulary size of the word recognl~lOn system to about several hundreds, we are afraid that the recognition time becomes not only very long by increasing an amount of processing but also the correct rate of recognition decreases. To cope with these weak points, we adopted the method which reduced candidate words in the vocabulary by means of pre-matching using both local and global features of a spoken word. That is, to eliminate the most unlike group of candidates using the measurements of both features from the vocabulary list was tried to reduce the recognition time, and this operation also eliminated the misleading candidates to make increase the correct rate of recognition. Furthermore, to add the measurement to the final judgement made increase the correct rate. Moreover, in order to absorb the influence of speaker differences, we added the capability of learning to the system. In an experiment on name recognition using 100 Japanese-city names, the system recognized the names correctly at the rate of 83 % for unspec-ific speakers and 93 % after learning, using a mini-computer in real time. The number of candidate words was reduced to one tenth by pre-matching .

Proceedings ArticleDOI
01 Apr 1978
TL;DR: It is shown that the nearest neighbour decision rule gives significant improvement in classification score for vowel and digit recognition schemes.
Abstract: Minimum distance to mean is usually used as a classification rule in speech and speaker recognition studies. In this paper it is shown that the nearest neighbour decision rule gives significant improvement in classification score for vowel and digit recognition schemes. Autocorrelation coefficients of lags two to five sampling instants are used to form the feature vector. Pour samples per class have been used. Minimum squared Euclidean distance of the test vector from the nearest reference is chosen as the classification rule. For sustained vowels the recognition score is cent percent. for the same feature the minimum distance to mean gives 70 % recognition score. When the reference samples of a given speaker is tested over the vowels spoken by different speaker(up to 10), this scheme gives the recognition score of about 95 %. for digits without any time warping the recognition score of about 86 % to 92 % is obtained.

Book ChapterDOI
V. Vemuri1
01 Jan 1978
TL;DR: The recognition of patterns is a basic attribute of living organisms as mentioned in this paper and can be classified into two major types: recognition of concrete items and recognition of abstract items, which can be divided into two categories: spatial and temporal patterns.
Abstract: This chapter describes the recognition of patterns. Recognition can be regarded as a basic attribute of living organisms. A pattern is a description of an object. Recognition of patterns is a basic activity of all living organisms. The ability to recognize patterns is a necessary part of survival. In the animal kingdom, one's survival depends upon his ability to recognize a friend and a foe. An infant learns to recognize its mother's face and voice at an early age. One can recognize people from an analysis of their hand-writing, fingerprints, and voice prints. Acts of recognition can be divided into two major types: recognition of concrete items and recognition of abstract items. Recognition of spatial and temporal patterns using one's visual and aural sensory apparatus belongs to the former type. Examples of spatial patterns are alphanumeric characters, fingerprints, and pictures. Temporal patterns include speech waveforms, electrocardiograms, time series, and target signatures. Recognition of conceptual patterns, such as the proof of a theorem, belongs to the latter type. The subject of pattern recognition spans a number of disciplines.

Book ChapterDOI
01 Jan 1978
TL;DR: Continuous Speech Recognition is an attempt to develop a “voice-actuated typewriter” that automatically transcribes naturally spoken utterances into correct English orthography.
Abstract: Continuous Speech Recognition is an attempt to develop a “voice-actuated typewriter” that automatically transcribes naturally spoken utterances into correct English orthography The current IBM objective is to transform (not necessarily in real time) speech signals recorded by a known speaker in a high fidelity envi ronment into reasonably error free text