scispace - formally typeset
Search or ask a question

Showing papers on "Speaker recognition published in 1975"


Journal ArticleDOI
O. Fujimura1
TL;DR: Irregularities in phonetic manifestations of phonemes are discussed and it is argued that the syllable, phonologically redefined, will serve as the effective minimal unit in the time domain.
Abstract: Basic problems involved in automatic recognition of continuous speech are discussed with reference to the recently developed template matching technique using dynamic programming. Irregularities in phonetic manifestations of phonemes are discussed and it is argued that the syllable, phonologically redefined, will serve as the effective minimal unit in the time domain. English syllable structures are discussed from this point of view using the notions of "syllable features" and "vowel affinity."

110 citations


Patent
Marvin Robert Sambur1
31 Dec 1975
TL;DR: In this paper, the mean and variance of an unknown speaker's Orthogonal Parameter signals versus those of previously-stored known speakers were compared to those of known speakers.
Abstract: This speaker recognition system offers improved recognition by comparing the mean and variance of an unknown (test) speaker's Orthogonal Parameter signals versus those of previously-stored known (reference) speakers. The unknown speaker's Orthogonal Parameters represent his hypothesized identity because his original speech linear prediction coefficients are transformed into his set of Orthogonal Parameters using the stored (reference) transformation coefficients of each of the previously-recorded known speakers.

27 citations


Book ChapterDOI
01 Jan 1975
TL;DR: Liberman, Mattingly and Turvey contrast speech codes and memory codes, suggesting that improving memory codes is likely to require a recoding into a form like a language code which is the class of codes to which speech belongs.
Abstract: Nowadays the study of coding processes in human memory is a large central topic of psychology. How is information perceived, selected, transformed, abstracted or elaborated and finally put in memory? A recent review of the work on encoding processes has been given by Melton and Martin (1972). In this book Liberman, Mattingly and Turvey contrast speech codes and memory codes. For them speech is a special process having a code which is complex, lawful, natural, efficient and resistant to distortion. Memory codes are not special but are relative ly arbitrary and generally inefficient. Liberman et al suggest that improving memory codes is likely to require a recoding into a form like a language code which is the class of codes to which speech belongs. An interesting question which arises is, how hard or easy is it to tell one speech-encoding system, or voice, from another? If a voice which will later be heard again is stripped of its semantical, grammatical and contextual constraints so as to lose its specialness of speech except as a carrier, are its abstracted properties laid down in a speech code or a memory code? The answer is important in the biology of survival, and also in our own human society which is held together by voice communication. The day is near when men and machines will talk fluently to each other.

27 citations


Journal ArticleDOI
TL;DR: In this paper, a mathematical formulation for each of several zero-crossing feature extraction techniques is derived and related (where possible) to each of the other zero-Crossing methods.
Abstract: Zero-crossing analysis techniques have long been applied to speech analysis, to automatic speech recognition, and to many other signal-processing and pattern-recognition tasks. In this paper, a mathematical formulation for each of several zero-crossing feature extraction techniques is derived and related (where possible) to each of the other zero-crossing methods. Based upon this mathematical formulation, a physical interpretation of each analysis technique is effected, as is a discussion of the properties of each method. It is shown that four of these methods are a description of a short-time waveform in which essentially the same information is preserved. Each turns out to be a particular normalization of a count of zero-crossing intervals method. The effects of the various forms of normalization are discussed. A fifth method is shown to be a different type of measure; one which preserves information concerning the duration of zero-crossing intervals rather than their absolute number. Although reference is made as to how each of the zero-crossing methods has been applied to automatic speech recognition, an attempt is made to enumerate general characteristics of each of the techniques so as to make the mathematical analysis generally applicable.

25 citations


Journal ArticleDOI
TL;DR: A lattice representation of the segmentation is devised which allows for multiple choices that can be sorted out by higher level processes and to deal effectively with acoustic recognition errors.
Abstract: Errors in acoustic-phonetic recognition occur not only because of the limited scope of the recognition algorithm, but also because certain ambiguities are inherent in analyzing the speech signal. Examples of such ambiguities in segmentation and labeling (feature extraction) are given. In order to allow for these phenomena and to deal effectively with acoustic recognition errors, we have devised a lattice representation of the segmentation which allows for multiple choices that can be sorted out by higher level processes. A description of the current acoustic-phonetic recognition program in the Bolt Beranek and Newman (BBN) Speech Understanding System is given, along with a specification of the parameters used in the recognition.

23 citations


Journal ArticleDOI
M. Sambur1
TL;DR: The effectiveness of a set of speaker recognition features is usually characterized in terms of the ratio of the interspeaker variability of the feature to its intraspeaker variability (F‐ratio), but by an appropriate eigenvector analysis, aSet of orthogonal parameters can be obtained that is essentially constant across an utterance for a given speaker.
Abstract: The effectiveness of a set of speaker recognition features is usually characterized in terms of the ratio of the interspeaker variability of the feature to its intraspeaker variability (F‐ratio). A recent experiment in speech synthesis [M.R. Sambur, “An Efficient LPC Vocoder,” Bell Syst. Tech. J. (to be published)] has shown that by an appropriate eigenvector analysis, a set of orthogonal parameters can be obtained that is essentially constant across an utterance for a given speaker (i.e., zero intraspeaker variability). If the same eigenvector analysis is applied to the same utterance spoken by another speaker, the resulting values of the orthogonal parameters are, however, different. These orthogonal parameters were therefore examined for their ability to differentiate different speakers. They were formally tested in a speaker recognition experiment involving 21 speakers. The speech data consisted of six repetitions of the same sentence spoken by each speaker on six separate occasions. The identificatio...

14 citations


Journal ArticleDOI
TL;DR: The results of the study indicate that retention of meaning involving the speaker's predictions, opinions, etc., is influenced by the listener's perception of the speaker.
Abstract: Results are reported for an experiment which examined the influence of listener perception of speaker intention on sentence recognition. Given the same passage and recognition sentences, subjects displayed different false recognition patterns of test items depending on which of two speakers with opposing viewpoints the passage was attributed to. It is argued that the reconstructive process of memory is based on information from the context (e.g., the speaker's perceived intentions) as well as on the actual words used. Retention of different aspects of a message is seen to rely on information from different sources. Specifically, the results of the study indicate that retention of meaning involving the speaker's predictions, opinions, etc., is influenced by the listener's perception of the speaker.

5 citations


Proceedings Article
03 Sep 1975
TL;DR: The term "pragmatics" is used here to mean the procedure which applies knowledge about the speaker, the previous dialogue, and the domain of discourse to interpret utterances and respond appropriately.
Abstract: When a person speaks he is using words to achieve a goal, whether that be to gain information, to threaten, to promise, or to reassure. Recognition of that goal is an essential part of speech understanding, both in determining what was said and in deciding what was meant. The term "pragmatics" is used here to mean the procedure which applies knowledge about the speaker, the previous dialogue, and the domain of discourse to interpret utterances and respond appropriately. This procedure invokes definitions of intents (speech acts) and modes of interaction to recognize the goal of a speaker and consequently to understand his utterance.

4 citations


Journal ArticleDOI
TL;DR: In this article, the error energy of the linear prediction inverse filter was used for speaker recognition in a speech and speaker recognition task with ten American vowels in the context of “hVd” produced by nine male and seven female speakers.
Abstract: The error energy of the linear prediction inverse filter seems to give an effective triter ion for speech and speaker recognition In this task, reference sounds are stored in terms of their filter coefficients and the error energy of their optimal inverse filters The error energy of an unknown sound from each reference filter is compared with the reference error energy (Method A) Alternately, the error energy of the optimal inverse filter for an unknown sound is compared with the error energy of the same sound from the reference filters (Method B) Method A has its advantage in requiring only the autocorrelation function of unknown sounds The recognition task was conducted with ten American vowels in the context of “hVd” produced by nine male and seven female speakers A single speaker system averaged 99% recognition, whereas a multispeaker system in which one of the speakers was chosen as the reference averaged 46% A proper threshold for the criterion suggested that the method is effective for speak

4 citations


01 Jul 1975
TL;DR: The VICI is an isolated word recognition system capable of recognizing the English digits and four control words, CANCEL, ERASE, VERIFY and TERMINATE, which will accept these words independent of speaker for a large population of General American males.
Abstract: : This report describes the development, operation and performance characteristics of an Advanced Development Model of a Voice Input Code Identifier (VICI). The VICI is an isolated word recognition system capable of recognizing the English digits and four control words, CANCEL, ERASE, VERIFY and TERMINATE. The system will accept these words independent of speaker for a large population of General American males. No training of the system by a speaker is necessary. By the use of an alphanumeric output display, a speaker using the system can verify that each digit spoken into the system was correctly recognized. Errors can be corrected through the use of the control words. To confirm system performance several final tests were held, two of which included live inputs rather than tape recordings. The VICI system is based upon the VIP- 100 isolated word recognition system which normally requires the input of training data by each talker who uses the system.

3 citations


Journal ArticleDOI
TL;DR: In this article, spontaneous speech samples of 46 male speakers between the ages of 25 and 70 years were played to 40 untrained listeners who estimated the speakers' ages, and the features identified should be useful for defining the criteria of "normal" aging speech, and in traditional speaker recognition research.
Abstract: Spontaneous speech samples of 46 male speakers between the ages of 25 and 70 years were played to 40 untrained listeners who estimated the speakers' ages. Samples which showed agreement among untrained listeners were played to 20 trained listeners who described the perceptual features of the given perceived ages via an a posteriori schema. Results showed characteristic perceptual features for four perceived age decades which could be classified to pitch, rate of speech, quality, and articulation. It was concluded that the features identified should be useful for defining the criteria of “normal” aging speech, and in traditional speaker recognition research.