scispace - formally typeset
Search or ask a question

Showing papers on "Speaker recognition published in 1984"


Proceedings ArticleDOI
R. Leonard1
19 Mar 1984
TL;DR: A large speech database has been collected for use in designing and evaluating algorithms for speaker independent recognition of connected digit sequences and formal human listening tests on this database provided certification of the labelling of the digit sequences.
Abstract: A large speech database has been collected for use in designing and evaluating algorithms for speaker independent recognition of connected digit sequences. This dialect balanced database consists of more than 25 thousand digit sequences spoken by over 300 men, women, and children. The data were collected in a quiet environment and digitized at 20 KHz. Formal human listening tests on this database provided certification of the labelling of the digit sequences, and also provided information about human recognition performance and the inherent recognizability of the data.

599 citations


01 Jan 1984
TL;DR: An automatic lipreading system which has been developed and the combination of the acoustic and visual recognition candidates is shown to yield a final recognition accuracy which greatly exceeds the acoustic recognition accuracy alone.
Abstract: Automatic recognition of the acoustic speech signal alone is inaccurate and computationally expensive. Additional sources of speech information, such as lipreading (or speechreading), should enhance automatic speech recognition, just as lipreading is used by humans to enhance speech recognition when the acoustic signal is degraded. This paper describes an automatic lipreading system which has been developed. A commercial device performs the acoustic speech recognition independently of the lipreading system. The recognition domain is restricted to isolated utterances and speaker dependent recognition. The speaker faces a solid state camera which sends digitized video to a minicomputer system with custom video processing hardware. The video data is sampled during an utterance and then reduced to a template consisting of visual speech parameter time sequences. The distances between the incoming template and all of the trained templates for each utterance in the vocabulary are computed and a visual recognition candidate is obtained. The combination of the acoustic and visual recognition candidates is shown to yield a final recognition accuracy which greatly exceeds the acoustic recognition accuracy alone. Practical considerations and the possible enhancement of speaker independent and continuous speech recognition systems are also discussed.

389 citations


Journal ArticleDOI
TL;DR: In this article, the authors listen to a series of recorded voice samples obtained from unfamiliar speakers and were then given a two-alternative forced-choice recognition test, and found that voice learning was inferior to face learning.
Abstract: Subjects listened to a series of recorded voice samples obtained from unfamiliar speakers and were then given a two-alternative forced-choice recognition test. Recognition performance improved when the voice-sample duration was increased from 6 to 60 s, when the target set size was reduced from 20 to 5 voices, and when slides of faces provided context information. Recognition performance was not significantly different for retention intervals of 15 min and 10 days. For the conditions of our experiments, voice learning was inferior to face learning.

93 citations


PatentDOI
TL;DR: In this paper, a speech recognition apparatus includes a speech signal analyzing circuit for time-sequentially generating acoustic parameter patterns representing the phonetic features of speech signals, and phoneme reference memories each storing a plurality of reference parameter pattern vectors.
Abstract: A speech recognition apparatus includes a speech signal analyzing circuit for time-sequentially generating acoustic parameter patterns representing the phonetic features of speech signals, and phoneme reference memories each storing a plurality of reference parameter pattern vectors. A phoneme pattern vector from the speech signal analyzing circuit is compared with each of the reference pattern vectors stored in the phoneme reference memories in order to recognize an input speech. The speech signal analyzing circuit has a parameter extraction circuit for time-sequentially extracting acoustic parameter patterns representing the speech signal, a first phoneme pattern vector memory for storing a phoneme pattern vector including an acoustic parameter pattern of each frame from the parameter extraction circuit, and a second phoneme pattern vector memory for storing a phoneme pattern vector including a plurality of parameter patterns from the parameter extraction circuit.

69 citations


Patent
31 Dec 1984
TL;DR: In this paper, a method and system for speaker enrollment, as well as for speaker recognition, is described, where each candidate speaker is assigned a set of short acoustic segments of phonemic duration.
Abstract: The invention provides a method and system for speaker enrollment, as well as for speaker recognition. Speaker enrollment creates for each candidate speaker a set of short acoustic segments, or templates, of phonemic duration. An equal number of templates is derived from every candidate speaker's training utterance. A speaker's template set serves as a model for that speaker. Recognition is accomplished by employing a continuous speech recognition (CSR) system to match the recognition utterance with each speaker's template set in turn. The system selects the speaker whose templates match the recognition utterance most closely, that is, the speaker whose CSR match score is lowest. The method of the invention incorporates the entire training utterance in each speaker model, and explains the entire test utterance. The method of the invention models individual short segments of the speech utterances as well as their long-term statistics. Both static and dynamic speaker characteristics are captured in the speaker models.

28 citations


Proceedings ArticleDOI
01 Mar 1984
TL;DR: The paper describes an automatic method, called Automatic Diphone Bootstrapping (or A.D.R.B.B.), for template extraction for Speaker-Adaptive Continuous Speech Recognition using "diphones" as speech units, which operates without any manual intervention and performed very well for all the speakers on which it was tested.
Abstract: The paper describes an automatic method, called Automatic Diphone Bootstrapping (or A.D.B.), for template extraction for Speaker-Adaptive Continuous Speech Recognition using "diphones" as speech units. Diphones have proved to be very suitable for C.S.R. as they meet the main requirements of phonetic units: invariance with the context and economy. Furthermore the performance of diphone-based speaker dependent C.S.R. systems is very high. For a long time manual extraction has been presented in the literature as the only completely reliable method for sub-word template creation for any speaker (see [1] as an example). Recently some automatic techniques for reference pattern extraction were developed [2,3], but they also require some manual corrections. The A.D.B. procedure operates without any manual intervention and performed very well for all the speakers on which it was tested. In a connected digit recognition task, a W.R.R. of 98.79% was achieved by using the speaker-adaptive templates created by the A.D.B. procedure.

14 citations


Proceedings ArticleDOI
19 Mar 1984
TL;DR: An artificial speech recognition experiment is introduced as a convenient means of assessing alignment accuracy, and alignment accuracy is found to be improved considerably by applying certain speaker adaptation transformations to the synthetic speech.
Abstract: A capacity to carry out reliable automatic time alignment of synthetic speech to naturally produced speech offers potential benfits in speech recognition and speaker recognition as well as in synthesis itself. Phrase alignment experiments are described that indicate that alignment to synthetic speech is more difficult than alignment of speech from two natural speakers. An artificial speech recognition experiment is introduced as a convenient means of assessing alignment accuracy. By this measure, alignment accuracy is found to be improved considerably by applying certain speaker adaptation transformations to the synthetic speech, by modifying the spectrum similarity metric, and by generating the synthetic spectra directly from the control parameters using simplified excitation spectra. The improvements seem to limit, however, at a level below that found between natural speakers. It is conjectured that further improvement requires modifications to the synthesis rules themselves.

13 citations


Journal ArticleDOI
R. Pieraccini1
TL;DR: In this work three different pattern compression techniques are compared on the basis of efficiency as well as recognition performance when applied to pattern matching by means of dynamic programming in a speaker dependent context.

10 citations


Proceedings ArticleDOI
01 Mar 1984
TL;DR: The basic idea is to reduce the number of word candidates for the recognition by looking for robust phonetic features computed from the input signal, and it is possible to design a multiprocessor structure in order to reduced the overall recognition time.
Abstract: Our group has been designing for the past twelve years several speech recognition systems, from isolated vocabulary pattern matching systems to continuous speech understanding systems. The experiments we carried out showed us that the systems designed for restricted vocabularies task were not readily extensible to large vocabularies. We therefore started some years ago implementing a 200 word recognition system using a phonetic approach. This system was tested successfully in 1980. In continuation of this research we decided to extend our approach to a 1000 word vocabulary. This paper describes the principles involved in this system together with the preliminary results already obtained. The basic idea is to reduce the number of word candidates for the recognition by looking for robust phonetic features computed from the input signal. These features are used as a key for accessing the lexicon. Since the determination of the features is carried out in parallel with the phonetic decoding of the input word, it is possible to design a multiprocessor structure in order to reduce the overall recognition time. The determination of crude phonetic features is described together with the organization of the lexicon. Some preliminary results are finally presented and discussed.

8 citations


Proceedings ArticleDOI
01 Mar 1984
TL;DR: An algorithm for speaker-independent connected digit recognition for telephone use, and its experimental results are described, which shows the average correct recognition score to be 94% for each Japanese digit in their connected utterances through actual telephone lines.
Abstract: An algorithm for speaker-independent connected digit recognition for telephone use, and its experimental results are described The main features of this algorithm are the use of multiple reference templates assigned to each speaker class, a continuous DP matching process for word spotting, and partial reference templates to confirm spotted digits The K-nearest neighbor decision rule and pair-comparison judgement are used to obtain the final result from spotted digit sequences Experimental results show the average correct recognition score to be 94% for each Japanese digit in their connected utterances through actual telephone lines

8 citations


Proceedings ArticleDOI
01 Mar 1984
TL;DR: A system is proposed for automatic speech recognition using syllable templates that is based on an overall likelihood measure calculated for each item stored in the lexicon.
Abstract: A system is proposed for automatic speech recognition using syllable templates. In this system, input speech signal is analyzed and matched against syllable templates and converted into parameters characterizing each candidate syllable. Word recognition is based on an overall likelihood measure calculated for each item stored in the lexicon. A method is also developed for the optimization of syllable templates. The validity of the proposed method was tested in a preliminary recognition experiment in which a lexicon consisting of 1000 city names was used to recognize utterances of 100 city names by a female speaker. The rate of correct recognition was 96.5%.

Proceedings ArticleDOI
01 Mar 1984
TL;DR: This paper describes the speaker-independent spoken word recognition system for a large size vocabulary and results are obtained for the training samples in the 212 words uttered by 10 male and 10 female speakers.
Abstract: This paper describes the speaker-independent spoken word recognition system for a large size vocabulary. Speech is analyzed by the filter bank, from whose logarithmic spectrum the 11 features are extracted every 10 ms. Using the features the speech is first segmented and the primary phoneme recognition is carried out for every segment using the Bayes decision method. After correcting errors in segmentation and phoneme recognition, the secondary recognition of part of the consonants is carried out and the phonemic sequence is determined. The word dictionary item having maximum likelihood to the sequence is chosen as the recognition output. The 75.9% score for the phoneme recognition and the 92.4% score for the word recognition are obtained for the training samples in the 212 words uttered by 10 male and 10 female speakers. For the same words uttered by 30 male and 20 female speakers different from the above speakers, the 88.1% word recognition score is obtained.

Proceedings ArticleDOI
01 Mar 1984
TL;DR: A Speaker Recognizability Test (SRT) is presented, which tries to establish how well a given communications system preserves a speaker's identity.
Abstract: Speech intelligibility and quality are the two most often tested features of speech coding systems. However, another feature of interest in store-and-forward applications is the preservation of a speaker's identity. Here, a Speaker Recognizability Test (SRT) is presented, which tries to establish how well a given communications system preserves a speaker's identity. Contrary to previous efforts, no attempt is made to identify the cues used by listeners for speaker recognition. Instead, listeners are asked directly to identify a speaker who says an utterance by comparing the uttered sentence with reference sentences, one from each speaker. Among the issues considered in the design of the test is the choice of speakers, the use of reference sentences from the same or different sessions of data collection, and the use of processed or unprocessed speech for reference.

Proceedings ArticleDOI
19 Mar 1984
TL;DR: A phrase unit speech recognition system is discussed, which is applicable for a large vocabulary and is independent of the task, and a technique to recognize phrases based on the phoneme recognition is introduced.
Abstract: A phrase unit speech recognition system is discussed, which is applicable for a large vocabulary and is independent of the task. In the case of large vocabulary, it is desirable to express the words in the dictionary by the sequence of phonemes or phoneme-like units. Therefore, the recognition of phonemes in continuous speech is essential to achieve a flexible speech understanding system. In this paper, a technique to recognize phrases based on the phoneme recognition is introduced. The system is composed of the phoneme recognition part and the phrase recognition part. In the phoneme recognition part, the features in the articulatory domain are extracted and applied to compensate coarticulation. In the phrase recognition part, a word sequence corresponding to the phoneme sequence is determined by using two-level DP matching with automaton control, in which words are processed symbolically to attain the acceptable processing speed.

Journal ArticleDOI
TL;DR: Word classification based on the Mahalonobis distance metric, and using templates derived from cluster analysis of the training inputs, was found to give results superior to the other strategies studied, and the principle of clustering was successfully applied to produce an adaptive system which tracked changes in the user's voice.

Proceedings ArticleDOI
19 Mar 1984
TL;DR: Two new methods by which the CMU feature-based recognition system can learn the acoustical characteristics of individual speakers without feedback from the user are described.
Abstract: This paper describes two new methods by which the CMU feature-based recognition system can learn the acoustical characteristics of individual speakers without feedback from the user. We have previously described how the system uses MAP techniques to update its estimates of the mean values of features used by the classifier in recognizing the letters of the English alphabet on the basis of a priori information and labelled observations. In the first of the new procedures described in this paper the system assumes a correct decision every time it classifies a new utterance with a sufficiently high confidence level. In the second new procedure the system adjusts its estimates of the means on the basis of their correlation with the average values of the features over all utterances. Experiments were conducted on two confusable sets of letters using both speaker adaptation procedures. In each case classification performance using the unsupervised estimation procedures could equal that obtained using speaker adaptation with feedback from the user, although which method provided the better performance depended on which set of letters was being classified.

Proceedings ArticleDOI
01 Mar 1984
TL;DR: A speaker adaptation method that follows two steps -- selection of "persons" who have voices similar to the user's and generation of a speaker-adapted dictionary from their dictionaries is studied.
Abstract: A speaker-trained voice recognition system with a large vocabulary has a serious weak point, that is, the user must register a large number of words prior to its use. To be freed from this problem, the authors have studied a speaker adaptation method. This method follows two steps -- 1) selection of "persons" who have voices similar to the user's and 2) generation of a speaker-adapted dictionary from their dictionaries. Results of simulation using 1000-word speech samples by 40 male speakers (20 for standard dictionaries and 20 for performance evaluation) are reported. The results indicated the advantage of this method. The speaker-trained dictionary gave 90.1% recognition accuracy, the speaker-independent dictionary gave 83.6%, and the speaker-adapted dictionary which required only 10% of the vocabulary for training gave 85.7%.

Proceedings Article
06 Aug 1984
TL;DR: A planning system for recognizing connected letters is described and some preliminary experimental results are reported.
Abstract: A planning system for recognizing connected letters is described and some preliminary experimental results are reported.


01 Jan 1984
TL;DR: In this paper, a speaker-independent spoken word recognition system for a large size vocabulary is described, in which speech is analyzed by the filter bank, from whose logarithmic spectrum the 11 features are extracted every 10 ms.
Abstract: This paper describes the speaker-independent spoken word recognition system for a large size vocabulary. Speech is analyzed by the filter bank, from whose logarithmic spectrum the 11 features are extracted every 10 ms. Using the features the speech is first segmented and the primary phoneme recognition is carried out for every segment using the Bayes decision method. After correcting errors in segmentation and phoneme recognition, the secondary recognition of part of the consonants is carried out and the phonemic sequence is determined. The word dictionary item having maximum likelihood to the sequence is chosen as the recognition output. The 75.9% score for the phoneme recognition and the 92.4% score for the word recognition are obtained for the training samples in the 212 words uttered by 10 male and 10 female speakers. For the same words uttered by 30 male and 20 female speakers different from the above speakers, the 88.1% word recognition score is obtained.


Proceedings ArticleDOI
01 Mar 1984
TL;DR: It is shown that this type of coder operating at 7.2 kbps, provides a good communications quality, an intelligibility which is sufficient for most of telephony applications, and a perfect speaker recognition (natural voice).
Abstract: In this paper, we discuss the Implementation of a low bit-rate linear prediction base-band coder on a bipolar signal processor having a processing capacity of 10 millions of instructions per second (MIPS). We show that the implementation of our algorithm requires less than 5 MIPS, with a ROS occupancy less than 5 K instructions. Some quality evaluation tests are also reported, and show that this type of coder operating at 7.2 kbps, provides a good communications quality, an intelligibility which is sufficient for most of telephony applications, and a perfect speaker recognition (natural voice).

Proceedings ArticleDOI
01 Mar 1984
TL;DR: The Mark II system provides both speaker dependent and multiple speaker recognition of up to a 32 isolated word active vocabulary in real-time on a 2 MHz 6502, with no custom hardware, except an inexpensive microphone, pre-amp, and 8-bit A/D converter.
Abstract: Recent developments have made it possible to implement high performance speech recognition with much less computation than traditional techniques, thereby enabling real-time computation on standard microprocessors. Concepts such as time-domain acoustic-phonetic speech signal- processing as well as efficient adaptations of hidden Markov models can provide this type of capability. The Mark II system provides both speaker dependent and multiple speaker recognition of up to a 32 isolated word active vocabulary in real-time on a 2 MHz 6502, with no custom hardware, except an inexpensive microphone, pre-amp, and 8-bit A/D converter. On an initial test of 5120 test utterances (Texas Instruments isolated word data base, Spectrum, Sept., 1981), the Mark II achieved an error rate of only 0.67% (34 errors).

Journal ArticleDOI
TL;DR: A method of speaker-independent connected-word recognition by robust segmentation for speaker variation by varying the matching path adaptively with respect to each phoneme, at the dynamic-programming word-matching level is proposed.

Proceedings ArticleDOI
01 Jan 1984
TL;DR: This study uses operational evaluation techniques to model a system which processes human speech to verify the identity persons seeking access to a facility resource and decides whether the speaker is valid or imposter based on the degree of similarity observed.
Abstract: This study uses operational evaluation techniques to model a system which processes human speech to verify the identity persons seeking access to a facility resource. The system consists of hardware and software for accepting analog speech; extracting time, frequency, and amplitude characteristics; producing compact digital templates containing the features for speaker identification; and cross-referencing templates with reference patterns establish the degree of similarity between utterence and a set of utterences for the person whose identity is being claimed. decision algorithm is implemented determine whether the speaker is valid or imposter based on the degree of similarity observed.A conceptual model has been tested and used to simulate variations in system attributes in order to optimize system performance. Performance is evaluated in terms of number of imposters who can defeat system, and the number of rejected valid speakers.


Patent
02 Oct 1984
TL;DR: In this paper, a plurality of speech feature vectors are generated from the time series of the speech feature parameter for the input speech pattern, by taking account of knowledge concerning the variation tendencies of speech patterns, and the learning (preparation) of a reference pattern vectors for speech recognition is carried out by the use of these feature vectors thus generated.
Abstract: In the learning method of reference pattern vectors for speech recognition in accordance with the present invention, a plurality of speech feature vectors are generated (block 20) from the time series of speech feature parameter for the input speech pattern, by taking account of knowledge concerning the variation tendencies of the speech patterns, and the learning (preparation) of a reference pattern vectors for speech recognition is carried out (block 22) by the use of these speech feature vectors thus generated. In particular, the method according to the present invention will become effective when it is combined with a statistical pattern recognition method that can absorb wide variations in the speech patterns.


16 Jul 1984
TL;DR: A commonly cited drawback of narrowband systems such as the DoD standard linear predictive coding (LPC) algorithm is that speaker recognition is poor, yet it is the opinion of many users that they frequently recognize the speaker.
Abstract: : A commonly cited drawback of narrowband systems such as the DoD standard linear predictive coding (LPC) algorithm is that speaker recognition is poor Yet it is the opinion of many users that they frequently recognize the speaker Tape recordings of 24 speakers conversing over an unprocessed channel and over an LPC voice processing system were subjected to listening tests Twenty four co workers listened to the tapes and attempted to identify each speaker from a list of about 40 people in the same branch Prior to the recognition tests, each of the listeners also rated his or her familiarity with each of the speakers and the distinctiveness of each speaker's voice There was some loss in voice recognition over LPC, but the recognition rate was still quite high Unprocessed voices were correctly identified 88% of the time, whereas the same people talking over the LPC system were correctly identified 69% of the time Talker familiarity was significantly correlated with correct identifications There was no significant correlation between the rated distinctiveness of the speaker and correct identifications However, familiarity and distinctiveness ratings were highly correlated