Showing papers on "Speaker recognition published in 1986"

PDF

Open Access

Journal Article•

Speaker recognition

[...]

Douglas O'Shaughnessy¹•Institutions (1)

Institut national de la recherche scientifique¹

01 Oct 1986-IEEE Assp Magazine

305 citations

Proceedings Article•DOI•

Speaker adaptation through vector quantization

[...]

Kiyohiro Shikano¹, Kai-Fu Lee, Raj Reddy•Institutions (1)

Carnegie Mellon University¹

01 Apr 1986

TL;DR: Vector quantization (VQ) is a technique that reduces the computation amount and memory size drastically and is proposed in order to improve speaker-independent recognition.

...read moreread less

Abstract: Vector quantization (VQ) is a technique that reduces the computation amount and memory size drastically. In this paper, speaker adaptation algorithms through VQ are proposed in order to improve speaker-independent recognition. The speaker adaptation algorithms use VQ codebooks of a reference speaker and an input speaker. Speaker adaptation is performed by substituting vectors in the codebook of a reference speaker for vectors of the input speaker's codebook, or vice versa. To confirm the effectiveness of these algorithms, word recognition experiments are carried out using the IBM office correspondence task uttered by 11 speakers. The total number of words is 1174 for each speaker, and the number of different words is 422. The average word recognition rate using different speaker's reference through speaker adaptation is 80.9%, and the rate within the second choice is 92.0%.

...read moreread less

269 citations

Proceedings Article•DOI•

On the use of instantaneous and transitional spectral information in speaker recognition

[...]

F.K. Soong¹, Aaron E. Rosenberg¹•Institutions (1)

Bell Labs¹

01 Apr 1986

TL;DR: The experimental results show that the instantaneous and transitional representations are relatively uncorrelated thus providing complementary information for speaker recognition, and simple transmission channel variations are shown to affect the instantaneous spectral representations and the corresponding recognition performance significantly, while the transitional representations and performance are relatively resistant.

...read moreread less

Abstract: The use of instantaneous and transitional spectral representations of spoken utterances for speaker recognition is investigated. LPC derived-cepstral coefficients are used to represent instantaneous spectral information and best linear fits of each cepstral coefficient over a specified time window are used to represent transitional information. An evaluation has been carried out using a data base of isolated digit utterances over dialed-up telephone lines by 10 talkers. Two vector quantization (VQ) codebooks, instantaneous and transitional, are constructed from training utterances for each speaker. The experimental results show that the instantaneous and transitional representations are relatively uncorrelated thus providing complementary information for speaker recognition. A rectangular window of approximately 100-150 ms duration provides an effective estimate of spectral transitions for speaker recognition. Also, simple transmission channel variations are shown to affect the instantaneous spectral representations and the corresponding recognition performance significantly, while the transitional representations and performance are relatively resistant.

...read moreread less

228 citations

Proceedings Article•DOI•

Script recognition using hidden Markov models

[...]

R. Nag¹, Kin Hong Wong¹, F. Fallside¹•Institutions (1)

University of Cambridge¹

07 Apr 1986

TL;DR: Results are given which show that HMMs provide a versatile pattern matching tool suitable for some image processing tasks as well as speech processing problems.

...read moreread less

Abstract: A handwritten script recognition system is presented which uses Hidden Markov Models (HMM), a technique widely used in speech recognition. The script is encoded as templates in the form of a sequence of quantised inclination angles of short equal length vectors together with some additional features. A HMM is created for each written word from a set of training data. Incoming templates are recognised by calculating which model has the highest probability for producing that template. The task chosen to test the system is that of handwritten word recognition, where the words are digits written by one person. Results are given which show that HMMs provide a versatile pattern matching tool suitable for some image processing tasks as well as speech processing problems.

...read moreread less

124 citations

Journal Article•DOI•

Research on individuality features in speech waves and automatic speaker recognition techniques

[...]

S Furui¹•Institutions (1)

Nippon Telegraph and Telephone¹

01 Jun 1986-Speech Communication

TL;DR: This paper focuses on the long-term intra-speaker variability of feature parameters as on the most crucial problems in speaker recognition, and presents an investigation into methods for reducing the effects of long- term spectral variability on recognition accuracy.

...read moreread less

79 citations

Patent•DOI•

Speech recognition system with efficient storage and rapid assembly of phonological graphs

[...]

Lalit R. Bahl¹, Paul S. Cohen¹, Robert Leroy Mercer¹•Institutions (1)

IBM¹

24 Mar 1986-Journal of the Acoustical Society of America

TL;DR: In this article, a continuous speech recognition system with a speech processor and a word recognition computer subsystem is described, which is characterized by an element for developing a graph for confluent links between confluent nodes.

...read moreread less

Abstract: A continuous speech recognition system having a speech processor and a word recognition computer subsystem, characterized by an element for developing a graph for confluent links between confluent nodes; an element for developing a graph of boundary links between adjacent words; an element for storing an inventory of confluent links and boundary links as a coding inventory; an element for converting an unknown utterance into an encoded sequence of confluent links and boundary links corresponding to recognition sequences stored in the word recognition subsystem recognition vocabulary for speech recognition. The invention also includes a method for achieving continouous speech recognition by characterizing speech as a sequence of confluent links which are matched with candidate words. The invention also applies to isolated word speech recognition as with continuous speech recognition, except that in such case there are no boundary links.

...read moreread less

68 citations

Proceedings Article•DOI•

The role of word-dependent coarticulatory effects in a phoneme-based speech recognition system

[...]

Yen-Lu Chow, Richard Schwartz, S. Roucos, Owen Kimball, Patti Price, Francis Kubala, M. Dunham, M. Krasner, J. Makhoul - Show less +5 more

01 Apr 1986

TL;DR: This paper describes the results of the work in designing a system for large-vocabulary word recognition of continuous speech, and generalizes the use of context-dependent Hidden Markov Models of phonemes to take into account word-dependent coarticulatory effects.

...read moreread less

Abstract: This paper describes the results of our work in designing a system for large-vocabulary word recognition of continuous speech. We generalize the use of context-dependent Hidden Markov Models (HMM) of phonemes to take into account word-dependent coarticulatory effects, Robustness is assured by smoothing the detailed word-dependent models with less detailed but more robust models. We describe training and recognition algorithms for HMMs of phonemes-in-context. On a task with a 334-word vocabulary and no grammar (i.e., a branching factor of 334), in speaker-dependent mode, we show an average reduction in word error rate from 24% using context-independent phoneme models, to 10% when using robust context-dependent phoneme models.

...read moreread less

59 citations

Journal Article•DOI•

Frame-specific statistical features for speaker independent speech recognition

[...]

E. Bocchieri¹, G. Doddington•Institutions (1)

Texas Instruments¹

01 Aug 1986-IEEE Transactions on Acoustics, Speech, and Signal Processing

TL;DR: Improved recognition performance is demonstrated by explicitly modeling the correlation between spectral measurements of adjacent frames; and using a distance measure which is a function of the recognition reference frame being used.

...read moreread less

Abstract: The performance of current speaker independent speech recognition technology is limited by the inadequacy of the measures of the speech data to discriminate between different speech sounds. In particular, two critical assumptions that underlie and limit most current recognition techniques are that: 1) speech data from different frames are statistically independent (e.g., there are no between-frame interactions); and 2) speech data statistics are independent of phonetic events (e.g., distance measures are fixed and independent of input or reference speech). In the context of speaker independent isolated digit recognition, improved recognition performance is demonstrated by: 1) explicitly modeling the correlation between spectral measurements of adjacent frames; and 2) using a distance measure which is a function of the recognition reference frame being used. A statistical model was created from a 2464 token database (2 tokens of each of 11 words "zero" through "nine" and "oh") for 112 speakers. Primary features include energy and filter bank amplitudes. Interspeaker variability was estimated by time aligning all training tokens and creating an ensemble of 224 feature vectors for each reference frame. Normal distributions were then estimated individually for each frame jointly with its neighbors. Testing was performed on a multidialect database of 2486 spoken digit tokens collected from 113 (different) speakers using maximum-likelihood decision methods. The substitution rate dropped from 1.7 to 1.4 percent with incorporation of between-frame statistics, and further to 0.6 percent with incorporation of frame-specific statistics in the likelihood model.

...read moreread less

44 citations

Proceedings Article•DOI•

The analog voice privacy system

[...]

Richard V. Cox¹, D. Bock, K. Bauer, James D. Johnston, J. Snyder - Show less +1 more•Institutions (1)

Bell Labs¹

01 Apr 1986

TL;DR: The underlying principles of the AVPS algorithm, its implementation, and laboratory test results are described, and the quality of the decrypted speech is considered very natural, and speaker recognition is retained — a significant advantage over digital vocoders.

...read moreread less

Abstract: The Analog Voice Privacy System is based on individual sample permutation of the output samples of a sub-band coder analysis filterbank. The system has a large number of digital keys, giving it the strength of a digital encryption system, but also retains the good quality characteristics of analog scramblers. It has been implemented in a real-time hardware prototype designed for evaluation in the field. The units work with any modular telephone and standard 120 volts AC electricity. The device contains two circuitry boards, one for analog and one for digital processing which contain four digital signal processors. There are 125! possible permutation keys. These prototypes were designed to be tested in real telephone environments. To date, the device has been successfully tested over long distance telephone connections, several different analog and digital PBXs and telephone switches, and a channel simulator. The quality of the decrypted speech is considered very natural, and in particular, speaker recognition is retained. This is a significant advantage over digital vocoders. This paper describes the underlying principles of the algorithm, the details of its implementation and laboratory test results.

...read moreread less

43 citations

Proceedings Article•DOI•

Methods and experiments for text-independent speaker recognition over telephone channels

[...]

Herbert Gish, M. Krasner, W. Russell, J. Wolf

01 Apr 1986

TL;DR: Methods for text-independent speaker identification that deal with the variability in the data introduced by unknown telephone channels including probabilistic channel modeling, a channel-invariant model and a modified-Gaussian model are considered.

...read moreread less

Abstract: We consider methods for text-independent speaker identification that deal with the variability in the data introduced by unknown telephone channels. The methods investigated include probabilistic channel modeling, a channel-invariant model and a modified-Gaussian model. The methods are described and then evaluated with experiments conducted with a twenty speaker database of long distance telephone calls.

...read moreread less

36 citations

Proceedings Article•DOI•

Perceptually based processing in automatic speech recognition

[...]

Hynek Hermansky, K. Tsuga, S. Makino, Hisashi Wakita

01 Apr 1986

TL;DR: It is shown that in speaker-dependent recognition of the alpha-numeric vocabulary, the PLP method in VQ-based ASR yields similar recognition scores as does the standard ASR system.

...read moreread less

Abstract: The perceptually based linear predictive (PLP) speech analysis method is applied to isolated word automatic speech recognition (ASR). Low dimensionality of the PLP analysis vector, which is otherwise identical in form to the standard linear predictive (LP) analysis vector, allows for computational and storage savings in ASR. We show that in speaker-dependent recognition of the alpha-numeric vocabulary, the PLP method in VQ-based ASR yields similar recognition scores as does the standard ASR system. The main focus of the paper is on cross-speaker ASR. We demonstrate in experiments with vowel centroids of two male and one female speakers that PLP speech representation is more consistent with the underlying phonetic information than the standard LP method. Conclusions from the experiments are confirmed by superior performance of the PLP method in cross-speaker isolated word recognition.

...read moreread less

Proceedings Article•DOI•

Spectral transformations through canonical correlation analysis for speaker adptation in ASR

[...]

K. Choukri, G. Chollet, Yves Grenier

07 Apr 1986

TL;DR: This paper describes a technique of spectral transformation for improved adaptation of a knowledge data base or reference templates to new speakers in automatic speech recognition (ASR).

...read moreread less

Abstract: This paper describes a technique of spectral transformation for improved adaptation of a knowledge data base or reference templates to new speakers in automatic speech recognition (ASR). Based on a statistical analysis tool (Canonical correlation analysis) the proposed method permits to improve speaker independance in Large vocabulary ASR. Application to an isolated word recognizer improved a 70% correct score to 87%.

...read moreread less

Proceedings Article•DOI•

A new method of text-independent speaker recognition

[...]

A. Higgins, R. Wohlford

01 Apr 1986

TL;DR: A new method, based on template matching, that utilizes temporal information to advantage in text-dependent recognition as a special case and is compared with that of similar recently-developed methods.

...read moreread less

Abstract: Text-independent speaker recognition methods have been based on measurements of long-term statistics of individual speech frames. These methods are not capable of modeling speaker-dependent speech dynamics. In this paper, we describe a new method, based on template matching, that utilizes temporal information to advantage. The template-matching method performs text-dependent recognition as a special case. Performance of the template-matching method is compared with that of similar recently-developed methods.

...read moreread less

Proceedings Article•DOI•

A Japanese language speech database

[...]

S. Itahashi¹•Institutions (1)

University of Tsukuba¹

01 Apr 1986

TL;DR: A database of spoken Japanese sound has been collected for use in designing and evaluating algorithms for automatic speech recognition and part of the database has been distributed among the members of the committee.

...read moreread less

Abstract: A database of spoken Japanese sound has been collected for use in designing and evaluating algorithms for automatic speech recognition. The database is composed of 323 words. It is a special feature of this database that all samples are uttered four times by each speaker (i. e. four tokens per word). Seventy-five male and 75 female data are collected at 15 recording places. Speaker data include sex, age, height, etc. Fifteen research institutions and private enterprises engaged in speech research and development have taken part in the data collection. This is a result of four years' effort by a committee supported by JEIDA (Japan Electronic Industry Development Association). Part of the database has been distributed among the members of the committee.

...read moreread less

Journal Article•DOI•

Speech recognition enhancement by lip information

[...]

Shogo Nishida¹•Institutions (1)

Mitsubishi Electric¹

01 Apr 1986

TL;DR: The algorithms proposed here are composed of simple image-processing, and it is shown they work well and will make it possible to realize them in real-time.

...read moreread less

Abstract: Though technology in speech recognition has progressed recently, Automatic Speech Recognition (ASR) is vulnerable to noise. Lip-information is thought to be useful for speech recognition in noisy situations, such as in a factory or in a car.This paper describes speech recognition enhancement by lip-information. Two types of usage are dealt with. One is the detection of start and stop of speech from lip-information. This is the simplest usage of lip-information. The other is lip-pattern recognition, and it is used for speech recognition together with sound information. The algorithms for both usages are proposed, and the experimental system shows they work well. The algorithms proposed here are composed of simple image-processing. Future progress in image-processing will make it possible to realize them in real-time.

...read moreread less

Journal Article•DOI•

Speech research directions

[...]

Bishnu S. Atal¹, Lawrence R. Rabiner¹•Institutions (1)

Bell Labs¹

10 Sep 1986-AT&T technical journal

TL;DR: The state of the art in speech coding, text-to-speech synthesis, speech recognition, and speaker recognition is discussed, with a focus on solving the problem of continuous speech recognition for large vocabularies and verifying talkers' identities from a limited amount of spoken text.

...read moreread less

Abstract: This paper presents an overview of the current activities in speech research. We will discuss the state of the art in speech coding, text-to-speech synthesis, speech recognition, and speaker recognition. In the speech coding area, current algorithms perform well at bit rates down to 9.6 kb/s, and the research is directed at bringing the rate for high-quality speech coding down to 2.4 kb/s. In text-to-speech synthesis, what we currently are able to produce is very intelligible but not yet completely natural. Current research aims at providing higher quality and intelligibility to the synthetic speech that these systems produce. Finally, today's systems for speech and speaker recognition provide excellent performance on limited tasks; i.e., limited vocabulary, modest syntax, small talker populations, constrained inputs, etc. Current research is directed at solving the problem of continuous speech recognition for large vocabularies, and at verifying talkers' identities from a limited amount of spoken text.

...read moreread less

Proceedings Article•DOI•

A phonetically based semivowel recognition system

[...]

Carol Y. Espy-Wilson¹•Institutions (1)

Massachusetts Institute of Technology¹

01 Apr 1986

TL;DR: A framework for developing a phonetically based recognition system for speech recognition is discussed, the recognition task is the class of sounds known as the semivowels.

...read moreread less

Abstract: A phonetically based approach to speech recognition uses speech specific knowledge obtained from phonotactics, phonology and acoustic phonetics to capture relevant phonetic information. Thus, a recognition system based on this approach can make broad classifications as well as detailed phonetic distinctions. This paper discusses a framework for developing a phonetically based recognition system. The recognition task is the class of sounds known as the semivowels. The recognition results reported, though incomplete, are encouraging.

...read moreread less

Journal Article•DOI•

Recognition of previously unfamiliar speakers as a function of narrow‐band processing and speaker selection

[...]

A. Schmidt‐Nielsen, Karen R. Stern

01 Apr 1986-Journal of the Acoustical Society of America

TL;DR: In this paper, the effect of linear predictive coding (LPC) on the recognition of previously unfamiliar speakers was investigated. And the results showed that the more distinctive male speakers and the females were better recognized than the less distinctive males for unprocessed speech.

...read moreread less

Abstract: The effect of narrow‐band digital processing, using a linear predictive coding (LPC) algorithm at 2400 bits/s, on the recognition of previously unfamiliar speakers was investigated. In two experiments, rated voice distinctiveness was used to select three sets of five speakers (two sets of males and one set of females). The more distinctive male speakers and the females were better recognized than the less distinctive males for unprocessed speech. With LPC processed speech, there were large losses in speaker recognition for the more distinctive males and the females, whereas the less distinctive males showed no recognition loss. This interaction is discouraging to prospects for developing practical procedures for comparing speaker recognition over various voice communication systems.

...read moreread less

Book•

An approach to speech recognition using synthesis by rule

[...]

J. S. Bridle, M. P. Ralls

24 Feb 1986

Journal Article•DOI•

Speech and speaker recognition

[...]

William S.-Y. Wang, M. R. Schroeder

01 Sep 1986-Language

Proceedings Article•DOI•

Unsupervised speaker adaptation methods for vowel templates

[...]

M. Sugiyama

01 Apr 1986

TL;DR: Four unsupervised speaker adaptation methods for vowel templates are described and evaluated and show that these methods work well and that the top-down approach is better than the bottom-up one.

...read moreread less

Abstract: Four unsupervised speaker adaptation methods for vowel templates are described and evaluated There are two approaches to automatically obtaining information on vowel classification and location One is based on feature parameters and the other on the results of input speech recognition Here, the former is referred to as a bottom-up approach and the latter as a top-down one Two adaptation techniques are also presented The first is template selection from pre-stored sets and the second is template modification Combining these approaches and techniques, four adaptation methods are derived These four methods are evaluated in terms of spectral distortion and word recognition rate They are then compared considering performance, required calculation, rate of correctly used vowels, and type of input speech The results show that these methods work well and that the top-down approach is better than the bottom-up one They also show that the modification technique is better than the selection technique

...read moreread less

Proceedings Article•DOI•

Generating word hypotheses in continuous speech

[...]

G. Schukat-Talamazzini, H. Niemann

01 Apr 1986

TL;DR: This paper uses an extension of the well-known hidden Markow models in order to model more accurately the properties of the phonetic labeling stage and presents experimental results which were computed speaker independently.

...read moreread less

Abstract: This paper addresses the problem of generating word hypotheses in continuous German speech. It uses an extension of the well-known hidden Markow models in order to model more accurately the properties of the phonetic labeling stage. A powerful scoring function is derived. Experimental results are presented which were computed speaker independently.

...read moreread less

Proceedings Article•DOI•

Incremental network generation in word recognition

[...]

Kai-Fu Lee¹•Institutions (1)

Carnegie Mellon University¹

07 Apr 1986

TL;DR: An automatic incremental network generation algorithm for speaker independent isolated word recognition is described, which is possible to add new words to the network at any time; because of its complete freedom from human intervention, it is language and vocabulary independent.

...read moreread less

Abstract: It is well known that a network representation of templates has many advantages; however, generating a network by hand is an impossible task for a large vocabulary database. This paper describes an automatic incremental network generation algorithm for speaker independent isolated word recognition. Because of its incremental nature, it is possible to add new words to the network at any time; because of its complete freedom from human intervention, it is language and vocabulary independent. By applying this technique to speaker-independent recognition, recognition accuracy of 99% was obtained for the digits, and 91.92% was obtained for the alphabets.

...read moreread less

Preservation of familiar speaker recognition but not unfamiliar speaker discrimination in aphasic patients

[...]

Diana Van Lancker, Jody Kreiman¹, Jody Kreiman², Jody Kreiman³•Institutions (3)

University of North Dakota¹, University of California, Berkeley², University of California, Los Angeles³

01 Jan 1986

Proceedings Article•DOI•

Plan refinement in a knowledge-based system for automatic speech recognition

[...]

R. De Mori¹, Lily Lam•Institutions (1)

Concordia University¹

01 Apr 1986

TL;DR: This paper shows how a semi-automatic design of a speech recognition system can be done as a planning activity and results in the recognition of connectedly spoken letters pronounced by 70 new speakers are presented.

...read moreread less

Abstract: This paper shows how a semi-automatic design of a speech recognition system can be done as a planning activity. Recognition performances are used for deciding plan refinement. Inductive learning is performed for setting action preconditions. Experimental results in the recognition of connectedly spoken letters pronounced by 70 new speakers are presented.

...read moreread less

Proceedings Article•DOI•

Voice-activated word processor with automatic learning for dynamic optimization of syllable-templates

[...]

Fumio Togawa¹, Mitsuhiro Hakaridani¹, Hiroyuki Iwahashi¹, Torn Ueda¹•Institutions (1)

National Archives and Records Administration¹

07 Apr 1986

TL;DR: A learning method in which the syllable templates are automatically optimized, based on speaker-dependent recognition system, showed an average syllable recognition accuracy of 71.0% without and 82.5% with automatic learning.

...read moreread less

Abstract: In this speaker-dependent recognition system, recognition is based on syllable template matching and each syllable has several templates. In the initial training for each speaker, 590 templates for 111 syllables are made, each including various contextual variations. The authors studied a learning method in which the syllable templates are automatically optimized. It is judged whether or not an input syllable should be learned according to the recent recognition condition. If it should be learned, the input syllable pattern replaces the template that contributes the least to recognition in the templates segmented from the same context and in the same syllable category. Automatic learning was evaluated on recognition of speech data obtained by reading Japanese sentences at a rate of about 4 to 5 syllables per second. The results over eight speakers showed an average syllable recognition accuracy of 71.0% without and 82.5% with automatic learning. Further, by increasing the maximum number of templates to 1024, it rose to 84.8%.

...read moreread less

Proceedings Article•DOI•

Speaker independent digit recognition with reference frame-specific distance measures

[...]

E. Bocchieri¹, George R. Doddington¹•Institutions (1)

Texas Instruments¹

07 Apr 1986

TL;DR: This work defines novel distance measures for speech recognition which are specifically designed to differentiate between confusable speech sounds.

...read moreread less

Abstract: This work defines novel distance measures for speech recognition which: 1. Model the statistical interaction between adjacent speech frames, 2. Model the statistical characteristics of different speech sounds individually, 3. Are specifically designed to differentiate between confusable speech sounds. Speaker independent recognition tests performed on the Texas Instruments multi-dialect isolated digit data base give substitution rates as low as 0.6 % with a vocabulary of 11 digits.

...read moreread less

Proceedings Article•DOI•

Isolated-word recognition of the complete vocabulary of spoken Chinese

[...]

Michael Wagner¹, Wei Wang, H. Ho, M. O'Kane•Institutions (1)

University of Wollongong¹

01 Apr 1986

TL;DR: The possibility of using an automatic speech recognition system as a front end to a computer for Chinese-character processing is explored and some preliminary experiments are reported which indicate that the syllable inventory of spoken Standard Chinese belongs into the category of "difficult" vocabularies.

...read moreread less

Abstract: The possibility of using an automatic speech recognition system as a front end to a computer for Chinese-character processing is explored in this paper. Aspects of the Chinese language are discussed in relation to the capabilities of current state-of-the-art isolated-word recognition systems. Some preliminary experiments are reported which indicate that the syllable inventory of spoken Standard Chinese belongs into the category of "difficult" vocabularies. The vocabulary size is of the order of 350 syllables with a large number of similar word pairs. Recognition rates using linear predictive coding, Itakura distance measures and dynamic time warping are of the order of 25-30%.

...read moreread less

Proceedings Article•DOI•

A new approach to continuous speech recognition based on considerations on human processes of speech perception

[...]

Hidehiko Fujisaki¹, K. Hirose, H. Udagawa, N. Kanedera•Institutions (1)

University of Tokyo¹

01 Apr 1986

TL;DR: A system for automatic recognition of continuous speech recognition is proposed, utilizing both words and syllables as units of recognition, and using both local and global coherence in reducing word candidates as well as modifying or renewing word and syllable templates.

...read moreread less

Abstract: Critical re-examination of the premises of conventional systems for continuous speech recognition has led to a study of human processes of speech perception. It was found that deletion of a syllable is often not noticed by a listenter, suggesting that the basic unit of continuous speech perception is larger than the syllable. Further experiments are described and discussed on the size of actual units of perception, effects of syntactic roles, unit, organization and access to the mental lexicon, effects of context, as well as effects of repeated listening. A system for automatic recognition of continuous speech is then proposed on the basis of these results and considerations, utilizing both words and syllables as units of recognition, and using both local and global coherence in reducing word candidates as well as modifying or renewing word and syllable templates.

...read moreread less

Proceedings Article•DOI•

Speaker-independent isolated word recognition for telephone voice using phoneme-like templates

[...]

T. Nomura, R. Nakatsu

01 Apr 1986

TL;DR: This paper describes a speaker-independent isolated word recognition algorithm for telephone voice and its recognition performance, which consists of dynamic time warping and statistical word discrimination.

...read moreread less

Abstract: This paper describes a speaker-independent isolated word recognition algorithm for telephone voice and its recognition performance. The recognition algorithm consists of two processes ; dynamic time warping and statistical word discrimination. In the first process, input speech is compared with each word template using the dynamic time warping technique. Multiple word templates are used to deal with speech variations among speakers, where each word template is represented by a sequence of phoneme-like templates. To attain high recognition ability, a new technique for generating word templates is proposed. In the second process, statistical word discrimination is carried out for word candidates which have relatively low reliability in the first process. Discrimination functions are calculated based on statistics of transition tendencies of speech characteristics between adjacent frames, and the final word decision is made. The system was trained using utterances from 1305 speakers and tested with utterances from 259 speakers. The average recognition rate of 96.5% was obtained for a 16-word Japanese vocabulary set.

...read moreread less