scispace - formally typeset
Search or ask a question

Showing papers on "Word error rate published in 1986"


Journal ArticleDOI
TL;DR: This paper proposes a new isolated word recognition technique based on a combination of instantaneous and dynamic features of the speech spectrum that is shown to be highly effective in speaker-independent speech recognition.
Abstract: This paper proposes a new isolated word recognition technique based on a combination of instantaneous and dynamic features of the speech spectrum. This technique is shown to be highly effective in speaker-independent speech recognition. Spoken utterances are represented by time sequences of cepstrum coefficients and energy. Regression coefficients for these time functions are extracted for every frame over an approximately 50 ms period. Time functions of regression coefficients extracted for cepstrum and energy are combined with time functions of the original cepstrum coefficients, and used with a staggered array DP matching algorithm to compare multiple templates and input speech. Speaker-independent isolated word recognition experiments using a vocabulary of 100 Japanese city names indicate that a recognition error rate of 2.4 percent can be obtained with this method. Using only the original cepstrum coefficients the error rate is 6.2 percent.

812 citations


Journal ArticleDOI
TL;DR: This work provides simple estimates for the downward bias of the apparent error rate of logistic regression on binary data, with error rates measured by the proportion of misclassified cases.
Abstract: A regression model is fitted to an observed set of data. How accurate is the model for predicting future observations? The apparent error rate tends to underestimate the true error rate because the data have been used twice, both to fit the model and to check its accuracy. We provide simple estimates for the downward bias of the apparent error rate. The theory applies to general exponential family linear models and general measures of prediction error. Special attention is given to the case of logistic regression on binary data, with error rates measured by the proportion of misclassified cases. Several connected ideas are compared: Mallows's Cp , cross-validation, generalized cross-validation, the bootstrap, and Akaike's information criterion.

630 citations


PatentDOI
TL;DR: In this paper, a first speech recognition method receives an acoustic description of an utterance to be recognized and scores a portion of that description against each of a plurality of cluster models representing similar sounds from different words.
Abstract: A first speech recognition method receives an acoustic description of an utterance to be recognized and scores a portion of that description against each of a plurality of cluster models representing similar sounds from different words. The resulting score for each cluster is used to calculate a word score for each word represented by that cluster. Preferably these word scores are used to prefilter vocabulary words, and the description of the utterance includes a succession of acoustic decriptions which are compared by linear time alignment against a succession of acoustic models. A second speech recognition method is also provided which matches an acoustic model with each of a succession of acoustic descriptions of an utterance to be recognized. Each of these models has a probability score for each vocabulary word. The probability scores for each word associated with the matching acoustic models are combined to form a total score for that word. The preferred speech recognition method calculates to separate word scores for each currently active vocabulary word from a common succession of sounds. Preferably the first scores is calculated by a time alignment method, while the second score is calculated by a time independent method. Preferably this calculation of two separate word scores is used in one of multiple word-selecting phase of a recognition process, such as in the prefiltering phase.

194 citations


PatentDOI
TL;DR: A speech recognition system which can perform multiple recognition passes on each word, which may also be used as an interactive transcription system for prerecorded speech and can operate on either discrete utterances or continuous speech.
Abstract: A speech recognition system which can perform multiple recognition passes on each word. If the recognizer is correct in its first pass, the operator may abort later passes by either pressing a key or speaking the next word. Otherwise, the operator may either wait for a second recognition pass to be performed against a larger vocabulary, or may specify one or more initial letters causing the second recognition pass to be performed against a vocabulary substantially restricted to words starting with those initial letters. Each time the user adds an additional letter to the initial string, any previous recognition is aborted and the re-recognition process is started anew with the new string. If the user types a control character after the initial string, then the string itself is used as the output of the recognizer. In one embodiment, a language model limits a relatively small vocabulary used in the first pass to the words most likely to occur given the language context of the dictated word. The system may also be used as an interactive transcription system for prerecorded speech and can operate on either discrete utterances or continuous speech. When used with prerecorded speech, the system displays the best scoring words of a recognition to the user, and, when the user choses a desired word from such a display, the system employs the portion of prerecorded speech matched against the chosen word to help determine where in that prerecorded speech the system should look for the next word to recognize.

171 citations


Journal ArticleDOI
TL;DR: Different types of error rates are described and the state of the art of error rate estimation at the time of Toussaint's survey is briefly summarised, and the two major advances, namely bootstrap and average conditional error rates estimation methods, and their extensions, are described in detail.

141 citations


Patent
Robert W. Downes1, Wilson E. Smith1
09 Oct 1986
TL;DR: In this article, the ring error monitor analyzes the report and calculates and stores weighted error counts for stations in an error domain, and compared with a threshold value normal for a communications network, operating at acceptable error rate.
Abstract: Stations of a communications network maintain a set of counters which measure the frequency of occurrence of soft errors in said network. Periodically, each station generates and transmits an error report containing the error counts to a ring error monitor provided in one of the stations. The ring error monitor analyzes the report and calculates and stores weighted error counts for stations in an error domain. The stored error counts are integrated, over a selected time interval, and compared with a threshold value normal for a communications network, operating at acceptable error rate. The results of the comparison set error flags if the limits are exceeded indicating possible future station failures.

106 citations


Proceedings ArticleDOI
07 Apr 1986
TL;DR: The development and application of a new voicing algorithm used in the 2400 bit per second U.S. Government's Enhanced Linear Predictive Coder (LPC-10E) that improves upon other 2400 bps LPC voicing algorithms by providing higher quality synthesized speech.
Abstract: This paper describes the development and application of a new voicing algorithm used in the 2400 bit per second U.S. Government's Enhanced Linear Predictive Coder (LPC-10E). Correct voicing is crucial to perceived quality and naturalness of LPC systems and therefore to user acceptance of LPC systems. This new voicing algorithm uses a smoothed adaptive linear discriminator to classify the signal as voiced or unvoiced speech. The classifier was determined using Fisher's method of linear discriminant analysis. The voicing decision smoother is a modified median smoother that uses both the linear discriminant and speech onsets to determine its smoothing. The voicing classifier adapts to various acoustic noise levels and features a powerful new set of signal measurements: biased zero crossing rate, energy measures, reflection coefficients, and prediction gains. The LPC-10E voicing algorithm improves upon other 2400 bps LPC voicing algorithms by providing higher quality synthesized speech. Higher quality is due to halving of the error rate and graceful degradation in the presence of acoustic noise.

102 citations


Proceedings ArticleDOI
07 Apr 1986
TL;DR: Speaker-independent word recognition experiments using time functions of the dynamics-emphasized cepstrum and the polynomial coefficient for energy indicate that the error rate can be largely reduced by this method.
Abstract: A new speech analysis technique applicable to speech recognition is proposed considering the auditory mechanism of speech perception which emphasizes spectral dynamics and which compensates for the spectral undershoot associated with coarticulation. A speech wave is represented by the LPC cepstrum and logarithmic energy sequences, and the time sequences over short periods are expanded by the first- and second-order polynomial functions at every frame period. The dynamics of the cepstrum sequences are then emphasized by the linear combination of their polynomial expansion coefficients, that is, derivatives, and their instantaneous values. Speaker-independent word recognition experiments using time functions of the dynamics-emphasized cepstrum and the polynomial coefficient for energy indicate that the error rate can be largely reduced by this method.

99 citations


PatentDOI
TL;DR: In this article, a continuous speech recognition system with a speech processor and a word recognition computer subsystem is described, which is characterized by an element for developing a graph for confluent links between confluent nodes.
Abstract: A continuous speech recognition system having a speech processor and a word recognition computer subsystem, characterized by an element for developing a graph for confluent links between confluent nodes; an element for developing a graph of boundary links between adjacent words; an element for storing an inventory of confluent links and boundary links as a coding inventory; an element for converting an unknown utterance into an encoded sequence of confluent links and boundary links corresponding to recognition sequences stored in the word recognition subsystem recognition vocabulary for speech recognition. The invention also includes a method for achieving continouous speech recognition by characterizing speech as a sequence of confluent links which are matched with candidate words. The invention also applies to isolated word speech recognition as with continuous speech recognition, except that in such case there are no boundary links.

68 citations


Journal ArticleDOI
TL;DR: The derivation and analysis of optimum multiuser detectors for additive-rate and additive-light Poisson multiple-access channels are studied and a particular case of these results, namely, the single-user finite-length intersymbol interference problem, solves the error rate analysis of optimal direct-detection systems for dispersive optical fibers.
Abstract: The derivation and analysis of optimum multiuser detectors for additive-rate and additive-light Poisson multiple-access channels are studied. The observed point process models the output of an ideal photodetector illuminated by several synchronous or asynchronous users who modulate coherent light of the same frequency. Dynamic programming-based decision rules for the asynchronous multiple-access channel exhibit the same computational complexity as their synchronous counterparts and are shown to be optimum under the criteria of minimum error probability and maximum likelihood sequence detection. Upper and lower bounds on the minimum uncoded bit error rate achievable with arbitrary signal constellations are obtained in terms of the error probability of binary hypothesis testing problems. A particular case of these results, namely, the single-user finite-length intersymbol interference problem, solves the error rate analysis of optimum direct-detection systems for dispersive optical fibers.

62 citations


Proceedings ArticleDOI
01 Apr 1986
TL;DR: This paper describes the results of the work in designing a system for large-vocabulary word recognition of continuous speech, and generalizes the use of context-dependent Hidden Markov Models of phonemes to take into account word-dependent coarticulatory effects.
Abstract: This paper describes the results of our work in designing a system for large-vocabulary word recognition of continuous speech. We generalize the use of context-dependent Hidden Markov Models (HMM) of phonemes to take into account word-dependent coarticulatory effects, Robustness is assured by smoothing the detailed word-dependent models with less detailed but more robust models. We describe training and recognition algorithms for HMMs of phonemes-in-context. On a task with a 334-word vocabulary and no grammar (i.e., a branching factor of 334), in speaker-dependent mode, we show an average reduction in word error rate from 24% using context-independent phoneme models, to 10% when using robust context-dependent phoneme models.

PatentDOI
TL;DR: In this article, the authors describe combining (322) contiguous acoustically similar frames derived from the previous input word or words into representative frames to form a corresponding reduced word template.
Abstract: Arrangement and method for processing speech information in a speech recognition system. In such a system where the speech information is depicted as words, each word representing a sequence of frames and where the recognition system has means for comparing present input speech to a word template, the word template stored in template memory (160) and derived from one or more previous input word, the present invention is best employed. The invention describes combining (322) contiguous acoustically similar frames derived from the previous input word or words into representative frames to form a corresponding reduced word template, storing the reduced word template in template memory (160) in an efficient manner, and comparing (326) frames of the present input speech to the representative frames of the reduced word template according to the number of frames combined in the representative frames of the reduced word template. In doing so, a measure of similarity between the present input speech and the word template is generated.

Journal ArticleDOI
TL;DR: A survey of 2000 voice identification comparisons made by Federal Bureau of Investigation (FBI) examiners was used to determine the observed error rate of the spectrographic voice identification technique under actual forensic conditions.
Abstract: A survey of 2000 voice identification comparisons made by Federal Bureau of Investigation (FBI) examiners was used to determine the observed error rate of the spectrographic voice identification technique under actual forensic conditions. The qualifications of the examiners and the comparison procedures are set forth. The survey revealed that decisions were made in 34.8% of the comparisons with a 0.31% false identification error rate and a 0.53% false elimination error rate. These error rates are expected to represent the minimum error rates under actual forensic conditions.

Proceedings ArticleDOI
07 Apr 1986
TL;DR: It is found that measurements of speech spectral envelopes are prone to statistical variations due to window position fluctuations, excitation interference, measurement noise, etc. and may possess spurious characteristics because of analysis model constraints and that a statistical model can be established to predict the variances of the cepstral coefficient measurements.
Abstract: In this paper, we extend the interpretation of distortion measures, based upon the observation that measurements of speech spectral envelopes (as normally obtained from analysis procedures) are prone to statistical variations due to window position fluctuations, excitation interference, measurement noise, etc. and may possess spurious characteristics because of analysis model constraints. We have found that these undesirable spectral measurement variations can be controlled (i.e. reduced in the level of variation) through proper cepstral processing and that a statistical model can be established to predict the variances of the cepstral coefficient measurements. The findings lead to the use of a bandpass "liftering" process aimed at reducing the variability of the statistical components of spectral measurements. We have applied this liftering process to various speech recognition problems; in particular, vowel recognition and isolated word recognition. With the liftering process, we have been able to achieve an average digit error rate of 1%, which is about half of the previously reported best results, with dynamic time warping in a speaker-independent isolated digit test.

Proceedings ArticleDOI
07 Apr 1986
TL;DR: In a series of experiments on isolated-word recognition, hidden Markov models with multivariate Gaussian output densities with best models obtained with offsets of 75 or 90 msecs improved on previous algorithms.
Abstract: Hidden Markov modeling has become an increasingly popular technique in automatic speech recognition. Recently, attention has been focused on the application of these models to talker-independent, isolated-word recognition. Initial results using models with discrete output densities for isolated-digit recognition were later improved using models based on continuous output densities. In a series of experiments on isolated-word recognition, we applied hidden Markov models with multivariate Gaussian output densities to the problem. Speech data was represented by feature vectors consisting of eight log area ratios and the log LPC error. A weak measure of vocal-tract dynamics was included in the observations by appending to the feature vector observed at time t, the vector observed at time t-δ, for some fixed offset δ. The best models were obtained with offsets of 75 or 90 msecs. When a comparison is made on a common data base, the resulting error rate of 0.2% for isolated-digit recognition improves on previous algorithms.

Proceedings ArticleDOI
01 Apr 1986
TL;DR: It is shown that in speaker-dependent recognition of the alpha-numeric vocabulary, the PLP method in VQ-based ASR yields similar recognition scores as does the standard ASR system.
Abstract: The perceptually based linear predictive (PLP) speech analysis method is applied to isolated word automatic speech recognition (ASR). Low dimensionality of the PLP analysis vector, which is otherwise identical in form to the standard linear predictive (LP) analysis vector, allows for computational and storage savings in ASR. We show that in speaker-dependent recognition of the alpha-numeric vocabulary, the PLP method in VQ-based ASR yields similar recognition scores as does the standard ASR system. The main focus of the paper is on cross-speaker ASR. We demonstrate in experiments with vowel centroids of two male and one female speakers that PLP speech representation is more consistent with the underlying phonetic information than the standard LP method. Conclusions from the experiments are confirmed by superior performance of the PLP method in cross-speaker isolated word recognition.

Patent
27 Mar 1986
TL;DR: In this article, a method for determining probability values for probability items by biassing at least some of the stored values to enhance the likelihood that outputs generated in response to communication of a known word input are produced by the base form for the known word relative to the respective likelihood of the generated outputs being produced by a baseform for at least one other word.
Abstract: In a word, or speech, recognition system for decoding a vocabulary word from outputs selected from an alphabet of outputs in response to a communicated word input wherein each word in the vocabulary is represented by a baseform of at least one probabilistic finite state model and wherein each probabilistic model has transition probability items and output probability items and wherein a value is stored for each of at least some probability items, the present invention relates to apparatus and method for determining probability values for probability items by biassing at least some of the stored values to enhance the likelihood that outputs generated in response to communication of a known word input are produced by the baseform for the known word relative to the respective likelihood of the generated outputs being produced by the baseform for at least one other word. Specifically, the current values of counts --from which probability items are derived--are adjusted by uttering a known word and determining how often probability events occur relative to (a) the model corresponding to the known uttered "correct" word and (b) the model of at least one other "incorrect" word. The current count values are increased based on the event occurrences relating to the correct word and are reduced based on the event occurrences relating to the incorrect word or words.

Journal ArticleDOI
TL;DR: Attention is focussed on the formation of improved estimates, mainly through appropriate bias correction of the apparent error rate, and the role of the bootstrap, a computer-based methodology, is highlighted.
Abstract: The problem of estimating the error rates of a sample-based rule on the basis of the same sample used in its construction is considered. The apparent error rate is an obvious nonparametric estimate of the conditional error rate of a sample rule, but unfortunately it provides too optimistic an assessment. Attention is focussed on the formation of improved estimates, mainly through appropriate bias correction of the apparent error rate. In this respect the role of the bootstrap, a computer-based methodology, is highlighted.

Journal ArticleDOI
TL;DR: This paper attempts to improve current practice by providing an introduction to the essential quantities required for performing a power analysis (sample size, effect size, type 1 and type 2 error rates), and shows how to modify these tables to perform power analyses for multiple comparisons in univariate and some multivariate designs.
Abstract: Statistical power is neglected in much psychiatric research, with the consequence that many studies do not provide a reasonable chance of detecting differences between groups if they exist in the population. This paper attempts to improve current practice by providing an introduction to the essential quantities required for performing a power analysis (sample size, effect size, type 1 and type 2 error rates). We provide simplified tables for estimating the sample size required to detect a specified size of effect with a type 1 error rate of α and a type 2 error rate of β, and for estimating the power provided by a given sample size for detecting a specified size of effect with a type 1 error rate of α. We show how to modify these tables to perform power analyses for multiple comparisons in univariate and some multivariate designs. Power analyses for each of these types of design are illustrated by examples.

Journal ArticleDOI
Pierre A. Devijver1
TL;DR: It is stated, without proof, that the total fraction of samples discarded by the Multiedit algorithm is bounded from above by 2E1, where E1 is the 1-NNR error rate for the initial distributions.

Journal ArticleDOI
TL;DR: This paper applies two parallel processes to the case of 9600 bit/s modems and describes a complete discrimination algorithm which is fit for microprocessor technology and indicates a very low error rate.
Abstract: This paper concerns the design of a speech/data discriminator with the speed and accuracy required for statistical multiplexing of speech and high-speed baseband data on a single telephone line. The structure proposed for the discriminator combines the information from two parallel processes. The first process, even though slow, accurately recognizes the speech/data nature of the signal. This nature is determined by statistical pattern classification applied to simple zero-crossing-type parameters extracted from the signal filtered in a PCM representation. The second process has temporal accuracy in detecting transition events, from speech to data and vice versa. This detection is provided by special parameters which supply tentative transition markers. This paper applies these processes to the case of 9600 bit/s modems and describes a complete discrimination algorithm which is fit for microprocessor technology. The simulation results provided indicate a very low error rate and demonstrate that transitions are detected exactly or within one or two sampling intervals.

Book ChapterDOI
TL;DR: This chapter presents a system for handwriting recognition that uses a word segmenter and a word recognizer, and discusses the differences between the recognition systems for the Chinese and English writings.
Abstract: Publisher Summary This chapter presents a system for handwriting recognition. The system has two components: a word segmenter and a word recognizer. Different levels of segmentation are required for different handwriting types. Training, updating, and recognition modes of operation are carried out by the system. The chapter also discusses the differences between the recognition systems for the Chinese and English writings. The recognition system uses two procedures. The external segmentation procedure separates words from one another in all writings except spaced discrete characters where it separates characters from one another. The recognition procedure then operates on one word at a time. Within the processing of a word, the character segmentation and recognition steps are combined into an overall scheme. The character recognition results are sorted and combined so that the character sequences having the best cumulative distance scores are obtained as the best word choices. For a particular word choice, the corresponding character segmentation is simply the segment combinations that resulted in the chosen characters.

Journal ArticleDOI
TL;DR: Recommendations for applications based on available evaluations of robust estimators are made, and important unresolved issues are identified.
Abstract: Recent work on robust error rate estimation in classification analysis is summarized. First, the perspective for the error rate estimation problem is established, and the parameters that are referred to as error rates are described. Next, the bases for comparison of error rate estimators are reviewed and a mean-square error criterion recommended. Then several approaches to robust error rate estimation are introduced. Finally, recommendations for applications based on available evaluations of robust estimators are made, and important unresolved issues are identified.

Proceedings ArticleDOI
07 Apr 1986
TL;DR: In view of the automatic recognition of a very large or, eventually, unlimited vocabulary, it is necessary to choose recognition units that are smaller than word size, and syllable templates extracted from words to take coarticulation effects between syllables into account are chosen.
Abstract: In view of the automatic recognition of a very large or, eventually, unlimited vocabulary, it is necessary to choose recognition units that are smaller than word size. A series of experiments has been carried out to evaluate the use of the syllable as a concatenative unit for large vocabularies. For isolated word recognition from syllable units, a reference pattern was created for each word of the lexicon by concatenating isolated syllable templates. The test utterance is then matched to each word template through the use of dynamic programming. To build the word reference pattern, the syllable templates were adjusted by using the VLTS procedure and down sampling the beginning and end parts of the syllable. The syllable approach was compared to the classical whole word one on the same 10,400-word vocabulary. The storage required for the syllable dictionary is one sixth of that necessary for the whole word dictionary. For a trained speaker, the recognition error rate obtained using the syllable approach was 12% compared to 6% using the whole word approach. This difference may be reduced by using syllable templates extracted from words to take coarticulation effects between syllables into account.


01 Jan 1986
TL;DR: It is hypothesized that a single feature,such as [+- strong], which corresponds to a global tensening of the articulators at word beginning, accounts for a number of apparently unrelated phenomena, such as a more precise articulation, higher position of the velum, aspiration and glottalisation phenomena, devoicing which are characteristics of word onset.
Abstract: What is variant and what is invariant when a word is spoken? The fist part deals with the definition of the term “word”, as a graphic word, a unit of meaning or an acoustic unit. The function words tend to be shorter in duration, lower in fundamental frequency and less intense. The meaningful words have typically one dominant syllable, with in Fo peak, final lengthening and a strengthening of the word initial syllable. A sequence of words representing a single unit of meaning tend to be superimposed with the prosodic profile of a single word, and the process is continuous. How to represent the phoneme in a word in a symbolic form representing the way it is uttered? Contrarily to what is generally done, the phonemic representation and the representation of the phoneme by contrastive feature should include “prosodic' features, related to the position of the phoneme relative to the word boundaries and to stress. Some of the prosodic features , related to word final lengthening, word initial strengthening are shared by unrelated languages. It is hypothesized that a single feature, such as [+- strong], which corresponds to a global tensening of the articulators at word beginning, accounts for a number of apparently unrelated phenomena, such as a more precise articulation, higher position of the velum, aspiration and glottalisation phenomena, devoicing which are characteristics of word onset.

Proceedings ArticleDOI
T. Ukita1, T. Nitta, S. Watanabe
01 Apr 1986
TL;DR: A speaker independent recognition algorithm for connected words is described which uses a word boundary hypothesizer to reduce computational cost, as well as a robust word classifier and an effective scoring strategy that is superior to a method which regularly skips some frames for boundary hypothesis.
Abstract: A speaker independent recognition algorithm for connected words is described which uses a word boundary hypothesizer to reduce computational cost, as well as a robust word classifier and an effective scoring strategy. The word boundary hypothesizer predicts possible candidates for word boundaries at a variable rate which is controlled by a difference in adjacent frame spectra, obtained by bandpass filters. It reduces computational cost of the algorithm to about one-tenth, compared with a conventional approach. The word classifier uses a statistical pattern recognition technique to calculate word similarities and discriminate a word for a provisional interval between hypothesized boundaries. As the scoring strategy for evaluating possible word strings, word scores are calculated and accumulated for continuous intervals. A word score is calculated from a word similarity by an equation which models the a posteriori probability of correct word position. An experiment was performed for 35 four-connected digits uttered by ten male speakers. The string recognition rate was 93.9% (word rate = 98.4%). It was also shown that the algorithm is superior to a method which regularly skips some frames for boundary hypothesis.

Proceedings ArticleDOI
01 Oct 1986
TL;DR: In this article, a Reed-Solomon error correction codec was developed and integrated into a satellite TDMA system, where the coding overhead was only about 6%, and a dramatic error rate improvement was achieved for channel error rates less than 5×10-4.
Abstract: In a recent experimental project, a long, high rate Reed-Solomon error correction codec was developed and integrated into a satellite TDMA system. Although the coding overhead was only about 6%, a dramatic error rate improvement was achieved for channel error rates less than 5×10-4. This paper describes the design and implementation of the codec, its integration into a TDMA format, and the results of the experimental testing.

Proceedings Article
01 Jan 1986
TL;DR: A long, high rate Reed-Solomon error correction codec was developed and integrated into a satellite TDMA system, and a dramatic error rate improvement was achieved for channel error rates less than 5×10-4.
Abstract: In a recent experimental project, a long, high rate Reed-Solomon error correction codec was developed and integrated into a satellite TDMA system. Although the coding overhead was only about 6%, a dramatic error rate improvement was achieved for channel error rates less than 5×10-4. This paper describes the design and implementation of the codec, its integration into a TDMA format, and the results of the experimental testing.

Proceedings ArticleDOI
01 Apr 1986
TL;DR: A high-speed speaker-independent speech recognition system for a large vocabulary of Japanese words, based on phoneme recognition, implemented in small size by a pair of digital signal processors and a general purpose micro-processor on three printed circuit boards.
Abstract: This paper presents the development of a high-speed speaker-independent speech recognition system for a large vocabulary of Japanese words, based on phoneme recognition. The system is implemented in small size by a pair of digital signal processors and a general purpose micro-processor on three printed circuit boards. An experimental result using 212 Japanese word samples, which have been pronounced by 20 males and 20 females, showed an average word recognition rate of 95.5%. And the word recognition process completes within 0.8 second after the end of utterance.