scispace - formally typeset
Search or ask a question

Showing papers on "Word error rate published in 1988"


Book
31 Oct 1988
TL;DR: This paper presents a meta-analysis of the SPHINX system and its applications to speech recognition, finding a good unit of speech and finding a Good Unit of Speech that learns and adapts to new environments.
Abstract: 1. Introduction.- 2. Hidden Markov Modeling of Speech.- 3. Task and Databases.- 4. The Baseline SPHINX System.- 5. Adding Knowledge.- 6. Finding a Good Unit of Speech.- 7. Learning and Adaptation.- 8. Summary of Results.- 9. Conclusion.- Appendix I. Evaluating Speech Recognizers.- I.1. Perplexity.- I.2. Computing Error Rate.- Appendix H. The Resource Management Task.- II.1. The Vocabulary and the SPHINX Pronunciation Dictionary.- II.2. The Grammar.- II.3. Training and Test Speakers.- Appendix III. Examples of SPHINX Recognition.- References.

462 citations


PatentDOI
TL;DR: A method for creating word models for a large vocabulary, natural language dictation system that may be used for connected speech as well as for discrete utterances.
Abstract: A method for creating word models for a large vocabulary, natural language dictation system. A user with limited typing skills can create documents with little or no advance training of word models. As the user is dictating, the user speaks a word which may or may not already be in the active vocabulary. The system displays a list of the words in the active vocabulary which best match the spoken word. By keyboard or voice command, the user may choose the correct word from the list or may choose to edit a similar word if the correct word is not on the list. Alternately, the user may type or speak the initial letters of the word. Then the recognition algorithm is called again satisfying the initial letters, and the choices displayed again. A word list is then also displayed from a large backup vocabulary. The best words to display from the backup vocabulary are chosen using a statistical language model and optionally word models derived from a phonemic dictionary. When the correct word is chosen by the user, the speech sample is used to create or update an acoustic model for the word, without further intervention by the user. As the system is used, it also constantly updates its statistical language model. The system gets more and more word models and keeps improving its performance the more it is used. The system may be used for connected speech as well as for discrete utterances.

284 citations


Journal ArticleDOI
TL;DR: A method of fault tree analysis of human errors based on the concept ‘error possibility’ instead of the error rate and it is shown that the proposed method gives us information.

245 citations


Journal ArticleDOI
TL;DR: A study of talker-stress-induced intraword variability and an algorithm that compensates for the systematic changes observed are presented and the functional form of the compensation is shown to correspond to the equalization of spectral tilts.
Abstract: A study of talker-stress-induced intraword variability and an algorithm that compensates for the systematic changes observed are presented. The study is based on hidden Markov models trained by speech tokens spoken in various talking styles. The talking styles include normal speech, fast speech, loud speech, soft speech, and taking with noise injected through earphones; the styles are designed to simulate speech produced under real stressful conditions. Cepstral coefficients are used as the parameters in the hidden Markov models. The stress compensation algorithm compensates for the variations in the cepstral coefficients in a hypothesis-driven manner. The functional form of the compensation is shown to correspond to the equalization of spectral tilts. Substantial reduction of error rates has been achieved when the cepstral domain compensation techniques were tested on the simulated-stress speech database. The hypothesis-driven compensation technique reduced the average error rate from 13.9% to 6.2%. When a more sophisticated recognizer was used, it reduced the error rate from 2.5% to 1.9%. >

114 citations


Journal ArticleDOI
Chin-Hui Lee1
TL;DR: A robust linear prediction (LP) algorithms is proposed that minimizes the sum of appropriately weighted residuals and takes into account the non-Gaussian nature of the excitations for voiced speech and gives a more efficient and less biased estimate for the prediction coefficients than conventional methods.
Abstract: A robust linear prediction (LP) algorithms is proposed that minimizes the sum of appropriately weighted residuals. The weight is a function of the prediction residual, and the cost function is selected to give more weight to the bulk of small residuals while deemphasizing the small portion of large residuals. In contrast, the conventional LP procedure weights all prediction residuals equally. The robust algorithm takes into account the non-Gaussian nature of the excitations for voiced speech and gives a more efficient (less variance) and less biased estimate for the prediction coefficients than conventional methods. The algorithm can be used in the front-end features extractor for a speech recognition system and as an analyzer for a speech coding system. Testing on synthetic vowel data demonstrates that the robust LP procedure is able to reduce the formant and bandwidth error rate by more than an order of magnitude compared to the conventional LP procedures and is relatively insensitive to the placement of the LPC (LP coding) analysis window and to the value of the pitch period, for a given section of speech signal. >

112 citations


Journal ArticleDOI
A. Nadas1, David Nahamoo1, Michael Picheny1
TL;DR: For minimizing the decoding error rate of the (optimal) maximum a posteriori probability (MAP) decoder, it is shown that the CMLE (or maximum mutual information estimate, MMIE) may be preferable when the model is incorrect.
Abstract: Training methods for designing better decoders are compared. The training problem is considered as a statistical parameter estimation problem. In particular, the conditional maximum likelihood estimate (CMLE), which estimates the parameter values that maximize the conditional probability of words given acoustics during training, is compared to the maximum-likelihood estimate, which is obtained by maximizing the joint probability of the words and acoustics. For minimizing the decoding error rate of the (optimal) maximum a posteriori probability (MAP) decoder, it is shown that the CMLE (or maximum mutual information estimate, MMIE) may be preferable when the model is incorrect. In this sense, the CMLE/MMIE appears more robust than the MLE. >

94 citations


Journal ArticleDOI
TL;DR: It is found that for a given error rate, error patterns having zero correlation between successive transmission generally fare better than those with negative correlation, and that error patterns with positive correlation fare better still.
Abstract: A formula for the go-back-N ARQ (automatic repeat request) scheme applicable to Markov error patterns is derived. It is a generalization of the well-known efficiency formula p/(p+m(1-p)) (where m is the round trip delay in number of block durations and p is the block transmission success probability), and it has been successfully validated against simulation measurements. It is found that for a given error rate, error patterns having zero correlation between successive transmission generally fare better than those with negative correlation, and that error patterns with positive correlation fare better still. It is shown that the present analysis can be extended in a straightforward manner to cope with error patterns of a more complex nature. Simple procedures for numerical evaluation of efficiency under quite general error structures are presented. >

84 citations


Proceedings ArticleDOI
Eva Ejerhed1
09 Feb 1988
TL;DR: A comparison of the error rates of the two parsing methods in the recognition of basic clauses showed that there was a 13% error rates for the regular expression method and a 6.5% error rate for the stochastic method.
Abstract: The paper presents and compares two different methods of parsing, a regular expression method and a stochastic method, with respect to their success in identifying basic clauses in unrestricted English text. These methods of parsing were developed in order to be applied to the task of improving the detection of large prosodic units in the Bell Labs text-to-speech system, and were so applied experimentally. The paper also discusses the notion of basic clause that was defined as the parsing target. The result of a comparison of the error rates of the two parsing methods in the recognition of basic clauses showed that there was a 13% error rate for the regular expression method and a 6.5% error rate for the stochastic method.

54 citations


Proceedings ArticleDOI
11 Apr 1988
TL;DR: This system extends an earlier robust continuous observation HMM IWR system to continuous speech using the DARPA-robust (multi-condition with a pilot's facemask) database.
Abstract: Most speech recognizers are sensitive to the speech style and the speaker's environment. This system extends an earlier robust continuous observation HMM IWR system to continuous speech using the DARPA-robust (multi-condition with a pilot's facemask) database. Performance on a 207 word, perplexity 14 task is 0.9% word error rate under office conditions and 2.5% (best speaker) and 5% (4 speaker average) for the normal test condition of the database. >

54 citations


Book
01 Dec 1988
TL;DR: In this article, a selective-repeat ARQ scheme with a finite receiver buffer and a finite range of sequence numbers is proposed, and the throughput performance is analyzed and simulated based on the assumption that the channel errors are randomly distributed and the return channel is noiseless.
Abstract: In this paper, we investigate a selective-repeat ARQ scheme which operates with a finite receiver buffer and a finite range of sequence numbers. The throughput performance of the proposed scheme is analyzed and simulated based on the assumption that the channel errors are randomly distributed and the return channel is noiseless. Both analytical and simulation results show that it significantly outperforms the go-back- N ARQ scheme, particularly for channels with large roundtrip delay and high data rate. It provides high throughput efficiency over a wide range of bit error rates. The throughput remains in a usable range even for very high error rate conditions. The proposed scheme is capable of handling data and/or acknowledgment loss. Furthermore, when buffer overflow occurs at the receiver, the transmitter is capable of detecting it and backs up to the proper location of the input queue to retransmit the correct data blocks.

50 citations


PatentDOI
Annedore Paeseler1
TL;DR: In this article, the recognition process is achieved by comparing the input sequence of speech signals to reference values and summing those which are syntactically permissible until they form a valid word.
Abstract: Continuous speech recognition assigns predetermined words to syntactic categories and defines the syntactic categories which can follow and precede each predetermined word. The recognition process is achieved by comparing the input sequence of speech signals to reference values and summing those which are syntactically permissible until they form a valid word. Subsequent speech values to previouly calculated valid words are compared to reference values listed in syntactic categories which can follow the predetermined word. For each word, values are updated indicating the current word's sequence number, syntax category, cumulative comparison sum, and the current list of compared words. Values are also stored for each word which identify the previous word, the following word and their syntax categories. This process is repeated until all input values have been processed. The results are then checked to verify valid syntax and the words with the closest match are read out.

Journal ArticleDOI
TL;DR: It is suggested that the fuzzy expression of human reliability is defined by all the factors required for the task: the error rate, the time required, and so on.

Proceedings Article
01 Jan 1988
TL;DR: A new probabdistic spectral edure is investigated to estimate the transformation of speaker adaptation and it is found that significant unprovement in recognition has been achieved compared to the previous adaptation algorithm.
Abstract: of speaker adaptation is to minmize the amount v Models to model the speech from the new speaker d in high recognition accuracy with a grammar of ce, we investigate a new probabdistic spectral edure to estimate the transformation. To evaluate rithm, recognition expenments are carried out on 1000-word resource management continuous ase using a grammar with perplexity 60. The that significant unprovement in recognition has been achieved compared to our previous adaptation algorithm. The average word error rate of speakeradapted models using 2 minutes of training speech is 11.3% compared to 7.1% for speaker-dependent models using 20-28

PatentDOI
TL;DR: In this paper, a low cost speech recognition system was proposed to generate frames of received speech having binary feature components. But the received speech frames were compared with reference templates, and error values representing the difference between the Received Speech and the Reference Templates were generated.
Abstract: A low cost speech recognition system generates frames of received speech having binary feature components. The received speech frames are compared (18) with reference templates (22) , and error values representing the difference between the received speech and the reference templates (22) are generated. At the end of an utterance, if one template resulted in a sufficiently small error value, the word represented by that template is selected (26) as the recognized word.

Proceedings ArticleDOI
11 Apr 1988
TL;DR: Analysis parameters and various distance measures are investigated for a template matching scheme for speaker identity verification (SIV) and performance varies significantly across vocabulary, and average performance is approximately 5% EER for the better algorithms on telephone speech.
Abstract: Analysis parameters and various distance measures are investigated for a template matching scheme for speaker identity verification (SIV). Two parameters are systematically varied-the length of the signal analysis window, and the order of the linear predictive coding/-cepstrum analysis. Computational costs associated with the choice of parameters are also considered. The distance measures tested are the Euclidean, inverse variance weighting, differential mean weighting, Kahn's simplified weighting, the Mahalanobis distance, and the Fisher linear discriminant. Using the equal error rate (EER) of pairwise utterance dissimilarity distributions, performance is estimated for prespecified and (a simulation of) user-determined input vocabulary. Performance varies significantly across vocabulary, and average performance is approximately 5% EER for the better algorithms on telephone speech. >

Proceedings ArticleDOI
11 Apr 1988
TL;DR: A novel text-dependent probabilistic spectral mapping method is presented for rapid speaker adaptation that results in significant better performance than the previous algorithms, and also provides recognition performance which is less than two times the word error rate for speaker-dependent training.
Abstract: A novel text-dependent probabilistic spectral mapping method is presented for rapid speaker adaptation. The algorithm has been tested on the DARPA 1000-word resource management database with a grammar perplexity of 60. It results in significant better performance than the previous algorithms, and also provides recognition performance which is less than two times the word error rate for speaker-dependent training, using two minutes of adaptation speech. >

Journal ArticleDOI
Bernard Merialdo1
TL;DR: This paper describes a new organization of the recognition process, Multilevel Decoding (MLD), that allows the system to support a Very-Large-Size Dictionary (VLSD)—one comprising over 100,000 words, which significantly surpasses the capacity of previous speech-recognition systems.
Abstract: An important concern in the field of speech recognition is the size of the vocabulary that a recognition system is able to support. Large vocabularies introduce difficulties involving the amount of computation the system must perform and the number of ambiguities it must resolve. But, for practical applications in general and for dictation tasks in particular, large vocabularies are required, because of the difficulties and inconveniences involved in restricting the speaker to the use of a limited vocabulary. This paper describes a new organization of the recognition process, Multilevel Decoding (MLD), that allows the system to support a Very-Large-Size Dictionary (VLSD)—one comprising over 100,000 words. This significantly surpasses the capacity of previous speech-recognition systems. With MLD, the effect of dictionary size on the accuracy of recognition can be studied. In this paper, recognition experiments using 10,000- and 200,000-word dictionaries are compared. They indicate that recognition using a 200,000-word dictionary is more accurate than recognition using a 10,000-word dictionary (when unrecognized words are included in the error rate).

Proceedings ArticleDOI
Masafumi Nishimura1, K. Sugawara1
11 Apr 1988
TL;DR: The authors describe a speaker adaptation method consisting of two stages, in the first stage, label prototypes, which represent spectral features, are modified to reduce the total distortion error of vector quantization for a new speaker.
Abstract: The authors describe a speaker adaptation method consisting of two stages. In the first stage, label prototypes, which represent spectral features, are modified to reduce the total distortion error of vector quantization for a new speaker. In the second stage, well-trained hidden Markov model (HMM) parameters are transformed by using a linear mapping function. This is estimated by counting the correspondences along the alignment between a state sequence of an HMM and a label sequence of a new speaker utterance. This adaptation procedure was tested in an isolated word recognition task using 150 confusable Japanese words. The original label prototypes and HMM parameters were estimated for a male speaker, who spoke each word 10 times. When the adaptation procedure was applied with 25 words, the average error rate for another seven male speakers was reduced from 25.0% to 5.6%, which was roughly the same as that for the original speaker. This procedure was also effective for adaptation between male and female speakers. >

Proceedings ArticleDOI
11 Apr 1988
TL;DR: A speech recognition system that comprises an acoustic/phonetic decoder, a lexical access mechanism and a syntax analyzer based on a continuously variable duration hidden Markov model and a content-free covering grammar of English is reported.
Abstract: Experiments with a speech recognition system are reported. The system comprises an acoustic/phonetic decoder, a lexical access mechanism and a syntax analyzer. The acoustic, phonetic and lexical processing are based on a continuously variable duration hidden Markov model (CVDHMM). The syntactic component is based on the Cocke-Kasami-Young (CKY) parser and a content-free covering grammar of English. Lexical items are represented in terms of the 43 phonetic units. In recognition tests conducted on a separate data set, a 70% correct recognition rate on phonetic units in fluent speech was observed. In two additional tests on isolated words, a 40% word recognition was observed with the complete 52000 word lexicon. When the vocabulary size was reduced to 1040 words, the recognition rate improved to 80%. After syntax analysis the word recognition rate rose to 90%. >

Proceedings ArticleDOI
11 Apr 1988
TL;DR: Several recently proposed automatic speech recognition (ASR) front-ends are experimentally compared for speaker-dependent and cross-speaker ASR and the perceptually based linear predictive front-end yields the highest accuracies.
Abstract: Several recently proposed automatic speech recognition (ASR) front-ends are experimentally compared for speaker-dependent and cross-speaker ASR. The perceptually based linear predictive front-end yields the highest accuracies. By modifying its sensitivity to spectral peaks and to spectral tilt and by utilizing the speech dynamics the authors further improve, by about 10%, its error rate in speaker-independent ASR. >

Proceedings ArticleDOI
11 Apr 1988
TL;DR: An algorithm based on hidden Markov models is applied to the task of speaker-independent continuous-speech recognition for a vocabulary of 1000 words with no syntactic constraints, and it was found that the use of several different acoustic features and theUse of word-specific phonetic modeling, where possible, improved system performance.
Abstract: An algorithm based on hidden Markov models is applied to the task of speaker-independent continuous-speech recognition for a vocabulary of 1000 words with no syntactic constraints. The signal is limited to 4000 Hz. Word models were built from three-state representations of phonetic units, concatenated according to entries in a lexicon. Performance as measured on DARPAs resource management database was 40% correct word recognition. It was found that the use of several different acoustic features and the use of word-specific phonetic modeling, where possible, improved system performance. >

Proceedings ArticleDOI
11 Apr 1988
TL;DR: Simulations show that the commonly used dynamic programming word-sequence matching algorithm has serious shortcomings as an evaluation method at low performance levels, though it is generally reliable at high performance levels and a method using word end-point information provides precise, detailed performance analyses.
Abstract: Outputs of connected-word recognizers may contain substitution, deletion and insertion errors, and their interpretation is not trivial. Simulations show that the commonly used dynamic programming word-sequence matching algorithm has serious shortcomings as an evaluation method at low performance levels, though it is generally reliable at high performance levels. The strategy of comparing input and output words in strict sequence is found to have little to recommend it. A method using word end-point information, which provides precise, detailed performance analyses, is described. Tests with real data confirm the reliability of the end-point method and the presence of positive bias in performance estimates form the word-sequence matching method. >

01 Jan 1988
TL;DR: The Dispersion Frame Technique (DFT) was developed from the observation that electromechanical devices experience a period of deteriorating performance usually in the form of increasing error rate prior to catastrophic failure.
Abstract: Projections indicate that the use of personal computing environments and distributed networks will increase by an order of magnitude by the turn of the century. In addition, personal workstations will continue to increase in complexity through the use of VLSI hardware. Therefore an efficient and flexible means for maintaining high availability in workstations becomes vitally important. Thus fault handling, as well as the collection and analysis of data produced by faults must be automated. A distributed on-line monitoring and predictive diagnostic system has been developed. The diagnostic system integrates error logging, monitoring, and control functions. The hybrid architecture implementation, where diagnosability is integrated in both the centralized diagnostic server and the individual file server, ensures synchronization and robustness of the communication between the diagnostic system and the file server machines. Data collected from the file servers over the last twenty-two months was analyzed. Twenty-nine permanent faults were identified in the operator's log and were shown to follow an exponential failure distribution. The error log was shown to contain events which are caused by a mixture of transient and intermittent faults. The failure distribution of the transient faults can be characterized by the Weibull function with a decreasing error rate, whereas that of the intermittent faults exhibits an increasing error rate. The failure distribution of the entire error log also follows a Weibull distribution with a decreasing error rate. The parameters of the entire error log distribution are a function of the relationship between transient and intermittent faults as summarized by the ratios of the shape parameters and the relative frequency of error occurrences. It is shown that 25 faults are typically required in this study to give an accurate estimate of the Weibull parameters. Studying the average number of faults before repair activities shows that users will not tolerate such a large number of errors, and subsequent system crashes, prior to an attempted repair. Hence the Dispersion Frame Technique (DFT) was developed from the observation that electromechanical devices experience a period of deteriorating performance usually in the form of increasing error rate prior to catastrophic failure. (Abstract shortened with permission of author.)

Proceedings ArticleDOI
11 Apr 1988
TL;DR: The Boltzmann machine algorithm and the error back propagation algorithm were used to learn to recognize the place of articulation of vowels (front, center or back), represented by a static description of spectral lines, which shows a fault tolerant property of the neural nets.
Abstract: The Boltzmann machine algorithm and the error back propagation algorithm were used to learn to recognize the place of articulation of vowels (front, center or back), represented by a static description of spectral lines. The error rate is shown to depend on the coding. Results are comparable or better than those obtained by us on the same data using hidden Markov models. The authors also show a fault tolerant property of the neural nets, i.e. that the error on the test set increases slowly and gradually when an increasing number of nodes fail. >

Proceedings ArticleDOI
11 Apr 1988
TL;DR: A tight integration between the two steps rather than a hierarchical approach has been investigated and the hypothesization and the verification modules are implemented as processes running in parallel.
Abstract: Recently a two step strategy for large vocabulary isolated word recognition has been successfully experimented. The first step consists in the hypothesization of a reduced set of word candidates on the basis of broad bottom-up features, while the second one is the verification of the hypotheses using more detailed phonetic knowledge. This paper deals with its extension to continuous speech. A tight integration between the two steps rather than a hierarchical approach has been investigated. The hypothesization and the verification modules are implemented as processes running in parallel. Both processes represent lexical knowledge by a tree. Each node of the hypothesization tree is labeled by one of 6 broad phonetic classes. The nodes of the verification tree are, instead, the states of sub-word HMMs. The two processes cooperate to detect word hypotheses along the sentence. >

DOI
01 Oct 1988
TL;DR: In this paper, the influence of IF-filtering on the bit error rate floor in optical DPSK-systems is investigated, and it is shown that IF filtering leads to a reduction of the error rate, so that the linewidth requirements are reduced by a factor of 068.
Abstract: We investigate the influence of IF-filtering on the bit error rate floor in optical DPSK-systems; this influence is usually neglected We show that the IF-filtering leads to a reduction of the error rate floor, so that the linewidth requirements are reduced by a factor of 068

Journal ArticleDOI
TL;DR: A comprehensive performance analysis method that models, at bit level, the error performance of individual links in an end-to-end connection is presented and the utility and power of the model are illustrated with the help of an example connection.
Abstract: A comprehensive performance analysis method that models, at bit level, the error performance of individual links in an end-to-end connection is presented. The link model accounts for the burst-error behaviour of each individual link. A method to concatenate several individual links and extract a model for the end-to-end connection is given. This resulting end-to-end model can be used to calculate performance measures such as bit error rate and block error rate for any given block size. A procedure to compute the probability distribution of errors within a specific block is also developed. Finally, a method to compute the probability distribution of blocks having a certain error rate over a given period of time is presented. The utility and power of the model are illustrated with the help of an example connection. >

Proceedings ArticleDOI
11 Apr 1988
TL;DR: An HMM-based isolated-word recognition system that dynamically adapts word model parameters to new speakers and to stress-induced speech variations that produces results comparable to multistyle-trained systems.
Abstract: The authors describe an HMM-based isolated-word recognition system that dynamically adapts word model parameters to new speakers and to stress-induced speech variations. During recognition all input tokens presented to the system can be used to augment the current word model parameters. New tokens can be weighted so that adaptation simply increases the size of the training set, or tracks systematic changes by exponentially weighting all previously seen data. This system was tested on the 35-word 10710 token Lincoln stressed speech data base. Speaker adaptation experiments produced error rates equivalent to speaker-trained systems after the presentation of only a single new token per vocabulary word. Stress condition adaptation experiments produced results comparable to multistyle-trained systems after the presentation of several new tokens per vocabulary word. >

Journal ArticleDOI
TL;DR: The design of a multifont character recognizer which uses a binary decision tree to classify a character on the basis of 197 geometric features is described, which was highly sensitive to typeface and error rates varied between 10 percent and 0.1 percent.
Abstract: An optical character reader for processing typeset documents must be able to handle proportional spacing, the presence of touching characters and a wide variety of type fonts. This paper describes the design of a multifont character recognizer which uses a binary decision tree to classify a character on the basis of 197 geometric features. The algorithm for designing the decision tree is based upon an entropy minimization procedure, and makes no assumptions on the distribution or independence of the binary features. The decision tree classifier provides confidence measures which may be used to reduce the substitution error rate at the expense of higher rejection rates. Methods of reducing the overall error rate by combining the decision tree classifier with other classifiers were examined. In particular, the paper evaluates the performance of a classifier using a combination of multiple decision trees, template matching and contextual post-processing. Error rates were highly sensitive to typeface and varied between 10 percent and 0.1 percent. Computer processing times for the various stages of the system are presented.

PatentDOI
Ira A. Gerson1
TL;DR: The invention is intended to be implemented in a system which has word templates stored in template memory, with the system being capable of accumulating distance measures for states within each word template.
Abstract: Word spotting in a speech recognition system without predetermining the endpoints of the input speech. The invention is intended to be implemented in a system which has word templates stored in template memory, with the system being capable of accumulating distance measures for states within each word template. The following steps are used to generate a measure of similarity between a subset of the input frames and a word template. The steps are: a) recording a beginning input frame number for each state to identify the potential beginning of the word; b) accumulating distance measures for at least one state for each input frame; c) normalizing the distance measures by substracting a normalization amount from each distance measure; d) recording normalization information corresponding to the normalization amount for each input frame; and e) determining a similarity measure between the word template and a subset of input frames after a given input frame has been processed. The subset is identified from the beginning input frame number corresponding to an end state of the template, through the given input frame number. The similarity measure is based on the normalized distance measure recorded for the end state. and the normalization information.