scispace - formally typeset
Search or ask a question

Showing papers on "Word error rate published in 1991"


Book
01 Jan 1991
TL;DR: In this article, the authors discuss the importance of unbiased error rate estimation and find the right complexity fit to estimate the true performance of a learning system and compare it to the expected patterns of classifier behavior.
Abstract: Preface 1 Overview of Learning Systems 1.1 What is a Learning System? 1.2 Motivation for Building Learning Systems 1.3 Types of Practical Empirical Learning Systems 1.3.1 Common Theme: The Classification Model 1.3.2 Let the Data Speak 1.4 What's New in Learning Methods 1.4.1 The Impact of New Technology 1.5 Outline of the Book 1.6 Bibliographical and Historical Remarks 2 How to Estimate the True Performance of a Learning System 2.1 The Importance of Unbiased Error Rate Estimation 2.2. What is an Error? 2.2.1 Costs and Risks 2.3 Apparent Error Rate Estimates 2.4 Too Good to Be True: Overspecialization 2.5 True Error Rate Estimation 2.5.1 The Idealized Model for Unlimited Samples 2.5.2 Train-and Test Error Rate Estimation 2.5.3 Resampling Techniques 2.5.4 Finding the Right Complexity Fit 2.6 Getting the Most Out of the Data 2.7 Classifier Complexity and Feature Dimensionality 2.7.1 Expected Patterns of Classifier Behavior 2.8 What Can Go Wrong? 2.8.1 Poor Features, Data Errors, and Mislabeled Classes 2.8.2 Unrepresentative Samples 2.9 How Close to the Truth? 2.10 Common Mistakes in Performance Analysis 2.11 Bibliographical and Historical Remarks 3 Statistical Pattern Recognition 3.1 Introduction and Overview 3.2 A Few Sample Applications 3.3 Bayesian Classifiers 3.3.1 Direct Application of the Bayes Rule 3.4 Linear Discriminants 3.4.1 The Normality Assumption and Discriminant Functions 3.4.2 Logistic Regression 3.5 Nearest Neighbor Methods 3.6 Feature Selection 3.7 Error Rate Analysis 3.8 Bibliographical and Historical Remarks 4 Neural Nets 4.1 Introduction and Overview 4.2 Perceptrons 4.2.1 Least Mean Square Learning Systems 4.2.2 How Good Is a Linear Separation Network? 4.3 Multilayer Neural Networks 4.3.1 Back-Propagation 4.3.2 The Practical Application of Back-Propagation 4.4 Error Rate and Complexity Fit Estimation 4.5 Improving on Standard Back-Propagation 4.6 Bibliographical and Historical Remarks 5 Machine Learning: Easily Understood Decision Rules 5.1 Introduction and Overview 5.2 Decision Trees 5.2.1 Finding the Perfect Tree 5.2.2 The Incredible Shrinking Tree 5.2.3 Limitations of Tree Induction Methods 5.3 Rule Induction 5.3.1 Predictive Value Maximization 5.4 Bibliographical and Historical Remarks 6 Which Technique is Best? 6.1 What's Important in Choosing a Classifier? 6.1.1 Prediction Accuracy 6.1.2 Speed of Learning and Classification 6.1.3 Explanation and Insight 6.2 So, How Do I Choose a Learning System? 6.3 Variations on the Standard Problem 6.3.1 Missing Data 6.3.2 Incremental Learning 6.4 Future Prospects for Improved Learning Methods 6.5 Bibliographical and Historical Remarks 7 Expert Systems 7.1 Introduction and Overview 7.1.1 Why Build Expert Systems? New vs. Old Knowledge 7.2 Estimating Error Rates for Expert Systems 7.3 Complexity of Knowledge Bases 7.3.1 How Many Rules Are Too Many? 7.4 Knowledge Base Example 7.5 Empirical Analysis of Knowledge Bases 7.6 Future: Combined Learning and Expert Systems 7.7 Bibliographical and Historical Remarks References Author Index Subject Index

813 citations


BookDOI
01 May 1991
TL;DR: This dissertation describes a number of algorithms developed to increase the robustness of automatic speech recognition systems with respect to changes in the environment, including the SNR-Dependent Cepstral Normalization, (SDCN) and the Codeword-Dependent Cep stral normalization (CDCN).
Abstract: This dissertation describes a number of algorithms developed to increase the robustness of automatic speech recognition systems with respect to changes in the environment. These algorithms attempt to improve the recognition accuracy of speech recognition systems when they are trained and tested in different acoustical environments, and when a desk-top microphone (rather than a close-talking microphone) is used for speech input. Without such processing, mismatches between training and testing conditions produce an unacceptable degradation in recognition accuracy. Two kinds of environmental variability are introduced by the use of desk-top microphones and different training and testing conditions: additive noise and spectral tilt introduced by linear filtering. An important attribute of the novel compensation algorithms described in this thesis is that they provide joint rather than independent compensation for these two types of degradation. Acoustical compensation is applied in our algorithms as an additive correction in the cepstral domain. This allows a higher degree of integration within SPHINX, the Carnegie Mellon speech recognition system, that uses the cepstrum as its feature vector. Therefore, these algorithms can be implemented very efficiently. Processing in many of these algorithms is based on instantaneous signal-to-noise ratio (SNR), as the appropriate compensation represents a form of noise suppression at low SNRs and spectral equalization at high SNRs. The compensation vectors for additive noise and spectral transformations are estimated by minimizing the differences between speech feature vectors obtained from a "standard" training corpus of speech and feature vectors that represent the current acoustical environment. In our work this is accomplished by minimizing the distortion of vector-quantized cepstra that are produced by the feature extraction module in SPHINX. In this dissertation we describe several algorithms including the SNR-Dependent Cepstral Normalization, (SDCN) and the Codeword-Dependent Cepstral Normalization (CDCN). With CDCN, the accuracy of SPHINX when trained on speech recorded with a close-talking microphone and tested on speech recorded with a desk-top microphone is essentially the same obtained when the system is trained and tested on speech from the desk-top microphone. An algorithm for frequency normalization has also been proposed in which the parameter of the bilinear transformation that is used by the signal-processing stage to produce frequency warping is adjusted for each new speaker and acoustical environment. The optimum value of this parameter is again chosen to minimize the vector-quantization distortion between the standard environment and the current one. In preliminary studies, use of this frequency normalization produced a moderate additional decrease in the observed error rate.

474 citations


Proceedings ArticleDOI
18 Jun 1991
TL;DR: A statistical technique for assigning senses to words is described, which incorporated into the statistical machine translation system the error rate of the system decreased by thirteen percent.
Abstract: We describe a statistical technique for assigning senses to words. An instance of a word is assigned a sense by asking a question about the context in which the word appears. The question is constructed to have high mutual information with the translation of that instance in another language. When we incorporated this method of assigning senses into our statistical machine translation system, the error rate of the system decreased by thirteen percent.

436 citations


Journal ArticleDOI
TL;DR: The authors combine two algorithms for application to the recognition of unconstrained isolated handwritten numerals utilizing features derived from the profile of the character in a structural configuration to recognize the numerals.

299 citations


Proceedings ArticleDOI
14 Apr 1991
TL;DR: Several algorithms are presented that increase the robustness of SPHINX, the CMU (Carnegie Mellon University) continuous-speech speaker-independent recognition systems, by normalizing the acoustic space via minimization of the overall VQ distortion.
Abstract: Several algorithms are presented that increase the robustness of SPHINX, the CMU (Carnegie Mellon University) continuous-speech speaker-independent recognition systems, by normalizing the acoustic space via minimization of the overall VQ distortion. The authors propose an affine transformation of the cepstrum in which a matrix multiplication perform frequency normalization and a vector addition attempts environment normalization. The algorithms for environment normalization are efficient and improve the recognition accuracy when the system is tested on a microphone other than the one on which it was trained. The frequency normalization algorithm applies a different warping on the frequency axis to different speakers and it achieves a 10% decrease in error rate. >

229 citations


Patent
25 Jun 1991
TL;DR: In this paper, an N-best search is conducted to find the N most likely sentence hypotheses in a spoken language system, where word theories are distinguished based only on the one previous word.
Abstract: As a step in finding the one most likely word sequence in a spoken language system, an N-best search is conducted to find the N most likely sentence hypotheses. During the search, word theories are distinguished based only on the one previous word. At each state within a word, the total probability is calculated for each of a few previous words. At the end of each word, the probability score is recorded for each previous word theory, together with the name of the previous word. At the end of the sentence, a recursive traceback is performed to derive the list of the N best sentences.

203 citations


Proceedings ArticleDOI
14 Apr 1991
TL;DR: The authors introduce a new, more efficient algorithm, the word-dependent N-best algorithm, for finding multiple sentence hypotheses, based on the assumption that the beginning time of a word depends only on the preceding word.
Abstract: The authors introduce a new, more efficient algorithm, the word-dependent N-best algorithm, for finding multiple sentence hypotheses. The proposed algorithm is based on the assumption that the beginning time of a word depends only on the preceding word. The authors compare this algorithm with two other algorithms for finding the N-best hypotheses: the exact sentence-dependent method and a computationally efficient lattice N-best method. Although the word-dependent algorithm is computationally much less expensive than the exact algorithm, it appears to result in the same accuracy. The lattice method, which is still more efficient, has a significantly higher error rate. It is demonstrated that algorithms that use Viterbi scoring have significantly higher error rates than those that use total likelihood scoring. >

198 citations


Journal ArticleDOI
TL;DR: These results on a large, high input dimensional problem demonstrate that practical constraints including training time, memory usage, and classification time often constrain classifier selection more strongly than small differences in overall error rate.
Abstract: Results of recent research suggest that carefully designed multiplayer neural networks with local receptive fields and shared weights may be unique in providing low error rates on handwritten digit recognition tasks. This study, however, demonstrates that these networks, radial basis function (RBF) networks, and k nearest-neighbor (kNN) classifiers, all provide similar low error rates on a large handwritten digit database. The backpropagation network is overall superior in memory usage and classification time but can provide false positive classifications when the input is not a digit. The backpropagation network also has the longest training time. The RBF classifier requires more memory and more classification time, but less training time. When high accuracy is warranted, the RBF classifier can generate a more effective confidence judgment for rejecting ambiguous inputs. The simple kNN classifier can also perform handwritten digit recognition, but requires a prohibitively large amount of memory and is much slower at classification. Nevertheless, the simplicity of the algorithm and fast training characteristics makes the kNN classifier an attractive candidate in hardware-assisted classification tasks. These results on a large, high input dimensional problem demonstrate that practical constraints including training time, memory usage, and classification time often constrain classifier selection more strongly than small differences in overall error rate.

187 citations


Journal ArticleDOI
P. Balaban1, J. Salz1
TL;DR: The performance of digital data transmission over frequency-selective fading channels is investigated, and estimates of average attainable error rates and outage probabilities as functions of system parameters are provided.
Abstract: The performance of digital data transmission over frequency-selective fading channels is investigated. For statistically independent diversity paths, estimates of average attainable error rates and outage probabilities as functions of system parameters are provided. The dependences among the important system parameters are exhibited graphically for several examples, including quaternary phase-shift keying (QPSK). In the optimized uncoded QPSK with 1.5 b/s/Hz, two orders of magnitude in outage probability can be gained by diversity reception. When one compares the uncoded average probability of error for the optimized mean squared error (MSE) systems one finds at most an order-of-magnitude difference among the different equalizers investigated except for the zero-forcing equalizer, whose performance is drastically inferior to the others. Again, dual diversity can provide two orders of magnitude improvement in the average error rate or in outage probability for the uncoded optimized systems. >

170 citations


Journal ArticleDOI
TL;DR: A speaker-independent phoneme and word recognition system based on a recurrent error propagation network trained on the TIMIT database and analysis of the phoneme recognition results shows that information available from bigram and durational constraints is adequately handled within the network allowing for efficient parsing of the network output.

170 citations


Proceedings ArticleDOI
14 Apr 1991
TL;DR: The authors present a novel technique for obtaining a phonetic transcription for a new word, which is needed to add the new word to the system, using DECtalk's text-to-sound rules.
Abstract: The authors report on the detection of new words for the speaker-dependent and speaker-independent paradigms. A useful operating point in a speaker-dependent paradigm is defined at 71% detection rate and 1% false alarm rate. The authors present a novel technique for obtaining a phonetic transcription for a new word, which is needed to add the new word to the system. The technique utilizes DECtalk's text-to-sound rules to obtain an initial phonetic transcription for the new word. Since these text-to-sound rules are imperfect, a probabilistic transformation technique is used that produces a phonetic pronunciation network of all possible pronunciations given DECtalk's transcription. The network is used to constrain a phonetic recognition process that results in an improved phonetic transcription for the new word. The resulting transcriptions are sufficient for speech recognition purposes. >

Proceedings ArticleDOI
14 Apr 1991
TL;DR: A corrective MMIE training algorithm is introduced, which, when applied to the TI/NIST connected digit database, has made it possible to reduce the string error rate by close to 50%.
Abstract: Recently, Gopalakrishnan et al (1989) introduced a reestimation formula for discrete HMMs (hidden Markov models) which applies to rational objective functions like the MMIE (maximum mutual information estimation) criterion The authors analyze the formula and show how its convergence rate can be substantially improved They introduce a corrective MMIE training algorithm, which, when applied to the TI/NIST connected digit database, has made it possible to reduce the string error rate by close to 50% Gopalakrishnan's result is extended to the continuous case by proposing a new formula for estimating the mean and variance parameters of diagonal Gaussian densities >

Proceedings ArticleDOI
19 Feb 1991
TL;DR: An investigation into the use of Bayesian learning of the parameters of a multivariate Gaussian mixture density has been carried out and preliminary results applying to HMM parameter smoothing, speaker adaptation, and speaker clustering are given.
Abstract: An investigation into the use of Bayesian learning of the parameters of a multivariate Gaussian mixture density has been carried out. In a continuous density hidden Markov model (CDHMM) framework, Bayesian learning serves as a unified approach for parameter smoothing, speaker adaptation, speaker clustering, and corrective training. The goal of this study is to enhance model robustness in a CDHMM-based speech recognition system so as to improve performance. Our approach is to use Bayesian learning to incorporate prior knowledge into the CDHMM training process in the form of prior densities of the HMM parameters. The theoretical basis for this procedure is presented and preliminary results applying to HMM parameter smoothing, speaker adaptation, and speaker clustering are given.Performance improvements were observed on tests using the DARPA RM task. For speaker adaptation, under a supervised learning mode with 2 minutes of speaker-specific training data, a 31% reduction in word error rate was obtained compared to speaker-independent results. Using Baysesian learning for HMM parameter smoothing and sex-dependent modeling, a 21% error reduction was observed on the FEB91 test.

Journal ArticleDOI
TL;DR: Diversity combining and majority-logic decoding are combined to create a simple but powerful hybrid automatic repeat request (ARQ) error control scheme and ideal choice for high-data-rate error control over both stationary and nonstationary channels.
Abstract: Diversity combining and majority-logic decoding are combined to create a simple but powerful hybrid automatic repeat request (ARQ) error control scheme. Forward-error-correcting (FEC) majority-logic decoders are modified for use in type-I hybrid-ARQ protocols through the identification of reliability information within the decoding process. Diversity combining is then added to reduce the number of retransmissions and their consequent impact on throughput performance. Packet combining has the added benefit of adapting the effective code rate to channel conditions. Excellent reliability performance coupled with a simple high-speed implementation makes the majority-logic system and ideal choice for high-data-rate error control over both stationary and nonstationary channels. >

Proceedings ArticleDOI
14 Apr 1991
TL;DR: Several techniques that substantially reduced SPHINX's error rate are presented, which include dynamic features, semicontinuous hidden Markov models, speaker clustering, and the shared distribution modeling.
Abstract: The authors report recent efforts to further improve the performance of the SPHINX system for speaker-independent continuous speech recognition. They adhere to the basic architecture of the SPHINX system and use the DARPA resource management task and training corpus. The improvements are evaluated on the 600 sentences that comprise the DARPA February and October 1989 test sets. Several techniques that substantially reduced SPHINX's error rate are presented. These techniques include dynamic features, semicontinuous hidden Markov models, speaker clustering, and the shared distribution modeling. The error rate of the baseline system was reduced by 45%. >

PatentDOI
TL;DR: In this article, a hybrid of dictionary and rule-based approaches for both speech and speech recognition using morphology and rhyming is presented. But the method and apparatus employ a hybrid approach that employs a lexicon and rule based approach.
Abstract: A method and apparatus for natural language processing using morphology and rhyming. The method and apparatus employ a hybrid of dictionary and rule-based approaches for both speech and speech recognition. In an illustrative embodiment of the present invention the pronunciation of a word is determined by rhyming the word, or components of the word, with a reference word, or components of the reference word. In another illustrative embodiment of the present invention, the spelling of a word is determined by rhyming the word, or components of the word, with a reference word, or components of the reference word.

Journal ArticleDOI
TL;DR: A computer system for recognizing the printed Arabic text of multifonts contains four components: acquisition, segmentation, character recognition and word recognition, which results in an increase in the rate of word recognition by 6·97% using the proposed morphologically based method.
Abstract: We describe a computer system for recognizing the printed Arabic text of multifonts. The system contains four components: acquisition, segmentation, character recognition and word recognition. The word recognition component includes two sub-systems: a morphological spelling checker/correcter, and a morphological word recognizer. With a 95·5% character recognition rate corresponding to an 81·58% word recognition rate, tests have resulted in an increase in the rate of word recognition by 6·97% using the proposed morphologically based method.

Journal ArticleDOI
TL;DR: It is demonstrated that the approximations of perfect optical-to-electrical conversion and Gaussian-distributed MAI yield a poor approximation to the minimum error rate and an underestimate of the optimal threshold, while Arbitrarily tight bounds are developed on the error rate for unequal energies per bit.
Abstract: A model noncoherent, optical, asynchronous, CDMA system is described. The error rate for a single-user matched-filter receiver that is valid for arbitrary photomultipliers and signature sequence sets, adheres to the semiclassical model of light, and does not depend on approximations for large user groups, strong received optical fields, or chip synchronism is analyzed. The exact minimum probability of error and optimal threshold are compared to those obtained with user-synchronism and multiple-access interference (MAI) distribution approximations. For the special case of unity-gain photodetectors and prime sequences, it is shown that the approximation of chip synchronism yields a weak upper bound on the exact error rate. It is demonstrated that the approximations of perfect optical-to-electrical conversion and Gaussian-distributed MAI yield a poor approximation to the minimum error rate and an underestimate of the optimal threshold. Arbitrarily tight bounds are developed on the error rate for unequal energies per bit. In the case when the signal energies coincide, these bounding expressions are considerably easier to compute than the exact error rate. >

Proceedings ArticleDOI
14 Apr 1991
TL;DR: The authors already have a state-of-the-art speaker-independent speech recognition system, SPHINX, and extended it to speaker-dependent speech recognition, which demonstrated a substantial difference between speaker- dependent and -independent systems.
Abstract: The DARPA Resource Management task is used as the domain to investigate the performance of speaker-independent, speaker-dependent, and speaker-adaptive speech recognition. The authors already have a state-of-the-art speaker-independent speech recognition system, SPHINX. The error rate for RM2 test set is 4.3%. They extended SPHINX to speaker-dependent speech recognition. The error rate is reduced to 1.4-2.6% with 600-2400 training sentences for each speaker, which demonstrated a substantial difference between speaker-dependent and -independent systems. Based on speaker-independent models, a study was made of speaker-adaptive speech recognition. With 40 adaptation sentences for each speaker, the error rate can be reduced from 4.3% to 3.1%. >

Proceedings ArticleDOI
04 Nov 1991
TL;DR: A novel technique for estimating a robust time-varying spectrum, RASTA-PLP (relative spectral-perceptual linear predictive), is described, which shows an order-of-magnitude improvement in error rate over conventional spectral estimation techniques such as LPC (linear prediction coding) or PLP.
Abstract: A novel technique for estimating a robust time-varying spectrum, RASTA-PLP (relative spectral-perceptual linear predictive), is described. A large test was conducted on a speaker-independent telephone digit recognition task using speech that had been corrupted with convolutional noise. Results from this test show an order-of-magnitude improvement in error rate over conventional spectral estimation techniques such as LPC (linear prediction coding) or PLP. Results from similar tests with large vocabulary continuous speech recognition show a halving of error rate with no tuning for the new problem. >

Proceedings ArticleDOI
14 Apr 1991
TL;DR: Improvements to an HMM (hidden, Markov model)-based, continuous speech recognition system lead to a recognition system which gives a 95% word accuracy and 75% sentence accuracy for speaker independent recognition of the 1000-word, DARPA resource management task using the standard word pair grammar.
Abstract: The authors report recent improvements to an HMM (hidden, Markov model)-based, continuous speech recognition system. These advances, which include the incorporation of interword context-dependent units and position-dependent units and an improved feature analysis, lead to a recognition system which gives a 95% word accuracy and 75% sentence accuracy for speaker independent recognition of the 1000-word, DARPA resource management task using the standard word pair grammar (with a perplexity of about 60). With the improved acoustic modeling of subword units, the overall error rate reduction was over 42% compared with the performance results reported in the baseline system. The best results obtained so far using the word pair grammar gave 95.2% average word accuracy for the three DARPA evaluation sets. >

Journal ArticleDOI
Charles C. Tappert1
TL;DR: An on-line character recognizer was improved by making trade-offs on computation speed, recognition accuracy, and flexibility, and the error rate was halved by increasing computation precision at the expense of speed.
Abstract: An on-line character recognizer was improved by making trade-offs on computation speed, recognition accuracy, and flexibility Firstly, the error rate was halved by increasing computation precision at the expense of speed, achieving a character recognition accuracy of 973 percent Secondly, without loss of accuracy, computation speed was increased by an order of magnitude over the original speed, to 85 characters/second on the IBM System 370/Model 3081 This was done by using a fast linear match to narrow the character choices for a slow but accurate elastic (non-linear) match Some loss in flexibility resulted, however, because the linear match requires the same number of handwritten strokes for input and prototype characters Also, the error rate of elastic matching was found to be about half that of linear matching

Proceedings ArticleDOI
19 Feb 1991
TL;DR: This paper presents speech recognition test results from the BBN BYBLOS system on the Feb 91 DARPA benchmarks in both the Resource Management (RM) and the Air Travel Information System (ATIS) domains.
Abstract: This paper presents speech recognition test results from the BBN BYBLOS system on the Feb 91 DARPA benchmarks in both the Resource Management (RM) and the Air Travel Information System (ATIS) domains. In the RM test, we report on speaker-independent (SI) recognition performance for the standard training condition using 109 speakers and for our recently proposed SI model made from only 12 training speakers. Surprisingly, the 12-speaker model performs as well as the one made from 109 speakers. Also within the RM domain, we demonstrate that state-of-the-art SI models perform poorly for speakers with strong dialects. But we show that this degradation can be overcome by using speaker adaptation from multiple-reference speakers. For the ATIS benchmarks, we ran a new system configuration which first produced a rank-ordered list of the N-best word-sequence hypotheses. The list of hypotheses was then reordered using more detailed acoustic and language models. In the ATIS benchmarks, we report SI recognition results on two conditions. The first is a baseline condition using only training data available from NIST on CD-ROM and a word-based statistical bi-gram grammar developed at MIT/Lincoln. In the second condition, we added training data from speakers collected at BBN and used a 4-gram class grammar. These changes reduced the word error rate by 25%.

Book ChapterDOI
03 Sep 1991

Journal ArticleDOI
TL;DR: A method is shown based on calculation of the similarity error to estimate the possible character recognition error rate related to the optical transfer functions and sampling performance of the scanner to characterize the whole image transfer process.
Abstract: A new definition of picture degradation error caused by the sampling-transferring process in image scanners is described. The aim of this new error definition is to characterize the whole image transfer process. A method is shown based on calculation ofthis similarity error (which could also be called spectral distortion error) to estimate the possible character recognition error rate related to the optical transfer functions and sampling performance of the scanner. The increase of the probable reading error can be sufficiently estimated by the similarity error. Theoretical and experimental results are in good accordance. The different character types, scanners, and printer outputs can be characterized by this method, as well.

Proceedings ArticleDOI
14 Apr 1991
TL;DR: A large number of experiments on a 104 talker British English E-set database were performed that illustrate the utility of the method on a difficult speech recognition task, and the theoretical basis is presented and various aspects of using these models are discussed.
Abstract: Models similar to Doddington's (1989, 1990) hidden Markov models (HMMs) that use phonetically sensitive discriminants are discussed. In this style of HMM, each state models a subspace of the overall acoustic vector; the subspace is chosen to increase discrimination between the in-class and potentially confusable out-of-class utterances. The theoretical basis is presented and various aspects of using these models are discussed, such as the method of gathering confusion statistics; obtaining the correct normalization for the subspace Gaussian distribution and the effects of this term; and the computational requirements for the method. A large number of experiments on a 104 talker British English E-set database were performed that illustrate the utility of the method on a difficult speech recognition task. The experiments give a best speaker-independent error rate 7.9%, and a best multiple speaker error rate of 3.8%. >

Proceedings ArticleDOI
19 Feb 1991
TL;DR: The CMU Phoenix system is an experiment in understanding spontaneous speech that uses a bigram language model with the Sphinx speech recognition system and applies grammatical constraints at the phrase level and to use semantic rather than lexical grammars.
Abstract: The CMU Phoenix system is an experiment in understanding spontaneous speech. It has been implemented for the Air Travel Information Service task. In this task, casual users are asked to obtain information from a database of air travel information. Users are not given a vocabulary, grammar or set of sentences to read. They compose queries themselves in a spontaneous manner. This task presents speech recognizers with many new problems compared to the Resource Management task. Not only is the speech not fluent, but the vocabulary and grammar are open. Also, the task is not just to produce a transcription, but to produce an action, retrieve data from the database. Taking such actions requires parsing and "understanding" the utterance. Word error rate is not as important as utterance understanding rate.Phoenix attempts to deal with phenomena that occur in spontaneous speech. Unknown words, restarts, repeats, and poorly formed or unusual grammar are common is spontaneous speech and are very disruptive to standard recognizers. These events lead to misrecognitions which often cause a total parse failure. Our strategy is to apply grammatical constraints at the phrase level and to use semantic rather than lexical grammars. Semantics provide more constraint than parts of speech and must ultimately be delt with in order to take actions. Applying constraints at the phrase level is more flexible than recognizing sentences as a whole while providing much more constraint than word-spotting. Restarts and repeats are most often between phase occurences, so individual phrases can still be recognized correctly. Poorly constructed grammar often consists of well-formed phrases, and is often semantically well-formed. It is only syntactically incorrect. We associate phrases by frame-based semantics. Phrases represent word strings that can fill slots in frames. The slots represent information which the frame is able to act on.The current Phoenix system uses a bigram language model with the Sphinx speech recognition system. The top-scoring word string is passed to a flexible frame-based parser. The parser assigns phrases (word strings) from the input to slots in frames. The slots represent information content needed for the frame. A beam of frame hypotheses is produced and the best scoring one is used to produce an SQL query.

Journal ArticleDOI
TL;DR: An automatic threshold setting algorithm is presented, and a paper on single-utterance word boundary detection in a quiet environment is corrected, promising great improvement in the accuracy of fully automatic word recognition and assistance in the hand labeling of endpoints.
Abstract: Robust word boundary detection remains an unsolved problem. An automatic threshold setting algorithm is presented, and a paper on single-utterance word boundary detection in a quiet environment is corrected, promising great improvement in the accuracy of fully automatic word recognition and assistance in the hand labeling of endpoints. The original paper was by L.F. Lamel et al. (see IEEE Trans. Acoust. Speech Signal Process., vol.29, no.4, p.777-85, 1981). >

Journal ArticleDOI
TL;DR: In the data analyzed, no trend has been observed in the Fritchman parameter as a function of the type of area or of vehicle speed.
Abstract: A statistical characterization of transmission errors for digital mobile radio links at 910 MHz in different types of urban areas is presented. Error sequences for messages of 2-s duration at a nominal rate of 20-kb/s for binary-phase-shift-keying (BPSK) demodulated transmissions have been obtained and characterized by Fritchman's channel model, including one error state and two to four error-free states. In the data analyzed, no trend has been observed in the Fritchman parameter as a function of the type of area or of vehicle speed. Typical Fritchman's parameters are established for low and intermediate error rate channels (on the order of 10/sup -3/ and 10/sup -2/ error rates) and for higher error rate channels. Results are also presented for a limited sample of quadrature-phase-shift-keying (QPSK) demodulated 40-kb/s transmissions. >

Proceedings ArticleDOI
25 Feb 1991
TL;DR: A multiple neural network system (MNNS) for image-based character recognition that minimizes the number of free parameters which must be determined by the training set, leading to rapid training and robust recognition.
Abstract: A multiple neural network system (MNNS) for image-based character recognition is presented. The architecture employs network designs consisting of two levels of fixed feature extraction, followed by a three-layer feedforward perceptron for classification. A multiple network architecture is used to combine network responses. This design minimizes the number of free parameters which must be determined by the training set, leading to rapid training and robust recognition. In comparison to a single network trained with back propagation on zip code digits, the MNNS performs significantly better in terms of error rate and reject rate. >