scispace - formally typeset
Search or ask a question

Showing papers on "Word error rate published in 1999"


Journal ArticleDOI
TL;DR: A novel classification method, called the nearest feature line (NFL), for face recognition, based on the nearest distance from the query feature point to each FL, which achieves the lowest error rate reported for the ORL face database.
Abstract: We propose a classification method, called the nearest feature line (NFL), for face recognition. Any two feature points of the same class (person) are generalized by the feature line (FL) passing through the two points. The derived FL can capture more variations of face images than the original points and thus expands the capacity of the available database. The classification is based on the nearest distance from the query feature point to each FL. With a combined face database, the NFL error rate is about 43.7-65.4% of that of the standard eigenface method. Moreover, the NFL achieves the lowest error rate reported to date for the ORL face database.

555 citations


Proceedings Article
01 Jan 1999
TL;DR: A new algorithm for finding the hypothesis in a recognition lattice that is expected to minimize the word error rate (WER) is described, which overcomes the mismatch between the word-based performance metric and the standard MAP scoring paradigm that is sentence-based.
Abstract: We describe a new algorithm for finding the hypothesis in a recognition lattice that is expected to minimize the word error rate (WER). Our approach thus overcomes the mismatch between the word-based performance metric and the standard MAP scoring paradigm that is sentence-based, and that can lead to sub-optimal recognition results. To this end we first find a complete alignment of all words in the recognition lattice, identifying mutually supporting and competing word hypotheses. Finally, a new sentence hypothesis is formed by concatenating the words with maximal posterior probabilities. Experimentally, this approach leads to a significant WER reduction in a large vocabulary recognition task.

268 citations


Proceedings ArticleDOI
15 Mar 1999
TL;DR: This work analyzes how the interaction between the recognition process and the translation process can be modelled and suggests two new methods, the local averaging approximation and the monotone alignments.
Abstract: In speech translation, we are faced with the problem of how to couple the speech recognition process and the translation process. Starting from the Bayes decision rule for speech translation, we analyze how the interaction between the recognition process and the translation process can be modelled. In the light of this decision rule, we discuss the already existing approaches to speech translation. None of the existing approaches seems to have addressed this direct interaction. We suggest two new methods, the local averaging approximation and the monotone alignments.

194 citations


Journal ArticleDOI
TL;DR: This work mainly deals with the various methods that were proposed to realize the core of recognition in a word recognition system, and classifies the field into three categories: segmentation-free methods, which compare a sequence of observations derived from a word image with similar references of words in the lexicon; segmentations-based methods, that look for the best match between consecutive sequences of primitive segments and letters of a possible word.
Abstract: We review the field of offline cursive word recognition. We mainly deal with the various methods that were proposed to realize the core of recognition in a word recognition system. These methods are discussed in view of the two most important properties of such a system: the size and nature of the lexicon involved, and whether or not a segmentation stage is present. We classify the field into three categories: segmentation-free methods, which compare a sequence of observations derived from a word image with similar references of words in the lexicon; segmentation-based methods, that look for the best match between consecutive sequences of primitive segments and letters of a possible word; and the perception-oriented approach, that relates to methods that perform a human-like reading technique, in which anchor features found all over the word are used to boot-strap a few candidates for a final evaluation phase.

184 citations


Book ChapterDOI
01 Jan 1999
TL;DR: Because word_align and char_align were designed to work robustly on texts that are smaller and more noisy than the Hansards, it has been possible to successfully deploy the programs at AT&T Language Line Services, a commercial translation service, to help them with difficult terminology.
Abstract: We have developed a new program called word_align for aligning parallel text, text such as the Canadian Hansards that are available in two or more languages. The program takes the output of char_align (Church, 1993), a robust alternative to sentence-based alignment programs, and applies word-level constraints using a version of Brown et al.’s Model 2 (Brown et al., 1993), modified and extended to deal with robustness issues. Word_align was tested on a subset of Canadian Hansards supplied by Simard (Simard et al., 1992). The combination of word_align plus char_align reduces the variance (average square error) by a factor of 5 over char_align alone. More importantly, because word_align and char_align were designed to work robustly on texts that are smaller and more noisy than the Hansards, it has been possible to successfully deploy the programs at AT&T Language Line Services, a commercial translation service, to help them with difficult terminology.

163 citations


Journal ArticleDOI
TL;DR: These new techniques are shown to simultaneously achieve a multiplicative data-rate advantage and lower error rate as compared to conventional coded orthogonal-frequency division multiplexing in Rayleigh fading.
Abstract: This paper explores the improvement in information capacity and practical data rate that is possible with adaptive antenna technology applied to wireless-multipath communication channels. Whereas the conventional view is that multipath-signal propagation is an impediment to reliable communication, this paper shows that multipath can actually multiply the achievable data rate for wireless channels provided that the appropriate communication structure is employed. Multivariate discrete multitone (MDMT) combined with multivariate trellis-coded modulation (MTCM) is proposed and analyzed as a practical means of realizing a multiplicative-rate advantage in the case where channel-state information is not available at the transmitter. In Rayleigh fading, these new techniques are shown to simultaneously achieve a multiplicative data-rate advantage and lower error rate as compared to conventional coded orthogonal-frequency division multiplexing. Optimal minimum mean square error (MMSE), adaptive MDMT channel-estimation techniques are derived. The effects of channel-estimation error on MTCM are analyzed.

163 citations


Patent
Joseph N Butler1, Mark Anthony Hughes1, Neil O Fanning1, Eugene O'neill1, Una Quinlan1 
09 Nov 1999
TL;DR: In this article, an onboard detector for detecting an error rate for each channel is used to select the channel having the lowest error rate as the control channel and optionally to render at least the channel with the highest error rate inactive.
Abstract: A high speed link between chips and comprising a multiplicity of synchronous serial data channels includes an onboard detector for detecting an error rate for each channel. The transmitter and the receiver chips are configured in response to the detector to select the channel having the lowest error rate as the control channel and optionally to render at least the channel with the highest error rate inactive.

133 citations


Dissertation
01 Jan 1999
TL;DR: Evidence is presented indicating that recognition performance can be significantly improved through a contrasting approach using more detailed and more diverse acoustic measurements, which are referred to as heterogeneous measurements, as well as understanding of the weaknesses of current automatic phonetic classification systems.
Abstract: The acoustic-phonetic modeling component of most current speech recognition systems calculates a small set of homogeneous frame-based measurements at a single, fixed time-frequency resolution. This thesis presents evidence indicating that recognition performance can be significantly improved through a contrasting approach using more detailed and more diverse acoustic measurements, which we refer to as heterogeneous measurements. This investigation has three principal goals. The first goal is to develop heterogeneous acoustic measurements to increase the amount of acoustic-phonetic information I extracted from the speech signal. Diverse measurements are obtained by varying the time-frequency resolution, the spectral representation, the choice of temporal basis vectors, and other aspects of the preprocessing of the speech waveform. The second goal is to develop classifier systems for successfully utilizing high-dimensional heterogeneous acoustic measurement spaces. This is accomplished through hierarchical and committee-based techniques for combining multiple classifiers. The third goal is to increase understanding of the weaknesses of current automatic phonetic classification systems. This is accomplished through perceptual experiments on stop consonants which facilitate comparisons between humans and machines. Systems using heterogeneous measurements and multiple classifiers were evaluated in phonetic classification, phonetic recognition, and word recognition tasks. On the TIMIT core test set, these systems achieved error rates of 18.3% and 24.4% for, context-independent phonetic classification and context-dependent phonetic recognition, respectively. These results are the best that we have seen reported on these tasks. Word recognition experiments using the corpus associated with the JUPITER telephone-based weather information system showed 10–16% word error rate reduction, thus demonstrating that these techniques generalize to word recognition in a telephone-bandwidth acoustic environment. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)

115 citations


Journal ArticleDOI
TL;DR: Performance benefits have been demonstrated from incorporating a linear trajectory description and additionally from modelling variability in the mid-point parameter, and theoretical and experimental comparisons between different types of PTSHMMs, simpler SHMMs and conventional HMMs are presented.

108 citations


Proceedings ArticleDOI
15 Mar 1999
TL;DR: A faster accent classification approach using phoneme-class models is proposed and it is shown how to rapidly transform a native accent pronunciation dictionary to that for accented speech by simply using knowledge of the native language of the foreign speaker.
Abstract: The performance of speech recognition systems degrades when speaker accent is different from that in the training set. Accent-independent or accent-dependent recognition both require collection of more training data. In this paper, we propose a faster accent classification approach using phoneme-class models. We also present our findings in acoustic features sensitive to a Cantonese accent, and possibly other Asian language accents. In addition, we show how we can rapidly transform a native accent pronunciation dictionary to that for accented speech by simply using knowledge of the native language of the foreign speaker. The use of this accent-adapted dictionary reduces recognition error rate by 13.5%, similar to the results obtained from a longer, data-driven process.

106 citations


Journal ArticleDOI
TL;DR: It is established that biometric identification systems can be used in populations of 100 million people by derive equations for false-match and false-nonmatch error-rate prediction under the simplifying but limiting assumption of statistical independence of all errors.
Abstract: We derive equations for false-match and false-nonmatch error-rate prediction for the general M-to-N biometric identification system, under the simplifying, but limiting, assumption of statistical independence of all errors. For systems with large N, error rates are shown to be linked to the hardware processing speed through the system penetration coefficient and the throughput equation. These equations are somewhat limited in their ability to handle sample-dependent decision policies and are shown to be consistent with previously published cases for verification and identification. Applying parameters consistent with the Philippine Social Security System benchmark test results for AFIS vendors, we establish that biometric identification systems can be used in populations of 100 million people. Development of more generalized equations, accounting for error correlation and general sample-dependent thresholds, establishing confidence bounds, and substituting the inter-template for the impostor distribution under the template generating policy remain for future study.

Proceedings ArticleDOI
21 Sep 1999
TL;DR: A new formulation for the pair-wise error probability for any coherently demodulated system in flat Rayleigh fading is provided, finding that the resulting error rate expression is a polynomial function of the eigenvalues of a 'signal' matrix.
Abstract: This paper provides a new formulation for the pair-wise error probability for any coherently demodulated system in flat Rayleigh fading. The novelty of the result is that the resulting error rate expression is a polynomial function of the eigenvalues of a 'signal' matrix. This view also enables a simple new asymptotically tight bound on the pair-wise error probability. Examples of single and multiple transmit antenna systems are considered.

Proceedings ArticleDOI
15 Mar 1999
TL;DR: The 1998 HTK large vocabulary speech recognition system for conversational telephone speech was used in the NIST 1998 Hub5E evaluation as mentioned in this paper, which includes reduced bandwidth analysis, side-based cepstral feature normalisation, vocal tract length normalisation (VTLN), triphone and quinphone hidden Markov models (HMMs) built using speaker adaptive training (SAT), maximum likelihood linear regression (MLLR) speaker adaptation and a confidence score based system combination.
Abstract: This paper describes the 1998 HTK large vocabulary speech recognition system for conversational telephone speech as used in the NIST 1998 Hub5E evaluation. Front-end and language modelling experiments conducted using various training and test sets from both the Switchboard and Callhome English corpora are presented. Our complete system includes reduced bandwidth analysis, side-based cepstral feature normalisation, vocal tract length normalisation (VTLN), triphone and quinphone hidden Markov models (HMMs) built using speaker adaptive training (SAT), maximum likelihood linear regression (MLLR) speaker adaptation and a confidence score based system combination. A detailed description of the complete system together with experimental results for each stage of our multi-pass decoding scheme is presented. The word error rate obtained is almost 20% better than our 1997 system on the development set.

Proceedings Article
01 Jan 1999
TL;DR: This report reports on data collected during a study of three commercially available ASR systems that show how initial users of speech systems tend to fixate on a single strategy for error correction, coupled with application assumptions about how error correction features will be used, make a very frustrating, and unsatisfying user experience.
Abstract: Automatic Speech Recognition (ASR) systems have improved greatly over the last three decades. However, even with 98% reported accuracy, error correction still consumes a significant portion of user effort in text creation tasks. We report on data collected during a study of three commercially available ASR systems that show how initial users of speech systems tend to fixate on a single strategy for error correction. This tendency coupled with application assumptions about how error correction features will be used, combine to make a very frustrating, and unsatisfying user experience. We observe two distinct error correction patterns: spiral depth (Oviatt & van Gent, 1996) and cascades. In contrast, users with more extensive experience learn to switch correction strategies more quickly.

Journal ArticleDOI
TL;DR: This study proposes a new approach which combines stress classification and speech recognition functions into one algorithm by generalizing the one-dimensional (1-D) hidden Markov model to an N-channel hiddenMarkov model (N-channel HMM).
Abstract: Robust speech recognition systems must address variations due to perceptually induced stress in order to maintain acceptable levels of performance in adverse conditions. One approach for addressing these variations is to utilize front-end stress classification to direct a stress dependent recognition algorithm which separately models each speech production domain. This study proposes a new approach which combines stress classification and speech recognition functions into one algorithm. This is accomplished by generalizing the one-dimensional (1-D) hidden Markov model to an N-channel hidden Markov model (N-channel HMM). Here, each stressed speech production style under consideration is allocated a dimension in the N-channel HMM to model each perceptually induced stress condition. It is shown that this formulation better integrates perceptually induced stress effects for stress independent recognition. This is due to the sub-phoneme (state level) stress classification that is implicitly performed by the algorithm. The proposed N-channel stress independent HMM method is compared to a previously established one-channel stress dependent isolated word recognition system yielding a 73.8% reduction in error rate. In addition, an 82.7% reduction in error rate is observed compared to the common one-channel neutral trained recognition approach.

Book ChapterDOI
01 Jan 1999
TL;DR: This chapter provides an analytical framework to quantify the improvements in classification results due to combining, and derives expressions that indicate how much the median, the maximum and in general the ith order statistic can improve classifier performance.
Abstract: Several researchers have experimentally shown that substantial improvements can be obtained in difficult pattern recognition problems by combining or integrating the outputs of multiple classifiers. This chapter provides an analytical framework to quantify the improvements in classification results due to combining. The results apply to both linear combiners and order statistics combiners. We first show that to a first order approximation, the error rate obtained over and above the Bayes error rate, is directly proportional to the variance of the actual decision boundaries around the Bayes optimum boundary. Combining classifiers in output space reduces this variance, and hence reduces the “added” error. If N unbiased classifiers are combined by simple averaging, the added error rate can be reduced by a factor of N if the individual errors in approximating the decision boundaries are uncorrelated. Expressions are then derived for linear combiners which are biased or correlated, and the effect of output correlations on ensemble performance is quantified. For order statistics based non-linear combiners, we derive expressions that indicate how much the median, the maximum and in general the ith order statistic can improve classifier performance. The analysis presented here facilitates the understanding of the relationships among error rates, classifier boundary distributions, and combining in output space. Experimental results on several public domain data sets are provided to illustrate the benefits of combining and to support the analytical results.

01 Jan 1999
TL;DR: This paper explores the effects of word error rate, loss of textual clues, amount of training data, changes in guidelines, and out-of-vocabulary errors in the context of the Hub4e-IE evaluation.
Abstract: In this paper, we contrast the two tasks of named entity extraction from speech and text both qualitatively and quantitatively in the context of the DARPA 1998 Hub4e-IE evaluation. We will present some top level observations and a detailed engineering analysis of our system’s failures and successes. We explore the effects of word error rate, loss of textual clues, amount of training data, changes in guidelines, and out-of-vocabulary errors.

Proceedings ArticleDOI
15 Mar 1999
TL;DR: There appears to be an optimal ratio of training patterns to parameters of around 25:1 in these circumstances, and doubling the training data and system size appears to provide diminishing returns of error rate reduction for the largest systems.
Abstract: We have trained and tested a number of large neural networks for the purpose of emission probability estimation in large vocabulary continuous speech recognition. In particular, the problem under test is the DARPA Broadcast News task. Our goal here was to determine the relationship between training time, word error rate, size of the training set, and size of the neural network. In all cases, the network architecture was quite simple, comprising a single large hidden layer with an input window consisting of feature vectors from 9 frames around the current time, with a single output for each of 54 phonetic categories. Thus far, simultaneous increases to the size of the training set and the neural network improve performance; in other words, more data helps, as does the training of more parameters. We continue to be surprised that such a simple system works as well as it does for complex tasks. Given a limitation in training time, however, there appears to be an optimal ratio of training patterns to parameters of around 25:1 in these circumstances. Additionally, doubling the training data and system size appears to provide diminishing returns of error rate reduction for the largest systems.

Journal ArticleDOI
TL;DR: A set of cross-word decision-tree state-clustered context-dependent hidden Markov models are used to define a set of subphone units to be used in a concatenation synthesizer, which produces speech which is both natural sounding and highly intelligible.

Journal ArticleDOI
TL;DR: Way in which to quantify the performance of confidence measures in terms of their discrimination power and bias is discussed and two different performance metrics are analyzed: the classification equal error rate and the normalized mutual information metric.

Proceedings ArticleDOI
15 Mar 1999
TL;DR: This paper investigates if AdaBoost can be used to improve a hybrid HMM/neural network continuous speech recognizer and reports results on the Numbers 95 corpus and compares them with other classifier combination techniques.
Abstract: "Boosting" is a general method for improving the performance of almost any learning algorithm. A previously proposed and very promising boosting algorithm is AdaBoost. In this paper we investigate if AdaBoost can be used to improve a hybrid HMM/neural network continuous speech recognizer. Boosting significantly improves the word error rate from 6.3% to 5.3% on a test set of the OGI Numbers 95 corpus, a medium size continuous numbers recognition task. These results compare favorably with other combining techniques using several different feature representations or additional information from longer time spans. In summary, we can say that the reasons for the impressive success of AdaBoost are still not completely understood. To the best of our knowledge, an application of AdaBoost to a real world problem has not yet been reported in the literature either. In this paper we investigate if AdaBoost can be applied to boost the performance of a continuous speech recognition system. In this domain we have to deal with large amounts of data (often more than 1 million training examples) and inherently noisy phoneme labels. The paper is organized as follows. We summarize the AdaBoost algorithm and our baseline speech recognizer. We show how AdaBoost can be applied to this task and we report results on the Numbers 95 corpus and compare them with other classifier combination techniques. The paper finishes with a conclusion and perspectives for future work.

Journal ArticleDOI
TL;DR: A method for upgrading initially simple pronunciation models to new models that can explain several pronunciation variants of each word, and the introduction of such variants in a segment-based recognizer significantly improves the recognition accuracy.

Proceedings Article
01 Jan 1999
TL;DR: A new approach to penalize word hypotheses that are inconsistent with prosodic features such as duration and pitch is investigated, and the language model is modified to represent hidden events such as sentence boundaries and various forms of disfluency.
Abstract: We investigate a new approach for using speech prosody as a knowledge source for speech recognition. The idea is to penalize word hypotheses that are inconsistent with prosodic features such as duration and pitch. To model the interaction between words and prosody we modify the language model to represent hidden events such as sentence boundaries and various forms of disfluency, and combine with it decision trees that predict such events from prosodic features. N-best rescoring experiments on the Switchboard corpus show a small but consistent reduction of word error as a result of this modeling. We conclude with a preliminary analysis of the types of errors that are corrected by the prosodically informed model.

Proceedings ArticleDOI
15 Mar 1999
TL;DR: A compact language model which incorporates local dependencies in the form of N-grams and long distance dependencies through dynamic topic conditional constraints is presented and extends easily to incorporate other forms of statistical dependencies such as syntactic word-pair relationships or hierarchical topic constraints.
Abstract: A compact language model which incorporates local dependencies in the form of N-grams and long distance dependencies through dynamic topic conditional constraints is presented. These constraints are integrated using the maximum entropy principle. Issues in assigning a topic to a test utterance are investigated. Recognition results on the Switchboard corpus are presented showing that with a very small increase in the number of model parameters, reduction in word error rate and language model perplexity are achieved over trigram models. Some analysis follows, demonstrating that the gains are even larger on content-bearing words. The results are compared with those obtained by interpolating topic-independent and topic-specific N-gram models. The framework presented here extends easily to incorporate other forms of statistical dependencies such as syntactic word-pair relationships or hierarchical topic constraints.

Journal ArticleDOI
TL;DR: The article introduces the search problem, discusses in detail a typical implementation of a search engine, and demonstrates the efficacy of this approach on a range of problems, which is scalable across a wide range of applications.
Abstract: Large vocabulary continuous speech recognition (LVCSR) systems have advanced significantly due to the ability to handle extremely large problem spaces in fairly small amounts of memory. The article introduces the search problem, discusses in detail a typical implementation of a search engine, and demonstrates the efficacy of this approach on a range of problems. The approach presented is scalable across a wide range of applications. It is designed to address research needs, where a premium is placed on the flexibility of the system architecture, and the needs of application prototypes, which require near-real-time speed without a great sacrifice in word error rate (WER). One major area of focus for researchers is the development of real-time systems. With only minor degradations in performance (typically, no more than a 25% increase in WER), the systems described in this article can be transformed into systems that operate at 10/spl times/RT or less. There are four active areas of research related to this problem. First, more intelligent pruning algorithms that prune the search space more heavily are required. Look-ahead and N-best strategies at all levels of the system are key to achieving such large reductions in the search space. Second, multi-pass systems that perform a quick search using a simple system, and then rescore only the N-best resulting hypotheses using better models are very popular for real-time implementation. Third, since much of the computation in these systems is devoted to acoustic model processing, fast-matching strategies within the acoustic model are important. Finally, since Gaussian evaluation at each state in the system is a major consumer of CPU time, vector quantization-like approaches that enable one to compute only a small number of Gaussians per frame are proven to be successful. In some sense, the Viterbi (1967) based system presented represents only one path through this continuum of recognition search strategies.

Proceedings ArticleDOI
T. Tan1, Hong Yan1
15 Mar 1999
TL;DR: By utilizing the intrinsic properties of block-wise self-similar transformations in fractal image coding the authors can use it to perform face recognition and the contractivity factor and the encoding scheme of the fractal encoder are shown to affect recognition rates.
Abstract: In this paper, we propose a new method for computerized human face recognition using fractal transformations. We show that by utilizing the intrinsic properties of block-wise self-similar transformations in fractal image coding we can use it to perform face recognition. The contractivity factor and the encoding scheme of the fractal encoder are shown to affect recognition rates. Using this method, an average error rate of 1.75% was obtained on the ORL face database.

Proceedings ArticleDOI
21 Nov 1999
TL;DR: A hybrid error control architecture which takes into account the use of hierarchical video coding is developed, which shows that pure FEC offers the worst overall performance and the ARQ scheme offers the best performance under low bit error rate and short round trip time.
Abstract: Considering the limited bandwidth of the wireless link, it is important that the error control mechanism in wireless networks be spectrally efficient. Towards this end, we develop a hybrid error control architecture which takes into account the use of hierarchical video coding. Additionally, the architecture also includes a module which estimates the packet error rate and round trip time observed by the receiver and adjusts the level of redundancy used based on the estimate. By choosing different options in the architecture, we get different error control schemes. In this paper, we investigate the performance of five different error control schemes, namely ARQ, pure FEC, hybrid FEC/ARQ, hybrid FEC/ARQ with priority-dependent redundancy, and adaptive hybrid FEC/ARQ with priority-dependent redundancy. We evaluate the performance of these schemes using MPEG-2 video traces. The results show that pure FEC offers the worst overall performance. The ARQ scheme offers the best performance under low bit error rate and short round trip time, while the priority-aware hybrid FEC/ARQ with or without FEC adaptation offers the best performance under other conditions.

01 Jan 1999
TL;DR: A survey of the design, implementation, and study of interfaces for correcting error prone input technologies for handling errors in recognition systems is presented.
Abstract: Interfaces which support natural inputs such as handwriting and speech are becoming more prevalent and this is a desirable trend. However, these recognitionbased interface techniques are error prone. Despite research e orts to improve recognition rates, a certain amount of error will never be removed. Suitable research e orts should attend to the problem of correction techniques for these error prone techniques. Humans have developed countless ways to correct errors in understanding or clarify ambiguous statements. It is time for interface designers to focus on ways for computers to do the same. We present a survey of the design, implementation, and study of interfaces for correcting error prone input technologies. Previous work by others and our own research into exible pen-based note-taking environments grounds our research into interface techniques for handling errors in recognition systems.

Journal ArticleDOI
TL;DR: The notion of a critic associated with each classifier, whose objective is to predict the classifier's errors, is introduced and is found to achieve significant performance gains over alternative methods on a number of benchmark data sets.
Abstract: We develop new rules for combining the estimates obtained from each classifier in an ensemble, in order to address problems involving multiple (>2) classes. A variety of techniques have been previously suggested, including averaging probability estimates from each classifier, as well as hard (0-1) voting schemes. In this work, we introduce the notion of a critic associated with each classifier, whose objective is to predict the classifier's errors. Since the critic only tackles a two class problem, its predictions are generally more reliable than those of the classifier and, thus, can be used as the basis for improved combination rules. Several such rules are suggested here. While previous techniques are only effective when the individual classifier error rate is p<0.5, the new approach is successful, as proved under an independence assumption, even when this condition is violated-in particular, so long as p+q<1, with q the critic's error rate. More generally, critic-driven combining is found to achieve significant performance gains over alternative methods on a number of benchmark data sets. We also propose a new analytical tool for modeling ensemble performance, based on dependence between experts. This approach is substantially more accurate than the analysis based on independence that is often used to justify ensemble methods.

Proceedings ArticleDOI
15 Mar 1999
TL;DR: This work describes four system prototypes for voice-dialing, Internet information retrieval-called InfoPhone, voice e-mail, and car navigation that use a client-server architecture with the client designed to be resident on a phone or other hand-held device.
Abstract: With the advances in speech recognition and wireless communications, the possibilities for information access in the automobile have expanded significantly. We describe four system prototypes for (i) voice-dialing, (ii) Internet information retrieval-called InfoPhone, (iii) voice e-mail, and (iv) car navigation. These systems are designed primarily for hands-busy, eyes-busy conditions, use speaker-independent speech recognizers, and can be used with a restricted display or no display at all. The voice-dialing prototype incorporates our hands-free speech recognition engine that is very robust in noisy car environments (1% WER and 3% string error rate on the continuous digit recognition task at 0 db SNR). The InfoPhone, voice e-mail, and car navigation prototypes use a client-server architecture with the client designed to be resident on a phone or other hand-held device.