scispace - formally typeset
Search or ask a question

Showing papers on "Word error rate published in 1995"


Proceedings ArticleDOI
Reinhard Kneser1, Hermann Ney
09 May 1995
TL;DR: This paper proposes to use distributions which are especially optimized for the task of back-off, which are quite different from the probability distributions that are usually used for backing-off.
Abstract: In stochastic language modeling, backing-off is a widely used method to cope with the sparse data problem. In case of unseen events this method backs off to a less specific distribution. In this paper we propose to use distributions which are especially optimized for the task of backing-off. Two different theoretical derivations lead to distributions which are quite different from the probability distributions that are usually used for backing-off. Experiments show an improvement of about 10% in terms of perplexity and 5% in terms of word error rate.

1,768 citations


Journal ArticleDOI
TL;DR: A constrained estimation technique for Gaussian mixture densities for speech recognition that approaches the speaker-independent accuracy achieved for native speakers and speaker-dependent systems that use six times as much training data.
Abstract: A trend in automatic speech recognition systems is the use of continuous mixture-density hidden Markov models (HMMs). Despite the good recognition performance that these systems achieve on average in large vocabulary applications, there is a large variability in performance across speakers. Performance degrades dramatically when the user is radically different from the training population. A popular technique that can improve the performance and robustness of a speech recognition system is adapting speech models to the speaker, and more generally to the channel and the task. In continuous mixture-density HMMs the number of component densities is typically very large, and it may not be feasible to acquire a sufficient amount of adaptation data for robust maximum-likelihood estimates. To solve this problem, the authors propose a constrained estimation technique for Gaussian mixture densities. The algorithm is evaluated on the large-vocabulary Wall Street Journal corpus for both native and nonnative speakers of American English. For nonnative speakers, the recognition error rate is approximately halved with only a small amount of adaptation data, and it approaches the speaker-independent accuracy achieved for native speakers. For native speakers, the recognition performance after adaptation improves to the accuracy of speaker-dependent systems that use six times as much training data. >

439 citations


Proceedings Article
01 Jan 1995

306 citations


Journal ArticleDOI
TL;DR: This paper showed that competition between simultaneously active word candidates can modulate the size of prosodic effects, which suggests that spoken-word recognition must be sensitive both to prosodic structure and to the effects of competition.
Abstract: Spoken utterances contain few reliable cues to word boundaries, but listeners nonetheless experience little difficulty identifying words in continuous speech. The authors present data and simulations that suggest that this ability is best accounted for by a model of spoken-word recognition combining competition between alternative lexical candidates and sensitivity to prosodic structure. In a word-spotting experiment, stress pattern effects emerged most clearly when there were many competing lexical candidates for part of the input. Thus, competition between simultaneously active word candidates can modulate the size of prosodic effects, which suggests that spoken-word recognition must be sensitive both to prosodic structure and to the effects of competition. A version of the Shortlist model (D. G. Norris, 1994b) incorporating the Metrical Segmentation Strategy (A. Cutler & D. Norris, 1988) accurately simulates the results using a lexicon of more than 25,000 words.

267 citations


Journal ArticleDOI
TL;DR: A combinatorial analysis is presented to derive a closed-form expression for the number of transmission errors that occur in a block transmitted through a Gilbert channel that simplifies the computations needed to investigate the tradeoffs among the decoding error probability, degree of interleaving, and the error-correction ability of a code.
Abstract: Presents a combinatorial analysis to derive a closed-form expression for the number of transmission errors that occur in a block transmitted through a Gilbert channel. This expression simplifies the computations needed to investigate the tradeoffs among the decoding error probability, degree of interleaving, and the error-correction ability of a code. The authors illustrate how a designer may apply the method to determine different combinations of the degree of interleaving and error correction ability to achieve a specified decoding error rate. >

231 citations


Proceedings ArticleDOI
09 May 1995
TL;DR: The results show that integration of duration models that take into account context and speaking rate can improve the word accuracy of the baseline recognition system.
Abstract: This paper presents a study of different methods for phoneme duration modeling in large vocabulary speech recognition. We investigate the employment of phoneme duration and the effect of context, speaking rate and lexical stress in the duration of phoneme segments in a large vocabulary speech recognition system. The duration models are used in a postprocessing phase of BYBLOS, our baseline HMM-based recognition system, to rescore the N-Best hypotheses. We describe experiments with the 5 K word ARPA Wall Street Journal (WSJ) corpus. The results show that integration of duration models that take into account context and speaking rate can improve the word accuracy of the baseline recognition system.

199 citations


Patent
25 May 1995
TL;DR: In this article, a method for recognizing handwritten characters in response to an input signal from a handwriting transducer is described, which relies on static or shape information, wherein the temporal order in which points are captured by an electronic tablet may be disregarded.
Abstract: Methods and apparatus are disclosed for recognizing handwritten characters in response to an input signal from a handwriting transducer. A feature extraction and reduction procedure is disclosed that relies on static or shape information, wherein the temporal order in which points are captured by an electronic tablet may be disregarded. A method of the invention generates and processes the tablet data with three independent sets of feature vectors which encode the shape information of the input character information. These feature vectors include horizontal (x-axis) and vertical (y-axis) slices of a bit-mapped image of the input character data, and an additional feature vector to encode an absolute y-axis displacement from a baseline of the bit-mapped image. It is shown that the recognition errors that result from the spatial or static processing are quite different from those resulting from temporal or dynamic processing. Furthermore, it is shown that these differences complement one another. As a result, a combination of these two sources of feature vector information provides a substantial reduction in an overall recognition error rate. Methods to combine probability scores from dynamic and the static character models are also disclosed.

197 citations


Book ChapterDOI
09 Jul 1995
TL;DR: A “wrapper” method that uses best-first search and crossvalidation to wrap around the basic induction algorithm to find the parameter settings that will result in optimal performance of a given learning algorithm using a particular dataset as training data.
Abstract: We address the problem of finding the parameter settings that will result in optimal performance of a given learning algorithm using a particular dataset as training data. We describe a “wrapper” method, considering determination of the best parameters as a discrete function optimization problem. The method uses best-first search and crossvalidation to wrap around the basic induction algorithm: the search explores the space of parameter values, running the basic algorithm many times on training and holdout sets produced by crossvalidation to get an estimate of the expected error of each parameter setting. Thus, the final selected parameter settings are tuned for the specific induction algorithm and dataset being studied. We report experiments with this method on 33 datasets selected from the UCI and StatLog collections using C4.5 as the basic induction algorithm. At a 90% confidence level, our method improves the performance of C4.5 on nine domains, degrades performance on one, and is statistically indistinguishable from C4.5 on the rest. On the sample of datasets used for comparison, our method yields an average 13% relative decrease in error rate. We expect to see similar performance improvements when using our method with other machine learning algorithms.

192 citations


Patent
13 Nov 1995
TL;DR: In this article, a word tagging and editing system for speech recognition receives recognized speech text from a speech recognition engine, and creates tagging information that follows the speech text as it is received by a word processing program or other program.
Abstract: A word tagging and editing system for speech recognition receives recognized speech text from a speech recognition engine, and creates tagging information that follows the speech text as it is received by a word processing program or other program. The body of text to be edited in connection with the word processing program may be selected and cut and pasted and otherwise manipulated, and the tags follow the speech text. A word may be selected by a user, and the tag information used to point to a sound bite within the audio data file created initially by the speech recognition engine. The sound bite may be replayed to the user through a speaker. The practical results include that the user may confirm the correctness of a particular recognized word, in real time whilst editing text in the word processor. If the recognition is manually corrected, the correction information may be supplied to the engine for use in updating a user profile for the user who dictated the audio that was recognized. Particular tagging approaches are employed depending on the particular word processor being used.

188 citations


Proceedings ArticleDOI
27 Nov 1995
TL;DR: First experiments along highways in the Netherlands show that the CLPR-system has an error rate, of 0.02% at a recognition rate of 98.51%.
Abstract: A car license plate recognition system (CLPR-system) has been developed to identify vehicles by the contents of their license plate for speed-limit enforcement. This type of application puts high demands on the reliability of the CLPR-system. A combination of neural and fuzzy techniques is used to guarantee a very low error rate at an acceptable recognition rate. First experiments along highways in the Netherlands show that the system has an error rate, of 0.02% at a recognition rate of 98.51%. These results are also compared with other published CLPR-systems.

180 citations


Proceedings ArticleDOI
09 May 1995
TL;DR: A new scoring algorithm has been developed for generating wordspotting hypotheses and their associated scores that uses a large-vocabulary continuous speech recognition system to generate the N-best answers along with their Viterbi alignments.
Abstract: A new scoring algorithm has been developed for generating wordspotting hypotheses and their associated scores. This technique uses a large-vocabulary continuous speech recognition (LVCSR) system to generate the N-best answers along with their Viterbi alignments. The score for a putative hit is computed by summing the likelihoods for all hypotheses that contain the keyword normalized by dividing by the sum of all hypothesis likelihoods in the N-best list. Using a test set of conversational speech from Switchboard Credit Card conversations, we achieved an 81% figure of merit (FOM). Our word recognition error rate on this same test set is 54.7%.

Journal ArticleDOI
TL;DR: MFB cepstra significantly outperform LPC cepstral under noisy conditions and techniques using an optimal linear combination of features for data reduction were evaluated.
Abstract: This paper compares the word error rate of a speech recognizer using several signal processing front ends based on auditory properties. Front ends were compared with a control mel filter bank (MFB) based cepstral front end in clean speech and with speech degraded by noise and spectral variability, using the TI-105 isolated word database. MFB recognition error rates ranged from 0.5 to 26.9% in noise, depending on the SNR, and auditory models provided error rates as much as four percentage points lower. With speech degraded by linear filtering, MFB error rates ranged from 0.5 to 3.1%, and the reduction in error rates provided by auditory models was less than 0.5 percentage points. Some earlier studies that demonstrated considerably more improvement with auditory models used linear predictive coding (LPC) based control front ends. This paper shows that MFB cepstra significantly outperform LPC cepstra under noisy conditions. Techniques using an optimal linear combination of features for data reduction were also evaluated. >

Journal ArticleDOI
TL;DR: Direct-detection optical synchronous code-division multiple-access systems with M-ary pulse-position modulation (PPM) signaling are investigated and it is shown that under average power and bit error rate constraints, there always exists a pulse position multiplicity that permit all the subscribers to communicate simultaneously.
Abstract: Direct-detection optical synchronous code-division multiple-access (CDMA) systems with M-ary pulse-position modulation (PPM) signaling are investigated. Optical orthogonal codes are used as the signature sequences of our system. A union upper bound on the bit error rate is derived taking into account the effect of the background noise, multiple-user interference, and receiver shot noise. The performance characteristics are then discussed for a variety of system parameters. Another upper bound on the probability of error is also obtained (based on Chernoff inequality). This bound is utilized to derive achievable expressions for both the maximum number of users that can communicate simultaneously with asymptotically zero error rate and the channel capacity. Our results show that under average power and bit error rate constraints, there always exists a pulse position multiplicity that permit all the subscribers to communicate simultaneously. >

Proceedings ArticleDOI
09 May 1995
TL;DR: An algorithm for using a probabilistic Earley parser and a stochastic context-free grammar (SCFG) to generate word transition probabilities at each frame for a Viterbi decoder and it is shown that using an SCFG as a language model improves the word error rate.
Abstract: This paper describes a number of experiments in adding new grammatical knowledge to the Berkeley Restaurant Project (BeRP), our medium-vocabulary (1300 word), speaker-independent, spontaneous continuous-speech understanding system. We describe an algorithm for using a probabilistic Earley parser and a stochastic context-free grammar (SCFG) to generate word transition probabilities at each frame for a Viterbi decoder. We show that using an SCFG as a language model improves the word error rate from 34.6% (bigram) to 29.6% (SCFG), and the semantic sentence recognition error from from 39.0% (bigram) to 34.1% (SCFG). In addition, we get a further reduction to 28.8% word error by mixing the bigram and SCFG LMs. We also report on our preliminary results from using discourse-context information in the LM.

Patent
Hsiao-Wuen Hon1, Yen-Lu Chow1
04 Oct 1995
TL;DR: In this paper, a method for reducing recognition errors in a speech recognition system that has a user interface, which instructs the user to invoke a new word acquisition module upon a predetermined condition, and that improves the recognition accuracy for poorly recognized words.
Abstract: A method for reducing recognition errors in a speech recognition system that has a user interface, which instructs the user to invoke a new word acquisition module upon a predetermined condition, and that improves the recognition accuracy for poorly recognized words. The user interface of the present invention suggests to a user which unrecognized words may be new words that should be added to the recognition program lexicon. The user interface advises the user to enter words into a new word lexicon that fails to present themselves in an alternative word list for two consecutive tries. A method to improve the recognition accuracy for poorly recognized words via language model adaptation is also provided by the present invention. The present invention increases the unigram probability of an unrecognized word in proportion to the score difference between the unrecognized word and the top one word to guarantee recognition of the same word in a subsequent try. In the event that the score of unrecognized word is unknown (i.e., not in the alternative word list), the present invention increases the unigram probability of the unrecognized word in proportion to the difference between the top one word score and the smallest score in the alternative list.

Proceedings ArticleDOI
09 May 1995
TL;DR: It is suggested that phone rate is a more meaningful measure of speech rate than the more common word rate, and it is found that when data sets are clustered according to the phone rate metric, recognition errors increase when thePhone rate is more than 1 standard deviation greater than the mean.
Abstract: It is well known that a higher-than-normal speech rate will cause the rate of recognition errors in large vocabulary automatic speech recognition (ASR) systems to increase. In this paper we attempt to identify and correct for errors due to fast speech. We first suggest that phone rate is a more meaningful measure of speech rate than the more common word rate. We find that when data sets are clustered according to the phone rate metric, recognition errors increase when the phone rate is more than 1 standard deviation greater than the mean. We propose three methods to improve the recognition accuracy of fast speech, each addressing different aspects of performance degradation. The first method is an implementation of Baum-Welch codebook adaptation. The second method is based on the adaptation of HMM state-transition probabilities. In the third method, the pronunciation dictionaries are modified using rule-based techniques and compound words are added. We compare improvements in recognition accuracy for each method using data sets clustered according to the phone rate metric. Adaptation of the HMM state-transition probabilities to fast speech improves recognition of fast speech by a relative amount of 4 to 6 percent.

Journal ArticleDOI
TL;DR: A new representation of the residue is proposed and its corresponding recognition performance is analysed by issuing experiments in the context of text-independent speaker verification, which suggests the possibility of an improvement over current speaker recognition approaches based on nothing but the usual synthesis filter features.

Proceedings ArticleDOI
09 May 1995
TL;DR: It is shown that sizable gains can be achieved by either batch or incremental adaptation for large vocabulary recognition of native speakers, and that good improvements in performance are realized when instantaneous adaptation is used for recognition of non-native speakers.
Abstract: We present a framework for maximum a posteriori adaptation of large scale HMM speech recognizers. In this framework, we introduce mechanisms that take advantage of correlations present among HMM parameters in order to maximize the number of parameters that can be adapted by a limited number of observations. We are also separately exploring the feasibility of instantaneous adaptation techniques. Instantaneous adaptation attempts to improve recognition on a single sentence, the same sentence that is used to estimate the adaptation. We show that sizable gains (20-40% reduction in error rate) can be achieved by either batch or incremental adaptation for large vocabulary recognition of native speakers. The same techniques cut the error rate for recognition of non-native speakers by factors of 2 to 4, bringing their performance much closer to the native speaker performance. We also demonstrate that good improvements in performance (25-30%) are realized when instantaneous adaptation is used for recognition of non-native speakers.

Journal ArticleDOI
TL;DR: This paper showed that an overwhelming majority (84%) of polysyllables have shorter words embedded within them and that these embeddings are most common at the onsets of the longer word.
Abstract: Several models of spoken word recognition postulate that recognition is achieved via a process of competition between lexical hypotheses. Competition not only provides a mechanism for isolated word recognition, it also assists in continuous speech recognition, since it offers a means of segmenting continuous input into individual words. We present statistics on the pattern of occurrence of words embedded in the polysyllabic words of the English vocabulary, showing that an overwhelming majority (84%) of polysyllables have shorter words embedded within them. Positional analyses show that these embeddings are most common at the onsets of the longer word. Although both phonological and syntactic constraints could rule out some embedded words, they do not remove the problem. Lexical competition provides a means of dealing with lexical embedding. It is also supported by a growing body of experimental evidence. We present results which indicate that competition operates both between word candidates that...

Book ChapterDOI
William W. Cohen1
09 Jul 1995
TL;DR: It is shown that FOIL usually forms classifiers with lower error rates and higher rates of precision and recall with a relational encoding than with a propositional encoding, and its performance can be improved by relation selection, a first order analog of feature selection.
Abstract: We evaluate the first order learning system FOIL on a series of text categorization problems. It is shown that FOIL usually forms classifiers with lower error rates and higher rates of precision and recall with a relational encoding than with a propositional encoding. We show that FOIL's performance can be improved by relation selection, a first order analog of feature selection. Relation selection improves FOIL's performance as measured by any of recall, precision, F-measure, or error rate. With an appropriate level of relation selection, FOIL appears to be competitive with or superior to existing propositional techniques.

Journal ArticleDOI
TL;DR: The authors report 4 lexical decision experiments in which case type, word frequency, and exposure duration were varied, which indicated that there is a larger mixed-case disadvantage for nonwords than for words for longer duration presentations of targets.
Abstract: The authors report 4 lexical decision experiments in which case type, word frequency, and exposure duration were varied. These data indicated that there is a larger mixed-case disadvantage for nonwords than for words for longer duration presentations of targets. However, when targets were presented for 100 ms (followed by a postdisplay pattern mask), a larger mixed-case disadvantage occurred for words than for nonwords. For word frequency, the data from Experiments 1, 2, and 3 revealed a slightly larger mixed- case disadvantage for higher frequency words than for lower frequency words. (There was additivity between word frequency and case type for experiment 4.) These results are consistent with a holistically biased, hybrid model of visual word recognition but inconsistent with analytically biased, hybrid models of word recognition, such as the process model (Besner & Johnston, 1989) and the interactive-activation model (McClelland & Rumelhart, 1981).

Proceedings ArticleDOI
09 May 1995
TL;DR: A general recognition system for large vocabulary, writer independent, unconstrained handwritten text, that performs recognition in real-time on 486 class PC platforms without the large amounts of memory required for traditional HMM based systems.
Abstract: We address the problem of automatic recognition of unconstrained handwritten text. Statistical methods, such as hidden Markov models (HMMs) have been used successfully for speech recognition and they have been applied to the problem of handwriting recognition as well. We discuss a general recognition system for large vocabulary, writer independent, unconstrained handwritten text. "Unconstrained" implies that the user may write in any style e.g. printed, cursive or in any combination of styles. This is more representative of typical handwritten text where one seldom encounters purely printed or purely cursive forms. Furthermore, a key characteristic of the system is that it performs recognition in real-time on 486 class PC platforms without the large amounts of memory required for traditional HMM based systems. We focus mainly on the writer independent task. Some initial writer dependent results are also reported. An error rate of 18.9% is achieved for a writer-independent 21,000 word vocabulary task in the absence of any language models.

Journal ArticleDOI
Yu-Dong Yao1
TL;DR: An effective go-back-N ARQ scheme is proposed which estimates the channel state in a simple manner, and adaptively switches its operation mode in a channel where error rates vary slowly.
Abstract: In nonstationary channels, error rates vary considerably. The author proposes an effective go-back-N ARQ scheme which estimates the channel state in a simple manner, and adaptively switches its operation mode in a channel where error rates vary slowly. It provides higher throughput than other comparable ARQ schemes under a wide variety of error rate conditions. >

Journal ArticleDOI
TL;DR: Two methods for creating a phoneme- and speaker-independent model that greatly reduce the amount of calculation needed for similarity (or likelihood) normalization in speaker verification are proposed.

PatentDOI
Jukka Ranta1
TL;DR: In this paper, a speech recognition method and a system for a speech-controllable telephone is presented, in which a value is computed (2) for a reference word with a speech recognizer (8) on the basis of a word uttered by a user, and a recognition resolution (6a, 6b) is made based on said value.
Abstract: The present invention relates to a speech recognition method and a system for a speech-controllable telephone in which a value is computed (2) for a reference word with a speech recognizer (8) on the basis of a word uttered by a user, and a recognition resolution (6a, 6b) is made on the basis of said value. Prior to making said recognition resolution, it is found out (3) if repetition of a previous word is in question, and if so, a new value is computed (5) for the reference word on the basis of the value computed by the speech recognizer and of a value in the memory, computed earlier for the reference word, and a recognition resolution (6a, 6b) is made on the basis of said computed new value.

Patent
02 Oct 1995
TL;DR: In this paper, a translation word learning scheme for machine translation capable of learning translation words for each lexical rule separately and easily is proposed. But it does not address the problem of learning translations for each rule separately.
Abstract: A translation word learning scheme for a machine translation capable of learning translation words for each lexical rule separately and easily. In this scheme, a translation word for each original word is obtained by a machine translation using a translation dictionary storing headwords in the first language, a plurality of lexical rules for each headword, and at least one candidate translation word in the second language corresponding to each lexical rule. Then, a change of a translation word from that obtained by the machine translation to another translation word specified by a user is learned by registering a learning data indicating a headword, a top candidate translation word corresponding to a lexical rule applied in translating this headword, and the specified translation word. This specified translation word is used in subsequent translations only when an original word and a top candidate translation word for this original word obtained by the machine translation coincide with the headword and the top candidate translation word indicated in the learning data.

Proceedings ArticleDOI
09 May 1995
TL;DR: This work proposes a constrained estimation technique for Gaussian mixture densities, and combines it with Bayesian techniques to improve its asymptotic properties, and evaluates the algorithms on the large-vocabulary Wall Street Journal corpus for nonnative speakers of American English.
Abstract: The performance and robustness of a speech recognition system can be improved by adapting the speech models to the speaker, the channel and the task. In continuous mixture-density hidden Markov models the number of component densities is typically very large, and it may not be feasible to acquire a large amount of adaptation data for robust maximum-likelihood estimates. To solve this problem, we propose a constrained estimation technique for Gaussian mixture densities, and combine it with Bayesian techniques to improve its asymptotic properties. We evaluate our algorithms on the large-vocabulary Wall Street Journal corpus for nonnative speakers of American English. The recognition error rate is comparable to the speaker-independent accuracy achieved for native speakers.

Journal ArticleDOI
TL;DR: This paper focuses on the speech recognition advances made through better speech modeling techniques, chiefly through more accurate mathematical modeling of speech sounds.
Abstract: In the past decade, tremendous advances in the state of the art of automatic speech recognition by machine have taken place. A reduction in the word error rate by more than a factor of 5 and an increase in recognition speeds by several orders of magnitude (brought about by a combination of faster recognition search algorithms and more powerful computers), have combined to make high-accuracy, speaker-independent, continuous speech recognition for large vocabularies possible in real time, on off-the-shelf workstations, without the aid of special hardware. These advances promise to make speech recognition technology readily available to the general public. This paper focuses on the speech recognition advances made through better speech modeling techniques, chiefly through more accurate mathematical modeling of speech sounds.

Patent
19 Jun 1995
TL;DR: In this paper, a morphological find and replace editing tool for a word processor replaces inflected forms of a user-specified find word in a text document with the inflectional forms of the user specified replacement word having matching parts of speech, by selecting a single set of word forms with a common root word for each of the find and replacement words.
Abstract: A morphological find and replace editing tool for a word processor replaces inflected forms of a user-specified find word in a text document with inflected forms of a user-specified replacement word having matching parts of speech. The tool retrieves sets of word forms having a same root word as the find and replacement words, respectively, from a word forms database. The tool selects a single set of word forms with a common root word for each of the find and replacement words such that the find and replacement words are matching parts of speech. Where word forms in the find word's set are found in the text document, they are replaced with a word form from the replacement word's set with a best matching part of speech.

01 Jan 1995
TL;DR: Ali et al. as mentioned in this paper provided empirical evidence that there is a linear relationship between the degree of error reduction and the degree to which patterns of errors made by individual models are uncorrelated.
Abstract: Author(s): Ali, Kamal M.; Pazzani, Michael J. | Abstract: Recent work has shown that learning an ensemble consisting of multiple models and then making classifications by combining the classifications of the models often leads to more accurate classifications then those based on a single model learned from the same data. However, the amount of error reduction achieved varies from data set to data set. This paper provides empirical evidence that there is a linear relationship between the degree of error reduction and the degree to which patterns of errors made by individual models are uncorrelated. Ensemble error rate is most reduced in ensembles whose constituents make individual errors in a less correlated manner. The second result of the work is that some of the greatest error reductions occur on domains for which many ties in information gain occur during learning. The third result is that ensembles consisting of models that make errors in a dependent but "negatively correlated" manner will have lower ensemble error rates than ensembles whose constituents make errors in an uncorrelated manner. Previous work has aimed at learning models that make errors in a uncorrelated manner rather than those that make errors in an "negatively correlated" manner. Taken together, these results help provide an understanding of why the multiple models approach yields great error reduction in some domains but little in others.