scispace - formally typeset
Search or ask a question

Showing papers on "Word error rate published in 2003"


Proceedings ArticleDOI
07 Jul 2003
TL;DR: It is shown that significantly better results can often be obtained if the final evaluation criterion is taken directly into account as part of the training procedure.
Abstract: Often, the training procedure for statistical machine translation models is based on maximum likelihood or related criteria. A general problem of this approach is that there is only a loose relation to the final translation quality on unseen text. In this paper, we analyze various training criteria which directly optimize translation quality. These training criteria make use of recently proposed automatic evaluation metrics. We describe a new algorithm for efficient training an unsmoothed error count. We show that significantly better results can often be obtained if the final evaluation criterion is taken directly into account as part of the training procedure.

3,259 citations


Journal ArticleDOI
TL;DR: This paper proposes a kernel machine-based discriminant analysis method, which deals with the nonlinearity of the face patterns' distribution and effectively solves the so-called "small sample size" (SSS) problem, which exists in most FR tasks.
Abstract: Techniques that can introduce low-dimensional feature representation with enhanced discriminatory power is of paramount importance in face recognition (FR) systems. It is well known that the distribution of face images, under a perceivable variation in viewpoint, illumination or facial expression, is highly nonlinear and complex. It is, therefore, not surprising that linear techniques, such as those based on principle component analysis (PCA) or linear discriminant analysis (LDA), cannot provide reliable and robust solutions to those FR problems with complex face variations. In this paper, we propose a kernel machine-based discriminant analysis method, which deals with the nonlinearity of the face patterns' distribution. The proposed method also effectively solves the so-called "small sample size" (SSS) problem, which exists in most FR tasks. The new algorithm has been tested, in terms of classification error rate performance, on the multiview UMIST face database. Results indicate that the proposed methodology is able to achieve excellent performance with only a very small set of features being used, and its error rate is approximately 34% and 48% of those of two other commonly used kernel FR approaches, the kernel-PCA (KPCA) and the generalized discriminant analysis (GDA), respectively.

651 citations


Proceedings ArticleDOI
05 Apr 2003
TL;DR: To overcome weaknesses in two statistics recently introduced to measure accuracy in text entry evaluations, a new framework for error analysis is developed and demonstrated that combines the analysis of the presented text, input stream (keystrokes), and transcribed text.
Abstract: We describe and identify shortcomings in two statistics recently introduced to measure accuracy in text entry evaluations: the minimum string distance (MSD) error rate and keystrokes per character (KSPC). To overcome the weaknesses, a new framework for error analysis is developed and demonstrated. It combines the analysis of the presented text, input stream (keystrokes), and transcribed text. New statistics include a unified total error rate, combining two constituent error rates: the corrected error rate (errors committed but corrected) and the not corrected error rate (errors left in the transcribed text). The framework includes other measures including error correction efficiency, participant conscientiousness, utilised bandwidth, and wasted bandwidth. A text entry study demonstrating the new methodology is described.

383 citations


Proceedings ArticleDOI
06 Apr 2003
TL;DR: The SuperSID project as mentioned in this paper used prosodic dynamics, pitch and duration features, phone streams, and conversational interactions to improve the accuracy of automatic speaker recognition using a defined NIST evaluation corpus and task.
Abstract: The area of automatic speaker recognition has been dominated by systems using only short-term, low-level acoustic information, such as cepstral features. While these systems have indeed produced very low error rates, they ignore other levels of information beyond low-level acoustics that convey speaker information. Recently published work has shown examples that such high-level information can be used successfully in automatic speaker recognition systems and has the potential to improve accuracy and add robustness. For the 2002 JHU CLSP summer workshop, the SuperSID project (http://www.clsp.jhu.edu/ws2002/groups/supersid/) was undertaken to exploit these high-level information sources and dramatically increase speaker recognition accuracy on a defined NIST evaluation corpus and task. The paper provides an overview of the structure, data, task, tools, and accomplishments of this project. Wide ranging approaches using pronunciation models, prosodic dynamics, pitch and duration features, phone streams, and conversational interactions were explored and developed. We show how these novel features and classifiers indeed provide complementary information and can be fused together to drive down the equal error rate on the 2001 NIST extended data task to 0.2% - a 71% relative reduction in error over the previous state of the art.

256 citations


Patent
10 Oct 2003
TL;DR: In this paper, the error rate of a pronunciation guesser that guesses the phonetic spelling of words used in speech recognition is improved by causing its training to weigh letter-to-phoneme mappings used as data in such training as a function of the frequency of the words in which such mappings occur.
Abstract: The error rate of a pronunciation guesser that guesses the phonetic spelling of words used in speech recognition is improved by causing its training to weigh letter-to-phoneme mappings used as data in such training as a function of the frequency of the words in which such mappings occur. Preferably the ratio of the weight to word frequency increases as word frequencies decreases. Acoustic phoneme models for use in speech recognition with phonetic spellings generated by a pronunciation guesser that makes errors are trained against word models whose phonetic spellings have been generated by a pronunciation guesser that makes similar errors. As a result, the acoustic models represent blends of phoneme sounds that reflect the spelling errors made by the pronunciation guessers. Speech recognition enabled systems are made by storing in them both a pronunciation guesser and a corresponding set of such blended acoustic models.

241 citations


Journal ArticleDOI
TL;DR: Through an analysis of age-related acoustic characteristics of children's speech in the context of automatic speech recognition (ASR), effects such as frequency scaling of spectral envelope parameters are demonstrated and speaker normalization algorithm that combines frequency warping and model transformation is shown to reduce acoustic variability.
Abstract: Developmental changes in speech production introduce age-dependent spectral and temporal variability in the speech signal produced by children. Such variabilities pose challenges for robust automatic recognition of children's speech. Through an analysis of age-related acoustic characteristics of children's speech in the context of automatic speech recognition (ASR), effects such as frequency scaling of spectral envelope parameters are demonstrated. Recognition experiments using acoustic models trained from adult speech and tested against speech from children of various ages clearly show performance degradation with decreasing age. On average, the word error rates are two to five times worse for children speech than for adult speech. Various techniques for improving ASR performance on children's speech are reported. A speaker normalization algorithm that combines frequency warping and model transformation is shown to reduce acoustic variability and significantly improve ASR performance for children speakers (by 25-45% under various model training and testing conditions). The use of age-dependent acoustic models further reduces word error rate by 10%. The potential of using piece-wise linear and phoneme-dependent frequency warping algorithms for reducing the variability in the acoustic feature space of children is also investigated.

213 citations


Proceedings ArticleDOI
06 Apr 2003
TL;DR: Two approaches that use the fundamental frequency and energy trajectories to capture long-term information are proposed that can achieve a 77% relative improvement over a system based on short-term pitch and energy features alone.
Abstract: Most current state-of-the-art automatic speaker recognition systems extract speaker-dependent features by looking at short-term spectral information. This approach ignores long-term information that can convey supra-segmental information, such as prosodics and speaking style. We propose two approaches that use the fundamental frequency and energy trajectories to capture long-term information. The first approach uses bigram models to model the dynamics of the fundamental frequency and energy trajectories for each speaker. The second approach uses the fundamental frequency trajectories of a predefined set of words as the speaker templates and then, using dynamic time warping, computes the distance between the templates and the words from the test message. The results presented in this work are on Switchboard I using the NIST Extended Data evaluation design. We show that these approaches can achieve an equal error rate of 3.7%, which is a 77% relative improvement over a system based on short-term pitch and energy features alone.

212 citations


Proceedings Article
01 Jan 2003
TL;DR: This paper describes and evaluates three techniques that have been applied to the language identification problem: phone recognition, Gaussian mixture modeling, and support vector machine classification and an approach to dealing with the problem of out-of-set data.
Abstract: Formal evaluations conducted by NIST in 1996 demonstrated that systems that used parallel banks of tokenizer-dependent language models produced the best language identification performance. Since that time, other approaches to language identification have been developed that match or surpass the performance of phone-based systems. This paper describes and evaluates three techniques that have been applied to the language identification problem: phone recognition, Gaussian mixture modeling, and support vector machine classification. A recognizer that fuses the scores of three systems that employ these techniques produces a 2.7% equal error rate (EER) on the 1996 NIST evaluation set and a 2.8% EER on the NIST 2003 primary condition evaluation set. An approach to dealing with the problem of out-of-set data is also discussed.

176 citations


Proceedings ArticleDOI
30 Nov 2003
TL;DR: It is a conventional wisdom in the speech community that better speech recognition accuracy is a good indicator for better spoken language understanding accuracy, but the findings in this work reveal that this is not always the case.
Abstract: It is a conventional wisdom in the speech community that better speech recognition accuracy is a good indicator for better spoken language understanding accuracy, given a fixed understanding component. The findings in this work reveal that this is not always the case. More important than word error rate reduction, the language model for recognition should be trained to match the optimization objective for understanding. In this work, we applied a spoken language understanding model as the language model in speech recognition. The model was obtained with an example-based learning algorithm that optimized the understanding accuracy. Although the speech recognition word error rate is 46% higher than the trigram model, the overall slot understanding error can be reduced by as much as 17%.

159 citations


Proceedings Article
09 Dec 2003
TL;DR: A new phone- based SVM speaker recognition approach that halves the error rate of conventional phone-based approaches is introduced and a new kernel based upon a linearization of likelihood ratio scoring is derived.
Abstract: A recent area of significant progress in speaker recognition is the use of high level features—idiolect, phonetic relations, prosody, discourse structure, etc. A speaker not only has a distinctive acoustic sound but uses language in a characteristic manner. Large corpora of speech data available in recent years allow experimentation with long term statistics of phone patterns, word patterns, etc. of an individual. We propose the use of support vector machines and term frequency analysis of phone sequences to model a given speaker. To this end, we explore techniques for text categorization applied to the problem. We derive a new kernel based upon a linearization of likelihood ratio scoring. We introduce a new phone-based SVM speaker recognition approach that halves the error rate of conventional phone-based approaches.

150 citations


Journal ArticleDOI
TL;DR: A new method for recognition of prokaryotic promoter regions with startpoints of transcription based on Sequence Alignment Kernel, a function reflecting the quantitative measure of match between two sequences, which performs well.
Abstract: In this paper we propose a new method for recognition of prokaryotic promoter regions with startpoints of transcription. The method is based on Sequence Alignment Kernel, a function reflecting the quantitative measure of match between two sequences. This kernel function is further used in Dual SVM, which performs the recognition. Several recognition methods have been trained and tested on positive data set, consisting of 669 σ 70 -promoter regions with known transcription startpoints of Escherichia coli and two negative data sets of 709 examples each, taken from coding and non-coding regions of the same genome. The results show that our method performs well and achieves 16.5% average error rate on positive & coding negative data and 18.6% average error rate on positive & non-coding negative data. Availability: The demo version of our method is accessible from our website http://mendel.cs.rhul.ac.uk/

Patent
14 May 2003
TL;DR: In this paper, a method and apparatus for generating speech that sounds more natural is presented, where word prominence and latent semantic analysis are used to generate more natural sounding speech, and a word prominence is assigned to a word in the current sentence in accordance with the information determination.
Abstract: A method and apparatus is provided for generating speech that sounds more natural. In one embodiment, word prominence and latent semantic analysis are used to generate more natural sounding speech. A method for generating speech that sounds more natural may comprise generating synthesized speech having certain word prominence characteristics and applying a semantically-driven word prominence assignment model to specify word prominence consistent with the way humans assign word prominence. A speech representative of a current sentence is generated. The determination is made whether information in the current sentence is new or previously given in accordance with a semantic relationship between the current sentence and a number of preceding sentences. A word prominence is assigned to a word in the current sentence in accordance with the information determination.

Patent
29 Jan 2003
TL;DR: In this article, a method for associating words and word strings in a language by analyzing word formations around a word or word string to identify other words or word strings that are equivalent or near equivalent semantically is presented.
Abstract: A method for creating and using a cross-idea association database (figure 1) that includes a method for associating words and word strings in a language by analyzing word formations around a word or word string to identify other words or word strings that are equivalent or near equivalent semantically. One method for associating words and word strings includes querying a collection of documents with a user-supplied word or word string input device 210), determining a user-defined amount of words or word strings to the left and right of the query string, determining the frequency of occurrence of words or word strings located on the left and right of the query string, and ranking the located words.

Journal ArticleDOI
TL;DR: The experimental results demonstrated the crucial importance of using the newly introduced iterations in improving the earlier stochastic approximation technique, and showed sensitivity of the noise estimation algorithm's performance to the forgetting factor embedded in the algorithm.
Abstract: We describe a novel algorithm for recursive estimation of nonstationary acoustic noise which corrupts clean speech, and a successful application of the algorithm in the speech feature enhancement framework of noise-normalized SPLICE for robust speech recognition. The noise estimation algorithm makes use of a nonlinear model of the acoustic environment in the cepstral domain. Central to the algorithm is the innovative iterative stochastic approximation technique that improves piecewise linear approximation to the nonlinearity involved and that subsequently increases the accuracy for noise estimation. We report comprehensive experiments on SPLICE-based, noise-robust speech recognition for the AURORA2 task using the results of iterative stochastic approximation. The effectiveness of the new technique is demonstrated in comparison with a more traditional, MMSE noise estimation algorithm under otherwise identical conditions. The word error rate reduction achieved by iterative stochastic approximation for recursive noise estimation in the framework of noise-normalized SPLICE is 27.9% for the multicondition training mode, and 67.4% for the clean-only training mode, respectively, compared with the results using the standard cepstra with no speech enhancement and using the baseline HMM supplied by AURORA2. These represent the best performance in the clean-training category of the September-2001 AURORA2 evaluation. The relative error rate reduction achieved by using the same noise estimate is increased to 48.40% and 76.86%, respectively, for the two training modes after using a better designed HMM system. The experimental results demonstrated the crucial importance of using the newly introduced iterations in improving the earlier stochastic approximation technique, and showed sensitivity of the noise estimation algorithm's performance to the forgetting factor embedded in the algorithm.

Journal ArticleDOI
TL;DR: The experimental results show that computational reduction by a factor of 17 can be achieved with 5% relative reduction in equal error rate (EER) compared with the baseline, and the SGMM-SBM shows some advantages over the recently proposed hash GMM, including higher speed and better verification performance.
Abstract: We present an integrated system with structural Gaussian mixture models (SGMMs) and a neural network for purposes of achieving both computational efficiency and high accuracy in text-independent speaker verification. A structural background model (SBM) is constructed first by hierarchically clustering all Gaussian mixture components in a universal background model (UBM). In this way the acoustic space is partitioned into multiple regions in different levels of resolution. For each target speaker, a SGMM can be generated through multilevel maximum a posteriori (MAP) adaptation from the SBM. During test, only a small subset of Gaussian mixture components are scored for each feature vector in order to reduce the computational cost significantly. Furthermore, the scores obtained in different layers of the tree-structured models are combined via a neural network for final decision. Different configurations are compared in the experiments conducted on the telephony speech data used in the NIST speaker verification evaluation. The experimental results show that computational reduction by a factor of 17 can be achieved with 5% relative reduction in equal error rate (EER) compared with the baseline. The SGMM-SBM also shows some advantages over the recently proposed hash GMM, including higher speed and better verification performance.

Proceedings Article
01 Jan 2003
TL;DR: The issue of just how much data might be required in order to bring the performance of an automatic speech recognition system up to that of a human listener is addressed.
Abstract: Since the introduction of hidden Markov modelling there has been an increasing emphasis on data-driven approaches to automatic speech recognition This derives from the fact that systems trained on substantial corpora readily outperform those that rely on more phonetic or linguistic priors Similarly, extra training data almost always results in a reduction in word error rate - “there's no data like more data” However, despite this progress, contemporary systems are not able to fulfill the requirements demanded by many potential applications, and performance is still significantly short of the capabilities exhibited by human listeners For these reasons, the R&D community continues to call for even greater quantities of data in order to train their systems This paper addresses the issue of just how much data might be required in order to bring the performance of an automatic speech recognition system up to that of a human listener

Proceedings ArticleDOI
30 Nov 2003
TL;DR: This work proposes a novel representation of the temporal envelope in different frequency bands by exploring the dual of conventional linear prediction (LPC) when applied in the transform domain and shows a relatively large word error rate improvement in a recognizer trained on general conversational telephone speech and tested on a small-vocabulary spontaneous numbers task.
Abstract: Current speech recognition systems uniformly employ short-time spectral analysis, usually over windows of 10-30 ms, as the basis for their acoustic representations. Any detail below this timescale is lost, and even temporal structures above this level are usually only weakly represented in the form of deltas etc. We address this limitation by proposing a novel representation of the temporal envelope in different frequency bands by exploring the dual of conventional linear prediction (LPC) when applied in the transform domain. With this technique of frequency-domain linear prediction (FDLP), the 'poles' of the model describe temporal, rather than spectral, peaks. By using analysis windows on the order of hundreds of milliseconds, the procedure automatically decides how to distribute poles to model the temporal structure best within the window. While this approach offers many possibilities for novel speech features, we experiment with one particular form, an index describing the 'sharpness' of individual poles within a window, and show a relatively large word error rate improvement from 4.97% to 3.81% in a recognizer trained on general conversational telephone speech and tested on a small-vocabulary spontaneous numbers task. We analyze this improvement in terms of the confusion matrices and suggest how the newly-modeled fine temporal structure may be helping.

Proceedings Article
01 Jan 2003
TL;DR: It is shown how novel features and classifiers provide complementary information and can be fused together to drive down the equal error rate on the 2001 NIST Extended Data Task to 0.2%—a 71% relative reduction in error over the previous state of the art.
Abstract: The area of automatic speaker recognition has been dominated by systems using only short-term, low-level acoustic information, such as cepstral features. While these systems have produced low error rates, they ignore higher levels of information beyond low-level acoustics that convey speaker information. Recently published works have demonstrated that such high-level information can be used successfully in automatic speaker recognition systems by improving accuracy and potentially increasing robustness. Wide ranging high-levelfeature-based approaches using pronunciation models, prosodic dynamics, pitch gestures, phone streams, and conversational interactions were explored and developed under the SuperSID project at the 2002 JHU CLSP Summer Workshop (WS2002): http://www.clsp.jhu.edu/ws2002/groups/supersid/. In this paper, we show how these novel features and classifiers provide complementary information and can be fused together to drive down the equal error rate on the 2001 NIST Extended Data Task to 0.2%—a 71% relative reduction in error over the previous state of the art.

Proceedings ArticleDOI
30 Nov 2003
TL;DR: It is shown that a word error rate of 42.6% is achieved on the presented children's story summarization task after using unsupervised MAPLR (maximum a posteriori linear regression) adaptation and VTLN (vocal tract length normalization) to compensate for inter-speaker acoustic variability.
Abstract: We present initial work towards development of a children's speech recognition system for use within an interactive reading and comprehension training system. We first describe the Colorado Literacy Tutor project and two corpora collected for children's speech recognition research. Next, baseline speech recognition experiments are performed to illustrate the degree of acoustic mismatch for children in grades K through 5. It is shown that an 11.2% relative reduction in word error rate can be achieved through vocal tract normalization applied to children's speech. Finally, we describe our baseline system for automatic recognition of spontaneously spoken story summaries. It is shown that a word error rate of 42.6% is achieved on the presented children's story summarization task after using unsupervised MAPLR (maximum a posteriori linear regression) adaptation and VTLN (vocal tract length normalization) to compensate for inter-speaker acoustic variability. Based on this result, we point to promising directions for further study.

Journal ArticleDOI
TL;DR: This work proposes to use a merged morpheme as the recognition unit and pronunciation-dependent entries in a language model (LM) so that it can reduce difficulties and incorporate the between-word phonology rule into the decoding algorithm of a Korean LVCSR system.

Patent
Nils Klarlund1, Michael Riley1
26 Mar 2003
TL;DR: In this article, a QWERTY-based cluster keyboard is described, which consists of fourteen alphabet keys arranged such that all the letters in the alphabet are distributed in three rows of keys.
Abstract: A QWERTY-based cluster keyboard is disclosed. In the preferred embodiment, the keyboard comprises fourteen alphabet keys arranged such that all the letters in the alphabet are distributed in three rows of keys and in the standard QWERTY positions. Stochastic language models are used to reduce the error rate for typing on the keyboards. The language models consist of probability estimates of occurrences of n-grams (sequences of n consecutive words), wherein n is preferably 1, 2 or 3. A delay parameter d, which is related to the period of time the system displays the predicted intended word upon entry of a word boundary, is preferably zero to immediately display the primary word choice at a word boundary and provide the user the option to select the secondary candidate if necessary. Two disambiguation keys enable the user to identify which letter is intended as a secondary option to the language model predictions.

Proceedings ArticleDOI
06 Apr 2003
TL;DR: It is shown that the MAPLR adaptation method outperforms single and multiple regression class MLLR on the SPINE task and investigates methods for unsupervised speaker and environment adaptation from limited data.
Abstract: We report on recent improvements in the University of Colorado system for the DARPA/NRL Speech in Noisy Environments (SPINE) task. In particular, we describe our efforts on improving acoustic and language modeling for the task and investigate methods for unsupervised speaker and environment adaptation from limited data. We show that the MAPLR adaptation method outperforms single and multiple regression class MLLR on the SPINE task. Our current SPINE system uses the Sonic speech recognition engine that was developed at the University of Colorado. This system is shown to have a word error rate of 31.5% on the SPINE-2 evaluation data. These improvements amount to a 16% reduction in relative word error rate compared to our previous SPINE-2 system fielded in the November 2001 DARPA/NRL evaluation.

Proceedings ArticleDOI
06 Apr 2003
TL;DR: It is come to the conclusion that articulatory features can be recognized across languages and that using detectors from many languages can improve the classification accuracy of the feature detectors on a single language.
Abstract: Speech recognition systems based on or aided by articulatory features, such as place and manner of articulation, have been shown to be useful under varying circumstances. Recognizers based on features better compensate channel and noise variability. We show that it is also possible to compensate for inter language variability using articulatory feature detectors. We come to the conclusion that articulatory features can be recognized across languages and that using detectors from many languages can improve the classification accuracy of the feature detectors on a single language. We further demonstrate how those multilingual and cross-lingual detectors can support an HMM based recognizer and thereby significantly reduce the word error rate by up to 12.3% relative. We expect that with the use of multilingual articulatory features it is possible to support the rapid deployment of recognition systems for new target languages.

Proceedings ArticleDOI
06 Apr 2003
TL;DR: Speech recognition failed completely without normalization on the highway dataset, whereas the word error rate could be reduced to 17% using an online setup and to 10% with an offline setup.
Abstract: We study the effect of different feature space normalization techniques in adverse acoustic conditions. Recognition tests are reported for cepstral mean and variance normalization, histogram normalization, feature space rotation, and vocal tract length normalization on a German isolated word recognition task with large acoustic mismatch. The training data was recorded in a clean office environment and the test data in cars. Speech recognition failed completely without normalization on the highway dataset, whereas the word error rate could be reduced to 17% using an online setup and to 10% with an offline setup.

Journal ArticleDOI
TL;DR: Experimental results in speaker-independent, continuous speech recognition over Italian digit-strings validate the novel hybrid framework, allowing for improved recognition performance over HMMs with mixtures of Gaussian components, as well as over Bourlard and Morgan's paradigm.
Abstract: Acoustic modeling in state-of-the-art speech recognition systems usually relies on hidden Markov models (HMMs) with Gaussian emission densities. HMMs suffer from intrinsic limitations, mainly due to their arbitrary parametric assumption. Artificial neural networks (ANNs) appear to be a promising alternative in this respect, but they historically failed as a general solution to the acoustic modeling problem. This paper introduces algorithms based on a gradient-ascent technique for global training of a hybrid ANN/HMM system, in which the ANN is trained for estimating the emission probabilities of the states of the HMM. The approach is related to the major hybrid systems proposed by Bourlard and Morgan and by Bengio, with the aim of combining their benefits within a unified framework and to overcome their limitations. Several viable solutions to the "divergence problem"-that may arise when training is accomplished over the maximum-likelihood (ML) criterion-are proposed. Experimental results in speaker-independent, continuous speech recognition over Italian digit-strings validate the novel hybrid framework, allowing for improved recognition performance over HMMs with mixtures of Gaussian components, as well as over Bourlard and Morgan's paradigm. In particular, it is shown that the maximum a posteriori (MAP) version of the algorithm yields a 46.34% relative word error rate reduction with respect to standard HMMs.

Proceedings ArticleDOI
Christoph Tillmann1
11 Jul 2003
TL;DR: A phrase- based unigram model for statistical machine translation that uses a much simpler set of model parameters than similar phrase-based models that has been successfully test on a Chinese-English and an Arabic-English translation task.
Abstract: In this paper, we describe a phrase-based unigram model for statistical machine translation that uses a much simpler set of model parameters than similar phrase-based models. The units of translation are blocks -- pairs of phrases. During decoding, we use a block unigram model and a word-based trigram language model. During training, the blocks are learned from source interval projections using an underlying high-precision word alignment. The system performance is significantly increased by applying a novel block extension algorithm using an additional high-recall word alignment. The blocks are further filtered using unigram-count selection criteria. The system has been successfully test on a Chinese-English and an Arabic-English translation task.

Proceedings ArticleDOI
27 May 2003
TL;DR: A generative probabilistic optical character recognition model is introduced that describes an end-to-end process in the noisy channel framework, progressing from generation of true text through its transformation into the noisy output of an OCR system.
Abstract: In this paper, we introduce a generative probabilistic optical character recognition (OCR) model that describes an end-to-end process in the noisy channel framework, progressing from generation of true text through its transformation into the noisy output of an OCR system. The model is designed for use in error correction, with a focus on post-processing the output of black-box OCR systems in order to make it more useful for NLP tasks. We present an implementation of the model based on finite-state models, demonstrate the model's ability to significantly reduce character and word error rate, and provide evaluation results involving automatic extraction of translation lexicons from printed text.

Proceedings Article
01 Jan 2003
TL;DR: A method based on the Minimum Description Length principle is used to split words statistically into subword units allowing efficient language modeling and unlimited vocabulary and the resulting model outperforms both word and syllable based trigram models.
Abstract: We study continuous speech recognition based on sub-word units found in an unsupervised fashion. For agglutinative languages like Finnish, traditional word-based n-gram language modeling does not work well due to the huge number of different word forms. We use a method based on the Minimum Description Length principle to split words statistically into subword units allowing efficient language modeling and unlimited vocabulary. The perplexity and speech recognition experiments on Finnish speech data show that the resulting model outperforms both word and syllable based trigram models. Compared to the word trigram model, the out-of-vocabulary rate is reduced from 20% to 0% and the word error rate from 56% to 32%.

Journal ArticleDOI
TL;DR: The Hidden-Articulator Markov model (HAMM) as discussed by the authors is an extension of the articulatory-feature model introduced by Erler in 1996, which integrates articulatory information into speech recognition.

Journal ArticleDOI
TL;DR: The overall error rate in laboratory medicine was found to be 20.0%, which indicates that, also on the clinical side, error reduction is desirable, especially in the requesting of laboratory investigation.