scispace - formally typeset
Search or ask a question

Showing papers on "Word error rate published in 2002"


Journal ArticleDOI
TL;DR: The calculation of the q‐value is discussed, the pFDR analogue of the p‐value, which eliminates the need to set the error rate beforehand as is traditionally done, and can yield an increase of over eight times in power compared with the Benjamini–Hochberg FDR method.
Abstract: Summary. Multiple-hypothesis testing involves guarding against much more complicated errors than single-hypothesis testing. Whereas we typically control the type I error rate for a single-hypothesis test, a compound error rate is controlled for multiple-hypothesis tests. For example, controlling the false discovery rate FDR traditionally involves intricate sequential p-value rejection methods based on the observed data. Whereas a sequential p-value method fixes the error rate and estimates its corresponding rejection region, we propose the opposite approach—we fix the rejection region and then estimate its corresponding error rate. This new approach offers increased applicability, accuracy and power. We apply the methodology to both the positive false discovery rate pFDR and FDR, and provide evidence for its benefits. It is shown that pFDR is probably the quantity of interest over FDR. Also discussed is the calculation of the q-value, the pFDR analogue of the p-value, which eliminates the need to set the error rate beforehand as is traditionally done. Some simple numerical examples are presented that show that this new approach can yield an increase of over eight times in power compared with the Benjamini–Hochberg FDR method.

5,414 citations


Proceedings ArticleDOI
13 May 2002
TL;DR: The Minimum Phone Error (MPE) and Minimum Word Error (MWE) criteria are smoothed approximations to the phone or word error rate respectively and I-smoothing which is a novel technique for smoothing discriminative training criteria using statistics for maximum likelihood estimation (MLE).
Abstract: In this paper we introduce the Minimum Phone Error (MPE) and Minimum Word Error (MWE) criteria for the discriminative training of HMM systems. The MPE/MWE criteria are smoothed approximations to the phone or word error rate respectively. We also discuss I-smoothing which is a novel technique for smoothing discriminative training criteria using statistics for maximum likelihood estimation (MLE). Experiments have been performed on the Switchboard/Call Home corpora of telephone conversations with up to 265 hours of training data. It is shown that for the maximum mutual information estimation (MMIE) criterion, I-smoothing reduces the word error rate (WER) by 0.4% absolute over the MMIE baseline. The combination of MPE and I-smoothing gives an improvement of 1 % over MMIE and a total reduction in WER of 4.8% absolute over the original MLE system.

758 citations


Book ChapterDOI
TL;DR: A translation model that is based on bilingual phrases to explicitly model the local context is presented and it is shown that this model performs better than the single-word based model.
Abstract: This paper is based on the work carried out in the framework of the VERBMOBIL project, which is a limited-domain speech translation task (German-English). In the final evaluation, the statistical approach was found to perform best among five competing approaches.In this paper, we will further investigate the used statistical translation models. A shortcoming of the single-word based model is that it does not take contextual information into account for the translation decisions. We will present a translation model that is based on bilingual phrases to explicitly model the local context. We will show that this model performs better than the single-word based model. We will compare monotone and non-monotone search for this model and we will investigate the benefit of using the sum criterion instead of the maximum approximation.

408 citations


Journal ArticleDOI
TL;DR: It is shown that HMMs trained with MMIE benefit as much as MLE-trained HMMs from applying model adaptation using maximum likelihood linear regression (MLLR), which has allowed the straightforward integration of MMIe- trained HMMs into complex multi-pass systems for transcription of conversational telephone speech.

360 citations


Proceedings ArticleDOI
Robert Baumann1
08 Dec 2002
TL;DR: Memory and logic scaling trends are discussed along with a method for determining logic SER, the soft error rate of advanced CMOS devices, which may limit future product reliability.
Abstract: The soft error rate (SER) of advanced CMOS devices is higher than all other reliability mechanisms combined. Memories can be protected with error correction circuitry but SER in logic may limit future product reliability. Memory and logic scaling trends are discussed along with a method for determining logic SER.

336 citations


Proceedings Article
01 Jan 2002
TL;DR: A new true error bound for classifiers with a margin which is simpler, functionally tighter, and more data-dependent than all previous bounds is shown.
Abstract: We show two related things: (1) Given a classifier which consists of a weighted sum of features with a large margin, we can construct a stochastic classifier with negligibly larger training error rate. The stochastic classifier has a future error rate bound that depends on the margin distribution and is independent of the size of the base hypothesis class. (2) A new true error bound for classifiers with a margin which is simpler, functionally tighter, and more data-dependent than all previous bounds.

209 citations


Journal ArticleDOI
TL;DR: This work shows how to compute, for any fixed bound n and any input word W, a deterministic Levenshtein automaton of degree n for W in time linear to the length of W, which leads to a very fast method for correcting corrupted input words of unrestricted text using large electronic dictionaries.
Abstract: The Levenshtein distance between two words is the minimal number of insertions, deletions or substitutions that are needed to transform one word into the other. Levenshtein automata of degree n for a word W are defined as finite state automata that recognize the set of all words V where the Levenshtein distance between V and W does not exceed n. We show how to compute, for any fixed bound n and any input word W, a deterministic Levenshtein automaton of degree n for W in time linear to the length of W. Given an electronic dictionary that is implemented in the form of a trie or a finite state automaton, the Levenshtein automaton for W can be used to control search in the lexicon in such a way that exactly the lexical words V are generated where the Levenshtein distance between V and W does not exceed the given bound. This leads to a very fast method for correcting corrupted input words of unrestricted text using large electronic dictionaries. We then introduce a second method that avoids the explicit computation of Levenshtein automata and leads to even improved efficiency. Evaluation results are given that also address variants of both methods that are based on modified Levenshtein distances where further primitive edit operations (transpositions, merges and splits) are used.

192 citations


Journal ArticleDOI
TL;DR: It is shown that articulatory feature (AF) systems are capable of achieving a superior performance at high noise levels and that the combination of acoustic and AFs consistently leads to a significant reduction of word error rate across all acoustic conditions.

180 citations


Journal ArticleDOI
Dietrich Klakow1, Jochen Peters1
TL;DR: This paper first presents some theoretical arguments for a close relationship between perplexity and word error rate, and the notion of uncertainty of a measurement is introduced and is then used to test the hypothesis thatword error rate and perplexity are correlated by a power law.

180 citations


Journal ArticleDOI
TL;DR: This paper presents an approach to recognition confidence scoring and a set of techniques for integrating confidence scores into the understanding and dialogue components of a speech understanding system and demonstrates a relative reduction in concept error rate.

179 citations


Proceedings ArticleDOI
06 Jul 2002
TL;DR: A method for constructing a word graph to represent alternative hypotheses in an efficient way to ensure that these hypotheses can be rescored using a refined language or translation model.
Abstract: Statistical machine translation systems usually compute the single sentence that has the highest probability according to the models that are trained on data. We describe a method for constructing a word graph to represent alternative hypotheses in an efficient way. The advantage is that these hypotheses can be rescored using a refined language or translation model. Results are presented on the German-English Verbmobil corpus.

Proceedings ArticleDOI
13 May 2002
TL;DR: The connectionist language model is being evaluated on the DARPA HUB5 conversational telephone speech recognition task and preliminary results show consistent improvements in both perplexity and word error rate.
Abstract: This paper describes ongoing work on a new approach for language modeling for large vocabulary continuous speech recognition. Almost all state.. o. f-the-art systems use statistical n-gram language models estimated on text corpora. One principle problem with such language models is the fact that many of the n-grams are never observed even in very large training corpora, and therefore it is common to back-off to a lower-order model. In this paper we propose to address this problem by carrying out the estimation task in a continuous space, enabling a smooth interpolation of the probabilities. A neural network is used to learn the projection of the words onto a continuous space and to estimate the n-gram probabilities. The connectionist language model is being evaluated on the DARPA HUB5 conversational telephone speech recognition task and preliminary results show consistent improvements in both perplexity and word error rate.

Journal ArticleDOI
TL;DR: It is shown that the overall error rate can be expressed by a single integral whose integrand is nonnegative and exponentially decaying, and bit-error rates (BERs) are obtained to any desired accuracy with minimal computational complexity.
Abstract: A binary direct-sequence spread-spectrum multiple-access system with random sequences in flat Rayleigh fading is considered. A new explicit closed-form expression is obtained for the characteristic function of the multiple-access interference signals. It is shown that the overall error rate can be expressed by a single integral whose integrand is nonnegative and exponentially decaying. Bit-error rates (BERs) are obtained with this expression to any desired accuracy with minimal computational complexity. The dependence of the system BER on the number of transitions in the target user signature chip sequence is explicitly derived. The results are used to examine definitively the validity of three Gaussian approximations and to compare the performances of synchronous systems to asynchronous systems.

Proceedings ArticleDOI
13 May 2002
TL;DR: It is shown that one of the recent techniques used for speaker recognition, feature warping can be formulated within the framework of Gaussianization, and around 20% relative improvement in both equal error rate (EER) and minimum detection cost function (DCF) is obtained on NIST 2001 cellular phone data evaluation.
Abstract: In this paper, a novel approach for robust speaker verification, namely short-time Gaussianization, is proposed. Short-time Gaussianization is initiated by a global linear transformation of the features, followed by a short-time windowed cumulative distribution function (CDF) matching. First, the linear transformation in the feature space leads to local independence or decorrelation. Then the CDF matching is applied to segments of speech localized in time and tries to warp a given feature so that its CDF matches normal distribution. It is shown that one of the recent techniques used for speaker recognition, feature warping [l] can be formulated within the framework of Gaussianization. Compared to the baseline system with cepstral mean subtraction (CMS), around 20% relative improvement in both equal error rate(EER) and minimum detection cost function (DCF) is obtained on NIST 2001 cellular phone data evaluation.

01 Sep 2002
TL;DR: This paper proposes the use of the syllable as the acoustic unit for spoken name recognition and shows how pronunciation variation modeling by syllables can help in improving recognition performance and reducing the system perplexity.
Abstract: Recognition of spoken names is a challenging task for speech recognition systems because of the large variations in speaking styles, linguistic origins and pronunciation found in names. The complex linguistic nature of names makes it difficult to automatically generate pronunciation variations. For many applications the list of names tends to be in the order of several hundred thousands, making spoken name recognition a high perplexity task. Use of multiple pronunciations to account for the variations in names further increases the perplexity of the recognition system substantially. In this paper we propose the use of the syllable as the acoustic unit for spoken name recognition and show how pronunciation variation modeling with syllables can help in improving recognition performance and reducing the system perplexity. We present results comparing systems which use context dependent phones with syllable based systems, and demonstrate that a significant increase in recognition accuracy and speed, can be achieved by using the syllable as the acoustic unit for spoken name recognition. With a finite state grammar network for spoken name recognition, the observed recognition error rate for the syllable-based system was 40% less than the phone-based system. For syllable bigram based information retrieval schemes the observed recognition error rate was about 60% less than the corresponding phone system.

Journal ArticleDOI
TL;DR: The proposed algorithm, called structural MAPLR (SMAPLR), has been evaluated on the Spoke3 1993 test set of the WSJ task and it is shown that SMAPLR reduces the risk of overtraining and exploits the adaptation data much more efficiently than MLLR, leading to a significant reduction of the word error rate for any amount of adaptation data.

Proceedings ArticleDOI
23 Oct 2002
TL;DR: This system employs noninvasive, non-expensive and fully automated measures of vocal tract characteristics and excitation information that represent 8% detection error rate improvement over the best performing classifier using carefully measured features prevalent in the state-of-the-art in pathological speech analysis.
Abstract: This study focuses on a robust, rapid and accurate system for automatic detection of normal and pathological speech. This system employs noninvasive, non-expensive and fully automated measures of vocal tract characteristics and excitation information. Mel-frequency filterbank cepstral coefficients and measures of pitch dynamics were modeled by Gaussian mixtures in a hidden Markov model (HMM) classifier. The method was evaluated using the sustained phoneme /a/ data obtained from over 700 subjects of normal and different pathological cases from the Massachusetts Eye and Ear Infirmary (MEEI) database. This method attained 99.44% correct classification rates for discrimination of normal and pathological speech for sustained /a/. This represents 8% detection error rate improvement over the best performing classifier using carefully measured features prevalent in the state-of-the-art in pathological speech analysis.

Journal ArticleDOI
TL;DR: A version of the HTK Broadcast News transcription system was developed that ran in less than 10 times real time with only a small increase in error rate which has been used for the bulk transcription of broadcast news for information retrieval from audio data.

Journal Article
TL;DR: In this article, an automatic recognition of German continuous sign language is presented. The statistical approach is based on the Bayes decision rule for minimum error rate, which can be used to reduce the amount of necessary training material.
Abstract: This paper is concerned with the automatic recognition of German continuous sign language. For the most user-friendliness only one single color video camera is used for image recording. The statistical approach is based on the Bayes decision rule for minimum error rate. Following speech recognition system design, which are in general based on subunits, here the idea of an automatic sign language recognition system using subunits rather than models for whole signs will be outlined. The advantage of such a system will be a future reduction of necessary training material. Furthermore, a simplified enlargement of the existing vocabulary is expected. Since it is difficult to define subunits for sign language, this approach employs totally self-organized subunits called fenone. K-means algorithm is used for the definition of such fenones. The software prototype of the system is currently evaluated in experiments.

Journal ArticleDOI
TL;DR: This work uses writer-independent writing style models (lexemes) to identify the styles present in a particular writer's training data and updates these models using the writer's data, demonstrating the feasibility of this approach on both isolated handwritten character recognition and unconstrained word recognition tasks.
Abstract: Writer-adaptation is the process of converting a writer-independent handwriting recognition system into a writer-dependent system. It can greatly increasing recognition accuracy, given adequate writer models. The limited amount of data a writer provides during training constrains the models' complexity. We show how appropriate use of writer-independent models is important for the adaptation. Our approach uses writer-independent writing style models (lexemes) to identify the styles present in a particular writer's training data. These models are then updated using the writer's data. Lexemes in the writer's data for which an inadequate number of training examples is available are replaced with the writer-independent models. We demonstrate the feasibility of this approach on both isolated handwritten character recognition and unconstrained word recognition tasks. Our results show an average reduction in error rate of 16.3 percent for lowercase characters as compared against representing each of the writer's character classes with a single model. In addition, an average error rate reduction of 9.2 percent is shown on handwritten words using only a small amount of data for adaptation.

Proceedings ArticleDOI
13 May 2002
TL;DR: An approach to close the gap between text-dependent and text-independent speaker verification performance is presented and results on the 2001 NIST extended data task show this approach can be used to produce an equal error rate.
Abstract: In this paper we present an approach to close the gap between text-dependent and text-independent speaker verification performance. Text-constrained GMM-UBM systems are created using word segmentations produced by a LVCSR system on conversational speech allowing the system to focus on speaker differences over a constrained set of acoustic units. Results on the 2001 NIST extended data task show this approach can be used to produce an equal error rate of < 1 %.

01 Jan 2002
TL;DR: The Bionic Pattern Recognition uses neural networks, which acts by the method of covering the high dimensional geometrical distribution of the sample set in the feature space of any one of the certain kinds of samples.
Abstract: A new model of pattern recognition principles,witch is based on "matter cognition"instead of "matter classification"in traditional statistical pattern recognition,has been proposed.This new model is better closer to the function of human being,rather than traditional statistical pattern recognition using"optimal seperating"as its main principle.So the new model of pattern recognition is called the Bionic Pattern Recognition.Its mathematical basis are topological analysis of sample set in the high dimensional feature space,therefore it is also called the Topological Pattern Recognition.The basic idea of this model is based on the fact of the continuity in the feature space of any one of the certain kinds of samples.We did experiments on recognition of omnidirectionally oriented rigid objects on the same level,with the Bionic Pattern Recognition using neural networks,which acts by the method of covering the high dimensional geometrical distribution of the sample set in the feature space.Many animal and vehicle models(even with rather similar shapes) were recognized omnidirectionally thousands of times.For total 8800 tests,the correct recognition rate is 99.75%,the error rate and the rejection rate are 0 and 0.25 respectively.

Journal ArticleDOI
TL;DR: The minimum attainable error rate of a device discriminating between three particularly chosen pure qubit states is calculated with the help of the algorithm proposed.
Abstract: We propose a numerical algorithm for finding optimal measurements for quantum-state discrimination. The theory of the semidefinite programming provides a simple check of the optimality of the numerically obtained results. With the help of our algorithm we calculate the minimum attainable error rate of a device discriminating between three particularly chosen pure qubit states.

Proceedings ArticleDOI
13 May 2002
TL;DR: Improvements to an innovative high-performance speaker recognition system are described, incorporating gender-dependent phone models, pre-processing the speech files to remove cross-talk, and developing more sophisticated fusion techniques for the multi-language likelihood scores.
Abstract: This paper describes improvements to an innovative high-performance speaker recognition system. Recent experiments showed that with sufficient training data phone strings from multiple languages are exceptional features for speaker recognition. The prototype phonetic speaker recognition system used phone sequences from six languages to produce an equal error rate of 11.5% on Switchboard-I audio files. The improved system described in this paper reduces the equal error rate to less then 4%. This is accomplished by incorporating gender-dependent phone models, pre-processing the speech files to remove cross-talk, and developing more sophisticated fusion techniques for the multi-language likelihood scores.

Proceedings ArticleDOI
10 Dec 2002
TL;DR: The idea of an automatic sign language recognition system using subunits rather than models for whole signs is outlined, which will be a future reduction of necessary training material and a simplified enlargement of the existing vocabulary.
Abstract: This paper deals with the automatic recognition of German signs. The statistical approach is based on the Bayes decision rule for minimum error rate. Following speech recognition system designs, which are in general based on phonemes, here the idea of an automatic sign language recognition system using subunits rather than models for whole signs is outlined. The advantage of such a system will be a future reduction of necessary training material. Furthermore, a simplified enlargement of the existing vocabulary is expected, as new signs can be added to the vocabulary database without re-training the existing hidden Markov models (HMMs) for subunits. Since it is difficult to define subunits for sign language, this approach employs totally self-organized subunits. In first experiences a recognition accuracy of 92,5% was achieved for 100 signs, which were previously trained. For 50 new signs an accuracy of 81% was achieved without retraining of subunit-HMMs.

Proceedings ArticleDOI
Luhong Liang1, Xiaoxing Liu1, Yibao Zhao1, Xiaobo Pi1, Ara V. Nefian1 
07 Nov 2002
TL;DR: The speaker independent audio-visual continuous speech recognition system presented relies on a robust set of visual features obtained from the accurate detection and tracking of the mouth region using a coupled hidden Markov (CHMM) model.
Abstract: The increase in the number of multimedia applications that require robust speech recognition systems determined a large interest in the study of audio-visual speech recognition (AVSR) systems. The use of visual features in AVSR is justified by both the audio and visual modality of the speech generation and the need for features that are invariant to acoustic noise perturbation. The speaker independent audio-visual continuous speech recognition system presented relies on a robust set of visual features obtained from the accurate detection and tracking of the mouth region. Further, the visual and acoustic observation sequences are integrated using a coupled hidden Markov (CHMM) model. The statistical properties of the CHMM can model the audio and visual state asynchrony while preserving their natural correlation over time. The experimental results show that the current system tested on the XM2VTS database reduces by over 55% the error rate of the audio only speech recognition system at SNR of 0 dB.

Patent
13 May 2002
TL;DR: A linear transformation of parallel multiple input, multiple output (MIMO) encoded streams; also, space-time diversity and asymmetrical symbol mapping of parallel streams are discussed in this paper.
Abstract: A linear transformation of parallel multiple input, multiple output (MIMO) encoded streams; also, space-time diversity and asymmetrical symbol mapping of parallel streams. Separately or together, these improve error rate performance as well as system throughput. Preferred embodiments include CDMA wireless systems with multiple antennas.

Journal ArticleDOI
TL;DR: This study examines several key issues in system combination for the word sense disambiguation task, ranging from algorithmic structure to parameter estimation, and demonstrates that the combination system obtains a significantly lower error rate than other systems participating in the SENSEVAL2 exercise.
Abstract: Classifier combination is an effective and broadly useful method of improving system performance. This article investigates in depth a large number of both well-established and novel classifier combination approaches for the word sense disambiguation task, studied over a diverse classifier pool which includes feature-enhanced Naive Bayes, Cosine, Decision List, Transformation-based Learning and MMVC classifiers. Each classifier has access to the same rich feature space, comprised of distance weighted bag-of-lemmas, local ngram context and specific syntactic relations, such as Verb-Object and Noun-Modifier. This study examines several key issues in system combination for the word sense disambiguation task, ranging from algorithmic structure to parameter estimation. Experiments using the standard SENSEVAL2 lexical-sample data sets in four languages (English, Spanish, Swedish and Basque) demonstrate that the combination system obtains a significantly lower error rate when compared with other systems participating in the SENSEVAL2 exercise, yielding state-of-the-art performance on these data sets.

Journal ArticleDOI
TL;DR: In this paper, a new technique is presented for searching digital audio at the word/phrase level, which combines high speed and accuracy, supports open vocabulary, imposes low penalty for new words, permits phonetic and inexact spelling, enables user-determined depth of search, and is amenable to parallel execution for highly scalable deployment.
Abstract: A new technique is presented for searching digital audio at the word/phrase level. Unlike previous methods based upon Large Vocabulary Continuous Speech Recognition (LVCSR, with inherent problems of closed vocabulary and high word error rate), phonetic searching combines high speed and accuracy, supports open vocabulary, imposes low penalty for new words, permits phonetic and inexact spelling, enables user-determined depth of search, and is amenable to parallel execution for highly scalable deployment. A detailed comparison of accuracy between phonetic searching and one popular embodiment of LVCSR is presented along with other operating characteristics of the new technique. The current implementation for Digital Media Asset Management (DMAM) is described along with suggested applications in other domains.

Proceedings Article
01 Sep 2002
TL;DR: Stereo-based Piecewise Linear Compensation for Environments (SPLICE) is a general framework for removing distortions from noisy speech cepstra that contains a non-parametric model for cepstral corruption, which is learned from two channels of training data.
Abstract: Stereo-based Piecewise Linear Compensation for Environments (SPLICE) is a general framework for removing distortions from noisy speech cepstra. It contains a non-parametric model for cepstral corruption, which is learned from two channels of training data. We evaluate SPLICE on both the Aurora 2 and 3 tasks. These tasks consist of digit sequences in five European languages. Noise corruption is both synthetic (Aurora 2) and realistic (Aurora 3). For both the Aurora 2 and 3 tasks, we use the same training and testing procedure provided with the corpora. By holding the back-end constant, we ensure that any increase in word accuracy is due to our front-end processing techniques. In the Aurora 2 task, we achieve a 76.86% average decrease in word error rate with clean acoustic models, and an overall improvement of 62.63%. For the Aurora 3 task, we achieve a 75.06% average decrease in word error rate for the high-mismatch experiment, and an overall improvement of 47.19%.