scispace - formally typeset
Search or ask a question
Topic

Word error rate

About: Word error rate is a research topic. Over the lifetime, 11939 publications have been published within this topic receiving 298031 citations.


Papers
More filters
Proceedings ArticleDOI
12 May 1998
TL;DR: An approach to estimate the confidence in a hypothesized word as its posterior probability, given all acoustic feature vectors of the speaker utterance, as the sum of all word hypothesis probabilities which represent the occurrence of the same word in more or less the same segment of time.
Abstract: Estimates of confidence for the output of a speech recognition system can be used in many practical applications of speech recognition technology. They can be employed for detecting possible errors and can help to avoid undesirable verification turns in automatic inquiry systems. We propose to estimate the confidence in a hypothesized word as its posterior probability, given all acoustic feature vectors of the speaker utterance. The basic idea of our approach is to estimate the posterior word probabilities as the sum of all word hypothesis probabilities which represent the occurrence of the same word in more or less the same segment of time. The word hypothesis probabilities are approximated by paths in a wordgraph and are computed using a simplified forward-backward algorithm. We present experimental results on the North American Business (NAB'94) and the German Verbmobil recognition task.

127 citations

01 Jan 2007
TL;DR: A probabilistic model based on context dependent phonetic rewrite rules to derive a list of possible pronunciations for all words or sequences of words is developed and developed to reduce the confusion of this expanded dictionary.
Abstract: To provide rapid access to meetings between human beings, transcription, tracking, retrieval and summarization of on-going human-to-human conversation has to be achieved. In DARPA and DoD sponsored work (projects GENOA and CLARITY) we aim to develop strategies to transcribe human discourse and provide rapid access to the structure and content of this human exchange. The system consists of four major components: 1.) the speech transcription engine, based on the JANUS recognition toolkit, 2.) the summarizer, a statistical tool that attempts to find salient and novel turns in the exchange, 3.) the discourse component that attempts to identify the speech acts, and 4.) the non-verbal structure, including speaker types and non-verbal visual cues. The meeting browser also attempts to identify the speech acts found in the turns of the meeting, and track topics. The browser is implemented in Java and also includes video capture of the individuals in the meeting. It attempts to identify the speakers, and their focus of attention from acoustic and visual cues. 1. THE MEETING RECOGNITION ENGINE The speech recognition component of the meeting browser is based on the JANUS Switchboard recognizer trained for the 1997 NIST Hub-5E evaluation [3]. The gender independent, vocal tract length normalized, large vocabulary recognizer features dynamic, speaking mode adaptive acoustic and pronunciation models [2] which allow for robust recognition of conversational speech as observed in human to human dialogs. 1.1 Speaking Mode Dependent Pronunciation Modeling In spontaneous conversational human-to-human speech as observed in meetings there is a large amount of variability due to accents, speaking styles and speaking rates (also known as the speaking mode [6]. Because current recognition systems usually use only a relatively small number of pronunciation variants for the words in their dictionaries, the amount of variability that can be modeled is limited. Increasing the number of variants per dictionary entry may seem to be the obvious solution, but doing so actually results in a increase in error rate. This is explained by the greater confusion between the dictionary entries, particularly, for short reduced words. We developed a probabilistic model based on context dependent phonetic rewrite rules to derive a list of possible pronunciations for all words or sequences of words [2][4]. In order to reduce the confusion of this expanded dictionary, each variant of a word is annotated with an observation probability. To this aim we automatically retranscribe the corpus based on all allowable variants using flexible utterance transcription graphs (Flexible Transcription Alignment (FTA) [5]) and speaker adapted models. The alignments are then used to train a model of how likely which form of variation (i.e. rule) is and how likely a variant is, to be observed in a certain context (acoustic, word, speaking mode or dialogue) is. For decoding, the probability of encountering pronunciation variants is then defined to be a function of the speaking style (phonetic context, linguistic context, speaking rate and duration). The probability function is learned through decision trees from rule based generated pronunciation variants as observed on the Switchboard corpus [2]. 1.2 Experimental Setup To date, we have experimented with three different meeting environments and tasks to assess the performance in terms of word accuracy and summarization quality: i.) Switchboard human to human telephone conversations, ii.) Research group meetings recorded in the Interactive Systems labs and iii.) Simulated crisis management meetings (3 participants) which also include video capture of the individuals. We report results from speech recognition experiments in the first two conditions. 1) Human to Human Telephone The test set to evaluate the use of the flexible transcription alignment approach consisted of the Switchboard and CallHome partitions of the 1996 NIST Hub-5e evaluation set. All test runs were carried out using a Switchboard recognizer trained with the JANUS Recognition Toolkit (JRTk) [4]. The preprocessing of the system begins by extracting MFCC based feature vectors every 10 ms. A truncated LDA transformation is performed over a concatenation of MFCCs and their first and second order derivatives are determined. Vocal tract length normalization and cepstral mean subtraction are computed to reduce speaker and channel differences. The rule-based expanded dictionary that was used in these tests included 1.78 pronunciation variants/word, compared to 1.13 found in the baseform dictionary (PronLex). The first list of results in Table 1 is based on a recognizer whose polyphonic decision trees were still trained on Viterbi alignments based on the unexpanded dictionary. We compare a baseline system trained on the base dictionary with an expanded dictionary FTA trained system tested in two different ways: with the base dictionary and with the expanded one. It turns out, that FTA training reduces the word error rate significantly, which means, that we improved the quality of the transcriptions through FTA and pronunciation modeling. Due to the added confusion of the expanded dictionary the test with the large dictionary without any weighting of the variants yields slightly worse results than testing with the baseline dictionary. Condition SWB WER CH WER Baseline 32.2% 43.7% FTA traing test w.basedict 30.7% 41.9% FTA traing test w.expanded dict 31.1% 42.5% Table 1 Recognition results using flexible transcription alignment training and label boosting. The test using the expanded dictionary was done without weighting the variants Adding vowel stress related questions to the phonetic clustering procedure and regrowing the polyphonic decision tree based on FTA labels improved the performance by 2.6% absolute on SWB and 2.2% absolute on CallHome. Table 2 shows results for mode dependent pronunciation weighting. We gain an additional ~2% absolute by weighting the pronunciation based on mode related features. Condition SWB WER CH WER unweighted 28.7% 38.6% Weighted p(r|w) 27.1% 36.7% Weighted p(r|w,m) 26.7% 36.1% Table 2 Results using different pronunciation variant weighting schemes. 2) Research Group Meetings In a second experiment we used recorded during internal group meetings at our lab. We placed lapel microphones on three out of ten participants, and recorded the signals on those three channels. Each meeting was approximately one hour in length, for a total of three hours of speech on which to adapt and test. Since we have no additional training data collected in this particular environment, the following unsupervised adaptation techniques was used to adapt a read speech, clean environment Wall Street Journal dictation recognizer to the meeting conditions: 1. MLLR based adaptation: In our system, we employed a regression tree, constructed using an acoustic similarity criterion for the defnition of regression classes. The tree is pruned as necessary to ensure sufficient adaptation data on each leaf. For each leaf node we calculate a linear transformation that maximizes the likelihood of the adaptation data. The number of transformations is determined automatically. 2. Iterative batch-mode unsupervised adaptation: The quality of adaptation depends directly on the quality of the hypotheses on which the alignments are based. We iterate the adaptation procedure, improving both the acoustic models and the hypotheses they produce. Significant gains were observed during the two iterations, after which performance converges. 3. Adaptation wth confidence measures: Confidence measures were used to automatically select the best candidates for adaptation. We used the stability of a hypothesis in a lattice as indicator of confidence. If, in rescoring the lattice with a variety of language model weights and insertion penalties, a word appears in every possible top-1 hypothesis, acoustic stability is indicated. Such acoustic stability often identifies a good candidate for adaptation. Using only these words in the adaptation procedure produces 1-2% gains in word accuracy over blind adaptation [9]. The baseline performance of the JRTk based WSJ Recognizer over the Hub4-Nov94 test set is about 7% WER. These preliminary experiments suggest that due to the effects of spontaneous human-to-human speech, significant differences in recording conditions, significant crosstalk on the recorded channels, significantly different microphone characteristics, and inappropriate language models the error rate on meetings is in a range of 40-50\% WER. Adaptation Iterations Speaker 0 1 2 Adaptation Gain maxl 51.7 45.3 45.2 12% fdmg 48.4 43.8 44.9 9% flsl 63.8 59.5 59.6 7% Total 54.8 49.6 49.9 Table 3 Error rates for three different speakers in a research group meeting using JRTk trained over WSJ dictation data.

127 citations

Journal ArticleDOI
TL;DR: This work shows how to perform the processing associated with an $n\ifmmode\times\else\texttimes\fi{}n$ lattice of qubits, each being manipulated in a realistic, fault-tolerant manner, in average time per round of error correction.
Abstract: The surface code is unarguably the leading quantum error correction code for 2D nearest neighbor architectures, featuring a high threshold error rate of approximately 1%, low overhead implementations of the entire Clifford group, and flexible, arbitrarily long-range logical gates. These highly desirable features come at the cost of significant classical processing complexity. We show how to perform the processing associated with an n×n lattice of qubits, each being manipulated in a realistic, fault-tolerant manner, in O(n2) average time per round of error correction. We also describe how to parallelize the algorithm to achieve O(1) average processing per round, using only constant computing resources per unit area and local communication. Both of these complexities are optimal.

127 citations

Journal ArticleDOI
Yanan Zhong1, Jianshi Tang1, Xinyi Li1, Bin Gao1, He Qian1, Huaqiang Wu1 
TL;DR: In this paper, a parallel dynamic memristor-based reservoir computing system was proposed by applying a controllable mask process, in which the critical parameters, including state richness, feedback strength and input scaling, can be tuned by changing the mask length and the range of input signal.
Abstract: Reservoir computing is a highly efficient network for processing temporal signals due to its low training cost compared to standard recurrent neural networks, and generating rich reservoir states is critical in the hardware implementation. In this work, we report a parallel dynamic memristor-based reservoir computing system by applying a controllable mask process, in which the critical parameters, including state richness, feedback strength and input scaling, can be tuned by changing the mask length and the range of input signal. Our system achieves a low word error rate of 0.4% in the spoken-digit recognition and low normalized root mean square error of 0.046 in the time-series prediction of the Henon map, which outperforms most existing hardware-based reservoir computing systems and also software-based one in the Henon map prediction task. Our work could pave the road towards high-efficiency memristor-based reservoir computing systems to handle more complex temporal tasks in the future.

126 citations

Journal ArticleDOI
TL;DR: This article introduces and evaluates several different word-level confidence measures for machine translation, which provide a method for labeling each word in an automatically generated translation as correct or incorrect.
Abstract: This article introduces and evaluates several different word-level confidence measures for machine translation. These measures provide a method for labeling each word in an automatically generated translation as correct or incorrect. All approaches to confidence estimation presented here are based on word posterior probabilities. Different concepts of word posterior probabilities as well as different ways of calculating them will be introduced and compared. They can be divided into two categories: System-based methods that explore knowledge provided by the translation system that generated the translations, and direct methods that are independent of the translation system. The system-based techniques make use of system output, such as word graphs or N-best lists. The word posterior probability is determined by summing the probabilities of the sentences in the translation hypothesis space that contains the target word. The direct confidence measures take other knowledge sources, such as word or phrase lexica, into account. They can be applied to output from nonstatistical machine translation systems as well. Experimental assessment of the different confidence measures on various translation tasks and in several language pairs will be presented. Moreover,the application of confidence measures for rescoring of translation hypotheses will be investigated.

126 citations


Network Information
Related Topics (5)
Deep learning
79.8K papers, 2.1M citations
88% related
Feature extraction
111.8K papers, 2.1M citations
86% related
Convolutional neural network
74.7K papers, 2M citations
85% related
Artificial neural network
207K papers, 4.5M citations
84% related
Cluster analysis
146.5K papers, 2.9M citations
83% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023271
2022562
2021640
2020643
2019633
2018528