scispace - formally typeset
Search or ask a question
Topic

Word error rate

About: Word error rate is a research topic. Over the lifetime, 11939 publications have been published within this topic receiving 298031 citations.


Papers
More filters
01 Jan 1998
TL;DR: A hidden Markov model is used to extract information from broadcast news with encouraging result that a language-independent, trainable information extraction algorithm degraded on speech input at most by the word error rate of the recognizer.
Abstract: We report results using a hidden Markov model to extract information from broadcast news. IdentiFinderTM was trained on the broadcast news corpus and tested on both the 1996 HUB-4 development test data and the 1997 HUB-4 evaluation test data with respect to the named entity (NE) task: extracting • names of locations, persons, and organizations; • dates and times; • monetary amounts and percentages. Evaluation is based on automatic word alignment of the speech recognition output (the NIST algorithm) followed by the MUC6/MUC-7 scorer for NE on text, since MUC scoring assumes identical text in the system output and in the answer key. Additionally, we used the experimental MITRE scoring metric (Burger, et al., 1998). The most encouraging result is that a language-independent, trainable information extraction algorithm degraded on speech input at most by the word error rate of the recognizer. 1. MOTIVATING FACTORS One of the reasons behind this effort is to go beyond speech transcription (e.g. beyond the dictation problem) to address (at least) shallow understanding of speech. As a result of this effort, we believe that evaluating named entity (NE) extraction from speech offers a measure complementary to word error rate (wer) and represents a measure of understanding. The scores for NE from speech seem to track quality of speech recognition proportionally, i.e., NE performance degrades at worst linearly with word error rate. A second motivation is the fact that NE is the first information extraction task from text showing success, with error rates on newswire less than 10%. The named entity problem has generated much interest, as evidenced by its inclusion as an understanding task to be evaluated in both the Sixth and Seventh Message Understanding Conferences (MUC-6 and MUC-7), in the First and Second Multilingual Entity Task evaluations (MET-1 and MET-2), and as a planned track in the next broadcast news evaluation. Furthermore, at least one commercial product has emerged: NameTagTM from IsoQuest. NE is defined by a set of annotation guidelines, an evaluation metric, and example data (Chinchor, 1997). 2. THE NAMED ENTITY PROBLEM FOR SPEECH The named entity task is to identify all named locations, named persons, named organizations, dates, times, monetary amounts, and percentages. Though this sounds clear, enough special cases arise to require lengthy guidelines, e.g., when i s The Wall Street Journal an artifact, and when is it an organization? When is White House an organization, and when a location? Are branch offices of a bank an organization? Is a street name a location? Should yesterday and last Tuesday be labeled dates? Is mid-morning a time? For human annotator consistency, guidelines with numerous special cases have been defined for the Seventh Message Understanding Conference, MUC-7 (Chinchor, 1997). In training data, the boundaries of an expression and its type must be marked via SGML. Various GUIs support manual preparation of training data and reference answers. Though the problem is relatively easy in mixed case English prose, this is not solvable solely by recognizing capitalization in English. Though capitalization does indicate proper nouns in English, the type of the entity (person, organization, location, or none of those) must be identified. Many proper noun categories are not to be marked, e.g., nationalities, product names, and book titles. Named entity recognition is a challenge where case does not signal proper nouns, e.g., in Chinese, Japanese, German or non-text modalities (e.g., speech). Since the task was generalized to other languages in the multi-lingual entity task (MET), the task definition is no longer dependent on the use of mixed case in English. Broadcast news presents significant challenges, as illustrated in Table 1. Not having mixed case removes information useful to recognizing names in English. Automatically transcribed speech, even with no recognition errors, is harder due to the lack of punctuation, spelling numbers out as words, and upper case in SNOR (Speech Normalized Orthographic Representation) format. 3. OVERVIEW OF HMM IN IDENTIFINDERTM A full description of our HMM for named entity extraction appears in Bikel, et. al., 1997. By definition of the task, only a single label can be assigned to a word in context. Therefore, to every word, the HMM will assign either one of the desired classes (e.g., person, organization, etc.) or the label NOT-ANAME (to represent “none of the desired classes”). We organize the states into regions, one region for each desired class plus one for NOT-A-NAME. See Figure 1. The HMM will have a model of each desired class and of the other text. The implementation is not confined to the seven classes of NE; in fact, it determines the set of classes by the SGML labels in the training data. Additionally, there are two special states, the START-OF-SENTENCE and END-OF-SENTENCE states. Within each of the regions, we use a statistical bigram language model, and emit exactly one word upon entering each state. Therefore, the number of states in each of the nameclass regions is equal to the vocabulary size, V . The generation of words and name-classes proceeds in the following steps: 1. Select a name-class NC, conditioning on the previous name-class and the previous word. 2. Generate the first word inside that name-class, conditioning on the current and previous nameclasses. 3. Generate all subsequent words inside the current name-class, where each subsequent word i s conditioned on its immediate predecessor. 4. If not at the end of a sentence, go to 1. Using the Viterbi algorithm, we search the entire space of all possible name-class assignments, maximizing Pr(W, NC). This model allows each type of “name” to have its own language, with separate bigram probabilities for generating its words. This reflects our intuition that • There is generally predictive internal evidence regarding the class of a desired entity. Consider the following evidence: organization names tend to be stereotypical for airlines, utilities, law firms, insurance companies, other corporations, and government organizations. Organizations tend to select names to suggest the purpose or type of the organization. For person names, first person names are stereotypical in many cultures; in Chinese, family names are stereotypical. In Chinese and Japanese, special characters are used to transliterate foreign names. Monetary amounts typically include a unit term, e.g., Taiwan dollars, yen, German marks, etc. • Local evidence often suggests the boundaries and class of one of the desired expressions. Titles signal beginnings of person names. Closed class words, such as determiners, pronouns, and prepositions often signal a boundary. Corporate designators (Inc, Ltd., Corp., etc.) often end a corporation name. While the number of word-states within each name-class i s equal to V , this “interior” bigram language model is ergodic, Mixed Case The crash was the second of a 757 in less than two months. On Dec. 20, an American Airlines jet crashed in the mountains near Cali, Colombia, killing 160 of th 164 people on board. The cause of that crash is still under investigation. UPPER CASE THE CRASH WAS THE SECOND OF A 757 IN LESS THAN TWO MONTHS. ON DEC. 20, AN AMERICAN AIRLINES JET CRASHED IN THE MOUNTAINS NEAR CALI, COLOMBIA, KILLING 160 OF TH 164 PEOPLE ON BOARD. THE CAUSE OF THAT CRASH IS STILL UNDER INVESTIGATION. SNOR THE CRASH WAS THE SECOND OF A SEVEN FIFTY SEVEN IN LESS THAN TWO MONTHS ON DECEMBER TWENTY AN AMERICAN AIRLINES JET CRASHED IN THE MOUNTAINS NEAR CALI COLOMBIA KILLING ONE HUNDRED SIXTY OF THE ONE HUNDRED SIXTY FOUR PEOPLE ON BOARD THE CAUSE OF THAT CRASH IS STILL UNDER INVESTIGATION Table 1: Illustration of difficulties presented by speech recognition output (SNOR).

100 citations

Proceedings ArticleDOI
05 Aug 1996
TL;DR: A Chinese word segmentation algorithm based on forward maximum matching and word binding force is proposed in this paper that plays a key role in post-processing the output of a character or speech recognizer in determining the proper word sequence corresponding to an input line of character images or a speech waveform.
Abstract: A Chinese word segmentation algorithm based on forward maximum matching and word binding force is proposed in this paper. This algorithm plays a key role in post-processing the output of a character or speech recognizer in determining the proper word sequence corresponding to an input line of character images or a speech waveform. To support this algorithm, a text corpus of over 63 millions characters is employed to enrich an 80,000-words lexicon in terms of its word entries and word binding forces. As it stands now, given an input line of text, the word segmentor can process on the average 210,000 characters per second when running on an IBM RISC System/6000 3BT workstation with a correct word identification rate of 99.74%.

100 citations

Patent
26 Oct 2001
TL;DR: In this paper, a method to correct incorrect text associated with recognition errors in computer-implemented speech recognition is described, which includes the step of performing speech recognition on an utterance to produce a recognition result for the utterance.
Abstract: A method (1400, 1435) is described that corrects incorrect text associated with recognition errors in computer-implemented speech recognition. The method includes the step of performing speech recognition on an utterance to produce a recognition result (1405) for the utterance. The command includes a word and a phrase (1500). The method includes determining if a word closely corresponds to a portion of the phrase (1505). A speech recognition result is produced if the word closely corresponds to a portion of the phrase (1520, 1525).

100 citations

Patent
04 Feb 2008
TL;DR: In this article, a method and a system for creating or updating entries in a speech recognition lexicon of speech recognition system, mapping speech recognition (SR) phoneme sequences to words, is presented.
Abstract: In a method and a system (20) for creating or updating entries in a speech recognition (SR) lexicon (7) of a speech recognition system, said entries mapping speech recognition (SR) phoneme sequences to words, said method comprising entering a respective word, and in the case that the word is a new word to be added to the SR lexicon, also entering at least one associated SR phoneme sequence through input means (26), it is provided that the SR phoneme sequence associated with the respective word is converted into speech by phoneme to speech conversion means (4.4), and the speech is played back by playback means (28), to control the match of the phoneme sequence and the word.

100 citations

Proceedings ArticleDOI
07 May 2020
TL;DR: ContextNet as mentioned in this paper incorporates global context information into convolution layers by adding squeeze-and-excitation modules, and proposes a simple scaling method that scales the widths of ContextNet that achieves good trade-off between computation and accuracy.
Abstract: Convolutional neural networks (CNN) have shown promising results for end-to-end speech recognition, albeit still behind other state-of-the-art methods in performance. In this paper, we study how to bridge this gap and go beyond with a novel CNN-RNN-transducer architecture, which we call ContextNet. ContextNet features a fully convolutional encoder that incorporates global context information into convolution layers by adding squeeze-and-excitation modules. In addition, we propose a simple scaling method that scales the widths of ContextNet that achieves good trade-off between computation and accuracy. We demonstrate that on the widely used LibriSpeech benchmark, ContextNet achieves a word error rate (WER) of 2.1%/4.6% without external language model (LM), 1.9%/4.1% with LM and 2.9%/7.0% with only 10M parameters on the clean/noisy LibriSpeech test sets. This compares to the previous best published system of 2.0%/4.6% with LM and 3.9%/11.3% with 20M parameters. The superiority of the proposed ContextNet model is also verified on a much larger internal dataset.

99 citations


Network Information
Related Topics (5)
Deep learning
79.8K papers, 2.1M citations
88% related
Feature extraction
111.8K papers, 2.1M citations
86% related
Convolutional neural network
74.7K papers, 2M citations
85% related
Artificial neural network
207K papers, 4.5M citations
84% related
Cluster analysis
146.5K papers, 2.9M citations
83% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023271
2022562
2021640
2020643
2019633
2018528