Topic

Word error rate

About: Word error rate is a research topic. Over the lifetime, 11939 publications have been published within this topic receiving 298031 citations.

...read moreread less

Papers published on a yearly basis

1 / 2

Papers

PDF

Open Access

More filters

Named entity extraction from speech

[...]

Francis Kubala, Richard Schwartz, Rebecca Stone, Ralph Weischedel

01 Jan 1998

TL;DR: A hidden Markov model is used to extract information from broadcast news with encouraging result that a language-independent, trainable information extraction algorithm degraded on speech input at most by the word error rate of the recognizer.

...read moreread less

Abstract: We report results using a hidden Markov model to extract information from broadcast news. IdentiFinderTM was trained on the broadcast news corpus and tested on both the 1996 HUB-4 development test data and the 1997 HUB-4 evaluation test data with respect to the named entity (NE) task: extracting • names of locations, persons, and organizations; • dates and times; • monetary amounts and percentages. Evaluation is based on automatic word alignment of the speech recognition output (the NIST algorithm) followed by the MUC6/MUC-7 scorer for NE on text, since MUC scoring assumes identical text in the system output and in the answer key. Additionally, we used the experimental MITRE scoring metric (Burger, et al., 1998). The most encouraging result is that a language-independent, trainable information extraction algorithm degraded on speech input at most by the word error rate of the recognizer. 1. MOTIVATING FACTORS One of the reasons behind this effort is to go beyond speech transcription (e.g. beyond the dictation problem) to address (at least) shallow understanding of speech. As a result of this effort, we believe that evaluating named entity (NE) extraction from speech offers a measure complementary to word error rate (wer) and represents a measure of understanding. The scores for NE from speech seem to track quality of speech recognition proportionally, i.e., NE performance degrades at worst linearly with word error rate. A second motivation is the fact that NE is the first information extraction task from text showing success, with error rates on newswire less than 10%. The named entity problem has generated much interest, as evidenced by its inclusion as an understanding task to be evaluated in both the Sixth and Seventh Message Understanding Conferences (MUC-6 and MUC-7), in the First and Second Multilingual Entity Task evaluations (MET-1 and MET-2), and as a planned track in the next broadcast news evaluation. Furthermore, at least one commercial product has emerged: NameTagTM from IsoQuest. NE is defined by a set of annotation guidelines, an evaluation metric, and example data (Chinchor, 1997). 2. THE NAMED ENTITY PROBLEM FOR SPEECH The named entity task is to identify all named locations, named persons, named organizations, dates, times, monetary amounts, and percentages. Though this sounds clear, enough special cases arise to require lengthy guidelines, e.g., when i s The Wall Street Journal an artifact, and when is it an organization? When is White House an organization, and when a location? Are branch offices of a bank an organization? Is a street name a location? Should yesterday and last Tuesday be labeled dates? Is mid-morning a time? For human annotator consistency, guidelines with numerous special cases have been defined for the Seventh Message Understanding Conference, MUC-7 (Chinchor, 1997). In training data, the boundaries of an expression and its type must be marked via SGML. Various GUIs support manual preparation of training data and reference answers. Though the problem is relatively easy in mixed case English prose, this is not solvable solely by recognizing capitalization in English. Though capitalization does indicate proper nouns in English, the type of the entity (person, organization, location, or none of those) must be identified. Many proper noun categories are not to be marked, e.g., nationalities, product names, and book titles. Named entity recognition is a challenge where case does not signal proper nouns, e.g., in Chinese, Japanese, German or non-text modalities (e.g., speech). Since the task was generalized to other languages in the multi-lingual entity task (MET), the task definition is no longer dependent on the use of mixed case in English. Broadcast news presents significant challenges, as illustrated in Table 1. Not having mixed case removes information useful to recognizing names in English. Automatically transcribed speech, even with no recognition errors, is harder due to the lack of punctuation, spelling numbers out as words, and upper case in SNOR (Speech Normalized Orthographic Representation) format. 3. OVERVIEW OF HMM IN IDENTIFINDERTM A full description of our HMM for named entity extraction appears in Bikel, et. al., 1997. By definition of the task, only a single label can be assigned to a word in context. Therefore, to every word, the HMM will assign either one of the desired classes (e.g., person, organization, etc.) or the label NOT-ANAME (to represent “none of the desired classes”). We organize the states into regions, one region for each desired class plus one for NOT-A-NAME. See Figure 1. The HMM will have a model of each desired class and of the other text. The implementation is not confined to the seven classes of NE; in fact, it determines the set of classes by the SGML labels in the training data. Additionally, there are two special states, the START-OF-SENTENCE and END-OF-SENTENCE states. Within each of the regions, we use a statistical bigram language model, and emit exactly one word upon entering each state. Therefore, the number of states in each of the nameclass regions is equal to the vocabulary size, V . The generation of words and name-classes proceeds in the following steps: 1. Select a name-class NC, conditioning on the previous name-class and the previous word. 2. Generate the first word inside that name-class, conditioning on the current and previous nameclasses. 3. Generate all subsequent words inside the current name-class, where each subsequent word i s conditioned on its immediate predecessor. 4. If not at the end of a sentence, go to 1. Using the Viterbi algorithm, we search the entire space of all possible name-class assignments, maximizing Pr(W, NC). This model allows each type of “name” to have its own language, with separate bigram probabilities for generating its words. This reflects our intuition that • There is generally predictive internal evidence regarding the class of a desired entity. Consider the following evidence: organization names tend to be stereotypical for airlines, utilities, law firms, insurance companies, other corporations, and government organizations. Organizations tend to select names to suggest the purpose or type of the organization. For person names, first person names are stereotypical in many cultures; in Chinese, family names are stereotypical. In Chinese and Japanese, special characters are used to transliterate foreign names. Monetary amounts typically include a unit term, e.g., Taiwan dollars, yen, German marks, etc. • Local evidence often suggests the boundaries and class of one of the desired expressions. Titles signal beginnings of person names. Closed class words, such as determiners, pronouns, and prepositions often signal a boundary. Corporate designators (Inc, Ltd., Corp., etc.) often end a corporation name. While the number of word-states within each name-class i s equal to V , this “interior” bigram language model is ergodic, Mixed Case The crash was the second of a 757 in less than two months. On Dec. 20, an American Airlines jet crashed in the mountains near Cali, Colombia, killing 160 of th 164 people on board. The cause of that crash is still under investigation. UPPER CASE THE CRASH WAS THE SECOND OF A 757 IN LESS THAN TWO MONTHS. ON DEC. 20, AN AMERICAN AIRLINES JET CRASHED IN THE MOUNTAINS NEAR CALI, COLOMBIA, KILLING 160 OF TH 164 PEOPLE ON BOARD. THE CAUSE OF THAT CRASH IS STILL UNDER INVESTIGATION. SNOR THE CRASH WAS THE SECOND OF A SEVEN FIFTY SEVEN IN LESS THAN TWO MONTHS ON DECEMBER TWENTY AN AMERICAN AIRLINES JET CRASHED IN THE MOUNTAINS NEAR CALI COLOMBIA KILLING ONE HUNDRED SIXTY OF THE ONE HUNDRED SIXTY FOUR PEOPLE ON BOARD THE CAUSE OF THAT CRASH IS STILL UNDER INVESTIGATION Table 1: Illustration of difficulties presented by speech recognition output (SNOR).

...read moreread less

100 citations

Proceedings Article•DOI•

Chinese word segmentation based on maximum matching and word binding force

[...]

Pak-Kwong Wong¹, Chorkin Chan¹•Institutions (1)

University of Hong Kong¹

05 Aug 1996

TL;DR: A Chinese word segmentation algorithm based on forward maximum matching and word binding force is proposed in this paper that plays a key role in post-processing the output of a character or speech recognizer in determining the proper word sequence corresponding to an input line of character images or a speech waveform.

...read moreread less

Abstract: A Chinese word segmentation algorithm based on forward maximum matching and word binding force is proposed in this paper. This algorithm plays a key role in post-processing the output of a character or speech recognizer in determining the proper word sequence corresponding to an input line of character images or a speech waveform. To support this algorithm, a text corpus of over 63 millions characters is employed to enrich an 80,000-words lexicon in terms of its word entries and word binding forces. As it stands now, given an input line of text, the word segmentor can process on the average 210,000 characters per second when running on an IBM RISC System/6000 3BT workstation with a correct word identification rate of 99.74%.

...read moreread less

100 citations

Patent•

Speech recognition using word-in-phrase command

[...]

Even Stijn Van¹, Li Li¹, Xianju Du¹, Puming Zhan¹•Institutions (1)

Nuance Communications¹

26 Oct 2001

TL;DR: In this paper, a method to correct incorrect text associated with recognition errors in computer-implemented speech recognition is described, which includes the step of performing speech recognition on an utterance to produce a recognition result for the utterance.

...read moreread less

Abstract: A method (1400, 1435) is described that corrects incorrect text associated with recognition errors in computer-implemented speech recognition. The method includes the step of performing speech recognition on an utterance to produce a recognition result (1405) for the utterance. The command includes a word and a phrase (1500). The method includes determining if a word closely corresponds to a portion of the phrase (1505). A speech recognition result is produced if the word closely corresponds to a portion of the phrase (1520, 1525).

...read moreread less

100 citations

Patent•

Method and system for creating or updating entries in a speech recognition lexicon

[...]

Andreas Neubacher¹, Gerhard Grobauer¹•Institutions (1)

Nuance Communications¹

04 Feb 2008

TL;DR: In this article, a method and a system for creating or updating entries in a speech recognition lexicon of speech recognition system, mapping speech recognition (SR) phoneme sequences to words, is presented.

...read moreread less

Abstract: In a method and a system (20) for creating or updating entries in a speech recognition (SR) lexicon (7) of a speech recognition system, said entries mapping speech recognition (SR) phoneme sequences to words, said method comprising entering a respective word, and in the case that the word is a new word to be added to the SR lexicon, also entering at least one associated SR phoneme sequence through input means (26), it is provided that the SR phoneme sequence associated with the respective word is converted into speech by phoneme to speech conversion means (4.4), and the speech is played back by playback means (28), to control the match of the phoneme sequence and the word.

...read moreread less

100 citations

Proceedings Article•DOI•

ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context

[...]

Wei Han¹, Zhengdong Zhang¹, Yu Zhang¹, Jiahui Yu², Chung-Cheng Chiu¹, James Qin¹, Anmol Gulati¹, Ruoming Pang¹, Yonghui Wu¹ - Show less +5 more•Institutions (2)

Google¹, Adobe Systems²

07 May 2020

TL;DR: ContextNet as mentioned in this paper incorporates global context information into convolution layers by adding squeeze-and-excitation modules, and proposes a simple scaling method that scales the widths of ContextNet that achieves good trade-off between computation and accuracy.

...read moreread less

Abstract: Convolutional neural networks (CNN) have shown promising results for end-to-end speech recognition, albeit still behind other state-of-the-art methods in performance. In this paper, we study how to bridge this gap and go beyond with a novel CNN-RNN-transducer architecture, which we call ContextNet. ContextNet features a fully convolutional encoder that incorporates global context information into convolution layers by adding squeeze-and-excitation modules. In addition, we propose a simple scaling method that scales the widths of ContextNet that achieves good trade-off between computation and accuracy. We demonstrate that on the widely used LibriSpeech benchmark, ContextNet achieves a word error rate (WER) of 2.1%/4.6% without external language model (LM), 1.9%/4.1% with LM and 2.9%/7.0% with only 10M parameters on the clean/noisy LibriSpeech test sets. This compares to the previous best published system of 2.0%/4.6% with LM and 3.9%/11.3% with 20M parameters. The superiority of the proposed ContextNet model is also verified on a much larger internal dataset.

...read moreread less

99 citations

Collapse

Network Information

Performance

Metrics

12,777

Papers

335,740

Citations

No. of papers in the topic in previous years
Year	Papers
2023	271
2022	562
2021	640
2020	643
2019	633
2018	528

Word error rate

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics