scispace - formally typeset
Search or ask a question
Topic

Word error rate

About: Word error rate is a research topic. Over the lifetime, 11939 publications have been published within this topic receiving 298031 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: A dereverberation method to reduce reverberation prior to recognition and introduces a parametric model for variance adaptation that includes static and dynamic components in order to realize an appropriate interconnection between dereverbers and a speech recognizer.
Abstract: The performance of automatic speech recognition is severely degraded in the presence of noise or reverberation. Much research has been undertaken on noise robustness. In contrast, the problem of the recognition of reverberant speech has received far less attention and remains very challenging. In this paper, we use a dereverberation method to reduce reverberation prior to recognition. Such a preprocessor may remove most reverberation effects. However, it often introduces distortion, causing a dynamic mismatch between speech features and the acoustic model used for recognition. Model adaptation could be used to reduce this mismatch. However, conventional model adaptation techniques assume a static mismatch and may therefore not cope well with a dynamic mismatch arising from dereverberation. This paper proposes a novel adaptation scheme that is capable of managing both static and dynamic mismatches. We introduce a parametric model for variance adaptation that includes static and dynamic components in order to realize an appropriate interconnection between dereverberation and a speech recognizer. The model parameters are optimized using adaptive training implemented with the expectation maximization algorithm. An experiment using the proposed method with reverberant speech for a reverberation time of 0.5 s revealed that it was possible to achieve an 80% reduction in the relative error rate compared with the recognition of dereverberated speech (word error rate of 31%), and the final error rate was 5.4%, which was obtained by combining the proposed variance compensation and MLLR adaptation.

63 citations

Journal ArticleDOI
TL;DR: Two experiments aimed at selecting utterances from lists of responses indicate that the decoding process can be improved by optimizing the language model and the acoustic models, thus reducing the utterance error rate from 29–26% to 10–8%.
Abstract: Computer-Assisted Language Learning (CALL) applications for improving the oral skills of low-proficient learners have to cope with non-native speech that is particularly challenging. Since unconstrained non-native ASR is still problematic, a possible solution is to elicit constrained responses from the learners. In this paper, we describe experiments aimed at selecting utterances from lists of responses. The first experiment on utterance selection indicates that the decoding process can be improved by optimizing the language model and the acoustic models, thus reducing the utterance error rate from 29-26% to 10-8%. Since giving feedback on incorrectly recognized utterances is confusing, we verify the correctness of the utterance before providing feedback. The results of the second experiment on utterance verification indicate that combining duration-related features with a likelihood ratio (LR) yield an equal error rate (EER) of 10.3%, which is significantly better than the EER for the other measures in isolation.

63 citations

Patent
11 Sep 2000
TL;DR: A speech recognition apparatus includes a speech input device, a storage device that stores a recognition word indicating a pronunciation of a word to undergo speech recognition, and a speech recognition processing device that performs speech recognition by comparing audio data obtained through the voice input device and speech recognition data created in correspondence to the recognition word as discussed by the authors.
Abstract: A speech recognition apparatus includes: a speech input device; a storage device that stores a recognition word indicating a pronunciation of a word to undergo speech recognition; and a speech recognition processing device that performs speech recognition processing by comparing audio data obtained through the voice input device and speech recognition data created in correspondence to the recognition word, and the storage device stores both a first recognition word corresponding to a pronunciation of an entirety of the word to undergo speech recognition and a second recognition word corresponding to a pronunciation of only a starting portion of a predetermined length of the entirety of the word to undergo speech recognition as recognition words for the word to undergo speech recognition.

62 citations

Proceedings ArticleDOI
25 Oct 2020
TL;DR: Modifications to the RNN-T model are proposed that allow the model to utilize additional metadata text with the objective of improving performance on Named Entities (WER-NE) for videos with related metadata.
Abstract: End-to-end (E2E) systems for automatic speech recognition (ASR), such as RNN Transducer (RNN-T) and Listen-Attend-Spell (LAS) blend the individual components of a traditional hybrid ASR system - acoustic model, language model, pronunciation model - into a single neural network. While this has some nice advantages, it limits the system to be trained using only paired audio and text. Because of this, E2E models tend to have difficulties with correctly recognizing rare words that are not frequently seen during training, such as entity names. In this paper, we propose modifications to the RNN-T model that allow the model to utilize additional metadata text with the objective of improving performance on these named entity words. We evaluate our approach on an in-house dataset sampled from de-identified public social media videos, which represent an open domain ASR task. By using an attention model and a biasing model to leverage the contextual metadata that accompanies a video, we observe a relative improvement of about 16% in Word Error Rate on Named Entities (WER-NE) for videos with related metadata.

62 citations

Proceedings ArticleDOI
27 Apr 1993
TL;DR: Although the proposed discriminative feature extraction approach is a direct and simple extension of MCE/GPD, it is a significant departure from conventional approaches, providing a comprehensive basis for the entire system design.
Abstract: A novel approach to pattern recognition which comprehensively optimizes both a feature extraction process and a classification process is introduced. Assuming that the best features for recognition are the ones that yield the lowest classification error rate over unknown data, an overall recognizer, consisting of a feature extractor module and a classifier module, is trained using the minimum classification error (MCE)/generalized probabilistic descent (GPD) method. Although the proposed discriminative feature extraction approach is a direct and simple extension of MCE/GPD, it is a significant departure from conventional approaches, providing a comprehensive basis for the entire system design. Experimental results are presented for the simple example of optimally designing a cepstrum representation for vowel recognition. The results clearly demonstrate the effectiveness of the proposed method. >

62 citations


Network Information
Related Topics (5)
Deep learning
79.8K papers, 2.1M citations
88% related
Feature extraction
111.8K papers, 2.1M citations
86% related
Convolutional neural network
74.7K papers, 2M citations
85% related
Artificial neural network
207K papers, 4.5M citations
84% related
Cluster analysis
146.5K papers, 2.9M citations
83% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023271
2022562
2021640
2020643
2019633
2018528