scispace - formally typeset
Search or ask a question
Topic

Word error rate

About: Word error rate is a research topic. Over the lifetime, 11939 publications have been published within this topic receiving 298031 citations.


Papers
More filters
Proceedings ArticleDOI
09 May 1977
TL;DR: A novel approach to the voiced-unvoiced-silence detection problem is proposed in which a spectral characterization of each of the 3 classes of signal is obtained during a training session, and an LPC distance metric and an energy distance are nonlinearly combined to make the final discrimination.
Abstract: One of the most difficult problems in speech analysis is reliable discrimination among silence, unvoiced speech, and voiced speech which has been transmitted over a telephone line. Although several methods have been proposed for making this 3-level decision, these schemes have met with only modest success. In this paper a novel approach to the voiced-unvoiced-silence detection problem is proposed in which a spectral characterization of each of the 3 classes of signal is obtained during a training session, and an LPC distance metric and an energy distance are nonlinearly combined to make the final discrimination. This algorithm has been tested over conventional switched telephone lines, across a variety of speakers, and has been found to have an error rate of about 5%, with the majority of the errors (about 2/3) occurring at the boundaries between signal classes. The algorithm is currently being used in a speaker independent word recognition system.

128 citations

Proceedings ArticleDOI
13 Jun 2016
TL;DR: A deep TransfeR NIR-VIS heterogeneous face recognition neTwork (TRIVET) with deep convolutional neural network with ordinal measures to learn discriminative models achieves state-of-the-art recognition performance on the most challenging CASIA Nir-VIS 2.0 Face Database.
Abstract: One task of heterogeneous face recognition is to match a near infrared (NIR) face image to a visible light (VIS) image. In practice, there are often a few pairwise NIR-VIS face images but it is easy to collect lots of VIS face images. Therefore, how to use these unpaired VIS images to improve the NIR-VIS recognition accuracy is an ongoing issue. This paper presents a deep TransfeR NIR-VIS heterogeneous facE recognition neTwork (TRIVET) for NIR-VIS face recognition. First, to utilize large numbers of unpaired VIS face images, we employ the deep convolutional neural network (CNN) with ordinal measures to learn discriminative models. The ordinal activation function (Max-Feature-Map) is used to select discriminative features and make the models robust and lighten. Second, we transfer these models to NIR-VIS domain by fine-tuning with two types of NIR-VIS triplet loss. The triplet loss not only reduces intra-class NIR-VIS variations but also augments the number of positive training sample pairs. It makes fine-tuning deep models on a small dataset possible. The proposed method achieves state-of-the-art recognition performance on the most challenging CASIA NIR-VIS 2.0 Face Database. It achieves a new record on rank-1 accuracy of 95.74% and verification rate of 91.03% at FAR=0.001. It cuts the error rate in comparison with the best accuracy [27] by 69%.

128 citations

Proceedings ArticleDOI
01 Dec 2013
TL;DR: The accuracy of a network recognising speech from a single distant microphone can approach that of a multi-microphone setup by training with data from other microphones.
Abstract: We investigate the application of deep neural network (DNN)-hidden Markov model (HMM) hybrid acoustic models for far-field speech recognition of meetings recorded using microphone arrays We show that the hybrid models achieve significantly better accuracy than conventional systems based on Gaussian mixture models (GMMs) We observe up to 8% absolute word error rate (WER) reduction from a discriminatively trained GMM baseline when using a single distant microphone, and between 4-6% absolute WER reduction when using beamforming on various combinations of array channels By training the networks on audio from multiple channels, we find the networks can recover significant part of accuracy difference between the single distant microphone and beamformed configurations Finally, we show that the accuracy of a network recognising speech from a single distant microphone can approach that of a multi-microphone setup by training with data from other microphones

127 citations

Proceedings ArticleDOI
23 Oct 2002
TL;DR: This system employs noninvasive, non-expensive and fully automated measures of vocal tract characteristics and excitation information that represent 8% detection error rate improvement over the best performing classifier using carefully measured features prevalent in the state-of-the-art in pathological speech analysis.
Abstract: This study focuses on a robust, rapid and accurate system for automatic detection of normal and pathological speech. This system employs noninvasive, non-expensive and fully automated measures of vocal tract characteristics and excitation information. Mel-frequency filterbank cepstral coefficients and measures of pitch dynamics were modeled by Gaussian mixtures in a hidden Markov model (HMM) classifier. The method was evaluated using the sustained phoneme /a/ data obtained from over 700 subjects of normal and different pathological cases from the Massachusetts Eye and Ear Infirmary (MEEI) database. This method attained 99.44% correct classification rates for discrimination of normal and pathological speech for sustained /a/. This represents 8% detection error rate improvement over the best performing classifier using carefully measured features prevalent in the state-of-the-art in pathological speech analysis.

127 citations

Proceedings Article
01 Jan 2006
TL;DR: The gains seen with cross-system adaptation and system combination methods are demonstrated and it is shown that sequences of adaption and decoding make it possible to incrementally improve the performance of the recognition system.
Abstract: Cross-system adaptation and system combination methods,such as ROVER and confusion network combination, areknown to lower the word error rate of speech recognitionsystems. They require the training of systems that are rea-sonably close in performance but at the same time produceoutput that differs in its errors. This provides complemen-taryinformationwhichleadstoperformanceimprovements.In this paper we demonstrate the gains we have seen withcross-systemadaptationandsystemcombinationontheEn-glish EPPS and RT0-05S lecture meeting task. We obtainedthe necessary varying systems by using different acous-tic front-ends and phoneme sets on which our models arebased. Inasetofcontrastiveexperimentsweshowtheinflu-ence that the exchange of the components has on adaptationand system combination.Index Terms: automatic speech recognition, system com-bination, cross adaptation, EPPS, RT-05S. 1. Introduction In state-of-the-art speech recognition systems it is commonpractice to use multi-pass systems with adaptation of theacoustic model in-between passes. The adaptation aims atbetter fitting the system to the speakers and/or acoustic en-vironmentsfoundinthetestdata. Itisusuallyperformedona by-speaker basis, obtained either from manual speaker la-bels or automatic clustering methods. Common adaptationmethods try to transform either the models used in a systemor the features to which the models are applied.Three adaptation methods that can be found in manystate-of-the-art systems are Maximum Likelihood LinearRegression (MLLR) [1], a model transformation, Vo-cal Tract Length Normalization (VTLN) [2] and feature-space constrained MLLR (fMLLR) [3], two feature-transformation methods. Adaptation is performed in an un-supervisedmanner,suchthattheerror-pronehypothesesob-tainedfromthepreviousdecodingpassaretakenasthenec-essary reference for adaptation. Generally, the word errorrates of the hypotheses obtained from the adapted systemsarelowerthanthoseforhypothesesonwhichtheadaptationwas performed. This sequences of adaption and decodingmake it possible to incrementally improve the performanceof the recognition system. Unfortunately, this loop of adap-tation and decoding does not always lead to significant im-provements. Often, after two or three stages of adapting asystem on its own output, no more gains can be obtained.This problem can be overcome by adapting a system

127 citations


Network Information
Related Topics (5)
Deep learning
79.8K papers, 2.1M citations
88% related
Feature extraction
111.8K papers, 2.1M citations
86% related
Convolutional neural network
74.7K papers, 2M citations
85% related
Artificial neural network
207K papers, 4.5M citations
84% related
Cluster analysis
146.5K papers, 2.9M citations
83% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023271
2022562
2021640
2020643
2019633
2018528