Topic

Word error rate

About: Word error rate is a research topic. Over the lifetime, 11939 publications have been published within this topic receiving 298031 citations.

...read moreread less

Papers published on a yearly basis

1 / 2

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription

[...]

Frank Seide¹, Gang Li¹, Xie Chen¹, Dong Yu¹•Institutions (1)

Microsoft¹

01 Dec 2011

TL;DR: This work investigates the potential of Context-Dependent Deep-Neural-Network HMMs, or CD-DNN-HMMs, from a feature-engineering perspective to reduce the word error rate for speaker-independent transcription of phone calls.

...read moreread less

Abstract: We investigate the potential of Context-Dependent Deep-Neural-Network HMMs, or CD-DNN-HMMs, from a feature-engineering perspective. Recently, we had shown that for speaker-independent transcription of phone calls (NIST RT03S Fisher data), CD-DNN-HMMs reduced the word error rate by as much as one third—from 27.4%, obtained by discriminatively trained Gaussian-mixture HMMs with HLDA features, to 18.5%—using 300+ hours of training data (Switchboard), 9000+ tied triphone states, and up to 9 hidden network layers.

...read moreread less

702 citations

Journal Article•DOI•

Spoken word recognition processes and the gating paradigm

[...]

François Grosjean¹•Institutions (1)

Northeastern University¹

01 Jul 1980-Attention Perception & Psychophysics

TL;DR: For instance, the authors found that some delay appears to exist between the moment a word is isolated from other word candidates and the moment it is recognized; word candidates differ in number and in type from one context to another; and, like syntactic processing, word recognition is strewn with garden paths.

...read moreread less

Abstract: Words varying in length (one, two, and three syllables) and in frequency (high and low) were presented to subjects in isolation, in a short context, and in a long context. Each word was presented repeatedly, and its presentation time (duration from the onset of the word) increased at each successive pass. After each pass, subjects were asked to write down the word being presented and to indicate how confident they were about each guess. In addition to replicating a frequency, a context, and a word-length effect, this “gating” paradigm allowed us to study more closely the narrowing-in process employed by listeners in the isolation and recognition of words: Some delay appears to exist between the moment a word is isolated from other word candidates and the moment it is recognized; word candidates differ in number and in type from one context to the other; and, like syntactic processing, word recognition is strewn with garden paths. The active direct access model proposed by Marslen-Wilson and Welsh is discussed in light of these findings.

...read moreread less

694 citations

Journal Article•DOI•

A Study of Interspeaker Variability in Speaker Verification

[...]

Patrick Kenny, Pierre Ouellet, Najim Dehak, Vishwa Gupta, Pierre Dumouchel - Show less +1 more

01 Jul 2008-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: It is shown that when a large joint factor analysis model is trained in this way and tested on the core condition, the extended data condition and the cross-channel condition, it is capable of performing at least as well as fusions of multiple systems of other types.

...read moreread less

Abstract: We propose a new approach to the problem of estimating the hyperparameters which define the interspeaker variability model in joint factor analysis. We tested the proposed estimation technique on the NIST 2006 speaker recognition evaluation data and obtained 10%-15% reductions in error rates on the core condition and the extended data condition (as measured both by equal error rates and the NIST detection cost function). We show that when a large joint factor analysis model is trained in this way and tested on the core condition, the extended data condition and the cross-channel condition, it is capable of performing at least as well as fusions of multiple systems of other types. (The comparisons are based on the best results on these tasks that have been reported in the literature.) In the case of the cross-channel condition, a factor analysis model with 300 speaker factors and 200 channel factors can achieve equal error rates of less than 3.0%. This is a substantial improvement over the best results that have previously been reported on this task.

...read moreread less

671 citations

Journal Article•DOI•

Face recognition using kernel direct discriminant analysis algorithms

[...]

Juwei Lu¹, Konstantinos N. Plataniotis¹, Anastasios N. Venetsanopoulos¹•Institutions (1)

University of Toronto¹

01 Jan 2003-IEEE Transactions on Neural Networks

TL;DR: This paper proposes a kernel machine-based discriminant analysis method, which deals with the nonlinearity of the face patterns' distribution and effectively solves the so-called "small sample size" (SSS) problem, which exists in most FR tasks.

...read moreread less

Abstract: Techniques that can introduce low-dimensional feature representation with enhanced discriminatory power is of paramount importance in face recognition (FR) systems. It is well known that the distribution of face images, under a perceivable variation in viewpoint, illumination or facial expression, is highly nonlinear and complex. It is, therefore, not surprising that linear techniques, such as those based on principle component analysis (PCA) or linear discriminant analysis (LDA), cannot provide reliable and robust solutions to those FR problems with complex face variations. In this paper, we propose a kernel machine-based discriminant analysis method, which deals with the nonlinearity of the face patterns' distribution. The proposed method also effectively solves the so-called "small sample size" (SSS) problem, which exists in most FR tasks. The new algorithm has been tested, in terms of classification error rate performance, on the multiview UMIST face database. Results indicate that the proposed methodology is able to achieve excellent performance with only a very small set of features being used, and its error rate is approximately 34% and 48% of those of two other commonly used kernel FR approaches, the kernel-PCA (KPCA) and the generalized discriminant analysis (GDA), respectively.

...read moreread less

651 citations

Proceedings Article•DOI•

Joint CTC-attention based end-to-end speech recognition using multi-task learning

[...]

Suyoun Kim¹, Takaaki Hori¹, Shinji Watanabe¹•Institutions (1)

Mitsubishi Electric Research Laboratories¹

05 Mar 2017

TL;DR: This paper proposed a joint Connectionist Temporal Classification (CTC) and attention-based encoder-decoder framework for end-to-end speech recognition, which can improve robustness and achieve fast convergence by using a joint CTC-attention model.

...read moreread less

Abstract: Recently, there has been an increasing interest in end-to-end speech recognition that directly transcribes speech to text without any predefined alignments. One approach is the attention-based encoder-decoder framework that learns a mapping between variable-length input and output sequences in one step using a purely data-driven method. The attention model has often been shown to improve the performance over another end-to-end approach, the Connectionist Temporal Classification (CTC), mainly because it explicitly uses the history of the target character without any conditional independence assumptions. However, we observed that the performance of the attention has shown poor results in noisy condition and is hard to learn in the initial training stage with long input sequences. This is because the attention model is too flexible to predict proper alignments in such cases due to the lack of left-to-right constraints as used in CTC. This paper presents a novel method for end-to-end speech recognition to improve robustness and achieve fast convergence by using a joint CTC-attention model within the multi-task learning framework, thereby mitigating the alignment issue. An experiment on the WSJ and CHiME-4 tasks demonstrates its advantages over both the CTC and attention-based encoder-decoder baselines, showing 5.4–14.6% relative improvements in Character Error Rate (CER).

...read moreread less

645 citations

Collapse

Network Information

Performance

Metrics

12,777

Papers

335,740

Citations

No. of papers in the topic in previous years
Year	Papers
2023	271
2022	562
2021	640
2020	643
2019	633
2018	528

Word error rate

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics