scispace - formally typeset
Open AccessProceedings ArticleDOI

Joint CTC/attention decoding for end-to-end speech recognition

Reads0
Chats0
TLDR
This paper proposes joint decoding algorithm for end-to-end ASR with a hybrid CTC/attention architecture, which effectively utilizes both advantages in decoding.
Abstract
End-to-end automatic speech recognition (ASR) has become a popular alternative to conventional DNN/HMM systems because it avoids the need for linguistic resources such as pronunciation dictionary, tokenization, and context-dependency trees, leading to a greatly simplified model-building process. There are two major types of end-to-end architectures for ASR: attention-based methods use an attention mechanism to perform alignment between acoustic frames and recognized symbols, and connectionist temporal classification (CTC), uses Markov assumptions to efficiently solve sequential problems by dynamic programming. This paper proposes joint decoding algorithm for end-to-end ASR with a hybrid CTC/attention architecture, which effectively utilizes both advantages in decoding. We have applied the proposed method to two ASR benchmarks (spontaneous Japanese and Mandarin Chinese), and showing the comparable performance to conventional state-of-the-art DNN/HMM ASR systems without linguistic resources.

read more

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

Hybrid CTC/Attention Architecture for End-to-End Speech Recognition

TL;DR: The proposed hybrid CTC/attention end-to-end ASR is applied to two large-scale ASR benchmarks, and exhibits performance that is comparable to conventional DNN/HMM ASR systems based on the advantages of both multiobjective learning and joint decoding without linguistic resources.
Journal ArticleDOI

Recent progresses in deep learning based acoustic models

TL;DR: In this paper, the authors summarize recent progress made in deep learning based acoustic models and the motivation and insights behind the surveyed techniques, and further illustrate robustness issues in speech recognition systems, and discuss acoustic model adaptation, speech enhancement and separation.
Proceedings ArticleDOI

Iterative Alignment Network for Continuous Sign Language Recognition

TL;DR: The framework consists of a 3D convolutional residual network for feature learning and an encoder-decoder network with connectionist temporal classification (CTC) for sequence modelling that is optimized in an alternate way for weakly supervised continuous sign language recognition.
Proceedings ArticleDOI

Language independent end-to-end architecture for joint language identification and speech recognition

TL;DR: This paper presents a model that can recognize speech in 10 different languages, by directly performing grapheme (character/chunked-character) based speech recognition, based on the hybrid attention/connectionist temporal classification (CTC) architecture.
Proceedings ArticleDOI

End-to-end Speech Recognition With Word-Based Rnn Language Models

TL;DR: A novel word-based RNN-LM is proposed, which allows us to decode with only the word- based LM, where it provides look-ahead word probabilities to predict next characters instead of the character-based LM, leading competitive accuracy with less computation compared to the multi-level LM.
References
More filters
Proceedings Article

Neural Machine Translation by Jointly Learning to Align and Translate

TL;DR: It is conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and it is proposed to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.
Posted Content

Neural Machine Translation by Jointly Learning to Align and Translate

TL;DR: In this paper, the authors propose to use a soft-searching model to find the parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.
Journal ArticleDOI

Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups

TL;DR: This article provides an overview of progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.
Posted Content

ADADELTA: An Adaptive Learning Rate Method

Matthew D. Zeiler
- 22 Dec 2012 - 
TL;DR: A novel per-dimension learning rate method for gradient descent called ADADELTA that dynamically adapts over time using only first order information and has minimal computational overhead beyond vanilla stochastic gradient descent is presented.
Proceedings Article

The Kaldi Speech Recognition Toolkit

TL;DR: The design of Kaldi is described, a free, open-source toolkit for speech recognition research that provides a speech recognition system based on finite-state automata together with detailed documentation and a comprehensive set of scripts for building complete recognition systems.
Related Papers (5)