Joint CTC/attention decoding for end-to-end speech recognition

doi:10.18653/V1/P17-1048

Open AccessProceedings ArticleDOI

Joint CTC/attention decoding for end-to-end speech recognition

Takaaki Hori, +2 more

- pp 518-529

Chats0

TLDR

This paper proposes joint decoding algorithm for end-to-end ASR with a hybrid CTC/attention architecture, which effectively utilizes both advantages in decoding.

Abstract:

End-to-end automatic speech recognition (ASR) has become a popular alternative to conventional DNN/HMM systems because it avoids the need for linguistic resources such as pronunciation dictionary, tokenization, and context-dependency trees, leading to a greatly simplified model-building process. There are two major types of end-to-end architectures for ASR: attention-based methods use an attention mechanism to perform alignment between acoustic frames and recognized symbols, and connectionist temporal classification (CTC), uses Markov assumptions to efficiently solve sequential problems by dynamic programming. This paper proposes joint decoding algorithm for end-to-end ASR with a hybrid CTC/attention architecture, which effectively utilizes both advantages in decoding. We have applied the proposed method to two ASR benchmarks (spontaneous Japanese and Mandarin Chinese), and showing the comparable performance to conventional state-of-the-art DNN/HMM ASR systems without linguistic resources.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Hybrid CTC/Attention Architecture for End-to-End Speech Recognition

Shinji Watanabe, +4 more

- 16 Oct 2017 -

IEEE Journal of Selected Topics in Signa...

TL;DR: The proposed hybrid CTC/attention end-to-end ASR is applied to two large-scale ASR benchmarks, and exhibits performance that is comparable to conventional DNN/HMM ASR systems based on the advantages of both multiobjective learning and joint decoding without linguistic resources.

...read moreread less

Journal ArticleDOI

Recent progresses in deep learning based acoustic models

Dong Yu, +1 more

- 10 Jul 2017 -

IEEE/CAA Journal of Automatica Sinica

TL;DR: In this paper, the authors summarize recent progress made in deep learning based acoustic models and the motivation and insights behind the surveyed techniques, and further illustrate robustness issues in speech recognition systems, and discuss acoustic model adaptation, speech enhancement and separation.

...read moreread less

Proceedings ArticleDOI

Iterative Alignment Network for Continuous Sign Language Recognition

Junfu Pu, +2 more

TL;DR: The framework consists of a 3D convolutional residual network for feature learning and an encoder-decoder network with connectionist temporal classification (CTC) for sequence modelling that is optimized in an alternate way for weakly supervised continuous sign language recognition.

...read moreread less

Proceedings ArticleDOI

Language independent end-to-end architecture for joint language identification and speech recognition

Shinji Watanabe, +2 more

TL;DR: This paper presents a model that can recognize speech in 10 different languages, by directly performing grapheme (character/chunked-character) based speech recognition, based on the hybrid attention/connectionist temporal classification (CTC) architecture.

...read moreread less

Proceedings ArticleDOI

End-to-end Speech Recognition With Word-Based Rnn Language Models

Takaaki Hori, +2 more

TL;DR: A novel word-based RNN-LM is proposed, which allows us to decode with only the word- based LM, where it provides look-ahead word probabilities to predict next characters instead of the character-based LM, leading competitive accuracy with less computation compared to the multi-level LM.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings Article

Neural Machine Translation by Jointly Learning to Align and Translate

Dzmitry Bahdanau, +2 more

TL;DR: It is conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and it is proposed to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.

...read moreread less

Posted Content

Neural Machine Translation by Jointly Learning to Align and Translate

Dzmitry Bahdanau, +2 more

- 01 Sep 2014 -

arXiv: Computation and Language

TL;DR: In this paper, the authors propose to use a soft-searching model to find the parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.

...read moreread less

Journal ArticleDOI

Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups

Geoffrey E. Hinton, +10 more

- 18 Oct 2012 -

IEEE Signal Processing Magazine

TL;DR: This article provides an overview of progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.

...read moreread less

Posted Content

ADADELTA: An Adaptive Learning Rate Method

Matthew D. Zeiler

- 22 Dec 2012 -

arXiv: Learning

TL;DR: A novel per-dimension learning rate method for gradient descent called ADADELTA that dynamically adapts over time using only first order information and has minimal computational overhead beyond vanilla stochastic gradient descent is presented.

...read moreread less

Proceedings Article

The Kaldi Speech Recognition Toolkit

Daniel Povey, +12 more

TL;DR: The design of Kaldi is described, a free, open-source toolkit for speech recognition research that provides a speech recognition system based on finite-state automata together with detailed documentation and a comprehensive set of scripts for building complete recognition systems.

...read moreread less

Collapse

Joint CTC/attention decoding for end-to-end speech recognition

Citations

Hybrid CTC/Attention Architecture for End-to-End Speech Recognition

Recent progresses in deep learning based acoustic models

Iterative Alignment Network for Continuous Sign Language Recognition

Language independent end-to-end architecture for joint language identification and speech recognition

End-to-end Speech Recognition With Word-Based Rnn Language Models

References

Neural Machine Translation by Jointly Learning to Align and Translate

Neural Machine Translation by Jointly Learning to Align and Translate

Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups

ADADELTA: An Adaptive Learning Rate Method

The Kaldi Speech Recognition Toolkit

Related Papers (5)

Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks

Listen, attend and spell: A neural network for large vocabulary conversational speech recognition

Towards End-To-End Speech Recognition with Recurrent Neural Networks

Librispeech: An ASR corpus based on public domain audio books

The Kaldi Speech Recognition Toolkit