Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition

doi:10.1109/ICASSP.2018.8462506

Proceedings ArticleDOI

Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition

- pp 5884-5888

TLDR

The Speech-Transformer is presented, a no-recurrence sequence-to-sequence model entirely relies on attention mechanisms to learn the positional dependencies, which can be trained faster with more efficiency and a 2D-Attention mechanism which can jointly attend to the time and frequency axes of the 2-dimensional speech inputs, thus providing more expressive representations for the Speech- Transformer.

Abstract:

Recurrent sequence-to-sequence models using encoder-decoder architecture have made great progress in speech recognition task. However, they suffer from the drawback of slow training speed because the internal recurrence limits the training parallelization. In this paper, we present the Speech-Transformer, a no-recurrence sequence-to-sequence model entirely relies on attention mechanisms to learn the positional dependencies, which can be trained faster with more efficiency. We also propose a 2D-Attention mechanism, which can jointly attend to the time and frequency axes of the 2-dimensional speech inputs, thus providing more expressive representations for the Speech-Transformer. Evaluated on the Wall Street Journal (WSJ) speech recognition dataset, our best model achieves competitive word error rate (WER) of 10.9%, while the whole training process only takes 1.2 days on 1 GPU, significantly faster than the published results of recurrent sequence-to-sequence models.

Citations

PDF

Open Access

More filters

Posted Content

Conformer: Convolution-augmented Transformer for Speech Recognition

Anmol Gulati, +10 more

- 16 May 2020 -

arXiv: Audio and Speech Processing

TL;DR: This work proposes the convolution-augmented transformer for speech recognition, named Conformer, which significantly outperforms the previous Transformer and CNN based models achieving state-of-the-art accuracies.

...read moreread less

Proceedings ArticleDOI

A Comparative Study on Transformer vs RNN in Speech Applications

Shigeki Karita, +12 more

TL;DR: Transformer as mentioned in this paper is an emergent sequence-to-sequence model which achieves state-of-the-art performance in neural machine translation and other natural language processing applications, such as automatic speech recognition (ASR), speech translation (ST), and text to speech (TTS).

...read moreread less

Proceedings ArticleDOI

Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss

Qian Zhang, +6 more

TL;DR: An end-to-end speech recognition model with Transformer encoders that can be used in a streaming speech recognition system and shows that the full attention version of the model beats the-state-of-the art accuracy on the LibriSpeech benchmarks.

...read moreread less

Proceedings ArticleDOI

A Comparative Study on Transformer vs RNN in Speech Applications

Shigeki Karita, +12 more

- 13 Sep 2019 -

arXiv: Computation and Language

TL;DR: An emergent sequence-to-sequence model called Transformer achieves state-of-the-art performance in neural machine translation and other natural language processing applications, including the surprising superiority of Transformer in 13/15 ASR benchmarks in comparison with RNN.

...read moreread less

Proceedings ArticleDOI

Transformer-Based Acoustic Modeling for Hybrid Speech Recognition

Yongqiang Wang, +12 more

TL;DR: This article proposed and evaluated transformer-based acoustic models (AMs) for hybrid speech recognition, including various positional embedding methods and an iterated loss to enable training deep transformers.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings Article

Adam: A Method for Stochastic Optimization

Diederik P. Kingma, +1 more

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

...read moreread less

Posted Content

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

Martín Abadi, +39 more

- 01 Jan 2015 -

arXiv: Distributed, Parallel, and Cluste...

TL;DR: The TensorFlow interface and an implementation of that interface that is built at Google are described, which has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields.

...read moreread less

Posted Content

Empirical evaluation of gated recurrent neural networks on sequence modeling

Junyoung Chung, +5 more

- 11 Dec 2014 -

arXiv: Neural and Evolutionary Computing

TL;DR: These advanced recurrent units that implement a gating mechanism, such as a long short-term memory (LSTM) unit and a recently proposed gated recurrent unit (GRU), are found to be comparable to LSTM.

...read moreread less

Posted Content

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Yonghui Wu, +30 more

- 26 Sep 2016 -

arXiv: Computation and Language

TL;DR: GNMT, Google's Neural Machine Translation system, is presented, which attempts to address many of the weaknesses of conventional phrase-based translation systems and provides a good balance between the flexibility of "character"-delimited models and the efficiency of "word"-delicited models.

...read moreread less

Posted Content

Speech Recognition with Deep Recurrent Neural Networks

Alex Graves, +2 more

- 22 Mar 2013 -

arXiv: Neural and Evolutionary Computing

TL;DR: In this paper, deep recurrent neural networks (RNNs) are used to combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs.

...read moreread less

Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition

Citations

Conformer: Convolution-augmented Transformer for Speech Recognition

A Comparative Study on Transformer vs RNN in Speech Applications

Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss

A Comparative Study on Transformer vs RNN in Speech Applications

Transformer-Based Acoustic Modeling for Hybrid Speech Recognition

References

Adam: A Method for Stochastic Optimization

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

Empirical evaluation of gated recurrent neural networks on sequence modeling

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Speech Recognition with Deep Recurrent Neural Networks

Related Papers (5)

Attention is All you Need

Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks

Listen, attend and spell: A neural network for large vocabulary conversational speech recognition

SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

Librispeech: An ASR corpus based on public domain audio books