Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer

doi:10.1109/ASRU.2017.8268935

Open AccessProceedings ArticleDOI

Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer

- pp 193-199

TLDR

In this article, a recurrent neural network transducer (RNN-T) is proposed to jointly learn acoustic and language model components from transcribed acoustic data, which achieves state-of-the-art performance for end-to-end speech recognition.

Abstract:

We investigate training end-to-end speech recognition models with the recurrent neural network transducer (RNN-T): a streaming, all-neural, sequence-to-sequence architecture which jointly learns acoustic and language model components from transcribed acoustic data. We explore various model architectures and demonstrate how the model can be improved further if additional text or pronunciation data are available. The model consists of an ‘encoder’, which is initialized from a connectionist temporal classification-based (CTC) acoustic model, and a ‘decoder’ which is partially initialized from a recurrent neural network language model trained on text data alone. The entire neural network is trained with the RNN-T loss and directly outputs the recognized transcript as a sequence of graphemes, thus performing end-to-end speech recognition. We find that performance can be improved further through the use of sub-word units ('wordpieces') which capture longer context and significantly reduce substitution errors. The best RNN-T system, a twelve-layer LSTM encoder with a two-layer LSTM decoder trained with 30,000 wordpieces as output targets achieves a word error rate of 8.5% on voice-search and 5.2% on voice-dictation tasks and is comparable to a state-of-the-art baseline at 8.3% on voice-search and 5.4% voice-dictation.

Citations

PDF

Open Access

More filters

Posted Content

Conformer: Convolution-augmented Transformer for Speech Recognition

Anmol Gulati, +10 more

- 16 May 2020 -

arXiv: Audio and Speech Processing

TL;DR: This work proposes the convolution-augmented transformer for speech recognition, named Conformer, which significantly outperforms the previous Transformer and CNN based models achieving state-of-the-art accuracies.

...read moreread less

Proceedings ArticleDOI

Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss

Qian Zhang, +6 more

TL;DR: An end-to-end speech recognition model with Transformer encoders that can be used in a streaming speech recognition system and shows that the full attention version of the model beats the-state-of-the art accuracy on the LibriSpeech benchmarks.

...read moreread less

Proceedings ArticleDOI

Improved Training of End-to-end Attention Models for Speech Recognition

Albert Zeyer, +3 more

TL;DR: In this article, a sequence-to-sequence attention-based model on subword units was proposed to achieve competitive results on the Switchboard 300h and LibriSpeech 1000h tasks.

...read moreread less

Proceedings Article

Commandersong: a systematic approach for practical adversarial voice recognition

Xuejing Yuan, +9 more

TL;DR: Novel techniques are developed that address a key technical challenge: integrating the commands into a song in a way that can be effectively recognized by ASR through the air, in the presence of background noise, while not being detected by a human listener.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

Long short-term memory

Sepp Hochreiter, +1 more

- 01 Nov 1997 -

Neural Computation

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.

...read moreread less

Proceedings Article

Sequence to Sequence Learning with Neural Networks

Ilya Sutskever, +2 more

TL;DR: The authors used a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector.

...read moreread less

Posted Content

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

Martín Abadi, +39 more

- 01 Jan 2015 -

arXiv: Distributed, Parallel, and Cluste...

TL;DR: The TensorFlow interface and an implementation of that interface that is built at Google are described, which has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields.

...read moreread less

Proceedings ArticleDOI

Speech recognition with deep recurrent neural networks

Alex Graves, +2 more

TL;DR: This paper investigates deep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs.

...read moreread less

Posted Content

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Yonghui Wu, +30 more

- 26 Sep 2016 -

arXiv: Computation and Language

TL;DR: GNMT, Google's Neural Machine Translation system, is presented, which attempts to address many of the weaknesses of conventional phrase-based translation systems and provides a good balance between the flexibility of "character"-delimited models and the efficiency of "word"-delicited models.

...read moreread less

Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer

Citations

Conformer: Convolution-augmented Transformer for Speech Recognition

Streaming End-to-end Speech Recognition for Mobile Devices

Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss

Improved Training of End-to-end Attention Models for Speech Recognition

Commandersong: a systematic approach for practical adversarial voice recognition

References

Long short-term memory

Sequence to Sequence Learning with Neural Networks

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

Speech recognition with deep recurrent neural networks

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Related Papers (5)

Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks

Listen, attend and spell: A neural network for large vocabulary conversational speech recognition

SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

Attention is All you Need

Speech recognition with deep recurrent neural networks