Emformer: Efficient Memory Transformer Based Acoustic Model for Low Latency Streaming Speech Recognition

doi:10.1109/ICASSP39728.2021.9414560

Open AccessProceedings ArticleDOI

Emformer: Efficient Memory Transformer Based Acoustic Model for Low Latency Streaming Speech Recognition

- pp 6783-6787

TLDR

In this article, the long-range history context is distilled into an augmented memory bank to reduce self-attention's computation complexity, and a cache mechanism saves the computation for the key and value in selfattention for the left context.

Abstract:

This paper proposes an efficient memory transformer Emformer for low latency streaming speech recognition. In Emformer, the long-range history context is distilled into an augmented memory bank to reduce self-attention’s computation complexity. A cache mechanism saves the computation for the key and value in self-attention for the left context. Emformer applies a parallelized block processing in training to support low latency models. We carry out experiments on benchmark LibriSpeech data. Under average latency of 960 ms, Emformer gets WER 2.50% on test-clean and 5.62% on test-other. Comparing with a strong baseline augmented memory transformer (AM-TRF), Emformer gets 4.6 folds training speedup and 18% relative real-time factor (RTF) reduction in decoding with relative WER reduction 17% on test-clean and 9% on test-other. For a low latency scenario with an average latency of 80 ms, Emformer achieves WER 3.01% on test-clean and 7.09% on test-other. Comparing with the LSTM baseline with the same latency and model size, Emformer gets relative WER reduction 9% and 16% on test-clean and test-other, respectively.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

Developing Real-Time Streaming Transformer Transducer for Speech Recognition on Large-Scale Dataset

Xie Chen, +4 more

TL;DR: In this article, the authors explored the potential of Transformer Transducer (T-T) models for the fist pass decoding with low latency and fast speed on a large-scale dataset.

...read moreread less

Journal ArticleDOI

Recent Advances in End-to-End Automatic Speech Recognition

Melissa A. Collins

- 01 Jan 2022 -

APSIPA transactions on signal and inform...

TL;DR: In this paper , the authors overview the recent advances in E2E models, focusing on technologies addressing those challenges from the industry's perspective, and provide excellent solutions to all these factors.

...read moreread less

Posted Content

Recent Advances in End-to-End Automatic Speech Recognition.

Jinyu Li, +1 more

- 02 Nov 2021 -

arXiv: Audio and Speech Processing

TL;DR: In this article, the authors overview the recent advances in E2E models, focusing on technologies addressing those challenges from the industry's perspective, and provide excellent solutions to all these factors.

...read moreread less

Proceedings ArticleDOI

Improving the Fusion of Acoustic and Text Representations in RNN-T

Zhao Zhang, +4 more

TL;DR: Experimental results on a multilingual ASR setting for voice search over nine languages show that the joint use of the proposed methods can result in 4%–5% relative word error rate reductions with only a few million extra parameters.

...read moreread less

Proceedings ArticleDOI

Improving The Latency And Quality Of Cascaded Encoders

TL;DR: In this paper , the authors explore reducing computational latency of the 2-pass cascaded encoder model by reducing the size of the causal 1st pass and adding capacity to the non-causal 2nd pass.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings Article

Adam: A Method for Stochastic Optimization

Diederik P. Kingma, +1 more

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

...read moreread less

Proceedings Article

Attention is All you Need

Ashish Vaswani, +7 more

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.

...read moreread less

Proceedings ArticleDOI

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, +3 more

TL;DR: BERT as mentioned in this paper pre-trains deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

...read moreread less

Proceedings ArticleDOI

Neural Machine Translation of Rare Words with Subword Units

Rico Sennrich, +2 more

TL;DR: This paper introduces a simpler and more effective approach, making the NMT model capable of open-vocabulary translation by encoding rare and unknown words as sequences of subword units, and empirically shows that subword models improve over a back-off dictionary baseline for the WMT 15 translation tasks English-German and English-Russian by 1.3 BLEU.

...read moreread less

Proceedings Article

The Kaldi Speech Recognition Toolkit

Daniel Povey, +12 more

TL;DR: The design of Kaldi is described, a free, open-source toolkit for speech recognition research that provides a speech recognition system based on finite-state automata together with detailed documentation and a comprehensive set of scripts for building complete recognition systems.

...read moreread less

Collapse

Related Papers (5)

Sequence Transduction with Recurrent Neural Networks

Alex Graves

- 14 Nov 2012 -

arXiv: Neural and Evolutionary Computing

Emformer: Efficient Memory Transformer Based Acoustic Model for Low Latency Streaming Speech Recognition

Citations

Developing Real-Time Streaming Transformer Transducer for Speech Recognition on Large-Scale Dataset

Recent Advances in End-to-End Automatic Speech Recognition

Recent Advances in End-to-End Automatic Speech Recognition.

Improving the Fusion of Acoustic and Text Representations in RNN-T

Improving The Latency And Quality Of Cascaded Encoders

References

Adam: A Method for Stochastic Optimization

Attention is All you Need

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Neural Machine Translation of Rare Words with Subword Units

The Kaldi Speech Recognition Toolkit

Related Papers (5)

Sequence Transduction with Recurrent Neural Networks

Attention is All you Need

SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

Librispeech: An ASR corpus based on public domain audio books

Audio augmentation for speech recognition.