scispace - formally typeset
Open AccessProceedings ArticleDOI

Emformer: Efficient Memory Transformer Based Acoustic Model for Low Latency Streaming Speech Recognition

TLDR
In this article, the long-range history context is distilled into an augmented memory bank to reduce self-attention's computation complexity, and a cache mechanism saves the computation for the key and value in selfattention for the left context.
Abstract
This paper proposes an efficient memory transformer Emformer for low latency streaming speech recognition. In Emformer, the long-range history context is distilled into an augmented memory bank to reduce self-attention’s computation complexity. A cache mechanism saves the computation for the key and value in self-attention for the left context. Emformer applies a parallelized block processing in training to support low latency models. We carry out experiments on benchmark LibriSpeech data. Under average latency of 960 ms, Emformer gets WER 2.50% on test-clean and 5.62% on test-other. Comparing with a strong baseline augmented memory transformer (AM-TRF), Emformer gets 4.6 folds training speedup and 18% relative real-time factor (RTF) reduction in decoding with relative WER reduction 17% on test-clean and 9% on test-other. For a low latency scenario with an average latency of 80 ms, Emformer achieves WER 3.01% on test-clean and 7.09% on test-other. Comparing with the LSTM baseline with the same latency and model size, Emformer gets relative WER reduction 9% and 16% on test-clean and test-other, respectively.

read more

Citations
More filters
Proceedings ArticleDOI

Developing Real-Time Streaming Transformer Transducer for Speech Recognition on Large-Scale Dataset

TL;DR: In this article, the authors explored the potential of Transformer Transducer (T-T) models for the fist pass decoding with low latency and fast speed on a large-scale dataset.
Journal ArticleDOI

Recent Advances in End-to-End Automatic Speech Recognition

TL;DR: In this paper , the authors overview the recent advances in E2E models, focusing on technologies addressing those challenges from the industry's perspective, and provide excellent solutions to all these factors.
Posted Content

Recent Advances in End-to-End Automatic Speech Recognition.

TL;DR: In this article, the authors overview the recent advances in E2E models, focusing on technologies addressing those challenges from the industry's perspective, and provide excellent solutions to all these factors.
Proceedings ArticleDOI

Improving the Fusion of Acoustic and Text Representations in RNN-T

TL;DR: Experimental results on a multilingual ASR setting for voice search over nine languages show that the joint use of the proposed methods can result in 4%–5% relative word error rate reductions with only a few million extra parameters.
Proceedings ArticleDOI

Improving The Latency And Quality Of Cascaded Encoders

TL;DR: In this paper , the authors explore reducing computational latency of the 2-pass cascaded encoder model by reducing the size of the causal 1st pass and adding capacity to the non-causal 2nd pass.
References
More filters
Proceedings Article

Adam: A Method for Stochastic Optimization

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Proceedings Article

Attention is All you Need

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
Proceedings ArticleDOI

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

TL;DR: BERT as mentioned in this paper pre-trains deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Proceedings ArticleDOI

Neural Machine Translation of Rare Words with Subword Units

TL;DR: This paper introduces a simpler and more effective approach, making the NMT model capable of open-vocabulary translation by encoding rare and unknown words as sequences of subword units, and empirically shows that subword models improve over a back-off dictionary baseline for the WMT 15 translation tasks English-German and English-Russian by 1.3 BLEU.
Proceedings Article

The Kaldi Speech Recognition Toolkit

TL;DR: The design of Kaldi is described, a free, open-source toolkit for speech recognition research that provides a speech recognition system based on finite-state automata together with detailed documentation and a comprehensive set of scripts for building complete recognition systems.
Related Papers (5)