Attention is All you Need

Open AccessProceedings Article

Attention is All you Need

Ashish Vaswani, +7 more

- Vol. 30, pp 5998-6008

Chats0

TLDR

This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.

Abstract:

The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks in an encoder and decoder configuration. The best performing such models also connect the encoder and decoder through an attentionm echanisms. We propose a novel, simple network architecture based solely onan attention mechanism, dispensing with recurrence and convolutions entirely.Experiments on two machine translation tasks show these models to be superiorin quality while being more parallelizable and requiring significantly less timeto train. Our single model with 165 million parameters, achieves 27.5 BLEU onEnglish-to-German translation, improving over the existing best ensemble result by over 1 BLEU. On English-to-French translation, we outperform the previoussingle state-of-the-art with model by 0.7 BLEU, achieving a BLEU score of 41.1.

Citations

PDF

Open Access

More filters

Posted Content

GLU Variants Improve Transformer.

Noam Shazeer

- 12 Feb 2020 -

arXiv: Learning

TL;DR: Gated Linear Units (GLU) consist of the component-wise product of two linear projections, one of which is first passed through a sigmoid function, and it is found that some of them yield quality improvements over the typically-used ReLU or GELU activations.

...read moreread less

Posted Content

Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation.

Hao Wu, +4 more

- 20 Apr 2020 -

arXiv: Learning

TL;DR: This paper presents a workflow for 8-bit quantization that is able to maintain accuracy within 1% of the floating-point baseline on all networks studied, including models that are more difficult to quantize, such as MobileNets and BERT-large.

...read moreread less

Posted Content

ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context

Wei Han, +8 more

- 07 May 2020 -

arXiv: Audio and Speech Processing

TL;DR: This paper proposes a simple scaling method that scales the widths of ContextNet that achieves good trade-off between computation and accuracy and demonstrates that on the widely used LibriSpeech benchmark, ContextNet achieves a word error rate of 2.1%/4.6%.

...read moreread less

Proceedings ArticleDOI

Strategies for Structuring Story Generation

Angela Fan, +2 more

TL;DR: This paper explore coarse-to-fine models for creating narrative texts of several hundred words, and introduce new models which decompose stories by abstracting over actions and entities, which can help improve the diversity and coherence of events and entities in generated stories.

...read moreread less

Proceedings ArticleDOI

DSANet: Dual Self-Attention Network for Multivariate Time Series Forecasting

Siteng Huang, +3 more

TL;DR: A dual self-attention network (DSANet) for highly efficient multivariate time series forecasting, especially for dynamic-period or nonperiodic series, which is effective and outperforms baselines.

...read moreread less

Collapse

Attention is All you Need

Citations

GLU Variants Improve Transformer.

Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation.

ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context

Strategies for Structuring Story Generation

DSANet: Dual Self-Attention Network for Multivariate Time Series Forecasting

Related Papers (5)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Adam: A Method for Stochastic Optimization

Deep Residual Learning for Image Recognition

Long short-term memory

Bleu: a Method for Automatic Evaluation of Machine Translation