Attention is All you Need

Open AccessProceedings Article

Attention is All you Need

Ashish Vaswani, +7 more

- Vol. 30, pp 5998-6008

Chats0

TLDR

This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.

Abstract:

The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks in an encoder and decoder configuration. The best performing such models also connect the encoder and decoder through an attentionm echanisms. We propose a novel, simple network architecture based solely onan attention mechanism, dispensing with recurrence and convolutions entirely.Experiments on two machine translation tasks show these models to be superiorin quality while being more parallelizable and requiring significantly less timeto train. Our single model with 165 million parameters, achieves 27.5 BLEU onEnglish-to-German translation, improving over the existing best ensemble result by over 1 BLEU. On English-to-French translation, we outperform the previoussingle state-of-the-art with model by 0.7 BLEU, achieving a BLEU score of 41.1.

Citations

PDF

Open Access

More filters

Posted Content

LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention

Ikuya Yamada, +4 more

- 02 Oct 2020 -

arXiv: Computation and Language

TL;DR: New pretrained contextualized representations of words and entities based on the bidirectional transformer, and an entity-aware self-attention mechanism that considers the types of tokens (words or entities) when computing attention scores are proposed.

...read moreread less

Proceedings ArticleDOI

Entangled Transformer for Image Captioning

Guang Li, +3 more

TL;DR: A Transformer-based sequence modeling framework built only with attention layers and feedforward layers that enables the Transformer to exploit semantic and visual information simultaneously and achieves state-of-the-art performance on the MSCOCO image captioning dataset.

...read moreread less

Proceedings ArticleDOI

MLPerf inference benchmark

Vijay Janapa Reddi, +46 more

TL;DR: This paper presents the benchmarking method for evaluating ML inference systems, MLPerf Inference, and prescribes a set of rules and best practices to ensure comparability across systems with wildly differing architectures.

...read moreread less

Posted Content

DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations

John M Giorgi, +3 more

- 05 Jun 2020 -

arXiv: Computation and Language

TL;DR: Inspired by recent advances in deep metric learning (DML), this work carefully design a self-supervised objective for learning universal sentence embeddings that does not require labelled training data and closes the performance gap between unsupervised and supervised pretraining for universal sentence encoders.

...read moreread less

Posted Content

Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation.

Huiyu Wang, +5 more

- 17 Mar 2020 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: This paper factorizes 2D self-attention into two 1Dself-attentions, a novel building block that one could stack to form axial-att attention models for image classification and dense prediction, and achieves state-of-the-art results on Mapillary Vistas and Cityscapes.

...read moreread less

Collapse

Attention is All you Need

Citations

LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention

Entangled Transformer for Image Captioning

MLPerf inference benchmark

DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations

Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation.

Related Papers (5)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Adam: A Method for Stochastic Optimization

Deep Residual Learning for Image Recognition

Long short-term memory

Bleu: a Method for Automatic Evaluation of Machine Translation