Attention is All you Need

Open AccessProceedings Article

Attention is All you Need

Ashish Vaswani, +7 more

- Vol. 30, pp 5998-6008

Chats0

TLDR

This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.

Abstract:

The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks in an encoder and decoder configuration. The best performing such models also connect the encoder and decoder through an attentionm echanisms. We propose a novel, simple network architecture based solely onan attention mechanism, dispensing with recurrence and convolutions entirely.Experiments on two machine translation tasks show these models to be superiorin quality while being more parallelizable and requiring significantly less timeto train. Our single model with 165 million parameters, achieves 27.5 BLEU onEnglish-to-German translation, improving over the existing best ensemble result by over 1 BLEU. On English-to-French translation, we outperform the previoussingle state-of-the-art with model by 0.7 BLEU, achieving a BLEU score of 41.1.

Citations

PDF

Open Access

More filters

Proceedings Article

Global Relational Models of Source Code

Vincent J. Hellendoorn, +3 more

TL;DR: This work bridges the divide between global and structured models by introducing two new hybrid model families that are both global and incorporate structural bias: Graph Sandwiches, which wrap traditional (gated) graph message-Passing layers in sequential message-passing layers; and Graph Relational Embedding Attention Transformers, which bias traditional Transformers with relational information from graph edge types.

...read moreread less

Proceedings ArticleDOI

Training Millions of Personalized Dialogue Agents

Pierre-Emmanuel Mazaré, +3 more

TL;DR: This article introduced a new dataset providing 5 million personas and 700 million persona-based dialogues and showed that training using personas still improves the performance of end-to-end dialogue models.

...read moreread less

Proceedings ArticleDOI

Dependency Graph Enhanced Dual-transformer Structure for Aspect-based Sentiment Classification.

Hao Tang, +3 more

TL;DR: A dual-transformer structure is devised in DGEDT to support mutual reinforcement between the flat representation learning and graph-based representation learning, and to allow the dependency graph to guide the representation learning of the transformer encoder and vice versa.

...read moreread less

Proceedings Article

Pretrained Encyclopedia: Weakly Supervised Knowledge-Pretrained Language Model

Wenhan Xiong, +3 more

TL;DR: This work proposes a simple yet effective weakly supervised pretraining objective, which explicitly forces the model to incorporate knowledge about real-world entities, and consistently outperforms BERT on four entity-related question answering datasets.

...read moreread less

Posted Content

Retrospective Reader for Machine Reading Comprehension

Zhuosheng Zhang, +2 more

- 27 Jan 2020 -

arXiv: Computation and Language

TL;DR: Inspired by how humans solve reading comprehension questions, a retrospective reader (Retro-Reader) is proposed that integrates two stages of reading and verification strategies: 1) sketchy reading that briefly investigates the overall interactions of passage and question, and yield an initial judgment; 2) intensive reading that verifies the answer and gives the final prediction.

...read moreread less

Collapse

Attention is All you Need

Citations

Global Relational Models of Source Code

Training Millions of Personalized Dialogue Agents

Dependency Graph Enhanced Dual-transformer Structure for Aspect-based Sentiment Classification.

Pretrained Encyclopedia: Weakly Supervised Knowledge-Pretrained Language Model

Retrospective Reader for Machine Reading Comprehension

Related Papers (5)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Adam: A Method for Stochastic Optimization

Deep Residual Learning for Image Recognition

Long short-term memory

Bleu: a Method for Automatic Evaluation of Machine Translation