scispace - formally typeset
Open AccessProceedings Article

Attention is All you Need

Reads0
Chats0
TLDR
This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
Abstract
The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks in an encoder and decoder configuration. The best performing such models also connect the encoder and decoder through an attentionm echanisms. We propose a novel, simple network architecture based solely onan attention mechanism, dispensing with recurrence and convolutions entirely.Experiments on two machine translation tasks show these models to be superiorin quality while being more parallelizable and requiring significantly less timeto train. Our single model with 165 million parameters, achieves 27.5 BLEU onEnglish-to-German translation, improving over the existing best ensemble result by over 1 BLEU. On English-to-French translation, we outperform the previoussingle state-of-the-art with model by 0.7 BLEU, achieving a BLEU score of 41.1.

read more

Content maybe subject to copyright    Report

Citations
More filters
Proceedings Article

Global Relational Models of Source Code

TL;DR: This work bridges the divide between global and structured models by introducing two new hybrid model families that are both global and incorporate structural bias: Graph Sandwiches, which wrap traditional (gated) graph message-Passing layers in sequential message-passing layers; and Graph Relational Embedding Attention Transformers, which bias traditional Transformers with relational information from graph edge types.
Proceedings ArticleDOI

Training Millions of Personalized Dialogue Agents

TL;DR: This article introduced a new dataset providing 5 million personas and 700 million persona-based dialogues and showed that training using personas still improves the performance of end-to-end dialogue models.
Proceedings ArticleDOI

Dependency Graph Enhanced Dual-transformer Structure for Aspect-based Sentiment Classification.

TL;DR: A dual-transformer structure is devised in DGEDT to support mutual reinforcement between the flat representation learning and graph-based representation learning, and to allow the dependency graph to guide the representation learning of the transformer encoder and vice versa.
Proceedings Article

Pretrained Encyclopedia: Weakly Supervised Knowledge-Pretrained Language Model

TL;DR: This work proposes a simple yet effective weakly supervised pretraining objective, which explicitly forces the model to incorporate knowledge about real-world entities, and consistently outperforms BERT on four entity-related question answering datasets.
Posted Content

Retrospective Reader for Machine Reading Comprehension

TL;DR: Inspired by how humans solve reading comprehension questions, a retrospective reader (Retro-Reader) is proposed that integrates two stages of reading and verification strategies: 1) sketchy reading that briefly investigates the overall interactions of passage and question, and yield an initial judgment; 2) intensive reading that verifies the answer and gives the final prediction.