Open AccessProceedings Article
Attention is All you Need
Ashish Vaswani,Noam Shazeer,Niki Parmar,Jakob Uszkoreit,Llion Jones,Aidan N. Gomez,Lukasz Kaiser,Illia Polosukhin +7 more
- Vol. 30, pp 5998-6008
Reads0
Chats0
TLDR
This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.Abstract:
The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks in an encoder and decoder configuration. The best performing such models also connect the encoder and decoder through an attentionm echanisms. We propose a novel, simple network architecture based solely onan attention mechanism, dispensing with recurrence and convolutions entirely.Experiments on two machine translation tasks show these models to be superiorin quality while being more parallelizable and requiring significantly less timeto train. Our single model with 165 million parameters, achieves 27.5 BLEU onEnglish-to-German translation, improving over the existing best ensemble result by over 1 BLEU. On English-to-French translation, we outperform the previoussingle state-of-the-art with model by 0.7 BLEU, achieving a BLEU score of 41.1.read more
Citations
More filters
Proceedings Article
Global Relational Models of Source Code
TL;DR: This work bridges the divide between global and structured models by introducing two new hybrid model families that are both global and incorporate structural bias: Graph Sandwiches, which wrap traditional (gated) graph message-Passing layers in sequential message-passing layers; and Graph Relational Embedding Attention Transformers, which bias traditional Transformers with relational information from graph edge types.
Proceedings ArticleDOI
Training Millions of Personalized Dialogue Agents
TL;DR: This article introduced a new dataset providing 5 million personas and 700 million persona-based dialogues and showed that training using personas still improves the performance of end-to-end dialogue models.
Proceedings ArticleDOI
Dependency Graph Enhanced Dual-transformer Structure for Aspect-based Sentiment Classification.
TL;DR: A dual-transformer structure is devised in DGEDT to support mutual reinforcement between the flat representation learning and graph-based representation learning, and to allow the dependency graph to guide the representation learning of the transformer encoder and vice versa.
Proceedings Article
Pretrained Encyclopedia: Weakly Supervised Knowledge-Pretrained Language Model
TL;DR: This work proposes a simple yet effective weakly supervised pretraining objective, which explicitly forces the model to incorporate knowledge about real-world entities, and consistently outperforms BERT on four entity-related question answering datasets.
Posted Content
Retrospective Reader for Machine Reading Comprehension
TL;DR: Inspired by how humans solve reading comprehension questions, a retrospective reader (Retro-Reader) is proposed that integrates two stages of reading and verification strategies: 1) sketchy reading that briefly investigates the overall interactions of passage and question, and yield an initial judgment; 2) intensive reading that verifies the answer and gives the final prediction.