Attention is All you Need

Open AccessProceedings Article

Attention is All you Need

Ashish Vaswani, +7 more

- Vol. 30, pp 5998-6008

Chats0

TLDR

This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.

Abstract:

The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks in an encoder and decoder configuration. The best performing such models also connect the encoder and decoder through an attentionm echanisms. We propose a novel, simple network architecture based solely onan attention mechanism, dispensing with recurrence and convolutions entirely.Experiments on two machine translation tasks show these models to be superiorin quality while being more parallelizable and requiring significantly less timeto train. Our single model with 165 million parameters, achieves 27.5 BLEU onEnglish-to-German translation, improving over the existing best ensemble result by over 1 BLEU. On English-to-French translation, we outperform the previoussingle state-of-the-art with model by 0.7 BLEU, achieving a BLEU score of 41.1.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

Semantic Graph Convolutional Networks for 3D Human Pose Regression

Long Zhao, +4 more

TL;DR: The proposed Semantic Graph Convolutional Networks (SemGCN), a novel neural network architecture that operates on regression tasks with graph-structured data that learns to capture semantic information such as local and global node relationships, which is not explicitly represented in the graph.

...read moreread less

Proceedings ArticleDOI

Social IQa: Commonsense Reasoning about Social Interactions

Maarten Sap, +4 more

TL;DR: Social IQa as mentioned in this paper is a large-scale benchmark for commonsense reasoning about social situations, which contains 38,000 multiple choice questions for probing emotional and social intelligence in a variety of everyday situations.

...read moreread less

Posted Content

Green AI

Roy Schwartz, +3 more

- 22 Jul 2019 -

arXiv: Computers and Society

TL;DR: Creating efficiency in AI research will decrease its carbon footprint and increase its inclusivity as deep learning study should not require the deepest pockets.

...read moreread less

Posted Content

Patient Knowledge Distillation for BERT Model Compression

Siqi Sun, +3 more

- 25 Aug 2019 -

arXiv: Computation and Language

TL;DR: The authors proposed a Patient Knowledge Distillation approach to compress an original large model (teacher) into an equally effective lightweight shallow network (student), which enables the student model to patiently learn from and imitate the teacher through a multi-layer distillation process.

...read moreread less

Proceedings ArticleDOI

SR-LSTM: State Refinement for LSTM Towards Pedestrian Trajectory Prediction

Pu Zhang, +4 more

TL;DR: Zhang et al. as mentioned in this paper propose a data-driven state refinement module for LSTM network (SR-LSTM), which activates the utilization of the current intention of neighbors, and jointly and iteratively refines the current states of all participants in the crowd through a message passing mechanism.

...read moreread less

Collapse

Attention is All you Need

Citations

Semantic Graph Convolutional Networks for 3D Human Pose Regression

Social IQa: Commonsense Reasoning about Social Interactions

Green AI

Patient Knowledge Distillation for BERT Model Compression

SR-LSTM: State Refinement for LSTM Towards Pedestrian Trajectory Prediction

Related Papers (5)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Adam: A Method for Stochastic Optimization

Deep Residual Learning for Image Recognition

Long short-term memory

Bleu: a Method for Automatic Evaluation of Machine Translation