scispace - formally typeset
Open AccessProceedings Article

Attention is All you Need

Reads0
Chats0
TLDR
This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
Abstract
The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks in an encoder and decoder configuration. The best performing such models also connect the encoder and decoder through an attentionm echanisms. We propose a novel, simple network architecture based solely onan attention mechanism, dispensing with recurrence and convolutions entirely.Experiments on two machine translation tasks show these models to be superiorin quality while being more parallelizable and requiring significantly less timeto train. Our single model with 165 million parameters, achieves 27.5 BLEU onEnglish-to-German translation, improving over the existing best ensemble result by over 1 BLEU. On English-to-French translation, we outperform the previoussingle state-of-the-art with model by 0.7 BLEU, achieving a BLEU score of 41.1.

read more

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI

Semantic Graph Convolutional Networks for 3D Human Pose Regression

TL;DR: The proposed Semantic Graph Convolutional Networks (SemGCN), a novel neural network architecture that operates on regression tasks with graph-structured data that learns to capture semantic information such as local and global node relationships, which is not explicitly represented in the graph.
Proceedings ArticleDOI

Social IQa: Commonsense Reasoning about Social Interactions

TL;DR: Social IQa as mentioned in this paper is a large-scale benchmark for commonsense reasoning about social situations, which contains 38,000 multiple choice questions for probing emotional and social intelligence in a variety of everyday situations.
Posted Content

Green AI

TL;DR: Creating efficiency in AI research will decrease its carbon footprint and increase its inclusivity as deep learning study should not require the deepest pockets.
Posted Content

Patient Knowledge Distillation for BERT Model Compression

TL;DR: The authors proposed a Patient Knowledge Distillation approach to compress an original large model (teacher) into an equally effective lightweight shallow network (student), which enables the student model to patiently learn from and imitate the teacher through a multi-layer distillation process.
Proceedings ArticleDOI

SR-LSTM: State Refinement for LSTM Towards Pedestrian Trajectory Prediction

TL;DR: Zhang et al. as mentioned in this paper propose a data-driven state refinement module for LSTM network (SR-LSTM), which activates the utilization of the current intention of neighbors, and jointly and iteratively refines the current states of all participants in the crowd through a message passing mechanism.