scispace - formally typeset
Open AccessProceedings Article

Attention is All you Need

Reads0
Chats0
TLDR
This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
Abstract
The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks in an encoder and decoder configuration. The best performing such models also connect the encoder and decoder through an attentionm echanisms. We propose a novel, simple network architecture based solely onan attention mechanism, dispensing with recurrence and convolutions entirely.Experiments on two machine translation tasks show these models to be superiorin quality while being more parallelizable and requiring significantly less timeto train. Our single model with 165 million parameters, achieves 27.5 BLEU onEnglish-to-German translation, improving over the existing best ensemble result by over 1 BLEU. On English-to-French translation, we outperform the previoussingle state-of-the-art with model by 0.7 BLEU, achieving a BLEU score of 41.1.

read more

Content maybe subject to copyright    Report

Citations
More filters
Posted Content

LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention

TL;DR: New pretrained contextualized representations of words and entities based on the bidirectional transformer, and an entity-aware self-attention mechanism that considers the types of tokens (words or entities) when computing attention scores are proposed.
Proceedings ArticleDOI

Entangled Transformer for Image Captioning

TL;DR: A Transformer-based sequence modeling framework built only with attention layers and feedforward layers that enables the Transformer to exploit semantic and visual information simultaneously and achieves state-of-the-art performance on the MSCOCO image captioning dataset.
Posted Content

DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations

TL;DR: Inspired by recent advances in deep metric learning (DML), this work carefully design a self-supervised objective for learning universal sentence embeddings that does not require labelled training data and closes the performance gap between unsupervised and supervised pretraining for universal sentence encoders.
Posted Content

Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation.

TL;DR: This paper factorizes 2D self-attention into two 1Dself-attentions, a novel building block that one could stack to form axial-att attention models for image classification and dense prediction, and achieves state-of-the-art results on Mapillary Vistas and Cityscapes.