Attention is All you Need

Open AccessProceedings Article

Attention is All you Need

Ashish Vaswani, +7 more

- Vol. 30, pp 5998-6008

Chats0

TLDR

This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.

Abstract:

The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks in an encoder and decoder configuration. The best performing such models also connect the encoder and decoder through an attentionm echanisms. We propose a novel, simple network architecture based solely onan attention mechanism, dispensing with recurrence and convolutions entirely.Experiments on two machine translation tasks show these models to be superiorin quality while being more parallelizable and requiring significantly less timeto train. Our single model with 165 million parameters, achieves 27.5 BLEU onEnglish-to-German translation, improving over the existing best ensemble result by over 1 BLEU. On English-to-French translation, we outperform the previoussingle state-of-the-art with model by 0.7 BLEU, achieving a BLEU score of 41.1.

Citations

PDF

Open Access

More filters

Posted Content

Fine-tune BERT for Extractive Summarization.

Yang Liu

- 25 Mar 2019 -

arXiv: Computation and Language

TL;DR: BERTSUM, a simple variant of BERT, for extractive summarization, is described, which is the state of the art on the CNN/Dailymail dataset, outperforming the previous best-performed system by 1.65 on ROUGE-L.

...read moreread less

Proceedings ArticleDOI

GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training

Jiezhong Qiu, +7 more

- 17 Jun 2020 -

arXiv: Learning

TL;DR: Graph Contrastive Coding (GCC) is designed --- a self-supervised graph neural network pre-training framework --- to capture the universal network topological properties across multiple networks and leverage contrastive learning to empower graph neural networks to learn the intrinsic and transferable structural representations.

...read moreread less

Journal ArticleDOI

DASNet: Dual Attentive Fully Convolutional Siamese Networks for Change Detection in High-Resolution Satellite Images

Jie Chen, +7 more

- 01 Jan 2021 -

IEEE Journal of Selected Topics in Appli...

TL;DR: The weighted double-margin contrastive loss is proposed to address the imbalanced sample is a serious problem in change detection, i.e., unchanged samples are much more abundant than changed samples, which is one of the main reasons for pseudochanges.

...read moreread less

Proceedings ArticleDOI

Modeling Point Clouds With Self-Attention and Gumbel Subset Sampling

Jiancheng Yang, +6 more

TL;DR: This work develops Point Attention Transformers (PATs), using a parameter-efficient Group Shuffle Attention (GSA) to replace the costly Multi-Head Attention, and proposes an end-to-end learnable and task-agnostic sampling operation, named Gumbel Subset Sampling (GSS), to select a representative subset of input points.

...read moreread less

Collapse

Attention is All you Need

Citations

Competition-level code generation with AlphaCode

Fine-tune BERT for Extractive Summarization.

GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training

DASNet: Dual Attentive Fully Convolutional Siamese Networks for Change Detection in High-Resolution Satellite Images

Modeling Point Clouds With Self-Attention and Gumbel Subset Sampling

Related Papers (5)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Adam: A Method for Stochastic Optimization

Deep Residual Learning for Image Recognition

Long short-term memory

Bleu: a Method for Automatic Evaluation of Machine Translation