Attention is All you Need

Open AccessProceedings Article

Attention is All you Need

Ashish Vaswani, +7 more

- Vol. 30, pp 5998-6008

Chats0

TLDR

This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.

Abstract:

The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks in an encoder and decoder configuration. The best performing such models also connect the encoder and decoder through an attentionm echanisms. We propose a novel, simple network architecture based solely onan attention mechanism, dispensing with recurrence and convolutions entirely.Experiments on two machine translation tasks show these models to be superiorin quality while being more parallelizable and requiring significantly less timeto train. Our single model with 165 million parameters, achieves 27.5 BLEU onEnglish-to-German translation, improving over the existing best ensemble result by over 1 BLEU. On English-to-French translation, we outperform the previoussingle state-of-the-art with model by 0.7 BLEU, achieving a BLEU score of 41.1.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Contrastive Representation Learning: A Framework and Review

Phuc H. Le-Khac, +2 more

- 16 Oct 2020 -

IEEE Access

TL;DR: A general Contrastive Representation Learning framework is proposed that simplifies and unifies many different contrastive learning methods and a taxonomy for each of the components is provided in order to summarise and distinguish it from other forms of machine learning.

...read moreread less

Journal ArticleDOI

Pushing the Boundaries of Molecular Representation for Drug Discovery with the Graph Attention Mechanism.

Zhaoping Xiong, +14 more

- 27 Aug 2020 -

Journal of Medicinal Chemistry

TL;DR: A new graph neural network architecture called Attentive FP for molecular representation that uses a graph attention mechanism to learn from relevant drug discovery datasets and achieves state-of-the-art predictive performances on a variety of datasets and that what it learns is interpretable.

...read moreread less

Posted Content

A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents

Arman Cohan, +6 more

- 16 Apr 2018 -

arXiv: Computation and Language

TL;DR: This work proposes the first model for abstractive summarization of single, longer-form documents (e.g., research papers), consisting of a new hierarchical encoder that models the discourse structure of a document, and an attentive discourse-aware decoder to generate the summary.

...read moreread less

Proceedings Article

Neural Text Generation With Unlikelihood Training

Sean Welleck, +5 more

TL;DR: It is shown that the likelihood objective itself is at fault, resulting in a model that assigns too much probability to sequences containing repeats and frequent words, unlike those from the human training distribution, thus providing a strong alternative to existing techniques.

...read moreread less

Posted Content

End-to-End Video Instance Segmentation with Transformers

Yuqing Wang, +6 more

- 30 Nov 2020 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: A new video instance segmentation framework built upon Transformers, termed VisTR, which views the VIS task as a direct end-to-end parallel sequence decoding/prediction problem, and achieves the highest speed among all existing VIS models and the best result among methods using single model on the YouTube-VIS dataset.

...read moreread less

Collapse

Attention is All you Need

Citations

Contrastive Representation Learning: A Framework and Review

Pushing the Boundaries of Molecular Representation for Drug Discovery with the Graph Attention Mechanism.

A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents

Neural Text Generation With Unlikelihood Training

End-to-End Video Instance Segmentation with Transformers

Related Papers (5)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Adam: A Method for Stochastic Optimization

Deep Residual Learning for Image Recognition

Long short-term memory

Bleu: a Method for Automatic Evaluation of Machine Translation