scispace - formally typeset
Open AccessProceedings Article

Attention is All you Need

Reads0
Chats0
TLDR
This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
Abstract
The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks in an encoder and decoder configuration. The best performing such models also connect the encoder and decoder through an attentionm echanisms. We propose a novel, simple network architecture based solely onan attention mechanism, dispensing with recurrence and convolutions entirely.Experiments on two machine translation tasks show these models to be superiorin quality while being more parallelizable and requiring significantly less timeto train. Our single model with 165 million parameters, achieves 27.5 BLEU onEnglish-to-German translation, improving over the existing best ensemble result by over 1 BLEU. On English-to-French translation, we outperform the previoussingle state-of-the-art with model by 0.7 BLEU, achieving a BLEU score of 41.1.

read more

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

Contrastive Representation Learning: A Framework and Review

TL;DR: A general Contrastive Representation Learning framework is proposed that simplifies and unifies many different contrastive learning methods and a taxonomy for each of the components is provided in order to summarise and distinguish it from other forms of machine learning.
Journal ArticleDOI

Pushing the Boundaries of Molecular Representation for Drug Discovery with the Graph Attention Mechanism.

TL;DR: A new graph neural network architecture called Attentive FP for molecular representation that uses a graph attention mechanism to learn from relevant drug discovery datasets and achieves state-of-the-art predictive performances on a variety of datasets and that what it learns is interpretable.
Posted Content

A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents

TL;DR: This work proposes the first model for abstractive summarization of single, longer-form documents (e.g., research papers), consisting of a new hierarchical encoder that models the discourse structure of a document, and an attentive discourse-aware decoder to generate the summary.
Proceedings Article

Neural Text Generation With Unlikelihood Training

TL;DR: It is shown that the likelihood objective itself is at fault, resulting in a model that assigns too much probability to sequences containing repeats and frequent words, unlike those from the human training distribution, thus providing a strong alternative to existing techniques.
Posted Content

End-to-End Video Instance Segmentation with Transformers

TL;DR: A new video instance segmentation framework built upon Transformers, termed VisTR, which views the VIS task as a direct end-to-end parallel sequence decoding/prediction problem, and achieves the highest speed among all existing VIS models and the best result among methods using single model on the YouTube-VIS dataset.