Open AccessProceedings Article
Attention is All you Need
Ashish Vaswani,Noam Shazeer,Niki Parmar,Jakob Uszkoreit,Llion Jones,Aidan N. Gomez,Lukasz Kaiser,Illia Polosukhin +7 more
- Vol. 30, pp 5998-6008
Reads0
Chats0
TLDR
This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.Abstract:
The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks in an encoder and decoder configuration. The best performing such models also connect the encoder and decoder through an attentionm echanisms. We propose a novel, simple network architecture based solely onan attention mechanism, dispensing with recurrence and convolutions entirely.Experiments on two machine translation tasks show these models to be superiorin quality while being more parallelizable and requiring significantly less timeto train. Our single model with 165 million parameters, achieves 27.5 BLEU onEnglish-to-German translation, improving over the existing best ensemble result by over 1 BLEU. On English-to-French translation, we outperform the previoussingle state-of-the-art with model by 0.7 BLEU, achieving a BLEU score of 41.1.read more
Citations
More filters
Journal ArticleDOI
Contrastive Representation Learning: A Framework and Review
TL;DR: A general Contrastive Representation Learning framework is proposed that simplifies and unifies many different contrastive learning methods and a taxonomy for each of the components is provided in order to summarise and distinguish it from other forms of machine learning.
Journal ArticleDOI
Pushing the Boundaries of Molecular Representation for Drug Discovery with the Graph Attention Mechanism.
Zhaoping Xiong,Zhaoping Xiong,Dingyan Wang,Xiaohong Liu,Xiaohong Liu,Feisheng Zhong,Xiaozhe Wan,Xutong Li,Zhaojun Li,Xiaomin Luo,Kaixian Chen,Kaixian Chen,Hualiang Jiang,Hualiang Jiang,Mingyue Zheng +14 more
TL;DR: A new graph neural network architecture called Attentive FP for molecular representation that uses a graph attention mechanism to learn from relevant drug discovery datasets and achieves state-of-the-art predictive performances on a variety of datasets and that what it learns is interpretable.
Posted Content
A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents
Arman Cohan,Franck Dernoncourt,Doo Soon Kim,Trung Bui,Seokhwan Kim,Walter Chang,Nazli Goharian +6 more
TL;DR: This work proposes the first model for abstractive summarization of single, longer-form documents (e.g., research papers), consisting of a new hierarchical encoder that models the discourse structure of a document, and an attentive discourse-aware decoder to generate the summary.
Proceedings Article
Neural Text Generation With Unlikelihood Training
TL;DR: It is shown that the likelihood objective itself is at fault, resulting in a model that assigns too much probability to sequences containing repeats and frequent words, unlike those from the human training distribution, thus providing a strong alternative to existing techniques.
Posted Content
End-to-End Video Instance Segmentation with Transformers
TL;DR: A new video instance segmentation framework built upon Transformers, termed VisTR, which views the VIS task as a direct end-to-end parallel sequence decoding/prediction problem, and achieves the highest speed among all existing VIS models and the best result among methods using single model on the YouTube-VIS dataset.