Open AccessProceedings Article
Attention is All you Need
Ashish Vaswani,Noam Shazeer,Niki Parmar,Jakob Uszkoreit,Llion Jones,Aidan N. Gomez,Lukasz Kaiser,Illia Polosukhin +7 more
- Vol. 30, pp 5998-6008
Reads0
Chats0
TLDR
This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.Abstract:
The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks in an encoder and decoder configuration. The best performing such models also connect the encoder and decoder through an attentionm echanisms. We propose a novel, simple network architecture based solely onan attention mechanism, dispensing with recurrence and convolutions entirely.Experiments on two machine translation tasks show these models to be superiorin quality while being more parallelizable and requiring significantly less timeto train. Our single model with 165 million parameters, achieves 27.5 BLEU onEnglish-to-German translation, improving over the existing best ensemble result by over 1 BLEU. On English-to-French translation, we outperform the previoussingle state-of-the-art with model by 0.7 BLEU, achieving a BLEU score of 41.1.read more
Citations
More filters
Journal ArticleDOI
Competition-level code generation with AlphaCode
Yujia Li,David H. Choi,Junyoung Chung,Nate Kushman,Julian Schrittwieser,Rémi Leblond,Tom,Eccles,James Keeling,Felix Gimeno,Agustin Dal Lago,Thomas Hubert,Peter Choy,Cyprien de,Masson d’Autume,Igor Babuschkin,Xinyun Chen,Po-Sen Huang,Johannes Welbl,Sven Gowal,Alexey,Cherepanov,James L. Molloy,Daniel J. Mankowitz,Esme Sutherland Robson,Pushmeet Kohli,Nando de,Freitas,Koray Kavukcuoglu,Oriol Vinyals +29 more
TL;DR: Yujia Li*, David Choi*, Junyoung Chung*, Nate Kushman*, Julian Schrittwieser*, Rémi Leblond*, Tom Eccles*, James Keeling*, Felix Gimeno*, Agustin Dal Lago*, Thomas Hubert*, Peter Choy*, Cyprien de Masson d’Autume*, Igor Babuschkin, Xinyun Chen
Posted Content
Fine-tune BERT for Extractive Summarization.
TL;DR: BERTSUM, a simple variant of BERT, for extractive summarization, is described, which is the state of the art on the CNN/Dailymail dataset, outperforming the previous best-performed system by 1.65 on ROUGE-L.
Proceedings ArticleDOI
GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training
TL;DR: Graph Contrastive Coding (GCC) is designed --- a self-supervised graph neural network pre-training framework --- to capture the universal network topological properties across multiple networks and leverage contrastive learning to empower graph neural networks to learn the intrinsic and transferable structural representations.
Journal ArticleDOI
DASNet: Dual Attentive Fully Convolutional Siamese Networks for Change Detection in High-Resolution Satellite Images
TL;DR: The weighted double-margin contrastive loss is proposed to address the imbalanced sample is a serious problem in change detection, i.e., unchanged samples are much more abundant than changed samples, which is one of the main reasons for pseudochanges.
Proceedings ArticleDOI
Modeling Point Clouds With Self-Attention and Gumbel Subset Sampling
TL;DR: This work develops Point Attention Transformers (PATs), using a parameter-efficient Group Shuffle Attention (GSA) to replace the costly Multi-Head Attention, and proposes an end-to-end learnable and task-agnostic sampling operation, named Gumbel Subset Sampling (GSS), to select a representative subset of input points.