Open AccessProceedings Article
Attention is All you Need
Ashish Vaswani,Noam Shazeer,Niki Parmar,Jakob Uszkoreit,Llion Jones,Aidan N. Gomez,Lukasz Kaiser,Illia Polosukhin +7 more
- Vol. 30, pp 5998-6008
Reads0
Chats0
TLDR
This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.Abstract:
The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks in an encoder and decoder configuration. The best performing such models also connect the encoder and decoder through an attentionm echanisms. We propose a novel, simple network architecture based solely onan attention mechanism, dispensing with recurrence and convolutions entirely.Experiments on two machine translation tasks show these models to be superiorin quality while being more parallelizable and requiring significantly less timeto train. Our single model with 165 million parameters, achieves 27.5 BLEU onEnglish-to-German translation, improving over the existing best ensemble result by over 1 BLEU. On English-to-French translation, we outperform the previoussingle state-of-the-art with model by 0.7 BLEU, achieving a BLEU score of 41.1.read more
Citations
More filters
Proceedings ArticleDOI
Exploring Self-Attention for Image Recognition
TL;DR: This work considers two forms of self-attention, pairwise and patchwise, which generalizes standard dot-product attention and is fundamentally a set operator and strictly more powerful than convolution.
Proceedings ArticleDOI
Attention Augmented Convolutional Networks
TL;DR: Li et al. as mentioned in this paper concatenated convolutional feature maps with a set of feature maps produced via a novel relative self-attention mechanism, which attends jointly to both features and spatial locations while preserving translation equivariance.
Proceedings ArticleDOI
Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets.
TL;DR: The Biomedical Language Understanding Evaluation (BLUE) benchmark is introduced to facilitate research in the development of pre-training language representations in the biomedicine domain and it is found that the BERT model pre-trained on PubMed abstracts and MIMIC-III clinical notes achieves the best results.
Posted Content
Deep Graph Library: A Graph-Centric, Highly-Performant Package for Graph Neural Networks
Minjie Wang,Da Zheng,Zihao Ye,Quan Gan,Mufei Li,Xiang Song,Jinjing Zhou,Chao Ma,Lingfan Yu,Yu Gai,Tianjun Xiao,Tong He,George Karypis,Jinyang Li,Zheng Zhang +14 more
TL;DR: DGL distills the computational patterns of GNNs into a few generalized sparse tensor operations suitable for extensive parallelization and allows users to easily port and leverage the existing components across multiple deep learning frameworks.
Posted Content
Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet
Li Yuan,Yunpeng Chen,Tao Wang,Weihao Yu,Yujun Shi,Zihang Jiang,Francis Eh Tay,Jiashi Feng,Shuicheng Yan +8 more
TL;DR: T2T-ViT as mentioned in this paper proposes a token-to-token transformation to progressively transform the image to tokens by recursively aggregating neighboring tokens into one token (Token-To-Token), such that local structure represented by surrounding tokens can be modeled and tokens length can be reduced.