scispace - formally typeset
Open AccessProceedings Article

Attention is All you Need

Reads0
Chats0
TLDR
This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
Abstract
The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks in an encoder and decoder configuration. The best performing such models also connect the encoder and decoder through an attentionm echanisms. We propose a novel, simple network architecture based solely onan attention mechanism, dispensing with recurrence and convolutions entirely.Experiments on two machine translation tasks show these models to be superiorin quality while being more parallelizable and requiring significantly less timeto train. Our single model with 165 million parameters, achieves 27.5 BLEU onEnglish-to-German translation, improving over the existing best ensemble result by over 1 BLEU. On English-to-French translation, we outperform the previoussingle state-of-the-art with model by 0.7 BLEU, achieving a BLEU score of 41.1.

read more

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

Self-attention for raw optical Satellite Time Series Classification

TL;DR: This work compares recent deep learning models on crop type classification on raw and preprocessed Sentinel 2 data and qualitatively shows how self-attention scores focus selectively on few classification-relevant observations.
Journal ArticleDOI

Artificial Neural Networks for Neuroscientists: A Primer.

TL;DR: This pedagogical Primer introduces artificial neural networks and demonstrates how they have been fruitfully deployed to study neuroscientific questions, and details how to customize the analysis, structure, and learning of ANNs to better address a wide range of challenges in brain research.
Proceedings ArticleDOI

DN-DETR: Accelerate DETR Training by Introducing Query DeNoising

TL;DR: A novel denoising training method to speedup DETR (DEtection TRansformer) training and offer a deepened understanding of the slow convergence issue of DETR-like methods with ResNet-50 backbone.
Posted Content

Understanding Knowledge Distillation in Non-autoregressive Machine Translation

TL;DR: It is found that knowledge distillation can reduce the complexity of data sets and help NAT to model the variations in the output data, and a strong correlation is observed between the capacity of an NAT model and the optimal complexity of the distilled data for the best translation quality.
Posted ContentDOI

A data-driven drug repositioning framework discovered a potential therapeutic agent targeting COVID-19

TL;DR: The in silico screening followed by wet-lab validation indicated that a poly-ADP-ribose polymerase 1 (PARP1) inhibitor, CVL218, currently in Phase I clinical trial, may be repurposed to treat COVID-19 and proposed several possible mechanisms to explain the antiviral activities of PARP1 inhibitors against SARS-CoV-2.