Attention is All you Need

Open AccessProceedings Article

Attention is All you Need

Ashish Vaswani, +7 more

- Vol. 30, pp 5998-6008

Chats0

TLDR

This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.

Abstract:

The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks in an encoder and decoder configuration. The best performing such models also connect the encoder and decoder through an attentionm echanisms. We propose a novel, simple network architecture based solely onan attention mechanism, dispensing with recurrence and convolutions entirely.Experiments on two machine translation tasks show these models to be superiorin quality while being more parallelizable and requiring significantly less timeto train. Our single model with 165 million parameters, achieves 27.5 BLEU onEnglish-to-German translation, improving over the existing best ensemble result by over 1 BLEU. On English-to-French translation, we outperform the previoussingle state-of-the-art with model by 0.7 BLEU, achieving a BLEU score of 41.1.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

From Audio to Semantics: Approaches to End-to-End Spoken Language Understanding

Parisa Haghani, +8 more

TL;DR: This paper formulate audio to semantic understanding as a sequence-to-sequence problem, and proposes and compares various encoder-decoder based approaches that optimize both modules jointly, in an end- to-end manner.

...read moreread less

Journal ArticleDOI

Deep Entity Matching with Pre-Trained Language Models

Yuliang Li, +4 more

- 01 Apr 2020 -

arXiv: Databases

TL;DR: This paper proposed Ditto, a novel entity matching system based on pre-trained Transformer-based language models, which fine-tuned and cast EM as a sequence-pair classification problem to leverage such models with a simple architecture.

...read moreread less

Proceedings ArticleDOI

Multi-hop Selector Network for Multi-turn Response Selection in Retrieval-based Chatbots.

Chunyuan Yuan, +6 more

TL;DR: The side effect of using too many context utterances is analyzed and a multi-hop selector network (MSN) is proposed to alleviate the problem and results show that MSN outperforms some state-of-the-art methods on three public multi-turn dialogue datasets.

...read moreread less

Proceedings ArticleDOI

MaskGIT: Masked Generative Image Transformer

Hui Il Chang, +4 more

TL;DR: The proposed MaskGIT is a novel image synthesis paradigm using a bidirectional transformer decoder that significantly outperforms the state-of-the-art transformer model on the ImageNet dataset, and accelerates autoregressive decoding by up to 48x.

...read moreread less

Posted Content

Support-set bottlenecks for video-text representation learning

Mandela Patrick, +6 more

- 06 Oct 2020 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: This paper proposes a novel method that leverages a generative model to naturally push related samples together, and results in representations that explicitly encode semantics shared between samples, unlike noise contrastive learning.

...read moreread less

Collapse

Attention is All you Need

Citations

From Audio to Semantics: Approaches to End-to-End Spoken Language Understanding

Deep Entity Matching with Pre-Trained Language Models

Multi-hop Selector Network for Multi-turn Response Selection in Retrieval-based Chatbots.

MaskGIT: Masked Generative Image Transformer

Support-set bottlenecks for video-text representation learning

Related Papers (5)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Adam: A Method for Stochastic Optimization

Deep Residual Learning for Image Recognition

Long short-term memory

Bleu: a Method for Automatic Evaluation of Machine Translation