scispace - formally typeset
Open AccessProceedings Article

Attention is All you Need

Reads0
Chats0
TLDR
This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
Abstract
The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks in an encoder and decoder configuration. The best performing such models also connect the encoder and decoder through an attentionm echanisms. We propose a novel, simple network architecture based solely onan attention mechanism, dispensing with recurrence and convolutions entirely.Experiments on two machine translation tasks show these models to be superiorin quality while being more parallelizable and requiring significantly less timeto train. Our single model with 165 million parameters, achieves 27.5 BLEU onEnglish-to-German translation, improving over the existing best ensemble result by over 1 BLEU. On English-to-French translation, we outperform the previoussingle state-of-the-art with model by 0.7 BLEU, achieving a BLEU score of 41.1.

read more

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI

From Audio to Semantics: Approaches to End-to-End Spoken Language Understanding

TL;DR: This paper formulate audio to semantic understanding as a sequence-to-sequence problem, and proposes and compares various encoder-decoder based approaches that optimize both modules jointly, in an end- to-end manner.
Journal ArticleDOI

Deep Entity Matching with Pre-Trained Language Models

TL;DR: This paper proposed Ditto, a novel entity matching system based on pre-trained Transformer-based language models, which fine-tuned and cast EM as a sequence-pair classification problem to leverage such models with a simple architecture.
Proceedings ArticleDOI

Multi-hop Selector Network for Multi-turn Response Selection in Retrieval-based Chatbots.

TL;DR: The side effect of using too many context utterances is analyzed and a multi-hop selector network (MSN) is proposed to alleviate the problem and results show that MSN outperforms some state-of-the-art methods on three public multi-turn dialogue datasets.
Proceedings ArticleDOI

MaskGIT: Masked Generative Image Transformer

TL;DR: The proposed MaskGIT is a novel image synthesis paradigm using a bidirectional transformer decoder that significantly outperforms the state-of-the-art transformer model on the ImageNet dataset, and accelerates autoregressive decoding by up to 48x.
Posted Content

Support-set bottlenecks for video-text representation learning

TL;DR: This paper proposes a novel method that leverages a generative model to naturally push related samples together, and results in representations that explicitly encode semantics shared between samples, unlike noise contrastive learning.