scispace - formally typeset
Open AccessProceedings Article

Long Range Arena : A Benchmark for Efficient Transformers

Reads0
Chats0
TLDR
The Long Range Arena benchmark as discussed by the authors is a suite of tasks consisting of sequences ranging from 1K to 16k tokens, encompassing a wide range of data types and modalities such as text, natural, synthetic images, and mathematical expressions requiring similarity, structural, and visual-spatial reasoning.
Abstract
Transformers do not scale very well to long sequence lengths largely because of quadratic self-attention complexity. In the recent months, a wide spectrum of efficient, fast Transformers have been proposed to tackle this problem, more often than not claiming superior or comparable model quality to vanilla Transformer models. To this date, there is no well-established consensus on how to evaluate this class of models. Moreover, inconsistent benchmarking on a wide spectrum of tasks and datasets makes it difficult to assess relative model quality amongst many models. This paper proposes a systematic and unified benchmark, Long Range Arena, specifically focused on evaluating model quality under long-context scenarios. Our benchmark is a suite of tasks consisting of sequences ranging from 1K to 16K tokens, encompassing a wide range of data types and modalities such as text, natural, synthetic images, and mathematical expressions requiring similarity, structural, and visual-spatial reasoning. We systematically evaluate ten well-established long-range Transformer models (Reformers, Linformers, Linear Transformers, Sinkhorn Transformers, Performers, Synthesizers, Sparse Transformers, and Longformers) on our newly proposed benchmark suite. Long Range Arena paves the way towards better understanding this class of efficient Transformer models, facilitates more research in this direction, and presents new challenging tasks to tackle.

read more

Content maybe subject to copyright    Report

Citations
More filters
Posted Content

Rethinking Attention with Performers

TL;DR: Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear space and time complexity, without relying on any priors such as sparsity or low-rankness are introduced.
Proceedings Article

Rethinking Attention with Performers

TL;DR: Performers as mentioned in this paper uses Fast Attention Via positive Orthogonal Random features (FAVOR+) to approximate softmax attention-kernels, which can estimate regular (softmax) full-rank attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity.
Posted Content

FNet: Mixing Tokens with Fourier Transforms

TL;DR: This article proposed to replace the self-attention sublayer in a Transformer encoder with a standard, unparameterized Fourier Transform (FET) for text classification.
Proceedings Article

Random Feature Attention

TL;DR: The authors proposed RFA, a linear time and space attention that uses random feature methods to approximate the softmax function, and explore its applications in transformers, which can be used as a drop-in replacement for conventional softmax attention.
Posted Content

Perceiver: General Perception with Iterative Attention

TL;DR: The Perceiver as mentioned in this paper is a model that builds upon Transformers and hence makes few architectural assumptions about the relationship between its inputs, but that also scales to hundreds of thousands of inputs, like ConvNets.
References
More filters
Proceedings Article

Attention is All you Need

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
Proceedings ArticleDOI

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

TL;DR: BERT as mentioned in this paper pre-trains deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Dissertation

Learning Multiple Layers of Features from Tiny Images

TL;DR: In this paper, the authors describe how to train a multi-layer generative model of natural images, using a dataset of millions of tiny colour images, described in the next section.
Proceedings Article

Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank

TL;DR: A Sentiment Treebank that includes fine grained sentiment labels for 215,154 phrases in the parse trees of 11,855 sentences and presents new challenges for sentiment compositionality, and introduces the Recursive Neural Tensor Network.
Proceedings Article

Learning Word Vectors for Sentiment Analysis

TL;DR: This work presents a model that uses a mix of unsupervised and supervised techniques to learn word vectors capturing semantic term--document information as well as rich sentiment content, and finds it out-performs several previously introduced methods for sentiment classification.
Related Papers (5)