MLP-Mixer: An all-MLP Architecture for Vision

Open AccessPosted Content

MLP-Mixer: An all-MLP Architecture for Vision

Ilya Tolstikhin, +11 more

- 04 May 2021 -

arXiv: Computer Vision and Pattern Recog...

Chats0

TLDR

MLP-Mixer as discussed by the authors is an architecture based exclusively on multi-layer perceptrons (MLP), which contains two types of layers: one with MLPs applied independently to image patches (i.e. "mixing" the per-location features), and one with LSTM applied across patches, and it achieves competitive scores on image classification benchmarks, with pre-training and inference cost comparable to state-of-theart models.

Abstract:

Convolutional Neural Networks (CNNs) are the go-to model for computer vision. Recently, attention-based networks, such as the Vision Transformer, have also become popular. In this paper we show that while convolutions and attention are both sufficient for good performance, neither of them are necessary. We present MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs). MLP-Mixer contains two types of layers: one with MLPs applied independently to image patches (i.e. "mixing" the per-location features), and one with MLPs applied across patches (i.e. "mixing" spatial information). When trained on large datasets, or with modern regularization schemes, MLP-Mixer attains competitive scores on image classification benchmarks, with pre-training and inference cost comparable to state-of-the-art models. We hope that these results spark further research beyond the realms of well established CNNs and Transformers.

Citations

PDF

Open Access

More filters

Posted Content

Attention Mechanisms in Computer Vision: A Survey.

Meng-Hao Guo, +9 more

- 15 Nov 2021 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: A comprehensive review of attention mechanisms in computer vision can be found in this article, which categorizes them according to approach, such as channel attention, spatial attention, temporal attention and branch attention.

...read moreread less

Posted Content

Synthesizer: Rethinking Self-Attention in Transformer Models

Yi Tay, +5 more

- 02 May 2020 -

arXiv: Computation and Language

TL;DR: The true importance and contribution of the dot product-based self-attention mechanism on the performance of Transformer models is investigated and a model that learns synthetic attention weights without token-token interactions is proposed, called Synthesizer.

...read moreread less

Posted Content

ResMLP: Feedforward networks for image classification with data-efficient training

Hugo Touvron, +9 more

- 07 May 2021 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: ResMLP as mentioned in this paper is an architecture built entirely upon multi-layer perceptrons for image classification, which achieves surprisingly good accuracy/complexity trade-offs on ImageNet by using heavy data-augmentation and optionally distillation.

...read moreread less

Posted Content

On the Opportunities and Risks of Foundation Models.

Rishi Bommasani, +113 more

- 16 Aug 2021 -

arXiv: Learning

TL;DR: The authors provides a thorough account of the opportunities and risks of foundation models, ranging from their capabilities (e.g., language, vision, robotics, reasoning, human interaction) and technical principles(e. g.g. model architectures, training procedures, data, systems, security, evaluation, theory) to their applications.

...read moreread less

Posted Content

FNet: Mixing Tokens with Fourier Transforms

James Lee-Thorp, +3 more

- 09 May 2021 -

arXiv: Computation and Language

TL;DR: This article proposed to replace the self-attention sublayer in a Transformer encoder with a standard, unparameterized Fourier Transform (FET) for text classification.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.

...read moreread less

Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

Alex Krizhevsky, +2 more

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.

...read moreread less

Proceedings Article

Attention is All you Need

Ashish Vaswani, +7 more

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.

...read moreread less

Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan, +1 more

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.

...read moreread less

Proceedings ArticleDOI

ImageNet: A large-scale hierarchical image database

Jia Deng, +5 more

TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.

...read moreread less

Collapse

arXiv: Computer Vision and Pattern Recog...

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, +3 more

ImageNet: A large-scale hierarchical image database

Jia Deng, +5 more

MLP-Mixer: An all-MLP Architecture for Vision

Citations

Attention Mechanisms in Computer Vision: A Survey.

Synthesizer: Rethinking Self-Attention in Transformer Models

ResMLP: Feedforward networks for image classification with data-efficient training

On the Opportunities and Risks of Foundation Models.

FNet: Mixing Tokens with Fourier Transforms

References

Deep Residual Learning for Image Recognition

ImageNet Classification with Deep Convolutional Neural Networks

Attention is All you Need

Very Deep Convolutional Networks for Large-Scale Image Recognition

ImageNet: A large-scale hierarchical image database

Related Papers (5)

Attention is All you Need

Deep Residual Learning for Image Recognition

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

ImageNet: A large-scale hierarchical image database