An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Open AccessProceedings Article

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Chats0

TLDR

The Vision Transformer (ViT) as discussed by the authors uses a pure transformer applied directly to sequences of image patches to perform very well on image classification tasks, achieving state-of-the-art results on ImageNet, CIFAR-100, VTAB, etc.

Abstract:

While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

Citations

PDF

Open Access

More filters

Posted Content

An Empirical Study of Training Self-Supervised Vision Transformers

Xinlei Chen, +2 more

- 05 Apr 2021 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: This work investigates the effects of several fundamental components for training self-supervised ViT, and reveals that these results are indeed partial failure, and they can be improved when training is made more stable.

...read moreread less

Posted Content

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

Wenhai Wang, +8 more

- 24 Feb 2021 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: Huang et al. as discussed by the authors proposed Pyramid Vision Transformer (PVT), which is a simple backbone network useful for many dense prediction tasks without convolutions, and achieved state-of-the-art performance on the COCO dataset.

...read moreread less

Posted Content

Natural Adversarial Examples

Dan Hendrycks, +4 more

- 16 Jul 2019 -

arXiv: Learning

TL;DR: This work introduces two challenging datasets that reliably cause machine learning model performance to substantially degrade and curates an adversarial out-of-distribution detection dataset called IMAGENET-O, which is the first out- of-dist distribution detection dataset created for ImageNet models.

...read moreread less

Posted Content

An Attentive Survey of Attention Models

Sneha Chaudhari, +3 more

- 05 Apr 2019 -

arXiv: Learning

TL;DR: A taxonomy that groups existing techniques into coherent categories in attention models is proposed, and how attention has been used to improve the interpretability of neural networks is described.

...read moreread less

Posted Content

CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

Chun-Fu Chen, +2 more

- 27 Mar 2021 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: Zhang et al. as mentioned in this paper proposed a dual-branch transformer to combine image patches (i.e., tokens in a transformer) of different sizes to produce stronger image features, which achieved promising results on image classification compared to convolutional neural networks.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Posted Content

VisualBERT: A Simple and Performant Baseline for Vision and Language.

Liunian Harold Li, +4 more

- 09 Aug 2019 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: Analysis demonstrates that VisualBERT can ground elements of language to image regions without any explicit supervision and is even sensitive to syntactic relationships, tracking, for example, associations between verbs and image regions corresponding to their arguments.

...read moreread less

Proceedings ArticleDOI

Cats and dogs

Omkar M. Parkhi, +3 more

TL;DR: These models are very good: they beat all previously published results on the challenging ASIRRA test (cat vs dog discrimination) when applied to the task of discriminating the 37 different breeds of pets, and obtain an average accuracy of about 59%, a very encouraging result considering the difficulty of the problem.

...read moreread less

Proceedings Article

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Jiasen Lu, +3 more

TL;DR: The ViLBERT model as mentioned in this paper extends the BERT architecture to a multi-modal two-stream model, processing both visual and textual inputs in separate streams that interact through co-attentional transformer layers.

...read moreread less

Posted Content

Generating Long Sequences with Sparse Transformers.

Rewon Child, +3 more

- 23 Apr 2019 -

arXiv: Learning

TL;DR: This paper introduces sparse factorizations of the attention matrix which reduce this to $O(n)$, and generates unconditional samples that demonstrate global coherence and great diversity, and shows it is possible in principle to use self-attention to model sequences of length one million or more.

...read moreread less

Proceedings ArticleDOI

Relation Networks for Object Detection

Han Hu, +4 more

TL;DR: In this article, the authors propose an object relation module to model relations between objects, which is shown effective on improving object recognition and duplicate removal steps in the modern object detection pipeline.

...read moreread less

Collapse

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Citations

An Empirical Study of Training Self-Supervised Vision Transformers

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

Natural Adversarial Examples

An Attentive Survey of Attention Models

CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

References

VisualBERT: A Simple and Performant Baseline for Vision and Language.

Cats and dogs

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Generating Long Sequences with Sparse Transformers.

Relation Networks for Object Detection

Related Papers (5)

Attention is All you Need

Deep Residual Learning for Image Recognition

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

ImageNet: A large-scale hierarchical image database

Microsoft COCO: Common Objects in Context