Training data-efficient image transformers & distillation through attention

Open AccessProceedings Article

Training data-efficient image transformers & distillation through attention

Hugo Touvron, +5 more

- pp 10347-10357

Chats0

About:

This article is published in International Conference on Machine Learning.The article was published on 2021-07-18 and is currently open access. It has received 143 citations till now.

Citations

PDF

Open Access

More filters

Posted Content

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

Wenhai Wang, +8 more

- 24 Feb 2021 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: Huang et al. as discussed by the authors proposed Pyramid Vision Transformer (PVT), which is a simple backbone network useful for many dense prediction tasks without convolutions, and achieved state-of-the-art performance on the COCO dataset.

...read moreread less

Posted Content

CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

Chun-Fu Chen, +2 more

- 27 Mar 2021 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: Zhang et al. as mentioned in this paper proposed a dual-branch transformer to combine image patches (i.e., tokens in a transformer) of different sizes to produce stronger image features, which achieved promising results on image classification compared to convolutional neural networks.

...read moreread less

Posted Content

Transformer in Transformer

Kai Han, +5 more

- 27 Feb 2021 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: Transformer iN Transformer (TNT) as discussed by the authors is a new kind of neural architecture which encodes the input data as powerful features via the attention mechanism, where the visual transformers first divide the input images into several local patches and then calculate both representations and their relationship.

...read moreread less

Posted Content

Scalable Visual Transformers with Hierarchical Pooling

Zizheng Pan, +4 more

- 19 Mar 2021 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: A Hierarchical Visual Transformer (HVT) is proposed which progressively pools visual tokens to shrink the sequence length and hence reduces the computational cost, analogous to the feature maps downsampling in Convolutional Neural Networks (CNNs).

...read moreread less

Posted Content

With a Little Help from My Friends: Nearest-Neighbor Contrastive Learning of Visual Representations

Debidatta Dwibedi, +4 more

- 29 Apr 2021 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: Nearest-Neighbor Contrastive Learning of visual representations (NNCLR) as mentioned in this paper samples the nearest neighbors from the dataset in the latent space, and treats them as positives, which provides more semantic variations than pre-defined transformations.

...read moreread less

Collapse

Training data-efficient image transformers & distillation through attention

Citations

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

Transformer in Transformer

Scalable Visual Transformers with Hierarchical Pooling

With a Little Help from My Friends: Nearest-Neighbor Contrastive Learning of Visual Representations

Related Papers (5)

Deep Residual Learning for Image Recognition

Attention is All you Need

ImageNet: A large-scale hierarchical image database

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Microsoft COCO: Common Objects in Context