scispace - formally typeset
Open AccessProceedings Article

Training data-efficient image transformers & distillation through attention

Reads0
Chats0
About
This article is published in International Conference on Machine Learning.The article was published on 2021-07-18 and is currently open access. It has received 143 citations till now.

read more

Content maybe subject to copyright    Report

Citations
More filters
Posted Content

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

TL;DR: Huang et al. as discussed by the authors proposed Pyramid Vision Transformer (PVT), which is a simple backbone network useful for many dense prediction tasks without convolutions, and achieved state-of-the-art performance on the COCO dataset.
Posted Content

CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

TL;DR: Zhang et al. as mentioned in this paper proposed a dual-branch transformer to combine image patches (i.e., tokens in a transformer) of different sizes to produce stronger image features, which achieved promising results on image classification compared to convolutional neural networks.
Posted Content

Transformer in Transformer

TL;DR: Transformer iN Transformer (TNT) as discussed by the authors is a new kind of neural architecture which encodes the input data as powerful features via the attention mechanism, where the visual transformers first divide the input images into several local patches and then calculate both representations and their relationship.
Posted Content

Scalable Visual Transformers with Hierarchical Pooling

TL;DR: A Hierarchical Visual Transformer (HVT) is proposed which progressively pools visual tokens to shrink the sequence length and hence reduces the computational cost, analogous to the feature maps downsampling in Convolutional Neural Networks (CNNs).
Posted Content

With a Little Help from My Friends: Nearest-Neighbor Contrastive Learning of Visual Representations

TL;DR: Nearest-Neighbor Contrastive Learning of visual representations (NNCLR) as mentioned in this paper samples the nearest neighbors from the dataset in the latent space, and treats them as positives, which provides more semantic variations than pre-defined transformations.
Related Papers (5)