Open AccessProceedings Article
Training data-efficient image transformers & distillation through attention
Hugo Touvron,Matthieu Cord,Douze Matthijs,Francisco Massa,Alexandre Sablayrolles,Hervé Jégou +5 more
- pp 10347-10357
Reads0
Chats0
About:
This article is published in International Conference on Machine Learning.The article was published on 2021-07-18 and is currently open access. It has received 143 citations till now.read more
Citations
More filters
Posted Content
Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions
Wenhai Wang,Enze Xie,Xiang Li,Deng-Ping Fan,Kaitao Song,Ding Liang,Tong Lu,Ping Luo,Ling Shao +8 more
TL;DR: Huang et al. as discussed by the authors proposed Pyramid Vision Transformer (PVT), which is a simple backbone network useful for many dense prediction tasks without convolutions, and achieved state-of-the-art performance on the COCO dataset.
Posted Content
CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification
TL;DR: Zhang et al. as mentioned in this paper proposed a dual-branch transformer to combine image patches (i.e., tokens in a transformer) of different sizes to produce stronger image features, which achieved promising results on image classification compared to convolutional neural networks.
Posted Content
Transformer in Transformer
TL;DR: Transformer iN Transformer (TNT) as discussed by the authors is a new kind of neural architecture which encodes the input data as powerful features via the attention mechanism, where the visual transformers first divide the input images into several local patches and then calculate both representations and their relationship.
Posted Content
Scalable Visual Transformers with Hierarchical Pooling
TL;DR: A Hierarchical Visual Transformer (HVT) is proposed which progressively pools visual tokens to shrink the sequence length and hence reduces the computational cost, analogous to the feature maps downsampling in Convolutional Neural Networks (CNNs).
Posted Content
With a Little Help from My Friends: Nearest-Neighbor Contrastive Learning of Visual Representations
TL;DR: Nearest-Neighbor Contrastive Learning of visual representations (NNCLR) as mentioned in this paper samples the nearest neighbors from the dataset in the latent space, and treats them as positives, which provides more semantic variations than pre-defined transformations.