Incorporating Convolution Designs into Visual Transformers.

Open AccessPosted Content

Incorporating Convolution Designs into Visual Transformers.

Kun Yuan, +5 more

- 22 Mar 2021 -

arXiv: Computer Vision and Pattern Recog...

Chats0

TLDR

CeiT as discussed by the authors combines the advantages of CNNs in extracting low-level features, strengthening locality, and the advantage of Transformers in establishing long-range dependencies, which can reduce the training cost significantly.

Abstract:

Motivated by the success of Transformers in natural language processing (NLP) tasks, there emerge some attempts (e.g., ViT and DeiT) to apply Transformers to the vision domain. However, pure Transformer architectures often require a large amount of training data or extra supervision to obtain comparable performance with convolutional neural networks (CNNs). To overcome these limitations, we analyze the potential drawbacks when directly borrowing Transformer architectures from NLP. Then we propose a new \textbf{Convolution-enhanced image Transformer (CeiT)} which combines the advantages of CNNs in extracting low-level features, strengthening locality, and the advantages of Transformers in establishing long-range dependencies. Three modifications are made to the original Transformer: \textbf{1)} instead of the straightforward tokenization from raw input images, we design an \textbf{Image-to-Tokens (I2T)} module that extracts patches from generated low-level features; \textbf{2)} the feed-froward network in each encoder block is replaced with a \textbf{Locally-enhanced Feed-Forward (LeFF)} layer that promotes the correlation among neighboring tokens in the spatial dimension; \textbf{3)} a \textbf{Layer-wise Class token Attention (LCA)} is attached at the top of the Transformer that utilizes the multi-level representations. Experimental results on ImageNet and seven downstream tasks show the effectiveness and generalization ability of CeiT compared with previous Transformers and state-of-the-art CNNs, without requiring a large amount of training data and extra CNN teachers. Besides, CeiT models also demonstrate better convergence with $3\times$ fewer training iterations, which can reduce the training cost significantly\footnote{Code and models will be released upon acceptance.}.

Incorporating Convolution Designs into Visual Transformers.

Citations

A Survey on Vision Transformer

Transformers in Vision: A Survey

Uformer: A General U-Shaped Transformer for Image Restoration

ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond

CMT: Convolutional Neural Networks Meet Vision Transformers

References

Deep Residual Learning for Image Recognition

ImageNet Classification with Deep Convolutional Neural Networks

Attention is All you Need

Very Deep Convolutional Networks for Large-Scale Image Recognition

ImageNet Large Scale Visual Recognition Challenge

Related Papers (5)

Deep Residual Learning for Image Recognition

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows.

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

Attention is All you Need

Trending Questions (1)