Pre-Trained Image Processing Transformer

Open AccessPosted Content

Pre-Trained Image Processing Transformer

- 01 Dec 2020 -

arXiv: Computer Vision and Pattern Recog...

TLDR

To maximally excavate the capability of transformer, the IPT model is presented to utilize the well-known ImageNet benchmark for generating a large amount of corrupted image pairs and the contrastive learning is introduced for well adapting to different image processing tasks.

Abstract:

As the computing power of modern hardware is increasing strongly, pre-trained deep learning models (e.g., BERT, GPT-3) learned on large-scale datasets have shown their effectiveness over conventional methods. The big progress is mainly contributed to the representation ability of transformer and its variant architectures. In this paper, we study the low-level computer vision task (e.g., denoising, super-resolution and deraining) and develop a new pre-trained model, namely, image processing transformer (IPT). To maximally excavate the capability of transformer, we present to utilize the well-known ImageNet benchmark for generating a large amount of corrupted image pairs. The IPT model is trained on these images with multi-heads and multi-tails. In addition, the contrastive learning is introduced for well adapting to different image processing tasks. The pre-trained model can therefore efficiently employed on desired task after fine-tuning. With only one pre-trained model, IPT outperforms the current state-of-the-art methods on various low-level benchmarks. Code is available at this https URL and this https URL

Citations

PDF

Open Access

More filters

Posted Content

Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

Li Yuan, +8 more

- 28 Jan 2021 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: T2T-ViT as mentioned in this paper proposes a token-to-token transformation to progressively transform the image to tokens by recursively aggregating neighboring tokens into one token (Token-To-Token), such that local structure represented by surrounding tokens can be modeled and tokens length can be reduced.

...read moreread less

Journal ArticleDOI

Adding Conditional Control to Text-to-Image Diffusion Models

Lvmin Zhang, +1 more

- 10 Feb 2023 -

arXiv.org

TL;DR: ControlNet as discussed by the authors learns task-specific conditions in an end-to-end way, and the learning is robust even when the training dataset is small (<50k), and the model can be trained on a personal devices.

...read moreread less

Posted Content

LocalViT: Bringing Locality to Vision Transformers

Yawei Li, +4 more

- 12 Apr 2021 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: This work adds locality to vision transformers by introducing depth-wise convolution into the feed-forward network and successfully applies the same locality mechanism to 4 visiontransformers, which shows the generalization of the locality concept.

...read moreread less

Proceedings ArticleDOI

Scaling Up Your Kernels to 31×31: Revisiting Large Kernel Design in CNNs

Xiaohan Ding, +5 more

TL;DR: It is demonstrated that using a few large convolutional kernels instead of a stack of small kernels could be a more powerful paradigm, and proposed RepLKNet, a pure CNN architecture whose kernel size is as large as 31×31, in contrast to commonly used 3×3.

...read moreread less

Journal ArticleDOI

Remote Sensing Image Change Detection with Transformers

Hao Chen, +2 more

- 27 Feb 2021 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: Wang et al. as discussed by the authors proposed a bitemporal image transformer (BIT) to efficiently and effectively model contexts within the spatial-temporal domain, where the high-level concepts of the change of interest can be represented by a few visual words.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.

...read moreread less

Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

Alex Krizhevsky, +2 more

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.

...read moreread less

Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan, +1 more

TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.

...read moreread less

Proceedings Article

Attention is All you Need

Ashish Vaswani, +7 more

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.

...read moreread less

Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan, +1 more

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.

...read moreread less

Collapse

Related Papers (5)

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, +11 more

- 22 Oct 2020 -

arXiv: Computer Vision and Pattern Recog...

Pre-Trained Image Processing Transformer

Citations

Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

Adding Conditional Control to Text-to-Image Diffusion Models

LocalViT: Bringing Locality to Vision Transformers

Scaling Up Your Kernels to 31×31: Revisiting Large Kernel Design in CNNs

Remote Sensing Image Change Detection with Transformers

References

Deep Residual Learning for Image Recognition

ImageNet Classification with Deep Convolutional Neural Networks

Very Deep Convolutional Networks for Large-Scale Image Recognition

Attention is All you Need

Very Deep Convolutional Networks for Large-Scale Image Recognition

Related Papers (5)

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Deep Residual Learning for Image Recognition

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Attention is All you Need

End-to-End Object Detection with Transformers