Open AccessPosted Content
Pre-Trained Image Processing Transformer
Hanting Chen,Yunhe Wang,Tianyu Guo,Chang Xu,Yiping Deng,Zhenhua Liu,Siwei Ma,Chunjing Xu,Chao Xu,Wen Gao +9 more
TLDR
To maximally excavate the capability of transformer, the IPT model is presented to utilize the well-known ImageNet benchmark for generating a large amount of corrupted image pairs and the contrastive learning is introduced for well adapting to different image processing tasks.Abstract:
As the computing power of modern hardware is increasing strongly, pre-trained deep learning models (e.g., BERT, GPT-3) learned on large-scale datasets have shown their effectiveness over conventional methods. The big progress is mainly contributed to the representation ability of transformer and its variant architectures. In this paper, we study the low-level computer vision task (e.g., denoising, super-resolution and deraining) and develop a new pre-trained model, namely, image processing transformer (IPT). To maximally excavate the capability of transformer, we present to utilize the well-known ImageNet benchmark for generating a large amount of corrupted image pairs. The IPT model is trained on these images with multi-heads and multi-tails. In addition, the contrastive learning is introduced for well adapting to different image processing tasks. The pre-trained model can therefore efficiently employed on desired task after fine-tuning. With only one pre-trained model, IPT outperforms the current state-of-the-art methods on various low-level benchmarks. Code is available at this https URL and this https URLread more
Citations
More filters
Posted Content
Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet
Li Yuan,Yunpeng Chen,Tao Wang,Weihao Yu,Yujun Shi,Zihang Jiang,Francis Eh Tay,Jiashi Feng,Shuicheng Yan +8 more
TL;DR: T2T-ViT as mentioned in this paper proposes a token-to-token transformation to progressively transform the image to tokens by recursively aggregating neighboring tokens into one token (Token-To-Token), such that local structure represented by surrounding tokens can be modeled and tokens length can be reduced.
Journal ArticleDOI
Adding Conditional Control to Text-to-Image Diffusion Models
Lvmin Zhang,Maneesh Agrawala +1 more
TL;DR: ControlNet as discussed by the authors learns task-specific conditions in an end-to-end way, and the learning is robust even when the training dataset is small (<50k), and the model can be trained on a personal devices.
Posted Content
LocalViT: Bringing Locality to Vision Transformers
TL;DR: This work adds locality to vision transformers by introducing depth-wise convolution into the feed-forward network and successfully applies the same locality mechanism to 4 visiontransformers, which shows the generalization of the locality concept.
Proceedings ArticleDOI
Scaling Up Your Kernels to 31×31: Revisiting Large Kernel Design in CNNs
TL;DR: It is demonstrated that using a few large convolutional kernels instead of a stack of small kernels could be a more powerful paradigm, and proposed RepLKNet, a pure CNN architecture whose kernel size is as large as 31×31, in contrast to commonly used 3×3.
Journal ArticleDOI
Remote Sensing Image Change Detection with Transformers
Hao Chen,Zipeng Qi,Zhenwei Shi +2 more
TL;DR: Wang et al. as discussed by the authors proposed a bitemporal image transformer (BIT) to efficiently and effectively model contexts within the spatial-temporal domain, where the high-level concepts of the change of interest can be represented by a few visual words.
References
More filters
Proceedings ArticleDOI
Deep Residual Learning for Image Recognition
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Proceedings Article
ImageNet Classification with Deep Convolutional Neural Networks
TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Proceedings Article
Very Deep Convolutional Networks for Large-Scale Image Recognition
Karen Simonyan,Andrew Zisserman +1 more
TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
Proceedings Article
Attention is All you Need
Ashish Vaswani,Noam Shazeer,Niki Parmar,Jakob Uszkoreit,Llion Jones,Aidan N. Gomez,Lukasz Kaiser,Illia Polosukhin +7 more
TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
Proceedings Article
Very Deep Convolutional Networks for Large-Scale Image Recognition
Karen Simonyan,Andrew Zisserman +1 more
TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.