Transformer in Transformer

Citations

PDF

Open Access

More filters

Posted Content•

Attention Mechanisms in Computer Vision: A Survey.

[...]

Meng-Hao Guo, Tian-Xing Xu, Jiangjiang Liu, Zheng-Ning Liu, Peng-Tao Jiang, Tai-Jiang Mu, Song-Hai Zhang, Ralph R. Martin, Ming-Ming Cheng, Shi-Min Hu¹ - Show less +6 more•Institutions (1)

Tsinghua University¹

15 Nov 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: A comprehensive review of attention mechanisms in computer vision can be found in this article, which categorizes them according to approach, such as channel attention, spatial attention, temporal attention and branch attention.

...read moreread less

Abstract: Humans can naturally and effectively find salient regions in complex scenes. Motivated by this observation, attention mechanisms were introduced into computer vision with the aim of imitating this aspect of the human visual system. Such an attention mechanism can be regarded as a dynamic weight adjustment process based on features of the input image. Attention mechanisms have achieved great success in many visual tasks, including image classification, object detection, semantic segmentation, video understanding, image generation, 3D vision, multi-modal tasks and self-supervised learning. In this survey, we provide a comprehensive review of various attention mechanisms in computer vision and categorize them according to approach, such as channel attention, spatial attention, temporal attention and branch attention; a related repository this https URL is dedicated to collecting related work. We also suggest future directions for attention mechanism research.

...read moreread less

243 citations

Posted Content•

Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation.

[...]

Hu Cao, Yueyue Wang, Joy Chen, Dongsheng Jiang, Xiaopeng Zhang, Qi Tian, Manning Wang - Show less +3 more

12 May 2021-arXiv: Image and Video Processing

TL;DR: Wang et al. as mentioned in this paper proposed a pure transformer-based U-shaped Encoder-Decoder architecture with skip-connections for local-global semantic feature learning for medical image segmentation.

...read moreread less

Abstract: In the past few years, convolutional neural networks (CNNs) have achieved milestones in medical image analysis. Especially, the deep neural networks based on U-shaped architecture and skip-connections have been widely applied in a variety of medical image tasks. However, although CNN has achieved excellent performance, it cannot learn global and long-range semantic information interaction well due to the locality of the convolution operation. In this paper, we propose Swin-Unet, which is an Unet-like pure Transformer for medical image segmentation. The tokenized image patches are fed into the Transformer-based U-shaped Encoder-Decoder architecture with skip-connections for local-global semantic feature learning. Specifically, we use hierarchical Swin Transformer with shifted windows as the encoder to extract context features. And a symmetric Swin Transformer-based decoder with patch expanding layer is designed to perform the up-sampling operation to restore the spatial resolution of the feature maps. Under the direct down-sampling and up-sampling of the inputs and outputs by 4x, experiments on multi-organ and cardiac segmentation tasks demonstrate that the pure Transformer-based U-shaped Encoder-Decoder network outperforms those methods with full-convolution or the combination of transformer and convolution. The codes and trained models will be publicly available at this https URL.

...read moreread less

34 citations

Posted Content•

A Survey of Transformers.

[...]

Tianyang Lin, Yuxin Wang, Xiangyang Liu, Xipeng Qiu

08 Jun 2021-arXiv: Learning

TL;DR: A comprehensive review of various X-formers can be found in this article, where the vanilla Transformer is briefly introduced and then a new taxonomy of X-forms is proposed.

...read moreread less

Abstract: Transformers have achieved great success in many artificial intelligence fields, such as natural language processing, computer vision, and audio processing. Therefore, it is natural to attract lots of interest from academic and industry researchers. Up to the present, a great variety of Transformer variants (a.k.a. X-formers) have been proposed, however, a systematic and comprehensive literature review on these Transformer variants is still missing. In this survey, we provide a comprehensive review of various X-formers. We first briefly introduce the vanilla Transformer and then propose a new taxonomy of X-formers. Next, we introduce the various X-formers from three perspectives: architectural modification, pre-training, and applications. Finally, we outline some potential directions for future research.

...read moreread less

19 citations

Posted Content•

MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens.

[...]

Jiemin Fang, Lingxi Xie, Xinggang Wang, Xiaopeng Zhang, Wenyu Liu, Qi Tian - Show less +2 more

31 May 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: Zhang et al. as discussed by the authors proposed a specialized token for each region that serves as a messenger (MSG), which can flexibly exchange visual information across regions and the computational complexity is reduced.

...read moreread less

Abstract: Transformers have offered a new methodology of designing neural networks for visual recognition. Compared to convolutional networks, Transformers enjoy the ability of referring to global features at each stage, yet the attention module brings higher computational overhead that obstructs the application of Transformers to process high-resolution visual data. This paper aims to alleviate the conflict between efficiency and flexibility, for which we propose a specialized token for each region that serves as a messenger (MSG). Hence, by manipulating these MSG tokens, one can flexibly exchange visual information across regions and the computational complexity is reduced. We then integrate the MSG token into a multi-scale architecture named MSG-Transformer. In standard image classification and object detection, MSG-Transformer achieves competitive performance and the inference on both GPU and CPU is accelerated. The code will be available at this https URL.

...read moreread less

8 citations

Posted Content•

Fully Transformer Networks for Semantic Image Segmentation

[...]

Sitong Wu, Tianyi Wu, Fangjian Lin, Shengwei Tian, Guodong Guo - Show less +1 more

08 Jun 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: Huang et al. as mentioned in this paper proposed a Fully Transformer Network (FTN) for semantic image segmentation, which is an encoder-decoder based fully transformer network (FTE).

...read moreread less

Abstract: Transformers have shown impressive performance in various natural language processing and computer vision tasks, due to the capability of modeling long-range dependencies. Recent progress has demonstrated to combine such transformers with CNN-based semantic image segmentation models is very promising. However, it is not well studied yet on how well a pure transformer based approach can achieve for image segmentation. In this work, we explore a novel framework for semantic image segmentation, which is encoder-decoder based Fully Transformer Networks (FTN). Specifically, we first propose a Pyramid Group Transformer (PGT) as the encoder for progressively learning hierarchical features, while reducing the computation complexity of the standard visual transformer(ViT). Then, we propose a Feature Pyramid Transformer (FPT) to fuse semantic-level and spatial-level information from multiple levels of the PGT encoder for semantic image segmentation. Surprisingly, this simple baseline can achieve new state-of-the-art results on multiple challenging semantic segmentation benchmarks, including PASCAL Context, ADE20K and COCO-Stuff. The source code will be released upon the publication of this work.

...read moreread less

3 citations

Transformer in Transformer

Citations

Related Papers (5)

Trending Questions (2)