(Open Access) Coordinate Attention for Efficient Mobile Network Design. (2021) | Qibin Hou

Citations

PDF

Open Access

More filters

Posted Content•

DS-TransUNet: Dual Swin Transformer U-Net for Medical Image Segmentation.

[...]

Ailiang Lin, Bingzhi Chen¹, Jiayu Xu, Zheng Zhang¹, Guangming Lu - Show less +1 more•Institutions (1)

Harbin Institute of Technology¹

12 Jun 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: Wang et al. as discussed by the authors proposed a dual-scale encoder subnetworks based on Swin Transformer to extract the coarse and fine-grained feature representations of different semantic scales.

...read moreread less

Abstract: Automatic medical image segmentation has made great progress benefit from the development of deep learning. However, most existing methods are based on convolutional neural networks (CNNs), which fail to build long-range dependencies and global context connections due to the limitation of receptive field in convolution operation. Inspired by the success of Transformer in modeling the long-range contextual information, some researchers have expended considerable efforts in designing the robust variants of Transformer-based U-Net. Moreover, the patch division used in vision transformers usually ignores the pixel-level intrinsic structural features inside each patch. To alleviate these problems, we propose a novel deep medical image segmentation framework called Dual Swin Transformer U-Net (DS-TransUNet), which might be the first attempt to concurrently incorporate the advantages of hierarchical Swin Transformer into both encoder and decoder of the standard U-shaped architecture to enhance the semantic segmentation quality of varying medical images. Unlike many prior Transformer-based solutions, the proposed DS-TransUNet first adopts dual-scale encoder subnetworks based on Swin Transformer to extract the coarse and fine-grained feature representations of different semantic scales. As the core component for our DS-TransUNet, a well-designed Transformer Interactive Fusion (TIF) module is proposed to effectively establish global dependencies between features of different scales through the self-attention mechanism. Furthermore, we also introduce the Swin Transformer block into decoder to further explore the long-range contextual information during the up-sampling process. Extensive experiments across four typical tasks for medical image segmentation demonstrate the effectiveness of DS-TransUNet, and show that our approach significantly outperforms the state-of-the-art methods.

...read moreread less

59 citations

Posted Content•

DeepViT: Towards Deeper Vision Transformer

[...]

Daquan Zhou, Bingyi Kang, Xiaojie Jin, Linjie Yang, Xiaochen Lian, Zihang Jiang, Qibin Hou, Jiashi Feng - Show less +4 more

19 Apr 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: Zhang et al. as mentioned in this paper propose to re-generate the attention maps to increase their diversity at different layers with negligible computation and memory cost, which makes it feasible to train deeper ViTs with consistent performance improvements via minor modification to existing ViT models.

...read moreread less

Abstract: Vision transformers (ViTs) have been successfully applied in image classification tasks recently. In this paper, we show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper. More specifically, we empirically observe that such scaling difficulty is caused by the attention collapse issue: as the transformer goes deeper, the attention maps gradually become similar and even much the same after certain layers. In other words, the feature maps tend to be identical in the top layers of deep ViT models. This fact demonstrates that in deeper layers of ViTs, the self-attention mechanism fails to learn effective concepts for representation learning and hinders the model from getting expected performance gain. Based on above observation, we propose a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity at different layers with negligible computation and memory cost. The pro-posed method makes it feasible to train deeper ViT models with consistent performance improvements via minor modification to existing ViT models. Notably, when training a deep ViT model with 32 transformer blocks, the Top-1 classification accuracy can be improved by 1.6% on ImageNet. Code will be made publicly available

...read moreread less

48 citations

Posted Content•

Constructing Stronger and Faster Baselines for Skeleton-based Action Recognition.

[...]

Yi-Fan Song, Zhang Zhang, Caifeng Shan, Liang Wang

29 Jun 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: Wang et al. as discussed by the authors designed a compound scaling strategy to expand the model's width and depth synchronously, and eventually obtained a family of efficient GCN baselines with high accuracies and small amounts of trainable parameters, where ''x'' denotes the scaling coefficient.

...read moreread less

Abstract: One essential problem in skeleton-based action recognition is how to extract discriminative features over all skeleton joints. However, the complexity of the recent State-Of-The-Art (SOTA) models for this task tends to be exceedingly sophisticated and over-parameterized. The low efficiency in model training and inference has increased the validation costs of model architectures in large-scale datasets. To address the above issue, recent advanced separable convolutional layers are embedded into an early fused Multiple Input Branches (MIB) network, constructing an efficient Graph Convolutional Network (GCN) baseline for skeleton-based action recognition. In addition, based on such the baseline, we design a compound scaling strategy to expand the model's width and depth synchronously, and eventually obtain a family of efficient GCN baselines with high accuracies and small amounts of trainable parameters, termed EfficientGCN-Bx, where ''x'' denotes the scaling coefficient. On two large-scale datasets, i.e., NTU RGB+D 60 and 120, the proposed EfficientGCN-B4 baseline outperforms other SOTA methods, e.g., achieving 91.7% accuracy on the cross-subject benchmark of NTU 60 dataset, while being 3.15x smaller and 3.21x faster than MS-G3D, which is one of the best SOTA methods. The source code in PyTorch version and the pretrained models are available at this https URL.

...read moreread less

41 citations

Posted Content•

FcaNet: Frequency Channel Attention Networks

[...]

Zequn Qin¹, Pengyi Zhang², Fei Wu², Xi Li²•Institutions (2)

Northwestern Polytechnical University¹, Zhejiang University²

22 Dec 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: Based on the frequency analysis, the authors mathematically proved that the conventional GAP is a special case of the feature decomposition in the frequency domain and proposed FCANet with novel multi-spectral channel attention.

...read moreread less

Abstract: Attention mechanism, especially channel attention, has gained great success in the computer vision field. Many works focus on how to design efficient channel attention mechanisms while ignoring a fundamental problem, i.e., using global average pooling (GAP) as the unquestionable pre-processing method. In this work, we start from a different view and rethink channel attention using frequency analysis. Based on the frequency analysis, we mathematically prove that the conventional GAP is a special case of the feature decomposition in the frequency domain. With the proof, we naturally generalize the pre-processing of channel attention mechanism in the frequency domain and propose FcaNet with novel multi-spectral channel attention. The proposed method is simple but effective. We can change only one line of code in the calculation to implement our method within existing channel attention methods. Moreover, the proposed method achieves state-of-the-art results compared with other channel attention methods on image classification, object detection, and instance segmentation tasks. Our method could improve by 1.8% in terms of Top-1 accuracy on ImageNet compared with the baseline SENet-50, with the same number of parameters and the same computational cost. Our code and models are publicly available at this https URL

...read moreread less

19 citations

Journal Article•DOI•

Monocular 3D multi-person pose estimation via predicting factorized correction factors

[...]

Yu Guo¹, Lichen Ma¹, Zhi Li², Xuan Wang³, Fei Wang¹ - Show less +1 more•Institutions (3)

Xi'an Jiaotong University¹, Max Planck Society², Tencent³

01 Dec 2021-Computer Vision and Image Understanding

TL;DR: A pipeline consists of human detection, absolute 3D human root localization, and root-relative 3D single-person pose estimation modules, and a data augmentation strategy is presented to tackle occlusions, such that the model can effectively estimate the root localization with the incomplete bounding boxes.

...read moreread less

6 citations

Coordinate Attention for Efficient Mobile Network Design.

Citations

References

Related Papers (5)