scispace - formally typeset
Open AccessPosted Content

Coordinate Attention for Efficient Mobile Network Design.

Reads0
Chats0
TLDR
The coordinate attention as discussed by the authors factorizes channel attention into two 1D feature encoding processes that aggregate features along the two spatial directions, respectively, to capture long-range dependencies along one spatial direction and meanwhile precise positional information can be preserved along the other spatial direction.
Abstract
Recent studies on mobile network design have demonstrated the remarkable effectiveness of channel attention (e.g., the Squeeze-and-Excitation attention) for lifting model performance, but they generally neglect the positional information, which is important for generating spatially selective attention maps. In this paper, we propose a novel attention mechanism for mobile networks by embedding positional information into channel attention, which we call "coordinate attention". Unlike channel attention that transforms a feature tensor to a single feature vector via 2D global pooling, the coordinate attention factorizes channel attention into two 1D feature encoding processes that aggregate features along the two spatial directions, respectively. In this way, long-range dependencies can be captured along one spatial direction and meanwhile precise positional information can be preserved along the other spatial direction. The resulting feature maps are then encoded separately into a pair of direction-aware and position-sensitive attention maps that can be complementarily applied to the input feature map to augment the representations of the objects of interest. Our coordinate attention is simple and can be flexibly plugged into classic mobile networks, such as MobileNetV2, MobileNeXt, and EfficientNet with nearly no computational overhead. Extensive experiments demonstrate that our coordinate attention is not only beneficial to ImageNet classification but more interestingly, behaves better in down-stream tasks, such as object detection and semantic segmentation. Code is available at this https URL.

read more

Citations
More filters
Posted Content

DS-TransUNet: Dual Swin Transformer U-Net for Medical Image Segmentation.

TL;DR: Wang et al. as discussed by the authors proposed a dual-scale encoder subnetworks based on Swin Transformer to extract the coarse and fine-grained feature representations of different semantic scales.
Posted Content

DeepViT: Towards Deeper Vision Transformer

TL;DR: Zhang et al. as mentioned in this paper propose to re-generate the attention maps to increase their diversity at different layers with negligible computation and memory cost, which makes it feasible to train deeper ViTs with consistent performance improvements via minor modification to existing ViT models.
Posted Content

Constructing Stronger and Faster Baselines for Skeleton-based Action Recognition.

TL;DR: Wang et al. as discussed by the authors designed a compound scaling strategy to expand the model's width and depth synchronously, and eventually obtained a family of efficient GCN baselines with high accuracies and small amounts of trainable parameters, where ''x'' denotes the scaling coefficient.
Posted Content

FcaNet: Frequency Channel Attention Networks

TL;DR: Based on the frequency analysis, the authors mathematically proved that the conventional GAP is a special case of the feature decomposition in the frequency domain and proposed FCANet with novel multi-spectral channel attention.
Journal ArticleDOI

Monocular 3D multi-person pose estimation via predicting factorized correction factors

TL;DR: A pipeline consists of human detection, absolute 3D human root localization, and root-relative 3D single-person pose estimation modules, and a data augmentation strategy is presented to tackle occlusions, such that the model can effectively estimate the root localization with the incomplete bounding boxes.
References
More filters
Journal ArticleDOI

Squeeze-and-Excitation Networks

TL;DR: This work proposes a novel architectural unit, which is term the "Squeeze-and-Excitation" (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels and finds that SE blocks produce significant performance improvements for existing state-of-the-art deep architectures at minimal additional computational cost.
Posted Content

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

TL;DR: This work introduces two simple global hyper-parameters that efficiently trade off between latency and accuracy and demonstrates the effectiveness of MobileNets across a wide range of applications and use cases including object detection, finegrain classification, face attributes and large scale geo-localization.
Proceedings ArticleDOI

Pyramid Scene Parsing Network

TL;DR: This paper exploits the capability of global context information by different-region-based context aggregation through the pyramid pooling module together with the proposed pyramid scene parsing network (PSPNet) to produce good quality results on the scene parsing task.
Proceedings ArticleDOI

MobileNetV2: Inverted Residuals and Linear Bottlenecks

TL;DR: MobileNetV2 as mentioned in this paper is based on an inverted residual structure where the shortcut connections are between the thin bottleneck layers and intermediate expansion layer uses lightweight depthwise convolutions to filter features as a source of non-linearity.