Open AccessPosted Content
Coordinate Attention for Efficient Mobile Network Design.
Reads0
Chats0
TLDR
The coordinate attention as discussed by the authors factorizes channel attention into two 1D feature encoding processes that aggregate features along the two spatial directions, respectively, to capture long-range dependencies along one spatial direction and meanwhile precise positional information can be preserved along the other spatial direction.Abstract:
Recent studies on mobile network design have demonstrated the remarkable effectiveness of channel attention (e.g., the Squeeze-and-Excitation attention) for lifting model performance, but they generally neglect the positional information, which is important for generating spatially selective attention maps. In this paper, we propose a novel attention mechanism for mobile networks by embedding positional information into channel attention, which we call "coordinate attention". Unlike channel attention that transforms a feature tensor to a single feature vector via 2D global pooling, the coordinate attention factorizes channel attention into two 1D feature encoding processes that aggregate features along the two spatial directions, respectively. In this way, long-range dependencies can be captured along one spatial direction and meanwhile precise positional information can be preserved along the other spatial direction. The resulting feature maps are then encoded separately into a pair of direction-aware and position-sensitive attention maps that can be complementarily applied to the input feature map to augment the representations of the objects of interest. Our coordinate attention is simple and can be flexibly plugged into classic mobile networks, such as MobileNetV2, MobileNeXt, and EfficientNet with nearly no computational overhead. Extensive experiments demonstrate that our coordinate attention is not only beneficial to ImageNet classification but more interestingly, behaves better in down-stream tasks, such as object detection and semantic segmentation. Code is available at this https URL.read more
Citations
More filters
Posted Content
DS-TransUNet: Dual Swin Transformer U-Net for Medical Image Segmentation.
TL;DR: Wang et al. as discussed by the authors proposed a dual-scale encoder subnetworks based on Swin Transformer to extract the coarse and fine-grained feature representations of different semantic scales.
Posted Content
DeepViT: Towards Deeper Vision Transformer
Daquan Zhou,Bingyi Kang,Xiaojie Jin,Linjie Yang,Xiaochen Lian,Zihang Jiang,Qibin Hou,Jiashi Feng +7 more
TL;DR: Zhang et al. as mentioned in this paper propose to re-generate the attention maps to increase their diversity at different layers with negligible computation and memory cost, which makes it feasible to train deeper ViTs with consistent performance improvements via minor modification to existing ViT models.
Posted Content
Constructing Stronger and Faster Baselines for Skeleton-based Action Recognition.
TL;DR: Wang et al. as discussed by the authors designed a compound scaling strategy to expand the model's width and depth synchronously, and eventually obtained a family of efficient GCN baselines with high accuracies and small amounts of trainable parameters, where ''x'' denotes the scaling coefficient.
Posted Content
FcaNet: Frequency Channel Attention Networks
TL;DR: Based on the frequency analysis, the authors mathematically proved that the conventional GAP is a special case of the feature decomposition in the frequency domain and proposed FCANet with novel multi-spectral channel attention.
Journal ArticleDOI
Monocular 3D multi-person pose estimation via predicting factorized correction factors
TL;DR: A pipeline consists of human detection, absolute 3D human root localization, and root-relative 3D single-person pose estimation modules, and a data augmentation strategy is presented to tackle occlusions, such that the model can effectively estimate the root localization with the incomplete bounding boxes.
References
More filters
Journal ArticleDOI
Squeeze-and-Excitation Networks
TL;DR: This work proposes a novel architectural unit, which is term the "Squeeze-and-Excitation" (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels and finds that SE blocks produce significant performance improvements for existing state-of-the-art deep architectures at minimal additional computational cost.
Posted Content
MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
Andrew Howard,Menglong Zhu,Bo Chen,Dmitry Kalenichenko,Weijun Wang,Tobias Weyand,M. Andreetto,Hartwig Adam +7 more
TL;DR: This work introduces two simple global hyper-parameters that efficiently trade off between latency and accuracy and demonstrates the effectiveness of MobileNets across a wide range of applications and use cases including object detection, finegrain classification, face attributes and large scale geo-localization.
Proceedings ArticleDOI
Pyramid Scene Parsing Network
TL;DR: This paper exploits the capability of global context information by different-region-based context aggregation through the pyramid pooling module together with the proposed pyramid scene parsing network (PSPNet) to produce good quality results on the scene parsing task.
Proceedings Article
PyTorch: An Imperative Style, High-Performance Deep Learning Library
Adam Paszke,Sam Gross,Francisco Massa,Adam Lerer,James Bradbury,Gregory Chanan,Trevor Killeen,Zeming Lin,Natalia Gimelshein,Luca Antiga,Alban Desmaison,Andreas Kopf,Edward Z. Yang,Zachary DeVito,Martin Raison,Alykhan Tejani,Sasank Chilamkurthy,Benoit Steiner,Lu Fang,Junjie Bai,Soumith Chintala +20 more
TL;DR: This paper details the principles that drove the implementation of PyTorch and how they are reflected in its architecture, and explains how the careful and pragmatic implementation of the key components of its runtime enables them to work together to achieve compelling performance.
Proceedings ArticleDOI
MobileNetV2: Inverted Residuals and Linear Bottlenecks
TL;DR: MobileNetV2 as mentioned in this paper is based on an inverted residual structure where the shortcut connections are between the thin bottleneck layers and intermediate expansion layer uses lightweight depthwise convolutions to filter features as a source of non-linearity.