scispace - formally typeset
Open AccessPosted Content

Coordinate Attention for Efficient Mobile Network Design.

Reads0
Chats0
TLDR
The coordinate attention as discussed by the authors factorizes channel attention into two 1D feature encoding processes that aggregate features along the two spatial directions, respectively, to capture long-range dependencies along one spatial direction and meanwhile precise positional information can be preserved along the other spatial direction.
Abstract
Recent studies on mobile network design have demonstrated the remarkable effectiveness of channel attention (e.g., the Squeeze-and-Excitation attention) for lifting model performance, but they generally neglect the positional information, which is important for generating spatially selective attention maps. In this paper, we propose a novel attention mechanism for mobile networks by embedding positional information into channel attention, which we call "coordinate attention". Unlike channel attention that transforms a feature tensor to a single feature vector via 2D global pooling, the coordinate attention factorizes channel attention into two 1D feature encoding processes that aggregate features along the two spatial directions, respectively. In this way, long-range dependencies can be captured along one spatial direction and meanwhile precise positional information can be preserved along the other spatial direction. The resulting feature maps are then encoded separately into a pair of direction-aware and position-sensitive attention maps that can be complementarily applied to the input feature map to augment the representations of the objects of interest. Our coordinate attention is simple and can be flexibly plugged into classic mobile networks, such as MobileNetV2, MobileNeXt, and EfficientNet with nearly no computational overhead. Extensive experiments demonstrate that our coordinate attention is not only beneficial to ImageNet classification but more interestingly, behaves better in down-stream tasks, such as object detection and semantic segmentation. Code is available at this https URL.

read more

Citations
More filters
Posted Content

DS-TransUNet: Dual Swin Transformer U-Net for Medical Image Segmentation.

TL;DR: Wang et al. as discussed by the authors proposed a dual-scale encoder subnetworks based on Swin Transformer to extract the coarse and fine-grained feature representations of different semantic scales.
Posted Content

DeepViT: Towards Deeper Vision Transformer

TL;DR: Zhang et al. as mentioned in this paper propose to re-generate the attention maps to increase their diversity at different layers with negligible computation and memory cost, which makes it feasible to train deeper ViTs with consistent performance improvements via minor modification to existing ViT models.
Posted Content

Constructing Stronger and Faster Baselines for Skeleton-based Action Recognition.

TL;DR: Wang et al. as discussed by the authors designed a compound scaling strategy to expand the model's width and depth synchronously, and eventually obtained a family of efficient GCN baselines with high accuracies and small amounts of trainable parameters, where ''x'' denotes the scaling coefficient.
Posted Content

FcaNet: Frequency Channel Attention Networks

TL;DR: Based on the frequency analysis, the authors mathematically proved that the conventional GAP is a special case of the feature decomposition in the frequency domain and proposed FCANet with novel multi-spectral channel attention.
Journal ArticleDOI

Monocular 3D multi-person pose estimation via predicting factorized correction factors

TL;DR: A pipeline consists of human detection, absolute 3D human root localization, and root-relative 3D single-person pose estimation modules, and a data augmentation strategy is presented to tackle occlusions, such that the model can effectively estimate the root localization with the incomplete bounding boxes.
References
More filters
Proceedings ArticleDOI

MnasNet: Platform-Aware Neural Architecture Search for Mobile

TL;DR: In this article, the authors propose an automated mobile neural architecture search (MNAS) approach, which explicitly incorporates model latency into the main objective so that the search can identify a model that achieves a good trade-off between accuracy and latency.
Proceedings Article

Recurrent Models of Visual Attention

TL;DR: A novel recurrent neural network model that is capable of extracting information from an image or video by adaptively selecting a sequence of regions or locations and only processing the selected regions at high resolution is presented.
Proceedings ArticleDOI

Semantic contours from inverse detectors

TL;DR: A simple yet effective method for combining generic object detectors with bottom-up contours to identify object contours is presented and a principled way of combining information from different part detectors and across categories is provided.
Book ChapterDOI

Progressive Neural Architecture Search

TL;DR: In this article, a sequential model-based optimization (SMBO) strategy is proposed to search for structures in order of increasing complexity, while simultaneously learning a surrogate model to guide the search through structure space.
Proceedings Article

ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware

TL;DR: ProxylessNAS is presented, which can directly learn the architectures for large-scale target tasks and target hardware platforms and apply ProxylessNAS to specialize neural architectures for hardware with direct hardware metrics (e.g. latency) and provide insights for efficient CNN architecture design.