Open AccessProceedings Article
Deformable DETR: Deformable Transformers for End-to-End Object Detection
Reads0
Chats0
TLDR
Deformable DETR as discussed by the authors proposes to only attend to a small set of key sampling points around a reference, which can achieve better performance than DETR with 10× less training epochs.Abstract:
DETR has been recently proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance. However, it suffers from slow convergence and limited feature spatial resolution, due to the limitation of Transformer attention modules in processing image feature maps. To mitigate these issues, we proposed Deformable DETR, whose attention modules only attend to a small set of key sampling points around a reference. Deformable DETR can achieve better performance than DETR (especially on small objects) with 10× less training epochs. Extensive experiments on the COCO benchmark demonstrate the effectiveness of our approach. Code shall be released.read more
Citations
More filters
Posted Content
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows.
TL;DR: Wang et al. as mentioned in this paper proposed a new vision Transformer called Swin Transformer, which is computed with shifted windows to address the differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text.
Posted Content
Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions
Wenhai Wang,Enze Xie,Xiang Li,Deng-Ping Fan,Kaitao Song,Ding Liang,Tong Lu,Ping Luo,Ling Shao +8 more
TL;DR: Huang et al. as discussed by the authors proposed Pyramid Vision Transformer (PVT), which is a simple backbone network useful for many dense prediction tasks without convolutions, and achieved state-of-the-art performance on the COCO dataset.
Posted Content
An Attentive Survey of Attention Models
TL;DR: A taxonomy that groups existing techniques into coherent categories in attention models is proposed, and how attention has been used to improve the interpretability of neural networks is described.
Posted Content
Attention Mechanisms in Computer Vision: A Survey.
Meng-Hao Guo,Tian-Xing Xu,Jiangjiang Liu,Zheng-Ning Liu,Peng-Tao Jiang,Tai-Jiang Mu,Song-Hai Zhang,Ralph R. Martin,Ming-Ming Cheng,Shi-Min Hu +9 more
TL;DR: A comprehensive review of attention mechanisms in computer vision can be found in this article, which categorizes them according to approach, such as channel attention, spatial attention, temporal attention and branch attention.
Journal ArticleDOI
Remote Sensing Image Change Detection with Transformers
Hao Chen,Zipeng Qi,Zhenwei Shi +2 more
TL;DR: Wang et al. as discussed by the authors proposed a bitemporal image transformer (BIT) to efficiently and effectively model contexts within the spatial-temporal domain, where the high-level concepts of the change of interest can be represented by a few visual words.
References
More filters
Proceedings Article
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention
TL;DR: In this article, the self-attention is expressed as a linear dot-product of kernel feature maps and made use of the associativity property of matrix products to reduce the complexity.
Proceedings Article
Sparse Sinkhorn Attention
TL;DR: This work introduces a meta sorting network that learns to generate latent permutations over sequences and is able to compute quasi-global attention with only local windows, improving the memory efficiency of the attention module.
Book ChapterDOI
Deep Feature Pyramid Reconfiguration for Object Detection
TL;DR: Zhang et al. as discussed by the authors reformulate the feature pyramid construction as the feature reconfiguration process and propose a novel reconfigurative architecture to combine low-level representations with high-level semantic features in a highly-nonlinear yet efficient way.
Journal ArticleDOI
Efficient Content-Based Sparse Attention with Routing Transformers
TL;DR: The Routing Transformer as discussed by the authors proposes to learn dynamic sparse attention patterns that avoid allocating computation and memory to attend to content unrelated to the query of interest by combining self-attention with a sparse routing module based on online k-means.
Proceedings ArticleDOI
Revisiting the Sibling Head in Object Detector
TL;DR: In this article, a simple operator called task-aware spatial disentanglement (TSD) is proposed to solve the spatial misalignment between the two object functions in the sibling head.