Open AccessProceedings Article
Deformable DETR: Deformable Transformers for End-to-End Object Detection
Reads0
Chats0
TLDR
Deformable DETR as discussed by the authors proposes to only attend to a small set of key sampling points around a reference, which can achieve better performance than DETR with 10× less training epochs.Abstract:
DETR has been recently proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance. However, it suffers from slow convergence and limited feature spatial resolution, due to the limitation of Transformer attention modules in processing image feature maps. To mitigate these issues, we proposed Deformable DETR, whose attention modules only attend to a small set of key sampling points around a reference. Deformable DETR can achieve better performance than DETR (especially on small objects) with 10× less training epochs. Extensive experiments on the COCO benchmark demonstrate the effectiveness of our approach. Code shall be released.read more
Citations
More filters
Posted Content
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows.
TL;DR: Wang et al. as mentioned in this paper proposed a new vision Transformer called Swin Transformer, which is computed with shifted windows to address the differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text.
Posted Content
Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions
Wenhai Wang,Enze Xie,Xiang Li,Deng-Ping Fan,Kaitao Song,Ding Liang,Tong Lu,Ping Luo,Ling Shao +8 more
TL;DR: Huang et al. as discussed by the authors proposed Pyramid Vision Transformer (PVT), which is a simple backbone network useful for many dense prediction tasks without convolutions, and achieved state-of-the-art performance on the COCO dataset.
Posted Content
An Attentive Survey of Attention Models
TL;DR: A taxonomy that groups existing techniques into coherent categories in attention models is proposed, and how attention has been used to improve the interpretability of neural networks is described.
Posted Content
Attention Mechanisms in Computer Vision: A Survey.
Meng-Hao Guo,Tian-Xing Xu,Jiangjiang Liu,Zheng-Ning Liu,Peng-Tao Jiang,Tai-Jiang Mu,Song-Hai Zhang,Ralph R. Martin,Ming-Ming Cheng,Shi-Min Hu +9 more
TL;DR: A comprehensive review of attention mechanisms in computer vision can be found in this article, which categorizes them according to approach, such as channel attention, spatial attention, temporal attention and branch attention.
Journal ArticleDOI
Remote Sensing Image Change Detection with Transformers
Hao Chen,Zipeng Qi,Zhenwei Shi +2 more
TL;DR: Wang et al. as discussed by the authors proposed a bitemporal image transformer (BIT) to efficiently and effectively model contexts within the spatial-temporal domain, where the high-level concepts of the change of interest can be represented by a few visual words.
References
More filters
Posted Content
Big Bird: Transformers for Longer Sequences
Manzil Zaheer,Guru Guruganesh,Avinava Dubey,Joshua Ainslie,Chris Alberti,Santiago Ontañón,Philip Pham,Anirudh Ravula,Qifan Wang,Li Yang,Amr Ahmed +10 more
TL;DR: It is shown that BigBird is a universal approximator of sequence functions and is Turing complete, thereby preserving these properties of the quadratic, full attention model.
Posted Content
Linformer: Self-Attention with Linear Complexity
TL;DR: This paper demonstrates that the self-attention mechanism of the Transformer can be approximated by a low-rank matrix, and proposes a new self-Attention mechanism, which reduces the overall self-ATTention complexity from $O(n^2)$ to $O (n)$ in both time and space.
Posted Content
NAS-FPN: Learning Scalable Feature Pyramid Architecture for Object Detection
TL;DR: NAS-FPN as mentioned in this paper combines a combination of top-down and bottom-up connections to fuse features across scales and achieves better accuracy and latency tradeoff compared to state-of-the-art object detection models.
Proceedings ArticleDOI
Bridging the Gap Between Anchor-Based and Anchor-Free Detection via Adaptive Training Sample Selection
TL;DR: Zhang et al. as discussed by the authors proposed Adaptive Training Sample Selection (ATSS) to automatically select positive and negative samples according to statistical characteristics of object, which significantly improves the performance of anchor-based and anchor-free detectors and bridges the gap between them.
Posted Content
Efficient Transformers: A Survey
TL;DR: This paper characterizes a large and thoughtful selection of recent efficiency-flavored “X-former” models, providing an organized and comprehensive overview of existing work and models across multiple domains.