Deformable DETR: Deformable Transformers for End-to-End Object Detection

Open AccessProceedings Article

Deformable DETR: Deformable Transformers for End-to-End Object Detection

Chats0

TLDR

Deformable DETR as discussed by the authors proposes to only attend to a small set of key sampling points around a reference, which can achieve better performance than DETR with 10× less training epochs.

Abstract:

DETR has been recently proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance. However, it suffers from slow convergence and limited feature spatial resolution, due to the limitation of Transformer attention modules in processing image feature maps. To mitigate these issues, we proposed Deformable DETR, whose attention modules only attend to a small set of key sampling points around a reference. Deformable DETR can achieve better performance than DETR (especially on small objects) with 10× less training epochs. Extensive experiments on the COCO benchmark demonstrate the effectiveness of our approach. Code shall be released.

Citations

PDF

Open Access

More filters

Posted Content

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows.

Ze Liu, +7 more

- 25 Mar 2021 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: Wang et al. as mentioned in this paper proposed a new vision Transformer called Swin Transformer, which is computed with shifted windows to address the differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text.

...read moreread less

Posted Content

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

Wenhai Wang, +8 more

- 24 Feb 2021 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: Huang et al. as discussed by the authors proposed Pyramid Vision Transformer (PVT), which is a simple backbone network useful for many dense prediction tasks without convolutions, and achieved state-of-the-art performance on the COCO dataset.

...read moreread less

Posted Content

An Attentive Survey of Attention Models

Sneha Chaudhari, +3 more

- 05 Apr 2019 -

arXiv: Learning

TL;DR: A taxonomy that groups existing techniques into coherent categories in attention models is proposed, and how attention has been used to improve the interpretability of neural networks is described.

...read moreread less

Posted Content

Attention Mechanisms in Computer Vision: A Survey.

Meng-Hao Guo, +9 more

- 15 Nov 2021 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: A comprehensive review of attention mechanisms in computer vision can be found in this article, which categorizes them according to approach, such as channel attention, spatial attention, temporal attention and branch attention.

...read moreread less

Journal ArticleDOI

Remote Sensing Image Change Detection with Transformers

Hao Chen, +2 more

- 27 Feb 2021 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: Wang et al. as discussed by the authors proposed a bitemporal image transformer (BIT) to efficiently and effectively model contexts within the spatial-temporal domain, where the high-level concepts of the change of interest can be represented by a few visual words.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Posted Content

Big Bird: Transformers for Longer Sequences

Manzil Zaheer, +10 more

- 28 Jul 2020 -

arXiv: Learning

TL;DR: It is shown that BigBird is a universal approximator of sequence functions and is Turing complete, thereby preserving these properties of the quadratic, full attention model.

...read moreread less

Posted Content

Linformer: Self-Attention with Linear Complexity

Sinong Wang, +4 more

- 08 Jun 2020 -

arXiv: Learning

TL;DR: This paper demonstrates that the self-attention mechanism of the Transformer can be approximated by a low-rank matrix, and proposes a new self-Attention mechanism, which reduces the overall self-ATTention complexity from $O(n^2)$ to $O (n)$ in both time and space.

...read moreread less

Posted Content

NAS-FPN: Learning Scalable Feature Pyramid Architecture for Object Detection

Golnaz Ghiasi, +3 more

- 16 Apr 2019 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: NAS-FPN as mentioned in this paper combines a combination of top-down and bottom-up connections to fuse features across scales and achieves better accuracy and latency tradeoff compared to state-of-the-art object detection models.

...read moreread less

Proceedings ArticleDOI

Bridging the Gap Between Anchor-Based and Anchor-Free Detection via Adaptive Training Sample Selection

Shifeng Zhang, +4 more

TL;DR: Zhang et al. as discussed by the authors proposed Adaptive Training Sample Selection (ATSS) to automatically select positive and negative samples according to statistical characteristics of object, which significantly improves the performance of anchor-based and anchor-free detectors and bridges the gap between them.

...read moreread less

Posted Content

Efficient Transformers: A Survey

Yi Tay, +3 more

- 14 Sep 2020 -

arXiv: Learning

TL;DR: This paper characterizes a large and thoughtful selection of recent efficiency-flavored “X-former” models, providing an organized and comprehensive overview of existing work and models across multiple domains.

...read moreread less

Collapse

Deformable DETR: Deformable Transformers for End-to-End Object Detection

Citations

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows.

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

An Attentive Survey of Attention Models

Attention Mechanisms in Computer Vision: A Survey.

Remote Sensing Image Change Detection with Transformers

References

Big Bird: Transformers for Longer Sequences

Linformer: Self-Attention with Linear Complexity

NAS-FPN: Learning Scalable Feature Pyramid Architecture for Object Detection

Bridging the Gap Between Anchor-Based and Anchor-Free Detection via Adaptive Training Sample Selection

Efficient Transformers: A Survey

Related Papers (5)

Attention is All you Need

Deep Residual Learning for Image Recognition

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Microsoft COCO: Common Objects in Context

ImageNet: A large-scale hierarchical image database