Deformable DETR: Deformable Transformers for End-to-End Object Detection

Open AccessProceedings Article

Deformable DETR: Deformable Transformers for End-to-End Object Detection

Chats0

TLDR

Deformable DETR as discussed by the authors proposes to only attend to a small set of key sampling points around a reference, which can achieve better performance than DETR with 10× less training epochs.

Abstract:

DETR has been recently proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance. However, it suffers from slow convergence and limited feature spatial resolution, due to the limitation of Transformer attention modules in processing image feature maps. To mitigate these issues, we proposed Deformable DETR, whose attention modules only attend to a small set of key sampling points around a reference. Deformable DETR can achieve better performance than DETR (especially on small objects) with 10× less training epochs. Extensive experiments on the COCO benchmark demonstrate the effectiveness of our approach. Code shall be released.

Citations

PDF

Open Access

More filters

Posted Content

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows.

Ze Liu, +7 more

- 25 Mar 2021 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: Wang et al. as mentioned in this paper proposed a new vision Transformer called Swin Transformer, which is computed with shifted windows to address the differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text.

...read moreread less

Posted Content

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

Wenhai Wang, +8 more

- 24 Feb 2021 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: Huang et al. as discussed by the authors proposed Pyramid Vision Transformer (PVT), which is a simple backbone network useful for many dense prediction tasks without convolutions, and achieved state-of-the-art performance on the COCO dataset.

...read moreread less

Posted Content

An Attentive Survey of Attention Models

Sneha Chaudhari, +3 more

- 05 Apr 2019 -

arXiv: Learning

TL;DR: A taxonomy that groups existing techniques into coherent categories in attention models is proposed, and how attention has been used to improve the interpretability of neural networks is described.

...read moreread less

Posted Content

Attention Mechanisms in Computer Vision: A Survey.

Meng-Hao Guo, +9 more

- 15 Nov 2021 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: A comprehensive review of attention mechanisms in computer vision can be found in this article, which categorizes them according to approach, such as channel attention, spatial attention, temporal attention and branch attention.

...read moreread less

Journal ArticleDOI

Remote Sensing Image Change Detection with Transformers

Hao Chen, +2 more

- 27 Feb 2021 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: Wang et al. as discussed by the authors proposed a bitemporal image transformer (BIT) to efficiently and effectively model contexts within the spatial-temporal domain, where the high-level concepts of the change of interest can be represented by a few visual words.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings Article

Stand-Alone Self-Attention in Vision Models

Prajit Ramachandran, +5 more

TL;DR: The results establish that stand-alone self-attention is an important addition to the vision practitioner's toolbox and is especially impactful when used in later layers.

...read moreread less

Proceedings Article

Generating Wikipedia by Summarizing Long Sequences

Peter J. Liu, +6 more

TL;DR: This article used extractive summarization to coarsely identify salient information and a neural abstractive model to generate the article, which can extract relevant factual information as reflected in perplexity, ROUGE scores and human evaluations.

...read moreread less

Proceedings Article

Reformer: The Efficient Transformer

Nikita Kitaev, +2 more

TL;DR: This work replaces dot-product attention by one that uses locality-sensitive hashing and uses reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of several times, making the model much more memory-efficient and much faster on long sequences.

...read moreread less

Journal ArticleDOI

M2Det: A Single-Shot Object Detector Based on Multi-Level Feature Pyramid Network

Qijie Zhao, +6 more

TL;DR: A powerful end-to-end one-stage object detector called M2Det is designed and train by integrating it into the architecture of SSD, and achieve better detection performance than state-of-the-art one- stage detectors.

...read moreread less

Proceedings ArticleDOI

Local Relation Networks for Image Recognition

Han Hu, +3 more

TL;DR: A network built with local relation layers, called the Local Relation Network (LR-Net), is found to provide greater modeling capacity than its counterpart built with regular convolution on large-scale recognition tasks such as ImageNet classification.

...read moreread less

Collapse

Deformable DETR: Deformable Transformers for End-to-End Object Detection

Citations

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows.

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

An Attentive Survey of Attention Models

Attention Mechanisms in Computer Vision: A Survey.

Remote Sensing Image Change Detection with Transformers

References

Stand-Alone Self-Attention in Vision Models

Generating Wikipedia by Summarizing Long Sequences

Reformer: The Efficient Transformer

M2Det: A Single-Shot Object Detector Based on Multi-Level Feature Pyramid Network

Local Relation Networks for Image Recognition

Related Papers (5)

Attention is All you Need

Deep Residual Learning for Image Recognition

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Microsoft COCO: Common Objects in Context

ImageNet: A large-scale hierarchical image database