Deformable DETR: Deformable Transformers for End-to-End Object Detection

Open AccessProceedings Article

Deformable DETR: Deformable Transformers for End-to-End Object Detection

TLDR

Deformable DETR as discussed by the authors proposes to only attend to a small set of key sampling points around a reference, which can achieve better performance than DETR with 10× less training epochs.

Abstract:

DETR has been recently proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance. However, it suffers from slow convergence and limited feature spatial resolution, due to the limitation of Transformer attention modules in processing image feature maps. To mitigate these issues, we proposed Deformable DETR, whose attention modules only attend to a small set of key sampling points around a reference. Deformable DETR can achieve better performance than DETR (especially on small objects) with 10× less training epochs. Extensive experiments on the COCO benchmark demonstrate the effectiveness of our approach. Code shall be released.

Citations

PDF

Open Access

More filters

Posted Content

DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries

Yue Wang, +5 more

- 13 Oct 2021 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: In this paper, a top-down approach is proposed for multi-camera 3D object detection, which extracts 2D features from multiple camera images and then uses a sparse set of 3D objects queries to index into these features, linking 3D positions to multi-view images using camera transformation matrices.

...read moreread less

Posted Content

Suppress-and-Refine Framework for End-to-End 3D Object Detection.

Zili Liu, +7 more

- 18 Mar 2021 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: SRDet as discussed by the authors proposes a suppress-and-refine framework to remove the handcrafted components used to eliminate redundant boxes, and achieves state-of-the-art performance on the challenging ScanNetV2 and SUN RGB-D datasets.

...read moreread less

Posted Content

SwinTrack: A Simple and Strong Baseline for Transformer Tracking

Liting Lin, +3 more

- 02 Dec 2021 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: Lin et al. as discussed by the authors proposed a fully attentional-based Transformer tracking algorithm, SwinTrack, which uses Transformer for both feature extraction and feature fusion, allowing full interactions between the target object and the search region for tracking.

...read moreread less

Posted Content

3D Object Tracking with Transformer.

Yubo Cui, +4 more

- 28 Oct 2021 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: Li et al. as discussed by the authors proposed a feature fusion network based on transformer architecture, which captures the inter-and intra-relations among different regions of the point cloud to make similarity computing more efficient by including target object information.

...read moreread less

Posted Content

Multiview Detection with Shadow Transformer (and View-Coherent Data Augmentation)

Yunzhong Hou, +1 more

- 12 Aug 2021 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: Zhang et al. as discussed by the authors proposed a novel multiview detector, MVDeTr, which adopts a newly introduced shadow transformer to aggregate multi-view information, which attends differently at different positions and cameras to deal with various shadowlike distortions.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.

...read moreread less

Proceedings Article

Adam: A Method for Stochastic Optimization

Diederik P. Kingma, +1 more

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

...read moreread less

Proceedings Article

Attention is All you Need

Ashish Vaswani, +7 more

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.

...read moreread less

Proceedings ArticleDOI

ImageNet: A large-scale hierarchical image database

Jia Deng, +5 more

TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.

...read moreread less

Book ChapterDOI

Microsoft COCO: Common Objects in Context

Tsung-Yi Lin, +7 more

TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.

...read moreread less

Collapse

Deformable DETR: Deformable Transformers for End-to-End Object Detection

Citations

DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries

Suppress-and-Refine Framework for End-to-End 3D Object Detection.

SwinTrack: A Simple and Strong Baseline for Transformer Tracking

3D Object Tracking with Transformer.

Multiview Detection with Shadow Transformer (and View-Coherent Data Augmentation)

References

Deep Residual Learning for Image Recognition

Adam: A Method for Stochastic Optimization

Attention is All you Need

ImageNet: A large-scale hierarchical image database

Microsoft COCO: Common Objects in Context

Related Papers (5)

Attention is All you Need

Deep Residual Learning for Image Recognition

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Microsoft COCO: Common Objects in Context

ImageNet: A large-scale hierarchical image database