scispace - formally typeset
Open AccessProceedings Article

Deformable DETR: Deformable Transformers for End-to-End Object Detection

Reads0
Chats0
TLDR
Deformable DETR as discussed by the authors proposes to only attend to a small set of key sampling points around a reference, which can achieve better performance than DETR with 10× less training epochs.
Abstract
DETR has been recently proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance. However, it suffers from slow convergence and limited feature spatial resolution, due to the limitation of Transformer attention modules in processing image feature maps. To mitigate these issues, we proposed Deformable DETR, whose attention modules only attend to a small set of key sampling points around a reference. Deformable DETR can achieve better performance than DETR (especially on small objects) with 10× less training epochs. Extensive experiments on the COCO benchmark demonstrate the effectiveness of our approach. Code shall be released.

read more

Content maybe subject to copyright    Report

Citations
More filters
Posted Content

TransVG: End-to-End Visual Grounding with Transformers

TL;DR: TransVG as mentioned in this paper proposes a transformer-based framework for visual grounding to establish the multi-modal correspondence by leveraging transformers, and empirically show that the complex fusion modules can be replaced by a simple stack of transformer encoder layers with higher performance.
Posted Content

Twins: Revisiting the Design of Spatial Attention in Vision Transformers

TL;DR: In this article, the design of the spatial attention mechanism is revisited and two vision transformer architectures, namely, Twins-PCPVT and Twins-SVT, are proposed.
Posted Content

Going Beyond Linear Transformers with Recurrent Fast Weight Programmers

TL;DR: In this paper, a recurrent Fast Weight Programmer (RFWP) is proposed to add recurrence to the slow and fast nets of linear Transformers with linearised attention, which is more general than the linear Transformers.
Posted Content

BoxeR: Box-Attention for 2D and 3D Transformers.

TL;DR: Zhang et al. as discussed by the authors proposed a simple attention mechanism, called Box-Attention, which enables spatial interaction between grid features, as sampled from boxes of interest, and improves the learning capability of transformers for several vision tasks.
Posted Content

A Survey of Visual Transformers

TL;DR: This paper provided a comprehensive review of over one hundred different visual Transformers for three fundamental Computer Vision (CV) tasks (classification, detection, and segmentation), where a taxonomy is proposed to organize these methods according to their motivations, structures, and usage scenarios.
References
More filters
Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Proceedings Article

Adam: A Method for Stochastic Optimization

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Proceedings Article

Attention is All you Need

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
Proceedings ArticleDOI

ImageNet: A large-scale hierarchical image database

TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Book ChapterDOI

Microsoft COCO: Common Objects in Context

TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.
Related Papers (5)