scispace - formally typeset
Open AccessProceedings Article

Deformable DETR: Deformable Transformers for End-to-End Object Detection

Reads0
Chats0
TLDR
Deformable DETR as discussed by the authors proposes to only attend to a small set of key sampling points around a reference, which can achieve better performance than DETR with 10× less training epochs.
Abstract
DETR has been recently proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance. However, it suffers from slow convergence and limited feature spatial resolution, due to the limitation of Transformer attention modules in processing image feature maps. To mitigate these issues, we proposed Deformable DETR, whose attention modules only attend to a small set of key sampling points around a reference. Deformable DETR can achieve better performance than DETR (especially on small objects) with 10× less training epochs. Extensive experiments on the COCO benchmark demonstrate the effectiveness of our approach. Code shall be released.

read more

Content maybe subject to copyright    Report

Citations
More filters
Posted Content

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows.

TL;DR: Wang et al. as mentioned in this paper proposed a new vision Transformer called Swin Transformer, which is computed with shifted windows to address the differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text.
Posted Content

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

TL;DR: Huang et al. as discussed by the authors proposed Pyramid Vision Transformer (PVT), which is a simple backbone network useful for many dense prediction tasks without convolutions, and achieved state-of-the-art performance on the COCO dataset.
Posted Content

An Attentive Survey of Attention Models

TL;DR: A taxonomy that groups existing techniques into coherent categories in attention models is proposed, and how attention has been used to improve the interpretability of neural networks is described.
Posted Content

Attention Mechanisms in Computer Vision: A Survey.

TL;DR: A comprehensive review of attention mechanisms in computer vision can be found in this article, which categorizes them according to approach, such as channel attention, spatial attention, temporal attention and branch attention.
Journal ArticleDOI

Remote Sensing Image Change Detection with Transformers

TL;DR: Wang et al. as discussed by the authors proposed a bitemporal image transformer (BIT) to efficiently and effectively model contexts within the spatial-temporal domain, where the high-level concepts of the change of interest can be represented by a few visual words.
References
More filters
Proceedings ArticleDOI

EfficientDet: Scalable and Efficient Object Detection

TL;DR: EfficientDetD7 as discussed by the authors proposes a weighted bi-directional feature pyramid network (BiFPN), which allows easy and fast multi-scale feature fusion, and a compound scaling method that uniformly scales the resolution, depth, and width for all backbone, feature network, and box/class prediction networks at the same time.
Proceedings ArticleDOI

Deformable Convolutional Networks

TL;DR: Deformable convolutional networks as discussed by the authors augment the spatial sampling locations in the modules with additional offsets and learn the offsets from the target tasks, without additional supervision, which can readily replace their plain counterparts in existing CNNs and can be easily trained end-to-end by standard backpropagation.
Proceedings ArticleDOI

FCOS: Fully Convolutional One-Stage Object Detection

TL;DR: For the first time, a much simpler and flexible detection framework achieving improved detection accuracy is demonstrated, and it is hoped that the proposed FCOS framework can serve as a simple and strong alternative for many other instance-level tasks.
Book ChapterDOI

End-to-End Object Detection with Transformers

TL;DR: DetR as mentioned in this paper proposes a set-based global loss that forces unique predictions via bipartite matching, and a transformer encoder-decoder architecture to directly output the final set of predictions in parallel.
Journal ArticleDOI

Deep Learning for Generic Object Detection: A Survey

TL;DR: A comprehensive survey of the recent achievements in this field brought about by deep learning techniques, covering many aspects of generic object detection: detection frameworks, object feature representation, object proposal generation, context modeling, training strategies, and evaluation metrics.
Related Papers (5)