Deformable DETR: Deformable Transformers for End-to-End Object Detection

Home
/
Papers
/
Deformable DETR: Deformable Transformers for End-to-End Object Detection

Proceedings Article•

Deformable DETR: Deformable Transformers for End-to-End Object Detection

Xizhou Zhu, Weijie Su¹, Lewei Lu², Bin Li¹, Xiaogang Wang³, Jifeng Dai⁴ - Show less +2 more•Institutions (4)

University of Science and Technology of China¹, Microsoft², The Chinese University of Hong Kong³, SenseTime⁴

03 May 2021-

TL;DR: Deformable DETR as discussed by the authors proposes to only attend to a small set of key sampling points around a reference, which can achieve better performance than DETR with 10× less training epochs.

read less

Abstract: DETR has been recently proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance. However, it suffers from slow convergence and limited feature spatial resolution, due to the limitation of Transformer attention modules in processing image feature maps. To mitigate these issues, we proposed Deformable DETR, whose attention modules only attend to a small set of key sampling points around a reference. Deformable DETR can achieve better performance than DETR (especially on small objects) with 10× less training epochs. Extensive experiments on the COCO benchmark demonstrate the effectiveness of our approach. Code shall be released.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Posted Content•

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows.

[...]

Ze Liu¹, Yutong Lin¹, Yue Cao¹, Han Hu¹, Yixuan Wei¹, Zheng Zhang¹, Stephen Lin¹, Baining Guo¹ - Show less +4 more•Institutions (1)

Microsoft¹

25 Mar 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: Wang et al. as mentioned in this paper proposed a new vision Transformer called Swin Transformer, which is computed with shifted windows to address the differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text.

...read moreread less

Abstract: This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (86.4 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO test-dev) and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The code and models will be made publicly available at~\url{this https URL}.

...read moreread less

3,518 citations

Posted Content•

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

[...]

Wenhai Wang¹, Enze Xie², Xiang Li³, Deng-Ping Fan⁴, Kaitao Song⁵, Ding Liang⁶, Tong Lu¹, Ping Luo², Ling Shao⁷ - Show less +5 more•Institutions (7)

Nanjing University¹, University of Hong Kong², New York University Abu Dhabi³, Nankai University⁴, Nanjing University of Science and Technology⁵, SenseTime⁶, Seoul National University⁷

24 Feb 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: Huang et al. as discussed by the authors proposed Pyramid Vision Transformer (PVT), which is a simple backbone network useful for many dense prediction tasks without convolutions, and achieved state-of-the-art performance on the COCO dataset.

...read moreread less

Abstract: Although using convolutional neural networks (CNNs) as backbones achieves great successes in computer vision, this work investigates a simple backbone network useful for many dense prediction tasks without convolutions. Unlike the recently-proposed Transformer model (e.g., ViT) that is specially designed for image classification, we propose Pyramid Vision Transformer~(PVT), which overcomes the difficulties of porting Transformer to various dense prediction tasks. PVT has several merits compared to prior arts. (1) Different from ViT that typically has low-resolution outputs and high computational and memory cost, PVT can be not only trained on dense partitions of the image to achieve high output resolution, which is important for dense predictions but also using a progressive shrinking pyramid to reduce computations of large feature maps. (2) PVT inherits the advantages from both CNN and Transformer, making it a unified backbone in various vision tasks without convolutions by simply replacing CNN backbones. (3) We validate PVT by conducting extensive experiments, showing that it boosts the performance of many downstream tasks, e.g., object detection, semantic, and instance segmentation. For example, with a comparable number of parameters, RetinaNet+PVT achieves 40.4 AP on the COCO dataset, surpassing RetinNet+ResNet50 (36.3 AP) by 4.1 absolute AP. We hope PVT could serve as an alternative and useful backbone for pixel-level predictions and facilitate future researches. Code is available at this https URL.

...read moreread less

845 citations

Posted Content•

An Attentive Survey of Attention Models

[...]

Sneha Chaudhari, Gungor Polatkan, Rohan Ramanath, Varun Mithal

05 Apr 2019-arXiv: Learning

TL;DR: A taxonomy that groups existing techniques into coherent categories in attention models is proposed, and how attention has been used to improve the interpretability of neural networks is described.

...read moreread less

Abstract: Attention Model has now become an important concept in neural networks that has been researched within diverse application domains. This survey provides a structured and comprehensive overview of the developments in modeling attention. In particular, we propose a taxonomy which groups existing techniques into coherent categories. We review salient neural architectures in which attention has been incorporated, and discuss applications in which modeling attention has shown a significant impact. We also describe how attention has been used to improve the interpretability of neural networks. Finally, we discuss some future research directions in attention. We hope this survey will provide a succinct introduction to attention models and guide practitioners while developing approaches for their applications.

...read moreread less

355 citations

Posted Content•

Attention Mechanisms in Computer Vision: A Survey.

[...]

Meng-Hao Guo, Tian-Xing Xu, Jiangjiang Liu, Zheng-Ning Liu, Peng-Tao Jiang, Tai-Jiang Mu, Song-Hai Zhang, Ralph R. Martin, Ming-Ming Cheng, Shi-Min Hu¹ - Show less +6 more•Institutions (1)

Tsinghua University¹

15 Nov 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: A comprehensive review of attention mechanisms in computer vision can be found in this article, which categorizes them according to approach, such as channel attention, spatial attention, temporal attention and branch attention.

...read moreread less

Abstract: Humans can naturally and effectively find salient regions in complex scenes. Motivated by this observation, attention mechanisms were introduced into computer vision with the aim of imitating this aspect of the human visual system. Such an attention mechanism can be regarded as a dynamic weight adjustment process based on features of the input image. Attention mechanisms have achieved great success in many visual tasks, including image classification, object detection, semantic segmentation, video understanding, image generation, 3D vision, multi-modal tasks and self-supervised learning. In this survey, we provide a comprehensive review of various attention mechanisms in computer vision and categorize them according to approach, such as channel attention, spatial attention, temporal attention and branch attention; a related repository this https URL is dedicated to collecting related work. We also suggest future directions for attention mechanism research.

...read moreread less

243 citations

Journal Article•DOI•

Remote Sensing Image Change Detection with Transformers

[...]

Hao Chen¹, Zipeng Qi¹, Zhenwei Shi¹•Institutions (1)

Beihang University¹

27 Feb 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: Wang et al. as discussed by the authors proposed a bitemporal image transformer (BIT) to efficiently and effectively model contexts within the spatial-temporal domain, where the high-level concepts of the change of interest can be represented by a few visual words.

...read moreread less

Abstract: Modern change detection (CD) has achieved remarkable success by the powerful discriminative ability of deep convolutions. However, high-resolution remote sensing CD remains challenging due to the complexity of objects in the scene. Objects with the same semantic concept may show distinct spectral characteristics at different times and spatial locations. Most recent CD pipelines using pure convolutions are still struggling to relate long-range concepts in space-time. Non-local self-attention approaches show promising performance via modeling dense relations among pixels, yet are computationally inefficient. Here, we propose a bitemporal image transformer (BIT) to efficiently and effectively model contexts within the spatial-temporal domain. Our intuition is that the high-level concepts of the change of interest can be represented by a few visual words, i.e., semantic tokens. To achieve this, we express the bitemporal image into a few tokens, and use a transformer encoder to model contexts in the compact token-based space-time. The learned context-rich tokens are then feedback to the pixel-space for refining the original features via a transformer decoder. We incorporate BIT in a deep feature differencing-based CD framework. Extensive experiments on three CD datasets demonstrate the effectiveness and efficiency of the proposed method. Notably, our BIT-based model significantly outperforms the purely convolutional baseline using only 3 times lower computational costs and model parameters. Based on a naive backbone (ResNet18) without sophisticated structures (e.g., FPN, UNet), our model surpasses several state-of-the-art CD methods, including better than four recent attention-based methods in terms of efficiency and accuracy. Our code is available at this https URL\_CD.

...read moreread less

205 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

Collapse

References

PDF

Open Access

More filters

Proceedings Article•

Pay Less Attention with Lightweight and Dynamic Convolutions

[...]

Felix Wu¹, Angela Fan², Alexei Baevski², Yann N. Dauphin³, Michael Auli² - Show less +1 more•Institutions (3)

NEC¹, Facebook², Google³

29 Jan 2019

TL;DR: This article introduced dynamic convolutions, which are simpler and more efficient than self-attention, and achieved state-of-the-art performance on the WMT'14 English-German test set.

...read moreread less

Abstract: Self-attention is a useful mechanism to build generative models for language and images. It determines the importance of context elements by comparing each element to the current time step. In this paper, we show that a very lightweight convolution can perform competitively to the best reported self-attention results. Next, we introduce dynamic convolutions which are simpler and more efficient than self-attention. We predict separate convolution kernels based solely on the current time-step in order to determine the importance of context elements. The number of operations required by this approach scales linearly in the input length, whereas self-attention is quadratic. Experiments on large-scale machine translation, language modeling and abstractive summarization show that dynamic convolutions improve over strong self-attention models. On the WMT'14 English-German test set dynamic convolutions achieve a new state of the art of 29.7 BLEU.

...read moreread less

391 citations

Posted Content•

Axial Attention in Multidimensional Transformers

[...]

Jonathan Ho, Nal Kalchbrenner, Dirk Weissenborn, Tim Salimans

25 Sep 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: Axial Transformers is proposed, a self-attention-based autoregressive model for images and other data organized as high dimensional tensors that maintains both full expressiveness over joint distributions over data and ease of implementation with standard deep learning frameworks, while requiring reasonable memory and computation.

...read moreread less

Abstract: We propose Axial Transformers, a self-attention-based autoregressive model for images and other data organized as high dimensional tensors. Existing autoregressive models either suffer from excessively large computational resource requirements for high dimensional data, or make compromises in terms of distribution expressiveness or ease of implementation in order to decrease resource requirements. Our architecture, by contrast, maintains both full expressiveness over joint distributions over data and ease of implementation with standard deep learning frameworks, while requiring reasonable memory and computation and achieving state-of-the-art results on standard generative modeling benchmarks. Our models are based on axial attention, a simple generalization of self-attention that naturally aligns with the multiple dimensions of the tensors in both the encoding and the decoding settings. Notably the proposed structure of the layers allows for the vast majority of the context to be computed in parallel during decoding without introducing any independence assumptions. This semi-parallel structure goes a long way to making decoding from even a very large Axial Transformer broadly applicable. We demonstrate state-of-the-art results for the Axial Transformer on the ImageNet-32 and ImageNet-64 image benchmarks as well as on the BAIR Robotic Pushing video benchmark. We open source the implementation of Axial Transformers.

...read moreread less

318 citations

Proceedings Article•DOI•

An Empirical Study of Spatial Attention Mechanisms in Deep Networks

[...]

Xizhou Zhu¹, Dazhi Cheng², Zheng Zhang³, Stephen Lin³, Jifeng Dai³ - Show less +1 more•Institutions (3)

University of Science and Technology of China¹, Beijing Institute of Technology², Microsoft³

01 Oct 2019

TL;DR: In this article, the authors present an empirical study that ablates various spatial attention elements within a generalized attention formulation, encompassing the dominant Transformer attention as well as the prevalent deformable convolution and dynamic convolution modules.

...read moreread less

Abstract: Attention mechanisms have become a popular component in deep neural networks, yet there has been little examination of how different influencing factors and methods for computing attention from these factors affect performance. Toward a better general understanding of attention mechanisms, we present an empirical study that ablates various spatial attention elements within a generalized attention formulation, encompassing the dominant Transformer attention as well as the prevalent deformable convolution and dynamic convolution modules. Conducted on a variety of applications, the study yields significant findings about spatial attention in deep networks, some of which run counter to conventional understanding. For example, we find that the query and key content comparison in Transformer attention is negligible for self-attention, but vital for encoder-decoder attention. A proper combination of deformable convolution with key content only saliency achieves the best accuracy-efficiency tradeoff in self-attention. Our results suggest that there exists much room for improvement in the design of attention mechanisms.

...read moreread less

249 citations

Proceedings Article•DOI•

ETC: Encoding Long and Structured Inputs in Transformers

[...]

Joshua Ainslie¹, Santiago Ontañón², Chris Alberti¹, Vaclav Cvicek, Zachary Fisher³, Philip Pham¹, Anirudh Ravula¹, Sumit Sanghai¹, Qifan Wang¹, Li Yang¹ - Show less +6 more•Institutions (3)

Google¹, Drexel University², Perimeter Institute for Theoretical Physics³

17 Apr 2020

TL;DR: Extended Transformer Construction (ETC) as mentioned in this paper introduces a novel global-local attention mechanism between global tokens and regular input tokens to scale attention to longer inputs and achieves state-of-the-art results on four natural language datasets.

...read moreread less

Abstract: Transformer models have advanced the state of the art in many Natural Language Processing (NLP) tasks. In this paper, we present a new Transformer architecture, "Extended Transformer Construction" (ETC), that addresses two key challenges of standard Transformer architectures, namely scaling input length and encoding structured inputs. To scale attention to longer inputs, we introduce a novel global-local attention mechanism between global tokens and regular input tokens. We also show that combining global-local attention with relative position encodings and a "Contrastive Predictive Coding" (CPC) pre-training objective allows ETC to encode structured inputs. We achieve state-of-the-art results on four natural language datasets requiring long and/or structured inputs.

...read moreread less

248 citations

Proceedings Article•DOI•

Auto-FPN: Automatic Network Architecture Adaptation for Object Detection Beyond Classification

[...]

Hang Xu¹, Lewei Yao¹, Zhenguo Li¹, Xiaodan Liang², Wei Zhang¹ - Show less +1 more•Institutions (2)

Huawei¹, Sun Yat-sen University²

01 Oct 2019

TL;DR: This paper proposes an architecture search framework named Auto-FPN specifically designed for detection beyond simply searching a classification backbone, and proposes two auto search modules for detection: Auto-fusion to search a better fusion of the multi-level features; Auto-head to searchA better structure for classification and bounding-box(bbox) regression.

...read moreread less

Abstract: Neural architecture search (NAS) has shown great potential in automating the manual process of designing a good CNN architecture for image classification. In this paper, we study NAS for object detection, a core computer vision task that classifies and localizes object instances in an image. Existing works focus on transferring the searched architecture from classification task (ImageNet) to the detector backbone, while the rest of the architecture of the detector remains unchanged. However, this pipeline is not task-specific or data-oriented network search which cannot guarantee optimal adaptation to any dataset. Therefore, we propose an architecture search framework named Auto-FPN specifically designed for detection beyond simply searching a classification backbone. Specifically, we propose two auto search modules for detection: Auto-fusion to search a better fusion of the multi-level features; Auto-head to search a better structure for classification and bounding-box(bbox) regression. Instead of searching for one repeatable cell structure, we relax the constraint and allow different cells. The search space of both modules covers many popular designs of detectors and allows efficient gradient-based architecture search with resource constraint (2 days for COCO on 8 GPU cards). Extensive experiments on Pascal VOC, COCO, BDD, VisualGenome and ADE demonstrate the effectiveness of the proposed method, e.g. achieving around 5% improvement than FPN in terms of mAP while requiring around 50% fewer parameters on the searched modules.

...read moreread less

210 citations