scispace - formally typeset
Search or ask a question
Proceedings Article

Deformable DETR: Deformable Transformers for End-to-End Object Detection

TL;DR: Deformable DETR as discussed by the authors proposes to only attend to a small set of key sampling points around a reference, which can achieve better performance than DETR with 10× less training epochs.
Abstract: DETR has been recently proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance. However, it suffers from slow convergence and limited feature spatial resolution, due to the limitation of Transformer attention modules in processing image feature maps. To mitigate these issues, we proposed Deformable DETR, whose attention modules only attend to a small set of key sampling points around a reference. Deformable DETR can achieve better performance than DETR (especially on small objects) with 10× less training epochs. Extensive experiments on the COCO benchmark demonstrate the effectiveness of our approach. Code shall be released.

Content maybe subject to copyright    Report

Citations
More filters
Posted Content
TL;DR: TadTR as mentioned in this paper proposes an end-to-end framework for temporal action detection, which maps a set of learnable embeddings to action instances in parallel, by selectively attending to a sparse set of snippets in a video.
Abstract: Temporal action detection (TAD) aims to determine the semantic label and the boundaries of every action instance in an untrimmed video. It is a fundamental and challenging task in video understanding and significant progress has been made. Previous methods involve multiple stages or networks and hand-designed rules or operations, which fall short in efficiency and flexibility. In this paper, we propose an end-to-end framework for TAD upon Transformer, termed \textit{TadTR}, which maps a set of learnable embeddings to action instances in parallel. TadTR is able to adaptively extract temporal context information required for making action predictions, by selectively attending to a sparse set of snippets in a video. As a result, it simplifies the pipeline of TAD and requires lower computation cost than previous detectors, while preserving remarkable detection performance. TadTR achieves state-of-the-art performance on HACS Segments (+3.35% average mAP). As a single-network detector, TadTR runs 10$\times$ faster than its comparable competitor. It outperforms existing single-network detectors by a large margin on THUMOS14 (+5.0% average mAP) and ActivityNet (+7.53% average mAP). When combined with other detectors, it reports 54.1% mAP at IoU=0.5 on THUMOS14, and 34.55% average mAP on ActivityNet-1.3. Our code will be released at \url{this https URL}.

16 citations

Proceedings ArticleDOI
07 Jun 2021
TL;DR: In this paper, the authors explored two disparate state-of-the-art detection frameworks: DETR as representative of the novel transformer-based architectures for computer vision, and You Only Look Once v5 as an expedited PyTorch port of YOLOv4 with explicit mobile-focus.
Abstract: Diabetic feet are a long-term effect of diabetes mellitus that are at risk of ulceration due to neuropathy and ischemia. Early ulcer stages show subtle changes hard to recognize by the human eye, especially on darker skin types. Acquired ulcers may become chronic for various reasons, requiring extensive documentation to monitor healing progression. For early stage detection and documentation support, object detection algorithms are a key technology for prevention and care improvement. However, attendant symptoms like malformed toenails, hyperkeratosis, and rhagades display challenges regarding faulty detections. The research at hand explores two disparate state-of-the-art detection frameworks: Detection Transformer (DETR) as representative of the novel transformer-based architectures for computer vision, and You Only Look Once v5 (YOLOv5) as an expedited PyTorch port of YOLOv4 with explicit mobile-focus. Both are compared on a recently released dataset for diabetic foot ulcer detection with images typical for common wound care documentation. In addition, effects of self-training for performance improvement are investigated. Achieved results outperform those of other state-of-the-art methods. These are discussed highlighting differences and potential for further optimization.

15 citations

Posted Content
TL;DR: In this paper, a general-purpose building block for multi-head context aggregation is proposed, which can exploit long-range interactions between Transformers while still exploiting the inductive bias of the local convolution operation leading to faster convergence speeds.
Abstract: Convolutional neural networks (CNNs) are ubiquitous in computer vision, with a myriad of effective and efficient variations. Recently, Transformers -- originally introduced in natural language processing -- have been increasingly adopted in computer vision. While early adopters continue to employ CNN backbones, the latest networks are end-to-end CNN-free Transformer solutions. A recent surprising finding shows that a simple MLP based solution without any traditional convolutional or Transformer components can produce effective visual representations. While CNNs, Transformers and MLP-Mixers may be considered as completely disparate architectures, we provide a unified view showing that they are in fact special cases of a more general method to aggregate spatial context in a neural network stack. We present the \model (CONText AggregatIon NEtwoRk), a general-purpose building block for multi-head context aggregation that can exploit long-range interactions \emph{a la} Transformers while still exploiting the inductive bias of the local convolution operation leading to faster convergence speeds, often seen in CNNs. In contrast to Transformer-based methods that do not scale well to downstream tasks that rely on larger input image resolutions, our efficient network, named \modellight, can be employed in object detection and instance segmentation networks such as DETR, RetinaNet and Mask-RCNN to obtain an impressive detection mAP of 38.9, 43.8, 45.1 and mask mAP of 41.3, providing large improvements of 6.6, 7.3, 6.9 and 6.6 pts respectively, compared to a ResNet-50 backbone with a comparable compute and parameter size. Our method also achieves promising results on self-supervised learning compared to DeiT on the DINO framework.

14 citations

Journal ArticleDOI
Haoqi Hu1, Xiaofeng Lu1, Xinpeng Zhang1, Tianxing Zhang1, Guangling Sun1 
TL;DR: Zhang et al. as mentioned in this paper proposed an inheritance attention matrix-based UAP (IAM-UAP), which represents the integration of global information of the input patch sequence, and proposed a perturbation optimization objective based on IAM to confuse the global information integration of Vision Transformers.
Abstract: Universal Adversarial Perturbations (UAPs), which is type of image-agnostic adversarial attack, has been deeply investigated for Convolutional Neural Networks due to its high efficiency. On the other hand, as an architecture based on self-attention mechanism, Vision Transformers (ViTs) have boomed and been widely applied to solve various computer vision problems since most recent years. In this letter, we delve into the robustness of ViTs against universal adversarial attack and propose an inheritance attention matrix-based UAPs (IAM-UAP). Specifically, we introduce the inheritance attention weight matrix (IAM), which represents the integration of global information of the input patch sequence. Further, we propose a perturbation optimization objective based on IAM to confuse the global information integration of ViTs. The empirical results confirm the attacking capability of IAM-UAP on ViTs with a moderate attacking rate. In addition, we also disclose that the patch size of ViTs is a latent factor influencing the robustness against the universal attack.

13 citations

Posted Content
29 Mar 2021
TL;DR: Zhang et al. as mentioned in this paper formulate the pose estimation task into a sequence prediction problem that can effectively be solved by transformers and propose a human pose estimation framework that solves the task in the regression-based fashion.
Abstract: We propose a human pose estimation framework that solves the task in the regression-based fashion. Unlike previous regression-based methods, which often fall behind those state-of-the-art methods, we formulate the pose estimation task into a sequence prediction problem that can effectively be solved by transformers. Our framework is simple and direct, bypassing the drawbacks of the heatmap-based pose estimation. Moreover, with the attention mechanism in transformers, our proposed framework is able to adaptively attend to the features most relevant to the target keypoints, which largely overcomes the feature misalignment issue of previous regression-based methods and considerably improves the performance. Importantly, our framework can inherently take advantages of the structured relationship between keypoints. Experiments on the MS-COCO and MPII datasets demonstrate that our method can significantly improve the state-of-the-art of regression-based pose estimation and perform comparably with the best heatmap-based pose estimation methods.

11 citations

References
More filters
Proceedings ArticleDOI
27 Jun 2016
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

123,388 citations

Proceedings Article
01 Jan 2015
TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Abstract: We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and/or parameters. The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods. Finally, we discuss AdaMax, a variant of Adam based on the infinity norm.

111,197 citations

Proceedings Article
12 Jun 2017
TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
Abstract: The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks in an encoder and decoder configuration. The best performing such models also connect the encoder and decoder through an attentionm echanisms. We propose a novel, simple network architecture based solely onan attention mechanism, dispensing with recurrence and convolutions entirely.Experiments on two machine translation tasks show these models to be superiorin quality while being more parallelizable and requiring significantly less timeto train. Our single model with 165 million parameters, achieves 27.5 BLEU onEnglish-to-German translation, improving over the existing best ensemble result by over 1 BLEU. On English-to-French translation, we outperform the previoussingle state-of-the-art with model by 0.7 BLEU, achieving a BLEU score of 41.1.

52,856 citations

Proceedings ArticleDOI
Jia Deng1, Wei Dong1, Richard Socher1, Li-Jia Li1, Kai Li1, Li Fei-Fei1 
20 Jun 2009
TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Abstract: The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called “ImageNet”, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond.

49,639 citations

Book ChapterDOI
06 Sep 2014
TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.
Abstract: We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in precise object localization. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.

30,462 citations