Open AccessProceedings Article
Deformable DETR: Deformable Transformers for End-to-End Object Detection
Reads0
Chats0
TLDR
Deformable DETR as discussed by the authors proposes to only attend to a small set of key sampling points around a reference, which can achieve better performance than DETR with 10× less training epochs.Abstract:
DETR has been recently proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance. However, it suffers from slow convergence and limited feature spatial resolution, due to the limitation of Transformer attention modules in processing image feature maps. To mitigate these issues, we proposed Deformable DETR, whose attention modules only attend to a small set of key sampling points around a reference. Deformable DETR can achieve better performance than DETR (especially on small objects) with 10× less training epochs. Extensive experiments on the COCO benchmark demonstrate the effectiveness of our approach. Code shall be released.read more
Citations
More filters
Posted Content
ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias
TL;DR: Zhang et al. as discussed by the authors proposed a novel Vision Transformer Advanced by Exploring intrinsic inductive bias from convolutions, which has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context by using multiple convolutions with different dilation rates.
Posted Content
Visual Relationship Detection Using Part-and-Sum Transformers with Composite Queries
TL;DR: Part-and-Sum detection Transformer (PST) as discussed by the authors uses tensor-based part queries and vector-based sum queries to model the joint part and sum hypotheses/interactions.
Posted Content
Transformer Transforms Salient Object Detection and Camouflaged Object Detection
Yuxin Mao,Jing Zhang,Zhexiong Wan,Yuchao Dai,Aixuan Li,Yunqiu Lv,Xinyu Tian,Deng-Ping Fan,Nick Barnes +8 more
TL;DR: Zhang et al. as discussed by the authors adopted the dense transformer backbone for fully supervised RGB image based salient object detection, RGB-D image pair based SOD, and weakly supervised SOD within a unified framework based on the observation that the transformer backbone can provide accurate structure modeling.
Posted Content
SCTN: Sparse Convolution-Transformer Network for Scene Flow Estimation
TL;DR: Wang et al. as discussed by the authors proposed a sparse convolution-transformer network (SCTN) to transfer irregular point clouds into locally consistent flow features for estimating continuous and consistent motions within an object/local object part.
Posted Content
Hierarchical Modular Network for Video Captioning
TL;DR: In this article, a hierarchical modular network is proposed to bridge video representations and linguistic semantics from three levels before generating captions, which is composed of: (I) Entity level, which highlights objects that are most likely to be mentioned in captions.
References
More filters
Proceedings ArticleDOI
Deep Residual Learning for Image Recognition
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Proceedings Article
Adam: A Method for Stochastic Optimization
Diederik P. Kingma,Jimmy Ba +1 more
TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Proceedings Article
Attention is All you Need
Ashish Vaswani,Noam Shazeer,Niki Parmar,Jakob Uszkoreit,Llion Jones,Aidan N. Gomez,Lukasz Kaiser,Illia Polosukhin +7 more
TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
Proceedings ArticleDOI
ImageNet: A large-scale hierarchical image database
TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Book ChapterDOI
Microsoft COCO: Common Objects in Context
Tsung-Yi Lin,Michael Maire,Serge Belongie,James Hays,Pietro Perona,Deva Ramanan,Piotr Dollár,C. Lawrence Zitnick +7 more
TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.