Showing papers on "Object (computer science) published in 2021"

PDF

Open Access

Posted Content•

Learning Transferable Visual Models From Natural Language Supervision

[...]

Alec Radford¹, Jong Wook Kim¹, Chris Hallacy¹, Aditya Ramesh¹, Gabriel Goh¹, Sandhini Agarwal¹, Girish Sastry¹, Amanda Askell, Pamela Mishkin¹, Jack Clark¹, Gretchen Krueger¹, Ilya Sutskever¹ - Show less +8 more•Institutions (1)

OpenAI¹

26 Feb 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, a pre-training task of predicting which caption goes with which image is used to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.

...read moreread less

Abstract: State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at this https URL.

...read moreread less

403 citations

Proceedings Article•DOI•

Sparse R-CNN: End-to-End Object Detection with Learnable Proposals

[...]

Peize Sun¹, Rufeng Zhang², Yi Jiang, Tao Kong, Chenfeng Xu³, Wei Zhan³, Masayoshi Tomizuka³, Lei Li, Zehuan Yuan, Changhu Wang, Ping Luo¹ - Show less +7 more•Institutions (3)

University of Hong Kong¹, Tongji University², University of California, Berkeley³

01 Jun 2021

TL;DR: Sun et al. as mentioned in this paper proposed sparse R-CNN, a purely sparse method for object detection in images, which completely avoids all efforts related to object candidates design and many-to-one label assignment.

...read moreread less

Abstract: We present Sparse R-CNN, a purely sparse method for object detection in images. Existing works on object detection heavily rely on dense object candidates, such as k anchor boxes pre-defined on all grids of image feature map of size H × W. In our method, however, a fixed sparse set of learned object proposals, total length of N, are provided to object recognition head to perform classification and location. By eliminating HWk (up to hundreds of thousands) hand-designed object candidates to N (e.g. 100) learnable proposals, Sparse R-CNN completely avoids all efforts related to object candidates design and many-to-one label assignment. More importantly, final predictions are directly output without non-maximum suppression post-procedure. Sparse R-CNN demonstrates accuracy, run-time and training convergence performance on par with the well-established detector baselines on the challenging COCO dataset, e.g., achieving 45.0 AP in standard 3× training schedule and running at 22 fps using ResNet-50 FPN model. We hope our work could inspire re-thinking the convention of dense prior in object detectors. The code is available at: https://github.com/PeizeSun/SparseR-CNN.

...read moreread less

256 citations

Proceedings Article•DOI•

Towards Open World Object Detection

[...]

K J Joseph¹, Salman Khan², Fahad Shahbaz Khan², Vineeth N Balasubramanian¹•Institutions (2)

Indian Institute of Technology, Hyderabad¹, Zayed University²

03 Mar 2021

TL;DR: In this paper, the authors propose a novel computer vision problem called "Open World Object Detection", where a model is tasked to identify objects that have not been introduced to it as "unknown" and incrementally learn these identified unknown categories without forgetting previously learned classes, when the corresponding labels are progressively received.

...read moreread less

Abstract: Humans have a natural instinct to identify unknown object instances in their environments. The intrinsic curiosity about these unknown instances aids in learning about them, when the corresponding knowledge is eventually available. This motivates us to propose a novel computer vision problem called: ‘Open World Object Detection’, where a model is tasked to: 1) identify objects that have not been introduced to it as ‘unknown’, without explicit supervision to do so, and 2) incrementally learn these identified unknown categories without forgetting previously learned classes, when the corresponding labels are progressively received. We formulate the problem, introduce a strong evaluation protocol and provide a novel solution, which we call ORE: Open World Object Detector, based on contrastive clustering and energy based unknown identification. Our experimental evaluation and ablation studies analyse the efficacy of ORE in achieving Open World objectives. As an interesting by-product, we find that identifying and characterising unknown instances helps to reduce confusion in an incremental object detection setting, where we achieve state-of-the-art performance, with no extra methodological effort. We hope that our work will attract further research into this newly identified, yet crucial research direction.1

...read moreread less

248 citations

Journal Article•DOI•

Deep Affinity Network for Multiple Object Tracking

[...]

Shijie Sun¹, Naveed Akhtar², Huansheng Song¹, Ajmal Mian², Mubarak Shah³ - Show less +1 more•Institutions (3)

Chang'an University¹, University of Western Australia², University of Central Florida³

01 Jan 2021-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: The proposed Deep Affinity Network (DAN) learns compact, yet comprehensive features of pre-detected objects at several levels of abstraction, and performs exhaustive pairing permutations of those features in any two frames to infer object affinities.

...read moreread less

Abstract: Multiple Object Tracking (MOT) plays an important role in solving many fundamental problems in video analysis and computer vision. Most MOT methods employ two steps: Object Detection and Data Association. The first step detects objects of interest in every frame of a video, and the second establishes correspondence between the detected objects in different frames to obtain their tracks. Object detection has made tremendous progress in the last few years due to deep learning. However, data association for tracking still relies on hand crafted constraints such as appearance, motion, spatial proximity, grouping etc. to compute affinities between the objects in different frames. In this paper, we harness the power of deep learning for data association in tracking by jointly modeling object appearances and their affinities between different frames in an end-to-end fashion. The proposed Deep Affinity Network (DAN) learns compact, yet comprehensive features of pre-detected objects at several levels of abstraction, and performs exhaustive pairing permutations of those features in any two frames to infer object affinities. DAN also accounts for multiple objects appearing and disappearing between video frames. We exploit the resulting efficient affinity computations to associate objects in the current frame deep into the previous frames for reliable on-line tracking. Our technique is evaluated on popular multiple object tracking challenges MOT15, MOT17 and UA-DETRAC. Comprehensive benchmarking under twelve evaluation metrics demonstrates that our approach is among the best performing techniques on the leader board for these challenges. The open source implementation of our work is available at https://github.com/shijieS/SST.git .

...read moreread less

239 citations

Proceedings Article•DOI•

Dynamic Head: Unifying Object Detection Heads with Attentions

[...]

Xiyang Dai¹, Yinpeng Chen¹, Bin Xiao¹, Dongdong Chen¹, Mengchen Liu¹, Lu Yuan¹, Lei Zhang¹ - Show less +3 more•Institutions (1)

Microsoft¹

15 Jun 2021

TL;DR: In this article, a dynamic head framework is proposed to unify object detection heads with attentions, by coherently combining multiple self-attention mechanisms between feature levels for scale-awareness, among spatial locations for spatial-awareness and within output channels for task-awareness.

...read moreread less

Abstract: The complex nature of combining localization and classification in object detection has resulted in the flourished development of methods. Previous works tried to improve the performance in various object detection heads but failed to present a unified view. In this paper, we present a novel dynamic head framework to unify object detection heads with attentions. By coherently combining multiple self-attention mechanisms between feature levels for scale-awareness, among spatial locations for spatial-awareness, and within output channels for task-awareness, the proposed approach significantly improves the representation ability of object detection heads without any computational overhead. Further experiments demonstrate that the effectiveness and efficiency of the proposed dynamic head on the COCO benchmark. With a standard ResNeXt-101-DCN backbone, we largely improve the performance over popular object detectors and achieve a new state-of-the-art at 54.0 AP. The code will be released at https://github.com/microsoft/DynamicHead.

...read moreread less

230 citations

Proceedings Article•DOI•

3D Object Detection with Pointformer

[...]

Xuran Pan¹, Zhuofan Xia¹, Shiji Song¹, Li Erran Li², Gao Huang¹ - Show less +1 more•Institutions (2)

Tsinghua University¹, Columbia University²

01 Jun 2021

TL;DR: Pointformer as mentioned in this paper proposes a Transformer backbone for 3D point clouds to learn features effectively, where a Local Transformer module is employed to model interactions among points in a local region, which learns context-dependent region features at an object level.

...read moreread less

Abstract: Feature learning for 3D object detection from point clouds is very challenging due to the irregularity of 3D point cloud data. In this paper, we propose Pointformer, a Transformer backbone designed for 3D point clouds to learn features effectively. Specifically, a Local Transformer module is employed to model interactions among points in a local region, which learns context-dependent region features at an object level. A Global Transformer is designed to learn context-aware representations at the scene level. To further capture the dependencies among multi-scale representations, we propose Local-Global Transformer to integrate local features with global features from higher resolution. In addition, we introduce an efficient coordinate refinement module to shift down-sampled points closer to object centroids, which improves object proposal generation. We use Pointformer as the backbone for state-of-the-art object detection models and demonstrate significant improvements over original models on both indoor and outdoor datasets.

...read moreread less

218 citations

Journal Article•DOI•

LayerCAM: Exploring Hierarchical Class Activation Maps for Localization

[...]

Peng-Tao Jiang¹, Chang-Bin Zhang¹, Qibin Hou, Ming-Ming Cheng¹, Yunchao Wei² - Show less +1 more•Institutions (2)

Nankai University¹, Beijing Jiaotong University²

22 Jun 2021-IEEE Transactions on Image Processing

TL;DR: Li et al. as mentioned in this paper proposed a simple yet effective method, called LayerCAM, to generate more fine-grained object localization information from the class activation maps to locate the target objects more accurately.

...read moreread less

Abstract: The class activation maps are generated from the final convolutional layer of CNN. They can highlight discriminative object regions for the class of interest. These discovered object regions have been widely used for weakly-supervised tasks. However, due to the small spatial resolution of the final convolutional layer, such class activation maps often locate coarse regions of the target objects, limiting the performance of weakly-supervised tasks that need pixel-accurate object locations. Thus, we aim to generate more fine-grained object localization information from the class activation maps to locate the target objects more accurately. In this paper, by rethinking the relationships between the feature maps and their corresponding gradients, we propose a simple yet effective method, called LayerCAM. It can produce reliable class activation maps for different layers of CNN. This property enables us to collect object localization information from coarse (rough spatial localization) to fine (precise fine-grained details) levels. We further integrate them into a high-quality class activation map, where the object-related pixels can be better highlighted. To evaluate the quality of the class activation maps produced by LayerCAM, we apply them to weakly-supervised object localization and semantic segmentation. Experiments demonstrate that the class activation maps generated by our method are more effective and reliable than those by the existing attention methods. The code will be made publicly available.

...read moreread less

206 citations

Proceedings Article•DOI•

FSCE: Few-Shot Object Detection via Contrastive Proposal Encoding

[...]

Bo Sun¹, Banghuai Li, Shengcai Cai, Ye Yuan, Chi Zhang - Show less +1 more•Institutions (1)

University of Southern California¹

20 Jun 2021

TL;DR: In this article, a contrastive proposal encoding loss (CPE loss) was proposed to improve the performance of few-shot object detection by learning contrastive-aware object proposal encodings that facilitate the classification of detected objects.

...read moreread less

Abstract: Emerging interests have been brought to recognize previously unseen objects given very few training examples, known as few-shot object detection (FSOD). Recent researches demonstrate that good feature embedding is the key to reach favorable few-shot learning performance. We observe object proposals with different Intersection-of-Union (IoU) scores are analogous to the intra-image augmentation used in contrastive visual representation learning. And we exploit this analogy and incorporate supervised contrastive learning to achieve more robust objects representations in FSOD. We present Few-Shot object detection via Contrastive proposals Encoding (FSCE), a simple yet effective approach to learning contrastive-aware object proposal encodings that facilitate the classification of detected objects. We notice the degradation of average precision (AP) for rare objects mainly comes from misclassifying novel instances as confusable classes. And we ease the misclassification issues by promoting instance level intraclass compactness and inter-class variance via our contrastive proposal encoding loss (CPE loss). Our design outperforms current state-of-the-art works in any shot and all data splits, with up to +8.8% on standard benchmark PASCAL VOC and +2.7% on challenging COCO benchmark. Code is available at: https://github.com/MegviiDetection/FSCE.

...read moreread less

184 citations

Posted Content•

A Survey of Modern Deep Learning based Object Detection Models

[...]

Syed Sahil Abbas Zaidi, Mohammad Samar Ansari, Asra Aslam, Nadia Kanwal¹, Mamoona Naveed Asghar¹, Brian Lee¹ - Show less +2 more•Institutions (1)

Athlone Institute of Technology¹

24 Apr 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, a survey of recent developments in deep learning based object detectors is presented along with some of the prominent backbone architectures used in recognition tasks and compared the performances of these architectures on multiple metrics.

...read moreread less

Abstract: Object Detection is the task of classification and localization of objects in an image or video. It has gained prominence in recent years due to its widespread applications. This article surveys recent developments in deep learning based object detectors. Concise overview of benchmark datasets and evaluation metrics used in detection is also provided along with some of the prominent backbone architectures used in recognition tasks. It also covers contemporary lightweight classification models used on edge devices. Lastly, we compare the performances of these architectures on multiple metrics.

...read moreread less

174 citations

Proceedings Article•DOI•

HOTR: End-to-End Human-Object Interaction Detection with Transformers

[...]

Bumsoo Kim, Junhyun Lee¹, Jaewoo Kang¹, Eun-Sol Kim, Hyunwoo Kim¹ - Show less +1 more•Institutions (1)

Korea University¹

20 Jun 2021

TL;DR: Zhang et al. as mentioned in this paper presented a novel framework, referred by HOTR, which directly predicts a set of human, object, interaction triplets from an image based on a transformer encoder-decoder architecture.

...read moreread less

Abstract: Human-Object Interaction (HOI) detection is a task of identifying "a set of interactions" in an image, which involves the i) localization of the subject (i.e., humans) and target (i.e., objects) of interaction, and ii) the classification of the interaction labels. Most existing methods have indirectly addressed this task by detecting human and object instances and individually inferring every pair of the detected instances. In this paper, we present a novel framework, referred by HOTR, which directly predicts a set of 〈human, object, interaction〉 triplets from an image based on a transformer encoder-decoder architecture. Through the set prediction, our method effectively exploits the inherent semantic relationships in an image and does not require time-consuming post-processing which is the main bottleneck of existing methods. Our proposed algorithm achieves the state-of-the-art performance in two HOI detection benchmarks with an inference time under 1 ms after object detection.

...read moreread less

160 citations

Journal Article•DOI•

Concealed Object Detection.

[...]

Deng-Ping Fan¹, Ge-Peng Ji², Ming-Ming Cheng¹, Ling Shao•Institutions (2)

Nankai University¹, Wuhan University²

01 Jun 2021-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: Li et al. as discussed by the authors presented the first systematic study on concealed object detection (COD), which aims to identify objects that are?perfectly? embedded in their background, and designed a simple but strong baseline for COD, termed the Search Identification Network (SINet).

...read moreread less

Abstract: We present the first systematic study on concealed object detection (COD), which aims to identify objects that are ?perfectly? embedded in their background. The high intrinsic similarities between the concealed objects and their background make COD far more challenging than traditional object detection/segmentation. To better understand this task, we collect a large-scale dataset, called COD10K, which consists of 10,000 images covering concealed objects in diverse real-world scenarios from 78 object categories. Further, we provide rich annotations including object categories, object boundaries, challenging attributes, object-level labels, and instance-level annotations. Our COD10K enables comprehensive concealed object understanding and can even be used to help progress several other vision tasks, such as detection, segmentation, classification etc. We also design a simple but strong baseline for COD, termed the Search Identification Network (SINet). Without any bells and whistles, SINet outperform 12 cutting-edge baselines on all datasets tested, making them robust, general architectures that could serve as catalysts for future research in COD. Finally, we provide some interesting findings, and highlight several potential applications and future directions. To spark research in this new field, our code, dataset, and online demo are available at our project page: http://mmcheng.net/cod.

...read moreread less

Journal Article•DOI•

OCNet: Object Context for Semantic Segmentation

[...]

Yuhui Yuan¹, Yuhui Yuan², Lang Huang³, Jianyuan Guo³, Chao Zhang³, Xilin Chen¹, Jingdong Wang² - Show less +3 more•Institutions (3)

Chinese Academy of Sciences¹, Microsoft², Peking University³

24 May 2021-International Journal of Computer Vision

TL;DR: This paper proposes an efficient interlaced sparse self-attention scheme to model the dense relations between any two of all pixels via the combination of two sparse relation matrices and empirically shows the advantages of this approach with competitive performances on five challenging benchmarks.

...read moreread less

Abstract: In this paper, we address the semantic segmentation task with a new context aggregation scheme named object context, which focuses on enhancing the role of object information. Motivated by the fact that the category of each pixel is inherited from the object it belongs to, we define the object context for each pixel as the set of pixels that belong to the same category as the given pixel in the image. We use a binary relation matrix to represent the relationship between all pixels, where the value one indicates the two selected pixels belong to the same category and zero otherwise. We propose to use a dense relation matrix to serve as a surrogate for the binary relation matrix. The dense relation matrix is capable to emphasize the contribution of object information as the relation scores tend to be larger on the object pixels than the other pixels. Considering that the dense relation matrix estimation requires quadratic computation overhead and memory consumption w.r.t. the input size, we propose an efficient interlaced sparse self-attention scheme to model the dense relations between any two of all pixels via the combination of two sparse relation matrices. To capture richer context information, we further combine our interlaced sparse self-attention scheme with the conventional multi-scale context schemes including pyramid pooling (Zhao et al. 2017) and atrous spatial pyramid pooling (Chen et al. 2018). We empirically show the advantages of our approach with competitive performances on five challenging benchmarks including: Cityscapes, ADE20K, LIP, PASCAL-Context and COCO-Stuff.

...read moreread less

Journal Article•DOI•

Vision-based robotic grasping from object localization, object pose estimation to grasp estimation for parallel grippers: a review

[...]

Guoguang Du, Kai Wang, Shiguo Lian, Kaiyong Zhao

01 Mar 2021-Artificial Intelligence Review

TL;DR: Three key tasks during vision-based robotic grasping are concluded, which are object localization, object pose estimation and grasp estimation, which include 2D planar grasp methods and 6DoF grasp methods.

...read moreread less

Abstract: This paper presents a comprehensive survey on vision-based robotic grasping. We conclude three key tasks during vision-based robotic grasping, which are object localization, object pose estimation and grasp estimation. In detail, the object localization task contains object localization without classification, object detection and object instance segmentation. This task provides the regions of the target object in the input data. The object pose estimation task mainly refers to estimating the 6D object pose and includes correspondence-based methods, template-based methods and voting-based methods, which affords the generation of grasp poses for known objects. The grasp estimation task includes 2D planar grasp methods and 6DoF grasp methods, where the former is constrained to grasp from one direction. These three tasks could accomplish the robotic grasping with different combinations. Lots of object pose estimation methods need not object localization, and they conduct object localization and object pose estimation jointly. Lots of grasp estimation methods need not object localization and object pose estimation, and they conduct grasp estimation in an end-to-end manner. Both traditional methods and latest deep learning-based methods based on the RGB-D image inputs are reviewed elaborately in this survey. Related datasets and comparisons between state-of-the-art methods are summarized as well. In addition, challenges about vision-based robotic grasping and future directions in addressing these challenges are also pointed out.

...read moreread less

Proceedings Article•DOI•

Objects are Different: Flexible Monocular 3D Object Detection

[...]

Yunpeng Zhang¹, Jiwen Lu¹, Jie Zhou¹•Institutions (1)

Tsinghua University¹

06 Apr 2021

TL;DR: Zhang et al. as discussed by the authors propose a flexible framework for monocular 3D object detection which explicitly decouples the truncated objects and adaptively combines multiple approaches for object depth estimation.

...read moreread less

Abstract: The precise localization of 3D objects from a single image without depth information is a highly challenging problem. Most existing methods adopt the same approach for all objects regardless of their diverse distributions, leading to limited performance for truncated objects. In this paper, we propose a flexible framework for monocular 3D object detection which explicitly decouples the truncated objects and adaptively combines multiple approaches for object depth estimation. Specifically, we decouple the edge of the feature map for predicting long-tail truncated objects so that the optimization of normal objects is not influenced. Furthermore, we formulate the object depth estimation as an uncertainty-guided ensemble of directly regressed object depth and solved depths from different groups of keypoints. Experiments demonstrate that our method outperforms the state-of-the-art method by relatively 27% for the moderate level and 30% for the hard level in the test set of KITTI benchmark while maintaining real-time efficiency. Code will be available at https://github.com/zhangyp15/MonoFlex.

...read moreread less

Journal Article•DOI•

RGB-D salient object detection: A survey.

[...]

Tao Zhou, Deng-Ping Fan, Ming-Ming Cheng¹, Jianbing Shen, Ling Shao - Show less +1 more•Institutions (1)

Nankai University¹

07 Jan 2021-Computational Visual Media

TL;DR: Li et al. as discussed by the authors provided a comprehensive survey of RGB-D based salient object detection models from various perspectives, and reviewed related benchmark datasets in detail, and carried out a comprehensive attribute-based evaluation of several representative RGBD-based saliency detection models.

...read moreread less

Abstract: Salient object detection, which simulates human visual perception in locating the most significant object(s) in a scene, has been widely applied to various computer vision tasks. Now, the advent of depth sensors means that depth maps can easily be captured; this additional spatial information can boost the performance of salient object detection. Although various RGB-D based salient object detection models with promising performance have been proposed over the past several years, an in-depth understanding of these models and the challenges in this field remains lacking. In this paper, we provide a comprehensive survey of RGB-D based salient object detection models from various perspectives, and review related benchmark datasets in detail. Further, as light fields can also provide depth maps, we review salient object detection models and popular benchmark datasets from this domain too. Moreover, to investigate the ability of existing models to detect salient objects, we have carried out a comprehensive attribute-based evaluation of several representative RGB-D based salient object detection models. Finally, we discuss several challenges and open directions of RGB-D based salient object detection for future research. All collected models, benchmark datasets, datasets constructed for attribute-based evaluation, and related code are publicly available at https://github.com/taozh2017/RGBD-SODsurvey.

...read moreread less

Journal Article•DOI•

From user-generated data to data-driven innovation: A research agenda to understand user privacy in digital markets

[...]

Jose Ramon Saura¹, Domingo Ribeiro-Soriano², Daniel Palacios-Marqués³•Institutions (3)

King Juan Carlos University¹, University of Valencia², Polytechnic University of Valencia³

01 Oct 2021-International Journal of Information Management

TL;DR: The present study aims to provide a comprehensive understanding of the main challenges related to user privacy that affect DDI, and identifies 14 topics related to the study of DDI and UGD strategies.

...read moreread less

Journal Article•DOI•

PoseRBPF: A Rao–Blackwellized Particle Filter for 6-D Object Pose Tracking

[...]

Xinke Deng¹, Arsalan Mousavian², Yu Xiang², Fei Xia³, Timothy Bretl¹, Dieter Fox² - Show less +2 more•Institutions (3)

University of Illinois at Urbana–Champaign¹, Nvidia², Stanford University³

25 Feb 2021-IEEE Transactions on Robotics

TL;DR: This work forms the 6D object pose tracking problem in the Rao-Blackwellized particle filtering framework, where the 3D rotation and the3D translation of an object are decoupled, and achieves state-of-the-art results on two 6D pose estimation benchmarks.

...read moreread less

Abstract: Tracking 6-D poses of objects from videos provides rich information to a robot in performing different tasks such as manipulation and navigation. In this article, we formulate the 6-D object pose tracking problem in the Rao–Blackwellized particle filtering framework, where the 3-D rotation and the 3-D translation of an object are decoupled. This factorization allows our approach, called PoseRBPF, to efficiently estimate the 3-D translation of an object along with the full distribution over the 3-D rotation. This is achieved by discretizing the rotation space in a fine-grained manner and training an autoencoder network to construct a codebook of feature embeddings for the discretized rotations. As a result, PoseRBPF can track objects with arbitrary symmetries while still maintaining adequate posterior distributions. Our approach achieves state-of-the-art results on two 6-D pose estimation benchmarks. We open-source our implementation at https://github.com/NVlabs/PoseRBPF .

...read moreread less

Proceedings Article•DOI•

Uncertainty-aware Joint Salient Object and Camouflaged Object Detection

[...]

Aixuan Li¹, Jing Zhang², Yunqiu Lv¹, Bowen Liu¹, Tong Zhang³, Yuchao Dai¹ - Show less +2 more•Institutions (3)

Northwestern Polytechnical University¹, Australian National University², École Polytechnique Fédérale de Lausanne³

01 Jun 2021

TL;DR: Zhang et al. as discussed by the authors leveraged the contradictory information to enhance the detection ability of both salient object detection and camouflaged object detection, and proposed an adversarial learning network to achieve both higher order similarity measure and network confidence estimation.

...read moreread less

Abstract: Visual salient object detection (SOD) aims at finding the salient object(s) that attract human attention, while camouflaged object detection (COD) on the contrary intends to discover the camouflaged object(s) that hidden in the surrounding. In this paper, we propose a paradigm of lever-aging the contradictory information to enhance the detection ability of both salient object detection and camouflaged object detection. We start by exploiting the easy positive samples in the COD dataset to serve as hard positive samples in the SOD task to improve the robustness of the SOD model. Then, we introduce a "similarity measure" module to explicitly model the contradicting attributes of these two tasks. Furthermore, considering the uncertainty of labeling in both tasks’ datasets, we propose an adversarial learning network to achieve both higher order similarity measure and network confidence estimation. Experimental results on benchmark datasets demonstrate that our solution leads to state-of-the-art (SOTA) performance for both tasks1.

...read moreread less

Journal Article•DOI•

A Review and Comparative Study on Probabilistic Object Detection in Autonomous Driving

[...]

Di Feng¹, Ali Harakeh, Steven L. Waslander, Klaus Dietmayer•Institutions (1)

University of Ulm¹

12 Jul 2021-IEEE Transactions on Intelligent Transportation Systems

TL;DR: An overview of generic uncertainty estimation in deep learning is provided, and a strict comparative study on existing probabilistic object detection methods for autonomous driving applications is presented.

...read moreread less

Abstract: Capturing uncertainty in object detection is indispensable for safe autonomous driving. In recent years, deep learning has become the de-facto approach for object detection, and many probabilistic object detectors have been proposed. However, there is no summary on uncertainty estimation in deep object detection, and existing methods are not only built with different network architectures and uncertainty estimation methods, but also evaluated on different datasets with a wide range of evaluation metrics. As a result, a comparison among methods remains challenging, as does the selection of a model that best suits a particular application. This paper aims to alleviate this problem by providing a review and comparative study on existing probabilistic object detection methods for autonomous driving applications. First, we provide an overview of generic uncertainty estimation in deep learning, and then systematically survey existing methods and evaluation metrics for probabilistic object detection. Next, we present a strict comparative study for probabilistic object detection based on an image detector and three public autonomous driving datasets. Finally, we present a discussion of the remaining challenges and future works. Code has been made available at this https URL

...read moreread less

Journal Article•DOI•

Learning Deep Multi-Level Similarity for Thermal Infrared Object Tracking

[...]

Qiao Liu¹, Xin Li¹, Zhenyu He¹, Nana Fan¹, Di Yuan¹, Hongpeng Wang¹ - Show less +2 more•Institutions (1)

Harbin Institute of Technology¹

01 Jan 2021-IEEE Transactions on Multimedia

TL;DR: A multi-level similarity model under a Siamese framework for robust TIR object tracking is proposed and extensive experimental results demonstrate that the proposed algorithm performs favorably against the state-of-the-art methods.

...read moreread less

Abstract: Existing deep Thermal InfraRed (TIR) trackers only use semantic features to represent the TIR object, which lack the sufficient discriminative capacity for handling distractors. This becomes worse when the feature extraction network is only trained on RGB images. To address this issue, we propose a multi-level similarity model under a Siamese framework for robust TIR object tracking. Specifically, we compute different pattern similarities using the proposed multi-level similarity network. One of them focuses on the global semantic similarity and the other computes the local structural similarity of the TIR object. These two similarities complement each other and hence enhance the discriminative capacity of the network for handling distractors. In addition, we design a simple while effective relative entropy based ensemble subnetwork to integrate the semantic and structural similarities. This subnetwork can adaptive learn the weights of the semantic and structural similarities at the training stage. To further enhance the discriminative capacity of the tracker, we propose a large-scale TIR video sequence dataset for training the proposed model. To the best of our knowledge, this is the first and the largest TIR object tracking training dataset to date. The proposed TIR dataset not only benefits the training for TIR object tracking but also can be applied to numerous TIR visual tasks. Extensive experimental results on three benchmarks demonstrate that the proposed algorithm performs favorably against the state-of-the-art methods.

...read moreread less

Journal Article•DOI•

Deep learning in multi-object detection and tracking: state of the art

[...]

Sankar K. Pal¹, Anima Pramanik², Jhareswar Maiti², Pabitra Mitra²•Institutions (2)

Indian Statistical Institute¹, Indian Institute of Technology Kharagpur²

09 Apr 2021-Applied Intelligence

TL;DR: In this article, the authors provide a comprehensive overview of object detection and tracking using deep learning (DL) networks and compare the performance of different object detectors and trackers, including the recent development in granulated DL models.

...read moreread less

Abstract: Object detection and tracking is one of the most important and challenging branches in computer vision, and have been widely applied in various fields, such as health-care monitoring, autonomous driving, anomaly detection, and so on. With the rapid development of deep learning (DL) networks and GPU’s computing power, the performance of object detectors and trackers has been greatly improved. To understand the main development status of object detection and tracking pipeline thoroughly, in this survey, we have critically analyzed the existing DL network-based methods of object detection and tracking and described various benchmark datasets. This includes the recent development in granulated DL models. Primarily, we have provided a comprehensive overview of a variety of both generic object detection and specific object detection models. We have enlisted various comparative results for obtaining the best detector, tracker, and their combination. Moreover, we have listed the traditional and new applications of object detection and tracking showing its developmental trends. Finally, challenging issues, including the relevance of granular computing, in the said domain are elaborated as a future scope of research, together with some concerns. An extensive bibliography is also provided.

...read moreread less

Journal Article•DOI•

Review of multi-view 3D object recognition methods based on deep learning

[...]

Shaohua Qi¹, Shaohua Qi², Xin Ning², Guowei Yang¹, Liping Zhang², Peng Long², Weiwei Cai³, Weijun Li² - Show less +4 more•Institutions (3)

Qingdao University¹, Chinese Academy of Sciences², Central South University Forestry and Technology³

01 Sep 2021-Displays

TL;DR: A comprehensive review and classification of the latest developments in the deep learning methods for multi-view 3D object recognition is presented, which summarizes the results of these methods on a few mainstream datasets, provides an insightful summary, and puts forward enlightening future research directions.

...read moreread less

Posted Content•

TS-CAM: Token Semantic Coupled Attention Map for Weakly Supervised Object Localization.

[...]

Wei Gao¹, Fang Wan¹, Xingjia Pan¹, Zhiliang Peng, Qi Tian², Zhenjun Han¹, Bolei Zhou³, Qixiang Ye¹ - Show less +4 more•Institutions (3)

Chinese Academy of Sciences¹, Huawei², The Chinese University of Hong Kong³

27 Mar 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: This paper introduces the token semantic coupled attention map (TS-CAM) to take full advantage of the self-attention mechanism in visual transformer for long-range dependency extraction and achieves state-of-the-art performance.

...read moreread less

Abstract: Weakly supervised object localization (WSOL) is a challenging problem when given image category labels but requires to learn object localization models. Optimizing a convolutional neural network (CNN) for classification tends to activate local discriminative regions while ignoring complete object extent, causing the partial activation issue. In this paper, we argue that partial activation is caused by the intrinsic characteristics of CNN, where the convolution operations produce local receptive fields and experience difficulty to capture long-range feature dependency among pixels. We introduce the token semantic coupled attention map (TS-CAM) to take full advantage of the self-attention mechanism in visual transformer for long-range dependency extraction. TS-CAM first splits an image into a sequence of patch tokens for spatial embedding, which produce attention maps of long-range visual dependency to avoid partial activation. TS-CAM then re-allocates category-related semantics for patch tokens, enabling each of them to be aware of object categories. TS-CAM finally couples the patch tokens with the semantic-agnostic attention map to achieve semantic-aware localization. Experiments on the ILSVRC/CUB-200-2011 datasets show that TS-CAM outperforms its CNN-CAM counterparts by 7.1%/27.1% for WSOL, achieving state-of-the-art performance.

...read moreread less

Proceedings Article•DOI•

DexYCB: A Benchmark for Capturing Hand Grasping of Objects

[...]

Yu-Wei Chao¹, Wei Yang¹, Yu Xiang¹, Pavlo Molchanov¹, Ankur Handa¹, Jonathan Tremblay¹, Yashraj S. Narang¹, Karl Van Wyk¹, Umar Iqbal¹, Stan Birchfield¹, Jan Kautz¹, Dieter Fox¹ - Show less +8 more•Institutions (1)

Nvidia¹

01 Jun 2021

TL;DR: The DexYCB dataset as mentioned in this paper is a dataset for capturing hand grasping of objects, including 2D object and keypoint detection, 6D object pose estimation, and 3D hand pose estimation.

...read moreread less

Abstract: We introduce DexYCB, a new dataset for capturing hand grasping of objects. We first compare DexYCB with a related one through cross-dataset evaluation. We then present a thorough benchmark of state-of-the-art approaches on three relevant tasks: 2D object and keypoint detection, 6D object pose estimation, and 3D hand pose estimation. Finally, we evaluate a new robotics-relevant task: generating safe robot grasps in human-to-robot object handover. 1

...read moreread less

Proceedings Article•DOI•

Dense Relation Distillation with Context-aware Aggregation for Few-Shot Object Detection

[...]

Hanzhe Hu¹, Shuai Bai², Aoxue Li¹, Jinshi Cui¹, Liwei Wang¹ - Show less +1 more•Institutions (2)

Peking University¹, Beijing University of Posts and Telecommunications²

20 Jun 2021

TL;DR: DCNet as mentioned in this paper proposes Dense Relation Distillation with Context-aware Aggregation (DCNet) to tackle the few-shot detection problem, which learns to adapt to novel classes with only a few annotated examples.

...read moreread less

Abstract: Conventional deep learning based methods for object detection require a large amount of bounding box annotations for training, which is expensive to obtain such high quality annotated data. Few-shot object detection, which learns to adapt to novel classes with only a few annotated examples, is very challenging since the fine-grained feature of novel object can be easily overlooked with only a few data available. In this work, aiming to fully exploit features of annotated novel object and capture fine-grained features of query object, we propose Dense Relation Distillation with Context-aware Aggregation (DCNet) to tackle the few-shot detection problem. Built on the meta-learning based framework, Dense Relation Distillation module targets at fully exploiting support features, where support features and query feature are densely matched, covering all spatial locations in a feed-forward fashion. The abundant usage of the guidance information endows model the capability to handle common challenges such as appearance changes and occlusions. Moreover, to better capture scale-aware features, Context-aware Aggregation module adaptively harnesses features from different scales for a more comprehensive feature representation. Extensive experiments illustrate that our proposed approach achieves state-of-the-art results on PASCAL VOC and MS COCO datasets. Code will be made available at https://github.com/hzhupku/DCNet.

...read moreread less

Journal Article•DOI•

Activated Gradients for Deep Neural Networks.

[...]

Mei Liu¹, Liangming Chen¹, Xiaohao Du², Long Jin¹, Mingsheng Shang¹ - Show less +1 more•Institutions (2)

Chinese Academy of Sciences¹, Lanzhou University²

01 Sep 2021-IEEE Transactions on Neural Networks

TL;DR: In this article, a novel method by acting the gradient activation function (GAF) on the gradient is proposed to handle the saddle point problem, which enlarges the tiny gradients and restricts the large gradient.

...read moreread less

Abstract: Deep neural networks often suffer from poor performance or even training failure due to the ill-conditioned problem, the vanishing/exploding gradient problem, and the saddle point problem. In this article, a novel method by acting the gradient activation function (GAF) on the gradient is proposed to handle these challenges. Intuitively, the GAF enlarges the tiny gradients and restricts the large gradient. Theoretically, this article gives conditions that the GAF needs to meet and, on this basis, proves that the GAF alleviates the problems mentioned above. In addition, this article proves that the convergence rate of SGD with the GAF is faster than that without the GAF under some assumptions. Furthermore, experiments on CIFAR, ImageNet, and PASCAL visual object classes confirm the GAF's effectiveness. The experimental results also demonstrate that the proposed method is able to be adopted in various deep neural networks to improve their performance. The source code is publicly available at https://github.com/LongJin-lab/Activated-Gradients-for-Deep-Neural-Networks.

...read moreread less

Proceedings Article•DOI•

ST3D: Self-training for Unsupervised Domain Adaptation on 3D Object Detection

[...]

Jihan Yang¹, Shaoshuai Shi², Zhe Wang³, Hongsheng Li², Xiaojuan Qi¹ - Show less +1 more•Institutions (3)

University of Hong Kong¹, The Chinese University of Hong Kong², SenseTime³

01 Jun 2021

TL;DR: ST3D as discussed by the authors proposed a domain adaptive self-training pipeline, named ST3D, for unsupervised domain adaptation on 3D object detection from point clouds by pre-training the 3D detector on the source domain with a proposed random object scaling strategy for mitigating the negative effects of source domain bias.

...read moreread less

Abstract: We present a new domain adaptive self-training pipeline, named ST3D, for unsupervised domain adaptation on 3D object detection from point clouds. First, we pre-train the 3D detector on the source domain with our proposed random object scaling strategy for mitigating the negative effects of source domain bias. Then, the detector is iteratively improved on the target domain by alternatively conducting two steps, which are the pseudo label updating with the developed quality-aware triplet memory bank and the model training with curriculum data augmentation. These specific designs for 3D object detection enable the detector to be trained with consistent and high-quality pseudo labels and to avoid overfitting to the large number of easy examples in pseudo labeled data. Our ST3D achieves state-of-the-art performance on all evaluated datasets and even surpasses fully supervised results on KITTI 3D object detection benchmark. Code will be available at https://github.com/CVMI-Lab/ST3D.

...read moreread less

Proceedings Article•DOI•

Scale-Aware Graph Neural Network for Few-Shot Semantic Segmentation

[...]

Jie Liu¹, Huan Xiong¹, Ling Shao•Institutions (1)

Zayed University¹

01 Jun 2021

TL;DR: Zhang et al. as discussed by the authors proposed an end-to-end scale-aware graph neural network (SAGNN) by reasoning the cross-scale relations among the support-query images for few-shot semantic segmentation.

...read moreread less

Abstract: Few-shot semantic segmentation (FSS) aims to segment unseen class objects given very few densely-annotated support images from the same class. Existing FSS methods find the query object by using support prototypes or by directly relying on heuristic multi-scale feature fusion. However, they fail to fully leverage the high-order appearance relationships between multi-scale features among the support-query image pairs, thus leading to an inaccurate localization of the query objects. To tackle the above challenge, we propose an end-to-end scale-aware graph neural network (SAGNN) by reasoning the cross-scale relations among the support-query images for FSS. Specifically, a scale-aware graph is first built by taking support-induced multi-scale query features as nodes and, meanwhile, each edge is modeled as the pairwise interaction of its connected nodes. By progressive message passing over this graph, SAGNN is capable of capturing cross-scale relations and overcoming object variations (e.g., appearance, scale and location), and can thus learn more precise node embeddings. This in turn enables it to predict more accurate foreground objects. Moreover, to make full use of the location relations across scales for the query image, a novel self-node collaboration mechanism is proposed to enrich the current node, which endows SAGNN the ability of perceiving different resolutions of the same objects. Extensive experiments on PASCAL-5i and COCO-20i show that SAGNN achieves state-of-the-art results.

...read moreread less

Proceedings Article•DOI•

Efficient Regional Memory Network for Video Object Segmentation

[...]

Haozhe Xie¹, Hongxun Yao¹, Shangchen Zhou², Shengping Zhang¹, Wenxiu Sun³ - Show less +1 more•Institutions (3)

Harbin Institute of Technology¹, Nanyang Technological University², SenseTime³

01 Jun 2021

TL;DR: In this paper, the precise regional memory is constructed by memorizing local regions where the target objects appear in the past frames, and the query regions are tracked and predicted based on the optical flow estimated from the previous frame.

...read moreread less

Abstract: Recently, several Space-Time Memory based networks have shown that the object cues (e.g. video frames as well as the segmented object masks) from the past frames are useful for segmenting objects in the current frame. However, these methods exploit the information from the memory by global-to-global matching between the current and past frames, which lead to mismatching to similar objects and high computational complexity. To address these problems, we propose a novel local-to-local matching solution for semi-supervised VOS, namely Regional Memory Network (RMNet). In RMNet, the precise regional memory is constructed by memorizing local regions where the target objects appear in the past frames. For the current query frame, the query regions are tracked and predicted based on the optical flow estimated from the previous frame. The proposed local-to-local matching effectively alleviates the ambiguity of similar objects in both memory and query frames, which allows the information to be passed from the regional memory to the query region efficiently and effectively. Experimental results indicate that the proposed RM-Net performs favorably against state-of-the-art methods on the DAVIS and YouTube-VOS datasets.

...read moreread less

Proceedings Article•DOI•

BBAM: Bounding Box Attribution Map for Weakly Supervised Semantic and Instance Segmentation

[...]

Jungbeom Lee¹, Jihun Yi¹, Chaehun Shin¹, Sungroh Yoon¹•Institutions (1)

Seoul National University¹

01 Jun 2021

TL;DR: In this paper, a bounding-box attribution map (BBAM) was proposed to identify the target object in its bounding box and thus serve as pseudo ground truth for weakly supervised semantic and instance segmentation.

...read moreread less

Abstract: Weakly supervised segmentation methods using bounding box annotations focus on obtaining a pixel-level mask from each box containing an object. Existing methods typically depend on a class-agnostic mask generator, which operates on the low-level information intrinsic to an image. In this work, we utilize higher-level information from the behavior of a trained object detector, by seeking the smallest areas of the image from which the object detector produces almost the same result as it does from the whole image. These areas constitute a bounding-box attribution map (BBAM), which identifies the target object in its bounding box and thus serves as pseudo ground-truth for weakly supervised semantic and instance segmentation. This approach significantly outperforms recent comparable techniques on both the PASCAL VOC and MS COCO benchmarks in weakly supervised semantic and instance segmentation. In addition, we provide a detailed analysis of our method, offering deeper insight into the behavior of the BBAM.

...read moreread less

Collapse