scispace - formally typeset
Open AccessPosted Content

3D Bounding Box Estimation Using Deep Learning and Geometry

Reads0
Chats0
TLDR
Although conceptually simple, this method outperforms more complex and computationally expensive approaches that leverage semantic segmentation, instance level segmentation and flat ground priors and produces state of the art results for 3D viewpoint estimation on the Pascal 3D+ dataset.
Abstract
We present a method for 3D object detection and pose estimation from a single image. In contrast to current techniques that only regress the 3D orientation of an object, our method first regresses relatively stable 3D object properties using a deep convolutional neural network and then combines these estimates with geometric constraints provided by a 2D object bounding box to produce a complete 3D bounding box. The first network output estimates the 3D object orientation using a novel hybrid discrete-continuous loss, which significantly outperforms the L2 loss. The second output regresses the 3D object dimensions, which have relatively little variance compared to alternatives and can often be predicted for many object types. These estimates, combined with the geometric constraints on translation imposed by the 2D bounding box, enable us to recover a stable and accurate 3D object pose. We evaluate our method on the challenging KITTI object detection benchmark both on the official metric of 3D orientation estimation and also on the accuracy of the obtained 3D bounding boxes. Although conceptually simple, our method outperforms more complex and computationally expensive approaches that leverage semantic segmentation, instance level segmentation and flat ground priors and sub-category detection. Our discrete-continuous loss also produces state of the art results for 3D viewpoint estimation on the Pascal 3D+ dataset.

read more

Citations
More filters
Posted Content

nuScenes: A multimodal dataset for autonomous driving

TL;DR: nuScenes as mentioned in this paper is the first dataset to carry the full autonomous vehicle sensor suite: 6 cameras, 5 radars and 1 lidar, all with full 360 degree field of view.
Posted Content

Objects as Points

TL;DR: The center point based approach, CenterNet, is end-to-end differentiable, simpler, faster, and more accurate than corresponding bounding box based detectors and performs competitively with sophisticated multi-stage methods and runs in real-time.
Journal ArticleDOI

SECOND: Sparsely Embedded Convolutional Detection

TL;DR: An improved sparse convolution method for Voxel-based 3D convolutional networks is investigated, which significantly increases the speed of both training and inference and introduces a new form of angle loss regression to improve the orientation estimation performance.
Posted Content

PointRCNN: 3D Object Proposal Generation and Detection from Point Cloud

TL;DR: Extensive experiments on the 3D detection benchmark of KITTI dataset show that the proposed architecture outperforms state-of-the-art methods with remarkable margins by using only point cloud as input.
Proceedings ArticleDOI

SSD-6D: Making RGB-Based 3D Detection and 6D Pose Estimation Great Again

TL;DR: In this paper, a novel method for detecting 3D model instances and estimating their 6D pose from RGB data in a single shot is presented, which outperforms state-of-the-art methods that leverage RGBD data on multiple challenging datasets.
References
More filters
Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Proceedings ArticleDOI

You Only Look Once: Unified, Real-Time Object Detection

TL;DR: Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background, and outperforms other detection methods, including DPM and R-CNN, when generalizing from natural images to other domains like artwork.
Posted Content

Rich feature hierarchies for accurate object detection and semantic segmentation

TL;DR: This paper proposes a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012 -- achieving a mAP of 53.3%.
Book ChapterDOI

SSD: Single Shot MultiBox Detector

TL;DR: SSD as mentioned in this paper discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location, and combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes.
Related Papers (5)