scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

3D Bounding Box Estimation Using Deep Learning and Geometry

TL;DR: In this paper, a hybrid discrete-continuous loss is proposed to estimate 3D bounding box dimensions and geometric constraints provided by a 2D object bounding boxes. But this method requires a large amount of training data and is computationally expensive.
Abstract: We present a method for 3D object detection and pose estimation from a single image. In contrast to current techniques that only regress the 3D orientation of an object, our method first regresses relatively stable 3D object properties using a deep convolutional neural network and then combines these estimates with geometric constraints provided by a 2D object bounding box to produce a complete 3D bounding box. The first network output estimates the 3D object orientation using a novel hybrid discrete-continuous loss, which significantly outperforms the L2 loss. The second output regresses the 3D object dimensions, which have relatively little variance compared to alternatives and can often be predicted for many object types. These estimates, combined with the geometric constraints on translation imposed by the 2D bounding box, enable us to recover a stable and accurate 3D object pose. We evaluate our method on the challenging KITTI object detection benchmark [2] both on the official metric of 3D orientation estimation and also on the accuracy of the obtained 3D bounding boxes. Although conceptually simple, our method outperforms more complex and computationally expensive approaches that leverage semantic segmentation, instance level segmentation and flat ground priors [4] and sub-category detection [23][24]. Our discrete-continuous loss also produces state of the art results for 3D viewpoint estimation on the Pascal 3D+ dataset[26].

Content maybe subject to copyright    Report

3D Bounding Box Estimation Using Deep Learning and Geometry
Arsalan Mousavian
George Mason University
amousavi@gmu.edu
Dragomir Anguelov
Zoox, Inc.
drago@zoox.com
John Flynn
Zoox, Inc.
john.flynn@zoox.com
Jana Ko
ˇ
seck
´
a
George Mason University
kosecka@gmu.edu
Abstract
We present a method for 3D object detection and pose
estimation from a single image. In contrast to current tech-
niques that only regress the 3D orientation of an object, our
method first regresses relatively stable 3D object properties
using a deep convolutional neural network and then com-
bines these estimates with geometric constraints provided
by a 2D object bounding box to produce a complete 3D
bounding box. The first network output estimates the 3D
object orientation using a novel hybrid discrete-continuous
loss, which significantly outperforms the L2 loss. The sec-
ond output regresses the 3D object dimensions, which have
relatively little variance compared to alternatives and can
often be predicted for many object types. These estimates,
combined with the geometric constraints on translation im-
posed by the 2D bounding box, enable us to recover a stable
and accurate 3D object pose. We evaluate our method on
the challenging KITTI object detection benchmark [
2] both
on the official metric of 3D orientation estimation and also
on the accuracy of the obtained 3D bounding boxes. Al-
though conceptually simple, our method outperforms more
complex and computationally expensive approaches that
leverage semantic segmentation, instance level segmenta-
tion and flat ground priors [
4] and sub-category detec-
tion [
23][24]. Our discrete-continuous loss also produces
state of the art results for 3D viewpoint estimation on the
Pascal 3D+ dataset[
26].
1. Introduction
The problem of 3D object detection is of particular im-
portance in robotic applications that require decision mak-
ing or interactions with objects in the real world. 3D ob-
ject detection recovers both the 6 DoF pose and the dimen-
Work done as an intern at Zoox, Inc.
Figure 1. Our method takes the 2D detection bounding box and
estimates a 3D bounding box.
sions of an object from an image. While recently developed
2D detection algorithms are capable of handling large varia-
tions in viewpoint and clutter, accurate 3D object detection
largely remains an open problem despite some promising
recent work. The existing efforts to integrate pose estima-
tion with state-of-the-art object detectors focus mostly on
viewpoint estimation. They exploit the observation that the
appearance of objects changes as a function of viewpoint
and that discretization of viewpoints (parametrized by az-
imuth and elevation) gives rise to sub-categories which can
be trained discriminatively [
23]. In more restrictive driving
scenarios alternatives to full 3D pose estimation explore ex-
haustive sampling and scoring of all hypotheses [
4] using a
variety of contextual and semantic cues.
In this work, we propose a method that estimates the
pose (R, T) SE(3) and the dimensions of an object’s
3D bounding box from a 2D bounding box and the sur-
rounding image pixels. Our simple and efficient method
is suitable for many real world applications including self-
driving vehicles. The main contribution of our approach is
in the choice of the regression parameters and the associated
objective functions for the problem. We first regress the
orientation and object dimensions before combining these
estimates with geometric constraints to produce a final 3D
pose. This is in contrast to previous techniques that attempt
1
7074

to directly regress to pose.
A state of the art 2D object detector [
3] is extended
by training a deep convolutional neural network (CNN) to
regress the orientation of the object’s 3D bounding box and
its dimensions. Given estimated orientation and dimensions
and the constraint that the projection of the 3D bounding
box fits tightly into the 2D detection window, we recover
the translation and the object’s 3D bounding box. Although
conceptually simple, our method is based on several im-
portant insights. We show that a novel MultiBin discrete-
continuous formulation of the orientation regression signif-
icantly outperforms a more traditional L2 loss. Further con-
straining the 3D box by regressing to vehicle dimensions
proves especially effective, since they are relatively low-
variance and result in stable final 3D box estimates.
We evaluate our method on the KITTI [
2] and Pascal
3D+[
26] datasets. On the KITTI dataset, we perform an
in-depth comparison of our estimated 3D boxes to the re-
sults of other state-of-the-art 3D object detection algorithms
[
24, 4]. The official KITTI benchmark for 3D bounding box
estimation only evaluates the 3D box orientation estimate.
We introduce three additional performance metrics measur-
ing the 3D box accuracy: distance to center of box, distance
to the center of the closest bounding box face, and the over-
all bounding box overlap with the ground truth box, mea-
sured using 3D Intersection over Union (3D IoU) score. We
demonstrate that given sufficient training data, our method
is superior to the state of the art on all the above 3D metrics.
Since the Pascal 3D+ dataset does not have the physical di-
mensions annotated and the intrinsic camera parameters are
approximate, we only evaluate viewpoint estimation accu-
racy showing that our MultiBin module achieves state-of-
the-art results there as well.
In summary, the main contributions of our paper include:
1) A method to estimate an object’s full 3D pose and di-
mensions from a 2D bounding box using the constraints
provided by projective geometry and estimates of the ob-
ject’s orientation and size regressed using a deep CNN. In
contrast to other methods, our approach does not require
any preprocessing stages or 3D object models. 2) A novel
discrete-continuous CNN architecture called MultiBin re-
gression for estimation of the object’s orientation. 3) Three
new metrics for evaluating 3D boxes beyond their orienta-
tion accuracy for the KITTI dataset. 4) An experimental
evaluation demonstrating the effectiveness of our approach
for KITTI cars, which also illustrates the importance of the
specific choice of regression parameters within our 3D pose
estimation framework. 5) Viewpoint evaluation on the Pas-
cal 3D+ dataset.
2. Related Work
The classical problem of 6 DoF pose estimation of an
object instance from a single 2D image has been consid-
ered previously as a purely geometric problem known as
the perspective n-point problem (PnP). Several closed form
and iterative solutions assuming correspondences between
2D keypoints in the image and a 3D model of the object can
be found in [
10] and references therein. Other methods fo-
cus on constructing 3D models of the object instances and
then finding the 3D pose in the image that best matches the
model [
19, 6].
With the introduction of new challenging datasets [
2, 26,
25, 12], 3D pose estimation has been extended to object cat-
egories, which requires handling both the appearance vari-
ations due to pose changes and the appearance variations
within the category [
9, 15]. In [16, 26] the object detec-
tion framework of discriminative part based models (DPMs)
is used to tackle the problem of pose estimation formu-
lated jointly as a structured prediction problem, where each
mixture component represents a different azimuth section.
However, such approaches predict only an Euler angle sub-
set with respect to the canonical object frame, while object
dimensions and position are not estimated.
An alternative direction is to exploit the availability of
3D shape models and use those for 3D hypothesis sampling
and refinement. For example, Mottaghi et al. [
13] sample
the object viewpoint, position and size and then measure
the similarity between rendered 3D CAD models of the ob-
ject and the detection window using HOG features. A sim-
ilar method for estimating the pose using the projection of
CAD model object instances has been explored by [
29] in
a robotics table-top setting where the detection problem is
less challenging. Given the coarse pose estimate obtained
from a DPM-based detector, the continuous 6 DoF pose is
refined by estimating the correspondences between the pro-
jected 3D model and the image contours. The evaluation
was carried out on PASCAL3D+ or simple table top set-
tings with limited clutter or scale variations. An extension
of these methods to more challenging scenarios with signifi-
cant occlusion has been explored in [
22], which uses dictio-
naries of 3D voxel patterns learned from 3D CAD models
that characterize both the object’s shape and commonly en-
countered occlusion patterns.
Recently, deep convolutional neural networks (CNN)
have dramatically improved the performance of 2D object
detection and several extensions have been proposed to in-
clude 3D pose estimation. In [
21] R-CNN [7] is used to
detect objects and the resulting detected regions are passed
as input to a pose estimation network. The pose network
is initialized with VGG [
20] and fine-tuned for pose es-
timation using ground truth annotations from Pascal 3D+.
This approach is similar to [
8], with the distinction of using
separate pose weights for each category and a large num-
ber of synthetic images with pose annotation ground truth
for training. In [17], Poirson et al. discretize the object
viewpoint and train a deep convolutional network to jointly
7075

perform viewpoint estimation and 2D detection. The net-
work shares the pose parameter weights across all classes.
In [
21], Tulsiani et al. explore the relationship between
coarse viewpoint estimation, followed by keypoint detec-
tion, localization and pose estimation. Pavlakos et al [
14],
used CNN to localize the keypoints and they used the key-
points and their 3D coordinates from meshes to recover the
pose. However, their approach required training data with
annotated keypoints.
Several recent methods have explored 3D bounding box
detection for driving scenarios and are most closely related
to our method. Xiang et al. [
23, 24] cluster the set of pos-
sible object poses into viewpoint-dependent subcategories.
These subcategories are obtained by clustering 3D voxel
patterns introduced previously [
22]; 3D CAD models are
required to learn the pattern dictionaries. The subcategories
capture both shape, viewpoint and occlusion patterns and
are subsequently classified discriminatively [
24] using deep
CNNs. Another related approach by Chen et al. [
4] ad-
dresses the problem by sampling 3D boxes in the physical
world assuming the flat ground plane constraint. The boxes
are scored using high level contextual, shape and category
specific features. All of the above approaches require com-
plicated preprocessing including high level features such as
segmentation or 3D shape repositories and may not be suit-
able for robots with limited computational resources.
3. 3D Bounding Box Estimation
In order to leverage the success of existing work on 2D
object detection for 3D bounding box estimation, we use
the fact that the perspective projection of a 3D bounding
box should fit tightly within its 2D detection window. We
assume that the 2D object detector has been trained to pro-
duce boxes that correspond to the bounding box of the pro-
jected 3D box. The 3D bounding box is described by its
center T = [t
x
, t
y
, t
z
]
T
, dimensions D = [d
x
, d
y
, d
z
], and
orientation R(θ, φ, α) , here paramaterized by the azimuth,
elevation and roll angles. Given the pose of the object
in the camera coordinate frame (R, T ) SE(3) and the
camera intrinsics matrix K, the projection of a 3D point
X
o
= [X, Y, Z, 1]
T
in the object’s coordinate frame into
the image x = [x, y, 1]
T
is:
x = K
R T
X
o
(1)
Assuming that the origin of the object coordinate frame
is at the center of the 3D bounding box and the ob-
ject dimensions D are known, the coordinates of the 3D
bounding box vertices can be described simply by X
1
=
[d
x
/2, d
y
/2, d
z
/2]
T
, X
2
= [d
x
/2, d
y
/2, d
z
/2]
T
, . . . ,
X
8
= [d
x
/2, d
y
/2, d
z
/2]
T
. The constraint that the
3D bounding box fits tightly into 2D detection window re-
quires that each side of the 2D bounding box to be touched
by the projection of at least one of the 3D box corners.
For example, consider the projection of one 3D corner
X
0
= [d
x
/2, d
y
/2, d
z
/2]
T
that touches the left side of
the 2D bounding box with coordinate x
min
. This point-to-
side correspondence constraint results in the equation:
x
min
=
K
R T
d
x
/2
d
y
/2
d
z
/2
1
x
(2)
where (.)
x
refers to the x coordinate from the perspective
projection. Similar equations can be derived for the remain-
ing 2D box side parameters x
max
, y
min
, y
max
. In total the
sides of the 2D bounding box provide four constraints on
the 3D bounding box. This is not enough to constrain the
nine degrees of freedom (DoF) (three for translation, three
for rotation, and three for box dimensions). There are sev-
eral different geometric properties we could estimate from
the visual appearance of the box to further constrain the 3D
box. The main criteria is that they should be tied strongly
to the visual appearance and further constrain the final 3D
box.
3.1. Choice of Regression Parameters
The first set of parameters that have a strong effect on
the 3D bounding box is the orientation around each axis
(θ, φ, α). Apart from them, we choose to regress the box
dimensions D rather than translation T because the vari-
ance of the dimension estimate is typically smaller (e.g.
cars tend to be roughly the same size) and does not vary
as the object orientation changes: a desirable property if
we are also regressing orientation parameters. Furthermore,
the dimension estimate is strongly tied to the appearance of
a particular object subcategory and is likely to be accurately
recovered if we can classify that subcategory.
3.2. Correspondence Constraints
Using the regressed dimensions and orientations of the
3D box by CNN and 2D detection box we can solve for
the translation T that minimizes the reprojection error with
respect to the initial 2D detection box constraints in Equa-
tion
2. Details of how to solve for translation are included in
the supplementary material [
1]. Each side of the 2D detec-
tion box can correspond to any of the eight corners of the 3D
box which results in 8
4
= 4096 configurations. Each differ-
ent configuration involves solving an over-constrained sys-
tem of linear equations which is computationally fast and
can be done in parallel. In many scenarios the objects can
be assumed to be always upright. In this case, the 2D box
top and bottom correspond only to the projection of ver-
tices from the top and bottom of the 3D box, respectively,
which reduces the number of correspondences to 1024. Fur-
thermore, when the relative object roll is close to zero, the
7076

Figure 2. Correspondence between the 3D box and 2D bounding
box: Each figure shows a 3D bbox that surrounds an object. The
front face is shown in blue and the rear face is in red. The 3D
points that are active constraints in each of the images are shown
with a circle (best viewed in color).
vertical 2D box side coordinates x
min
and x
max
can only
correspond to projections of points from vertical 3D box
sides. Similarly, y
min
and y
max
can only correspond to
point projections from the horizontal 3D box sides. Conse-
quently, each vertical side of the 2D detection box can cor-
respond to [±d
x
/2, ., ±d
z
/2] and each horizontal side of
the 2D bounding corresponds to [., ±d
y
/2, ±d
z
/2], yield-
ing 4
4
= 256 possible configurations. In the KITTI dataset,
object pitch and roll angles are both zero, which further re-
duces of the number of configurations to 64. Fig.
2 visual-
izes some of the possible correspondences between 2D box
sides and 3D box points that can occur.
4. CNN Regression of 3D Box Parameters
In this section, we describe our approach for regressing
the 3D bounding box orientation and dimensions.
4.1. MultiBin Orientation Estimation
Estimating the global object orientation R SO(3) in
the camera reference frame from only the contents of the
detection window crop is not possible, as the location of the
crop within the image plane is also required. Consider the
rotation R(θ) parametrized only by azimuth θ (yaw). Fig.
4
shows an example of a car moving in a straight line. Al-
though the global orientation R(θ) of the car (its 3D bound-
ing box) does not change, its local orientation θ
l
with re-
spect to the ray through the crop center does, and generates
changes in the appearance of the cropped image.
We thus regress to this local orientation θ
l
. Fig.
4 shows
an example, where the local orientation angle θ
l
and the
ray angle change in such a way that their combined effect
is a constant global orientation of the car. Given intrinsic
Figure 3. Left: Car dimensions, the height of the car equals d
y
.
Right: Illustration of local orientation θ
l
, and global orientation of
a car θ. The local orientation is computed with respect to the ray
that goes through the center of the crop. The center ray of the crop
is indicated by the blue arrow. Note that the center of crop may
not go through the actual center of the object. Orientation of the
car θ is equal to θ
ray
+ θ
l
. The network is trained to estimate the
local orientation θ
l
.
Figure 4. Left: cropped image of a car passing by. Right: Image of
whole scene. As it is shown the car in the cropped images rotates
while the car direction is constant among all different rows.
camera parameters, the ray direction at a particular pixel is
trivial to compute. At inference time we combine this ray
direction at the crop center with the estimated local orienta-
tion in order to compute the global orientation of the object.
It is known that using the L2 loss is not a good fit for
many complex multi-modal regression problems. The L2
loss encourages the network to minimize to average loss
across all modes, which results in an estimate that may
be poor for any single mode. This has been observed in
the context of the image colorization problem, where the
L2 norm produces unrealistic average colors for items like
clothing [
27]. Similarly, object detectors such as Faster
R-CNN [18] and SSD [11] do not regress the bounding
7077

Figure 5. Proposed architecture for MultiBin estimation for orien-
tation and dimension estimation. It consists of three branches. The
left branch is for estimation of dimensions of the object of interest.
The other branches are for computing the confidence for each bin
and also compute the cos(∆θ) and sin(∆θ) of each bin
boxes directly: instead they divide the space of the bound-
ing boxes into several discrete modes called anchor boxes
and then estimate the continuous offsets that need to be ap-
plied to each anchor box.
We use a similar idea in our proposed MultiBin architec-
ture for orientation estimation. We first discretize the orien-
tation angle and divide it into n overlapping bins. For each
bin, the CNN network estimates both a confidence proba-
bility c
i
that the output angle lies inside the i
th
bin and the
residual rotation correction that needs to be applied to the
orientation of the center ray of that bin in order to obtain
the output angle. The residual rotation is represented by
two numbers, for the sine and the cosine of the angle. This
results in 3 outputs for each bin i: (c
i
, cos(∆θ
i
), sin(∆θ
i
)).
Valid cosine and sine values are obtained by applying an L2
normalization layer on top of a 2-dimensional input. The
total loss for the MultiBin orientation is thus:
L
θ
= L
conf
+ w × L
loc
(3)
The confidence loss L
conf
is equal to the softmax loss of
the confidences of each bin. L
loc
is the loss that tries to
minimize the difference between the estimated angle and
the ground truth angle in each of the bins that covers the
ground truth angle, with adjacent bins having overlapping
coverage. In the localization loss L
loc
, all the bins that cover
the ground truth angle are forced to estimate the correct an-
gle. The localization loss tries to minimize the difference
between the ground truth and all the bins that cover that
value which is equivalent of maximizing cosine distance as
it is shown in supplementary material [
1]. Localization loss
L
loc
is computed as following:
L
loc
=
1
n
θ
X
cos(θ
c
i
θ
i
) (4)
where n
θ
is the number of bins that cover ground truth
angle θ
, c
i
is the angle of the center of bin i and θ
i
is the
change that needs to be applied to the center of bin i.
During inference, the bin with maximum confidence is
selected and the final output is computed by applying the
estimated θ of that bin to the center of that bin. The Multi-
Bin module has 2 branches. One for computing the confi-
dences c
i
and the other for computing the cosine and sine
of θ. As a result, 3n parameters need to be estimated for
n bins.
In the KITTI dataset cars, vans, trucks, and buses are all
different categories and the distribution of the object dimen-
sions for category instances is low-variance and unimodal.
For example, the dimension variance for cars and cyclists
is on the order of several centimeters. Therefore, rather
than using a discrete-continuous loss like the MultiBin loss
above, we use directly the L2 loss. As is standard, for each
dimension we estimate the residual relative to the mean pa-
rameter value computed over the training dataset. The loss
for dimension estimation L
dims
is computed as follows:
L
dims
=
1
n
X
(D
¯
D δ)
2
, (5)
where D
are the ground truth dimensions of the box,
¯
D are
the mean dimensions for objects of a certain category and
δ is the estimated residual with respect to the mean that the
network predicts.
The CNN architecture of our parameter estimation mod-
ule is shown in Figure
5. There are three branches: two
branches for orientation estimation and one branch for di-
mension estimation. All of the branches are derived from
the same shared convolutional features and the total loss is
the weighted combination of L = α × L
dims
+ L
θ
.
5. Experiments and Discussions
5.1. Implementation Details
We performed our experiments on the KITTI [
2] and
Pascal 3D+[
26] datasets.
KITTI dataset: The KITTI dataset has a total of 7481
training images. We train the MS-CNN [3] object detec-
tor to produce 2D boxes and then estimate 3D boxes from
2D detection boxes whose scores exceed a threshold. For
regressing 3D parameters, we use a pretrained VGG net-
work [
20] without its FC layers and add our 3D box module,
which is shown in Fig.
5. In the module, the first FC layers
in each of the orientation branches have 256 dimensions,
while the first FC layer for dimension regression has a di-
mension of 512. During training, each ground truth crop is
resized to 224x224. In order to make the network more ro-
bust to viewpoint changes and occlusions, the ground truth
boxes are jittered and the ground truth θ
l
is changed to ac-
count for the movement of the center ray of the crop. In
addition, we added color distortions and also applied mir-
roring to images at random. The network is trained with
7078

Citations
More filters
Proceedings ArticleDOI
18 Jun 2018
TL;DR: This work directly operates on raw point clouds by popping up RGBD scans and leverages both mature 2D object detectors and advanced 3D deep learning for object localization, achieving efficiency as well as high recall for even small objects.
Abstract: In this work, we study 3D object detection from RGBD data in both indoor and outdoor scenes. While previous methods focus on images or 3D voxels, often obscuring natural 3D patterns and invariances of 3D data, we directly operate on raw point clouds by popping up RGB-D scans. However, a key challenge of this approach is how to efficiently localize objects in point clouds of large-scale scenes (region proposal). Instead of solely relying on 3D proposals, our method leverages both mature 2D object detectors and advanced 3D deep learning for object localization, achieving efficiency as well as high recall for even small objects. Benefited from learning directly in raw point clouds, our method is also able to precisely estimate 3D bounding boxes even under strong occlusion or with very sparse points. Evaluated on KITTI and SUN RGB-D 3D detection benchmarks, our method outperforms the state of the art by remarkable margins while having real-time capability.

1,947 citations

Posted Content
TL;DR: nuScenes as mentioned in this paper is the first dataset to carry the full autonomous vehicle sensor suite: 6 cameras, 5 radars and 1 lidar, all with full 360 degree field of view.
Abstract: Robust detection and tracking of objects is crucial for the deployment of autonomous vehicle technology. Image based benchmark datasets have driven development in computer vision tasks such as object detection, tracking and segmentation of agents in the environment. Most autonomous vehicles, however, carry a combination of cameras and range sensors such as lidar and radar. As machine learning based methods for detection and tracking become more prevalent, there is a need to train and evaluate such methods on datasets containing range sensor data along with images. In this work we present nuTonomy scenes (nuScenes), the first dataset to carry the full autonomous vehicle sensor suite: 6 cameras, 5 radars and 1 lidar, all with full 360 degree field of view. nuScenes comprises 1000 scenes, each 20s long and fully annotated with 3D bounding boxes for 23 classes and 8 attributes. It has 7x as many annotations and 100x as many images as the pioneering KITTI dataset. We define novel 3D detection and tracking metrics. We also provide careful dataset analysis as well as baselines for lidar and image based detection and tracking. Data, development kit and more information are available online.

1,939 citations

Posted Content
TL;DR: The center point based approach, CenterNet, is end-to-end differentiable, simpler, faster, and more accurate than corresponding bounding box based detectors and performs competitively with sophisticated multi-stage methods and runs in real-time.
Abstract: Detection identifies objects as axis-aligned boxes in an image. Most successful object detectors enumerate a nearly exhaustive list of potential object locations and classify each. This is wasteful, inefficient, and requires additional post-processing. In this paper, we take a different approach. We model an object as a single point --- the center point of its bounding box. Our detector uses keypoint estimation to find center points and regresses to all other object properties, such as size, 3D location, orientation, and even pose. Our center point based approach, CenterNet, is end-to-end differentiable, simpler, faster, and more accurate than corresponding bounding box based detectors. CenterNet achieves the best speed-accuracy trade-off on the MS COCO dataset, with 28.1% AP at 142 FPS, 37.4% AP at 52 FPS, and 45.1% AP with multi-scale testing at 1.4 FPS. We use the same approach to estimate 3D bounding box in the KITTI benchmark and human pose on the COCO keypoint dataset. Our method performs competitively with sophisticated multi-stage methods and runs in real-time.

1,899 citations

Proceedings ArticleDOI
14 Jun 2020
TL;DR: nuScenes as discussed by the authors is the first dataset to carry the full autonomous vehicle sensor suite: 6 cameras, 5 radars and 1 lidar, all with full 360 degree field of view.
Abstract: Robust detection and tracking of objects is crucial for the deployment of autonomous vehicle technology. Image based benchmark datasets have driven development in computer vision tasks such as object detection, tracking and segmentation of agents in the environment. Most autonomous vehicles, however, carry a combination of cameras and range sensors such as lidar and radar. As machine learning based methods for detection and tracking become more prevalent, there is a need to train and evaluate such methods on datasets containing range sensor data along with images. In this work we present nuTonomy scenes (nuScenes), the first dataset to carry the full autonomous vehicle sensor suite: 6 cameras, 5 radars and 1 lidar, all with full 360 degree field of view. nuScenes comprises 1000 scenes, each 20s long and fully annotated with 3D bounding boxes for 23 classes and 8 attributes. It has 7x as many annotations and 100x as many images as the pioneering KITTI dataset. We define novel 3D detection and tracking metrics. We also provide careful dataset analysis as well as baselines for lidar and image based detection and tracking. Data, development kit and more information are available online.

1,378 citations

Proceedings ArticleDOI
15 Jun 2019
TL;DR: PointRCNN as mentioned in this paper generates 3D object proposals from raw point clouds in a bottom-up manner via segmenting the point cloud of the whole scene into foreground points and background.
Abstract: In this paper, we propose PointRCNN for 3D object detection from raw point cloud. The whole framework is composed of two stages: stage-1 for the bottom-up 3D proposal generation and stage-2 for refining proposals in the canonical coordinates to obtain the final detection results. Instead of generating proposals from RGB image or projecting point cloud to bird's view or voxels as previous methods do, our stage-1 sub-network directly generates a small number of high-quality 3D proposals from point cloud in a bottom-up manner via segmenting the point cloud of the whole scene into foreground points and background. The stage-2 sub-network transforms the pooled points of each proposal to canonical coordinates to learn better local spatial features, which is combined with global semantic features of each point learned in stage-1 for accurate box refinement and confidence prediction. Extensive experiments on the 3D detection benchmark of KITTI dataset show that our proposed architecture outperforms state-of-the-art methods with remarkable margins by using only point cloud as input. The code is available at https://github.com/sshaoshuai/PointRCNN.

1,218 citations

References
More filters
Proceedings Article
01 Jan 2015
TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

49,914 citations

Proceedings ArticleDOI
27 Jun 2016
TL;DR: Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background, and outperforms other detection methods, including DPM and R-CNN, when generalizing from natural images to other domains like artwork.
Abstract: We present YOLO, a new approach to object detection. Prior work on object detection repurposes classifiers to perform detection. Instead, we frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance. Our unified architecture is extremely fast. Our base YOLO model processes images in real-time at 45 frames per second. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors. Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background. Finally, YOLO learns very general representations of objects. It outperforms other detection methods, including DPM and R-CNN, when generalizing from natural images to other domains like artwork.

27,256 citations

Proceedings ArticleDOI
23 Jun 2014
TL;DR: RCNN as discussed by the authors combines CNNs with bottom-up region proposals to localize and segment objects, and when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost.
Abstract: Object detection performance, as measured on the canonical PASCAL VOC dataset, has plateaued in the last few years. The best-performing methods are complex ensemble systems that typically combine multiple low-level image features with high-level context. In this paper, we propose a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012 -- achieving a mAP of 53.3%. Our approach combines two key insights: (1) one can apply high-capacity convolutional neural networks (CNNs) to bottom-up region proposals in order to localize and segment objects and (2) when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost. Since we combine region proposals with CNNs, we call our method R-CNN: Regions with CNN features. We also present experiments that provide insight into what the network learns, revealing a rich hierarchy of image features. Source code for the complete system is available at http://www.cs.berkeley.edu/~rbg/rcnn.

21,729 citations

Book ChapterDOI
08 Oct 2016
TL;DR: The approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location, which makes SSD easy to train and straightforward to integrate into systems that require a detection component.
Abstract: We present a method for detecting objects in images using a single deep neural network. Our approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. At prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes. SSD is simple relative to methods that require object proposals because it completely eliminates proposal generation and subsequent pixel or feature resampling stages and encapsulates all computation in a single network. This makes SSD easy to train and straightforward to integrate into systems that require a detection component. Experimental results on the PASCAL VOC, COCO, and ILSVRC datasets confirm that SSD has competitive accuracy to methods that utilize an additional object proposal step and is much faster, while providing a unified framework for both training and inference. For \(300 \times 300\) input, SSD achieves 74.3 % mAP on VOC2007 test at 59 FPS on a Nvidia Titan X and for \(512 \times 512\) input, SSD achieves 76.9 % mAP, outperforming a comparable state of the art Faster R-CNN model. Compared to other single stage methods, SSD has much better accuracy even with a smaller input image size. Code is available at https://github.com/weiliu89/caffe/tree/ssd.

19,543 citations

Proceedings ArticleDOI
16 Jun 2012
TL;DR: The autonomous driving platform is used to develop novel challenging benchmarks for the tasks of stereo, optical flow, visual odometry/SLAM and 3D object detection, revealing that methods ranking high on established datasets such as Middlebury perform below average when being moved outside the laboratory to the real world.
Abstract: Today, visual recognition systems are still rarely employed in robotics applications. Perhaps one of the main reasons for this is the lack of demanding benchmarks that mimic such scenarios. In this paper, we take advantage of our autonomous driving platform to develop novel challenging benchmarks for the tasks of stereo, optical flow, visual odometry/SLAM and 3D object detection. Our recording platform is equipped with four high resolution video cameras, a Velodyne laser scanner and a state-of-the-art localization system. Our benchmarks comprise 389 stereo and optical flow image pairs, stereo visual odometry sequences of 39.2 km length, and more than 200k 3D object annotations captured in cluttered scenarios (up to 15 cars and 30 pedestrians are visible per image). Results from state-of-the-art algorithms reveal that methods ranking high on established datasets such as Middlebury perform below average when being moved outside the laboratory to the real world. Our goal is to reduce this bias by providing challenging benchmarks with novel difficulties to the computer vision community. Our benchmarks are available online at: www.cvlibs.net/datasets/kitti

11,283 citations