3D Bounding Box Estimation Using Deep Learning and Geometry

doi:10.1109/CVPR.2017.597

Arsalan Mousavian

∗

George Mason University

amousavi@gmu.edu

Dragomir Anguelov

Zoox, Inc.

drago@zoox.com

John Flynn

Zoox, Inc.

john.flynn@zoox.com

Jana Ko

ˇ

seck

´

a

George Mason University

kosecka@gmu.edu

Abstract

We present a method for 3D object detection and pose

estimation from a single image. In contrast to current tech-

niques that only regress the 3D orientation of an object, our

method ﬁrst regresses relatively stable 3D object properties

using a deep convolutional neural network and then com-

bines these estimates with geometric constraints provided

by a 2D object bounding box to produce a complete 3D

bounding box. The ﬁrst network output estimates the 3D

object orientation using a novel hybrid discrete-continuous

loss, which signiﬁcantly outperforms the L2 loss. The sec-

ond output regresses the 3D object dimensions, which have

relatively little variance compared to alternatives and can

often be predicted for many object types. These estimates,

combined with the geometric constraints on translation im-

posed by the 2D bounding box, enable us to recover a stable

and accurate 3D object pose. We evaluate our method on

the challenging KITTI object detection benchmark [

2] both

on the ofﬁcial metric of 3D orientation estimation and also

on the accuracy of the obtained 3D bounding boxes. Al-

though conceptually simple, our method outperforms more

complex and computationally expensive approaches that

leverage semantic segmentation, instance level segmenta-

tion and ﬂat ground priors [

4] and sub-category detec-

tion [

23][24]. Our discrete-continuous loss also produces

state of the art results for 3D viewpoint estimation on the

Pascal 3D+ dataset[

26].

1. Introduction

The problem of 3D object detection is of particular im-

portance in robotic applications that require decision mak-

ing or interactions with objects in the real world. 3D ob-

ject detection recovers both the 6 DoF pose and the dimen-

∗

Work done as an intern at Zoox, Inc.

Figure 1. Our method takes the 2D detection bounding box and

estimates a 3D bounding box.

sions of an object from an image. While recently developed

2D detection algorithms are capable of handling large varia-

tions in viewpoint and clutter, accurate 3D object detection

largely remains an open problem despite some promising

recent work. The existing efforts to integrate pose estima-

tion with state-of-the-art object detectors focus mostly on

viewpoint estimation. They exploit the observation that the

appearance of objects changes as a function of viewpoint

and that discretization of viewpoints (parametrized by az-

imuth and elevation) gives rise to sub-categories which can

be trained discriminatively [

23]. In more restrictive driving

scenarios alternatives to full 3D pose estimation explore ex-

haustive sampling and scoring of all hypotheses [

4] using a

variety of contextual and semantic cues.

In this work, we propose a method that estimates the

pose (R, T) ∈ SE(3) and the dimensions of an object’s

3D bounding box from a 2D bounding box and the sur-

rounding image pixels. Our simple and efﬁcient method

is suitable for many real world applications including self-

driving vehicles. The main contribution of our approach is

in the choice of the regression parameters and the associated

objective functions for the problem. We ﬁrst regress the

orientation and object dimensions before combining these

estimates with geometric constraints to produce a ﬁnal 3D

pose. This is in contrast to previous techniques that attempt

1

7074

to directly regress to pose.

A state of the art 2D object detector [

3] is extended

by training a deep convolutional neural network (CNN) to

regress the orientation of the object’s 3D bounding box and

its dimensions. Given estimated orientation and dimensions

and the constraint that the projection of the 3D bounding

box ﬁts tightly into the 2D detection window, we recover

the translation and the object’s 3D bounding box. Although

conceptually simple, our method is based on several im-

portant insights. We show that a novel MultiBin discrete-

continuous formulation of the orientation regression signif-

icantly outperforms a more traditional L2 loss. Further con-

straining the 3D box by regressing to vehicle dimensions

proves especially effective, since they are relatively low-

variance and result in stable ﬁnal 3D box estimates.

We evaluate our method on the KITTI [

2] and Pascal

3D+[

26] datasets. On the KITTI dataset, we perform an

in-depth comparison of our estimated 3D boxes to the re-

sults of other state-of-the-art 3D object detection algorithms

[

24, 4]. The ofﬁcial KITTI benchmark for 3D bounding box

estimation only evaluates the 3D box orientation estimate.

We introduce three additional performance metrics measur-

ing the 3D box accuracy: distance to center of box, distance

to the center of the closest bounding box face, and the over-

all bounding box overlap with the ground truth box, mea-

sured using 3D Intersection over Union (3D IoU) score. We

demonstrate that given sufﬁcient training data, our method

is superior to the state of the art on all the above 3D metrics.

Since the Pascal 3D+ dataset does not have the physical di-

mensions annotated and the intrinsic camera parameters are

approximate, we only evaluate viewpoint estimation accu-

racy showing that our MultiBin module achieves state-of-

the-art results there as well.

In summary, the main contributions of our paper include:

1) A method to estimate an object’s full 3D pose and di-

mensions from a 2D bounding box using the constraints

provided by projective geometry and estimates of the ob-

ject’s orientation and size regressed using a deep CNN. In

contrast to other methods, our approach does not require

any preprocessing stages or 3D object models. 2) A novel

discrete-continuous CNN architecture called MultiBin re-

gression for estimation of the object’s orientation. 3) Three

new metrics for evaluating 3D boxes beyond their orienta-

tion accuracy for the KITTI dataset. 4) An experimental

evaluation demonstrating the effectiveness of our approach

for KITTI cars, which also illustrates the importance of the

speciﬁc choice of regression parameters within our 3D pose

estimation framework. 5) Viewpoint evaluation on the Pas-

cal 3D+ dataset.

2. Related Work

The classical problem of 6 DoF pose estimation of an

object instance from a single 2D image has been consid-

ered previously as a purely geometric problem known as

the perspective n-point problem (PnP). Several closed form

and iterative solutions assuming correspondences between

2D keypoints in the image and a 3D model of the object can

be found in [

10] and references therein. Other methods fo-

cus on constructing 3D models of the object instances and

then ﬁnding the 3D pose in the image that best matches the

model [

19, 6].

With the introduction of new challenging datasets [

2, 26,

25, 12], 3D pose estimation has been extended to object cat-

egories, which requires handling both the appearance vari-

ations due to pose changes and the appearance variations

within the category [

9, 15]. In [16, 26] the object detec-

tion framework of discriminative part based models (DPMs)

is used to tackle the problem of pose estimation formu-

lated jointly as a structured prediction problem, where each

mixture component represents a different azimuth section.

However, such approaches predict only an Euler angle sub-

set with respect to the canonical object frame, while object

dimensions and position are not estimated.

An alternative direction is to exploit the availability of

3D shape models and use those for 3D hypothesis sampling

and reﬁnement. For example, Mottaghi et al. [

13] sample

the object viewpoint, position and size and then measure

the similarity between rendered 3D CAD models of the ob-

ject and the detection window using HOG features. A sim-

ilar method for estimating the pose using the projection of

CAD model object instances has been explored by [

29] in

a robotics table-top setting where the detection problem is

less challenging. Given the coarse pose estimate obtained

from a DPM-based detector, the continuous 6 DoF pose is

reﬁned by estimating the correspondences between the pro-

jected 3D model and the image contours. The evaluation

was carried out on PASCAL3D+ or simple table top set-

tings with limited clutter or scale variations. An extension

of these methods to more challenging scenarios with signiﬁ-

cant occlusion has been explored in [

22], which uses dictio-

naries of 3D voxel patterns learned from 3D CAD models

that characterize both the object’s shape and commonly en-

countered occlusion patterns.

Recently, deep convolutional neural networks (CNN)

have dramatically improved the performance of 2D object

detection and several extensions have been proposed to in-

clude 3D pose estimation. In [

21] R-CNN [7] is used to

detect objects and the resulting detected regions are passed

as input to a pose estimation network. The pose network

is initialized with VGG [

20] and ﬁne-tuned for pose es-

timation using ground truth annotations from Pascal 3D+.

This approach is similar to [

8], with the distinction of using

separate pose weights for each category and a large num-

ber of synthetic images with pose annotation ground truth

for training. In [17], Poirson et al. discretize the object

viewpoint and train a deep convolutional network to jointly

7075

perform viewpoint estimation and 2D detection. The net-

work shares the pose parameter weights across all classes.

In [

21], Tulsiani et al. explore the relationship between

coarse viewpoint estimation, followed by keypoint detec-

tion, localization and pose estimation. Pavlakos et al [

14],

used CNN to localize the keypoints and they used the key-

points and their 3D coordinates from meshes to recover the

pose. However, their approach required training data with

annotated keypoints.

Several recent methods have explored 3D bounding box

detection for driving scenarios and are most closely related

to our method. Xiang et al. [

23, 24] cluster the set of pos-

sible object poses into viewpoint-dependent subcategories.

These subcategories are obtained by clustering 3D voxel

patterns introduced previously [

22]; 3D CAD models are

required to learn the pattern dictionaries. The subcategories

capture both shape, viewpoint and occlusion patterns and

are subsequently classiﬁed discriminatively [

24] using deep

CNNs. Another related approach by Chen et al. [

4] ad-

dresses the problem by sampling 3D boxes in the physical

world assuming the ﬂat ground plane constraint. The boxes

are scored using high level contextual, shape and category

speciﬁc features. All of the above approaches require com-

plicated preprocessing including high level features such as

segmentation or 3D shape repositories and may not be suit-

able for robots with limited computational resources.

3. 3D Bounding Box Estimation

In order to leverage the success of existing work on 2D

object detection for 3D bounding box estimation, we use

the fact that the perspective projection of a 3D bounding

box should ﬁt tightly within its 2D detection window. We

assume that the 2D object detector has been trained to pro-

duce boxes that correspond to the bounding box of the pro-

jected 3D box. The 3D bounding box is described by its

center T = [t

x

, t

y

, t

z

]

T

, dimensions D = [d

x

, d

y

, d

z

], and

orientation R(θ, φ, α) , here paramaterized by the azimuth,

elevation and roll angles. Given the pose of the object

in the camera coordinate frame (R, T ) ∈ SE(3) and the

camera intrinsics matrix K, the projection of a 3D point

X

o

= [X, Y, Z, 1]

T

in the object’s coordinate frame into

the image x = [x, y, 1]

T

is:

x = K



R T



X

o

(1)

Assuming that the origin of the object coordinate frame

is at the center of the 3D bounding box and the ob-

ject dimensions D are known, the coordinates of the 3D

bounding box vertices can be described simply by X

1

=

[d

x

/2, d

y

/2, d

z

/2]

T

, X

2

= [−d

x

/2, d

y

/2, d

z

/2]

T

, . . . ,

X

8

= [−d

x

/2, −d

y

/2, −d

z

/2]

T

. The constraint that the

3D bounding box ﬁts tightly into 2D detection window re-

quires that each side of the 2D bounding box to be touched

by the projection of at least one of the 3D box corners.

For example, consider the projection of one 3D corner

X

0

= [d

x

/2, −d

y

/2, d

z

/2]

T

that touches the left side of

the 2D bounding box with coordinate x

min

. This point-to-

side correspondence constraint results in the equation:

x

min

=







K



R T









d

x

/2

−d

y

/2

d

z

/2

1













x

(2)

where (.)

x

refers to the x coordinate from the perspective

projection. Similar equations can be derived for the remain-

ing 2D box side parameters x

max

, y

min

, y

max

. In total the

sides of the 2D bounding box provide four constraints on

the 3D bounding box. This is not enough to constrain the

nine degrees of freedom (DoF) (three for translation, three

for rotation, and three for box dimensions). There are sev-

eral different geometric properties we could estimate from

the visual appearance of the box to further constrain the 3D

box. The main criteria is that they should be tied strongly

to the visual appearance and further constrain the ﬁnal 3D

box.

3.1. Choice of Regression Parameters

The ﬁrst set of parameters that have a strong effect on

the 3D bounding box is the orientation around each axis

(θ, φ, α). Apart from them, we choose to regress the box

dimensions D rather than translation T because the vari-

ance of the dimension estimate is typically smaller (e.g.

cars tend to be roughly the same size) and does not vary

as the object orientation changes: a desirable property if

we are also regressing orientation parameters. Furthermore,

the dimension estimate is strongly tied to the appearance of

a particular object subcategory and is likely to be accurately

recovered if we can classify that subcategory.

3.2. Correspondence Constraints

Using the regressed dimensions and orientations of the

3D box by CNN and 2D detection box we can solve for

the translation T that minimizes the reprojection error with

respect to the initial 2D detection box constraints in Equa-

tion

2. Details of how to solve for translation are included in

the supplementary material [

1]. Each side of the 2D detec-

tion box can correspond to any of the eight corners of the 3D

box which results in 8

4

= 4096 conﬁgurations. Each differ-

ent conﬁguration involves solving an over-constrained sys-

tem of linear equations which is computationally fast and

can be done in parallel. In many scenarios the objects can

be assumed to be always upright. In this case, the 2D box

top and bottom correspond only to the projection of ver-

tices from the top and bottom of the 3D box, respectively,

which reduces the number of correspondences to 1024. Fur-

thermore, when the relative object roll is close to zero, the

7076

Figure 2. Correspondence between the 3D box and 2D bounding

box: Each ﬁgure shows a 3D bbox that surrounds an object. The

front face is shown in blue and the rear face is in red. The 3D

points that are active constraints in each of the images are shown

with a circle (best viewed in color).

vertical 2D box side coordinates x

min

and x

max

can only

correspond to projections of points from vertical 3D box

sides. Similarly, y

min

and y

max

can only correspond to

point projections from the horizontal 3D box sides. Conse-

quently, each vertical side of the 2D detection box can cor-

respond to [±d

x

/2, ., ±d

z

/2] and each horizontal side of

the 2D bounding corresponds to [., ±d

y

/2, ±d

z

/2], yield-

ing 4

4

= 256 possible conﬁgurations. In the KITTI dataset,

object pitch and roll angles are both zero, which further re-

duces of the number of conﬁgurations to 64. Fig.

2 visual-

izes some of the possible correspondences between 2D box

sides and 3D box points that can occur.

4. CNN Regression of 3D Box Parameters

In this section, we describe our approach for regressing

the 3D bounding box orientation and dimensions.

4.1. MultiBin Orientation Estimation

Estimating the global object orientation R ∈ SO(3) in

the camera reference frame from only the contents of the

detection window crop is not possible, as the location of the

crop within the image plane is also required. Consider the

rotation R(θ) parametrized only by azimuth θ (yaw). Fig.

4

shows an example of a car moving in a straight line. Al-

though the global orientation R(θ) of the car (its 3D bound-

ing box) does not change, its local orientation θ

l

with re-

spect to the ray through the crop center does, and generates

changes in the appearance of the cropped image.

We thus regress to this local orientation θ

l

. Fig.

4 shows

an example, where the local orientation angle θ

l

and the

ray angle change in such a way that their combined effect

is a constant global orientation of the car. Given intrinsic

Figure 3. Left: Car dimensions, the height of the car equals d

y

.

Right: Illustration of local orientation θ

l

, and global orientation of

a car θ. The local orientation is computed with respect to the ray

that goes through the center of the crop. The center ray of the crop

is indicated by the blue arrow. Note that the center of crop may

not go through the actual center of the object. Orientation of the

car θ is equal to θ

ray

+ θ

l

. The network is trained to estimate the

local orientation θ

l

.

Figure 4. Left: cropped image of a car passing by. Right: Image of

whole scene. As it is shown the car in the cropped images rotates

while the car direction is constant among all different rows.

camera parameters, the ray direction at a particular pixel is

trivial to compute. At inference time we combine this ray

direction at the crop center with the estimated local orienta-

tion in order to compute the global orientation of the object.

It is known that using the L2 loss is not a good ﬁt for

many complex multi-modal regression problems. The L2

loss encourages the network to minimize to average loss

across all modes, which results in an estimate that may

be poor for any single mode. This has been observed in

the context of the image colorization problem, where the

L2 norm produces unrealistic average colors for items like

clothing [

27]. Similarly, object detectors such as Faster

R-CNN [18] and SSD [11] do not regress the bounding

7077

Figure 5. Proposed architecture for MultiBin estimation for orien-

tation and dimension estimation. It consists of three branches. The

left branch is for estimation of dimensions of the object of interest.

The other branches are for computing the conﬁdence for each bin

and also compute the cos(∆θ) and sin(∆θ) of each bin

boxes directly: instead they divide the space of the bound-

ing boxes into several discrete modes called anchor boxes

and then estimate the continuous offsets that need to be ap-

plied to each anchor box.

We use a similar idea in our proposed MultiBin architec-

ture for orientation estimation. We ﬁrst discretize the orien-

tation angle and divide it into n overlapping bins. For each

bin, the CNN network estimates both a conﬁdence proba-

bility c

i

that the output angle lies inside the i

th

bin and the

residual rotation correction that needs to be applied to the

orientation of the center ray of that bin in order to obtain

the output angle. The residual rotation is represented by

two numbers, for the sine and the cosine of the angle. This

results in 3 outputs for each bin i: (c

i

, cos(∆θ

i

), sin(∆θ

i

)).

Valid cosine and sine values are obtained by applying an L2

normalization layer on top of a 2-dimensional input. The

total loss for the MultiBin orientation is thus:

L

θ

= L

conf

+ w × L

loc

(3)

The conﬁdence loss L

conf

is equal to the softmax loss of

the conﬁdences of each bin. L

loc

is the loss that tries to

minimize the difference between the estimated angle and

the ground truth angle in each of the bins that covers the

ground truth angle, with adjacent bins having overlapping

coverage. In the localization loss L

loc

, all the bins that cover

the ground truth angle are forced to estimate the correct an-

gle. The localization loss tries to minimize the difference

between the ground truth and all the bins that cover that

value which is equivalent of maximizing cosine distance as

it is shown in supplementary material [

1]. Localization loss

L

loc

is computed as following:

L

loc

= −

1

n

θ

∗

X

cos(θ

∗

− c

i

− ∆θ

i

) (4)

where n

θ

∗

is the number of bins that cover ground truth

angle θ

∗

, c

i

is the angle of the center of bin i and ∆θ

i

is the

change that needs to be applied to the center of bin i.

During inference, the bin with maximum conﬁdence is

selected and the ﬁnal output is computed by applying the

estimated ∆θ of that bin to the center of that bin. The Multi-

Bin module has 2 branches. One for computing the conﬁ-

dences c

i

and the other for computing the cosine and sine

of ∆θ. As a result, 3n parameters need to be estimated for

n bins.

In the KITTI dataset cars, vans, trucks, and buses are all

different categories and the distribution of the object dimen-

sions for category instances is low-variance and unimodal.

For example, the dimension variance for cars and cyclists

is on the order of several centimeters. Therefore, rather

than using a discrete-continuous loss like the MultiBin loss

above, we use directly the L2 loss. As is standard, for each

dimension we estimate the residual relative to the mean pa-

rameter value computed over the training dataset. The loss

for dimension estimation L

dims

is computed as follows:

L

dims

=

1

n

X

(D

∗

−

¯

D − δ)

2

, (5)

where D

∗

are the ground truth dimensions of the box,

¯

D are

the mean dimensions for objects of a certain category and

δ is the estimated residual with respect to the mean that the

network predicts.

The CNN architecture of our parameter estimation mod-

ule is shown in Figure

5. There are three branches: two

branches for orientation estimation and one branch for di-

mension estimation. All of the branches are derived from

the same shared convolutional features and the total loss is

the weighted combination of L = α × L

dims

+ L

θ

.

5. Experiments and Discussions

5.1. Implementation Details

We performed our experiments on the KITTI [

2] and

Pascal 3D+[

26] datasets.

KITTI dataset: The KITTI dataset has a total of 7481

training images. We train the MS-CNN [3] object detec-

tor to produce 2D boxes and then estimate 3D boxes from

2D detection boxes whose scores exceed a threshold. For

regressing 3D parameters, we use a pretrained VGG net-

work [

20] without its FC layers and add our 3D box module,

which is shown in Fig.

5. In the module, the ﬁrst FC layers

in each of the orientation branches have 256 dimensions,

while the ﬁrst FC layer for dimension regression has a di-

mension of 512. During training, each ground truth crop is

resized to 224x224. In order to make the network more ro-

bust to viewpoint changes and occlusions, the ground truth

boxes are jittered and the ground truth θ

l

is changed to ac-

count for the movement of the center ray of the crop. In

addition, we added color distortions and also applied mir-

roring to images at random. The network is trained with

7078

3D Bounding Box Estimation Using Deep Learning and Geometry

Citations

From Points to Parts: 3D Object Detection from Point Cloud with Part-aware and Part-aggregation Network

Monocular 3D Object Detection via Feature Domain Adaptation

Single-View 3D reconstruction: A Survey of deep learning methods

Weakly Supervised 3D Object Detection from Lidar Point Cloud

GSNet: Joint Vehicle Pose and Shape Reconstruction with Geometrical and Scene-Aware Supervision

References

Very Deep Convolutional Networks for Large-Scale Image Recognition

You Only Look Once: Unified, Real-Time Object Detection

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

SSD: Single Shot MultiBox Detector

Are we ready for autonomous driving? The KITTI vision benchmark suite

Related Papers (5)

Are we ready for autonomous driving? The KITTI vision benchmark suite

Deep Residual Learning for Image Recognition

Faster R-CNN: towards real-time object detection with region proposal networks

Mask R-CNN

SSD: Single Shot MultiBox Detector