A Point Set Generation Network for 3D Object Reconstruction from a Single Image

doi:10.1109/CVPR.2017.264

A Point Set Generation Network for

3D Object Reconstruction from a Single Image

Haoqiang Fan

∗

Institute for Interdisciplinary

Information Sciences

Tsinghua University

fanhqme@gmail.com

Hao Su

∗

Leonidas Guibas

Computer Science Department

Stanford University

{haosu,guibas}@cs.stanford.edu

Abstract

Generation of 3D data by deep neural networks has

been attracting increasing attention in the research com-

munity. The majority of extant works resort to regular

representations such as volumetric grids or collections of

images; however, these representations obscure the natural

invariance of 3D shapes under geometric transformations,

and also suffer from a number of other issues. In this paper

we address the problem of 3D reconstruction from a single

image, generating a straight-forward form of output – point

cloud coordinates. Along with this problem arises a unique

and interesting issue, that the groundtruth shape for an

input image may be ambiguous. Driven by this unorthodox

output form and the inherent ambiguity in groundtruth, we

design architecture, loss function and learning paradigm

that are novel and effective. Our ﬁnal solution is a

conditional shape sampler, capable of predicting multiple

plausible 3D point clouds from an input image. In

experiments not only can our system outperform state-of-

the-art methods on single image based 3d reconstruction

benchmarks; but it also shows strong performance for 3D

shape completion and promising ability in making multiple

plausible predictions.

1. Introduction

As we try to duplicate the successes of current deep

convolutional architectures in the 3D domain, we face a

fundamental representational issue. Extant deep net archi-

tectures for both discriminative and generative learning in

the signal domain are well-suited to data that is regularly

sampled, such as images, audio, or video. However,

most common 3D geometry representations, such as 2D

meshes or point clouds are not regular structures and do

not easily ﬁt into architectures that exploit such regularity

∗

equal contribution

Input

Reconstructed 3D point cloud

Figure 1. A 3D point cloud of the complete object can be

reconstructed from a single image. Each point is visualized as a

small sphere. The reconstruction is viewed at two viewpoints (0

◦

and 90

◦

along azimuth). A segmentation mask is used to indicate

the scope of the object in the image.

for weight sharing, etc. That is why the majority of

extant works on using deep nets for 3D data resort to

either volumetric grids or collections of images (2D views

of the geometry). Such representations, however, lead to

difﬁcult trade-offs between sampling resolution and net

efﬁciency. Furthermore, they enshrine quantization artifacts

that obscure natural invariances of the data under rigid

motions, etc.

In this paper we address the problem of generating the

3D geometry of an object based on a single image of that

object. We explore generative networks for 3D geometry

based on a point cloud representation. A point cloud

representation may not be as efﬁcient in representing the

underlying continuous 3D geometry as compared to a CAD

model using geometric primitives or even a simple mesh,

but for our purposes it has many advantages. A point cloud

is a simple, uniform structure that is easier to learn, as

it does not have to encode multiple primitives or combi-

natorial connectivity patterns. In addition, a point cloud

allows simple manipulation when it comes to geometric

transformations and deformations, as connectivity does not

1

605

have to be updated. Our pipeline infers the point positions in

a 3D frame determined by the input image and the inferred

viewpoint position.

Given this unorthodox network output, one of our chal-

lenges is how to measure loss during training, as the same

geometry may admit different point cloud representations

at the same degree of approximation. Unlike the usual

L

2

type losses, we use the solution of a transportation

problem based on the Earth Mover’s distance (EMD),

effectively solving an assignment problem. We exploit an

approximation to the EMD to provide speed as well as

ensure differentiability for end-to-end training.

Our approach effectively attempts to solve the ill-posed

problem of 3D structure recovery from a single projection

using certain learned priors. The network has to estimate

depth for the visible parts of the image and hallucinate the

rest of the object geometry, assessing the plausibility of sev-

eral different completions. From a statistical perspective, it

would be ideal if we can fully characterize the landscape

of the ground truth space, or be able to sample plausible

candidates accordingly. If we view this as a regression

problem, then it has a rather unique and interesting feature

arising from inherent object ambiguities in certain views.

These are situations where there are multiple, equally good

3D reconstructions of a 2D image, making our problem very

different from classical regression/classiﬁcation settings,

where each training sample has a unique ground truth

annotation. In such settings the proper loss deﬁnition can

be crucial in getting the most meaningful result.

Our ﬁnal algorithm is a conditional sampler, which

samples plausible 3D point clouds from the estimated

ground truth space given an input image. Experiments on

both synthetic and real world data verify the effectiveness

of our method. Our contributions can be summarized as

follows:

• We use deep learning techniques to study the point set

generation problem;

• On the task of 3D reconstruction from a single

image, we apply our point set generation network and

signiﬁcantly outperform state of the art;

• We systematically explore issues in the architecture

and loss function design for point generation network;

• We discuss and address the ground-truth ambiguity

issue for the 3D reconstruction from single image task.

Source code demonstrating our system can be obtained

from

https://github.com/fanhqme/PointSetGeneration.

2. Related Work

3D reconstruction from single images While most

researches focus on multi-view geometry such as SFM

and SLAM [10, 9], ideally, one expect that 3D can be

reconstructed from the abundant single-view images.

Under this setting, however, the problem is ill-posed

and priors must be incorporated. Early work such as

ShapeFromX [

12, 1] made strong assumptions over the

shape or the environment lighting conditions. [

11, 18]

pioneered the use of learning-based approach for simple

geometric structures. Coarse correspondences in an image

collection can also be used for rough 3D shape estima-

tion [

14, 3]. As commodity 3D sensors become popular,

RGBD database has been built and used to train learning-

based systems [6, 8]. Though great progress has been made,

these methods still cannot robustly reconstruct complete

and quality shapes from single images. Stronger shape

priors are missing.

Recently, large-scale repositories of 3D CAD models,

such as ShapeNet [

4], have been introduced. They have

great potential for 3D reconstruction tasks. For example,

[19, 13] proposed to deform and reassemble existing shapes

into a new model to ﬁt the observed image. These systems

rely on high-quality image-shape correspondence, which is

a challenging and ill-posed problem itself.

More relevant to our work is [

5]. Given a single image,

they use a neural network to predict the underlying 3D

object as a 3D volume. There are two key differences

between our work and [

5]: First, the predicted object in

[

5] is a 3D volume; whilst ours is a point cloud. As

demonstrated and analyzed in Sec

5.2, point set forms a

nicer shape space for neural networks, thus the predicted

shapes tend to be more complete and natural. Second, we

allow multiple reconstruction candidates for a single input

image. This design reﬂects the fact that a single image

cannot fully determine the reconstruction of a 3D shape.

Deep learning for geometric object synthesis In gen-

eral, the ﬁeld of how to predict geometries in an end-to-end

fashion is quite a virgin land. In particular, our output, 3D

point set, is still not a typical object in the deep learning

community. A point set contains orderless samples from

a metric-measure space. Therefore, equivalent classes are

deﬁned up to a permutation; in addition, the ground distance

must be taken into consideration. To our knowledge, we are

not aware of prior deep learning systems with the abilities

to predict such objects.

3. Problem and Notations

Our goal is to reconstruct the complete 3D shape of

an object from a single 2D image (RGB or RGB-D). We

represent the 3D shapes in the form of unordered point set

S = {(x

i

, y

i

, z

i

)}

N

i=1

where N is a predeﬁned constant. We

observed that for most objects using N = 1024 is sufﬁcient

to preserve the major structures.

606

conv deconv set unionfully connected

concatenation

in1

in2

out

Encoder

Predictor

input

r.v.

point

set

vanilla version

in1

in2

out

Encoder

Predictor

input

r.v.

point

set

two prediction branch version

Figure 2. PointOutNet structure

One advantage of point set comes from its unordered-

ness. Unlike 2D based representations like the depth

map no topological constraint is put on the represented

object. Compared to 3D grids, the point set enjoys higher

efﬁciency by encoding only the points on the surface.

Also, the coordinate values (x

i

, y

i

, z

i

) go over simple linear

transformations when the object is rotated or scaled, which

is in contrast to the case in volumetric representations.

To model the problem’s uncertainty, we deﬁne the

groundtruth as a probability distribution P(·|I) over the

shapes conditioned on the input I. In training we have

access to one sample from P(·|I) for each image I.

We train a neural network G as a conditional sampler

from P(·|I):

S = G(I, r; Θ) (1)

where Θ denotes network parameter, r ∼ N(0, I) is a

random variable to perturb the input

1

. During test time

multiple samples of r could be used to generate different

predictions.

4. Approach

4.1. Overview

Our task of building a conditional generative network

for point sets is challenging, due to the unordered form of

representation and the inherent ambiguity of groundtruth.

These challenges have pushed us to invent new architecture,

loss function, and learning paradigm. Speciﬁcally, we have

to address three subproblems:

Point set generator architecture: Network to predict point

set is barely studied in literature, leaving a huge open

space for us to explore the design choices. Ideally, a

network should make the best use of its data statistics

and possess enough representation power. We propose a

network with two prediction branches, one enjoys high

ﬂexibility in capturing complicated structures and the other

exploits geometric continuity. See Sec 4.2.

Loss function for point set comparison: For our novel

type of prediction, point set, it is unclear how to measure

1

Similar to the Conditional Generative Adversarial Network [

15].

the distance between the prediction and groundtruth. We

propose two distance metrics for point sets – the Chamfer

distance and the Earth Mover’s distance. We show that

both metrics are differentiable almost everywhere and can

be used as the loss function, but has different properties in

capturing shape space. See Sec

4.3.

Modeling the uncertainty of groundtruth: Our problem

of 3D structural recovery from a single image is ill-

posed, thus the ambiguity of groundtruth arises during

the train and test time. It is fundamentally important to

characterize the ambiguity of groundtruth for a given input,

and practically desirable to be able to generate multiple

predictions. Surprisingly, this goal can be achieved tactfully

by simply using the min function as a wrapper to the above

proposed loss, or by a conditional variational autoencoder.

See Sec

4.4.

4.2. Point Set Prediction Network

The task of building a network for point set prediction

is new. We design a network with the goal of possessing

strong representation power for complicated structures, and

make the best use of the statistics of geometric data. To

introduce our network progressively, we start from a simple

version and gradually add components.

As in Fig

2 (top), our network has an encoder stage and

a predictor stage. The encoder maps the input pair of an

image I and a random vector r into an embedding space.

The predictor outputs a shape as an N × 3 matrix M, each

row containing the coordinates of one point.

The encoder is a composition of convolution and ReLU

layers; in addition, a random vector r is subsumed so

that it perturbs the prediction from the image I. We

postpone the explanation of how r is used to Sec

4.4. The

predictor generates the coordinates of N points through

a fully connected network. Though simple, this version

works reasonably well in practice.

We further improve the design of the predictor branch to

better accommodate large and smooth surfaces which are

common in natural objects. The fully connected predictor

as above cannot make full use of such natural geometric

statistics, since each point is predicted independently. The

607

improved predictor in Fig 2 (middle) exploits this geometric

smoothness property.

This version has two parallel predictor branches – a

fully-connected (fc) branch and a deconvolution (deconv)

branch. The fc branch predicts N

1

points as before. The

deconv branch predicts a 3 channel image of size H × W ,

of which the three values at each pixel are the coordinates

of a point, giving another H × W points. Their predictions

are later merged together to form the whole set of points in

M. Multiple skip links are added to boost information ﬂow

across encoder and predictor.

With the fc branch, our model enjoys high ﬂexibility,

showing good performance at describing intricate struc-

tures. With the deconvolution branch, our model becomes

not only more parameter parsimonious by weight sharing;

but also more friendly to large smooth surfaces, due to the

spatial continuity induced by deconv and conv. Refer to

Sec

5.5 for experimental evidences.

Above introduces the design of our network G in Eq

1.

To train this network, however, we still need to design a

proper loss function for point set prediction, and enable the

role r for multiple candidates prediction. We explain in the

next two sections.

4.3. Distance Metric between Point Sets

A critical challenge is to design a good loss function for

comparing the predicted point cloud and the groundtruth.

To plug in a neural network, a suitable distance must satisfy

at least three conditions: 1) differentiable with respect to

point locations; 2) efﬁcient to compute, as data will be

forwarded and back-propagated for many times; 3) robust

against small number of outlier points in the sets (e.g.

Hausdorff distance would fail).

We seek for a distance d between subsets in R

3

, so that

the loss function L({S

pred

i

}, {S

g t

i

}) takes the form

L({S

pred

i

}, {S

g t

i

}) =

X

i

d(S

pred

i

, S

g t

i

), (2)

where i indexes training samples, S

pred

i

and S

g t

i

are the

prediction and groundtruth of each sample, respectively.

We propose two candidates: Chamfer distance (CD) and

Earth Mover’s distance (EMD) [

17].

Chamfer distance We deﬁne the Chamfer distance be-

tween S

1

, S

2

⊆ R

3

as:

d

CD

(S

1

, S

2

) =

X

x∈S

1

min

y ∈ S

2

kx − yk

2

+

X

y ∈ S

2

min

x∈S

1

kx − yk

2

(3)

In the strict sense, d

CD

is not a distance function because

triangle inequality does not hold. We nevertheless use

the term “distance” to refer to any non-negative function

deﬁned on point set pairs. For each point, the algorithm

of CD ﬁnds the nearest neighbor in the other set and sums

the squared distances up. Viewed as a function of point

locations in S

1

and S

2

, CD is continuous and piecewise

smooth. The range search for each point is independent,

thus trivially parallelizable. Also, spatial data structures

like KD-tree can be used to accelerate nearest neighbor

search. Though simple, CD produces reasonable high

quality results in practice.

Earth Mover’s distance Consider S

1

, S

2

⊆ R

3

of equal

size s = |S

1

| = |S

2

|. The EMD between A and B is deﬁned

as:

d

EMD

(S

1

, S

2

) = min

φ:S

1

→S

2

X

x∈S

1

kx − φ(x)k

2

(4)

where φ : S

1

→ S

2

is a bijection.

The EMD distance solves an optimization problem,

namely, the assignment problem. For all but a zero-

measure subset of point set pairs, the optimal bijection φ

is unique and invariant under inﬁnitesimal movement of the

points. Thus EMD is differentiable almost everywhere. In

practice, exact computation of EMD is too expensive for

deep learning, even on graphics hardware. We therefore

implement a (1 + ǫ) approximation scheme given by

[

2]. We allocate ﬁx amount of time for each instance

and incrementally adjust allowable error ratio to ensure

termination. For typical inputs, the algorithm gives highly

accurate results (approximation error on the magnitude of

1%). The algorithm is easily parallelizable on GPU.

Shape space Despite remarkable expressive power em-

bedded in the deep layers, neural networks inevitably

encounter uncertainty in predicting the precise geometry

of an object. Such uncertainty could arise from limited

network capacity, insufﬁcient use of input resolution, or the

ambiguity of groundtruth due to information loss in 3D-2D

projection. Facing the inherent inability to resolve the shape

precisely, neural networks tend to predict a “mean” shape

averaging out the space of uncertainty. The mean shape

carries the characteristics of the distance itself.

In Figure

3, we illustrate the distinct mean-shape be-

havior of EMD and CD on synthetic shape distributions,

by minimizing E

s∼S

[L(x, s)] through stochastic gradient

descent, where S is a given shape distribution, L is one of

the distance functions.

In the ﬁrst and the second case, there is a single

continuously changing hidden variable, namely the radius

of the circle in (a) and the location of the arc in (b). EMD

roughly captures the shape corresponding to the mean value

of the hidden variable. In contrast CD induces a splashy

shape that blurs the shape’s geometric structure. In the latter

two cases, there are categorical hidden variables: which

608

Input

EMD

mean

CD

mean

(a)

(b)

(c)

(d)

Figure 3. Mean-shape behavior of EMD and CD. The shape

distributions are (a) a circle with varying radius; (b) a spiky arc

moving along the diagonal; (c) a rectangle bar, with a square-

shaped attachment allocated randomly on one of the four corners;

(d) a bar, with a circular disk appearing next to it with probability 0.5.

The red dots plot the mean shape calculated according to EMD and

CD accordingly.

input

label

r.v.

Point Set

Prediction

Network

point cloud generation

point cloud loss

CD / EMD

distribution modeling

Mo2 / VAE

Figure 4. System structure. By plugging in distributional modeling

module, our system is capable of generating multiple predictions.

corner the square is located at (c) and whether there is a

circle besides the bar (d). To address the uncertain presence

of the varying part, the minimizer of CD distributes some

points outside the main body at the correct locations; while

the minimizer of EMD is considerably distorted.

4.4. Generation of Multiple Plausible Shapes

To better model the uncertainty or inherent ambiguity

(e.g. unseen parts in the single view), we need to enable

the system to generate distributional output. We expect that

the random variable r passed to G (see Eq (

1)) would help

it explore the groundtruth distribution. However, naively

plugging G from Eq (

1) into Loss (2) to predict S

pred

i

won’t

work, as the loss minimization will nullify the randomness.

We ﬁnd practically a simple and effective method for

uncertainty modeling: the MoN (min of N) loss:

minimize

Θ

X

k

min

r

j

∼N(0,I)

1≤j≤n

{d(G(I

k

, r

j

; Θ), S

g t

k

)}

(5)

By giving n chances to minimizes the distance, the network

learns to spread its prediction upon receiving different

random vectors. In practice, we ﬁnd that setting n =

2 already enables our method to explore the groundtruth

space.

In principle, to model uncertainty we should use gener-

ative frameworks like the conditional GAN (CGAN) [

15]

or the variational auto-encoder (VAE). One key element in

input

image

ours

ours (post-

processed)

ground

truth

3D-R2N2

Figure 5. Visual comparison to 3D-R2N2. Our method better

preserves thin structures of the objects.

Figure 6. Quantitative comparison to 3D-R2N2. (a) Point-set based

metrics CD and EMD. (b) Volumetric representation based metric 1

- IoU. Lower bars indicate smaller errors. Our method gives better

results on all three metrics.

these methods is a complementary network (the discrim-

inator in GAN, or encoder in VAE) that consumes input

in the target modality (the 3D point set in our case) to

generate prediction or distribution parameters. However,

how to feed the 3D point set to deep neural network is still

an open problem at the production of this paper. Our point

set representation will greatly beneﬁt from future advances

in this direction.

5. Experiment

5.1. Training Data Generation by Synthesis

To start, we introduce our training data preparation. We

take the approach of rendering 2D views from CAD object

models. Our models are from the ShapeNet dataset [

4],

containing large volume of manually cleaned 3D object

models with textures. Concretely we used a subset of

220K models covering 2,000 object categories. The use of

synthesized data has been adopted in a number of existing

works [

5, 16].

609

A Point Set Generation Network for 3D Object Reconstruction from a Single Image

Citations

Dynamic Graph CNN for Learning on Point Clouds

Social GAN: Socially Acceptable Trajectories with Generative Adversarial Networks

Learning Implicit Fields for Generative Shape Modeling

Occupancy Networks: Learning 3D Reconstruction in Function Space

Occupancy Networks: Learning 3D Reconstruction in Function Space

References

Conditional Generative Adversarial Nets

The Earth Mover's Distance as a Metric for Image Retrieval

Depth Map Prediction from a Single Image using a Multi-Scale Deep Network

Make3D: Learning 3D Scene Structure from a Single Still Image

3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction

Related Papers (5)

3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction

ShapeNet: An Information-Rich 3D Model Repository

PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation

PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space

Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling