scispace - formally typeset
Open AccessProceedings ArticleDOI

A Point Set Generation Network for 3D Object Reconstruction from a Single Image

Reads0
Chats0
TLDR
This paper addresses the problem of 3D reconstruction from a single image, generating a straight-forward form of output unorthordox, and designs architecture, loss function and learning paradigm that are novel and effective, capable of predicting multiple plausible 3D point clouds from an input image.
Abstract
Generation of 3D data by deep neural network has been attracting increasing attention in the research community. The majority of extant works resort to regular representations such as volumetric grids or collection of images, however, these representations obscure the natural invariance of 3D shapes under geometric transformations, and also suffer from a number of other issues. In this paper we address the problem of 3D reconstruction from a single image, generating a straight-forward form of output – point cloud coordinates. Along with this problem arises a unique and interesting issue, that the groundtruth shape for an input image may be ambiguous. Driven by this unorthordox output form and the inherent ambiguity in groundtruth, we design architecture, loss function and learning paradigm that are novel and effective. Our final solution is a conditional shape sampler, capable of predicting multiple plausible 3D point clouds from an input image. In experiments not only can our system outperform state-of-the-art methods on single image based 3D reconstruction benchmarks, but it also shows strong performance for 3D shape completion and promising ability in making multiple plausible predictions.

read more

Content maybe subject to copyright    Report

A Point Set Generation Network for
3D Object Reconstruction from a Single Image
Haoqiang Fan
Institute for Interdisciplinary
Information Sciences
Tsinghua University
fanhqme@gmail.com
Hao Su
Leonidas Guibas
Computer Science Department
Stanford University
{haosu,guibas}@cs.stanford.edu
Abstract
Generation of 3D data by deep neural networks has
been attracting increasing attention in the research com-
munity. The majority of extant works resort to regular
representations such as volumetric grids or collections of
images; however, these representations obscure the natural
invariance of 3D shapes under geometric transformations,
and also suffer from a number of other issues. In this paper
we address the problem of 3D reconstruction from a single
image, generating a straight-forward form of output point
cloud coordinates. Along with this problem arises a unique
and interesting issue, that the groundtruth shape for an
input image may be ambiguous. Driven by this unorthodox
output form and the inherent ambiguity in groundtruth, we
design architecture, loss function and learning paradigm
that are novel and effective. Our final solution is a
conditional shape sampler, capable of predicting multiple
plausible 3D point clouds from an input image. In
experiments not only can our system outperform state-of-
the-art methods on single image based 3d reconstruction
benchmarks; but it also shows strong performance for 3D
shape completion and promising ability in making multiple
plausible predictions.
1. Introduction
As we try to duplicate the successes of current deep
convolutional architectures in the 3D domain, we face a
fundamental representational issue. Extant deep net archi-
tectures for both discriminative and generative learning in
the signal domain are well-suited to data that is regularly
sampled, such as images, audio, or video. However,
most common 3D geometry representations, such as 2D
meshes or point clouds are not regular structures and do
not easily fit into architectures that exploit such regularity
equal contribution
Input
Reconstructed 3D point cloud
Figure 1. A 3D point cloud of the complete object can be
reconstructed from a single image. Each point is visualized as a
small sphere. The reconstruction is viewed at two viewpoints (0
and 90
along azimuth). A segmentation mask is used to indicate
the scope of the object in the image.
for weight sharing, etc. That is why the majority of
extant works on using deep nets for 3D data resort to
either volumetric grids or collections of images (2D views
of the geometry). Such representations, however, lead to
difficult trade-offs between sampling resolution and net
efficiency. Furthermore, they enshrine quantization artifacts
that obscure natural invariances of the data under rigid
motions, etc.
In this paper we address the problem of generating the
3D geometry of an object based on a single image of that
object. We explore generative networks for 3D geometry
based on a point cloud representation. A point cloud
representation may not be as efficient in representing the
underlying continuous 3D geometry as compared to a CAD
model using geometric primitives or even a simple mesh,
but for our purposes it has many advantages. A point cloud
is a simple, uniform structure that is easier to learn, as
it does not have to encode multiple primitives or combi-
natorial connectivity patterns. In addition, a point cloud
allows simple manipulation when it comes to geometric
transformations and deformations, as connectivity does not
1
605

have to be updated. Our pipeline infers the point positions in
a 3D frame determined by the input image and the inferred
viewpoint position.
Given this unorthodox network output, one of our chal-
lenges is how to measure loss during training, as the same
geometry may admit different point cloud representations
at the same degree of approximation. Unlike the usual
L
2
type losses, we use the solution of a transportation
problem based on the Earth Mover’s distance (EMD),
effectively solving an assignment problem. We exploit an
approximation to the EMD to provide speed as well as
ensure differentiability for end-to-end training.
Our approach effectively attempts to solve the ill-posed
problem of 3D structure recovery from a single projection
using certain learned priors. The network has to estimate
depth for the visible parts of the image and hallucinate the
rest of the object geometry, assessing the plausibility of sev-
eral different completions. From a statistical perspective, it
would be ideal if we can fully characterize the landscape
of the ground truth space, or be able to sample plausible
candidates accordingly. If we view this as a regression
problem, then it has a rather unique and interesting feature
arising from inherent object ambiguities in certain views.
These are situations where there are multiple, equally good
3D reconstructions of a 2D image, making our problem very
different from classical regression/classification settings,
where each training sample has a unique ground truth
annotation. In such settings the proper loss definition can
be crucial in getting the most meaningful result.
Our final algorithm is a conditional sampler, which
samples plausible 3D point clouds from the estimated
ground truth space given an input image. Experiments on
both synthetic and real world data verify the effectiveness
of our method. Our contributions can be summarized as
follows:
We use deep learning techniques to study the point set
generation problem;
On the task of 3D reconstruction from a single
image, we apply our point set generation network and
significantly outperform state of the art;
We systematically explore issues in the architecture
and loss function design for point generation network;
We discuss and address the ground-truth ambiguity
issue for the 3D reconstruction from single image task.
Source code demonstrating our system can be obtained
from
https://github.com/fanhqme/PointSetGeneration.
2. Related Work
3D reconstruction from single images While most
researches focus on multi-view geometry such as SFM
and SLAM [10, 9], ideally, one expect that 3D can be
reconstructed from the abundant single-view images.
Under this setting, however, the problem is ill-posed
and priors must be incorporated. Early work such as
ShapeFromX [
12, 1] made strong assumptions over the
shape or the environment lighting conditions. [
11, 18]
pioneered the use of learning-based approach for simple
geometric structures. Coarse correspondences in an image
collection can also be used for rough 3D shape estima-
tion [
14, 3]. As commodity 3D sensors become popular,
RGBD database has been built and used to train learning-
based systems [6, 8]. Though great progress has been made,
these methods still cannot robustly reconstruct complete
and quality shapes from single images. Stronger shape
priors are missing.
Recently, large-scale repositories of 3D CAD models,
such as ShapeNet [
4], have been introduced. They have
great potential for 3D reconstruction tasks. For example,
[19, 13] proposed to deform and reassemble existing shapes
into a new model to fit the observed image. These systems
rely on high-quality image-shape correspondence, which is
a challenging and ill-posed problem itself.
More relevant to our work is [
5]. Given a single image,
they use a neural network to predict the underlying 3D
object as a 3D volume. There are two key differences
between our work and [
5]: First, the predicted object in
[
5] is a 3D volume; whilst ours is a point cloud. As
demonstrated and analyzed in Sec
5.2, point set forms a
nicer shape space for neural networks, thus the predicted
shapes tend to be more complete and natural. Second, we
allow multiple reconstruction candidates for a single input
image. This design reflects the fact that a single image
cannot fully determine the reconstruction of a 3D shape.
Deep learning for geometric object synthesis In gen-
eral, the field of how to predict geometries in an end-to-end
fashion is quite a virgin land. In particular, our output, 3D
point set, is still not a typical object in the deep learning
community. A point set contains orderless samples from
a metric-measure space. Therefore, equivalent classes are
defined up to a permutation; in addition, the ground distance
must be taken into consideration. To our knowledge, we are
not aware of prior deep learning systems with the abilities
to predict such objects.
3. Problem and Notations
Our goal is to reconstruct the complete 3D shape of
an object from a single 2D image (RGB or RGB-D). We
represent the 3D shapes in the form of unordered point set
S = {(x
i
, y
i
, z
i
)}
N
i=1
where N is a predefined constant. We
observed that for most objects using N = 1024 is sufficient
to preserve the major structures.
606

conv deconv set unionfully connected
concatenation
in1
in2
out
Encoder
Predictor
input
r.v.
point
set
vanilla version
in1
in2
out
Encoder
Predictor
input
r.v.
point
set
two prediction branch version
Figure 2. PointOutNet structure
One advantage of point set comes from its unordered-
ness. Unlike 2D based representations like the depth
map no topological constraint is put on the represented
object. Compared to 3D grids, the point set enjoys higher
efficiency by encoding only the points on the surface.
Also, the coordinate values (x
i
, y
i
, z
i
) go over simple linear
transformations when the object is rotated or scaled, which
is in contrast to the case in volumetric representations.
To model the problem’s uncertainty, we define the
groundtruth as a probability distribution P(·|I) over the
shapes conditioned on the input I. In training we have
access to one sample from P(·|I) for each image I.
We train a neural network G as a conditional sampler
from P(·|I):
S = G(I, r; Θ) (1)
where Θ denotes network parameter, r N(0, I) is a
random variable to perturb the input
1
. During test time
multiple samples of r could be used to generate different
predictions.
4. Approach
4.1. Overview
Our task of building a conditional generative network
for point sets is challenging, due to the unordered form of
representation and the inherent ambiguity of groundtruth.
These challenges have pushed us to invent new architecture,
loss function, and learning paradigm. Specifically, we have
to address three subproblems:
Point set generator architecture: Network to predict point
set is barely studied in literature, leaving a huge open
space for us to explore the design choices. Ideally, a
network should make the best use of its data statistics
and possess enough representation power. We propose a
network with two prediction branches, one enjoys high
flexibility in capturing complicated structures and the other
exploits geometric continuity. See Sec 4.2.
Loss function for point set comparison: For our novel
type of prediction, point set, it is unclear how to measure
1
Similar to the Conditional Generative Adversarial Network [
15].
the distance between the prediction and groundtruth. We
propose two distance metrics for point sets the Chamfer
distance and the Earth Mover’s distance. We show that
both metrics are differentiable almost everywhere and can
be used as the loss function, but has different properties in
capturing shape space. See Sec
4.3.
Modeling the uncertainty of groundtruth: Our problem
of 3D structural recovery from a single image is ill-
posed, thus the ambiguity of groundtruth arises during
the train and test time. It is fundamentally important to
characterize the ambiguity of groundtruth for a given input,
and practically desirable to be able to generate multiple
predictions. Surprisingly, this goal can be achieved tactfully
by simply using the min function as a wrapper to the above
proposed loss, or by a conditional variational autoencoder.
See Sec
4.4.
4.2. Point Set Prediction Network
The task of building a network for point set prediction
is new. We design a network with the goal of possessing
strong representation power for complicated structures, and
make the best use of the statistics of geometric data. To
introduce our network progressively, we start from a simple
version and gradually add components.
As in Fig
2 (top), our network has an encoder stage and
a predictor stage. The encoder maps the input pair of an
image I and a random vector r into an embedding space.
The predictor outputs a shape as an N × 3 matrix M, each
row containing the coordinates of one point.
The encoder is a composition of convolution and ReLU
layers; in addition, a random vector r is subsumed so
that it perturbs the prediction from the image I. We
postpone the explanation of how r is used to Sec
4.4. The
predictor generates the coordinates of N points through
a fully connected network. Though simple, this version
works reasonably well in practice.
We further improve the design of the predictor branch to
better accommodate large and smooth surfaces which are
common in natural objects. The fully connected predictor
as above cannot make full use of such natural geometric
statistics, since each point is predicted independently. The
607

improved predictor in Fig 2 (middle) exploits this geometric
smoothness property.
This version has two parallel predictor branches a
fully-connected (fc) branch and a deconvolution (deconv)
branch. The fc branch predicts N
1
points as before. The
deconv branch predicts a 3 channel image of size H × W ,
of which the three values at each pixel are the coordinates
of a point, giving another H × W points. Their predictions
are later merged together to form the whole set of points in
M. Multiple skip links are added to boost information flow
across encoder and predictor.
With the fc branch, our model enjoys high flexibility,
showing good performance at describing intricate struc-
tures. With the deconvolution branch, our model becomes
not only more parameter parsimonious by weight sharing;
but also more friendly to large smooth surfaces, due to the
spatial continuity induced by deconv and conv. Refer to
Sec
5.5 for experimental evidences.
Above introduces the design of our network G in Eq
1.
To train this network, however, we still need to design a
proper loss function for point set prediction, and enable the
role r for multiple candidates prediction. We explain in the
next two sections.
4.3. Distance Metric between Point Sets
A critical challenge is to design a good loss function for
comparing the predicted point cloud and the groundtruth.
To plug in a neural network, a suitable distance must satisfy
at least three conditions: 1) differentiable with respect to
point locations; 2) efficient to compute, as data will be
forwarded and back-propagated for many times; 3) robust
against small number of outlier points in the sets (e.g.
Hausdorff distance would fail).
We seek for a distance d between subsets in R
3
, so that
the loss function L({S
pred
i
}, {S
g t
i
}) takes the form
L({S
pred
i
}, {S
g t
i
}) =
X
i
d(S
pred
i
, S
g t
i
), (2)
where i indexes training samples, S
pred
i
and S
g t
i
are the
prediction and groundtruth of each sample, respectively.
We propose two candidates: Chamfer distance (CD) and
Earth Mover’s distance (EMD) [
17].
Chamfer distance We define the Chamfer distance be-
tween S
1
, S
2
R
3
as:
d
CD
(S
1
, S
2
) =
X
xS
1
min
y S
2
kx yk
2
2
+
X
y S
2
min
xS
1
kx yk
2
2
(3)
In the strict sense, d
CD
is not a distance function because
triangle inequality does not hold. We nevertheless use
the term “distance” to refer to any non-negative function
defined on point set pairs. For each point, the algorithm
of CD finds the nearest neighbor in the other set and sums
the squared distances up. Viewed as a function of point
locations in S
1
and S
2
, CD is continuous and piecewise
smooth. The range search for each point is independent,
thus trivially parallelizable. Also, spatial data structures
like KD-tree can be used to accelerate nearest neighbor
search. Though simple, CD produces reasonable high
quality results in practice.
Earth Mover’s distance Consider S
1
, S
2
R
3
of equal
size s = |S
1
| = |S
2
|. The EMD between A and B is defined
as:
d
EMD
(S
1
, S
2
) = min
φ:S
1
S
2
X
xS
1
kx φ(x)k
2
(4)
where φ : S
1
S
2
is a bijection.
The EMD distance solves an optimization problem,
namely, the assignment problem. For all but a zero-
measure subset of point set pairs, the optimal bijection φ
is unique and invariant under infinitesimal movement of the
points. Thus EMD is differentiable almost everywhere. In
practice, exact computation of EMD is too expensive for
deep learning, even on graphics hardware. We therefore
implement a (1 + ǫ) approximation scheme given by
[
2]. We allocate fix amount of time for each instance
and incrementally adjust allowable error ratio to ensure
termination. For typical inputs, the algorithm gives highly
accurate results (approximation error on the magnitude of
1%). The algorithm is easily parallelizable on GPU.
Shape space Despite remarkable expressive power em-
bedded in the deep layers, neural networks inevitably
encounter uncertainty in predicting the precise geometry
of an object. Such uncertainty could arise from limited
network capacity, insufficient use of input resolution, or the
ambiguity of groundtruth due to information loss in 3D-2D
projection. Facing the inherent inability to resolve the shape
precisely, neural networks tend to predict a “mean” shape
averaging out the space of uncertainty. The mean shape
carries the characteristics of the distance itself.
In Figure
3, we illustrate the distinct mean-shape be-
havior of EMD and CD on synthetic shape distributions,
by minimizing E
sS
[L(x, s)] through stochastic gradient
descent, where S is a given shape distribution, L is one of
the distance functions.
In the first and the second case, there is a single
continuously changing hidden variable, namely the radius
of the circle in (a) and the location of the arc in (b). EMD
roughly captures the shape corresponding to the mean value
of the hidden variable. In contrast CD induces a splashy
shape that blurs the shape’s geometric structure. In the latter
two cases, there are categorical hidden variables: which
608

Input
EMD
mean
CD
mean
(a)
(b)
(c)
(d)
Figure 3. Mean-shape behavior of EMD and CD. The shape
distributions are (a) a circle with varying radius; (b) a spiky arc
moving along the diagonal; (c) a rectangle bar, with a square-
shaped attachment allocated randomly on one of the four corners;
(d) a bar, with a circular disk appearing next to it with probability 0.5.
The red dots plot the mean shape calculated according to EMD and
CD accordingly.
input
label
r.v.
Point Set
Prediction
Network
point cloud generation
point cloud loss
CD / EMD
distribution modeling
Mo2 / VAE
Figure 4. System structure. By plugging in distributional modeling
module, our system is capable of generating multiple predictions.
corner the square is located at (c) and whether there is a
circle besides the bar (d). To address the uncertain presence
of the varying part, the minimizer of CD distributes some
points outside the main body at the correct locations; while
the minimizer of EMD is considerably distorted.
4.4. Generation of Multiple Plausible Shapes
To better model the uncertainty or inherent ambiguity
(e.g. unseen parts in the single view), we need to enable
the system to generate distributional output. We expect that
the random variable r passed to G (see Eq (
1)) would help
it explore the groundtruth distribution. However, naively
plugging G from Eq (
1) into Loss (2) to predict S
pred
i
won’t
work, as the loss minimization will nullify the randomness.
We find practically a simple and effective method for
uncertainty modeling: the MoN (min of N) loss:
minimize
Θ
X
k
min
r
j
N(0,I)
1jn
{d(G(I
k
, r
j
; Θ), S
g t
k
)}
(5)
By giving n chances to minimizes the distance, the network
learns to spread its prediction upon receiving different
random vectors. In practice, we find that setting n =
2 already enables our method to explore the groundtruth
space.
In principle, to model uncertainty we should use gener-
ative frameworks like the conditional GAN (CGAN) [
15]
or the variational auto-encoder (VAE). One key element in
input
image
ours
ours (post-
processed)
ground
truth
3D-R2N2
Figure 5. Visual comparison to 3D-R2N2. Our method better
preserves thin structures of the objects.
Figure 6. Quantitative comparison to 3D-R2N2. (a) Point-set based
metrics CD and EMD. (b) Volumetric representation based metric 1
- IoU. Lower bars indicate smaller errors. Our method gives better
results on all three metrics.
these methods is a complementary network (the discrim-
inator in GAN, or encoder in VAE) that consumes input
in the target modality (the 3D point set in our case) to
generate prediction or distribution parameters. However,
how to feed the 3D point set to deep neural network is still
an open problem at the production of this paper. Our point
set representation will greatly benefit from future advances
in this direction.
5. Experiment
5.1. Training Data Generation by Synthesis
To start, we introduce our training data preparation. We
take the approach of rendering 2D views from CAD object
models. Our models are from the ShapeNet dataset [
4],
containing large volume of manually cleaned 3D object
models with textures. Concretely we used a subset of
220K models covering 2,000 object categories. The use of
synthesized data has been adopted in a number of existing
works [
5, 16].
609

Citations
More filters
Journal ArticleDOI

Dynamic Graph CNN for Learning on Point Clouds

TL;DR: This work proposes a new neural network module suitable for CNN-based high-level tasks on point clouds, including classification and segmentation called EdgeConv, which acts on graphs dynamically computed in each layer of the network.
Proceedings ArticleDOI

Social GAN: Socially Acceptable Trajectories with Generative Adversarial Networks

TL;DR: A recurrent sequence-to-sequence model observes motion histories and predicts future behavior, using a novel pooling mechanism to aggregate information across people, and outperforms prior work in terms of accuracy, variety, collision avoidance, and computational complexity.
Proceedings ArticleDOI

Learning Implicit Fields for Generative Shape Modeling

TL;DR: In this paper, an implicit field is used to assign a value to each point in 3D space, so that a shape can be extracted as an iso-surface, and a binary classifier is trained to perform this assignment.
Posted Content

Occupancy Networks: Learning 3D Reconstruction in Function Space

TL;DR: This paper proposes Occupancy Networks, a new representation for learning-based 3D reconstruction methods that encodes a description of the 3D output at infinite resolution without excessive memory footprint, and validate that the representation can efficiently encode 3D structure and can be inferred from various kinds of input.
Proceedings ArticleDOI

Occupancy Networks: Learning 3D Reconstruction in Function Space

TL;DR: In this paper, the authors propose Occupancy Networks, which implicitly represent the 3D surface as the continuous decision boundary of a deep neural network classifier, which can be used for learning-based 3D reconstruction methods.
References
More filters
Posted Content

Conditional Generative Adversarial Nets

Mehdi Mirza, +1 more
- 06 Nov 2014 - 
TL;DR: The conditional version of generative adversarial nets is introduced, which can be constructed by simply feeding the data, y, to the generator and discriminator, and it is shown that this model can generate MNIST digits conditioned on class labels.
Journal ArticleDOI

The Earth Mover's Distance as a Metric for Image Retrieval

TL;DR: This paper investigates the properties of a metric between two distributions, the Earth Mover's Distance (EMD), for content-based image retrieval, and compares the retrieval performance of the EMD with that of other distances.
Proceedings Article

Depth Map Prediction from a Single Image using a Multi-Scale Deep Network

TL;DR: In this article, two deep network stacks are employed to make a coarse global prediction based on the entire image, and another to refine this prediction locally, which achieves state-of-the-art results on both NYU Depth and KITTI.
Journal ArticleDOI

Make3D: Learning 3D Scene Structure from a Single Still Image

TL;DR: This work considers the problem of estimating detailed 3D structure from a single still image of an unstructured environment and uses a Markov random field (MRF) to infer a set of "plane parameters" that capture both the 3D location and 3D orientation of the patch.
Book ChapterDOI

3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction

TL;DR: 3D-R2N2 as discussed by the authors proposes a 3D Recurrent Reconstruction Neural Network that learns a mapping from images of objects to their underlying 3D shapes from a large collection of synthetic data.
Related Papers (5)