scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

3D Shape Segmentation with Projective Convolutional Networks

TL;DR: This paper introduces a deep architecture for segmenting 3D objects into their labeled semantic parts that significantly outperforms the existing state-of-the-art methods in the currently largest segmentation benchmark (ShapeNet).
Abstract: This paper introduces a deep architecture for segmenting 3D objects into their labeled semantic parts. Our architecture combines image-based Fully Convolutional Networks (FCNs) and surface-based Conditional Random Fields (CRFs) to yield coherent segmentations of 3D shapes. The image-based FCNs are used for efficient view-based reasoning about 3D object parts. Through a special projection layer, FCN outputs are effectively aggregated across multiple views and scales, then are projected onto the 3D object surfaces. Finally, a surface-based CRF combines the projected outputs with geometric consistency cues to yield coherent segmentations. The whole architecture (multi-view FCNs and CRF) is trained end-to-end. Our approach significantly outperforms the existing state-of-the-art methods in the currently largest segmentation benchmark (ShapeNet). Finally, we demonstrate promising segmentation results on noisy 3D shapes acquired from consumer-grade depth cameras.

Content maybe subject to copyright    Report

3D Shape Segmentation with Projective Convolutional Networks
Evangelos Kalogerakis
1
Melinos Averkiou
2
Subhransu Maji
1
Siddhartha Chaudhuri
3
1
University of Massachusetts Amherst
2
University of Cyprus
3
IIT Bombay
Abstract
This paper introduces a deep architecture for segmenting
3D objects into their labeled semantic parts. Our architec-
ture combines image-based Fully Convolutional Networks
(FCNs) and surface-based Conditional Random Fields
(CRFs) to yield coherent segmentations of 3D shapes. The
image-based FCNs are used for efficient view-based rea-
soning about 3D object parts. Through a special projec-
tion layer, FCN outputs are effectively aggregated across
multiple views and scales, then are projected onto the
3D object surfaces. Finally, a surface-based CRF com-
bines the projected outputs with geometric consistency
cues to yield coherent segmentations. The whole archi-
tecture (multi-view FCNs and CRF) is trained end-to-end.
Our approach significantly outperforms the existing state-
of-the-art methods in the currently largest segmentation
benchmark (ShapeNet). Finally, we demonstrate promis-
ing segmentation results on noisy 3D shapes acquired from
consumer-grade depth cameras.
1. Introduction
In recent years there has been an explosion of 3D shape
data on the web. In addition to the increasing number of
community-curated CAD models, depth sensors deployed
on a wide range of platforms are able to acquire 3D ge-
ometric representations of objects in the form of polygon
meshes or point clouds. Although there have been sig-
nificant advances in analyzing color images, in particular
through deep networks, existing semantic reasoning tech-
niques for 3D geometric shape data mostly rely on heuristic
processing stages and hand-tuned geometric descriptors.
Our work focuses on the task of segmenting 3D shapes into
labeled semantic parts. Compositional part-based reason-
ing for 3D shapes has been shown to be effective for a large
number of vision, robotics and virtual reality applications,
such as cross-modal analysis of 3D shapes and color im-
ages [
60, 24], skeletal tracking [42], objection detection in
images [11, 30, 36], 3D object reconstruction from images
and line drawings [54, 24, 21], interactive assembly-based
3D modeling [5, 4], generating 3D shapes from a small
number of examples [
25], style transfer between 3D objects
[
33], robot navigation and grasping [40, 8], to name a few.
The shape segmentation task, while fundamental, is chal-
lenging because of the variety and ambiguity of shape parts
that must be assigned the same semantic label; because ac-
curately detecting boundaries between parts can involve ex-
tremely subtle cues; because local and global features must
be jointly examined; and because the analysis must be ro-
bust to noise and undersampling.
We propose a deep architecture for segmenting and labeling
3D shapes that simply and effectively addresses these chal-
lenges, and significantly outperforms prior methods. The
key insights of our technique are to repurpose image-based
deep networks for view-based reasoning, and aggregate
their outputs onto the surface representation of the shape in
a geometrically consistent manner. We make no geometric,
topological or orientation assumptions about the shape, nor
exploit any hand-tuned geometric descriptors.
Our view-based approach is motivated by the success of
deep networks on image segmentation tasks. Using ren-
dered shapes lets us initialize our network with layers that
have been trained on large image datasets, allowing better
generalization. Since images depict shapes of photographed
objects (along with texture), we expect such pre-trained lay-
ers to already encode some information about parts and their
relationships. Recent work on view-based 3D shape clas-
sification [
47, 38] and RGB-D recognition [15, 46] have
shown the benefits of transferring learned representations
from color images to geometric and depth data.
A view-based approach to 3D shape segmentation must
overcome several technical obstacles. First, views must
be selected such that they together cover the shape sur-
face as much as possible and minimize occlusions. Sec-
ond, shape parts can be visible in more than one view, thus
our method must effectively consolidate information across
multiple views. Third, we must guarantee that the segmen-
tation is complete and coherent. This means all the surface
area, including any heavily occluded portions, should be la-
beled, and neighboring surface areas should likely have the
same label unless separated by a strong boundary feature.
Our approach, shown in Figure
1, systematically addresses
these difficulties using a single feed-forward network.
Given a raw 3D polygon mesh as input, our method gen-
erates a set of images from multiple views that are automat-
ically selected for optimal surface coverage. These images
are fed into the network, which outputs confidence maps per
part via image processing layers. The confidence maps are
1
3779

Surface
reference
image
(triangle ids)
...
...
...
...
Input 3D Shape &
Selected Viewpoints
Shaded
images
Depth
images
FCN
FCN
FCN
Per-label
confidence
maps
Image2Surface
projection
layer
Surface-based
CRF layer
509 865 865 133711741174
1342 865 13371337 558 558
13421342 887 887 849 558
932 932 887 849 849 1212
932 677 677 156712121212
1805 677 950 156715671566
projection
Per-label
confidence
maps
(on surface)
Labeled 3D Shape
forward pass / inference
backpropagation / learning
4x rotations
...
...
wing
fuselage
vert. stabilizer
horiz. stabilizer
shared
weights
shared
weights
Figure 1. Pipeline and architecture of our method for 3D shape segmentation and labeling. Given an input shape, a set of viewpoints
are computed at different scales such that the viewed shape surface is maximally covered (left). Shaded and depth images from these
viewpoints are processed through our architecture (here we show images for three viewpoints, corresponding to 3 different scales). Our
architecture employs image-based Fully Convolutional Network (FCN) modules with shared parameters to process the input images. The
modules output image-based part label confidences per view. Here we show confidence maps for the wing label (the redder the color, the
higher the confidence). The confidences are aggregated and projected on the shape surface through a special projection layer. Then they
are further processed through a surface-based CRF that promotes consistent labeling of the entire surface (right).
fused and projected onto the shape surface representation
through a projection layer. Finally, our architecture incor-
porates a surface-based Conditional Random Field (CRF)
layer that promotes consistent labeling of the entire surface.
The whole network, including the CRF, is trained in an end-
to-end manner to achieve optimal performance.
Our main contribution is the introduction of a deep ar-
chitecture for compositional part-based reasoning on 3D
shape representations without the use of hand-engineered
geometry processing stages or hand-tuned descriptors. We
demonstrate significant improvements over the state-of-the-
art. For complex objects, such as aircraft, motor vehicles,
and furniture, our method increases part labeling accuracy
by a remarkable 8% over the state of the art on the cur-
rently largest 3D shape segmentation dataset.
2. Related work
Our work is related to learning methods for segmentation of
images (including RGB-D data) and 3D shapes.
Image-based segmentation. There is a vast literature on
segmenting images into objects and their parts. Most recent
techniques are based on variants of random forest classi-
fiers or convolutional networks. An example of the former
is the remarkably fast and accurate human-pose estimator
that uses depth data from Kinect sensors for labeling hu-
man parts [
42]. Our work builds on the success of convo-
lutional networks for material segmentation, scene labeling,
and object part-labeling tasks. These approaches use image
classification networks repurposed for dense image label-
ing, commonly a fully-convolutional network (FCN) [
32],
to obtain an initial labeling. Several strategies for improving
these initial estimates have been proposed including tech-
niques based on top-down region-based reasoning [
10, 16],
CRFs [6, 31], atrous convolutional layers [6, 57], decon-
volutional layers [35], recurrent networks [59], or a multi-
scale analysis [34, 17]. Several works [29, 1, 2] have also
focused on learning feature representations from RGB-D
data (e.g. those captured using a Kinect sensor) for object-
level recognition and detection in scenes. Recently, Gupta
et al. [
15] showed that image-based networks can be repur-
posed for extracting depth representations for object detec-
tion and segmentation. Recent works [14, 45, 18] have ap-
plied a similar strategy for indoor scene recognition tasks.
In contrast to the above methods, our work aims to seg-
ment geometric representations of 3D objects, in the form
of polygon meshes, created through 3D modeling tools or
reconstruction techniques. The 3D models of these objects
often do not contain texture or color information. Segment-
ing these 3D objects into parts requires architectures that
are capable of operating on their geometric representations.
Learning 3D shape representations from images. A
few recent methods attempt to learn volumetric represen-
tations of shapes from images via convolutional networks
that employ special layers to model shape projections onto
images [
55, 39]. Alternatively, mesh-based representations
can also be learned from images by assuming a fixed num-
ber of mesh vertices [
39]. In contrast to these works, our
architecture discriminatively learns view-based shape rep-
resentations along with a surface-based CRF such that the
view projections match an input surface signal (part la-
bels). Our 3D-2D projection mechanism is differentiable,
parameter-free, and sparse, since it operates only on the
shape surface rather than its volume. In contrast to the mesh
representations of [
39], we do not assume that meshes have
a fixed number of vertices, which does not hold true for
general 3D models. Our method is more related to meth-
ods that learn view-based shape representations [
47, 38].
However, these methods only learn global representations
3780

for shape classification and rely on fixed sets of views. Our
method instead learns view-based shape representations for
part-based reasoning through adaptively selected views. It
also uses a CRF to resolve inconsistencies or missing sur-
face information in the view representations.
3D geometric shape segmentation. The most common
learning-based approach to shape segmentation is to assign
part labels to geometric elements of the shape representa-
tion, such as polygons, points, or patches [
53]. This is
often done through various processing stages: first, hand-
engineered geometric descriptors of these elements are ex-
tracted (e.g. surface curvature, shape diameter, local his-
tograms of point or normal distributions, surface eigenfunc-
tions, etc.); then, a clustering method or classifier infers
part labels for elements based on their descriptors; and fi-
nally (optionally) a separate graph cuts step is employed to
smooth out the surface labeling [
26, 41, 43, 19, 58]. Re-
cently, a convolutional network has been proposed as an
alternative element classifier [
13], yet it operates on hand-
engineered geometric descriptors organized in a 2D matrix
lacking spatially coherent structure for conventional convo-
lution. Another variant is to use two-layer networks which
transform the input by randomized kernels, in the form of
so-called “Extreme Learning Machines” [
52], but these of-
fer no better performance than standard shallow classifiers.
Other approaches segment shapes by employing non-rigid
alignment steps through deformable part templates [
27, 20],
or transfer labels through surface correspondences and
functional maps between 3D shapes [
48, 22, 50, 27, 23].
These correspondence and alignment methods rely on hand-
engineered geometric descriptors and deformation steps.
Wang et al. [
51] segment 3D shapes by warping and match-
ing binary images of their projected views with segmented
2D images through Hausdorff distances. However, the
matching procedure is hand-tuned, while potentially useful
surface information, such as depth and normals, is ignored.
In contrast to all the above approaches, we propose a view-
based deep architecture for shape segmentation with four
main advantages. First, our architecture adopts image pro-
cessing layers learned on large-scale image datasets, which
are orders of magnitude larger than existing 3D datasets.
As we show in this work, the deep stack of several lay-
ers extracts feature representations that can be successfully
adapted to the task of shape segmentation. We note that
such transfer has also been observed recently for shape
recognition [
47, 38]. Second, our architecture produces
shape segmentations without the use of hand-engineered ge-
ometric descriptors or processing stages that are prone to
degeneracies in the shape representation (i.e. surface noise,
sampling artifacts, irregular mesh tesselation, mesh degen-
eracies, and so on). Third, we employ adaptive viewpoint
selection to effectively capture all surface parts for analysis.
Finally, our architecture is trained end-to-end, including all
image and surface processing stages. As a result of these
contributions, our method achieves better performance than
prior work on big and complex datasets by a large margin.
3. Method
Given an input 3D shape, the goal of our method is to seg-
ment it into labeled parts. We designed a projective con-
volutional network to this end. Our network architecture is
visualized in Figure
1. It takes as input a set of images from
multiple views optimized for maximal surface coverage; ex-
tracts part-based confidence maps through image process-
ing layers (pre-trained on large image datasets); combines
and projects these maps onto the surface through a projec-
tion layer, and finally incorporates a surface-based Condi-
tional Random Field (CRF) that favors coherent labeling of
the input surface. The whole network, including the CRF, is
trained end-to-end. In the following sections, we discuss the
input to our network, its layers, and the training procedure.
Input. The input to our algorithm is a 3D shape repre-
sented as a polygon mesh. As a preprocessing step, the
shape surface is sampled with uniformly distributed points
(1024 in our implementation). Our algorithm first deter-
mines an overcomplete collection of viewpoints such that
nearly every point of the surface is visible from at least
K viewpoints (in our implementation, K = 3). For each
sampled surface point, we place viewpoints at different dis-
tances from it along its surface normal (distances are set
to 0.5, 1.0 and 1.5 of the shape’s bounding sphere radius).
In this manner, the surface is depicted at different scales
(Figure 1, left). We then determine a compact set of infor-
mative viewpoints that maximally cover the shape surface.
For each viewpoint, the shape is rasterized under a perspec-
tive projection to a binary image, where we associate ev-
ery “on” pixel with the sampled surface point closest to it.
The coverage of the viewpoint is measured as the fraction
of surface points visible from it, estimated by aggregating
surface point references from the image. For each of the
scales (camera distances), the viewpoint with largest cover-
age is inserted into a list. We then re-estimate coverages at
this scale, omitting points already covered by the selected
viewpoint, and the viewpoint with the next largest coverage
is added to the list. The process is repeated until all surface
points are covered at this scale. In our experiments, with
man-made shapes and at our selected scales, approximately
20 viewpoints were enough to cover the vast majority of the
surface area per scale.
After determining our viewpoint collection, we render the
shape to shaded images and depth images. For each view-
point, we place a camera pointing towards the surface point
used to generate that viewpoint, and rotate its up-vector 4
times at 90 degree intervals (i.e, we use 4 in-plane rota-
tions). For each of these 4 camera rotations, we render a
shaded, greyscale 512 × 512 image using a typical com-
puter graphics shader (Phong reflection model [
37]) and
a depth image, which are concatenated into a single two-
channel image. These images are fed as input to the image
processing module (FCN) of our network, described below.
We found that both shaded and depth images are useful in-
puts. In early experiments, labeling accuracy dropped 2.5%
using depth alone. This might be attributed to the more
3781

“photo-realistic” appearance of shaded images, which bet-
ter match the statistics of real images used to pretrain our
architecture. We note that shaded images directly encode
surface normals relative to view direction (shading is com-
puted from the angle between normals and view direction).
In addition to the shaded and depth images, for each se-
lected camera setting, we rasterize the shape into another
image where each pixel stores the ID of the polygon whose
projection is closest to the pixel center. These images,
which we call “surface reference” images, are fed into the
“projection layer” of our network (Figure 1).
FCN module. The two-channel images produced in the
previous step are processed through identical image-based
Fully-Connected Network (FCN) modules (Figure
1). Each
FCN module outputs L confidence maps of size 512 × 512
per each input image, where L is the number of part labels.
Specifically, in our implementation we employ the FCN ar-
chitecture suggested in [
57], which adopted the VGG-16
network [44] for dense prediction by removing its two last
pooling and striding layers, and using dilated convolutions.
We perform two additional modifications to this FCN ar-
chitecture. First, since our input is a 2-channel image, we
use 2-channel 3 × 3 filters instead of 3-channel (BGR) ones.
We also adapted these filters to handle greyscale rather
than color images during our training procedure. Second,
we modified the output of the original FCN module. The
original FCN outputs L confidence maps of size 64 × 64.
These are then converted into L probability maps through
a softmax operation. Instead, we upsample the confidence
maps to size 512 × 512 through a transpose convolutional
(“deconvolution”) layer with learned parameters and stride
8. The confidences are later converted into probabilities
through our CRF layer.
Image2Surface projection layer. The goal of this layer
is to aggregate the confidence maps across multiple views,
and project the result back onto the 3D surface. We note that
both the locations and the number of optimal viewpoints can
vary from shape to shape, and they are not ordered in any
manner. Even if the optimal viewpoints were the same for
different shapes, the views would still not necessarily be
ordered, since we do not assume that shapes are oriented
consistently. As a result, the projection layer should be in-
variant to the input image ordering. Given M
s
input images
of an input shape s, the L confidence maps extracted from
the FCN module are stacked into a M
s
× 512 × 512 × L
image. The projection layer takes as input this 4D image.
In addition, it takes as input the surface reference (polygon
ID) images, also stacked into a 3D M
s
× 512 × 512 image.
The layer outputs a F
s
× L array, where F
s
is the number
of polygons of the shape s. The projection is done through
a view-pooling operation. For each surface polygon f and
part category label l, we assign a confidence P (f, l) equal
to the maximum label confidence across all pixels and input
images that map to that polygon according to the surface
reference images. Mathematically, this projection operation
is formulated as:
˜
C(f, l) = max
m,i,j:
I(m,i,j)=f
C(m, i, j, l) (1)
where C(m, i, j, l) is the confidence of label l at pixel (i, j)
of image m; I(m, i, j) stores the polygon ID at pixel (i, j)
of the corresponding reference image m; and
˜
C(f, l) is the
output confidence of label l at polygon f . We note that
the surface reference images omit polygon references at and
near the shape silhouette, since an excessively large, nearly
occluded, portion of the surface tends to be mapped onto the
silhouette, thus the projection becomes unreliable there. In-
stead of using the max operator, an alternative aggregation
strategy would be to use the average instead of the maxi-
mum, but we observed that this results in a slightly lower
performance (about 1% in our experiments).
Surface CRF. Some small surface areas may be highly oc-
cluded and hence unobserved by any of the selected view-
points, or not included in any of the reference images. For
any such polygons, the label confidences are set to zero.
The rest of the surface should propagate label confidences to
these polygons. In addition, due to upsampling in the FCN
module, there might be bleeding across surface convexities
or concavities that are likely to be segmentation boundaries.
We define a CRF operating on the surface representation to
deal with the above issues. Specifically, each polygon f is
assigned a random variable R
f
representing its label. The
CRF includes a unary factor for each such variable, which is
set according to the confidences produced in the projection
layer: φ
unary
(R
f
= l) = exp(
˜
C(f, l)). The CRF also en-
codes pairwise interactions between these variables based
on surface proximity and curvature. For each pair of neigh-
boring polygons (f, f
), we define a factor that favors the
same label for polygons which share normals (e.g. on a flat
surface), and different labels otherwise. Given the angle
ω
f,f
between their normals (ω
f,f
is divided by π to map it
between [0, 1]), the factor is defined as follows:
φ
adj
(R
f
=l,R
f
=l
)=
(
exp
w
adj
·w
l,l
·ω
2
f,f
, l =l
exp
w
adj
·w
l,l
·(1ω
2
f,f
)
, l 6= l
where w
adj
and w
l,l
are learned factor- and label-dependent
weights. We also define factors that favor similar labels for
polygons f, f
which are spatially close to each other ac-
cording to the geodesic distance d
f,f
between them. These
factors are defined for pairs of polygons whose geodesic
distance is less than 10% of the bounding sphere radius in
our implementation. This makes our CRF relatively dense
and more sensitive to long-range interactions between sur-
face variables. We note that for small meshes or point
clouds, all pairs could be considered instead. The geodesic
distance-based factors are defined as follows:
φ
dist
(R
f
=l,R
f
=l
)=
(
exp
w
dist
·w
l,l
·d
2
f,f
, l =l
exp
w
dist
·w
l,l
·(1 d
2
f,f
)
, l 6= l
where the factor-dependent weight w
dist
and label-
dependent weights w
l,l
are learned parameters, and d
f,f
3782

frame
wheel
handle
seat
tank
unary factor only
CRF no dis. factor CRF no adj. factor full CRF
ground-truth
headlight
Figure 2. Labeled segmentation results for alternative versions of our CRF (best viewed in color).
represents the geodesic distance between f and f
. Dis-
tances are normalized to [0, 1].
Based on the above factors, our CRF is defined over all
surface random variables R
s
= {R
1
, R
2
, . . . , R
F
s
} of the
shape s as follows:
P (R
s
)=
1
Z
s
Y
f
φ
unary
(R
f
)
Y
adj f,f
φ
adj
(R
f
, R
f
)
Y
f,f
φ
dist
(R
f
, R
f
)
(2)
where Z
s
is a normalization constant. Exact inference is
intractable, thus we resort to mean-field inference to ap-
proximate the most likely joint assignment to all random
variables as well as their marginal probabilities. Our mean-
field approximation uses distributions over single variables
as messages (i.e. the posterior is approximated in a fully
factorized form see Algorithm 11.7 of [
28]). Figure 2
shows how segmentation results degrade for alternative ver-
sions of our CRF, and when the unary term is used alone.
Training procedure. The FCN module is initialized with
filters pre-trained on image processing tasks [
57]. Since
the input to our network are rendered grayscale (col-
orless) images, we average the BGR channel weights
of the pre-trained filters of the first convolutional layer,
i.e. the 3 × 3 × 3 filters are converted to color-insensitive
3 × 3 × 1 filters. Then, we replicate the weights twice to
yield 3 × 3 × 2 filters that can accept our 2-channel input
images. The CRF weights are initialized to 1.
Given an input training dataset S of 3D shapes, we first
generate their depth, shaded, and reference images using
our rendering procedure. Then, our algorithm fine-tunes the
FCN module filter parameters θ and learns the CRF weights
w
adj
, w
dist
, {w
l,l
} to maximize their log-likelihood plus a
small regularization term:
L =
1
|S|
X
sS
log P (R
s
= T
s
) + λ||θ| |
2
(3)
where T
s
are ground-truth labels per surface variable for
the training shape s, and λ is a regularization parameter
(weight decay) set to 10
3
in our experiments. To maximize
the above objective, we must compute its gradient w.r.t. the
FCN module outputs, as required for backpropagation:
L
C(m, i, j, l)
=
1 P (R
f
= l) if l = T
f
and I(m, i, j) = f
P (R
f
= l) if l 6= T
f
and I(m, i, j) = f
0 otherwise
(4)
Computing the gradient requires estimation of the marginal
probabilities P (R
f
). We use mean-field inference to esti-
mate the marginals (same inference procedure is used for
training and testing). We observed that after 20 iterations,
mean-field often converges (i.e. marginals change very lit-
tle). We also need to compute the gradient of the objective
function w.r.t. the CRF weights. Since our CRF has the
form of a log-linear model, gradients can be easily derived.
Given the estimated gradients, we can train our network
through backpropagation. Backpropagation can send er-
ror messages towards any FCN branch i.e., any input image
(Figure
1). One strategy to train our network would be to set
up as many FCN branches as the largest number of rendered
images across all training models. However, the number of
selected viewpoints varies per model, thus the number of
rendered images per model also varies, ranging from a few
tens to a few hundreds in our datasets. Maintaining hun-
dreds of FCN branches would exceed the memory capacity
of current GPUs. Instead, during training, our strategy is to
pick a random subset of 24 images per model, i.e. we keep
24 FCN branches with shared parameters in the GPU mem-
ory. For each batch, a different random subset per model is
selected (i.e. no fixed set of views used for training). We
note that the order of rendered images does not matter our
view pooling is invariant to the input image ordering. Our
training strategy is reminiscent of the DropConnect tech-
nique [
49], which tends to reduce overfitting.
At test time all rendered images per model are used to make
predictions. The forward pass does not require all the input
images to be processed at once (i.e., not all FCN branches
need to be set up). At test time, the image label confidences
are sequentially projected onto the surface, which produces
the same results as projecting all of them at once.
Implementation. Our network is implemented using C++
and Caffe
1
. Optimization is done through stochastic gradi-
ent descent with learning rate 10
3
and momentum 0.9. We
implemented a new Image2Surface layer in Caffe for pro-
jecting image-based confidences onto the shape surface. We
also created a CRF layer that handles mean-field inference
during the forward pass, and estimates the required gradi-
ents during backpropagation.
4. Evaluation
We now present experimental validations and analysis of
our approach.
Datasets. We evaluated our method on manually-labeled
segmentations available from the ShapeNetCore [
56],
Labeled-PSB (L-PSB) [7, 26], and COSEG datasets [50].
The dataset from ShapeNetCore currently contains 17,773
“expert-verified” segmentations of 3D models across 16
categories. The 3D models of this dataset are gathered
1
Our source code, results and datasets are available on the project page:
http://people.cs.umass.edu/kalo/papers/shapepfcn/
3783

Citations
More filters
Journal ArticleDOI
TL;DR: This paper presents a comprehensive review of recent progress in deep learning methods for point clouds, covering three major tasks, including 3D shape classification, 3D object detection and tracking, and 3D point cloud segmentation.
Abstract: Point cloud learning has lately attracted increasing attention due to its wide applications in many areas, such as computer vision, autonomous driving, and robotics As a dominating technique in AI, deep learning has been successfully used to solve various 2D vision problems However, deep learning on point clouds is still in its infancy due to the unique challenges faced by the processing of point clouds with deep neural networks Recently, deep learning on point clouds has become even thriving, with numerous methods being proposed to address different problems in this area To stimulate future research, this paper presents a comprehensive review of recent progress in deep learning methods for point clouds It covers three major tasks, including 3D shape classification, 3D object detection and tracking, and 3D point cloud segmentation It also presents comparative results on several publicly available datasets, together with insightful observations and inspiring future research directions

1,021 citations


Cites background from "3D Shape Segmentation with Projecti..."

  • ...[249] combined FCNs and surface-based CRFs to achieve end-to-end 3D part segmentation....

    [...]

Proceedings ArticleDOI
18 Jun 2018
TL;DR: In this article, an approximate gradient for rasterization is proposed to enable the integration of rendering into neural networks, which enables single-image 3D mesh reconstruction with silhouette image supervision.
Abstract: For modeling the 3D world behind 2D images, which 3D representation is most appropriate? A polygon mesh is a promising candidate for its compactness and geometric properties. However, it is not straightforward to model a polygon mesh from 2D images using neural networks because the conversion from a mesh to an image, or rendering, involves a discrete operation called rasterization, which prevents back-propagation. Therefore, in this work, we propose an approximate gradient for rasterization that enables the integration of rendering into neural networks. Using this renderer, we perform single-image 3D mesh reconstruction with silhouette image supervision and our system outperforms the existing voxel-based approach. Additionally, we perform gradient-based 3D mesh editing operations, such as 2D-to-3D style transfer and 3D DeepDream, with 2D supervision for the first time. These applications demonstrate the potential of the integration of a mesh renderer into neural networks and the effectiveness of our proposed renderer.

919 citations

Proceedings ArticleDOI
22 Feb 2018
TL;DR: A network architecture for processing point clouds that directly operates on a collection of points represented as a sparse set of samples in a high-dimensional lattice that outperforms existing state-of-the-art techniques on 3D segmentation tasks.
Abstract: We present a network architecture for processing point clouds that directly operates on a collection of points represented as a sparse set of samples in a high-dimensional lattice. NaA¯vely applying convolutions on this lattice scales poorly, both in terms of memory and computational cost, as the size of the lattice increases. Instead, our network uses sparse bilateral convolutional layers as building blocks. These layers maintain efficiency by using indexing structures to apply convolutions only on occupied parts of the lattice, and allow flexible specifications of the lattice structure enabling hierarchical and spatially-aware feature learning, as well as joint 2D-3D reasoning. Both point-based and image-based representations can be easily incorporated in a network with such layers and the resulting model can be trained in an end-to-end manner. We present results on 3D segmentation tasks where our approach outperforms existing state-of-the-art techniques.

693 citations


Cites background from "3D Shape Segmentation with Projecti..."

  • ...Multi-view networks pre-process shapes into a set of 2D rendered images encoding surface depth and normals under various 2D projections [39, 32, 3, 24, 9, 20]....

    [...]

Proceedings ArticleDOI
Lei Wang1, Yuchun Huang1, Yaolin Hou1, Shenman Zhang1, Jie Shan2 
15 Jun 2019
TL;DR: A novel graph attention convolution, whose kernels can be dynamically carved into specific shapes to adapt to the structure of an object, which can capture the structured features of point clouds for fine-grained segmentation and avoid feature contamination between objects.
Abstract: Standard convolution is inherently limited for semantic segmentation of point cloud due to its isotropy about features. It neglects the structure of an object, results in poor object delineation and small spurious regions in the segmentation result. This paper proposes a novel graph attention convolution (GAC), whose kernels can be dynamically carved into specific shapes to adapt to the structure of an object. Specifically, by assigning proper attentional weights to different neighboring points, GAC is designed to selectively focus on the most relevant part of them according to their dynamically learned features. The shape of the convolution kernel is then determined by the learned distribution of the attentional weights. Though simple, GAC can capture the structured features of point clouds for fine-grained segmentation and avoid feature contamination between objects. Theoretically, we provided a thorough analysis on the expressive capabilities of GAC to show how it can learn about the features of point clouds. Empirically, we evaluated the proposed GAC on challenging indoor and outdoor datasets and achieved the state-of-the-art results in both scenarios.

558 citations


Cites methods from "3D Shape Segmentation with Projecti..."

  • ...The multi-view-based method [43, 24, 18] represents the point cloud as a set of images rendered from multiple views....

    [...]

Journal ArticleDOI
TL;DR: This paper utilizes the unique properties of the mesh for a direct analysis of 3D shapes using MeshCNN, a convolutional neural network designed specifically for triangular meshes, and demonstrates the effectiveness of MeshCNN on various learning tasks applied to 3D meshes.
Abstract: Polygonal meshes provide an efficient representation for 3D shapes. They explicitly captureboth shape surface and topology, and leverage non-uniformity to represent large flat regions as well as sharp, intricate features. This non-uniformity and irregularity, however, inhibits mesh analysis efforts using neural networks that combine convolution and pooling operations. In this paper, we utilize the unique properties of the mesh for a direct analysis of 3D shapes using MeshCNN, a convolutional neural network designed specifically for triangular meshes. Analogous to classic CNNs, MeshCNN combines specialized convolution and pooling layers that operate on the mesh edges, by leveraging their intrinsic geodesic connections. Convolutions are applied on edges and the four edges of their incident triangles, and pooling is applied via an edge collapse operation that retains surface topology, thereby, generating new mesh connectivity for the subsequent convolutions. MeshCNN learns which edges to collapse, thus forming a task-driven process where the network exposes and expands the important features while discarding the redundant ones. We demonstrate the effectiveness of MeshCNN on various learning tasks applied to 3D meshes.

414 citations

References
More filters
Proceedings Article
01 Jan 2015
TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

49,914 citations

Proceedings ArticleDOI
07 Jun 2015
TL;DR: The key insight is to build “fully convolutional” networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning.
Abstract: Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks by themselves, trained end-to-end, pixels-to-pixels, exceed the state-of-the-art in semantic segmentation. Our key insight is to build “fully convolutional” networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolutional networks, explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet [20], the VGG net [31], and GoogLeNet [32]) into fully convolutional networks and transfer their learned representations by fine-tuning [3] to the segmentation task. We then define a skip architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional network achieves state-of-the-art segmentation of PASCAL VOC (20% relative improvement to 62.2% mean IU on 2012), NYUDv2, and SIFT Flow, while inference takes less than one fifth of a second for a typical image.

28,225 citations

Book
31 Jul 2009
TL;DR: The framework of probabilistic graphical models, presented in this book, provides a general approach for causal reasoning and decision making under uncertainty, allowing interpretable models to be constructed and then manipulated by reasoning algorithms.
Abstract: Most tasks require a person or an automated system to reason -- to reach conclusions based on available information The framework of probabilistic graphical models, presented in this book, provides a general approach for this task The approach is model-based, allowing interpretable models to be constructed and then manipulated by reasoning algorithms These models can also be learned automatically from data, allowing the approach to be used in cases where manually constructing a model is difficult or even impossible Because uncertainty is an inescapable aspect of most real-world applications, the book focuses on probabilistic models, which make the uncertainty explicit and provide models that are more faithful to reality Probabilistic Graphical Models discusses a variety of models, spanning Bayesian networks, undirected Markov networks, discrete and continuous models, and extensions to deal with dynamical systems and relational data For each class of models, the text describes the three fundamental cornerstones: representation, inference, and learning, presenting both basic concepts and advanced techniques Finally, the book considers the use of the proposed framework for causal reasoning and decision making under uncertainty The main text in each chapter provides the detailed technical development of the key ideas Most chapters also include boxes with additional material: skill boxes, which describe techniques; case study boxes, which discuss empirical cases related to the approach described in the text, including applications in computer vision, robotics, natural language understanding, and computational biology; and concept boxes, which present significant concepts drawn from the material in the chapter Instructors (and readers) can group chapters in various combinations, from core topics to more technically advanced material, to suit their particular needs

6,597 citations

Proceedings Article
30 Apr 2016
TL;DR: This work develops a new convolutional network module that is specifically designed for dense prediction, and shows that the presented context module increases the accuracy of state-of-the-art semantic segmentation systems.
Abstract: State-of-the-art models for semantic segmentation are based on adaptations of convolutional networks that had originally been designed for image classification. However, dense prediction and image classification are structurally different. In this work, we develop a new convolutional network module that is specifically designed for dense prediction. The presented module uses dilated convolutions to systematically aggregate multi-scale contextual information without losing resolution. The architecture is based on the fact that dilated convolutions support exponential expansion of the receptive field without loss of resolution or coverage. We show that the presented context module increases the accuracy of state-of-the-art semantic segmentation systems. In addition, we examine the adaptation of image classification networks to dense prediction and show that simplifying the adapted network can increase accuracy.

5,566 citations


"3D Shape Segmentation with Projecti..." refers methods in this paper

  • ...The FCN module is initialized with filters pre-trained on image processing tasks [57]....

    [...]

  • ...Specifically, in our implementation we employ the FCN architecture suggested in [57], which adopted the VGG-16 network [44] for dense prediction by removing its two last pooling and striding layers, and using dilated convolutions....

    [...]

  • ...Several strategies for improving these initial estimates have been proposed including techniques based on top-down region-based reasoning [10, 16], CRFs [6, 31], atrous convolutional layers [6, 57], deconvolutional layers [35], recurrent networks [59], or a multiscale analysis [34, 17]....

    [...]

Posted Content
TL;DR: ShapeNet contains 3D models from a multitude of semantic categories and organizes them under the WordNet taxonomy, a collection of datasets providing many semantic annotations for each 3D model such as consistent rigid alignments, parts and bilateral symmetry planes, physical sizes, keywords, as well as other planned annotations.
Abstract: We present ShapeNet: a richly-annotated, large-scale repository of shapes represented by 3D CAD models of objects. ShapeNet contains 3D models from a multitude of semantic categories and organizes them under the WordNet taxonomy. It is a collection of datasets providing many semantic annotations for each 3D model such as consistent rigid alignments, parts and bilateral symmetry planes, physical sizes, keywords, as well as other planned annotations. Annotations are made available through a public web-based interface to enable data visualization of object attributes, promote data-driven geometric analysis, and provide a large-scale quantitative benchmark for research in computer graphics and vision. At the time of this technical report, ShapeNet has indexed more than 3,000,000 models, 220,000 models out of which are classified into 3,135 categories (WordNet synsets). In this report we describe the ShapeNet effort as a whole, provide details for all currently available datasets, and summarize future plans.

3,707 citations


"3D Shape Segmentation with Projecti..." refers background in this paper

  • ...There have been efforts to develop methods for consistent orientation or alignment of 3D shapes [12, 14, 3], yet existing methods require human supervision, or do not work well for various shape classes, such as outdoor objects or organic shapes....

    [...]