3D Shape Segmentation with Projective Convolutional Networks

doi:10.1109/CVPR.2017.702

Evangelos Kalogerakis

1

Melinos Averkiou

2

Subhransu Maji

1

Siddhartha Chaudhuri

3

1

University of Massachusetts Amherst

2

University of Cyprus

3

IIT Bombay

Abstract

This paper introduces a deep architecture for segmenting

3D objects into their labeled semantic parts. Our architec-

ture combines image-based Fully Convolutional Networks

(FCNs) and surface-based Conditional Random Fields

(CRFs) to yield coherent segmentations of 3D shapes. The

image-based FCNs are used for efﬁcient view-based rea-

soning about 3D object parts. Through a special projec-

tion layer, FCN outputs are effectively aggregated across

multiple views and scales, then are projected onto the

3D object surfaces. Finally, a surface-based CRF com-

bines the projected outputs with geometric consistency

cues to yield coherent segmentations. The whole archi-

tecture (multi-view FCNs and CRF) is trained end-to-end.

Our approach signiﬁcantly outperforms the existing state-

of-the-art methods in the currently largest segmentation

benchmark (ShapeNet). Finally, we demonstrate promis-

ing segmentation results on noisy 3D shapes acquired from

consumer-grade depth cameras.

1. Introduction

In recent years there has been an explosion of 3D shape

data on the web. In addition to the increasing number of

community-curated CAD models, depth sensors deployed

on a wide range of platforms are able to acquire 3D ge-

ometric representations of objects in the form of polygon

meshes or point clouds. Although there have been sig-

niﬁcant advances in analyzing color images, in particular

through deep networks, existing semantic reasoning tech-

niques for 3D geometric shape data mostly rely on heuristic

processing stages and hand-tuned geometric descriptors.

Our work focuses on the task of segmenting 3D shapes into

labeled semantic parts. Compositional part-based reason-

ing for 3D shapes has been shown to be effective for a large

number of vision, robotics and virtual reality applications,

such as cross-modal analysis of 3D shapes and color im-

ages [

60, 24], skeletal tracking [42], objection detection in

images [11, 30, 36], 3D object reconstruction from images

and line drawings [54, 24, 21], interactive assembly-based

3D modeling [5, 4], generating 3D shapes from a small

number of examples [

25], style transfer between 3D objects

[

33], robot navigation and grasping [40, 8], to name a few.

The shape segmentation task, while fundamental, is chal-

lenging because of the variety and ambiguity of shape parts

that must be assigned the same semantic label; because ac-

curately detecting boundaries between parts can involve ex-

tremely subtle cues; because local and global features must

be jointly examined; and because the analysis must be ro-

bust to noise and undersampling.

We propose a deep architecture for segmenting and labeling

3D shapes that simply and effectively addresses these chal-

lenges, and signiﬁcantly outperforms prior methods. The

key insights of our technique are to repurpose image-based

deep networks for view-based reasoning, and aggregate

their outputs onto the surface representation of the shape in

a geometrically consistent manner. We make no geometric,

topological or orientation assumptions about the shape, nor

exploit any hand-tuned geometric descriptors.

Our view-based approach is motivated by the success of

deep networks on image segmentation tasks. Using ren-

dered shapes lets us initialize our network with layers that

have been trained on large image datasets, allowing better

generalization. Since images depict shapes of photographed

objects (along with texture), we expect such pre-trained lay-

ers to already encode some information about parts and their

relationships. Recent work on view-based 3D shape clas-

siﬁcation [

47, 38] and RGB-D recognition [15, 46] have

shown the beneﬁts of transferring learned representations

from color images to geometric and depth data.

A view-based approach to 3D shape segmentation must

overcome several technical obstacles. First, views must

be selected such that they together cover the shape sur-

face as much as possible and minimize occlusions. Sec-

ond, shape parts can be visible in more than one view, thus

our method must effectively consolidate information across

multiple views. Third, we must guarantee that the segmen-

tation is complete and coherent. This means all the surface

area, including any heavily occluded portions, should be la-

beled, and neighboring surface areas should likely have the

same label unless separated by a strong boundary feature.

Our approach, shown in Figure

1, systematically addresses

these difﬁculties using a single feed-forward network.

Given a raw 3D polygon mesh as input, our method gen-

erates a set of images from multiple views that are automat-

ically selected for optimal surface coverage. These images

are fed into the network, which outputs conﬁdence maps per

part via image processing layers. The conﬁdence maps are

1

3779

Surface

reference

image

(triangle ids)

...

Input 3D Shape &

Selected Viewpoints

Shaded

images

Depth

images

FCN

Per-label

conﬁdence

maps

Image2Surface

projection

layer

Surface-based

CRF layer

509 865 865 133711741174

1342 865 13371337 558 558

13421342 887 887 849 558

932 932 887 849 849 1212

932 677 677 156712121212

1805 677 950 156715671566

projection

Per-label

conﬁdence

maps

(on surface)

Labeled 3D Shape

forward pass / inference

backpropagation / learning

4x rotations

...

wing

fuselage

vert. stabilizer

horiz. stabilizer

shared

weights

shared

weights

Figure 1. Pipeline and architecture of our method for 3D shape segmentation and labeling. Given an input shape, a set of viewpoints

are computed at different scales such that the viewed shape surface is maximally covered (left). Shaded and depth images from these

viewpoints are processed through our architecture (here we show images for three viewpoints, corresponding to 3 different scales). Our

architecture employs image-based Fully Convolutional Network (FCN) modules with shared parameters to process the input images. The

modules output image-based part label conﬁdences per view. Here we show conﬁdence maps for the wing label (the redder the color, the

higher the conﬁdence). The conﬁdences are aggregated and projected on the shape surface through a special projection layer. Then they

are further processed through a surface-based CRF that promotes consistent labeling of the entire surface (right).

fused and projected onto the shape surface representation

through a projection layer. Finally, our architecture incor-

porates a surface-based Conditional Random Field (CRF)

layer that promotes consistent labeling of the entire surface.

The whole network, including the CRF, is trained in an end-

to-end manner to achieve optimal performance.

Our main contribution is the introduction of a deep ar-

chitecture for compositional part-based reasoning on 3D

shape representations without the use of hand-engineered

geometry processing stages or hand-tuned descriptors. We

demonstrate signiﬁcant improvements over the state-of-the-

art. For complex objects, such as aircraft, motor vehicles,

and furniture, our method increases part labeling accuracy

by a remarkable ∼8% over the state of the art on the cur-

rently largest 3D shape segmentation dataset.

2. Related work

Our work is related to learning methods for segmentation of

images (including RGB-D data) and 3D shapes.

Image-based segmentation. There is a vast literature on

segmenting images into objects and their parts. Most recent

techniques are based on variants of random forest classi-

ﬁers or convolutional networks. An example of the former

is the remarkably fast and accurate human-pose estimator

that uses depth data from Kinect sensors for labeling hu-

man parts [

42]. Our work builds on the success of convo-

lutional networks for material segmentation, scene labeling,

and object part-labeling tasks. These approaches use image

classiﬁcation networks repurposed for dense image label-

ing, commonly a fully-convolutional network (FCN) [

32],

to obtain an initial labeling. Several strategies for improving

these initial estimates have been proposed including tech-

niques based on top-down region-based reasoning [

10, 16],

CRFs [6, 31], atrous convolutional layers [6, 57], decon-

volutional layers [35], recurrent networks [59], or a multi-

scale analysis [34, 17]. Several works [29, 1, 2] have also

focused on learning feature representations from RGB-D

data (e.g. those captured using a Kinect sensor) for object-

level recognition and detection in scenes. Recently, Gupta

et al. [

15] showed that image-based networks can be repur-

posed for extracting depth representations for object detec-

tion and segmentation. Recent works [14, 45, 18] have ap-

plied a similar strategy for indoor scene recognition tasks.

In contrast to the above methods, our work aims to seg-

ment geometric representations of 3D objects, in the form

of polygon meshes, created through 3D modeling tools or

reconstruction techniques. The 3D models of these objects

often do not contain texture or color information. Segment-

ing these 3D objects into parts requires architectures that

are capable of operating on their geometric representations.

Learning 3D shape representations from images. A

few recent methods attempt to learn volumetric represen-

tations of shapes from images via convolutional networks

that employ special layers to model shape projections onto

images [

55, 39]. Alternatively, mesh-based representations

can also be learned from images by assuming a ﬁxed num-

ber of mesh vertices [

39]. In contrast to these works, our

architecture discriminatively learns view-based shape rep-

resentations along with a surface-based CRF such that the

view projections match an input surface signal (part la-

bels). Our 3D-2D projection mechanism is differentiable,

parameter-free, and sparse, since it operates only on the

shape surface rather than its volume. In contrast to the mesh

representations of [

39], we do not assume that meshes have

a ﬁxed number of vertices, which does not hold true for

general 3D models. Our method is more related to meth-

ods that learn view-based shape representations [

47, 38].

However, these methods only learn global representations

3780

for shape classiﬁcation and rely on ﬁxed sets of views. Our

method instead learns view-based shape representations for

part-based reasoning through adaptively selected views. It

also uses a CRF to resolve inconsistencies or missing sur-

face information in the view representations.

3D geometric shape segmentation. The most common

learning-based approach to shape segmentation is to assign

part labels to geometric elements of the shape representa-

tion, such as polygons, points, or patches [

53]. This is

often done through various processing stages: ﬁrst, hand-

engineered geometric descriptors of these elements are ex-

tracted (e.g. surface curvature, shape diameter, local his-

tograms of point or normal distributions, surface eigenfunc-

tions, etc.); then, a clustering method or classiﬁer infers

part labels for elements based on their descriptors; and ﬁ-

nally (optionally) a separate graph cuts step is employed to

smooth out the surface labeling [

26, 41, 43, 19, 58]. Re-

cently, a convolutional network has been proposed as an

alternative element classiﬁer [

13], yet it operates on hand-

engineered geometric descriptors organized in a 2D matrix

lacking spatially coherent structure for conventional convo-

lution. Another variant is to use two-layer networks which

transform the input by randomized kernels, in the form of

so-called “Extreme Learning Machines” [

52], but these of-

fer no better performance than standard shallow classiﬁers.

Other approaches segment shapes by employing non-rigid

alignment steps through deformable part templates [

27, 20],

or transfer labels through surface correspondences and

functional maps between 3D shapes [

48, 22, 50, 27, 23].

These correspondence and alignment methods rely on hand-

engineered geometric descriptors and deformation steps.

Wang et al. [

51] segment 3D shapes by warping and match-

ing binary images of their projected views with segmented

2D images through Hausdorff distances. However, the

matching procedure is hand-tuned, while potentially useful

surface information, such as depth and normals, is ignored.

In contrast to all the above approaches, we propose a view-

based deep architecture for shape segmentation with four

main advantages. First, our architecture adopts image pro-

cessing layers learned on large-scale image datasets, which

are orders of magnitude larger than existing 3D datasets.

As we show in this work, the deep stack of several lay-

ers extracts feature representations that can be successfully

adapted to the task of shape segmentation. We note that

such transfer has also been observed recently for shape

recognition [

47, 38]. Second, our architecture produces

shape segmentations without the use of hand-engineered ge-

ometric descriptors or processing stages that are prone to

degeneracies in the shape representation (i.e. surface noise,

sampling artifacts, irregular mesh tesselation, mesh degen-

eracies, and so on). Third, we employ adaptive viewpoint

selection to effectively capture all surface parts for analysis.

Finally, our architecture is trained end-to-end, including all

image and surface processing stages. As a result of these

contributions, our method achieves better performance than

prior work on big and complex datasets by a large margin.

3. Method

Given an input 3D shape, the goal of our method is to seg-

ment it into labeled parts. We designed a projective con-

volutional network to this end. Our network architecture is

visualized in Figure

1. It takes as input a set of images from

multiple views optimized for maximal surface coverage; ex-

tracts part-based conﬁdence maps through image process-

ing layers (pre-trained on large image datasets); combines

and projects these maps onto the surface through a projec-

tion layer, and ﬁnally incorporates a surface-based Condi-

tional Random Field (CRF) that favors coherent labeling of

the input surface. The whole network, including the CRF, is

trained end-to-end. In the following sections, we discuss the

input to our network, its layers, and the training procedure.

Input. The input to our algorithm is a 3D shape repre-

sented as a polygon mesh. As a preprocessing step, the

shape surface is sampled with uniformly distributed points

(1024 in our implementation). Our algorithm ﬁrst deter-

mines an overcomplete collection of viewpoints such that

nearly every point of the surface is visible from at least

K viewpoints (in our implementation, K = 3). For each

sampled surface point, we place viewpoints at different dis-

tances from it along its surface normal (distances are set

to 0.5, 1.0 and 1.5 of the shape’s bounding sphere radius).

In this manner, the surface is depicted at different scales

(Figure 1, left). We then determine a compact set of infor-

mative viewpoints that maximally cover the shape surface.

For each viewpoint, the shape is rasterized under a perspec-

tive projection to a binary image, where we associate ev-

ery “on” pixel with the sampled surface point closest to it.

The coverage of the viewpoint is measured as the fraction

of surface points visible from it, estimated by aggregating

surface point references from the image. For each of the

scales (camera distances), the viewpoint with largest cover-

age is inserted into a list. We then re-estimate coverages at

this scale, omitting points already covered by the selected

viewpoint, and the viewpoint with the next largest coverage

is added to the list. The process is repeated until all surface

points are covered at this scale. In our experiments, with

man-made shapes and at our selected scales, approximately

20 viewpoints were enough to cover the vast majority of the

surface area per scale.

After determining our viewpoint collection, we render the

shape to shaded images and depth images. For each view-

point, we place a camera pointing towards the surface point

used to generate that viewpoint, and rotate its up-vector 4

times at 90 degree intervals (i.e, we use 4 in-plane rota-

tions). For each of these 4 camera rotations, we render a

shaded, greyscale 512 × 512 image using a typical com-

puter graphics shader (Phong reﬂection model [

37]) and

a depth image, which are concatenated into a single two-

channel image. These images are fed as input to the image

processing module (FCN) of our network, described below.

We found that both shaded and depth images are useful in-

puts. In early experiments, labeling accuracy dropped 2.5%

using depth alone. This might be attributed to the more

3781

“photo-realistic” appearance of shaded images, which bet-

ter match the statistics of real images used to pretrain our

architecture. We note that shaded images directly encode

surface normals relative to view direction (shading is com-

puted from the angle between normals and view direction).

In addition to the shaded and depth images, for each se-

lected camera setting, we rasterize the shape into another

image where each pixel stores the ID of the polygon whose

projection is closest to the pixel center. These images,

which we call “surface reference” images, are fed into the

“projection layer” of our network (Figure 1).

FCN module. The two-channel images produced in the

previous step are processed through identical image-based

Fully-Connected Network (FCN) modules (Figure

1). Each

FCN module outputs L conﬁdence maps of size 512 × 512

per each input image, where L is the number of part labels.

Speciﬁcally, in our implementation we employ the FCN ar-

chitecture suggested in [

57], which adopted the VGG-16

network [44] for dense prediction by removing its two last

pooling and striding layers, and using dilated convolutions.

We perform two additional modiﬁcations to this FCN ar-

chitecture. First, since our input is a 2-channel image, we

use 2-channel 3 × 3 ﬁlters instead of 3-channel (BGR) ones.

We also adapted these ﬁlters to handle greyscale rather

than color images during our training procedure. Second,

we modiﬁed the output of the original FCN module. The

original FCN outputs L conﬁdence maps of size 64 × 64.

These are then converted into L probability maps through

a softmax operation. Instead, we upsample the conﬁdence

maps to size 512 × 512 through a transpose convolutional

(“deconvolution”) layer with learned parameters and stride

8. The conﬁdences are later converted into probabilities

through our CRF layer.

Image2Surface projection layer. The goal of this layer

is to aggregate the conﬁdence maps across multiple views,

and project the result back onto the 3D surface. We note that

both the locations and the number of optimal viewpoints can

vary from shape to shape, and they are not ordered in any

manner. Even if the optimal viewpoints were the same for

different shapes, the views would still not necessarily be

ordered, since we do not assume that shapes are oriented

consistently. As a result, the projection layer should be in-

variant to the input image ordering. Given M

s

input images

of an input shape s, the L conﬁdence maps extracted from

the FCN module are stacked into a M

s

× 512 × 512 × L

image. The projection layer takes as input this 4D image.

In addition, it takes as input the surface reference (polygon

ID) images, also stacked into a 3D M

s

× 512 × 512 image.

The layer outputs a F

s

× L array, where F

s

is the number

of polygons of the shape s. The projection is done through

a view-pooling operation. For each surface polygon f and

part category label l, we assign a conﬁdence P (f, l) equal

to the maximum label conﬁdence across all pixels and input

images that map to that polygon according to the surface

reference images. Mathematically, this projection operation

is formulated as:

˜

C(f, l) = max

∀m,i,j:

I(m,i,j)=f

C(m, i, j, l) (1)

where C(m, i, j, l) is the conﬁdence of label l at pixel (i, j)

of image m; I(m, i, j) stores the polygon ID at pixel (i, j)

of the corresponding reference image m; and

˜

C(f, l) is the

output conﬁdence of label l at polygon f . We note that

the surface reference images omit polygon references at and

near the shape silhouette, since an excessively large, nearly

occluded, portion of the surface tends to be mapped onto the

silhouette, thus the projection becomes unreliable there. In-

stead of using the max operator, an alternative aggregation

strategy would be to use the average instead of the maxi-

mum, but we observed that this results in a slightly lower

performance (about 1% in our experiments).

Surface CRF. Some small surface areas may be highly oc-

cluded and hence unobserved by any of the selected view-

points, or not included in any of the reference images. For

any such polygons, the label conﬁdences are set to zero.

The rest of the surface should propagate label conﬁdences to

these polygons. In addition, due to upsampling in the FCN

module, there might be bleeding across surface convexities

or concavities that are likely to be segmentation boundaries.

We deﬁne a CRF operating on the surface representation to

deal with the above issues. Speciﬁcally, each polygon f is

assigned a random variable R

f

representing its label. The

CRF includes a unary factor for each such variable, which is

set according to the conﬁdences produced in the projection

layer: φ

unary

(R

f

= l) = exp(

˜

C(f, l)). The CRF also en-

codes pairwise interactions between these variables based

on surface proximity and curvature. For each pair of neigh-

boring polygons (f, f

′

), we deﬁne a factor that favors the

same label for polygons which share normals (e.g. on a ﬂat

surface), and different labels otherwise. Given the angle

ω

f,f

′

between their normals (ω

f,f

′

is divided by π to map it

between [0, 1]), the factor is deﬁned as follows:

φ

adj

(R

f

=l,R

f

′

=l

′

)=

(

exp



−w

adj

·w

l,l

′

·ω

2

f,f

′



, l =l

′

exp



−w

adj

·w

l,l

′

·(1−ω

2

f,f

′

)



, l 6= l

′

where w

adj

and w

l,l

′

are learned factor- and label-dependent

weights. We also deﬁne factors that favor similar labels for

polygons f, f

′

which are spatially close to each other ac-

cording to the geodesic distance d

f,f

′

between them. These

factors are deﬁned for pairs of polygons whose geodesic

distance is less than 10% of the bounding sphere radius in

our implementation. This makes our CRF relatively dense

and more sensitive to long-range interactions between sur-

face variables. We note that for small meshes or point

clouds, all pairs could be considered instead. The geodesic

distance-based factors are deﬁned as follows:

φ

dist

(R

f

=l,R

f

′

=l

′

)=

(

exp



−w

dist

·w

l,l

′

·d

2

f,f

′



, l =l

′

exp



−w

dist

·w

l,l

′

·(1 − d

2

f,f

′

)



, l 6= l

′

where the factor-dependent weight w

dist

and label-

dependent weights w

l,l

′

are learned parameters, and d

f,f

′

3782

frame

wheel

handle

seat

tank

unary factor only

CRF no dis. factor CRF no adj. factor full CRF

ground-truth

headlight

Figure 2. Labeled segmentation results for alternative versions of our CRF (best viewed in color).

represents the geodesic distance between f and f

′

. Dis-

tances are normalized to [0, 1].

Based on the above factors, our CRF is deﬁned over all

surface random variables R

s

= {R

1

, R

2

, . . . , R

F

s

} of the

shape s as follows:

P (R

s

)=

1

Z

s

Y

f

φ

unary

(R

f

)

Y

adj f,f

′

φ

adj

(R

f

, R

f

′

)

Y

f,f

′

φ

dist

(R

f

, R

f

′

)

(2)

where Z

s

is a normalization constant. Exact inference is

intractable, thus we resort to mean-ﬁeld inference to ap-

proximate the most likely joint assignment to all random

variables as well as their marginal probabilities. Our mean-

ﬁeld approximation uses distributions over single variables

as messages (i.e. the posterior is approximated in a fully

factorized form – see Algorithm 11.7 of [

28]). Figure 2

shows how segmentation results degrade for alternative ver-

sions of our CRF, and when the unary term is used alone.

Training procedure. The FCN module is initialized with

ﬁlters pre-trained on image processing tasks [

57]. Since

the input to our network are rendered grayscale (col-

orless) images, we average the BGR channel weights

of the pre-trained ﬁlters of the ﬁrst convolutional layer,

i.e. the 3 × 3 × 3 ﬁlters are converted to color-insensitive

3 × 3 × 1 ﬁlters. Then, we replicate the weights twice to

yield 3 × 3 × 2 ﬁlters that can accept our 2-channel input

images. The CRF weights are initialized to 1.

Given an input training dataset S of 3D shapes, we ﬁrst

generate their depth, shaded, and reference images using

our rendering procedure. Then, our algorithm ﬁne-tunes the

FCN module ﬁlter parameters θ and learns the CRF weights

w

adj

, w

dist

, {w

l,l

′

} to maximize their log-likelihood plus a

small regularization term:

L =

1

|S|

X

s∈S

log P (R

s

= T

s

) + λ||θ| |

2

(3)

where T

s

are ground-truth labels per surface variable for

the training shape s, and λ is a regularization parameter

(weight decay) set to 10

−3

in our experiments. To maximize

the above objective, we must compute its gradient w.r.t. the

FCN module outputs, as required for backpropagation:

∂L

∂C(m, i, j, l)

=







1 − P (R

f

= l) if l = T

f

and I(m, i, j) = f

P (R

f

= l) if l 6= T

f

and I(m, i, j) = f

0 otherwise

(4)

Computing the gradient requires estimation of the marginal

probabilities P (R

f

). We use mean-ﬁeld inference to esti-

mate the marginals (same inference procedure is used for

training and testing). We observed that after 20 iterations,

mean-ﬁeld often converges (i.e. marginals change very lit-

tle). We also need to compute the gradient of the objective

function w.r.t. the CRF weights. Since our CRF has the

form of a log-linear model, gradients can be easily derived.

Given the estimated gradients, we can train our network

through backpropagation. Backpropagation can send er-

ror messages towards any FCN branch i.e., any input image

(Figure

1). One strategy to train our network would be to set

up as many FCN branches as the largest number of rendered

images across all training models. However, the number of

selected viewpoints varies per model, thus the number of

rendered images per model also varies, ranging from a few

tens to a few hundreds in our datasets. Maintaining hun-

dreds of FCN branches would exceed the memory capacity

of current GPUs. Instead, during training, our strategy is to

pick a random subset of 24 images per model, i.e. we keep

24 FCN branches with shared parameters in the GPU mem-

ory. For each batch, a different random subset per model is

selected (i.e. no ﬁxed set of views used for training). We

note that the order of rendered images does not matter – our

view pooling is invariant to the input image ordering. Our

training strategy is reminiscent of the DropConnect tech-

nique [

49], which tends to reduce overﬁtting.

At test time all rendered images per model are used to make

predictions. The forward pass does not require all the input

images to be processed at once (i.e., not all FCN branches

need to be set up). At test time, the image label conﬁdences

are sequentially projected onto the surface, which produces

the same results as projecting all of them at once.

Implementation. Our network is implemented using C++

and Caffe

1

. Optimization is done through stochastic gradi-

ent descent with learning rate 10

−3

and momentum 0.9. We

implemented a new Image2Surface layer in Caffe for pro-

jecting image-based conﬁdences onto the shape surface. We

also created a CRF layer that handles mean-ﬁeld inference

during the forward pass, and estimates the required gradi-

ents during backpropagation.

4. Evaluation

We now present experimental validations and analysis of

our approach.

Datasets. We evaluated our method on manually-labeled

segmentations available from the ShapeNetCore [

56],

Labeled-PSB (L-PSB) [7, 26], and COSEG datasets [50].

The dataset from ShapeNetCore currently contains 17,773

“expert-veriﬁed” segmentations of 3D models across 16

categories. The 3D models of this dataset are gathered

1

Our source code, results and datasets are available on the project page:

http://people.cs.umass.edu/kalo/papers/shapepfcn/

3783

3D Shape Segmentation with Projective Convolutional Networks

Citations

Cites background from "3D Shape Segmentation with Projecti..."

Cites background from "3D Shape Segmentation with Projecti..."

Cites methods from "3D Shape Segmentation with Projecti..."

References

"3D Shape Segmentation with Projecti..." refers methods in this paper

"3D Shape Segmentation with Projecti..." refers background in this paper

Related Papers (5)