scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

FGCN: Deep Feature-Based Graph Convolutional Network for Semantic Segmentation of Urban 3D Point Clouds

TL;DR: A more stable and effective end-to-end architecture to classify raw 3D point clouds from indoor and outdoor scenes is introduced and achieves on par or even better than state-of-the-art results on tasks like semantic scene parsing, part segmentation and urban classification on three standard benchmark datasets.
Abstract: Directly processing 3D point clouds using convolutional neural networks (CNNs) is a highly challenging task primarily due to the lack of explicit neighborhood relation-ship between points in 3D space. Several researchers have tried to cope with this problem using a preprocessing step of voxelization. Although, this allows to translate the existing CNN architectures to process 3D point clouds but, in addition to computational and memory constraints, it poses quantization artifacts which limits the accurate inference of the underlying object's structure in the illuminated scene. In this paper, we have introduced a more stable and effective end-to-end architecture to classify raw 3D point clouds from indoor and outdoor scenes. In the proposed methodology, we encode the spatial arrangement of neighbouring 3D points inside an undirected symmetrical graph, which is passed along with features extracted from a 2D CNN to a Graph Convolutional Network (GCN) that contains three layers of localized graph convolutions to generate a complete segmentation map. The proposed network achieves on par or even better than state-of-the-art results on tasks like semantic scene parsing, part segmentation and urban classification on three standard benchmark datasets.

Summary (3 min read)

1. Introduction

  • With recent successes of convolutional neural network (CNN) architectures in processing 2D structured data, there is an increasingly growing interest of researchers in developing similar architectures to directly process 3D point clouds.
  • Furthermore, many approaches [18, 29, 23] transform the 3D datasets into regular 3D structures like voxels and meshes to apply convolution, but the transformed regular structures loses most of the spatial information that lies between neighbouring points and thus struggles to obtain the local feature representations that can improve the overall classification results [33].
  • They have provided evidence of the possible generalizations of CNNs to signals in other domains without taking 3D translational factors into account.
  • Therefore, their proposed architecture learns the complete local structure embedded in the graph to achieve faster convergence and better classification results.
  • For reference, Figure 1 provides the visualization of two different outdoor scenes.

3. Proposed Methodology

  • In their proposed methodology, the authors extend the traditional graph based convolutions [26, 37], that works on latent graph signals to output a global signature which is then used for classification.
  • Most of these architectures, overlook the underlying spatial information between points inside a 3D space, which plays a crucial role in identifying objects.
  • Keeping in mind the importance of local features, the authors propose a unified architecture that jointly use both local and global features to give a more stable and reliable network for semantic segmentation of 3D point clouds.
  • Using the global feature extractor before graph convolutional network summarizes most of the information and provides geometric invariance [22] which in turn increases the overall performance or their network.
  • In the following sections, the authors will explain the key components of their proposed architecture and will provide evidence as to how using both local and global features can give better results.

3.1. Transforming 3D Point Sets to Weighted Graph

  • Using the normalized Laplacian with eigen decomposition has a high computational cost compared to ChebyNet [7].
  • Furthermore, Defferrard et al. [6] demonstrated the effectiveness of using Chebyshev graph filtering approximation (graph convolution) on homogeneous graphs, for tasks like image classification and 2D scene understanding.
  • The authors adapt a similar approach to [7], using the Chebyshev polynomials as a graph filtering method, but in their approach they have applied convolution on heterogeneous graphs with global features (extracted from 2D convolutional layers) as input.

3.2. Model Architecture

  • The architecture diagram can be visualized in figure 2. 3D Feature Extraction Many techniques have been developed in order to obtain global feature descriptors for 3D point sets [13, 22, 14, 8].
  • Johnson et al. [14] developed a method to extract local feature descriptors from 3D point sets called spin images.
  • The most recent work that employ CNNs to extract global features from raw 3D point clouds is PointNet [22].
  • Using the global feature extraction with graph convolutional network speeds up the training process and increases the overall performance of their network which is demonstrated in Sections 4 and 5.
  • Furthermore, using the Laplacian normalization, the eigenvalues of L lie in the range [−1, 1].

3.3. Training

  • The authors have used a batch size of 16 and dropout regularization of 0.8 for GCN layers and 0.4 for fully connected layers to prevent overfitting.
  • The model performs optimal at K = 1, and as the authors increase the order of K, the size of Ti(L) increases which diminishes the speed and increases the time required to train the network.

4. Performance Measures

  • The authors have evaluated their architecture on a variety benchmark datasets including S3DIS containing indoor 3D scenes[1], ShapeNet part segmentation [35] and Semantic3D benchmark dataset [10].
  • The authors methodology, outperforms the existing architectures on all the benchmarked datasets, and most of the performance gain is due to encoding the local spatial features of the 3D point cloud inside a graph model.

4.1. Semantic Scene Parsing

  • In their first experiment, the authors have used Stanford 3D dataset [1], that contains 3D scans from 6 different areas and 271 rooms collectively acquired using an individual Matterport Scanner.
  • The authors first divide the areas into rooms and then split points in each room using 1m by 1m blocks.
  • Furthermore, each point contains a 9-dimensional vector containing XYZ coordinates, RGB color channels and a normal or an equirectangular projection per room.
  • The authors train their model using a point size N of 4096 per training example and a batch size of 16, where each point contains only the XYZ coordinates.
  • The comparison between their architecture and existing architectures on S3DIS dataset is shown in table 1, and the results can be visualized in figure 3.

4.2. ShapeNet Part Segmentation

  • ShapeNet [35] provides a large-scale repository that contains richly annotated 3D shapes.
  • The ShapeNet part dataset from [35] contains 16, 881 3D shapes from 16 different categories, labelled with 50 parts in total.
  • Furthermore, for a fair comparison the authors have used the same evaluation metric as used by PointNet [22].
  • The authors compute the intersection-over-union (IOU) over each object category and then compute the mIOU by averaging the IOUs of each individual category.
  • The authors have compared their methodology with existing architectures that directly consume raw 3D point clouds, and have achieved a class average of 83.1 which is on par with state-of-the-art.

4.3. Semantic3D Benchmark

  • There has been a long tradition of benchmark evaluation in the geospatial dataset domain particularly ISPRS.
  • The ISPRS-EuroSDR benchmark on High Density Aerial Image Matching, which evaluates dense matching algorithms [9, 5] on aerial imagery.
  • The authors have used the Semantic3D benchmark dataset [10] for evaluating their architecture.
  • It contains nearly 4 billion points collected with 30 terrestrial laser scanners across Central Europe depicting the European architecture in most of its scenes.
  • Additionally, Semantic3D [10] benchmark proposed a baseline 3D-CNN architecture for 3D point cloud classification that takes as input 3D voxel-grids per scan point at 5 different resolutions.

5. Architecture Design Goals

  • The authors evaluate the performance of their architecture with respect to speed and stability using S3DIS [1] dataset.
  • The authors also show the effect of using local feature extraction and how adding the global features to their network gives best performance for their network.
  • Consider figure 4, which shows the fluctuations in test loss during training on S3DIS dataset [1], because of the sensitivity to initial weights.
  • On the other hand, their final architecture uses both global features (that also provides geometric invariance [22]) and local point features and thus has a relatively faster convergence rate and is more stable towards the unstructured nature of 3D point clouds.
  • This adds to the overall stability and reliability of their model across different scenes with objects of varying geometries.

6. Conclusion

  • The authors have presented FGCN, a novel feature based graph convolutional network for semantic segmentation of 3D point clouds.
  • The authors have shown the importance of using local features and how using the spatial position of points can increase the overall performance of the segmentation task when it comes to identifying objects in 3D scenes.
  • In addition to increased performance, the proposed architecture is invariant to geometric distortions and preserves the local structures of objects using the graph models.
  • Although the proposed network achieves better results in terms of accuracy but requires more memory footprint compared to the existing architectures.

Did you find this useful? Give us your feedback

Figures (8)

Content maybe subject to copyright    Report

FGCN: Deep Feature-based Graph Convolutional Network for Semantic
Segmentation of Urban 3D Point Clouds
Saqib Ali Khan
1
, Yilei Shi
2
, Muhammad Shahzad
1
, Xiao Xiang Zhu
3,4
1
School of Electrical Engineering and Computer Science (SEECS),
National University of Sciences and Technology (NUST), Islamabad, Pakistan
2
Chair of Remote Sensing Technology (LMF), Technical University of Munich (TUM), Munich, Germany
3
Signal Processing in Earth Observation (SiPEO), Technical University of Munich (TUM), Munich, Germany
4
Remote Sensing Technology Institute (IMF), German Aerospace Center (DLR), Wessling, Germany
{sakhan.bscs16seecs;muhammad.shehzad}@seecs.edu.pk, yilei.shi@tum.de,xiaoxiang.zhu@dlr.de
Abstract
Directly processing 3D point clouds using convolutional
neural networks (CNNs) is a highly challenging task pri-
marily due to the lack of explicit neighborhood relation-
ship between points in 3D space. Several researchers have
tried to cope with this problem using a preprocessing step
of voxelization. Although, this allows to translate the ex-
isting CNN architectures to process 3D point clouds but, in
addition to computational and memory constraints, it poses
quantization artifacts which limits the accurate inference of
the underlying object’s structure in the illuminated scene.
In this paper, we have introduced a more stable and effec-
tive end-to-end architecture to classify raw 3D point clouds
from indoor and outdoor scenes. In the proposed method-
ology, we encode the spatial arrangement of neighbouring
3D points inside an undirected symmetrical graph, which
is passed along with features extracted from a 2D CNN to
a Graph Convolutional Network (GCN) that contains three
layers of localized graph convolutions to generate a com-
plete segmentation map. The proposed network achieves on
par or even better than state-of-the-art results on tasks like
semantic scene parsing, part segmentation and urban clas-
sification on three standard benchmark datasets.
1. Introduction
With recent successes of convolutional neural network
(CNN) architectures in processing 2D structured data, there
is an increasingly growing interest of researchers in de-
veloping similar architectures to directly process 3D point
clouds. For instance, there has been many attempts to ex-
tend the traditional CNNs [
18, 22, 24, 27], that are best
fit for data that lie in a structured Euclidean space to 3D
Figure 1. Examples of outdoor scenes from Semantic3D bench-
mark dataset [10]. Our architecture assigns a correct semantic
label to each object with on par state-of-the-art accuracy. The re-
sults are visualized using PPTK viewer. Best viewed in color.
point clouds. However, 3D datasets do not lie on a regu-
lar grid and thus lacks the implicit neighborhood relation-
ship. Owing to this, there does not exist a single well-
defined notion that enables convolution on unstructured 3D
data. Furthermore, many approaches [
18, 29, 23] transform
the 3D datasets into regular 3D structures like voxels and
meshes to apply convolution, but the transformed regular
structures loses most of the spatial information that lies be-
tween neighbouring points and thus struggles to obtain the
local feature representations that can improve the overall
classification results [
33].
To encode the neighbourhood relationships, few re-
1

searchers have used graph representations to capture the lo-
cal features more effectively. In this context, Bronstein et
al. [
3] first used the term geometric deep learning and gave
an overview of the deep learning methods for datasets that
lie in non-Euclidean domain. However, the first prominent
research that defines convolutional GNN in a spectral do-
main was given by Bruna et al. [
4]. They have provided
evidence of the possible generalizations of CNNs to sig-
nals in other domains without taking 3D translational fac-
tors into account. Defferrard et al. [
6] proposed a gener-
alized formulation of CNNs for spectral graphs. Their ap-
proach used the recursive form of Chebyshev polynomials
to propose a fast convolution for high-dimensional unstruc-
tured datasets such as social networks or protein-interaction
networks. Furthermore, it is sometimes desirable to use
a kernel-based approach [
17, 30]. This property of us-
ing graph-kernels is favourable because the local structure
of the graph contains meaningful information. However,
kernel-based approaches are computationally expensive and
have quadratic training complexity.
Inspired by the idea of graph based representation to
propagate local features, we have used a Graph Convolu-
tional Network (GCN) to encode spatial information or lo-
cal neighbourhood features into symmetrical graph models.
In the proposed 3D representation, each point is represented
by three coordinates (x, y, z). In addition to our local fea-
ture encoder or GCN, we have used a global feature ex-
tractor similar to [
22], that extracts a vector of high dimen-
sional features by taking the raw point cloud as input. Us-
ing the global features, summarizes most of the information
and provides geometric invariance [
22] that increases the
overall performance and reliability of our network (See Sec-
tion 5 for details). The graph convolution refines these high
order features using the local spatial features from graph
representation and outputs a global signature summarizing
each point inside the graph. Therefore, our proposed archi-
tecture learns the complete local structure embedded in the
graph to achieve faster convergence and better classification
results. Our GCN or spatial-temporal graph neural network
[
33] achieves on par or even better results compared to state-
of-the-art architectures. Specifically, following are the main
contributions proposed in this work:
A novel graph based convolutional network has been
proposed that uses both local and global features for
semantic segmentation of 3D point clouds;
It is evidently showed how using the spatial informa-
tion in the local neighbourhood of points in 3D space
offers stability and increased performance;
The proposed architecture been compared with the
state-of-the-art approaches and achieved competitive
performance on three standard benchmark datasets in-
cluding S3DIS [
1], ShapeNet [35], and Semantic3D
[10] datasets. For reference, Figure 1 provides the vi-
sualization of two different outdoor scenes.
2. Related Work
Deep Learning on 3D Point Clouds Many approaches
utilize 3D shapes to apply deep learning, for example Vol-
umetric CNNs [
23, 38, 21], is the pioneer work that applies
3D convolutions on voxelized shapes. However, Volumet-
ric CNNs have a higher computational cost due to the spar-
sity of 3D data in volumetric representations. This problem
has been addressed through careful engineering of CNNs
[
20, 31]. However, the problem still persists due to signifi-
cantly sparse volumes in very large point clouds. Multiview
CNNs [28], integrate multiple views of a 3D point cloud
together and apply 2D convolution for classification. With
efficient 2D convolutions, they can process very high reso-
lution data. Furthermore, these architectures can achieve
state-of-the-art results in object classification on datasets
like ModelNet [
38], but they cannot be extended to more
complex tasks like 3D scene understanding.
Recently, many new approaches have been proposed that
directly consume raw 3D point clouds and are used for tasks
like semantic segmentation, object classification and detec-
tion etc. PointNet [
22] is the pioneer work that applies deep
learning on raw 3D point clouds with significant improve-
ments in performance. However, PointNet does not general-
ize well on complex scenes due to its inability to capture the
local structure induced by the 3D space. The local structure
is exploited by PointNet++ [
24], which is an extension of
PointNet. In their proposed methodology, they were able to
capture the local features with increasing contextual scales.
SPLATNet [
27], sparse lattice networks, used bilateral con-
volutions as building blocks to apply 3D convolution only
on the occupied parts of the lattice that reduces memory
and computational cost. PointConv [
32] uses dynamic fil-
ters to apply convolution on point clouds. They treat convo-
lutional kernels as non-linear functions of the point coordi-
nates comprised of density and weight functions.
Deep Learning on Graphs or spectral CNNs were first
introduced by [
4] and extended by [6]. Many approaches
like ours that applies convolution in a spectral domain uses
ideas from graph signal processing [
26] to apply localized
filters on graphs. Recently, many approaches [
6, 15, 37] ap-
proximate the spectral convolution using Chebyshev poly-
nomials, because transforming the signal back and forth be-
tween spectral domains can be expensive. Our approach
uses Chebyshev polynomials for spectral convolutions in a
similar way as [
37, 26].
3. Proposed Methodology
Suppose, we are given a set of m training examples
{X
m
, Y
m
} with X
i
= {P
j
|j = 1......n}, where n is the

number of points P R
3
in X
i
, and Y
m
= {1......n} is the
associated semantic label of each point P
j
in the i
th
train-
ing example. Furthermore, each point P
j
in X
i
consists of
a vector of 3D coordinates (x, y, z).
In our proposed methodology, we extend the traditional
graph based convolutions [
26, 37], that works on latent
graph signals to output a global signature which is then
used for classification. Most of these architectures, over-
look the underlying spatial information between points in-
side a 3D space, which plays a crucial role in identifying
objects. Keeping in mind the importance of local features,
we propose a unified architecture that jointly use both lo-
cal and global features to give a more stable and reliable
network for semantic segmentation of 3D point clouds. Us-
ing the global feature extractor before graph convolutional
network summarizes most of the information and provides
geometric invariance [
22] which in turn increases the over-
all performance or our network. In the following sections,
we will explain the key components of our proposed archi-
tecture and will provide evidence as to how using both local
and global features can give better results.
3.1. Transforming 3D Point Sets to Weighted Graph
Signals
A graph convolutional network performs convolution on
input that is supported on a graph G = {V, E, W }, with a
finite number of nodes v
i
V , edges e
ij
= {v
i
, v
j
} E,
and W
i,j
W corresponding to the weighted graph sig-
nal or an entry into the adjacency matrix indicating a con-
nection between v
i
and v
j
. In order to find the value of
W
i,j
, we find all the neighbouring nodes of node i using k-
nearest neighbors, and then use a Gaussian kernel to weight
the edge e
i,j
connecting node i and a neighbouring node j:
W
i,j
=
(
exp(
kv
i
v
j
k
2
2σ
2
) if kv
i
v
j
k < κ
0 otherwise
(1)
for some value of σ > 0 and parameter κ. In equation
1, kv
i
v
j
k represent the Euclidean distance between two
feature vectors of node v
i
= {x
i
, y
i
, z
i
} and node v
j
=
{x
j
, y
j
, z
j
}, with node v
j
as a neighbor of node v
i
.
Given the undirected graph with adjacency matrix W
R
N×N
, we apply graph filtering techniques [15, 37] us-
ing normalized Laplacian matrix L = I
n
D
1
2
W D
1
2
,
where D corresponds to the diagonal matrix in which D
ij
=
Σ
j
{W
i,j
}. The normalized Laplacian matrix can also be
interpreted using the eigenvectors as L = U ΛU
T
, where
U corresponds to the matrix of eigenvectors and Λ corre-
sponds to the diagonal matrix of U . Let’s restate our graph
mapping function f (x) with input x, as a linear graph filter
transformation function with coefficients µ
1
, µ
2
, ......µ
n
as,
f(x) = g
µ
(L)x =
K
X
i=0
µ
i
L
i
x (2)
The mapping function f (x) can also be approximated
using the eigen decomposition form of normalized Lapla-
cian matrix with eigenvalues Λ as,
f(x) = g
µ
(L)x = U g
µ
(Λ)U
T
x (3)
Spectral based graph filtering methods [
12, 7, 26] also
use Chebyshev polynomials to approximate graph filters.
ChebyNet [
7] uses the diagonal matrix of eigen values,
f(x) = g
θ
(L)x =
K
X
i=0
θ
i
T
i
(L)x (4)
Additionally, equation
4 can also be defined recursively
with T
0
(x) = 1 and T
1
(x) = x as,
T
i
(x) = 2xT
i1
T
i2
(x) (5)
The goal of graph convolutional layer is to learn a set
of graph filtering coefficients {µ} or {θ} using any type
of graph filtering method. However, using the normalized
Laplacian with eigen decomposition has a high computa-
tional cost compared to ChebyNet [
7]. Furthermore, Def-
ferrard et al. [
6] demonstrated the effectiveness of using
Chebyshev graph filtering approximation (graph convolu-
tion) on homogeneous graphs, for tasks like image classifi-
cation and 2D scene understanding. We adapt a similar ap-
proach to [
7], using the Chebyshev polynomials as a graph
filtering method, but in our approach we have applied con-
volution on heterogeneous graphs with global features (ex-
tracted from 2D convolutional layers) as input.
3.2. Model Architecture
Our segmentation network consists of three main mod-
ules: 1) Feature extraction that inputs the N ×3 dimensional
point coordinate vector and outputs an N × D dimensional
global feature vector; 2) Graph signal processing that also
takes an N × 3 dimensional coordinate vector as input and
outputs a weighted graph in the form of an adjacency matrix
W ; 3) Graph convolutional network with learnable param-
eter θ of order k, takes as input the N × D dimensional
feature vector along with weighted graph signals W and
extracts the local features corresponding to the spatial ar-
rangement of nodes in the graph, which is then passed to
fully connected layers for per-point classification. The ar-
chitecture diagram can be visualized in figure
2.
3D Feature Extraction Many techniques have been de-
veloped in order to obtain global feature descriptors for 3D
point sets [
13, 22, 14, 8]. Johnson et al. [14] developed a

Figure 2. Network Architecture: The network takes as input N points with coordinates (x, y, z). The input is passed to graph signal
processing module to generate a re-scaled normalized graph vector and is also passed to deep convolutional feature extraction layers to
output a global feature vector N × D. Both the normalized weighted graph and global features goes as input to graph convolutional
network to output a global feature signature which is passed to a fully connected layer that scales down the features and assign one of k
output classes to each point. GCN uses ReLU activation function and dropout regularization after each layer.
method to extract local feature descriptors from 3D point
sets called spin images. The distance (α, β) between a fea-
ture point in spin image with coordinate p, a surface normal
n and a neighbouring point q is given by α = n
q
.(pq) and
β =
q
kp qk
2
α
2
. The final spin image contains the
neighbors of feature points accumulated in a discontinuous
2D bin which is robust to occlusion and clutter. Flint et al.
[
8] propose a method called THRIFT that extends the fea-
ture extraction techniques applied to 2D images like SIFT
and propose a 3D feature descriptor that successfully iden-
tifies keypoints in range data.
Recently, convolutional neural networks have been used
in general for feature extraction in both 2D and 3D domains.
The most recent work that employ CNNs to extract global
features from raw 3D point clouds is PointNet [
22]. Point-
Net architecture uses a stack of 2D convolutional layers for
feature transformation and ensures invariance to permuta-
tions, geometric transformations and also considers the in-
teraction among points using a localized convolution oper-
ation. PointNet outperformed all the existing methods used
for classification of 3D points which either required conver-
sion to other irreversible representations [
23, 38, 21] or used
raw 3D point clouds [
18].
In this paper, we take motivation from PointNet [
22] and
extend our graph convolutional network to be more robust
using global features. So, instead of taking the point coor-
dinates (x
(i)
, y
(i)
, z
(i)
) as input feature vectors [
37], we use
2D convolutional layers to output an {x
(1)
i
, x
(2)
i
, ....x
(D)
i
}
R
N×D
global feature vector, where D represents the num-
ber of features per point.
Using the global feature extraction with graph convolu-
tional network speeds up the training process and increases
the overall performance of our network which is demon-
strated in Sections 4 and 5.
Graph Convolutional Network (GCN) takes as input
the feature vector {x
(1)
i
, x
(2)
i
, ....x
(D)
i
} R
N×D
, where
D corresponds to the number of features and the weighted
graph signals W R
N×N
, and the goal of GCN is to learn
a set of K trainable graph-filter coefficients. Moreover, a
GCN learns a mapping function that can translate the input
graph signals to capture the local features corresponding to
the relative position of points in 3D space. So, a GCN can
be written as a non-linear function σ of input graph signals
W
(l)
and X
(l)
, where l corresponds to the activations of l
th
layer.
f(X
(l)
, W ) = σ (θ
(l)
X
(l)
W )) (6)
where the learnable parameter θ is of order K. The map-
ping function in equation
6 contains an unnormalized graph
representation W , because the range of values can vary for
heterogeneous graphs, the unnormalized GCN cannot gen-
eralize well on graphs that lie in different spectral domains
[
33]. In order to overcome this problem, the input graph
signal is to be normalized in such way that adding all the

rows of W sum to one [15]. In our proposed methodol-
ogy, we have used a graph Laplacian L = I D
1
2
W D
1
2
using the diagonal matrix D such that D
ii
=
P
j
W
ij
for
symmetric normalization,
f(X
(l)
, W ) = σ (θ
(l)
b
D
1
2
c
W
b
D
1
2
X
(l)
)) (7)
where
c
W = W +I, and I is the identity matrix. Further-
more, using the Laplacian normalization, the eigenvalues of
L lie in the range [1, 1].
In order to obtain the local features at each layer l, we
use Chebyshev polynomials
4, and take as input the global
feature vector {x
(1)
i
, x
(2)
i
, ....x
(D)
i
} for the first layer. Fur-
thermore, in order to define a single graph convolution oper-
ation between the input feature vector x
i
and a graph signal
g, we use the inverse graph Fourier transform [
33] as,
x
G
g = U(U
T
x U
T
g) (8)
where U is the matrix of eigenvectors and represents
the pointwise product of inverse graph Fourier transform of
x as U
T
x and g as U
T
g.
In our proposed architecture, we have used the Cheby-
shev graph filtering representation given by equation
4, with
K-neighbourhood at each point to learn the localized fea-
ture maps with three layers of graph convolutions.
3.3. Training
The architecture is trained using Adam optimizer with
a learning rate that starts at 1 × 10
3
and is reduced to
half after every 20 epochs, but always stays in the range
[1 × 10
3
, 1 × 10
7
]. We have used a batch size of 16 and
dropout regularization of 0.8 for GCN layers and 0.4 for
fully connected layers to prevent overfitting. Our network
uses four layers of 2D convolutional layers with kernel sizes
[64, 64, 128, 1024] respectively. Furthermore, to avoid ad-
ditional complexity in our model, we have used a weight
decay of magnitude 2 × 10
4
.
The speed and stability of GCN depends heavily on the
order K of Chebyshev polynomial
4. The model performs
optimal at K = 1, and as we increase the order of K, the
size of T
i
(L) increases which diminishes the speed and in-
creases the time required to train the network.
4. Performance Measures
We have evaluated our architecture on a variety bench-
mark datasets including S3DIS containing indoor 3D
scenes[
1], ShapeNet part segmentation [35] and Seman-
tic3D benchmark dataset [
10]. Our methodology, outper-
forms the existing architectures on all the benchmarked
datasets, and most of the performance gain is due to encod-
ing the local spatial features of the 3D point cloud inside a
graph model.
Method mean IOU mean Accuracy
PointNet [22] 47.71 48.98
SEGCloud [29] 48.92 57.35
Ours (GCN Only) 47.22 56.44
Ours (FGCN) 52.17 63.22
Table 1. Results of Semantic scene parsing on Stanford 3D
dataset. The mIOU is calculated as an average over IOUs of all
13 classes containing indoor structural objects.
class average
SSCNN [36] 82.0
Kd-net [
16] 77.4
PointNet [
22] 80.4
PointNet++ [
24] 81.9
SpiderCNN [
34] 82.4
SPLATNet
3D
[
27] 82.0
PointConv [
32] 82.8
Ours (GCN Only) 78.2
Ours (FGCN)
83.1
Table 2. Results on ShapeNet part segmentation: The metric
is mIOU similar to the one used by PointNet [22]. We have com-
pared our architecture with existing architectures on ShapeNet part
segmentation. Our network achieves slightly better results than
state-of-the-art.
4.1. Semantic Scene Parsing
In our first experiment, we have used Stanford 3D dataset
[
1], that contains 3D scans from 6 different areas and 271
rooms collectively acquired using an individual Matterport
Scanner. The dataset contains 13 classes, so each point can
be assigned 1 out of 13 semantic labels.
In order to split the data into training and testing sets, we
have used the same method and statistics as used by Point-
Net [
22]. We first divide the areas into rooms and then split
points in each room using 1m by 1m blocks. Furthermore,
each point contains a 9-dimensional vector containing XYZ
coordinates, RGB color channels and a normal or an equi-
rectangular projection per room.
We train our model using a point size N of 4096 per
training example and a batch size of 16, where each point
contains only the XYZ coordinates. The comparison be-
tween our architecture and existing architectures on S3DIS
dataset is shown in table
1, and the results can be visual-
ized in figure
3. Our methodology outperforms the existing
architectures by a significant margin.
4.2. ShapeNet Part Segmentation
ShapeNet [35] provides a large-scale repository that con-
tains richly annotated 3D shapes. The ShapeNet part dataset
from [
35] contains 16, 881 3D shapes from 16 different cat-
egories, labelled with 50 parts in total. In object’s part seg-

Citations
More filters
Proceedings ArticleDOI
01 Jun 2021
TL;DR: In this article, the authors proposed a gradual receptive field component reasoning (RFCR) method, where target Receptive Field Component Codes (RFCCs) are designed to record categories within receptive fields for hidden units in the encoder.
Abstract: Hidden features in neural network usually fail to learn informative representation for 3D segmentation as supervisions are only given on output prediction, while this can be solved by omni-scale supervision on intermediate layers. In this paper, we bring the first omni-scale supervision method to point cloud segmentation via the proposed gradual Receptive Field Component Reasoning (RFCR), where target Receptive Field Component Codes (RFCCs) are designed to record categories within receptive fields for hidden units in the encoder. Then, target RFCCs will supervise the decoder to gradually infer the RFCCs in a coarse-to-fine categories reasoning manner, and finally obtain the semantic labels. Because many hidden features are inactive with tiny magnitude and make minor contributions to RFCC prediction, we propose a Feature Densification with a centrifugal potential to obtain more unambiguous features, and it is in effect equivalent to entropy regularization over features. More active features can further unleash the potential of our omni-supervision method. We embed our method into four prevailing backbones and test on three challenging benchmarks. Our method can significantly improve the backbones in all three datasets. Specifically, our method brings new state-of-the-art performances for S3DIS as well as Semantic3D and ranks the 1st in the ScanNet benchmark among all the point-based methods. Code is publicly available at https://github.com/azuki-miho/RFCR.

34 citations

Journal ArticleDOI
TL;DR: The association mechanism and the subsequent information transfer are presented, which are cornerstones for multi-modal scene analysis and facilitate to train machine learning algorithms and to semantically segment any of these data representations.
Abstract: The automatic semantic segmentation of the huge amount of acquired remote sensing data has become an important task in the last decade. Images and Point Clouds (PCs) are fundamental data representations, particularly in urban mapping applications. Textured 3D meshes integrate both data representations geometrically by wiring the PC and texturing the surface elements with available imagery. We present a mesh-centered holistic geometry-driven methodology that explicitly integrates entities of imagery, PC and mesh. Due to its integrative character, we choose the mesh as the core representation that also helps to solve the visibility problem for points in imagery. Utilizing the proposed multi-modal fusion as the backbone and considering the established entity relationships, we enable the sharing of information across the modalities imagery, PC and mesh in a twofold manner: (i) feature transfer and (ii) label transfer. By these means, we achieve to enrich feature vectors to multi-modal feature vectors for each representation. Concurrently, we achieve to label all representations consistently while reducing the manual label effort to a single representation. Consequently, we facilitate to train machine learning algorithms and to semantically segment any of these data representations – both in a multi-modal and single-modal sense. The paper presents the association mechanism and the subsequent information transfer, which we believe are cornerstones for multi-modal scene analysis. Furthermore, we discuss the preconditions and limitations of the presented approach in detail. We demonstrate the effectiveness of our methodology on the ISPRS 3D semantic labeling contest (Vaihingen 3D) and a proprietary data set (Hessigheim 3D).

8 citations


Cites methods from "FGCN: Deep Feature-Based Graph Conv..."

  • ...Ali Khan et al. (2020) transform PCs to an undirected symmetrically weighted graph encoding the spatial neighborhood and apply a Graph Convolutional Network....

    [...]

Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed an automated scan-to-BIM method considering both the geometry and material of building objects, which can contribute to the improvement of the as-built BIM model usability.
Abstract: Conventional scan to building information modeling (BIM) automation mainly deals with geometry. However, one of its limitations is the time it takes and the costs in generating material. Therefore, this study proposes an automated scan-to-BIM method considering both the geometry and material of building objects. It recognizes the geometry from a point cloud and the material from panorama images through deep learning–based semantic segmentation. The two extracted pieces of data are merged, and the BIM objects with material are automatically generated by using Dynamo. Here, the object–space relationships were applied to increase the accuracy of the material data to be included in the BIM object. As the result, the accuracy was improved by 48.66% compared with before the application. The proposed method can contribute to the improvement of the as-built BIM model usability because it can automatically generate a BIM model by reflecting the material, as well as the geometry of the existing building.
Proceedings ArticleDOI
26 Jul 2022
TL;DR: Experimental results show that the proposed upsampling algorithm is superior to the widely applied traditional interpolation algorithms when used for point cloud semantic segmentation.
Abstract: The point cloud semantic segmentation network based on point-wise multi-layer perceptron (MLP) has been widely applied with its end-to-end advantages. Normally, such networks use the traditional upsampling algorithm to recover the details of point clouds in the decoding stage. However, the point cloud has rich 3D geometric information. The traditional interpolation algorithm does not consider the geometric correlation in the process of recovering the details of the point cloud, resulting in the inaccurate output point features. To this end, a learnable upsampling algorithm is proposed in this paper. This upsampling algorithm is implemented by utilizing moving least squares (MLS) and radial basis function (RBF), which can fully exploit the local geometric features of point clouds and accurately restore the details of scenarios. The validity of the proposed upsampling operator is verified on the Semantic3D dataset. Experimental results show that the proposed upsampling algorithm is superior to the widely applied traditional interpolation algorithms when used for point cloud semantic segmentation.
Posted Content
TL;DR: In this article, the authors proposed a gradual receptive field component reasoning (RFCR) method, where target Receptive field component codes (RFCCs) are designed to record categories within receptive fields for hidden units in the encoder.
Abstract: Hidden features in neural network usually fail to learn informative representation for 3D segmentation as supervisions are only given on output prediction, while this can be solved by omni-scale supervision on intermediate layers. In this paper, we bring the first omni-scale supervision method to point cloud segmentation via the proposed gradual Receptive Field Component Reasoning (RFCR), where target Receptive Field Component Codes (RFCCs) are designed to record categories within receptive fields for hidden units in the encoder. Then, target RFCCs will supervise the decoder to gradually infer the RFCCs in a coarse-to-fine categories reasoning manner, and finally obtain the semantic labels. Because many hidden features are inactive with tiny magnitude and make minor contributions to RFCC prediction, we propose a Feature Densification with a centrifugal potential to obtain more unambiguous features, and it is in effect equivalent to entropy regularization over features. More active features can further unleash the potential of our omni-supervision method. We embed our method into four prevailing backbones and test on three challenging benchmarks. Our method can significantly improve the backbones in all three datasets. Specifically, our method brings new state-of-the-art performances for S3DIS as well as Semantic3D and ranks the 1st in the ScanNet benchmark among all the point-based methods. Code will be publicly available at this https URL.
References
More filters
Posted Content
TL;DR: A scalable approach for semi-supervised learning on graph-structured data that is based on an efficient variant of convolutional neural networks which operate directly on graphs which outperforms related methods by a significant margin.
Abstract: We present a scalable approach for semi-supervised learning on graph-structured data that is based on an efficient variant of convolutional neural networks which operate directly on graphs. We motivate the choice of our convolutional architecture via a localized first-order approximation of spectral graph convolutions. Our model scales linearly in the number of graph edges and learns hidden layer representations that encode both local graph structure and features of nodes. In a number of experiments on citation networks and on a knowledge graph dataset we demonstrate that our approach outperforms related methods by a significant margin.

15,696 citations


"FGCN: Deep Feature-Based Graph Conv..." refers background or methods in this paper

  • ...Recently, many approaches [6, 15, 37] approximate the spectral convolution using Chebyshev polynomials, because transforming the signal back and forth between spectral domains can be expensive....

    [...]

  • ...R N×N , we apply graph filtering techniques [15, 37] using normalized Laplacian matrix L = In − D − 1 2WD 1 2 , where D corresponds to the diagonal matrix in which Dij = Σj{Wi,j}....

    [...]

Proceedings ArticleDOI
21 Jul 2017
TL;DR: This paper designs a novel type of neural network that directly consumes point clouds, which well respects the permutation invariance of points in the input and provides a unified architecture for applications ranging from object classification, part segmentation, to scene semantic parsing.
Abstract: Point cloud is an important type of geometric data structure. Due to its irregular format, most researchers transform such data to regular 3D voxel grids or collections of images. This, however, renders data unnecessarily voluminous and causes issues. In this paper, we design a novel type of neural network that directly consumes point clouds, which well respects the permutation invariance of points in the input. Our network, named PointNet, provides a unified architecture for applications ranging from object classification, part segmentation, to scene semantic parsing. Though simple, PointNet is highly efficient and effective. Empirically, it shows strong performance on par or even better than state of the art. Theoretically, we provide analysis towards understanding of what the network has learnt and why the network is robust with respect to input perturbation and corruption.

9,457 citations


"FGCN: Deep Feature-Based Graph Conv..." refers background or methods in this paper

  • ...In this paper, we take motivation from PointNet [22] and extend our graph convolutional network to be more robust using global features....

    [...]

  • ...Results on ShapeNet part segmentation: The metric is mIOU similar to the one used by PointNet [22]....

    [...]

  • ...PointNet [22] is the pioneer work that applies deep learning on raw 3D point clouds with significant improvements in performance....

    [...]

  • ...In order to split the data into training and testing sets, we have used the same method and statistics as used by PointNet [22]....

    [...]

  • ...For instance, there has been many attempts to extend the traditional CNNs [18, 22, 24, 27], that are best fit for data that lie in a structured Euclidean space to 3D Figure 1....

    [...]

Posted Content
TL;DR: A hierarchical neural network that applies PointNet recursively on a nested partitioning of the input point set and proposes novel set learning layers to adaptively combine features from multiple scales to learn deep point set features efficiently and robustly.
Abstract: Few prior works study deep learning on point sets. PointNet by Qi et al. is a pioneer in this direction. However, by design PointNet does not capture local structures induced by the metric space points live in, limiting its ability to recognize fine-grained patterns and generalizability to complex scenes. In this work, we introduce a hierarchical neural network that applies PointNet recursively on a nested partitioning of the input point set. By exploiting metric space distances, our network is able to learn local features with increasing contextual scales. With further observation that point sets are usually sampled with varying densities, which results in greatly decreased performance for networks trained on uniform densities, we propose novel set learning layers to adaptively combine features from multiple scales. Experiments show that our network called PointNet++ is able to learn deep point set features efficiently and robustly. In particular, results significantly better than state-of-the-art have been obtained on challenging benchmarks of 3D point clouds.

4,802 citations


"FGCN: Deep Feature-Based Graph Conv..." refers background in this paper

  • ...This problem has been addressed through careful engineering of CNNs [20, 31]....

    [...]

  • ...Deep Learning on Graphs or spectral CNNs were first introduced by [4] and extended by [6]....

    [...]

  • ...The local structure is exploited by PointNet++ [24], which is an extension of PointNet....

    [...]

  • ...Defferrard et al. [6] proposed a generalized formulation of CNNs for spectral graphs....

    [...]

  • ...Directly processing 3D point clouds using convolutional neural networks (CNNs) is a highly challenging task primarily due to the lack of explicit neighborhood relationship between points in 3D space....

    [...]

Journal ArticleDOI
TL;DR: This article provides a comprehensive overview of graph neural networks (GNNs) in data mining and machine learning fields and proposes a new taxonomy to divide the state-of-the-art GNNs into four categories, namely, recurrent GNNS, convolutional GNN’s, graph autoencoders, and spatial–temporal Gnns.
Abstract: Deep learning has revolutionized many machine learning tasks in recent years, ranging from image classification and video processing to speech recognition and natural language understanding. The data in these tasks are typically represented in the Euclidean space. However, there is an increasing number of applications, where data are generated from non-Euclidean domains and are represented as graphs with complex relationships and interdependency between objects. The complexity of graph data has imposed significant challenges on the existing machine learning algorithms. Recently, many studies on extending deep learning approaches for graph data have emerged. In this article, we provide a comprehensive overview of graph neural networks (GNNs) in data mining and machine learning fields. We propose a new taxonomy to divide the state-of-the-art GNNs into four categories, namely, recurrent GNNs, convolutional GNNs, graph autoencoders, and spatial–temporal GNNs. We further discuss the applications of GNNs across various domains and summarize the open-source codes, benchmark data sets, and model evaluation of GNNs. Finally, we propose potential research directions in this rapidly growing field.

4,584 citations

Posted Content
TL;DR: In this article, a spectral graph theory formulation of convolutional neural networks (CNNs) was proposed to learn local, stationary, and compositional features on graphs, and the proposed technique offers the same linear computational complexity and constant learning complexity as classical CNNs while being universal to any graph structure.
Abstract: In this work, we are interested in generalizing convolutional neural networks (CNNs) from low-dimensional regular grids, where image, video and speech are represented, to high-dimensional irregular domains, such as social networks, brain connectomes or words' embedding, represented by graphs. We present a formulation of CNNs in the context of spectral graph theory, which provides the necessary mathematical background and efficient numerical schemes to design fast localized convolutional filters on graphs. Importantly, the proposed technique offers the same linear computational complexity and constant learning complexity as classical CNNs, while being universal to any graph structure. Experiments on MNIST and 20NEWS demonstrate the ability of this novel deep learning system to learn local, stationary, and compositional features on graphs.

4,562 citations

Frequently Asked Questions (18)
Q1. What are the contributions in "Fgcn: deep feature-based graph convolutional network for semantic segmentation of urban 3d point clouds" ?

In this paper, the authors have introduced a more stable and effective end-to-end architecture to classify raw 3D point clouds from indoor and outdoor scenes. In the proposed methodology, the authors encode the spatial arrangement of neighbouring 3D points inside an undirected symmetrical graph, which is passed along with features extracted from a 2D CNN to a Graph Convolutional Network ( GCN ) that contains three layers of localized graph convolutions to generate a complete segmentation map. 

SPLATNet [27], sparse lattice networks, used bilateral convolutions as building blocks to apply 3D convolution only on the occupied parts of the lattice that reduces memory and computational cost. 

PointNet architecture uses a stack of 2D convolutional layers for feature transformation and ensures invariance to permutations, geometric transformations and also considers the interaction among points using a localized convolution operation. 

Flint et al. [8] propose a method called THRIFT that extends the feature extraction techniques applied to 2D images like SIFT and propose a 3D feature descriptor that successfully identifies keypoints in range data. 

In order to evaluate their model on ShapeNet part dataset the authors pre-compute the Graph filters using Chebyshev polynomials 4 and train their model on each of the 16 object categories. 

instead of taking the point coordinates (x(i), y(i), z(i)) as input feature vectors [37], the authors use2D convolutional layers to output an {x (1) i , x (2) i , ....x (D) i } ∈ R N×D global feature vector, where D represents the number of features per point. 

Many approaches utilize 3D shapes to apply deep learning, for example Volumetric CNNs [23, 38, 21], is the pioneer work that applies 3D convolutions on voxelized shapes. 

PointNet outperformed all the existing methods used for classification of 3D points which either required conversion to other irreversible representations [23, 38, 21] or used raw 3D point clouds [18]. 

On the other hand, their final architecture uses both global features (that also provides geometric invariance [22]) and local point features and thus has a relatively faster convergence rate and is more stable towards the unstructured nature of 3D point clouds. 

The authors have evaluated their architecture on a variety benchmark datasets including S3DIS containing indoor 3D scenes[1], ShapeNet part segmentation [35] and Semantic3D benchmark dataset [10]. 

following are the main contributions proposed in this work:• A novel graph based convolutional network has been proposed that uses both local and global features forsemantic segmentation of 3D point clouds;• 

In this work, the authors have shown the importance of using local features and how using the spatial position of points can increase the overall performance of the segmentation task when it comes to identifying objects in 3D scenes. 

the interest is towards consuming the point clouds directly [22, 24, 32, 27], but many of these architectures try hard to improve the local feature extractor by ap-plying convolution directly to the unstructured point cloud. 

For instance, there has been many attempts to extend the traditional CNNs [18, 22, 24, 27], that are best fit for data that lie in a structured Euclidean space to 3Dpoint clouds. 

Let’s restate their graph mapping function f(x) with input x, as a linear graph filtertransformation function with coefficients µ1, µ2, ......µn as,f(x) = gµ(L)x = K∑i=0µiL ix (2)The mapping function f(x) can also be approximated using the eigen decomposition form of normalized Laplacian matrix with eigenvalues Λ as,f(x) = gµ(L)x = Ugµ(Λ)U Tx (3)Spectral based graph filtering methods [12, 7, 26] also use Chebyshev polynomials to approximate graph filters. 

their final architecture reforms the raw 3D point cloud to a vector of high dimensional features before passing it on to the graph convolutional network. 

In addition to their local feature encoder or GCN, the authors have used a global feature extractor similar to [22], that extracts a vector of high dimensional features by taking the raw point cloud as input. 

Although the proposed network achieves better results in terms of accuracy but requires more memory footprint compared to the existing architectures.