Proceedings Article•DOI•

FGCN: Deep Feature-Based Graph Convolutional Network for Semantic Segmentation of Urban 3D Point Clouds

Saqib Ali Khan¹, Yilei Shi², Muhammad Shahzad¹, Xiao Xiang Zhu²•Institutions (2)

National University of Sciences and Technology¹, Technische Universität München²

01 Jun 2020-pp 778-787

TL;DR: A more stable and effective end-to-end architecture to classify raw 3D point clouds from indoor and outdoor scenes is introduced and achieves on par or even better than state-of-the-art results on tasks like semantic scene parsing, part segmentation and urban classification on three standard benchmark datasets.

read less

Abstract: Directly processing 3D point clouds using convolutional neural networks (CNNs) is a highly challenging task primarily due to the lack of explicit neighborhood relation-ship between points in 3D space. Several researchers have tried to cope with this problem using a preprocessing step of voxelization. Although, this allows to translate the existing CNN architectures to process 3D point clouds but, in addition to computational and memory constraints, it poses quantization artifacts which limits the accurate inference of the underlying object's structure in the illuminated scene. In this paper, we have introduced a more stable and effective end-to-end architecture to classify raw 3D point clouds from indoor and outdoor scenes. In the proposed methodology, we encode the spatial arrangement of neighbouring 3D points inside an undirected symmetrical graph, which is passed along with features extracted from a 2D CNN to a Graph Convolutional Network (GCN) that contains three layers of localized graph convolutions to generate a complete segmentation map. The proposed network achieves on par or even better than state-of-the-art results on tasks like semantic scene parsing, part segmentation and urban classification on three standard benchmark datasets.

...read moreread less

Summary (3 min read)

Jump to: [1. Introduction] – [2. Related Work] – [3. Proposed Methodology] – [3.1. Transforming 3D Point Sets to Weighted Graph] – [3.2. Model Architecture] – [3.3. Training] – [4. Performance Measures] – [4.1. Semantic Scene Parsing] – [4.2. ShapeNet Part Segmentation] – [4.3. Semantic3D Benchmark] – [5. Architecture Design Goals] and [6. Conclusion]

1. Introduction

With recent successes of convolutional neural network (CNN) architectures in processing 2D structured data, there is an increasingly growing interest of researchers in developing similar architectures to directly process 3D point clouds.
Furthermore, many approaches [18, 29, 23] transform the 3D datasets into regular 3D structures like voxels and meshes to apply convolution, but the transformed regular structures loses most of the spatial information that lies between neighbouring points and thus struggles to obtain the local feature representations that can improve the overall classification results [33].
They have provided evidence of the possible generalizations of CNNs to signals in other domains without taking 3D translational factors into account.
Therefore, their proposed architecture learns the complete local structure embedded in the graph to achieve faster convergence and better classification results.
For reference, Figure 1 provides the visualization of two different outdoor scenes.

3. Proposed Methodology

In their proposed methodology, the authors extend the traditional graph based convolutions [26, 37], that works on latent graph signals to output a global signature which is then used for classification.
Most of these architectures, overlook the underlying spatial information between points inside a 3D space, which plays a crucial role in identifying objects.
Keeping in mind the importance of local features, the authors propose a unified architecture that jointly use both local and global features to give a more stable and reliable network for semantic segmentation of 3D point clouds.
Using the global feature extractor before graph convolutional network summarizes most of the information and provides geometric invariance [22] which in turn increases the overall performance or their network.
In the following sections, the authors will explain the key components of their proposed architecture and will provide evidence as to how using both local and global features can give better results.

3.1. Transforming 3D Point Sets to Weighted Graph

Using the normalized Laplacian with eigen decomposition has a high computational cost compared to ChebyNet [7].
Furthermore, Defferrard et al. [6] demonstrated the effectiveness of using Chebyshev graph filtering approximation (graph convolution) on homogeneous graphs, for tasks like image classification and 2D scene understanding.
The authors adapt a similar approach to [7], using the Chebyshev polynomials as a graph filtering method, but in their approach they have applied convolution on heterogeneous graphs with global features (extracted from 2D convolutional layers) as input.

3.2. Model Architecture

The architecture diagram can be visualized in figure 2. 3D Feature Extraction Many techniques have been developed in order to obtain global feature descriptors for 3D point sets [13, 22, 14, 8].
Johnson et al. [14] developed a method to extract local feature descriptors from 3D point sets called spin images.
The most recent work that employ CNNs to extract global features from raw 3D point clouds is PointNet [22].
Using the global feature extraction with graph convolutional network speeds up the training process and increases the overall performance of their network which is demonstrated in Sections 4 and 5.
Furthermore, using the Laplacian normalization, the eigenvalues of L lie in the range [−1, 1].

3.3. Training

The authors have used a batch size of 16 and dropout regularization of 0.8 for GCN layers and 0.4 for fully connected layers to prevent overfitting.
The model performs optimal at K = 1, and as the authors increase the order of K, the size of Ti(L) increases which diminishes the speed and increases the time required to train the network.

4. Performance Measures

The authors have evaluated their architecture on a variety benchmark datasets including S3DIS containing indoor 3D scenes[1], ShapeNet part segmentation [35] and Semantic3D benchmark dataset [10].
The authors methodology, outperforms the existing architectures on all the benchmarked datasets, and most of the performance gain is due to encoding the local spatial features of the 3D point cloud inside a graph model.

4.1. Semantic Scene Parsing

In their first experiment, the authors have used Stanford 3D dataset [1], that contains 3D scans from 6 different areas and 271 rooms collectively acquired using an individual Matterport Scanner.
The authors first divide the areas into rooms and then split points in each room using 1m by 1m blocks.
Furthermore, each point contains a 9-dimensional vector containing XYZ coordinates, RGB color channels and a normal or an equirectangular projection per room.
The authors train their model using a point size N of 4096 per training example and a batch size of 16, where each point contains only the XYZ coordinates.
The comparison between their architecture and existing architectures on S3DIS dataset is shown in table 1, and the results can be visualized in figure 3.

4.2. ShapeNet Part Segmentation

ShapeNet [35] provides a large-scale repository that contains richly annotated 3D shapes.
The ShapeNet part dataset from [35] contains 16, 881 3D shapes from 16 different categories, labelled with 50 parts in total.
Furthermore, for a fair comparison the authors have used the same evaluation metric as used by PointNet [22].
The authors compute the intersection-over-union (IOU) over each object category and then compute the mIOU by averaging the IOUs of each individual category.
The authors have compared their methodology with existing architectures that directly consume raw 3D point clouds, and have achieved a class average of 83.1 which is on par with state-of-the-art.

4.3. Semantic3D Benchmark

There has been a long tradition of benchmark evaluation in the geospatial dataset domain particularly ISPRS.
The ISPRS-EuroSDR benchmark on High Density Aerial Image Matching, which evaluates dense matching algorithms [9, 5] on aerial imagery.
The authors have used the Semantic3D benchmark dataset [10] for evaluating their architecture.
It contains nearly 4 billion points collected with 30 terrestrial laser scanners across Central Europe depicting the European architecture in most of its scenes.
Additionally, Semantic3D [10] benchmark proposed a baseline 3D-CNN architecture for 3D point cloud classification that takes as input 3D voxel-grids per scan point at 5 different resolutions.

5. Architecture Design Goals

The authors evaluate the performance of their architecture with respect to speed and stability using S3DIS [1] dataset.
The authors also show the effect of using local feature extraction and how adding the global features to their network gives best performance for their network.
Consider figure 4, which shows the fluctuations in test loss during training on S3DIS dataset [1], because of the sensitivity to initial weights.
On the other hand, their final architecture uses both global features (that also provides geometric invariance [22]) and local point features and thus has a relatively faster convergence rate and is more stable towards the unstructured nature of 3D point clouds.
This adds to the overall stability and reliability of their model across different scenes with objects of varying geometries.

6. Conclusion

The authors have presented FGCN, a novel feature based graph convolutional network for semantic segmentation of 3D point clouds.
The authors have shown the importance of using local features and how using the spatial position of points can increase the overall performance of the segmentation task when it comes to identifying objects in 3D scenes.
In addition to increased performance, the proposed architecture is invariant to geometric distortions and preserves the local structures of objects using the graph models.
Although the proposed network achieves better results in terms of accuracy but requires more memory footprint compared to the existing architectures.

Did you find this useful? Give us your feedback

Figures (8)

Figure 1. Examples of outdoor scenes from Semantic3D benchmark dataset [10]. Our architecture assigns a correct semantic label to each object with on par state-of-the-art accuracy. The results are visualized using PPTK viewer. Best viewed in color.

Figure 3. Qualitative results on semantic scene parsing. The images on the top contains the ground truth labels and on the bottom are the predictions by FGCN on the Stanford’s indoor semantic scene parsing dataset [1]. The point clouds are viewed using MeshLab software. Best viewed in color.

Figure 4. Test Loss comparison on S3DIS dataset [1]. The comparison is between PointNet [22] and our proposed architectures. The graph indicates a faster convergence rate and a more stable learning curve for our approach. Best viewed in color.

Table 3. Results on Semantic3D reduced-8 dataset: The mIOU and mAcc are calculated as mean over all categories of Semantic3D dataset. Our approach achieves state-of-the-art results on Semantic3D benchmark dataset.

Figure 5. Qualitative results of testing on Semantic3D benchmark dataset. The point clouds are visualized using the PPTK viewer. In order to produce these visualizations, we have further reduced the training set by three examples (scenes). Furthermore, the network outputs a sparse prediction which we interpolate to produce a dense point cloud prediction using Open3D’s k-NN hybrid search with radius of 0.2. The most prominent classification errors are indicated by the rectangles drawn on FGCN output. Best viewed in color.

Figure 2. Network Architecture: The network takes as input N points with coordinates (x, y, z). The input is passed to graph signal processing module to generate a re-scaled normalized graph vector and is also passed to deep convolutional feature extraction layers to output a global feature vector N × D. Both the normalized weighted graph and global features goes as input to graph convolutional network to output a global feature signature which is passed to a fully connected layer that scales down the features and assign one of k output classes to each point. GCN uses ReLU activation function and dropout regularization after each layer.

Table 1. Results of Semantic scene parsing on Stanford 3D dataset. The mIOU is calculated as an average over IOUs of all 13 classes containing indoor structural objects.

Table 2. Results on ShapeNet part segmentation: The metric is mIOU similar to the one used by PointNet [22]. We have compared our architecture with existing architectures on ShapeNet part segmentation. Our network achieves slightly better results than state-of-the-art.

Content maybe subject to copyright Report

FGCN: Deep Feature-based Graph Convolutional Network for Semantic

Segmentation of Urban 3D Point Clouds

Saqib Ali Khan

, Yilei Shi

, Muhammad Shahzad

, Xiao Xiang Zhu

3,4

School of Electrical Engineering and Computer Science (SEECS),

National University of Sciences and Technology (NUST), Islamabad, Pakistan

Chair of Remote Sensing Technology (LMF), Technical University of Munich (TUM), Munich, Germany

Signal Processing in Earth Observation (SiPEO), Technical University of Munich (TUM), Munich, Germany

Remote Sensing Technology Institute (IMF), German Aerospace Center (DLR), Wessling, Germany

{sakhan.bscs16seecs;muhammad.shehzad}@seecs.edu.pk, yilei.shi@tum.de,xiaoxiang.zhu@dlr.de

Abstract

Directly processing 3D point clouds using convolutional

neural networks (CNNs) is a highly challenging task pri-

marily due to the lack of explicit neighborhood relation-

ship between points in 3D space. Several researchers have

tried to cope with this problem using a preprocessing step

of voxelization. Although, this allows to translate the ex-

isting CNN architectures to process 3D point clouds but, in

addition to computational and memory constraints, it poses

quantization artifacts which limits the accurate inference of

the underlying object’s structure in the illuminated scene.

In this paper, we have introduced a more stable and effec-

tive end-to-end architecture to classify raw 3D point clouds

from indoor and outdoor scenes. In the proposed method-

ology, we encode the spatial arrangement of neighbouring

3D points inside an undirected symmetrical graph, which

is passed along with features extracted from a 2D CNN to

a Graph Convolutional Network (GCN) that contains three

layers of localized graph convolutions to generate a com-

plete segmentation map. The proposed network achieves on

par or even better than state-of-the-art results on tasks like

semantic scene parsing, part segmentation and urban clas-

siﬁcation on three standard benchmark datasets.

1. Introduction

With recent successes of convolutional neural network

(CNN) architectures in processing 2D structured data, there

is an increasingly growing interest of researchers in de-

veloping similar architectures to directly process 3D point

clouds. For instance, there has been many attempts to ex-

tend the traditional CNNs [

18, 22, 24, 27], that are best

ﬁt for data that lie in a structured Euclidean space to 3D

Figure 1. Examples of outdoor scenes from Semantic3D bench-

mark dataset [10]. Our architecture assigns a correct semantic

label to each object with on par state-of-the-art accuracy. The re-

sults are visualized using PPTK viewer. Best viewed in color.

point clouds. However, 3D datasets do not lie on a regu-

lar grid and thus lacks the implicit neighborhood relation-

ship. Owing to this, there does not exist a single well-

deﬁned notion that enables convolution on unstructured 3D

data. Furthermore, many approaches [

18, 29, 23] transform

the 3D datasets into regular 3D structures like voxels and

meshes to apply convolution, but the transformed regular

structures loses most of the spatial information that lies be-

tween neighbouring points and thus struggles to obtain the

local feature representations that can improve the overall

classiﬁcation results [

33].

To encode the neighbourhood relationships, few re-

searchers have used graph representations to capture the lo-

cal features more effectively. In this context, Bronstein et

al. [

3] ﬁrst used the term geometric deep learning and gave

an overview of the deep learning methods for datasets that

lie in non-Euclidean domain. However, the ﬁrst prominent

research that deﬁnes convolutional GNN in a spectral do-

main was given by Bruna et al. [

4]. They have provided

evidence of the possible generalizations of CNNs to sig-

nals in other domains without taking 3D translational fac-

tors into account. Defferrard et al. [

6] proposed a gener-

alized formulation of CNNs for spectral graphs. Their ap-

proach used the recursive form of Chebyshev polynomials

to propose a fast convolution for high-dimensional unstruc-

tured datasets such as social networks or protein-interaction

networks. Furthermore, it is sometimes desirable to use

a kernel-based approach [

17, 30]. This property of us-

ing graph-kernels is favourable because the local structure

of the graph contains meaningful information. However,

kernel-based approaches are computationally expensive and

have quadratic training complexity.

Inspired by the idea of graph based representation to

propagate local features, we have used a Graph Convolu-

tional Network (GCN) to encode spatial information or lo-

cal neighbourhood features into symmetrical graph models.

In the proposed 3D representation, each point is represented

by three coordinates (x, y, z). In addition to our local fea-

ture encoder or GCN, we have used a global feature ex-

tractor similar to [

22], that extracts a vector of high dimen-

sional features by taking the raw point cloud as input. Us-

ing the global features, summarizes most of the information

and provides geometric invariance [

22] that increases the

overall performance and reliability of our network (See Sec-

tion 5 for details). The graph convolution reﬁnes these high

order features using the local spatial features from graph

representation and outputs a global signature summarizing

each point inside the graph. Therefore, our proposed archi-

tecture learns the complete local structure embedded in the

graph to achieve faster convergence and better classiﬁcation

results. Our GCN or spatial-temporal graph neural network

[

33] achieves on par or even better results compared to state-

of-the-art architectures. Speciﬁcally, following are the main

contributions proposed in this work:

• A novel graph based convolutional network has been

proposed that uses both local and global features for

semantic segmentation of 3D point clouds;

• It is evidently showed how using the spatial informa-

tion in the local neighbourhood of points in 3D space

offers stability and increased performance;

• The proposed architecture been compared with the

state-of-the-art approaches and achieved competitive

performance on three standard benchmark datasets in-

cluding S3DIS [

1], ShapeNet [35], and Semantic3D

[10] datasets. For reference, Figure 1 provides the vi-

sualization of two different outdoor scenes.

2. Related Work

Deep Learning on 3D Point Clouds Many approaches

utilize 3D shapes to apply deep learning, for example Vol-

umetric CNNs [

23, 38, 21], is the pioneer work that applies

3D convolutions on voxelized shapes. However, Volumet-

ric CNNs have a higher computational cost due to the spar-

sity of 3D data in volumetric representations. This problem

has been addressed through careful engineering of CNNs

[

20, 31]. However, the problem still persists due to signiﬁ-

cantly sparse volumes in very large point clouds. Multiview

CNNs [28], integrate multiple views of a 3D point cloud

together and apply 2D convolution for classiﬁcation. With

efﬁcient 2D convolutions, they can process very high reso-

lution data. Furthermore, these architectures can achieve

state-of-the-art results in object classiﬁcation on datasets

like ModelNet [

38], but they cannot be extended to more

complex tasks like 3D scene understanding.

Recently, many new approaches have been proposed that

directly consume raw 3D point clouds and are used for tasks

like semantic segmentation, object classiﬁcation and detec-

tion etc. PointNet [

22] is the pioneer work that applies deep

learning on raw 3D point clouds with signiﬁcant improve-

ments in performance. However, PointNet does not general-

ize well on complex scenes due to its inability to capture the

local structure induced by the 3D space. The local structure

is exploited by PointNet++ [

24], which is an extension of

PointNet. In their proposed methodology, they were able to

capture the local features with increasing contextual scales.

SPLATNet [

27], sparse lattice networks, used bilateral con-

volutions as building blocks to apply 3D convolution only

on the occupied parts of the lattice that reduces memory

and computational cost. PointConv [

32] uses dynamic ﬁl-

ters to apply convolution on point clouds. They treat convo-

lutional kernels as non-linear functions of the point coordi-

nates comprised of density and weight functions.

Deep Learning on Graphs or spectral CNNs were ﬁrst

introduced by [

4] and extended by [6]. Many approaches

like ours that applies convolution in a spectral domain uses

ideas from graph signal processing [

26] to apply localized

ﬁlters on graphs. Recently, many approaches [

6, 15, 37] ap-

proximate the spectral convolution using Chebyshev poly-

nomials, because transforming the signal back and forth be-

tween spectral domains can be expensive. Our approach

uses Chebyshev polynomials for spectral convolutions in a

similar way as [

37, 26].

3. Proposed Methodology

Suppose, we are given a set of m training examples

, Y

} with X

= {P

|j = 1......n}, where n is the

number of points P ⊂ R

in X

, and Y

= {1......n} is the

associated semantic label of each point P

in the i

train-

ing example. Furthermore, each point P

in X

consists of

a vector of 3D coordinates (x, y, z).

In our proposed methodology, we extend the traditional

graph based convolutions [

26, 37], that works on latent

graph signals to output a global signature which is then

used for classiﬁcation. Most of these architectures, over-

look the underlying spatial information between points in-

side a 3D space, which plays a crucial role in identifying

objects. Keeping in mind the importance of local features,

we propose a uniﬁed architecture that jointly use both lo-

cal and global features to give a more stable and reliable

network for semantic segmentation of 3D point clouds. Us-

ing the global feature extractor before graph convolutional

network summarizes most of the information and provides

geometric invariance [

22] which in turn increases the over-

all performance or our network. In the following sections,

we will explain the key components of our proposed archi-

tecture and will provide evidence as to how using both local

and global features can give better results.

3.1. Transforming 3D Point Sets to Weighted Graph

Signals

A graph convolutional network performs convolution on

input that is supported on a graph G = {V, E, W }, with a

ﬁnite number of nodes v

∈ V , edges e

= {v

, v

} ∈ E,

and W

i,j

∈ W corresponding to the weighted graph sig-

nal or an entry into the adjacency matrix indicating a con-

nection between v

and v

. In order to ﬁnd the value of

i,j

, we ﬁnd all the neighbouring nodes of node i using k-

nearest neighbors, and then use a Gaussian kernel to weight

the edge e

i,j

connecting node i and a neighbouring node j:

i,j

(

exp(−

−v

2σ

) if kv

− v

k < κ

0 otherwise

(1)

for some value of σ > 0 and parameter κ. In equation

1, kv

− v

k represent the Euclidean distance between two

feature vectors of node v

= {x

, y

, z

} and node v

, y

, z

}, with node v

as a neighbor of node v

Given the undirected graph with adjacency matrix W ∈

N×N

, we apply graph ﬁltering techniques [15, 37] us-

ing normalized Laplacian matrix L = I

− D

−

W D

−

where D corresponds to the diagonal matrix in which D

i,j

}. The normalized Laplacian matrix can also be

interpreted using the eigenvectors as L = U ΛU

, where

U corresponds to the matrix of eigenvectors and Λ corre-

sponds to the diagonal matrix of U . Let’s restate our graph

mapping function f (x) with input x, as a linear graph ﬁlter

transformation function with coefﬁcients µ

, µ

, ......µ

as,

f(x) = g

(L)x =

i=0

x (2)

The mapping function f (x) can also be approximated

using the eigen decomposition form of normalized Lapla-

cian matrix with eigenvalues Λ as,

f(x) = g

(L)x = U g

(Λ)U

x (3)

Spectral based graph ﬁltering methods [

12, 7, 26] also

use Chebyshev polynomials to approximate graph ﬁlters.

ChebyNet [

7] uses the diagonal matrix of eigen values,

f(x) = g

(L)x =

i=0

(L)x (4)

Additionally, equation

4 can also be deﬁned recursively

with T

(x) = 1 and T

(x) = x as,

(x) = 2xT

i−1

− T

i−2

(x) (5)

The goal of graph convolutional layer is to learn a set

of graph ﬁltering coefﬁcients {µ} or {θ} using any type

of graph ﬁltering method. However, using the normalized

Laplacian with eigen decomposition has a high computa-

tional cost compared to ChebyNet [

7]. Furthermore, Def-

ferrard et al. [

6] demonstrated the effectiveness of using

Chebyshev graph ﬁltering approximation (graph convolu-

tion) on homogeneous graphs, for tasks like image classiﬁ-

cation and 2D scene understanding. We adapt a similar ap-

proach to [

7], using the Chebyshev polynomials as a graph

ﬁltering method, but in our approach we have applied con-

volution on heterogeneous graphs with global features (ex-

tracted from 2D convolutional layers) as input.

3.2. Model Architecture

Our segmentation network consists of three main mod-

ules: 1) Feature extraction that inputs the N ×3 dimensional

point coordinate vector and outputs an N × D dimensional

global feature vector; 2) Graph signal processing that also

takes an N × 3 dimensional coordinate vector as input and

outputs a weighted graph in the form of an adjacency matrix

W ; 3) Graph convolutional network with learnable param-

eter θ of order k, takes as input the N × D dimensional

feature vector along with weighted graph signals W and

extracts the local features corresponding to the spatial ar-

rangement of nodes in the graph, which is then passed to

fully connected layers for per-point classiﬁcation. The ar-

chitecture diagram can be visualized in ﬁgure

3D Feature Extraction Many techniques have been de-

veloped in order to obtain global feature descriptors for 3D

point sets [

13, 22, 14, 8]. Johnson et al. [14] developed a

Figure 2. Network Architecture: The network takes as input N points with coordinates (x, y, z). The input is passed to graph signal

processing module to generate a re-scaled normalized graph vector and is also passed to deep convolutional feature extraction layers to

output a global feature vector N × D. Both the normalized weighted graph and global features goes as input to graph convolutional

network to output a global feature signature which is passed to a fully connected layer that scales down the features and assign one of k

output classes to each point. GCN uses ReLU activation function and dropout regularization after each layer.

method to extract local feature descriptors from 3D point

sets called spin images. The distance (α, β) between a fea-

ture point in spin image with coordinate p, a surface normal

n and a neighbouring point q is given by α = n

.(p−q) and

β =

kp − qk

− α

. The ﬁnal spin image contains the

neighbors of feature points accumulated in a discontinuous

2D bin which is robust to occlusion and clutter. Flint et al.

[

8] propose a method called THRIFT that extends the fea-

ture extraction techniques applied to 2D images like SIFT

and propose a 3D feature descriptor that successfully iden-

tiﬁes keypoints in range data.

Recently, convolutional neural networks have been used

in general for feature extraction in both 2D and 3D domains.

The most recent work that employ CNNs to extract global

features from raw 3D point clouds is PointNet [

22]. Point-

Net architecture uses a stack of 2D convolutional layers for

feature transformation and ensures invariance to permuta-

tions, geometric transformations and also considers the in-

teraction among points using a localized convolution oper-

ation. PointNet outperformed all the existing methods used

for classiﬁcation of 3D points which either required conver-

sion to other irreversible representations [

23, 38, 21] or used

raw 3D point clouds [

18].

In this paper, we take motivation from PointNet [

22] and

extend our graph convolutional network to be more robust

using global features. So, instead of taking the point coor-

dinates (x

(i)

, y

(i)

, z

(i)

) as input feature vectors [

37], we use

2D convolutional layers to output an {x

(1)

, x

(2)

, ....x

(D)

} ∈

N×D

global feature vector, where D represents the num-

ber of features per point.

Using the global feature extraction with graph convolu-

tional network speeds up the training process and increases

the overall performance of our network which is demon-

strated in Sections 4 and 5.

Graph Convolutional Network (GCN) takes as input

the feature vector {x

(1)

, x

(2)

, ....x

(D)

} ∈ R

N×D

, where

D corresponds to the number of features and the weighted

graph signals W ∈ R

N×N

, and the goal of GCN is to learn

a set of K trainable graph-ﬁlter coefﬁcients. Moreover, a

GCN learns a mapping function that can translate the input

graph signals to capture the local features corresponding to

the relative position of points in 3D space. So, a GCN can

be written as a non-linear function σ of input graph signals

(l)

and X

(l)

, where l corresponds to the activations of l

layer.

f(X

(l)

, W ) = σ (θ

(l)

W )) (6)

where the learnable parameter θ is of order K. The map-

ping function in equation

6 contains an unnormalized graph

representation W , because the range of values can vary for

heterogeneous graphs, the unnormalized GCN cannot gen-

eralize well on graphs that lie in different spectral domains

[

33]. In order to overcome this problem, the input graph

signal is to be normalized in such way that adding all the

rows of W sum to one [15]. In our proposed methodol-

ogy, we have used a graph Laplacian L = I − D

−

W D

−

using the diagonal matrix D such that D

for

symmetric normalization,

f(X

(l)

, W ) = σ (θ

(l)

−

(l)

)) (7)

where

W = W +I, and I is the identity matrix. Further-

more, using the Laplacian normalization, the eigenvalues of

L lie in the range [−1, 1].

In order to obtain the local features at each layer l, we

use Chebyshev polynomials

4, and take as input the global

feature vector {x

(1)

, x

(2)

, ....x

(D)

} for the ﬁrst layer. Fur-

thermore, in order to deﬁne a single graph convolution oper-

ation between the input feature vector x

and a graph signal

g, we use the inverse graph Fourier transform [

33] as,

x ∗

g = U(U

x ⊙ U

g) (8)

where U is the matrix of eigenvectors and ⊙ represents

the pointwise product of inverse graph Fourier transform of

x as U

x and g as U

In our proposed architecture, we have used the Cheby-

shev graph ﬁltering representation given by equation

4, with

K-neighbourhood at each point to learn the localized fea-

ture maps with three layers of graph convolutions.

3.3. Training

The architecture is trained using Adam optimizer with

a learning rate that starts at 1 × 10

−3

and is reduced to

half after every 20 epochs, but always stays in the range

[1 × 10

−3

, 1 × 10

−7

]. We have used a batch size of 16 and

dropout regularization of 0.8 for GCN layers and 0.4 for

fully connected layers to prevent overﬁtting. Our network

uses four layers of 2D convolutional layers with kernel sizes

[64, 64, 128, 1024] respectively. Furthermore, to avoid ad-

ditional complexity in our model, we have used a weight

decay of magnitude 2 × 10

−4

The speed and stability of GCN depends heavily on the

order K of Chebyshev polynomial

4. The model performs

optimal at K = 1, and as we increase the order of K, the

size of T

(L) increases which diminishes the speed and in-

creases the time required to train the network.

4. Performance Measures

We have evaluated our architecture on a variety bench-

mark datasets including S3DIS containing indoor 3D

scenes[

1], ShapeNet part segmentation [35] and Seman-

tic3D benchmark dataset [

10]. Our methodology, outper-

forms the existing architectures on all the benchmarked

datasets, and most of the performance gain is due to encod-

ing the local spatial features of the 3D point cloud inside a

graph model.

Method mean IOU mean Accuracy

PointNet [22] 47.71 48.98

SEGCloud [29] 48.92 57.35

Ours (GCN Only) 47.22 56.44

Ours (FGCN) 52.17 63.22

Table 1. Results of Semantic scene parsing on Stanford 3D

dataset. The mIOU is calculated as an average over IOUs of all

13 classes containing indoor structural objects.

class average

SSCNN [36] 82.0

Kd-net [

16] 77.4

PointNet [

22] 80.4

PointNet++ [

24] 81.9

SpiderCNN [

34] 82.4

SPLATNet

[

27] 82.0

PointConv [

32] 82.8

Ours (GCN Only) 78.2

Ours (FGCN)

83.1

Table 2. Results on ShapeNet part segmentation: The metric

is mIOU similar to the one used by PointNet [22]. We have com-

pared our architecture with existing architectures on ShapeNet part

segmentation. Our network achieves slightly better results than

state-of-the-art.

4.1. Semantic Scene Parsing

In our ﬁrst experiment, we have used Stanford 3D dataset

[

1], that contains 3D scans from 6 different areas and 271

rooms collectively acquired using an individual Matterport

Scanner. The dataset contains 13 classes, so each point can

be assigned 1 out of 13 semantic labels.

In order to split the data into training and testing sets, we

have used the same method and statistics as used by Point-

Net [

22]. We ﬁrst divide the areas into rooms and then split

points in each room using 1m by 1m blocks. Furthermore,

each point contains a 9-dimensional vector containing XYZ

coordinates, RGB color channels and a normal or an equi-

rectangular projection per room.

We train our model using a point size N of 4096 per

training example and a batch size of 16, where each point

contains only the XYZ coordinates. The comparison be-

tween our architecture and existing architectures on S3DIS

dataset is shown in table

1, and the results can be visual-

ized in ﬁgure

3. Our methodology outperforms the existing

architectures by a signiﬁcant margin.

4.2. ShapeNet Part Segmentation

ShapeNet [35] provides a large-scale repository that con-

tains richly annotated 3D shapes. The ShapeNet part dataset

from [

35] contains 16, 881 3D shapes from 16 different cat-

egories, labelled with 50 parts in total. In object’s part seg-

HTML Viewer

Frequently Asked Questions (18)

Q1. What are the contributions in "Fgcn: deep feature-based graph convolutional network for semantic segmentation of urban 3d point clouds" ?

In this paper, the authors have introduced a more stable and effective end-to-end architecture to classify raw 3D point clouds from indoor and outdoor scenes. In the proposed methodology, the authors encode the spatial arrangement of neighbouring 3D points inside an undirected symmetrical graph, which is passed along with features extracted from a 2D CNN to a Graph Convolutional Network ( GCN ) that contains three layers of localized graph convolutions to generate a complete segmentation map.

Q2. What is the common method of applying convolution on point clouds?

SPLATNet [27], sparse lattice networks, used bilateral convolutions as building blocks to apply 3D convolution only on the occupied parts of the lattice that reduces memory and computational cost.

Q3. What is the main idea of the graph convolutional network?

PointNet architecture uses a stack of 2D convolutional layers for feature transformation and ensures invariance to permutations, geometric transformations and also considers the interaction among points using a localized convolution operation.

Q4. What is the main idea of Flint et al.?

Flint et al. [8] propose a method called THRIFT that extends the feature extraction techniques applied to 2D images like SIFT and propose a 3D feature descriptor that successfully identifies keypoints in range data.

Q5. How do the authors evaluate the model on ShapeNet part dataset?

In order to evaluate their model on ShapeNet part dataset the authors pre-compute the Graph filters using Chebyshev polynomials 4 and train their model on each of the 16 object categories.

Q6. What is the simplest way to extract local features from 3D point sets?

instead of taking the point coordinates (x(i), y(i), z(i)) as input feature vectors [37], the authors use2D convolutional layers to output an {x (1) i , x (2) i , ....x (D) i } ∈ R N×D global feature vector, where D represents the number of features per point.

Q7. What is the way to apply deep learning to 3D point clouds?

Many approaches utilize 3D shapes to apply deep learning, for example Volumetric CNNs [23, 38, 21], is the pioneer work that applies 3D convolutions on voxelized shapes.

Q8. What is the method for obtaining local feature descriptors from 3D point sets?

PointNet outperformed all the existing methods used for classification of 3D points which either required conversion to other irreversible representations [23, 38, 21] or used raw 3D point clouds [18].

Q9. What is the effect of local feature extraction on the performance of the architecture?

On the other hand, their final architecture uses both global features (that also provides geometric invariance [22]) and local point features and thus has a relatively faster convergence rate and is more stable towards the unstructured nature of 3D point clouds.

Q10. How many benchmark datasets have the authors evaluated?

The authors have evaluated their architecture on a variety benchmark datasets including S3DIS containing indoor 3D scenes[1], ShapeNet part segmentation [35] and Semantic3D benchmark dataset [10].

Q11. What is the main contribution to the proposed graph convolutional network?

following are the main contributions proposed in this work:• A novel graph based convolutional network has been proposed that uses both local and global features forsemantic segmentation of 3D point clouds;•

Q12. What is the importance of using local features?

In this work, the authors have shown the importance of using local features and how using the spatial position of points can increase the overall performance of the segmentation task when it comes to identifying objects in 3D scenes.

Q13. What is the main reason why many architectures are trying to improve the local feature extractor?

the interest is towards consuming the point clouds directly [22, 24, 32, 27], but many of these architectures try hard to improve the local feature extractor by ap-plying convolution directly to the unstructured point cloud.

Q14. What is the way to extend the traditional CNNs?

For instance, there has been many attempts to extend the traditional CNNs [18, 22, 24, 27], that are best fit for data that lie in a structured Euclidean space to 3Dpoint clouds.

Q15. What is the simplest way to approximate graph filters?

Let’s restate their graph mapping function f(x) with input x, as a linear graph filtertransformation function with coefficients µ1, µ2, ......µn as,f(x) = gµ(L)x = K∑i=0µiL ix (2)The mapping function f(x) can also be approximated using the eigen decomposition form of normalized Laplacian matrix with eigenvalues Λ as,f(x) = gµ(L)x = Ugµ(Λ)U Tx (3)Spectral based graph filtering methods [12, 7, 26] also use Chebyshev polynomials to approximate graph filters.

Q16. What is the effect of a feature based graph convolutional network?

their final architecture reforms the raw 3D point cloud to a vector of high dimensional features before passing it on to the graph convolutional network.

Q17. What is the way to extract the features from the point cloud?

In addition to their local feature encoder or GCN, the authors have used a global feature extractor similar to [22], that extracts a vector of high dimensional features by taking the raw point cloud as input.

Q18. What is the way to use the proposed network?

Although the proposed network achieves better results in terms of accuracy but requires more memory footprint compared to the existing architectures.

FGCN: Deep Feature-Based Graph Convolutional Network for Semantic Segmentation of Urban 3D Point Clouds

Summary (3 min read)

1. Introduction

3. Proposed Methodology

3.1. Transforming 3D Point Sets to Weighted Graph

3.2. Model Architecture

3.3. Training

4. Performance Measures

4.1. Semantic Scene Parsing

4.2. ShapeNet Part Segmentation

4.3. Semantic3D Benchmark

5. Architecture Design Goals

6. Conclusion

Figures (8)

Citations

Cites methods from "FGCN: Deep Feature-Based Graph Conv..."

References

"FGCN: Deep Feature-Based Graph Conv..." refers background or methods in this paper

"FGCN: Deep Feature-Based Graph Conv..." refers background or methods in this paper

"FGCN: Deep Feature-Based Graph Conv..." refers background in this paper

Related Papers (5)

Frequently Asked Questions (18)

Q1. What are the contributions in "Fgcn: deep feature-based graph convolutional network for semantic segmentation of urban 3d point clouds" ?

Q2. What is the common method of applying convolution on point clouds?

Q3. What is the main idea of the graph convolutional network?

Q4. What is the main idea of Flint et al.?

Q5. How do the authors evaluate the model on ShapeNet part dataset?

Q6. What is the simplest way to extract local features from 3D point sets?

Q7. What is the way to apply deep learning to 3D point clouds?

Q8. What is the method for obtaining local feature descriptors from 3D point sets?

Q9. What is the effect of local feature extraction on the performance of the architecture?

Q10. How many benchmark datasets have the authors evaluated?

Q11. What is the main contribution to the proposed graph convolutional network?

Q12. What is the importance of using local features?

Q13. What is the main reason why many architectures are trying to improve the local feature extractor?

Q14. What is the way to extend the traditional CNNs?

Q15. What is the simplest way to approximate graph filters?

Q16. What is the effect of a feature based graph convolutional network?

Q17. What is the way to extract the features from the point cloud?

Q18. What is the way to use the proposed network?