FGCN: Deep Feature-Based Graph Convolutional Network for Semantic Segmentation of Urban 3D Point Clouds
Summary (3 min read)
1. Introduction
- With recent successes of convolutional neural network (CNN) architectures in processing 2D structured data, there is an increasingly growing interest of researchers in developing similar architectures to directly process 3D point clouds.
- Furthermore, many approaches [18, 29, 23] transform the 3D datasets into regular 3D structures like voxels and meshes to apply convolution, but the transformed regular structures loses most of the spatial information that lies between neighbouring points and thus struggles to obtain the local feature representations that can improve the overall classification results [33].
- They have provided evidence of the possible generalizations of CNNs to signals in other domains without taking 3D translational factors into account.
- Therefore, their proposed architecture learns the complete local structure embedded in the graph to achieve faster convergence and better classification results.
- For reference, Figure 1 provides the visualization of two different outdoor scenes.
3. Proposed Methodology
- In their proposed methodology, the authors extend the traditional graph based convolutions [26, 37], that works on latent graph signals to output a global signature which is then used for classification.
- Most of these architectures, overlook the underlying spatial information between points inside a 3D space, which plays a crucial role in identifying objects.
- Keeping in mind the importance of local features, the authors propose a unified architecture that jointly use both local and global features to give a more stable and reliable network for semantic segmentation of 3D point clouds.
- Using the global feature extractor before graph convolutional network summarizes most of the information and provides geometric invariance [22] which in turn increases the overall performance or their network.
- In the following sections, the authors will explain the key components of their proposed architecture and will provide evidence as to how using both local and global features can give better results.
3.1. Transforming 3D Point Sets to Weighted Graph
- Using the normalized Laplacian with eigen decomposition has a high computational cost compared to ChebyNet [7].
- Furthermore, Defferrard et al. [6] demonstrated the effectiveness of using Chebyshev graph filtering approximation (graph convolution) on homogeneous graphs, for tasks like image classification and 2D scene understanding.
- The authors adapt a similar approach to [7], using the Chebyshev polynomials as a graph filtering method, but in their approach they have applied convolution on heterogeneous graphs with global features (extracted from 2D convolutional layers) as input.
3.2. Model Architecture
- The architecture diagram can be visualized in figure 2. 3D Feature Extraction Many techniques have been developed in order to obtain global feature descriptors for 3D point sets [13, 22, 14, 8].
- Johnson et al. [14] developed a method to extract local feature descriptors from 3D point sets called spin images.
- The most recent work that employ CNNs to extract global features from raw 3D point clouds is PointNet [22].
- Using the global feature extraction with graph convolutional network speeds up the training process and increases the overall performance of their network which is demonstrated in Sections 4 and 5.
- Furthermore, using the Laplacian normalization, the eigenvalues of L lie in the range [−1, 1].
3.3. Training
- The authors have used a batch size of 16 and dropout regularization of 0.8 for GCN layers and 0.4 for fully connected layers to prevent overfitting.
- The model performs optimal at K = 1, and as the authors increase the order of K, the size of Ti(L) increases which diminishes the speed and increases the time required to train the network.
4. Performance Measures
- The authors have evaluated their architecture on a variety benchmark datasets including S3DIS containing indoor 3D scenes[1], ShapeNet part segmentation [35] and Semantic3D benchmark dataset [10].
- The authors methodology, outperforms the existing architectures on all the benchmarked datasets, and most of the performance gain is due to encoding the local spatial features of the 3D point cloud inside a graph model.
4.1. Semantic Scene Parsing
- In their first experiment, the authors have used Stanford 3D dataset [1], that contains 3D scans from 6 different areas and 271 rooms collectively acquired using an individual Matterport Scanner.
- The authors first divide the areas into rooms and then split points in each room using 1m by 1m blocks.
- Furthermore, each point contains a 9-dimensional vector containing XYZ coordinates, RGB color channels and a normal or an equirectangular projection per room.
- The authors train their model using a point size N of 4096 per training example and a batch size of 16, where each point contains only the XYZ coordinates.
- The comparison between their architecture and existing architectures on S3DIS dataset is shown in table 1, and the results can be visualized in figure 3.
4.2. ShapeNet Part Segmentation
- ShapeNet [35] provides a large-scale repository that contains richly annotated 3D shapes.
- The ShapeNet part dataset from [35] contains 16, 881 3D shapes from 16 different categories, labelled with 50 parts in total.
- Furthermore, for a fair comparison the authors have used the same evaluation metric as used by PointNet [22].
- The authors compute the intersection-over-union (IOU) over each object category and then compute the mIOU by averaging the IOUs of each individual category.
- The authors have compared their methodology with existing architectures that directly consume raw 3D point clouds, and have achieved a class average of 83.1 which is on par with state-of-the-art.
4.3. Semantic3D Benchmark
- There has been a long tradition of benchmark evaluation in the geospatial dataset domain particularly ISPRS.
- The ISPRS-EuroSDR benchmark on High Density Aerial Image Matching, which evaluates dense matching algorithms [9, 5] on aerial imagery.
- The authors have used the Semantic3D benchmark dataset [10] for evaluating their architecture.
- It contains nearly 4 billion points collected with 30 terrestrial laser scanners across Central Europe depicting the European architecture in most of its scenes.
- Additionally, Semantic3D [10] benchmark proposed a baseline 3D-CNN architecture for 3D point cloud classification that takes as input 3D voxel-grids per scan point at 5 different resolutions.
5. Architecture Design Goals
- The authors evaluate the performance of their architecture with respect to speed and stability using S3DIS [1] dataset.
- The authors also show the effect of using local feature extraction and how adding the global features to their network gives best performance for their network.
- Consider figure 4, which shows the fluctuations in test loss during training on S3DIS dataset [1], because of the sensitivity to initial weights.
- On the other hand, their final architecture uses both global features (that also provides geometric invariance [22]) and local point features and thus has a relatively faster convergence rate and is more stable towards the unstructured nature of 3D point clouds.
- This adds to the overall stability and reliability of their model across different scenes with objects of varying geometries.
6. Conclusion
- The authors have presented FGCN, a novel feature based graph convolutional network for semantic segmentation of 3D point clouds.
- The authors have shown the importance of using local features and how using the spatial position of points can increase the overall performance of the segmentation task when it comes to identifying objects in 3D scenes.
- In addition to increased performance, the proposed architecture is invariant to geometric distortions and preserves the local structures of objects using the graph models.
- Although the proposed network achieves better results in terms of accuracy but requires more memory footprint compared to the existing architectures.
Did you find this useful? Give us your feedback
Citations
34 citations
8 citations
Cites methods from "FGCN: Deep Feature-Based Graph Conv..."
...Ali Khan et al. (2020) transform PCs to an undirected symmetrically weighted graph encoding the spatial neighborhood and apply a Graph Convolutional Network....
[...]
References
15,696 citations
"FGCN: Deep Feature-Based Graph Conv..." refers background or methods in this paper
...Recently, many approaches [6, 15, 37] approximate the spectral convolution using Chebyshev polynomials, because transforming the signal back and forth between spectral domains can be expensive....
[...]
...R N×N , we apply graph filtering techniques [15, 37] using normalized Laplacian matrix L = In − D − 1 2WD 1 2 , where D corresponds to the diagonal matrix in which Dij = Σj{Wi,j}....
[...]
9,457 citations
"FGCN: Deep Feature-Based Graph Conv..." refers background or methods in this paper
...In this paper, we take motivation from PointNet [22] and extend our graph convolutional network to be more robust using global features....
[...]
...Results on ShapeNet part segmentation: The metric is mIOU similar to the one used by PointNet [22]....
[...]
...PointNet [22] is the pioneer work that applies deep learning on raw 3D point clouds with significant improvements in performance....
[...]
...In order to split the data into training and testing sets, we have used the same method and statistics as used by PointNet [22]....
[...]
...For instance, there has been many attempts to extend the traditional CNNs [18, 22, 24, 27], that are best fit for data that lie in a structured Euclidean space to 3D Figure 1....
[...]
4,802 citations
"FGCN: Deep Feature-Based Graph Conv..." refers background in this paper
...This problem has been addressed through careful engineering of CNNs [20, 31]....
[...]
...Deep Learning on Graphs or spectral CNNs were first introduced by [4] and extended by [6]....
[...]
...The local structure is exploited by PointNet++ [24], which is an extension of PointNet....
[...]
...Defferrard et al. [6] proposed a generalized formulation of CNNs for spectral graphs....
[...]
...Directly processing 3D point clouds using convolutional neural networks (CNNs) is a highly challenging task primarily due to the lack of explicit neighborhood relationship between points in 3D space....
[...]
4,584 citations
4,562 citations
Related Papers (5)
Frequently Asked Questions (18)
Q2. What is the common method of applying convolution on point clouds?
SPLATNet [27], sparse lattice networks, used bilateral convolutions as building blocks to apply 3D convolution only on the occupied parts of the lattice that reduces memory and computational cost.
Q3. What is the main idea of the graph convolutional network?
PointNet architecture uses a stack of 2D convolutional layers for feature transformation and ensures invariance to permutations, geometric transformations and also considers the interaction among points using a localized convolution operation.
Q4. What is the main idea of Flint et al.?
Flint et al. [8] propose a method called THRIFT that extends the feature extraction techniques applied to 2D images like SIFT and propose a 3D feature descriptor that successfully identifies keypoints in range data.
Q5. How do the authors evaluate the model on ShapeNet part dataset?
In order to evaluate their model on ShapeNet part dataset the authors pre-compute the Graph filters using Chebyshev polynomials 4 and train their model on each of the 16 object categories.
Q6. What is the simplest way to extract local features from 3D point sets?
instead of taking the point coordinates (x(i), y(i), z(i)) as input feature vectors [37], the authors use2D convolutional layers to output an {x (1) i , x (2) i , ....x (D) i } ∈ R N×D global feature vector, where D represents the number of features per point.
Q7. What is the way to apply deep learning to 3D point clouds?
Many approaches utilize 3D shapes to apply deep learning, for example Volumetric CNNs [23, 38, 21], is the pioneer work that applies 3D convolutions on voxelized shapes.
Q8. What is the method for obtaining local feature descriptors from 3D point sets?
PointNet outperformed all the existing methods used for classification of 3D points which either required conversion to other irreversible representations [23, 38, 21] or used raw 3D point clouds [18].
Q9. What is the effect of local feature extraction on the performance of the architecture?
On the other hand, their final architecture uses both global features (that also provides geometric invariance [22]) and local point features and thus has a relatively faster convergence rate and is more stable towards the unstructured nature of 3D point clouds.
Q10. How many benchmark datasets have the authors evaluated?
The authors have evaluated their architecture on a variety benchmark datasets including S3DIS containing indoor 3D scenes[1], ShapeNet part segmentation [35] and Semantic3D benchmark dataset [10].
Q11. What is the main contribution to the proposed graph convolutional network?
following are the main contributions proposed in this work:• A novel graph based convolutional network has been proposed that uses both local and global features forsemantic segmentation of 3D point clouds;•
Q12. What is the importance of using local features?
In this work, the authors have shown the importance of using local features and how using the spatial position of points can increase the overall performance of the segmentation task when it comes to identifying objects in 3D scenes.
Q13. What is the main reason why many architectures are trying to improve the local feature extractor?
the interest is towards consuming the point clouds directly [22, 24, 32, 27], but many of these architectures try hard to improve the local feature extractor by ap-plying convolution directly to the unstructured point cloud.
Q14. What is the way to extend the traditional CNNs?
For instance, there has been many attempts to extend the traditional CNNs [18, 22, 24, 27], that are best fit for data that lie in a structured Euclidean space to 3Dpoint clouds.
Q15. What is the simplest way to approximate graph filters?
Let’s restate their graph mapping function f(x) with input x, as a linear graph filtertransformation function with coefficients µ1, µ2, ......µn as,f(x) = gµ(L)x = K∑i=0µiL ix (2)The mapping function f(x) can also be approximated using the eigen decomposition form of normalized Laplacian matrix with eigenvalues Λ as,f(x) = gµ(L)x = Ugµ(Λ)U Tx (3)Spectral based graph filtering methods [12, 7, 26] also use Chebyshev polynomials to approximate graph filters.
Q16. What is the effect of a feature based graph convolutional network?
their final architecture reforms the raw 3D point cloud to a vector of high dimensional features before passing it on to the graph convolutional network.
Q17. What is the way to extract the features from the point cloud?
In addition to their local feature encoder or GCN, the authors have used a global feature extractor similar to [22], that extracts a vector of high dimensional features by taking the raw point cloud as input.
Q18. What is the way to use the proposed network?
Although the proposed network achieves better results in terms of accuracy but requires more memory footprint compared to the existing architectures.