3DMatch: Learning Local Geometric Descriptors from RGB-D Reconstructions

doi:10.1109/CVPR.2017.29

Andy Zeng

1

Shuran Song

1

Matthias Nießner

2

Matthew Fisher

2,4

Jianxiong Xiao

3

Thomas Funkhouser

1

Princeton University

2

Stanford University

3

AutoX

4

Adobe Systems

http://3dmatch.cs.princeton.edu

Abstract

Matching local geometric features on real-world depth

images is a challenging task due to the noisy, low-

resolution, and incomplete nature of 3D scan data. These

difﬁculties limit the performance of current state-of-art

methods, which are typically based on histograms over ge-

ometric properties. In this paper, we present 3DMatch, a

data-driven model that learns a local volumetric patch de-

scriptor for establishing correspondences between partial

3D data. To amass training data for our model, we propose

a self-supervised feature learning method that leverages the

millions of correspondence labels found in existing RGB-D

reconstructions. Experiments show that our descriptor is

not only able to match local geometry in new scenes for re-

construction, but also generalize to different tasks and spa-

tial scales (e.g. instance-level object model alignment for

the Amazon Picking Challenge, and mesh surface corre-

spondence). Results show that 3DMatch consistently out-

performs other state-of-the-art approaches by a signiﬁcant

margin. Code, data, benchmarks, and pre-trained models

are available online at

http://3dmatch.cs.princeton.edu.

1. Introduction

Matching 3D geometry has a long history starting in

the early days of computer graphics and vision. With the

rise of commodity range sensing technologies, this research

has become paramount to many applications including ob-

ject pose estimation, object retrieval, 3D reconstruction, and

camera localization.

However, matching local geometric features in low-

resolution, noisy, and partial 3D data is still a challeng-

ing task as shown in Fig.

1. While there is a wide range

of low-level hand-crafted geometric feature descriptors that

can be used for this task, they are mostly based on signa-

tures derived from histograms over static geometric prop-

erties [

18, 21, 27]. They work well for 3D models with

complete surfaces, but are often unstable or inconsistent in

real-world partial surfaces from 3D scanning data and dif-

ﬁcult to adapt to new datasets. As a result, state-of-the-

Figure 1. In this work, we present a data-driven local descriptor

3DMatch that establishes correspondences (green) to match geo-

metric features in noisy and partial 3D scanning data. This ﬁgure

illustrates an example of bringing two RGB-D scans into align-

ment using 3DMatch on depth information only. Color images are

for visualization only.

art 3D reconstruction methods using these descriptors for

matching geometry require signiﬁcant algorithmic effort to

handle outliers and establish global correspondences [

5].

In response to these difﬁculties, and inspired by the re-

cent success of neural networks, we formulate a data-driven

method to learn a local geometric descriptor for establish-

ing correspondences between partial 3D data. The idea is

that by learning from example, data-driven models can suf-

ﬁciently address the difﬁculties of establishing correspon-

dences between partial surfaces in 3D scanning data. To this

end, we present a 3D convolutional neural network (Con-

vNet), called 3DMatch, that takes in the local volumetric

region (or 3D patch) around an arbitrary interest point on a

3D surface and computes a feature descriptor for that point,

where a smaller distance between two descriptors indicates

a higher likelihood of correspondence.

However, optimizing a 3D ConvNet-based descriptor for

this task requires massive amounts of training data (i.e.,

ground truth matches between local 3D patches). Obtaining

1802

Figure 2. Learning 3DMatch from reconstructions. From existing RGB-D reconstructions (a), we extract local 3D patches and corre-

spondence labels from scans of different views (b). We collect pairs of matching and non-matching local 3D patches and convert into a

volumetric representation (c) to train a 3D ConvNet-based descriptor (d). This geometric descriptor can be used to establish correspon-

dences for matching 3D geometry in various applications (e) such as reconstruction, model alignment, and surface correspondence.

this training data with manual annotations is a challenging

endeavor. Unlike 2D image labels, which can be effectively

crowd-sourced or parsed from the web, acquiring ground

truth correspondences by manually clicking keypoint pairs

on 3D partial data is not only time consuming but also prone

to errors.

Our key idea is to amass training data by leveraging cor-

respondence labels found in existing RGB-D scene recon-

structions. Due to the importance of 3D reconstructions,

there has been much research on designing algorithms and

systems that can build high-ﬁdelity reconstructions from

RGB-D data [

24, 25, 8]. Although these reconstructions

have been used for high-level reasoning about the environ-

ment [38, 39], it is often overlooked that they can also serve

as a massive source of labeled correspondences between 3D

surfaces of aligned depth frames. By training on corre-

spondences from multiple existing RGB-D reconstruction

datasets, each with its own properties of sensor noise, oc-

clusion patterns, variance of geometric structures, and va-

riety of camera viewpoints, we can optimize 3DMatch to

generalize and robustly match local geometry in real-world

partial 3D data.

In this paper, we train 3DMatch over 8 million corre-

spondences from a collection of 62 RGB-D scene recon-

structions [

36, 30, 39, 20, 15] and demonstrate its abil-

ity to match 3D data in several applications. Results

show that 3DMatch is considerably better than state-of-

the-art methods at matching keypoints, and outperforms

other algorithms for geometric registration when combined

with standard RANSAC. Furthermore, we demonstrate that

3DMatch can also generalize to different tasks and spatial

resolutions. For example, we utilize 3DMatch to obtain

instance-level model alignments for 6D object pose esti-

mation as well as to ﬁnd surface correspondences in 3D

meshes. To facilitate further research in the area of 3D

keypoint matching and geometric registration, we provide

a correspondence matching benchmark as well as a surface

registration benchmark similar to [

5], but with real-world

scan data.

2. Related Work

Learning local geometric descriptors for matching 3D

data lies at the intersection of computer vision and graph-

ics. We brieﬂy review the related work in both domains.

Hand-crafted 3D Local Descriptors. Many geometric

descriptors have been proposed including Spin Images [

18],

Geometry Histograms [

12], and Signatures of Histograms

[

34], Feature Histograms [28]. Many of these descriptors

are now available in the Point Cloud Library [

3]. While

these methods have made signiﬁcant progress, they still

struggle to handle noisy, low-resolution, and incomplete

real-world data from commodity range sensors. Further-

more, since they are manually designed for speciﬁc appli-

cations or 3D data types, it is often difﬁcult for them to gen-

eralize to new data modalities. The goal of our work is to

provide a new local 3D descriptor that directly learns from

data to provide more robust and accurate geometric feature

matching results in a variety of settings.

Learned 2D Local Descriptors. The recent availability

of large-scale labeled image data has opened up new op-

portunities to use data-driven approaches for designing 2D

local image patch descriptors. For instance, various works

[

32, 31, 40, 16, 41, 16] learn non-linear mappings from local

image patches to feature descriptors. Many of these prior

works are trained on data generated from multi-view stereo

datasets [4]. However, in addition to being limited to 2D

correspondences on images, multi-view stereo is difﬁcult to

1803

scale up in practice and is prone to error from missing corre-

spondences on textureless or non-Lambertian surfaces, so it

is not suitable for learning a 3D surface descriptor. A more

recent work [

29] uses RGB-D reconstructions to train a 2D

descriptor, while we train a 3D geometric descriptor.

Learned 3D Global Descriptors. There has also been

rapid progress in learning geometric representations on 3D

data. 3D ShapeNets [

38] introduced 3D deep learning for

modeling 3D shapes, and several recent works [

22, 11, 33]

also compute deep features from 3D data for the task of ob-

ject retrieval and classiﬁcation. While these works are in-

spiring, their focus is centered on extracting features from

complete 3D object models at a global level. In contrast,

our descriptor focuses on learning geometric features for

real-world RGB-D scanning data at a local level, to provide

more robustness when dealing with partial data suffering

from various occlusion patterns and viewpoint differences.

Learned 3D Local Descriptors. More closely related to

this work is Guo et al. [

14], which uses a 2D ConvNet de-

scriptor to match local geometric features for mesh label-

ing. However, their approach operates only on synthetic

and complete 3D models, while using ConvNets over input

patches of concatenated feature vectors that do not have any

kind of spatial correlation. In contrast, our work not only

tackles the harder problem of matching real-world partial

3D data, but also properly leverages 3D ConvNets on volu-

metric data in a spatially coherent way.

Self-supervised Deep Learning. Recently, there has

been signiﬁcant interest in learning powerful deep models

using automatically-obtained labels. For example, recent

works show that the temporal information from videos can

be used as a plentiful source of supervision to learn em-

beddings that are useful for various tasks [

13, 26]. Other

works show that deep features learned from egomotion su-

pervision perform better than features using class-labels as

supervision for many tasks [

2]. Analogous to these recent

works in self-supervised learning, our method of extract-

ing training data and correspondence labels from existing

RGB-D reconstructions online is fully automatic, and does

not require any manual labor or human supervision.

3. Learning From Reconstructions

In this paper, our goal is to create a function ψ that maps

the local volumetric region (or 3D patch) around a point on

a 3D surface to a descriptor vector. Given any two points, an

ideal function ψ maps their local 3D patches to two descrip-

tors, where a smaller ℓ

2

distance between the descriptors in-

dicates a higher likelihood of correspondence. We learn the

function ψ by making use of data from existing high quality

RGB-D scene reconstructions.

The advantage of this approach is threefold: First, re-

construction datasets can provide large amounts of train-

ing correspondences since each reconstruction contains mil-

lions of points that are observed from multiple different

scanning views. Each observation pair provides a training

example for matching local geometry. Between different

observations of the same interest point, its local 3D patches

can look very different due to sensor noise, viewpoint vari-

ance, and occlusion patterns. This helps to provide a large

and diverse correspondence training set. Second, recon-

structions can leverage domain knowledge such as temporal

information and well-engineered global optimization meth-

ods, which can facilitate wide baseline registrations (loop

closures). We can use the correspondences from these chal-

lenging registrations to train a powerful descriptor that can

be used for other tasks where the aforementioned domain

knowledge is unavailable. Third, by learning from mul-

tiple reconstruction datasets, we can optimize 3DMatch to

generalize and robustly match local geometry in real-world

partial 3D data under a variety of conditions. Speciﬁcally,

we use a total of over 200K RGB-D images of 62 differ-

ent scenes collected from Analysis-by-Synthesis [

36], 7-

Scenes [

30], SUN3D [39], RGB-D Scenes v.2 [20], and

Halber et al. [

15]. 54 scenes are used for training and 8

scenes for testing. Each of the reconstruction datasets are

captured in different environments with different local ge-

ometries at varying scales and built with different recon-

struction algorithms.

3.1. Generating Training Correspondences

To obtain training 3D patches and their ground truth cor-

respondence labels (match or non-match), we extract local

3D patches from different scanning views around interest

points randomly sampled from reconstructions. To ﬁnd cor-

respondences for an interest point, we map its 3D position

in the reconstruction into all RGB-D frames for which the

3D point lies within the frame’s camera view frustum and

is not occluded. The locations of the cameras from which

the RGB-D frames are taken are enforced to be at least

1m apart, so that the views between observation pairs are

sufﬁciently wide-baselined. We then extract two local 3D

patches around the interest point from two of these RGB-D

frames, and use them as a matching pair. To obtain non-

matching pairs, we extract local 3D patches from randomly

picked depth frames of two interest points (at least 0.1m

apart) randomly sampled from the surface of the reconstruc-

tion. Each local 3D patch is converted into a volumetric

representation as described in Sec.

4.1.

Due to perturbations from depth sensor noise and im-

perfections in reconstruction results, the sampled interest

points and their surrounding local 3D patches can experi-

ence some minor amounts of drift. We see this jitter as an

opportunity for our local descriptor to learn small amounts

1804

of translation invariance. Since we are learning from RGB-

D reconstruction datasets using different sensors and algo-

rithms, the jitter is not consistent, which enables the de-

scriptor to generalize and be more robust to it.

4. Learning A Local Geometric Descriptor

We use a 3D ConvNet to learn the mapping from a vol-

umetric 3D patch to an 512-dimensional feature represen-

tation that serves as the descriptor for that local region.

During training, we optimize this mapping (i.e., updating

the weights of the ConvNet) by minimizing the ℓ

2

distance

between descriptors generated from corresponding interest

points (matches), and maximize the ℓ

2

distance between de-

scriptors generated from non-corresponding interest points

(non-matches). This is equivalent to training a ConvNet

with two streams (i.e., Siamese Style ConvNets [

6]) that

takes in two local 3D patches and predicts whether or not

they correspond to each other.

4.1. 3D Data Representation

For each interest point, we ﬁrst extract a 3D volumet-

ric representation for the local region surrounding it. Each

3D region is converted from its original representation (sur-

face mesh, point cloud, or depth map) into a volumetric

30 × 30 × 30 voxel grid of Truncated Distance Function

(TDF) values. Analogous to 2D pixel image patches, we

refer to these TDF voxel grids as local 3D patches. In our

experiments, these local 3D patches spatially span 0.3m

3

,

where voxel size is 0.01m

3

. The voxel grid is aligned with

respect to the camera view. If camera information is un-

available (i.e. for pre-scanned 3D models), the voxel grid

is aligned to the object coordinates. The TDF value of each

voxel indicates the distance between the center of that voxel

to the nearest 3D surface. These TDF values are truncated,

normalized and then ﬂipped to be between 1 (on surface)

and 0 (far from surface). This form of 3D representation is

cross-compatible with 3D meshes, point-clouds, and depth

maps. Analogous to 2D RGB pixel matrices for color im-

ages, 3D TDF voxel grids also provide a natural volumetric

encoding of 3D space that is suitable as input to a 3D Con-

vNet.

The TDF representation holds several advantages over its

signed alternative TSDF [

7], which encodes occluded space

(values near -1) in addition to the surface (values near 0) and

free space (values near 1). By removing the sign, the TDF

loses the distinction between free space and occluded space,

but gains a new property that is crucial to the robustness of

our descriptor on partial data: the largest gradients between

voxel values are concentrated around the surfaces rather

than in the shadow boundaries between free space and oc-

cluded space. Furthermore, the TDF representation reduces

the ambiguity of determining what is occluded space on 3D

data where camera view is unavailable.

Figure 3. t-SNE embedding of 3DMatch descriptors for local

3D patches from the RedKitchen test scene of 7-Scenes [24 ]. This

embedding suggests that our 3DMatch ConvNet is able to cluster

local 3D patches based on local geometric features such as edges

(a,f), planes (e), corners (c,d), and other geometric structures (g,

b, h) in the face of noisy and partial data.

4.2. Network Architecture

3DMatch is a standard 3D ConvNet, inspired by AlexNet

[

9]. Given a 30×30×30 TDF voxel grid of a local 3D patch

around an interest point, we use eight convolutional layers

(each with a rectiﬁed linear unit activation function for non-

linearity) and a pooling layer to compute a 512-dimensional

feature representation, which serves as the feature descrip-

tor. Since the dimensions of the initial input voxel grid

are small, we only include one layer of pooling to avoid

a substantial loss of information. Convolution parameters

are shown in Fig.

2 as (kernel size, number of ﬁlters).

4.3. Network Training

During training, our objective is to optimize the local de-

scriptors generated by the ConvNet such that they are sim-

ilar for 3D patches corresponding to the same point, and

dissimilar otherwise. To this end, we train our ConvNet

with two streams in a Siamese fashion where each stream

independently computes a descriptor for a different local 3D

patch. The ﬁrst stream takes in the local 3D patch around

a surface point p

1

, while the second stream takes in a sec-

ond local 3D patch around a surface point p

2

. Both streams

share the same architecture and underlying weights. We use

the ℓ

2

norm as a similarity metric between descriptors, mod-

eled during training with the contrastive loss function [

6].

This loss minimizes the ℓ

2

distance between descriptors of

corresponding 3D point pairs (matches), while pulling apart

the ℓ

2

distance between descriptors of non-corresponding

3D point pairs. During training, we feed the network with

a balanced 1:1 ratio of matches to non-matches, a strategy

which has shown to be effective for efﬁciently learning dis-

criminative descriptors [

16, 31, 40]. Fig.3 shows a t-SNE

embedding [

37] of local 3D patches based on their 3DMatch

descriptors, which demonstrates the ConvNet’s ability to

cluster local 3D patches based on their geometric structure

as well as local context.

1805

Figure 4. Which 3D patches are matched by 3DMatch? On

the left, we show two fused fragments (A and B) taken at different

scan view angles, as well as their registration result using 3DMatch

+ RANSAC. On the right, each row shows a local 3D patch from

fragment A, followed by three nearest neighbor local 3D patches

from fragment B found by 3DMatch descriptors. The bounding

boxes are color coded to the keypoints illustrated on fragment A.

5. Evaluation

In this section, we ﬁrst evaluate how well our learned lo-

cal 3D descriptor (3DMatch) can match local 3D patches of

interest point pairs (Sec. 5.1). We then evaluate its practical

use as part of geometric registration for matching 3D data

in several applications, such as scene reconstruction (Sec.

5.2) and 6D object pose estimation (Sec. 5.3).

5.1. Keypoint Matching

Our ﬁrst set of experiments measure the quality of a 3D

local descriptor by testing its ability to distinguish between

matching and non-matching local 3D patches of keypoint

pairs. Using the sampling algorithm described in Sec. 3, we

construct a correspondence benchmark, similar to the Photo

Tourism dataset [4] but with local 3D patches extracted

from depth frames. The benchmark contains a collection

of 30, 000 3D patches, with a 1:1 ratio between matches

and non-matches. As in [

4, 16], our evaluation metric is the

false-positive rate (error) at 95% recall, the lower the better.

Is our descriptor better than others? We compare our

descriptor to several other state-of-the-art geometric de-

scriptors on this correspondence benchmark. For Johnson

et al. (Spin-Images) [

18] and Rusu et al. (Fast Point Feature

Histograms) [

27], we use the implementation provided in

the Point Cloud Library (PCL). While 3DMatch uses local

TDF voxel grids computed from only a single depth frame,

we run Johnson et al. and Rusu et al. on meshes fused from

50 nearby depth frames to boost their performance on this

benchmark, since these algorithms failed to produce reason-

able results on single depth frames. Nevertheless, 3DMatch

outperforms these methods by a signiﬁcant margin.

Method Error

Johnson et al. (Spin-Images) [18] 83.7

Rusu et al. (FPFH) [

27] 61.3

2D ConvNet on Depth 38.5

Ours (3DMatch) 35.3

Table 1. Keypoint matching task error (%) at 95% recall.

What’s the beneﬁt of 3D volumes vs. 2D depth patches?

We use TDF voxel grids to represent 3D data, not only be-

cause it is an intermediate representation that can be eas-

ily converted from meshes or point clouds, but also because

this 3D representation allows reasoning over real-world spa-

tial scale and occluded regions, which cannot be directly

encoded in 2D depth patches. To evaluate the advantages

of this 3D TDF encoding over 2D depth, we train a variant

of our method using a 2D ConvNet on depth patches. The

depth patches are extracted from a 0.3m

3

crop and resized

to 64x64 patches. For a fair comparison, the architecture

of the 2D ConvNet is similar to our 3D ConvNet with two

extra convolution layers to achieve a similar number of pa-

rameters as the 3D ConvNet. As shown in Table

1, this 2D

ConvNet yields a higher error rate (38.5 vs. 35.3).

Should we use a metric network? Recent work [

16] pro-

poses the joint learning of a descriptor and similarity metric

with ConvNets to optimize matching accuracy. To explore

this idea, we replace our contrastive loss layer with three

fully connected layers, followed by a Softmax layer for bi-

nary classiﬁcation of ”match” vs ”non-match”. We evaluate

the performance of this network on our keypoint matching

benchmark, where we see an error of 33.1% (2.2% improve-

ment). However, as noted by Yi et al. [

40], descriptors that

require a learned metric have a limited range of applica-

bility due to the O(n

2

) comparison behaviour at test time

since they cannot be directly combined with metric-based

acceleration structures such as KD-trees. To maintain run-

time within practical limits, we use the version of 3DMatch

trained with an ℓ

2

metric in the following sections.

5.2. Geometric Registration

To evaluate the practical use of our descriptor, we com-

bine 3DMatch with a RANSAC search algorithm for geo-

metric registration, and measure its performance on stan-

dard benchmarks. More speciﬁcally, given two 3D point

clouds from scanning data, we ﬁrst randomly sample n

keypoints from each point cloud. Using the local 3D

30 × 30 × 30 TDF patches around each keypoint (aligned to

the camera axes, which may be different per point cloud),

1806

3DMatch: Learning Local Geometric Descriptors from RGB-D Reconstructions

Citations

DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation

Matterport3D: Learning from RGB-D Data in Indoor Environments

Image Matching from Handcrafted to Deep Features: A Survey

Learning Synergies Between Pushing and Grasping with Self-Supervised Deep Reinforcement Learning

PPFNet: Global Context Aware Local Features for Robust 3D Point Matching

References

ImageNet: A large-scale hierarchical image database

Visualizing Data using t-SNE

3D ShapeNets: A deep representation for volumetric shapes

KinectFusion: Real-time dense surface mapping and tracking

Learning a similarity metric discriminatively, with application to face verification

Related Papers (5)

Fast Point Feature Histograms (FPFH) for 3D registration

PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation

PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space

3D ShapeNets: A deep representation for volumetric shapes

Using spin images for efficient object recognition in cluttered 3D scenes