What future works have the authors mentioned in the paper "Real-time 3d reconstruction at scale using voxel hashing" ?

To further extend the bounds of reconstruction, their method supports lightweight streaming without major data structure reorganization. Due to the high performance of their data structure, the available time budget can be utilized for further improving camera pose estimation, which directly improves reconstruction quality over existing online approaches. The authors believe the advantages of their method will be even more evident when future depth cameras with higher resolution sensing emerge, as their data structure is already capable of reconstructing surfaces beyond the resolution of existing depth sensors such as Kinect.

What is the significant limitation of the hierarchy?

The most significant limitation of the hierarchy is the data structure overhead causing a performance drop, particularly in complex scenes.

Why is the depth weights given to voxel blocks?

The authors set the integration weights according to the depth values in order to incorporate the noise characteristics of the sensor; i.e., more weight is given to nearer depth measurements for which the authors assume less noise.

How do the authors lock a bucket for writing?

To avoid race conditions when inserting hash entries in parallel, the authors lock a bucket atomically for writing when a suitable empty positionis found.

What is the advantage of their method?

To further extend the bounds of reconstruction, their method supports lightweight streaming without major data structure reorganization.

How is the point-plane energy function linearized on the GPU?

The point-plane energy function is linearized [Low 2004] on the GPU to a 6 × 6 matrix using a parallel reduction and solved via Singular Value Decomposition on the CPU.

How can the authors use the surface to estimate the pose?

Once the surface is extracted via raycasting, it can be shaded for rendering, or used for frame-to-model camera pose estimation [Newcombe et al. 2011].

What are the advantages of their method?

The authors believe the advantages of their method will be even more evident when future depth cameras with higher resolution sensing emerge, as their data structure is already capable of reconstructing surfaces beyond the resolution of existing depth sensors such as Kinect.

What is the advantage of streaming voxel blocks to the GPU?

This enhances performance, given the high host-GPU bandwidth and ability to efficiently cull voxel blocks outside of the view frustum.

What is the way to stream voxel blocks?

Their unstructured data structure is well-suited for this purpose, since streaming voxel blocks in or out does not require any reorganization of the hash table.

What is the benefit of a linear hash table?

This demonstrates another benefit for their linear hash table data structure (over hierarchical data structures), allowing fast parallel access to all allocated blocks for operations such as rasterization.

What is the weighting factor in the point-plane error-metric?

As their data structure also stores associated color data, the authors incorporate a weighting factor in the point-plane error-metric based on color consistency between extracted and input RGB values [Johnson and Bing Kang 1999].

(Open Access) Real-time 3D reconstruction at scale using voxel hashing (2013) | Matthias Nießner

Q: What are the contributions mentioned in the paper "Real-time 3d reconstruction at scale using voxel hashing" ?

The authors show interactive reconstructions of a variety of scenes, reconstructing both fine-grained details and large scale environments. The authors illustrate how all parts of their pipeline from depth map pre-processing, camera pose estimation, depth map fusion, and surface rendering are performed at real-time rates on commodity graphics hardware. The authors conclude with a comparison to current state-of-the-art online systems, illustrating improved performance and reconstruction quality.

Q: Why is 3D reconstruction gaining newfound momentum?

While 3D reconstruction is an established field in computer vision and graphics, it is now gaining newfound momentum due to the wide availability of depth cameras (such as the Microsoft Kinect and Asus Xtion).

Real-time 3D Reconstruction at Scale using Voxel Hashing

Matthias Nießner

1,3

Michael Zollh

ofer

Shahram Izadi

Marc Stamminger

University of Erlangen-Nuremberg

Microsoft Research Cambridge

Stanford University

Figure 1:

Example output from our reconstruction system without any geometry post-processing. Scene is about

20m

wide and

high and

captured online in less than 5 minutes with live feedback of the reconstruction.

Abstract

Online 3D reconstruction is gaining newfound interest due to the

availability of real-time consumer depth cameras. The basic problem

takes live overlapping depth maps as input and incrementally fuses

these into a single 3D model. This is challenging particularly when

real-time performance is desired without trading quality or scale. We

contribute an online system for large and ﬁne scale volumetric recon-

struction based on a memory and speed efﬁcient data structure. Our

system uses a simple spatial hashing scheme that compresses space,

and allows for real-time access and updates of implicit surface data,

without the need for a regular or hierarchical grid data structure. Sur-

face data is only stored densely where measurements are observed.

Additionally, data can be streamed efﬁciently in or out of the hash

table, allowing for further scalability during sensor motion. We show

interactive reconstructions of a variety of scenes, reconstructing both

ﬁne-grained details and large scale environments. We illustrate how

all parts of our pipeline from depth map pre-processing, camera pose

estimation, depth map fusion, and surface rendering are performed at

real-time rates on commodity graphics hardware. We conclude with

a comparison to current state-of-the-art online systems, illustrating

improved performance and reconstruction quality.

CR Categories:

I.3.3 [Computer Graphics]: Picture/Image

Generation—Digitizing and Scanning

Keywords:

real-time reconstruction, scalable, data structure, GPU

Links: DL PDF

1 Introduction

While 3D reconstruction is an established ﬁeld in computer vision

and graphics, it is now gaining newfound momentum due to the

wide availability of depth cameras (such as the Microsoft Kinect and

Asus Xtion). Since these devices output live but noisy depth maps, a

particular focus of recent work is online surface reconstruction using

such consumer depth cameras. The ability to obtain reconstructions

in real-time opens up various interactive applications including:

augmented reality (AR) where real-world geometry can be fused

with 3D graphics and rendered live to the user; autonomous guidance

for robots to reconstruct and respond rapidly to their environment;

or even to provide immediate feedback to users during 3D scanning.

Online reconstruction requires incremental fusion of many overlap-

ping depth maps into a single 3D representation that is continuously

reﬁned. This is challenging particularly when real-time performance

is required without trading ﬁne-quality reconstructions and spatial

scale. Many state-of-the-art online techniques therefore employ

different types of underlying data structures accelerated using graph-

ics hardware. These however have particular trade-offs in terms of

reconstruction speed, scale, and quality.

Point-based methods (e.g., [Rusinkiewicz et al

2002; Weise et al

2009]) use simple unstructured representations that closely map

to range and depth sensor input, but lack the ability to directly

reconstruct connected surfaces. High-quality online scanning of

small objects has been demonstrated [Weise et al

2009], but larger-

scale reconstructions clearly trade quality and/or speed [Henry et al

2012; St

uckler and Behnke 2012]. Height-map based representa-

tions [Pollefeys et al

2008; Gallup et al

2010] support efﬁcient

compression of connected surface data, and can scale efﬁciently to

larger scenes, but fail to reconstruct complex 3D structures.

For active sensors, implicit volumetric approaches, in particular the

method of Curless and Levoy [1996], have demonstrated compelling

results [Curless and Levoy 1996; Levoy et al

2000; Zhou and Koltun

2013], even at real-time rates [Izadi et al

2011; Newcombe et al

2011]. However, these rely on memory inefﬁcient regular voxel

grids, in turn restricting scale. This has led to either moving volume

variants [Roth and Vona 2012; Whelan et al

2012], which stream

voxel data out-of-core as the sensor moves, but still constrain the size

of the active volume. Or hierarchical data structures that subdivide

space more effectively, but do not parallelize efﬁciently given added

computational complexity [Zeng et al. 2012; Chen et al. 2013].

We contribute a new real-time surface reconstruction system which

supports ﬁne-quality reconstructions at scale. Our approach carries

the beneﬁts of volumetric approaches, but does not require either a

memory constrained voxel grid or the computational overheads of a

hierarchical data structure. Our method is based on a simple memory

and speed efﬁcient spatial hashing technique that compresses space,

and allows for real-time fusion of referenced implicit surface data,

without the need for a hierarchical data structure. Surface data

is only stored densely in cells where measurements are observed.

Additionally, data can be streamed efﬁciently in or out of the hash

table, allowing for further scalability during sensor motion.

While these types of efﬁcient spatial hashing techniques have been

proposed for a variety of rendering and collision detection tasks

[Teschner et al

2003; Lefebvre and Hoppe 2006; Bastos and Celes

2008; Alcantara et al

2009; Pan and Manocha 2011; Garc

ıa et al

2011], we describe the use of such data structures for surface re-

construction, where the underlying data needs to be continuously

updated. We show interactive reconstructions of a variety of scenes,

reconstructing both ﬁne-grained and large-scale environments. We il-

lustrate how all parts of our pipeline from depth map pre-processing,

sensor pose estimation, depth map fusion, and surface rendering

are performed at real-time rates on commodity graphics hardware.

We conclude with a comparison to current state-of-the-art systems,

illustrating improved performance and reconstruction quality.

2 Related work

There is over three decades of research on 3D reconstruction. In this

section we review relevant systems, with a focus on online recon-

struction methods and active sensors. Unlike systems that focus on

reconstruction from a complete set of 3D points [Hoppe et al

1992;

Kazhdan et al

2006], online methods require incremental fusion of

many overlapping depth maps into a single 3D representation that

is continuously reﬁned. Typically methods ﬁrst register or align

sequential depth maps using variants of the Iterative Closest Point

(ICP) algorithm [Besl and McKay 1992; Chen and Medioni 1992].

Parametric methods [Chen and Medioni 1992; Higuchi et al

1995]

simply average overlapping samples, and connect points by assum-

ing a simple surface topology (such as a cylinder or a sphere) to

locally ﬁt polygons. Extensions such as mesh zippering [Turk and

Levoy 1994] select one depth map per surface region, remove re-

dundant triangles in overlapping regions, and stitch meshes. These

methods handle some denoising by local averaging of points, but

are fragile in the presence of outliers and areas with high curvature.

These challenges associated with working directly with polygon

meshes have led to many other reconstruction methods.

Point-based methods perform reconstruction by merging overlap-

ping points, and avoid inferring connectivity. Rendering the ﬁnal

model is performed using point-based rendering techniques [Gross

and Pﬁster 2007]. Given the output from most depth sensors are

3D point samples, it is natural for reconstruction methods to work

directly with such data. Examples include in-hand scanning sys-

tems [Rusinkiewicz et al

2002; Weise et al

2009], which support

reconstruction of only single small objects. At this small scale,

high-quality [Weise et al

2009] reconstructions have been achieved.

Larger scenes have been reconstructed by trading real-time speed

and quality [Henry et al

2012; St

uckler and Behnke 2012]. These

methods lack the ability to directly model connected surfaces, requir-

ing additional expensive and often ofﬂine steps to construct surfaces;

e.g., using volumetric data structures [Rusinkiewicz et al. 2002].

Height-map based representations explore the use of more compact

2.5D continuous surface representations for reconstruction [Polle-

feys et al

2008; Gallup et al

2010]. These techniques are partic-

ularly useful for modeling large buildings with ﬂoors and walls,

since these appear as clear discontinuities in the height-map. Multi-

layered height-maps have been explored to support reconstruction of

more complex 3D shapes such as balconies, doorways, and arches

[Gallup et al

2010]. While these methods support more efﬁcient

compression of surface data, the 2.5D representation fails to recon-

struct many types of complex 3D structures.

An alternative method is to use a fully volumetric data structure

to implicitly store samples of a continuous function [Hilton et al

1996; Curless and Levoy 1996; Wheeler et al

1998]. In these

methods, depth maps are converted into signed distance ﬁelds and

cumulatively averaged into a regular voxel grid. The ﬁnal surface is

extracted as the zero-level set of the implicit function using isosur-

face polygonisation (e.g., [Lorensen and Cline 1987]) or raycasting.

A well-known example is the method of Curless and Levoy [1996],

which for active triangulation-based sensors such as laser range

scanners and structured light cameras, can generate very high qual-

ity results [Curless and Levoy 1996; Levoy et al

2000; Zhou and

Koltun 2013]. KinectFusion [Newcombe et al

2011; Izadi et al

2011] recently adopted this volumetric method and demonstrated

compelling real-time reconstructions using a commodity GPU.

While shown to be a high quality reconstruction method, particularly

given the computational cost, this approach suffers from one major

limitation: the use of a regular voxel grid imposes a large memory

footprint, representing both empty space and surfaces densely, and

thus fails to reconstruct larger scenes without compromising quality.

Scaling-up Volumetric Fusion

Recent work begins to address

this spatial limitation of volumetric methods in different ways.

[Keller et al

2013] use a point-based representation that captures

qualities of volumetric fusion but removes the need for a spatial data

structure. While demonstrating compelling scalable real-time recon-

structions, the quality is not on-par with true volumetric methods.

Moving volume methods [Roth and Vona 2012; Whelan et al

2012]

extend the GPU-based pipeline of KinectFusion. While still operat-

ing on a very restricted regular grid, these methods stream out voxels

from the GPU based on camera motion, freeing space for new data

to be stored. In these methods the streaming is one-way and lossy.

Surface data is compressed to a mesh, and once moved to host can-

not be streamed back to the GPU. While offering a simple approach

for scalability, at their core these systems still use a regular grid

structure, which means that the active volume must remain small to

ensure ﬁne-quality reconstructions. This limits reconstructions to

scenes with close-by geometric structures, and cannot utilize the full

range of data for active sensors such as the Kinect.

This limit of regular grids has led researcher to investigate more

efﬁcient volumetric data structures. This is a well studied topic in

the volume rendering literature, with efﬁcient methods based on

sparse voxel octrees [Laine and Karras 2011; K

ampe et al

2013],

simpler multi-level hierarchies and adaptive data structures [Kraus

and Ertl 2002; Lefebvre et al

2005; Bastos and Celes 2008; Reichl

et al

2012] and out-of-core streaming architectures for large datasets

[Hadwiger et al

2012; Crassin et al

2009]. These approaches have

begun to be explored in the context of online reconstruction, where

the need to support real-time updates of the underlying data adds a

fundamentally new challenge.

For example, [Zhou et al

2011] demonstrate a GPU-based octree

which can perform Poisson surface reconstruction on 300K vertices

at interactive rates. [Zeng et al

2012] implement a 9- to 10-level

octree on the GPU, which extends the KinectFusion pipeline to a

larger

8m × 8m × 2m

indoor ofﬁce space. The method however

requires a complex octree structure to be implemented, with addi-

tional computational complexity and pointer overhead, with only

limited gains in scale.

In an octree, the resolution in each dimension increases by a factor

of two at each subdivision level. This results in the need for a deep

tree structure for efﬁcient subdivision, which conversely impacts

performance, in particular on GPUs where tree traversal leads to

thread divergence. The rendering literature has proposed many alter-

native hierarchical data structures [Lefebvre et al

2005; Kraus and

Ertl 2002; Laine and Karras 2011; K

ampe et al

2013; Reichl et al

2012]. In [Chen et al

2013] an N

hierarchy [Lefebvre et al

2005]

was adopted for 3D reconstruction at scale, and the optimal tree

depth and branching factor were empirically derived (showing large

branching factors and a shallow tree optimizes GPU performance).

While avoiding the use of an octree, the system still carries compu-

tational overheads in realizing such a hierarchical data structure on

the GPU. As such this leads to performance that is only real-time on

speciﬁc scenes, and on very high-end graphics hardware.

3 Algorithm Over view

We extend the volumetric method of Curless and Levoy [1996] to

reconstruct high-quality 3D surfaces in real-time and at scale, by

incrementally fusing noisy depth maps into a memory and speed

efﬁcient data structure. Curless and Levoy have proven to produce

compelling results given a simple cumulative average of samples.

The method supports incremental updates, makes no topological

assumptions regarding surfaces, and approximates the noise charac-

teristics of triangulation based sensors effectively. Further, while an

implicit representation, stored isosurfaces can be readily extracted.

Our method addresses the main drawback of Curless and Levoy:

supporting efﬁcient scalability. Next, we review the Curless and

Levoy method, before the description of our new approach.

Implicit Volumetric Fusion

Curless and Levoy’s method is based

on storing an implicit signed distance ﬁeld (SDF) within a volumetric

data structure. Let us consider a regular dense voxel grid, and assume

the input is a sequence of depth maps. The depth sensor is initialized

at some origin relative to this grid (typically the center of the grid).

First, the rigid six degree-of-freedom (6DoF) ego-motion of the

sensor is estimated, typically using variants of ICP [Besl and McKay

1992; Chen and Medioni 1992].

Each voxel in the grid contains two values: a signed distance and

weight. For a single depth map, data is integrated into the grid by

uniformly sweeping through the volume, culling voxels outside of

the view frustum, projecting all voxel centers into the depth map,

and updating stored SDF values. All voxels that project onto the

same pixel are considered part of the depth sample’s footprint. At

each of these voxels a signed distance from the voxel center to the

observed surface measurement is stored, with positive distances in

front, negative behind, and nearing zero at the surface interface.

To reduce computational cost, support sensor motion, and approx-

imate sensor noise, Curless and Levoy introduce the notion of a

truncated SDF (TSDF) which only stores the signed distance in a

region around the observed surface. This region can be adapted in

size, approximating sensor noise as a Gaussian with variance based

on depth [Chang et al

1994; Nguyen et al

2012]. Only TSDF values

stored in voxels within these regions are updated using a weighted

average to obtain an estimate of the surface. Finally, voxels (in front

of the surface) that are part of each depth sample’s footprint, but

outside of the truncation region are explicitly marked as free-space.

This allows removal of outliers based on free-space violations.

Voxel Hashing

Given Curless and Levoy truncate SDFs around

the surface, the majority of data stored in the regular voxel grid is

marked either as free space or as unobserved space rather than sur-

face data. The key challenge becomes how to design a data structure

that exploits this underlying sparsity in the TSDF representation.

Our approach speciﬁcally avoids the use of a dense or hierarchical

data structure, removing the need for a memory intensive regular

grid or computationally complex hierarchy for volumetric fusion.

Instead, we use a simple hashing scheme to compactly store, access

and update an implicit surface representation.

In the graphics community, efﬁcient spatial hashing methods have

been explored in the context of a variety of 2D/3D rendering and

collision detection tasks [Teschner et al

2003; Lefebvre and Hoppe

2006; Bastos and Celes 2008; Alcantara et al

2009; Pan and

Manocha 2011; Garc

ıa et al

2011]. Sophisticated methods have

been proposed for efﬁcient GPU-based hashing that greatly reduce

the number of hash entry collisions.

Our goal is to build a real-time system that employs a spatial hashing

scheme for scalable volumetric reconstruction. This is non-trivial

for 3D reconstruction as the geometry is unknown ahead of time

and continually changing. Therefore, our hashing technique must

support dynamic allocations and updates, while minimizing and

resolving potential hash entry collisions, without requiring a-priori

knowledge of the contained surface geometry. In approaching the de-

sign of our data structure, we have purposefully chosen and extended

a simple hashing scheme [Teschner et al

2003], and while more

sophisticated methods exist, we show empirically that our method is

efﬁcient in terms of speed, quality, and scalability.

The hash table sparsely and efﬁciently stores and updates TSDFs.

In the following we describe the data structure in more detail, and

demonstrate how it can be efﬁciently implemented on the GPU. We

highlight some of the core features of our data structure, including:

•

The ability to efﬁciently compress volumetric TSDFs, while

maintaining surface resolution, without the need for a hierar-

chical spatial data structure.

•

Fusing new TSDF samples efﬁciently into the hash table, based

on insertions and updates, while minimizing collisions.

•

Removal and garbage collection of voxel blocks, without re-

quiring costly reorganization of the data structure.

•

Lightweight bidirectional streaming of voxel blocks between

host and GPU, allowing unbounded reconstructions.

•

Extraction of isosurfaces from the data structure efﬁciently

using standard raycasting or polygonization operations, for

rendering and camera pose estimation.

System Pipeline

Our pipeline is depicted in Fig. 2. Central is

a hash table data structure that stores sub-blocks containing SDFs,

called voxel blocks. Each occupied entry in our hash table refers to

an allocated voxel block. At each voxel we store a TSDF, weight,

and an additional color value. The hash table is unstructured; i.e.,

neighboring voxel blocks are not stored spatially, but can be in

different parts of the hash table. Our hashing function allows an

efﬁcient look-up of voxel blocks, using speciﬁed (integer rounded)

world coordinates. Our hash function aims to minimize the number

of collisions and ensures no duplicates exist in the table.

Given a new input depth map, we begin by performing fusion (also

referred to as integration). We ﬁrst allocate new voxel blocks and

insert block descriptors into the hash table, based on an input depth

map. Only occupied voxels are allocated and empty space is not

stored. Next we sweep each allocated voxel block to update the SDF,

color and weight of each contained voxel, based on the input depth

and color samples. In addition, we garbage collect voxel blocks

which are too far from the isosurface and contain no weight. This

involves freeing allocated memory as well as removing the voxel

block entry from the hash table. These steps ensure that our data

structure remains sparse over time.

Figure 2: Pipeline overview.

After integration, we raycast the implicit surface from the current

estimated camera pose to extract the isosurface, including associated

colors. This extracted depth and color buffer is used as input for

camera pose estimation: given the next input depth map, a projective

point-plane ICP [Chen and Medioni 1992] is performed to estimate

the new 6DoF camera pose. This ensures that pose estimation is

performed frame-to-model rather than frame-to-frame mitigating

some of the issues of drift (particularly for small scenes) [Newcombe

et al

2011]. Finally, our algorithm performs bidirectional streaming

between GPU and host. Hash entries (and associated voxel blocks)

are streamed to the host as their world positions exit the estimated

camera view frustum. Previously streamed out voxel blocks can also

be streamed back to the GPU data structure when revisiting areas.

4 Data Structure

Fig. 3 shows our voxel hashing data structure. Conceptually, an

inﬁnite uniform grid subdivides the world into voxel blocks. Each

block is a small regular voxel grid. In our current implementation a

voxel block is composed of

voxels. Each voxel stores a TSDF,

color, and weight and requires 8 bytes of memory:

struct Voxel {

float sdf;

uchar colorRGB[3];

uchar weight;

};

To exploit sparsity, voxel blocks are only allocated around recon-

structed surface geometry. We use an efﬁcient GPU accelerated hash

table to manage allocation and retrieval of voxel blocks. The hash

table stores hash entries, each containing a pointer to an allocated

voxel block. Voxel blocks can be retrieved from the hash table using

integer world coordinates

(x, y, z)

. Finding the coordinates for a

3D point in world space is achieved by simple multiplication and

rounding. We map from a world coordinate

(x, y, z)

to hash value

H(x, y, z) using the following hashing function:

H(x, y, z) = (x · p

⊕ y · p

⊕ z · p

) mod n

where

, and

are large prime numbers (in our case

73856093

19349669

83492791

respectively, based on [Teschner et al

2003]),

and

is the hash table size. In addition to storing a pointer to the

voxel block, each hash entry also contains the associated world posi-

tion, and an offset pointer to handle collisions efﬁciently (described

in the next section).

struct HashEntry {

short position[3];

short offset;

int pointer;

};

world

hash

table

voxel

blocks

bucket

Figure 3:

Our voxel hashing data structure. Conceptually, an

inﬁnite uniform grid partitions the world. Using our hash function,

we map from integer world coordinates to hash buckets, which store

a small array of pointers to regular grid voxel blocks. Each voxel

block contains an

grid of SDF values. When information for the

red block gets added, a collision appears which is resolved by using

the second element in the hash bucket.

4.1 Resolving Collisions

Collisions appear if multiple allocated blocks are mapped to the

same hash value (see red block in Fig. 3). We handle collisions by

uniformly organizing the hash table into buckets, one per unique

hash value. Each bucket sequentially stores a small number of hash

entries. When a collision occurs, we store the block pointer in the

next available sequential entry in the bucket (see Fig. 4). To ﬁnd

the voxel block for a particular world position, we ﬁrst evaluate our

hash function, and lookup and traverse the associated bucket until

our block entry is found. This is achieved by simply comparing the

stored hash entry world position with the query position.

With a reasonable selection of the hash table and bucket size (see

later), rarely will a bucket overﬂow. However, if this happens, we

append a linked list entry, ﬁlling up other free spots in the next

available buckets. The (relative) pointers for the linked lists are

stored in the offset ﬁeld of the hash table entries. Such a list is

appended to a full bucket by setting the offset pointer for the last

entry in the bucket. All following entries are then chained using

the offset ﬁeld. In order to create additional links for a bucket, we

linearly search across the hash table for a free slot to store our entry,

appending to the link list accordingly. We avoid the last entry in

each bucket, as this is locally reserved for the link list head.

As shown later, we choose a table and bucket size that keeps the num-

ber of collisions and therefore appended linked lists to a minimum

for most scenes, as to not impact overall performance.

4.2 Hashing operations

Insertion

To insert new hash entries, we ﬁrst evaluate the hash

function and determine the target bucket. We then iterate over all

bucket elements including possible lists attached to the last entry.

If we ﬁnd an element with the same world space position we can

immediately return a reference. Otherwise, we look for the ﬁrst

empty position within the bucket. If a position in the bucket is

available, we insert the new hash entry. If the bucket is full, we

append an element to its linked list element (see Fig. 4).

To avoid race conditions when inserting hash entries in parallel, we

lock a bucket atomically for writing when a suitable empty position

is found. This eliminates duplicate entries and ensures linked list

consistency. If a bucket is locked for writing, all other allocations

for the same bucket are staggered until the next frame is processed.

This may delay some allocations marginally. However, in practice

this causes no degradation in reconstruction quality (as observed in

the results and supplementary video), particularly as the Curless and

Levoy method supports order independent updates.

Retrieval

To read the hash entry for a query position, we compute

the hash value and perform a linear search within the correspond-

ing bucket. If no entry is found, and the bucket has a linked list

associated (the offset value of the last entry is set), we also have to

traverse this list. Note that we do not require a bucket to be ﬁlled

from left to right. As described below, removing values can lead

to fragmentation, so traversal does not stop when empty entries are

found in the bucket.

offset

position

pointer

(2,4,5)

hash=1

(2,4,5) (2,4,5)

(8,1,7)

hash=1

(8,1,7)

(1,2,3)

hash=1

(8,1,7)

(1,2,3)

(0,-2,3)

(1,2,3)

(0,-2,3)

(1,2,3)

(0,-2,3)

(2,4,5)

(0,-2,3)

hash=1

(8,1,7)

(2,4,5)

bucket hash entries

Figure 4:

The hash table is broken down into a set of buckets.

Each slot is either unallocated (white) or contains an entry (blue)

storing the query world position, pointer to surface data, and an

offset pointer for dealing with bucket overﬂow. Example hashing

operations: for illustration, we insert and remove four entries that

all map to hash = 1 and update entries and pointers accordingly.

Deletion

Deleting a hash entry is similar to insertion. For a given

world position we ﬁrst compute the hash and then linearly search

the corresponding hash bucket including list traversal. If we have

found the matching entry without list traversal we can simply delete

it. If it is the last element of the bucket and there was a non-zero

offset stored (i.e., the element is a list head), we copy the hash

entry pointed to by the offset into the last element of the bucket,

and delete it from its current position. Otherwise if the entry is a

(non-head) element in the linked list, we delete it and correct list

pointers accordingly (see Fig. 4). Synchronization is not required

for deletion directly within the bucket. However, in the case we need

to modify the linked list, we lock the bucket atomically and stagger

further list operations for this bucket until the next frame.

5 Voxel Block Allocation

Before integration of new TSDFs, voxel blocks must be allocated

that fall within the footprint of each input depth sample, and are also

within the truncation region of the surface measurement. We process

depth samples in parallel, inserting hash entries and allocating voxel

blocks within the truncation region around the observed surface. The

size of the truncation is adapted based on the variance of depth to

compensate for larger uncertainty in distant measurements [Chang

et al. 1994; Nguyen et al. 2012].

For each input depth sample, we instantiate a ray with an interval

bound to the truncation region. Given the predeﬁned voxel resolu-

tion and block size, we use DDA [Amanatides and Woo 1987] to

determine all the voxel blocks that intersect with the ray. For each

candidate found, we insert a new voxel block entry into the hash

table. In an idealized case, each depth sample would be modeled as

an entire frustum rather than a single ray. We would then allocate all

voxel blocks within the truncation region that intersect with this frus-

tum. In practice however, this leads to degradation in performance

(currently 10-fold). Our ray-based approximation provides a balance

between performance and precision. Given the continuous nature of

the reconstruction, the frame rate of the sensor, and the mobility of

the user, this in practice leads to no holes appearing between voxel

blocks at larger distances (see results and accompanying video).

Once we have successfully inserted an entry into the hash table,

we allocate a portion of preallocated heap memory on the GPU to

store voxel block data. The heap is a linear array of memory, allo-

cated once upon initialization. It is divided into contiguous blocks

(mapping to the size of voxel blocks), and managed by maintaining

a list of available blocks. This list is a linear buffer with indices

to all unallocated blocks. A new block is allocated using the last

index in the list. If a voxel block is subsequently freed, its index is

appended to the end of the list. Since the list is accessed in parallel,

synchronization is necessary, by incrementing or decrementing the

end of list pointer using an atomic operation.

6 Voxel Block Integration

We update all allocated voxel blocks that are currently within the

camera view frustum. After the previous step (see Section 5), all

voxel blocks in the truncation region of the visible surface are al-

located. However, a large fraction of the hash table will be empty

(i.e., not refer to any voxel blocks). Further, a signiﬁcant amount

of voxel blocks will be outside the viewing frustum. Under these

assumptions, TSDF integration can be done very efﬁciently by only

selecting available blocks inside the current camera frustum.

Voxel Block Selection

To select voxel blocks for integration, we

ﬁrst in parallel access all hash table entries, and store a corresponding

binary ﬂag in an array for an occupied and visible voxel block, or

zero otherwise. We then scan this array using a parallel preﬁx sum

technique [Harris et al

2007]. To facilitate large scan sizes (our

hash table can have millions of entries) we use a three level up and

down sweep. Using the scan results we compact the hash table into

another buffer, which contains all hash entries that point to voxel

blocks within the view frustum (see Fig. 5). Note that voxel blocks

are not copied, just their associated hash entries.

Implicit Surface Update

The generated list of hash entries is

then processed in parallel to update TSDF values. A single GPGPU

kernel is executed for each of the associated blocks, with one thread

allocated per voxel. That means that a voxel block will be processed

on a single GPU multiprocessor, thus maximizing cache hits and

minimizing code divergence. In practice, this is more efﬁcient than

assigning a single thread to process an entire voxel block.

Updating voxel blocks involves re-computation of the associated

TSDFs, weights and colors. Distance values are integrated using a

running average as in Curless and Levoy [Curless and Levoy 1996].

Real-time 3D reconstruction at scale using voxel hashing

Figures

Citations

ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes

Past, Present, and Future of Simultaneous Localization and Mapping: Toward the Robust-Perception Age

Past, Present, and Future of Simultaneous Localization And Mapping: Towards the Robust-Perception Age

Volumetric and Multi-view CNNs for Object Classification on 3D Data

Matterport3D: Learning from RGB-D Data in Indoor Environments

References

A method for registration of 3-D shapes

Marching cubes: A high resolution 3D surface construction algorithm

KinectFusion: Real-time dense surface mapping and tracking

A volumetric method for building complex models from range images

Surface reconstruction from unorganized points

Related Papers (5)

KinectFusion: Real-time dense surface mapping and tracking

A volumetric method for building complex models from range images

KinectFusion: real-time 3D reconstruction and interaction using a moving depth camera

Marching cubes: A high resolution 3D surface construction algorithm

A benchmark for the evaluation of RGB-D SLAM systems

Frequently Asked Questions (18)

Q1. What are the contributions mentioned in the paper "Real-time 3d reconstruction at scale using voxel hashing" ?

Q2. What future works have the authors mentioned in the paper "Real-time 3d reconstruction at scale using voxel hashing" ?

Q3. How is the list accessed in parallel?

Q4. What is the purpose of the concept of a truncated SDF?

Q5. What is the significant limitation of the hierarchy?

Q6. Why is the depth weights given to voxel blocks?

Q7. How do the authors lock a bucket for writing?

Q8. Why is 3D reconstruction gaining newfound momentum?

Q9. What is the advantage of their method?

Q10. How is the point-plane energy function linearized on the GPU?

Q11. How can the authors use the surface to estimate the pose?

Q12. What are the advantages of multilayered height maps?

Q13. What are the advantages of their method?

Q14. What is the advantage of streaming voxel blocks to the GPU?

Q15. What is the way to stream voxel blocks?

Q16. What is the benefit of a linear hash table?

Q17. What is the weighting factor in the point-plane error-metric?

Q18. What is the simplest way to estimate the ego-motion of a 3D surface?