scispace - formally typeset
Open AccessJournal ArticleDOI

Real-time 3D reconstruction at scale using voxel hashing

TLDR
An online system for large and fine scale volumetric reconstruction based on a memory and speed efficient data structure that compresses space, and allows for real-time access and updates of implicit surface data, without the need for a regular or hierarchical grid data structure.
Abstract
Online 3D reconstruction is gaining newfound interest due to the availability of real-time consumer depth cameras. The basic problem takes live overlapping depth maps as input and incrementally fuses these into a single 3D model. This is challenging particularly when real-time performance is desired without trading quality or scale. We contribute an online system for large and fine scale volumetric reconstruction based on a memory and speed efficient data structure. Our system uses a simple spatial hashing scheme that compresses space, and allows for real-time access and updates of implicit surface data, without the need for a regular or hierarchical grid data structure. Surface data is only stored densely where measurements are observed. Additionally, data can be streamed efficiently in or out of the hash table, allowing for further scalability during sensor motion. We show interactive reconstructions of a variety of scenes, reconstructing both fine-grained details and large scale environments. We illustrate how all parts of our pipeline from depth map pre-processing, camera pose estimation, depth map fusion, and surface rendering are performed at real-time rates on commodity graphics hardware. We conclude with a comparison to current state-of-the-art online systems, illustrating improved performance and reconstruction quality.

read more

Content maybe subject to copyright    Report

Real-time 3D Reconstruction at Scale using Voxel Hashing
Matthias Nießner
1,3
Michael Zollh
¨
ofer
1
Shahram Izadi
2
Marc Stamminger
1
1
University of Erlangen-Nuremberg
2
Microsoft Research Cambridge
3
Stanford University
Figure 1:
Example output from our reconstruction system without any geometry post-processing. Scene is about
20m
wide and
4m
high and
captured online in less than 5 minutes with live feedback of the reconstruction.
Abstract
Online 3D reconstruction is gaining newfound interest due to the
availability of real-time consumer depth cameras. The basic problem
takes live overlapping depth maps as input and incrementally fuses
these into a single 3D model. This is challenging particularly when
real-time performance is desired without trading quality or scale. We
contribute an online system for large and fine scale volumetric recon-
struction based on a memory and speed efficient data structure. Our
system uses a simple spatial hashing scheme that compresses space,
and allows for real-time access and updates of implicit surface data,
without the need for a regular or hierarchical grid data structure. Sur-
face data is only stored densely where measurements are observed.
Additionally, data can be streamed efficiently in or out of the hash
table, allowing for further scalability during sensor motion. We show
interactive reconstructions of a variety of scenes, reconstructing both
fine-grained details and large scale environments. We illustrate how
all parts of our pipeline from depth map pre-processing, camera pose
estimation, depth map fusion, and surface rendering are performed at
real-time rates on commodity graphics hardware. We conclude with
a comparison to current state-of-the-art online systems, illustrating
improved performance and reconstruction quality.
CR Categories:
I.3.3 [Computer Graphics]: Picture/Image
Generation—Digitizing and Scanning
Keywords:
real-time reconstruction, scalable, data structure, GPU
Links: DL PDF
1 Introduction
While 3D reconstruction is an established field in computer vision
and graphics, it is now gaining newfound momentum due to the
wide availability of depth cameras (such as the Microsoft Kinect and
Asus Xtion). Since these devices output live but noisy depth maps, a
particular focus of recent work is online surface reconstruction using
such consumer depth cameras. The ability to obtain reconstructions
in real-time opens up various interactive applications including:
augmented reality (AR) where real-world geometry can be fused
with 3D graphics and rendered live to the user; autonomous guidance
for robots to reconstruct and respond rapidly to their environment;
or even to provide immediate feedback to users during 3D scanning.
Online reconstruction requires incremental fusion of many overlap-
ping depth maps into a single 3D representation that is continuously
refined. This is challenging particularly when real-time performance
is required without trading fine-quality reconstructions and spatial
scale. Many state-of-the-art online techniques therefore employ
different types of underlying data structures accelerated using graph-
ics hardware. These however have particular trade-offs in terms of
reconstruction speed, scale, and quality.
Point-based methods (e.g., [Rusinkiewicz et al
.
2002; Weise et al
.
2009]) use simple unstructured representations that closely map
to range and depth sensor input, but lack the ability to directly
reconstruct connected surfaces. High-quality online scanning of
small objects has been demonstrated [Weise et al
.
2009], but larger-
scale reconstructions clearly trade quality and/or speed [Henry et al
.
2012; St
¨
uckler and Behnke 2012]. Height-map based representa-
tions [Pollefeys et al
.
2008; Gallup et al
.
2010] support efficient
compression of connected surface data, and can scale efficiently to
larger scenes, but fail to reconstruct complex 3D structures.
For active sensors, implicit volumetric approaches, in particular the
method of Curless and Levoy [1996], have demonstrated compelling
results [Curless and Levoy 1996; Levoy et al
.
2000; Zhou and Koltun
2013], even at real-time rates [Izadi et al
.
2011; Newcombe et al
.
2011]. However, these rely on memory inefficient regular voxel
grids, in turn restricting scale. This has led to either moving volume
variants [Roth and Vona 2012; Whelan et al
.
2012], which stream
voxel data out-of-core as the sensor moves, but still constrain the size
of the active volume. Or hierarchical data structures that subdivide
space more effectively, but do not parallelize efficiently given added
computational complexity [Zeng et al. 2012; Chen et al. 2013].

We contribute a new real-time surface reconstruction system which
supports fine-quality reconstructions at scale. Our approach carries
the benefits of volumetric approaches, but does not require either a
memory constrained voxel grid or the computational overheads of a
hierarchical data structure. Our method is based on a simple memory
and speed efficient spatial hashing technique that compresses space,
and allows for real-time fusion of referenced implicit surface data,
without the need for a hierarchical data structure. Surface data
is only stored densely in cells where measurements are observed.
Additionally, data can be streamed efficiently in or out of the hash
table, allowing for further scalability during sensor motion.
While these types of efficient spatial hashing techniques have been
proposed for a variety of rendering and collision detection tasks
[Teschner et al
.
2003; Lefebvre and Hoppe 2006; Bastos and Celes
2008; Alcantara et al
.
2009; Pan and Manocha 2011; Garc
´
ıa et al
.
2011], we describe the use of such data structures for surface re-
construction, where the underlying data needs to be continuously
updated. We show interactive reconstructions of a variety of scenes,
reconstructing both fine-grained and large-scale environments. We il-
lustrate how all parts of our pipeline from depth map pre-processing,
sensor pose estimation, depth map fusion, and surface rendering
are performed at real-time rates on commodity graphics hardware.
We conclude with a comparison to current state-of-the-art systems,
illustrating improved performance and reconstruction quality.
2 Related work
There is over three decades of research on 3D reconstruction. In this
section we review relevant systems, with a focus on online recon-
struction methods and active sensors. Unlike systems that focus on
reconstruction from a complete set of 3D points [Hoppe et al
.
1992;
Kazhdan et al
.
2006], online methods require incremental fusion of
many overlapping depth maps into a single 3D representation that
is continuously refined. Typically methods first register or align
sequential depth maps using variants of the Iterative Closest Point
(ICP) algorithm [Besl and McKay 1992; Chen and Medioni 1992].
Parametric methods [Chen and Medioni 1992; Higuchi et al
.
1995]
simply average overlapping samples, and connect points by assum-
ing a simple surface topology (such as a cylinder or a sphere) to
locally fit polygons. Extensions such as mesh zippering [Turk and
Levoy 1994] select one depth map per surface region, remove re-
dundant triangles in overlapping regions, and stitch meshes. These
methods handle some denoising by local averaging of points, but
are fragile in the presence of outliers and areas with high curvature.
These challenges associated with working directly with polygon
meshes have led to many other reconstruction methods.
Point-based methods perform reconstruction by merging overlap-
ping points, and avoid inferring connectivity. Rendering the final
model is performed using point-based rendering techniques [Gross
and Pfister 2007]. Given the output from most depth sensors are
3D point samples, it is natural for reconstruction methods to work
directly with such data. Examples include in-hand scanning sys-
tems [Rusinkiewicz et al
.
2002; Weise et al
.
2009], which support
reconstruction of only single small objects. At this small scale,
high-quality [Weise et al
.
2009] reconstructions have been achieved.
Larger scenes have been reconstructed by trading real-time speed
and quality [Henry et al
.
2012; St
¨
uckler and Behnke 2012]. These
methods lack the ability to directly model connected surfaces, requir-
ing additional expensive and often offline steps to construct surfaces;
e.g., using volumetric data structures [Rusinkiewicz et al. 2002].
Height-map based representations explore the use of more compact
2.5D continuous surface representations for reconstruction [Polle-
feys et al
.
2008; Gallup et al
.
2010]. These techniques are partic-
ularly useful for modeling large buildings with floors and walls,
since these appear as clear discontinuities in the height-map. Multi-
layered height-maps have been explored to support reconstruction of
more complex 3D shapes such as balconies, doorways, and arches
[Gallup et al
.
2010]. While these methods support more efficient
compression of surface data, the 2.5D representation fails to recon-
struct many types of complex 3D structures.
An alternative method is to use a fully volumetric data structure
to implicitly store samples of a continuous function [Hilton et al
.
1996; Curless and Levoy 1996; Wheeler et al
.
1998]. In these
methods, depth maps are converted into signed distance fields and
cumulatively averaged into a regular voxel grid. The final surface is
extracted as the zero-level set of the implicit function using isosur-
face polygonisation (e.g., [Lorensen and Cline 1987]) or raycasting.
A well-known example is the method of Curless and Levoy [1996],
which for active triangulation-based sensors such as laser range
scanners and structured light cameras, can generate very high qual-
ity results [Curless and Levoy 1996; Levoy et al
.
2000; Zhou and
Koltun 2013]. KinectFusion [Newcombe et al
.
2011; Izadi et al
.
2011] recently adopted this volumetric method and demonstrated
compelling real-time reconstructions using a commodity GPU.
While shown to be a high quality reconstruction method, particularly
given the computational cost, this approach suffers from one major
limitation: the use of a regular voxel grid imposes a large memory
footprint, representing both empty space and surfaces densely, and
thus fails to reconstruct larger scenes without compromising quality.
Scaling-up Volumetric Fusion
Recent work begins to address
this spatial limitation of volumetric methods in different ways.
[Keller et al
.
2013] use a point-based representation that captures
qualities of volumetric fusion but removes the need for a spatial data
structure. While demonstrating compelling scalable real-time recon-
structions, the quality is not on-par with true volumetric methods.
Moving volume methods [Roth and Vona 2012; Whelan et al
.
2012]
extend the GPU-based pipeline of KinectFusion. While still operat-
ing on a very restricted regular grid, these methods stream out voxels
from the GPU based on camera motion, freeing space for new data
to be stored. In these methods the streaming is one-way and lossy.
Surface data is compressed to a mesh, and once moved to host can-
not be streamed back to the GPU. While offering a simple approach
for scalability, at their core these systems still use a regular grid
structure, which means that the active volume must remain small to
ensure fine-quality reconstructions. This limits reconstructions to
scenes with close-by geometric structures, and cannot utilize the full
range of data for active sensors such as the Kinect.
This limit of regular grids has led researcher to investigate more
efficient volumetric data structures. This is a well studied topic in
the volume rendering literature, with efficient methods based on
sparse voxel octrees [Laine and Karras 2011; K
¨
ampe et al
.
2013],
simpler multi-level hierarchies and adaptive data structures [Kraus
and Ertl 2002; Lefebvre et al
.
2005; Bastos and Celes 2008; Reichl
et al
.
2012] and out-of-core streaming architectures for large datasets
[Hadwiger et al
.
2012; Crassin et al
.
2009]. These approaches have
begun to be explored in the context of online reconstruction, where
the need to support real-time updates of the underlying data adds a
fundamentally new challenge.
For example, [Zhou et al
.
2011] demonstrate a GPU-based octree
which can perform Poisson surface reconstruction on 300K vertices
at interactive rates. [Zeng et al
.
2012] implement a 9- to 10-level
octree on the GPU, which extends the KinectFusion pipeline to a
larger
8m × 8m × 2m
indoor office space. The method however
requires a complex octree structure to be implemented, with addi-
tional computational complexity and pointer overhead, with only
limited gains in scale.

In an octree, the resolution in each dimension increases by a factor
of two at each subdivision level. This results in the need for a deep
tree structure for efficient subdivision, which conversely impacts
performance, in particular on GPUs where tree traversal leads to
thread divergence. The rendering literature has proposed many alter-
native hierarchical data structures [Lefebvre et al
.
2005; Kraus and
Ertl 2002; Laine and Karras 2011; K
¨
ampe et al
.
2013; Reichl et al
.
2012]. In [Chen et al
.
2013] an N
3
hierarchy [Lefebvre et al
.
2005]
was adopted for 3D reconstruction at scale, and the optimal tree
depth and branching factor were empirically derived (showing large
branching factors and a shallow tree optimizes GPU performance).
While avoiding the use of an octree, the system still carries compu-
tational overheads in realizing such a hierarchical data structure on
the GPU. As such this leads to performance that is only real-time on
specific scenes, and on very high-end graphics hardware.
3 Algorithm Over view
We extend the volumetric method of Curless and Levoy [1996] to
reconstruct high-quality 3D surfaces in real-time and at scale, by
incrementally fusing noisy depth maps into a memory and speed
efficient data structure. Curless and Levoy have proven to produce
compelling results given a simple cumulative average of samples.
The method supports incremental updates, makes no topological
assumptions regarding surfaces, and approximates the noise charac-
teristics of triangulation based sensors effectively. Further, while an
implicit representation, stored isosurfaces can be readily extracted.
Our method addresses the main drawback of Curless and Levoy:
supporting efficient scalability. Next, we review the Curless and
Levoy method, before the description of our new approach.
Implicit Volumetric Fusion
Curless and Levoy’s method is based
on storing an implicit signed distance field (SDF) within a volumetric
data structure. Let us consider a regular dense voxel grid, and assume
the input is a sequence of depth maps. The depth sensor is initialized
at some origin relative to this grid (typically the center of the grid).
First, the rigid six degree-of-freedom (6DoF) ego-motion of the
sensor is estimated, typically using variants of ICP [Besl and McKay
1992; Chen and Medioni 1992].
Each voxel in the grid contains two values: a signed distance and
weight. For a single depth map, data is integrated into the grid by
uniformly sweeping through the volume, culling voxels outside of
the view frustum, projecting all voxel centers into the depth map,
and updating stored SDF values. All voxels that project onto the
same pixel are considered part of the depth sample’s footprint. At
each of these voxels a signed distance from the voxel center to the
observed surface measurement is stored, with positive distances in
front, negative behind, and nearing zero at the surface interface.
To reduce computational cost, support sensor motion, and approx-
imate sensor noise, Curless and Levoy introduce the notion of a
truncated SDF (TSDF) which only stores the signed distance in a
region around the observed surface. This region can be adapted in
size, approximating sensor noise as a Gaussian with variance based
on depth [Chang et al
.
1994; Nguyen et al
.
2012]. Only TSDF values
stored in voxels within these regions are updated using a weighted
average to obtain an estimate of the surface. Finally, voxels (in front
of the surface) that are part of each depth sample’s footprint, but
outside of the truncation region are explicitly marked as free-space.
This allows removal of outliers based on free-space violations.
Voxel Hashing
Given Curless and Levoy truncate SDFs around
the surface, the majority of data stored in the regular voxel grid is
marked either as free space or as unobserved space rather than sur-
face data. The key challenge becomes how to design a data structure
that exploits this underlying sparsity in the TSDF representation.
Our approach specifically avoids the use of a dense or hierarchical
data structure, removing the need for a memory intensive regular
grid or computationally complex hierarchy for volumetric fusion.
Instead, we use a simple hashing scheme to compactly store, access
and update an implicit surface representation.
In the graphics community, efficient spatial hashing methods have
been explored in the context of a variety of 2D/3D rendering and
collision detection tasks [Teschner et al
.
2003; Lefebvre and Hoppe
2006; Bastos and Celes 2008; Alcantara et al
.
2009; Pan and
Manocha 2011; Garc
´
ıa et al
.
2011]. Sophisticated methods have
been proposed for efficient GPU-based hashing that greatly reduce
the number of hash entry collisions.
Our goal is to build a real-time system that employs a spatial hashing
scheme for scalable volumetric reconstruction. This is non-trivial
for 3D reconstruction as the geometry is unknown ahead of time
and continually changing. Therefore, our hashing technique must
support dynamic allocations and updates, while minimizing and
resolving potential hash entry collisions, without requiring a-priori
knowledge of the contained surface geometry. In approaching the de-
sign of our data structure, we have purposefully chosen and extended
a simple hashing scheme [Teschner et al
.
2003], and while more
sophisticated methods exist, we show empirically that our method is
efficient in terms of speed, quality, and scalability.
The hash table sparsely and efficiently stores and updates TSDFs.
In the following we describe the data structure in more detail, and
demonstrate how it can be efficiently implemented on the GPU. We
highlight some of the core features of our data structure, including:
The ability to efficiently compress volumetric TSDFs, while
maintaining surface resolution, without the need for a hierar-
chical spatial data structure.
Fusing new TSDF samples efficiently into the hash table, based
on insertions and updates, while minimizing collisions.
Removal and garbage collection of voxel blocks, without re-
quiring costly reorganization of the data structure.
Lightweight bidirectional streaming of voxel blocks between
host and GPU, allowing unbounded reconstructions.
Extraction of isosurfaces from the data structure efficiently
using standard raycasting or polygonization operations, for
rendering and camera pose estimation.
System Pipeline
Our pipeline is depicted in Fig. 2. Central is
a hash table data structure that stores sub-blocks containing SDFs,
called voxel blocks. Each occupied entry in our hash table refers to
an allocated voxel block. At each voxel we store a TSDF, weight,
and an additional color value. The hash table is unstructured; i.e.,
neighboring voxel blocks are not stored spatially, but can be in
different parts of the hash table. Our hashing function allows an
efficient look-up of voxel blocks, using specified (integer rounded)
world coordinates. Our hash function aims to minimize the number
of collisions and ensures no duplicates exist in the table.
Given a new input depth map, we begin by performing fusion (also
referred to as integration). We first allocate new voxel blocks and
insert block descriptors into the hash table, based on an input depth
map. Only occupied voxels are allocated and empty space is not
stored. Next we sweep each allocated voxel block to update the SDF,
color and weight of each contained voxel, based on the input depth
and color samples. In addition, we garbage collect voxel blocks
which are too far from the isosurface and contain no weight. This
involves freeing allocated memory as well as removing the voxel

block entry from the hash table. These steps ensure that our data
structure remains sparse over time.
Figure 2: Pipeline overview.
After integration, we raycast the implicit surface from the current
estimated camera pose to extract the isosurface, including associated
colors. This extracted depth and color buffer is used as input for
camera pose estimation: given the next input depth map, a projective
point-plane ICP [Chen and Medioni 1992] is performed to estimate
the new 6DoF camera pose. This ensures that pose estimation is
performed frame-to-model rather than frame-to-frame mitigating
some of the issues of drift (particularly for small scenes) [Newcombe
et al
.
2011]. Finally, our algorithm performs bidirectional streaming
between GPU and host. Hash entries (and associated voxel blocks)
are streamed to the host as their world positions exit the estimated
camera view frustum. Previously streamed out voxel blocks can also
be streamed back to the GPU data structure when revisiting areas.
4 Data Structure
Fig. 3 shows our voxel hashing data structure. Conceptually, an
infinite uniform grid subdivides the world into voxel blocks. Each
block is a small regular voxel grid. In our current implementation a
voxel block is composed of
8
3
voxels. Each voxel stores a TSDF,
color, and weight and requires 8 bytes of memory:
struct Voxel {
float sdf;
uchar colorRGB[3];
uchar weight;
};
To exploit sparsity, voxel blocks are only allocated around recon-
structed surface geometry. We use an efficient GPU accelerated hash
table to manage allocation and retrieval of voxel blocks. The hash
table stores hash entries, each containing a pointer to an allocated
voxel block. Voxel blocks can be retrieved from the hash table using
integer world coordinates
(x, y, z)
. Finding the coordinates for a
3D point in world space is achieved by simple multiplication and
rounding. We map from a world coordinate
(x, y, z)
to hash value
H(x, y, z) using the following hashing function:
H(x, y, z) = (x · p
1
y · p
2
z · p
3
) mod n
where
p
1
,
p
2
, and
p
3
are large prime numbers (in our case
73856093
,
19349669
,
83492791
respectively, based on [Teschner et al
.
2003]),
and
n
is the hash table size. In addition to storing a pointer to the
voxel block, each hash entry also contains the associated world posi-
tion, and an offset pointer to handle collisions efficiently (described
in the next section).
struct HashEntry {
short position[3];
short offset;
int pointer;
};
world
hash
table
voxel
blocks
bucket
Figure 3:
Our voxel hashing data structure. Conceptually, an
infinite uniform grid partitions the world. Using our hash function,
we map from integer world coordinates to hash buckets, which store
a small array of pointers to regular grid voxel blocks. Each voxel
block contains an
8
3
grid of SDF values. When information for the
red block gets added, a collision appears which is resolved by using
the second element in the hash bucket.
4.1 Resolving Collisions
Collisions appear if multiple allocated blocks are mapped to the
same hash value (see red block in Fig. 3). We handle collisions by
uniformly organizing the hash table into buckets, one per unique
hash value. Each bucket sequentially stores a small number of hash
entries. When a collision occurs, we store the block pointer in the
next available sequential entry in the bucket (see Fig. 4). To find
the voxel block for a particular world position, we first evaluate our
hash function, and lookup and traverse the associated bucket until
our block entry is found. This is achieved by simply comparing the
stored hash entry world position with the query position.
With a reasonable selection of the hash table and bucket size (see
later), rarely will a bucket overflow. However, if this happens, we
append a linked list entry, filling up other free spots in the next
available buckets. The (relative) pointers for the linked lists are
stored in the offset field of the hash table entries. Such a list is
appended to a full bucket by setting the offset pointer for the last
entry in the bucket. All following entries are then chained using
the offset field. In order to create additional links for a bucket, we
linearly search across the hash table for a free slot to store our entry,
appending to the link list accordingly. We avoid the last entry in
each bucket, as this is locally reserved for the link list head.
As shown later, we choose a table and bucket size that keeps the num-
ber of collisions and therefore appended linked lists to a minimum
for most scenes, as to not impact overall performance.
4.2 Hashing operations
Insertion
To insert new hash entries, we first evaluate the hash
function and determine the target bucket. We then iterate over all
bucket elements including possible lists attached to the last entry.
If we find an element with the same world space position we can
immediately return a reference. Otherwise, we look for the first
empty position within the bucket. If a position in the bucket is
available, we insert the new hash entry. If the bucket is full, we
append an element to its linked list element (see Fig. 4).
To avoid race conditions when inserting hash entries in parallel, we
lock a bucket atomically for writing when a suitable empty position

is found. This eliminates duplicate entries and ensures linked list
consistency. If a bucket is locked for writing, all other allocations
for the same bucket are staggered until the next frame is processed.
This may delay some allocations marginally. However, in practice
this causes no degradation in reconstruction quality (as observed in
the results and supplementary video), particularly as the Curless and
Levoy method supports order independent updates.
Retrieval
To read the hash entry for a query position, we compute
the hash value and perform a linear search within the correspond-
ing bucket. If no entry is found, and the bucket has a linked list
associated (the offset value of the last entry is set), we also have to
traverse this list. Note that we do not require a bucket to be filled
from left to right. As described below, removing values can lead
to fragmentation, so traversal does not stop when empty entries are
found in the bucket.
offset
position
pointer
0
1
2
3
(2,4,5)
hash=1
(2,4,5) (2,4,5)
(8,1,7)
hash=1
(8,1,7)
(1,2,3)
hash=1
(8,1,7)
(1,2,3)
(0,-2,3)
(1,2,3)
(0,-2,3)
(1,2,3)
(0,-2,3)
(2,4,5)
(0,-2,3)
(0,-2,3)
hash=1
(8,1,7)
(2,4,5)
bucket hash entries
4
Figure 4:
The hash table is broken down into a set of buckets.
Each slot is either unallocated (white) or contains an entry (blue)
storing the query world position, pointer to surface data, and an
offset pointer for dealing with bucket overflow. Example hashing
operations: for illustration, we insert and remove four entries that
all map to hash = 1 and update entries and pointers accordingly.
Deletion
Deleting a hash entry is similar to insertion. For a given
world position we first compute the hash and then linearly search
the corresponding hash bucket including list traversal. If we have
found the matching entry without list traversal we can simply delete
it. If it is the last element of the bucket and there was a non-zero
offset stored (i.e., the element is a list head), we copy the hash
entry pointed to by the offset into the last element of the bucket,
and delete it from its current position. Otherwise if the entry is a
(non-head) element in the linked list, we delete it and correct list
pointers accordingly (see Fig. 4). Synchronization is not required
for deletion directly within the bucket. However, in the case we need
to modify the linked list, we lock the bucket atomically and stagger
further list operations for this bucket until the next frame.
5 Voxel Block Allocation
Before integration of new TSDFs, voxel blocks must be allocated
that fall within the footprint of each input depth sample, and are also
within the truncation region of the surface measurement. We process
depth samples in parallel, inserting hash entries and allocating voxel
blocks within the truncation region around the observed surface. The
size of the truncation is adapted based on the variance of depth to
compensate for larger uncertainty in distant measurements [Chang
et al. 1994; Nguyen et al. 2012].
For each input depth sample, we instantiate a ray with an interval
bound to the truncation region. Given the predefined voxel resolu-
tion and block size, we use DDA [Amanatides and Woo 1987] to
determine all the voxel blocks that intersect with the ray. For each
candidate found, we insert a new voxel block entry into the hash
table. In an idealized case, each depth sample would be modeled as
an entire frustum rather than a single ray. We would then allocate all
voxel blocks within the truncation region that intersect with this frus-
tum. In practice however, this leads to degradation in performance
(currently 10-fold). Our ray-based approximation provides a balance
between performance and precision. Given the continuous nature of
the reconstruction, the frame rate of the sensor, and the mobility of
the user, this in practice leads to no holes appearing between voxel
blocks at larger distances (see results and accompanying video).
Once we have successfully inserted an entry into the hash table,
we allocate a portion of preallocated heap memory on the GPU to
store voxel block data. The heap is a linear array of memory, allo-
cated once upon initialization. It is divided into contiguous blocks
(mapping to the size of voxel blocks), and managed by maintaining
a list of available blocks. This list is a linear buffer with indices
to all unallocated blocks. A new block is allocated using the last
index in the list. If a voxel block is subsequently freed, its index is
appended to the end of the list. Since the list is accessed in parallel,
synchronization is necessary, by incrementing or decrementing the
end of list pointer using an atomic operation.
6 Voxel Block Integration
We update all allocated voxel blocks that are currently within the
camera view frustum. After the previous step (see Section 5), all
voxel blocks in the truncation region of the visible surface are al-
located. However, a large fraction of the hash table will be empty
(i.e., not refer to any voxel blocks). Further, a significant amount
of voxel blocks will be outside the viewing frustum. Under these
assumptions, TSDF integration can be done very efficiently by only
selecting available blocks inside the current camera frustum.
Voxel Block Selection
To select voxel blocks for integration, we
first in parallel access all hash table entries, and store a corresponding
binary flag in an array for an occupied and visible voxel block, or
zero otherwise. We then scan this array using a parallel prefix sum
technique [Harris et al
.
2007]. To facilitate large scan sizes (our
hash table can have millions of entries) we use a three level up and
down sweep. Using the scan results we compact the hash table into
another buffer, which contains all hash entries that point to voxel
blocks within the view frustum (see Fig. 5). Note that voxel blocks
are not copied, just their associated hash entries.
Implicit Surface Update
The generated list of hash entries is
then processed in parallel to update TSDF values. A single GPGPU
kernel is executed for each of the associated blocks, with one thread
allocated per voxel. That means that a voxel block will be processed
on a single GPU multiprocessor, thus maximizing cache hits and
minimizing code divergence. In practice, this is more efficient than
assigning a single thread to process an entire voxel block.
Updating voxel blocks involves re-computation of the associated
TSDFs, weights and colors. Distance values are integrated using a
running average as in Curless and Levoy [Curless and Levoy 1996].

Figures
Citations
More filters
Proceedings ArticleDOI

ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes

TL;DR: This work introduces ScanNet, an RGB-D video dataset containing 2.5M views in 1513 scenes annotated with 3D camera poses, surface reconstructions, and semantic segmentations, and shows that using this data helps achieve state-of-the-art performance on several 3D scene understanding tasks.
Journal ArticleDOI

Past, Present, and Future of Simultaneous Localization and Mapping: Toward the Robust-Perception Age

TL;DR: Simultaneous localization and mapping (SLAM) as mentioned in this paper consists in the concurrent construction of a model of the environment (the map), and the estimation of the state of the robot moving within it.
Journal ArticleDOI

Past, Present, and Future of Simultaneous Localization And Mapping: Towards the Robust-Perception Age

TL;DR: What is now the de-facto standard formulation for SLAM is presented, covering a broad set of topics including robustness and scalability in long-term mapping, metric and semantic representations for mapping, theoretical performance guarantees, active SLAM and exploration, and other new frontiers.
Proceedings ArticleDOI

Volumetric and Multi-view CNNs for Object Classification on 3D Data

TL;DR: In this paper, two distinct network architectures of volumetric CNNs and multi-view CNNs are introduced, where they introduce multiresolution filtering in 3D. And they provide extensive experiments designed to evaluate underlying design choices.
Proceedings ArticleDOI

Matterport3D: Learning from RGB-D Data in Indoor Environments

TL;DR: Matterport3D as discussed by the authors is a large-scale RGB-D dataset containing 10,800 panoramic views from 194,400 images of 90 building-scale scenes.
References
More filters
Journal ArticleDOI

A method for registration of 3-D shapes

TL;DR: In this paper, the authors describe a general-purpose representation-independent method for the accurate and computationally efficient registration of 3D shapes including free-form curves and surfaces, based on the iterative closest point (ICP) algorithm, which requires only a procedure to find the closest point on a geometric entity to a given point.
Proceedings ArticleDOI

Marching cubes: A high resolution 3D surface construction algorithm

TL;DR: In this paper, a divide-and-conquer approach is used to generate inter-slice connectivity, and then a case table is created to define triangle topology using linear interpolation.
Proceedings ArticleDOI

KinectFusion: Real-time dense surface mapping and tracking

TL;DR: A system for accurate real-time mapping of complex and arbitrary indoor scenes in variable lighting conditions, using only a moving low-cost depth camera and commodity graphics hardware, which fuse all of the depth data streamed from a Kinect sensor into a single global implicit surface model of the observed scene in real- time.
Proceedings ArticleDOI

A volumetric method for building complex models from range images

TL;DR: This paper presents a volumetric method for integrating range images that is able to integrate a large number of range images yielding seamless, high-detail models of up to 2.6 million triangles.
Proceedings ArticleDOI

Surface reconstruction from unorganized points

TL;DR: A general method for automatic reconstruction of accurate, concise, piecewise smooth surfaces from unorganized 3D points that is able to automatically infer the topological type of the surface, its geometry, and the presence and location of features such as boundaries, creases, and corners.
Related Papers (5)
Frequently Asked Questions (18)
Q1. What are the contributions mentioned in the paper "Real-time 3d reconstruction at scale using voxel hashing" ?

The authors show interactive reconstructions of a variety of scenes, reconstructing both fine-grained details and large scale environments. The authors illustrate how all parts of their pipeline from depth map pre-processing, camera pose estimation, depth map fusion, and surface rendering are performed at real-time rates on commodity graphics hardware. The authors conclude with a comparison to current state-of-the-art online systems, illustrating improved performance and reconstruction quality. 

To further extend the bounds of reconstruction, their method supports lightweight streaming without major data structure reorganization. Due to the high performance of their data structure, the available time budget can be utilized for further improving camera pose estimation, which directly improves reconstruction quality over existing online approaches. The authors believe the advantages of their method will be even more evident when future depth cameras with higher resolution sensing emerge, as their data structure is already capable of reconstructing surfaces beyond the resolution of existing depth sensors such as Kinect. 

Since the list is accessed in parallel, synchronization is necessary, by incrementing or decrementing the end of list pointer using an atomic operation. 

To reduce computational cost, support sensor motion, and approximate sensor noise, Curless and Levoy introduce the notion of a truncated SDF (TSDF) which only stores the signed distance in a region around the observed surface. 

The most significant limitation of the hierarchy is the data structure overhead causing a performance drop, particularly in complex scenes. 

The authors set the integration weights according to the depth values in order to incorporate the noise characteristics of the sensor; i.e., more weight is given to nearer depth measurements for which the authors assume less noise. 

To avoid race conditions when inserting hash entries in parallel, the authors lock a bucket atomically for writing when a suitable empty positionis found. 

While 3D reconstruction is an established field in computer vision and graphics, it is now gaining newfound momentum due to the wide availability of depth cameras (such as the Microsoft Kinect and Asus Xtion). 

To further extend the bounds of reconstruction, their method supports lightweight streaming without major data structure reorganization. 

The point-plane energy function is linearized [Low 2004] on the GPU to a 6 × 6 matrix using a parallel reduction and solved via Singular Value Decomposition on the CPU. 

Once the surface is extracted via raycasting, it can be shaded for rendering, or used for frame-to-model camera pose estimation [Newcombe et al. 2011]. 

Multilayered height-maps have been explored to support reconstruction of more complex 3D shapes such as balconies, doorways, and arches [Gallup et al. 2010]. 

The authors believe the advantages of their method will be even more evident when future depth cameras with higher resolution sensing emerge, as their data structure is already capable of reconstructing surfaces beyond the resolution of existing depth sensors such as Kinect. 

This enhances performance, given the high host-GPU bandwidth and ability to efficiently cull voxel blocks outside of the view frustum. 

Their unstructured data structure is well-suited for this purpose, since streaming voxel blocks in or out does not require any reorganization of the hash table. 

This demonstrates another benefit for their linear hash table data structure (over hierarchical data structures), allowing fast parallel access to all allocated blocks for operations such as rasterization. 

As their data structure also stores associated color data, the authors incorporate a weighting factor in the point-plane error-metric based on color consistency between extracted and input RGB values [Johnson and Bing Kang 1999]. 

the rigid six degree-of-freedom (6DoF) ego-motion of the sensor is estimated, typically using variants of ICP [Besl and McKay 1992; Chen and Medioni 1992].