scispace - formally typeset
Open AccessProceedings ArticleDOI

DynamicFusion: Reconstruction and tracking of non-rigid scenes in real-time

Reads0
Chats0
TLDR
This work presents the first dense SLAM system capable of reconstructing non-rigidly deforming scenes in real-time, by fusing together RGBD scans captured from commodity sensors, and displays the updated model in real time.
Abstract
We present the first dense SLAM system capable of reconstructing non-rigidly deforming scenes in real-time, by fusing together RGBD scans captured from commodity sensors. Our DynamicFusion approach reconstructs scene geometry whilst simultaneously estimating a dense volumetric 6D motion field that warps the estimated geometry into a live frame. Like KinectFusion, our system produces increasingly denoised, detailed, and complete reconstructions as more measurements are fused, and displays the updated model in real time. Because we do not require a template or other prior scene model, the approach is applicable to a wide range of moving objects and scenes.

read more

Content maybe subject to copyright    Report

DynamicFusion: Reconstruction and Tracking of Non-rigid Scenes in Real-Time
Richard A. Newcombe
newcombe@cs.washington.edu
Dieter Fox
fox@cs.washington.edu
University of Washington, Seattle
Steven M. Seitz
seitz@cs.washington.edu
Figure 1: Real-time reconstructions of a moving scene with DynamicFusion; both the person and the camera are moving. The initially
noisy and incomplete model is progressively denoised and completed over time (left to right).
Abstract
We present the first dense SLAM system capable of re-
constructing non-rigidly deforming scenes in real-time, by
fusing together RGBD scans captured from commodity sen-
sors. Our DynamicFusion approach reconstructs scene ge-
ometry whilst simultaneously estimating a dense volumet-
ric 6D motion field that warps the estimated geometry into
a live frame. Like KinectFusion, our system produces in-
creasingly denoised, detailed, and complete reconstructions
as more measurements are fused, and displays the updated
model in real time. Because we do not require a template
or other prior scene model, the approach is applicable to a
wide range of moving objects and scenes.
3D scanning traditionally involves separate capture and
off-line processing phases, requiring very careful planning
of the capture to make sure that every surface is cov-
ered. In practice, it’s very difficult to avoid holes, requir-
ing several iterations of capture, reconstruction, identifying
holes, and recapturing missing regions to ensure a complete
model. Real-time 3D reconstruction systems like KinectFu-
sion [18, 10] represent a major advance, by providing users
the ability to instantly see the reconstruction and identify
regions that remain to be scanned. KinectFusion spurred a
flurry of follow up research aimed at robustifying the track-
ing [9, 32] and expanding its spatial mapping capabilities to
larger environments [22, 19, 34, 31, 9].
However, as with all traditional SLAM and dense re-
construction systems, the most basic assumption behind
KinectFusion is that the observed scene is largely static.
The core question we tackle in this paper is: How can we
generalise KinectFusion to reconstruct and track dynamic,
non-rigid scenes in real-time? To that end, we introduce
DynamicFusion, an approach based on solving for a vol-
umetric flow field that transforms the state of the scene at
each time instant into a fixed, canonical frame. In the case
of a moving person, for example, this transformation un-
does the person’s motion, warping each body configuration
into the pose of the first frame. Following these warps, the
scene is effectively rigid, and standard KinectFusion up-
dates can be used to obtain a high quality, denoised recon-
struction. This progressively denoised reconstruction can
then be transformed back into the live frame using the in-
verse map; each point in the canonical frame is transformed
to its location in the live frame (see Figure 1).
Defining a canonical “rigid” space for a dynamically
moving scene is not straightforward. A key contribution
of our work is an approach for non-rigid transformation and
fusion that retains the optimality properties of volumetric
scan fusion [5], developed originally for rigid scenes. The
main insight is that undoing the scene motion to enable fu-
sion of all observations into a single fixed frame can be
achieved efficiently by computing the inverse map alone.
Under this transformation, each canonical point projects
along a line of sight in the live camera frame. Since the
optimality arguments of [5] (developed for rigid scenes) de-
pend only on lines of sight, we can generalize their optimal-
ity results to the non-rigid case.
Our second key contribution is to represent this volumet-
ric warp efficiently, and compute it in real time. Indeed,
even a relatively low resolution, 256
3
deformation volume
would require 100 million transformation variables to be
computed at frame-rate. Our solution depends on a com-
bination of adaptive, sparse, hierarchical volumetric basis
functions, and innovative algorithmic work to ensure a real-

time solution on commodity hardware. As a result, Dynam-
icFusion is the first system capable of real-time dense recon-
struction in dynamic scenes using a single depth camera.
The remainder of this paper is structured as follows. Af-
ter discussing related work, we present an overview of Dy-
namicFusion in Section 2 and provide technical details in
Section 3. We provide experimental results in Section 4 and
conclude in Section 5.
1. Related Work
While no prior work achieves real-time, template-free,
non-rigid reconstruction, there are two categories of closely
related work: 1) real-time non-rigid tracking algorithms,
and 2) offline dynamic reconstruction techniques.
Real-time non-rigid template tracking. The vast ma-
jority of non-rigid tracking research focuses on human body
parts, for which specialised shape and motion templates are
learnt or manually designed. The best of these demonstrate
high accuracy, real-time performance capture for tracking
faces [16, 3], hands [21, 20], complete bodies [27], or gen-
eral articulated objects [23, 33].
Other techniques directly track and deform more gen-
eral mesh models. [12] demonstrated the ability to track
a statically acquired low resolution shape template and up-
grade its appearance with high frequency geometric details
not present in the original model. Recently, [37] demon-
strated an impressive real-time version of a similar tech-
nique, using GPU accelerated optimisations. In that sys-
tem, a dense surface model of the subject is captured while
remaining static, yielding a template for use in their real-
time tracking pipeline. This separation into template gen-
eration and tracking limits the system to objects and scenes
that are completely static during the geometric reconstruc-
tion phase, precluding reconstruction of things that won’t
reliably hold still (e.g., children or pets).
Offline simultaneous tracking and reconstruction of
dynamic scenes. There is a growing literature on offline
non-rigid tracking and reconstruction techniques. Several
researchers have extended ICP to enable small non-rigid
deformations, e.g., [1, 2]. Practical advancements to pair-
wise 3D shape and scan alignment over larger deformations
make use of reduced deformable model parametrisations
[14, 4]. In particular, embedded deformation graphs [25]
use a sparsely sampled set of transformation basis func-
tions that can be efficiently and densely interpolated over
space. Quasi-rigid reconstruction has also been demon-
strated [15, 35] and hybrid systems, making use of a known
kinematic structure (e.g., a human body), are able to per-
form non-rigid shape denoising [36]. Other work combines
non-rigid mesh template tracking and temporal denoising
and completion [13], but does not obtain a single consistent
representation of the scene.
More closely related to our work are template-free tech-
niques. An intriguing approach to template-free non-rigid
alignment, introduced in [17] and [26], treats each non-
rigid scan as a view from a 4D geometric observation and
performs 4D shape reconstruction. [30, 29] reconstruct
a fixed topology geometry by performing pair-wise scan
alignment. [24] use a space-time solid incompressible flow
prior that results in water tight reconstructions and is ef-
fective against noisy input point-cloud data. [28] intro-
duce animation cartography that also estimates shape and
a per frame deformation by developing a dense correspon-
dence matching scheme that is seeded with sparse landmark
matches. Recent work using multiple fixed kinect cameras
[8] [7] demonstrates larger scale non-rigid reconstruction by
densely tracking and fusing all depth map data into a novel
directional distance function representation.
All of these techniques require three to four orders of
magnitude more time than is available within a real-time
setting.
2. DynamicFusion Overview
DynamicFusion decomposes a non-rigidly deforming
scene into a latent geometric surface, reconstructed into a
rigid canonical space S R
3
; and a per frame volumetric
warp field that transforms that surface into the live frame.
There are three core algorithmic components to the system
that are performed in sequence on arrival of each new depth
frame:
1. Estimation of the volumetric model-to-frame warp
field parameters (Section 3.3)
2. Fusion of the live frame depth map into the canonical
space via the estimated warp field (Section 3.2)
3. Adaptation of the warp-field structure to capture newly
added geometry (Section 3.4)
Figure 2 provides an overview.
3. Technical Details
We will now describe the components of DynamicFusion
in detail. First, we describe our dense volumetric warp-field
parametrisation. This allows us to model per-frame defor-
mations in the scene. The warp-field is the key extension
over static state space representations used in traditional re-
construction and SLAM systems, and its estimation is the
enabler of both non-rigid tracking and scene reconstruction.
3.1. Dense Non-rigid Warp Field
We represent dynamic scene motion through a volumet-
ric warp-field, providing a per point 6D transformation
W : S 7→ SE(3). Whereas a dense 3D translation field
would be sufficient to describe time varying geometry, we
have found that representing the real-world transformation

(a) Initial Frame at t = 0s (b) Raw (noisy) depth maps for frames at t = 1s, 10s, 15s, 20s (c) Node Distance
(d) Canonical Model (e) Canonical model warped into its live frame (f) Model Normals
Figure 2: DynamicFusion takes an online stream of noisy depth maps (a,b) and outputs a real-time dense reconstruction of the moving
scene (d,e). To achieve this, we estimate a volumetric warp (motion) field that transforms the canonical model space into the live frame,
enabling the scene motion to be undone, and all depth maps to be densely fused into a single rigid TSDF reconstruction (d,f). Simulta-
neously, the structure of the warp field is constructed as a set of sparse 6D transformation nodes that are smoothly interpolated through
a k-nearest node average in the canonical frame (c). The resulting per-frame warp field estimate enables the progressively denoised and
completed scene geometry to be transformed into the live frame in real-time (e). In (e) we also visualise motion trails for a sub-sample
of model vertices over the last 1 second of scene motion together with a coordinate frame showing the rigid body component of the scene
motion. In (c) we render the nearest node to model surface distance where increased distance is mapped to a lighter value.
of objects with both translation and rotation results in signif-
icantly better tracking and reconstruction. For each canoni-
cal point v
c
S, T
lc
= W(v
c
) transforms that point from
canonical space into the live, non-rigidly deformed frame of
reference.
Since we will need to estimate the warp function for each
new frame, W
t
, its representation must be efficiently opti-
misable. One possibility is to densely sample the volume,
e.g. representing a quantised SE(3) field at the resolution
of the truncated signed distance (TSDF) geometric repre-
sentation. However, a typical TSDF volume reconstruction
at a relatively low resolution of 256
3
voxels would require
the solution of 6 × 256
3
parameters per frame, about 10
million times more than in the original KinectFusion al-
gorithm, which only estimates a single rigid transforma-
tion. Clearly, a completely dense parametrisation of the
warp function is infeasible. In reality, surfaces tend to move
smoothly in space, and so we can instead use a sparse set
of transformations as bases and define the dense volumet-
ric warp function through interpolation. Due to its compu-
tational efficiency and high quality interpolation capability
we use dual-quaternion blending DQB [11], to define our
warp function:
W(x
c
) SE3(DQB(x
c
)) , (1)
where the weighted average over unit dual quaternion trans-
formations is simply DQB(x
c
)
P
kN (x
c
)
w
k
(x
c
)ˆq
kc
k
P
kN (x
c
)
w
k
(x
c
)ˆq
kc
k
,
with each unit dual-quaternion ˆq
kc
R
8
. Here, N (x) are
the k-nearest transformation nodes to the point x and w
k
:
R
3
7→ R defines a weight that alters the radius of influence
of each node and SE3(.) converts from quaternions back to
an SE(3) transformation matrix. The state of the warp-field
W
t
at time t is defined by the values of a set of n defor-
mation nodes N
t
warp
= {dg
v
, dg
w
, dg
se3
}
t
. Each of the
i = 1..n nodes has a position in the canonical frame dg
i
v
R
3
, its associated transformation T
ic
= dg
i
se3
, and a ra-
dial basis weight dg
w
that controls the extent of the trans-
formation w
i
(x
c
) = exp
−kdg
i
v
x
c
k
2
/
2(dg
i
w
)
2

.
Each radius parameter dg
i
w
is set to ensure the node’s in-
fluence overlaps with neighbouring nodes, dependent on
the sampling sparsity of nodes, which we describe in de-
tail in section (3.4). Since the warp function defines a
rigid body transformation for all supported space, both posi-
tion and any associated orientation of space is transformed,
e.g., the vertex v
c
from a surface with orientation or nor-
mal n
c
is transformed into the live frame as (v
t
, 1)
>
=
W
t
(v
c
)(v
>
c
, 1)
>
and (n
t
, 0)
>
= W
t
(v
c
)(n
>
c
, 0)
>
. We note
that scaling of space can also be represented with this warp
function, since compression and expansion of space are rep-
resented by neighbouring points moving in converging and
diverging directions. Finally, we note that we can factor
out any rigid body transformation common to all points in
the volume, e.g., due to camera motion. We therefore in-
troduce the explicit warped model to live camera transform,
T
lw
, and compose this onto the volumetric warp function;

our complete warp-field is then given as:
W
t
(x
c
) = T
lw
SE3(DQB(x
c
)). (2)
3.2. Dense Non-Rigid Surface Fusion
We now describe how, given the model-to-frame warp
field W
t
, we update our canonical model geometry. Our
reconstruction into the canonical space S is represented by
the sampled TSDF V : S 7→ R
2
within a voxel domain
S N
3
. The sampled function holds for each voxel x
S corresponding to the sampled point x
c
, a tuple V(x) 7→
[v(x) R, w(x) R]
>
holding a weighted average of all
projective TSDF values observed for that point so far v(x),
together with the sum of all associated weights w(x).
We extend the projective TSDF fusion approach origi-
nally introduced by [6] to operate over non-rigidly deform-
ing scenes. Given the live depth image D
t
, we transform
each voxel center x
c
S by its estimated warp into the live
frame (x
>
t
, 1)
>
= W
t
(x
c
)(x
>
c
, 1)
>
, and carry through the
TSDF surface fusion operation by directly projecting the
warped center into the depth frame. This allows the TSDF
for a point in the canonical frame to be updated by com-
puting the projective TSDF in the deforming frame without
having to resample a warped TSDF in the live frame. The
projective signed distance at the warped canonical point is:
psdf(x
c
) =
h
K
1
D
t
(u
c
)
u
>
c
, 1
>
i
z
[x
t
]
z
, (3)
where u
c
= π (Kx
t
) is the pixel into which the voxel cen-
ter projects. We compute distance along the optical (z) axis
of the camera frame using the z component denoted [.]
z
.
K is the known 3 × 3 camera intrinsic matrix, and π per-
forms perspective projection. For each voxel x, we update
the TSDF to incorporate the projective SDF observed in the
warped frame using TSDF fusion:
V(x)
t
=
(
[v
0
(x), w
0
(x)]
>
, if psdf(dc(x)) > τ
V(x)
t1
, otherwise
(4)
where dc(.) transforms a discrete voxel point into the con-
tinuous TSDF domain. The truncation distance τ > 0 and
the updated TSDF value is given by the weighted averaging
scheme [5], with the weight truncation introduced in [18]:
v
0
(x) =
v(x)
t1
w(x)
t1
+ min(ρ, τ)w(x)
w(x)
t1
+ w(x)
ρ = psdf(dc(x))
w
0
(x) = min(w(x)
t1
+ w(x), w
max
) . (5)
Unlike the static fusion scenario where the weight w(x)
encodes the uncertainty of the depth value observed at the
projected pixel in the depth frame, we also account for un-
certainty associated with the warp function at x
c
. In the
case of the single rigid transformation in original TSDF
fusion, we are certain that observed surface regions, free
space, and unobserved regions transform equivalently. In
our non-rigid case, the further away the point x
c
is from an
already mapped and observable surface region, the less cer-
tain we can be about its transformation. We use the average
distance from x
c
to its k-nearest deformation nodes as a
proxy for this increase in uncertainty and scale: w(x)
1
k
P
iN(x
c
)
kdg
i
w
x
c
k
2
. We note that our non-rigid
fusion generalises the static reconstruction case used in
KinectFusion, replacing the single (rigid) model-to-camera
transform with a per voxel warp that transforms the asso-
ciated space into the live (non-rigid) frame (see Figure 3).
This technique greatly simplifies the non-rigid reconstruc-
tion process over methods where all frames are explicitly
warped into a canonical frame. Furthermore, given a cor-
rect warp field, then, since all TSDF updates are computed
using distances in the camera frame, the non-rigid projec-
tive TSDF fusion approach maintains the optimality guar-
antees for surface reconstruction from noisy observations
originally proved for the static reconstruction case in [6].
3.3. Estimating the Warp-field State W
t
We estimate the current values of the transformations
dg
se3
in W
t
given a newly observed depth map D
t
and the
current reconstruction V by constructing an energy function
that is minimised by our desired parameters:
E(W
t
, V, D
t
, E) = Data(W
t
, V, D
t
) + λReg(W
t
, E) . (6)
Our data term consists of a dense model-to-frame ICP
cost Data(W
t
, V, D
t
) which is coupled with a regular-
isation term Reg(W
t
, E) that penalises non-smooth mo-
tion fields, and ensures as-rigid-as-possible deformation be-
tween transformation nodes connected by the edge set E.
The coupling of a data-term formed from linearly blended
transformations with a rigid-as-possible graph based reg-
ularisation is a form of the embedded deformation graph
model introduced in [25]. The regularisation parameter λ
enables a trade-off between relaxing rigidity over the field
when given high quality data, and ensuring a smooth con-
sistent deformation of non or noisily observed regions of
space. We defined these terms in the next subsections.
3.3.1 Dense Non-Rigid ICP Data-term
Our aim is to estimate all non-rigid transformation param-
eters T
ic
and T
lw
that warp the canonical volume into the
live frame. We achieve this by performing a dense non-
rigid alignment of the current surface reconstruction, ex-
tracted from the canonical volume’s zero level set, into the
live frame’s depth map.
Surface Prediction and Data-Association: The current
zero level set of the TSDF V is extracted by marching cubes
and stored as a polygon mesh with point-normal pairs in the
canonical frame:
ˆ
V
c
{V
c
, N
c
}. We non-rigidly transform
this mesh into the live frame using the current warp field
W
t
resulting in the warped point-normals
ˆ
V
w
.

Non-rigid scene deformation Introducing an occlusion
(a) Live frame t = 0 (b) Live Frame t = 1 (c) Canonical 7→ Live (d) Live frame t = 0 (e) Live Frame t = 1 (f) Canonical 7→ Live
Figure 3: An illustration of how each point in the canonical frame maps, through a correct warp field, onto a ray in the live camera frame
when observing a deforming scene. In (a) the first view of a dynamic scene is observed. In the corresponding canonical frame, the warp is
initialized to the identity transform and the three rays shown in the live frame also map as straight lines in the canonical frame. As the scene
deforms in the live frame (b), the warp function transforms each point from the canonical and into the corresponding live frame location,
causing the corresponding rays to bend (c). Note that this warp can be achieved with two 6D deformation nodes (shown as circles), where
the left node applies a clockwise twist. In (d) we show a new scene that includes a cube that is about to occlude the bar. In the live frame
(e), as the cube occludes a portion of the bar, the points in the canonical frame (f) are warped to correctly pass through the cube.
We obtain an initial estimate for data-association (corre-
spondence) between the model geometry and the live frame
by rendering the warped surface
ˆ
V
w
into the live frame
shaded with canonical frame vertex positions using a ras-
terizing rendering pipeline. This results in a prediction of
the canonical frame’s geometry that is currently predicted
to be visible in the live frame: P(
ˆ
V
c
). We store this pre-
diction as a pair of images {v, n} : 7→ P(
ˆ
V
c
), where
is the pixel domain of the predicted images, storing the
rendered canonical frame vertices and normals.
Given optimal transformation parameters for the current
time frame, the predicted-to-be-visible geometry should
transform close, modulo observation noise, to the live sur-
face vl : 7→ R
3
, formed by back projection of the depth
image [vl(u)
>
, 1]
>
= K
1
D
t
(u)[u
>
, 1]
>
. This can be
quantified by a per pixel dense model-to-frame point-plane
error, which we compute under the robust Tukey penalty
function ψ
data
, summed over the predicted image domain
:
Data(W, V, D
t
)
X
u
ψ
data
ˆn
>
u
(ˆv
u
vl
˜u
)
.(7)
augment. The transformed model vertex v(u) is simply
˜
T
u
= W(v(u)), producing the current canonical to live
frame point-normal predictions ˆv
u
=
˜
T
u
v(u) and ˆn
u
=
˜
T
u
n(u), and data-association of that model point-normal
is made with a live frame point-normal through perspective
projection into the pixel ˜u = π(Kˆv
u
).
We note that, ignoring the negligible cost of rendering
the geometry
ˆ
V
w
, the ability to extract, predict, and perform
projective data association with the currently visible canoni-
cal geometry leads to a data-term evaluation that has a com-
putational complexity with an upper bound in the number
of pixels in the observation image. Furthermore, each data-
term summand depends only on a subset of the n trans-
formations when computing W, and the region over which
each node has a numerically significant impact on the error
function is compact. In practice, the result is a computa-
tional cost similar to a single rigid body dense projective
point-plane data-term evaluation (as used in KinectFusion).
3.3.2 Warp-field Regularization
It is crucial for our non-rigid TSDF fusion technique to es-
timate a deformation not only of currently visible surfaces,
but over all space within S. This enables reconstruction
of new regions of the scene surface that are about to come
into view. However, nodes affecting canonical space within
which no currently observed surface resides will have no
associated data term. In any case, noise, missing data and
insufficient geometric texture in the live frame an ana-
logue to the aperture problem in optical-flow will result
in optimisation of the transform parameters being ill-posed.
How should we constrain the motion of non-observed ge-
ometry? Whilst the fully correct motion depends on object
dynamics and, where applicable, the subject’s volition, we
make use of a simpler model of unobserved geometry: that
it deforms in a piece-wise smooth way.
We use a deformation graph based regularization defined
between transformation nodes, where an edge in the graph
between nodes i and j adds a rigid-as-possible regularisa-
tion term to the total error being minimized, under the dis-
continuity preserving Huber penalty ψ
reg
. The total regu-
larisation term sums over all pair-wise connected nodes:
Reg(W, E)
n
X
i=0
X
j∈E(i)
α
ij
ψ
reg
T
ic
dg
j
v
T
jc
dg
j
v
, (8)
where E defines the regularisation graph topology, and α
ij
defines the weight associated with the edge, which we set
to α
ij
= max(dg
i
w
, dg
j
w
).

Figures
Citations
More filters
Journal ArticleDOI

Past, Present, and Future of Simultaneous Localization and Mapping: Toward the Robust-Perception Age

TL;DR: Simultaneous localization and mapping (SLAM) as mentioned in this paper consists in the concurrent construction of a model of the environment (the map), and the estimation of the state of the robot moving within it.
Journal ArticleDOI

Past, Present, and Future of Simultaneous Localization And Mapping: Towards the Robust-Perception Age

TL;DR: What is now the de-facto standard formulation for SLAM is presented, covering a broad set of topics including robustness and scalability in long-term mapping, metric and semantic representations for mapping, theoretical performance guarantees, active SLAM and exploration, and other new frontiers.
Journal ArticleDOI

VNect: real-time 3D human pose estimation with a single RGB camera

TL;DR: In this paper, a fully-convolutional pose formulation was proposed to regress 2D and 3D joint positions jointly in real-time and does not require tightly cropped input frames.
Journal ArticleDOI

Deferred neural rendering: image synthesis using neural textures

TL;DR: This work proposes Neural Textures, which are learned feature maps that are trained as part of the scene capture process that can be utilized to coherently re-render or manipulate existing video content in both static and dynamic environments at real-time rates.
Journal ArticleDOI

VNect: Real-time 3D Human Pose Estimation with a Single RGB Camera

TL;DR: This work presents the first real-time method to capture the full global 3D skeletal pose of a human in a stable, temporally consistent manner using a single RGB camera and shows that the approach is more broadly applicable than RGB-D solutions, i.e., it works for outdoor scenes, community videos, and low quality commodity RGB cameras.
References
More filters
Proceedings ArticleDOI

KinectFusion: Real-time dense surface mapping and tracking

TL;DR: A system for accurate real-time mapping of complex and arbitrary indoor scenes in variable lighting conditions, using only a moving low-cost depth camera and commodity graphics hardware, which fuse all of the depth data streamed from a Kinect sensor into a single global implicit surface model of the observed scene in real- time.
Proceedings ArticleDOI

A volumetric method for building complex models from range images

TL;DR: This paper presents a volumetric method for integrating range images that is able to integrate a large number of range images yielding seamless, high-detail models of up to 2.6 million triangles.
Proceedings ArticleDOI

KinectFusion: real-time 3D reconstruction and interaction using a moving depth camera

TL;DR: Novel extensions to the core GPU pipeline demonstrate object segmentation and user interaction directly in front of the sensor, without degrading camera tracking or reconstruction, to enable real-time multi-touch interactions anywhere.
Proceedings ArticleDOI

Efficient Model-based 3D Tracking of Hand Articulations using Kinect

TL;DR: A novel solution to the problem of recovering and tracking the 3D position, orientation and full articulation of a human hand from markerless visual observations obtained by a Kinect sensor is presented.
Journal ArticleDOI

Real-time 3D reconstruction at scale using voxel hashing

TL;DR: An online system for large and fine scale volumetric reconstruction based on a memory and speed efficient data structure that compresses space, and allows for real-time access and updates of implicit surface data, without the need for a regular or hierarchical grid data structure.
Related Papers (5)
Frequently Asked Questions (19)
Q1. What have the authors contributed in "Dynamicfusion: reconstruction and tracking of non-rigid scenes in real-time" ?

The authors present the first dense SLAM system capable of reconstructing non-rigidly deforming scenes in real-time, by fusing together RGBD scans captured from commodity sensors. Their DynamicFusion approach reconstructs scene geometry whilst simultaneously estimating a dense volumetric 6D motion field that warps the estimated geometry into a live frame. The core question the authors tackle in this paper is: To that end, the authors introduce DynamicFusion, an approach based on solving for a volumetric flow field that transforms the state of the scene at each time instant into a fixed, canonical frame. A key contribution of their work is an approach for non-rigid transformation and fusion that retains the optimality properties of volumetric scan fusion [ 5 ], developed originally for rigid scenes. Under this transformation, each canonical point projects along a line of sight in the live camera frame. The remainder of this paper is structured as follows. After discussing related work, the authors present an overview of DynamicFusion in Section 2 and provide technical details in Section 3. The authors provide experimental results in Section 4 and conclude in Section 5. Since the optimality arguments of [ 5 ] ( developed for rigid scenes ) depend only on lines of sight, the authors can generalize their optimality results to the non-rigid case. 

The warp-field is the key extension over static state space representations used in traditional reconstruction and SLAM systems, and its estimation is the enabler of both non-rigid tracking and scene reconstruction. 

The main computational complexity in minimisingE involves constructing and factorizing the Gauss-Newton approximation of the Hessian: J>J = J>dJd + λJ > r Jr. First, the authors note that the k-nearest node field induces non-zero blocks in the data term component J>dJd for each pair of nodes currently involved in deforming V into an observable region of the live frame. 

The vast majority of non-rigid tracking research focuses on human body parts, for which specialised shape and motion templates are learnt or manually designed. 

Building the full linear system on the GPU currently hinders real-time performance due to requirements on global GPU memory read and writes. 

The authors compute the next l = 1 level of regularisation nodes by running the radius search based sub-sampling on the warp field nodes dgv to an increased decimation radius of βl, where β > 1, and again compute the initial node transforms through DQB with the now updated Wt. 

The projective signed distance at the warped canonical point is:psdf(xc) = [ K−1Dt (uc) [ u>c , 1 ]>] z − [xt]z , (3)where uc = π (Kxt) is the pixel into which the voxel center projects. 

The authors obtain an initial estimate for data-association (correspondence) between the model geometry and the live frame by rendering the warped surface V̂w into the live frame shaded with canonical frame vertex positions using a rasterizing rendering pipeline. 

An intriguing approach to template-free non-rigid alignment, introduced in [17] and [26], treats each nonrigid scan as a view from a 4D geometric observation and performs 4D shape reconstruction. [30, 29] reconstruct a fixed topology geometry by performing pair-wise scan alignment. [24] use a space-time solid incompressible flow prior that results in water tight reconstructions and is effective against noisy input point-cloud data. [28] introduce animation cartography that also estimates shape and a per frame deformation by developing a dense correspondence matching scheme that is seeded with sparse landmark matches. 

since the associated weight of each deformation node reduces to a very small value outside of 3dgw, any data term there can be safely ignored. 

N (x) are the k-nearest transformation nodes to the point x and wk : R3 7→ R defines a weight that alters the radius of influence of each node and SE3(.) converts from quaternions back to an SE(3) transformation matrix. 

There are three core algorithmic components to the system that are performed in sequence on arrival of each new depth frame:1. Estimation of the volumetric model-to-frame warp field parameters (Section 3.3)2. 

While no prior work achieves real-time, template-free, non-rigid reconstruction, there are two categories of closely related work: 1) real-time non-rigid tracking algorithms, and 2) offline dynamic reconstruction techniques. 

Other techniques directly track and deform more general mesh models. [12] demonstrated the ability to track a statically acquired low resolution shape template and upgrade its appearance with high frequency geometric details not present in the original model. 

This allows the TSDF for a point in the canonical frame to be updated by computing the projective TSDF in the deforming frame without having to resample a warped TSDF in the live frame. 

The authors note that scaling of space can also be represented with this warp function, since compression and expansion of space are represented by neighbouring points moving in converging and diverging directions. 

This consists of incrementally updating the deformation graph nodes Nwarp, and then recomputing a new hierarchical edge topology E that expands the regularisation to include the new nodes. 

In original applications of the embedded deformation graph [25] approach to non-rigid tracking, E is defined as the k−nearest neighbours of each node or all nodes within a specified radius. 

In the preceding subsections the authors defined how the canonical space can be deformed through W (Section 3.1), introduced the optimisation required to estimate warp-field state through time (3.3), and showed how, given an estimated warp field, the authors can incrementally update the canonical surface geometry (3.2).