What is the main computational complexity in minimisingE?

The main computational complexity in minimisingE involves constructing and factorizing the Gauss-Newton approximation of the Hessian: J>J = J>dJd + λJ > r Jr. First, the authors note that the k-nearest node field induces non-zero blocks in the data term component J>dJd for each pair of nodes currently involved in deforming V into an observable region of the live frame.

What is the main reason why the authors are building the full linear system on the GPU?

Building the full linear system on the GPU currently hinders real-time performance due to requirements on global GPU memory read and writes.

How do the authors compute the next level of regularisation nodes?

The authors compute the next l = 1 level of regularisation nodes by running the radius search based sub-sampling on the warp field nodes dgv to an increased decimation radius of βl, where β > 1, and again compute the initial node transforms through DQB with the now updated Wt.

What is the way to reconstruct a fixed topology?

An intriguing approach to template-free non-rigid alignment, introduced in [17] and [26], treats each nonrigid scan as a view from a 4D geometric observation and performs 4D shape reconstruction. [30, 29] reconstruct a fixed topology geometry by performing pair-wise scan alignment. [24] use a space-time solid incompressible flow prior that results in water tight reconstructions and is effective against noisy input point-cloud data. [28] introduce animation cartography that also estimates shape and a per frame deformation by developing a dense correspondence matching scheme that is seeded with sparse landmark matches.

How can the authors avoid the weight of the deformation node?

since the associated weight of each deformation node reduces to a very small value outside of 3dgw, any data term there can be safely ignored.

What is the k-nearest transformation node to the point x?

N (x) are the k-nearest transformation nodes to the point x and wk : R3 7→ R defines a weight that alters the radius of influence of each node and SE3(.) converts from quaternions back to an SE(3) transformation matrix.

What is the process of updating the deformation graph nodes?

This consists of incrementally updating the deformation graph nodes Nwarp, and then recomputing a new hierarchical edge topology E that expands the regularisation to include the new nodes.

What is the definition of the knearest neighbours of each node?

In original applications of the embedded deformation graph [25] approach to non-rigid tracking, E is defined as the k−nearest neighbours of each node or all nodes within a specified radius.

What is the optimisation required to estimate the canonical surface geometry?

In the preceding subsections the authors defined how the canonical space can be deformed through W (Section 3.1), introduced the optimisation required to estimate warp-field state through time (3.3), and showed how, given an estimated warp field, the authors can incrementally update the canonical surface geometry (3.2).

(Open Access) DynamicFusion: Reconstruction and tracking of non-rigid scenes in real-time (2015) | Richard Newcombe

Q: What have the authors contributed in "Dynamicfusion: reconstruction and tracking of non-rigid scenes in real-time" ?

The authors present the first dense SLAM system capable of reconstructing non-rigidly deforming scenes in real-time, by fusing together RGBD scans captured from commodity sensors. Their DynamicFusion approach reconstructs scene geometry whilst simultaneously estimating a dense volumetric 6D motion field that warps the estimated geometry into a live frame. The core question the authors tackle in this paper is: To that end, the authors introduce DynamicFusion, an approach based on solving for a volumetric flow field that transforms the state of the scene at each time instant into a fixed, canonical frame. A key contribution of their work is an approach for non-rigid transformation and fusion that retains the optimality properties of volumetric scan fusion [ 5 ], developed originally for rigid scenes. Under this transformation, each canonical point projects along a line of sight in the live camera frame. The remainder of this paper is structured as follows. After discussing related work, the authors present an overview of DynamicFusion in Section 2 and provide technical details in Section 3. The authors provide experimental results in Section 4 and conclude in Section 5. Since the optimality arguments of [ 5 ] ( developed for rigid scenes ) depend only on lines of sight, the authors can generalize their optimality results to the non-rigid case.

Q: What is the key extension over static state space representations?

The warp-field is the key extension over static state space representations used in traditional reconstruction and SLAM systems, and its estimation is the enabler of both non-rigid tracking and scene reconstruction.

Q: What is the main focus of non-rigid tracking research?

The vast majority of non-rigid tracking research focuses on human body parts, for which specialised shape and motion templates are learnt or manually designed.

Q: How do the authors obtain an initial estimate for data-association between the model geometry and the live frame?

The authors obtain an initial estimate for data-association (correspondence) between the model geometry and the live frame by rendering the warped surface V̂w into the live frame shaded with canonical frame vertex positions using a rasterizing rendering pipeline.

DynamicFusion: Reconstruction and Tracking of Non-rigid Scenes in Real-Time

Richard A. Newcombe

newcombe@cs.washington.edu

Dieter Fox

fox@cs.washington.edu

University of Washington, Seattle

Steven M. Seitz

seitz@cs.washington.edu

Figure 1: Real-time reconstructions of a moving scene with DynamicFusion; both the person and the camera are moving. The initially

noisy and incomplete model is progressively denoised and completed over time (left to right).

Abstract

We present the ﬁrst dense SLAM system capable of re-

constructing non-rigidly deforming scenes in real-time, by

fusing together RGBD scans captured from commodity sen-

sors. Our DynamicFusion approach reconstructs scene ge-

ometry whilst simultaneously estimating a dense volumet-

ric 6D motion ﬁeld that warps the estimated geometry into

a live frame. Like KinectFusion, our system produces in-

creasingly denoised, detailed, and complete reconstructions

as more measurements are fused, and displays the updated

model in real time. Because we do not require a template

or other prior scene model, the approach is applicable to a

wide range of moving objects and scenes.

3D scanning traditionally involves separate capture and

off-line processing phases, requiring very careful planning

of the capture to make sure that every surface is cov-

ered. In practice, it’s very difﬁcult to avoid holes, requir-

ing several iterations of capture, reconstruction, identifying

holes, and recapturing missing regions to ensure a complete

model. Real-time 3D reconstruction systems like KinectFu-

sion [18, 10] represent a major advance, by providing users

the ability to instantly see the reconstruction and identify

regions that remain to be scanned. KinectFusion spurred a

ﬂurry of follow up research aimed at robustifying the track-

ing [9, 32] and expanding its spatial mapping capabilities to

larger environments [22, 19, 34, 31, 9].

However, as with all traditional SLAM and dense re-

construction systems, the most basic assumption behind

KinectFusion is that the observed scene is largely static.

The core question we tackle in this paper is: How can we

generalise KinectFusion to reconstruct and track dynamic,

non-rigid scenes in real-time? To that end, we introduce

DynamicFusion, an approach based on solving for a vol-

umetric ﬂow ﬁeld that transforms the state of the scene at

each time instant into a ﬁxed, canonical frame. In the case

of a moving person, for example, this transformation un-

does the person’s motion, warping each body conﬁguration

into the pose of the ﬁrst frame. Following these warps, the

scene is effectively rigid, and standard KinectFusion up-

dates can be used to obtain a high quality, denoised recon-

struction. This progressively denoised reconstruction can

then be transformed back into the live frame using the in-

verse map; each point in the canonical frame is transformed

to its location in the live frame (see Figure 1).

Deﬁning a canonical “rigid” space for a dynamically

moving scene is not straightforward. A key contribution

of our work is an approach for non-rigid transformation and

fusion that retains the optimality properties of volumetric

scan fusion [5], developed originally for rigid scenes. The

main insight is that undoing the scene motion to enable fu-

sion of all observations into a single ﬁxed frame can be

achieved efﬁciently by computing the inverse map alone.

Under this transformation, each canonical point projects

along a line of sight in the live camera frame. Since the

optimality arguments of [5] (developed for rigid scenes) de-

pend only on lines of sight, we can generalize their optimal-

ity results to the non-rigid case.

Our second key contribution is to represent this volumet-

ric warp efﬁciently, and compute it in real time. Indeed,

even a relatively low resolution, 256

deformation volume

would require 100 million transformation variables to be

computed at frame-rate. Our solution depends on a com-

bination of adaptive, sparse, hierarchical volumetric basis

functions, and innovative algorithmic work to ensure a real-

time solution on commodity hardware. As a result, Dynam-

icFusion is the ﬁrst system capable of real-time dense recon-

struction in dynamic scenes using a single depth camera.

The remainder of this paper is structured as follows. Af-

ter discussing related work, we present an overview of Dy-

namicFusion in Section 2 and provide technical details in

Section 3. We provide experimental results in Section 4 and

conclude in Section 5.

1. Related Work

While no prior work achieves real-time, template-free,

non-rigid reconstruction, there are two categories of closely

related work: 1) real-time non-rigid tracking algorithms,

and 2) ofﬂine dynamic reconstruction techniques.

Real-time non-rigid template tracking. The vast ma-

jority of non-rigid tracking research focuses on human body

parts, for which specialised shape and motion templates are

learnt or manually designed. The best of these demonstrate

high accuracy, real-time performance capture for tracking

faces [16, 3], hands [21, 20], complete bodies [27], or gen-

eral articulated objects [23, 33].

Other techniques directly track and deform more gen-

eral mesh models. [12] demonstrated the ability to track

a statically acquired low resolution shape template and up-

grade its appearance with high frequency geometric details

not present in the original model. Recently, [37] demon-

strated an impressive real-time version of a similar tech-

nique, using GPU accelerated optimisations. In that sys-

tem, a dense surface model of the subject is captured while

remaining static, yielding a template for use in their real-

time tracking pipeline. This separation into template gen-

eration and tracking limits the system to objects and scenes

that are completely static during the geometric reconstruc-

tion phase, precluding reconstruction of things that won’t

reliably hold still (e.g., children or pets).

Ofﬂine simultaneous tracking and reconstruction of

dynamic scenes. There is a growing literature on ofﬂine

non-rigid tracking and reconstruction techniques. Several

researchers have extended ICP to enable small non-rigid

deformations, e.g., [1, 2]. Practical advancements to pair-

wise 3D shape and scan alignment over larger deformations

make use of reduced deformable model parametrisations

[14, 4]. In particular, embedded deformation graphs [25]

use a sparsely sampled set of transformation basis func-

tions that can be efﬁciently and densely interpolated over

space. Quasi-rigid reconstruction has also been demon-

strated [15, 35] and hybrid systems, making use of a known

kinematic structure (e.g., a human body), are able to per-

form non-rigid shape denoising [36]. Other work combines

non-rigid mesh template tracking and temporal denoising

and completion [13], but does not obtain a single consistent

representation of the scene.

More closely related to our work are template-free tech-

niques. An intriguing approach to template-free non-rigid

alignment, introduced in [17] and [26], treats each non-

rigid scan as a view from a 4D geometric observation and

performs 4D shape reconstruction. [30, 29] reconstruct

a ﬁxed topology geometry by performing pair-wise scan

alignment. [24] use a space-time solid incompressible ﬂow

prior that results in water tight reconstructions and is ef-

fective against noisy input point-cloud data. [28] intro-

duce animation cartography that also estimates shape and

a per frame deformation by developing a dense correspon-

dence matching scheme that is seeded with sparse landmark

matches. Recent work using multiple ﬁxed kinect cameras

[8] [7] demonstrates larger scale non-rigid reconstruction by

densely tracking and fusing all depth map data into a novel

directional distance function representation.

All of these techniques require three to four orders of

magnitude more time than is available within a real-time

setting.

2. DynamicFusion Overview

DynamicFusion decomposes a non-rigidly deforming

scene into a latent geometric surface, reconstructed into a

rigid canonical space S ⊆ R

; and a per frame volumetric

warp ﬁeld that transforms that surface into the live frame.

There are three core algorithmic components to the system

that are performed in sequence on arrival of each new depth

frame:

1. Estimation of the volumetric model-to-frame warp

ﬁeld parameters (Section 3.3)

2. Fusion of the live frame depth map into the canonical

space via the estimated warp ﬁeld (Section 3.2)

3. Adaptation of the warp-ﬁeld structure to capture newly

added geometry (Section 3.4)

Figure 2 provides an overview.

3. Technical Details

We will now describe the components of DynamicFusion

in detail. First, we describe our dense volumetric warp-ﬁeld

parametrisation. This allows us to model per-frame defor-

mations in the scene. The warp-ﬁeld is the key extension

over static state space representations used in traditional re-

construction and SLAM systems, and its estimation is the

enabler of both non-rigid tracking and scene reconstruction.

3.1. Dense Non-rigid Warp Field

We represent dynamic scene motion through a volumet-

ric warp-ﬁeld, providing a per point 6D transformation

W : S 7→ SE(3). Whereas a dense 3D translation ﬁeld

would be sufﬁcient to describe time varying geometry, we

have found that representing the real-world transformation

(a) Initial Frame at t = 0s (b) Raw (noisy) depth maps for frames at t = 1s, 10s, 15s, 20s (c) Node Distance

(d) Canonical Model (e) Canonical model warped into its live frame (f) Model Normals

Figure 2: DynamicFusion takes an online stream of noisy depth maps (a,b) and outputs a real-time dense reconstruction of the moving

scene (d,e). To achieve this, we estimate a volumetric warp (motion) ﬁeld that transforms the canonical model space into the live frame,

enabling the scene motion to be undone, and all depth maps to be densely fused into a single rigid TSDF reconstruction (d,f). Simulta-

neously, the structure of the warp ﬁeld is constructed as a set of sparse 6D transformation nodes that are smoothly interpolated through

a k-nearest node average in the canonical frame (c). The resulting per-frame warp ﬁeld estimate enables the progressively denoised and

completed scene geometry to be transformed into the live frame in real-time (e). In (e) we also visualise motion trails for a sub-sample

of model vertices over the last 1 second of scene motion together with a coordinate frame showing the rigid body component of the scene

motion. In (c) we render the nearest node to model surface distance where increased distance is mapped to a lighter value.

of objects with both translation and rotation results in signif-

icantly better tracking and reconstruction. For each canoni-

cal point v

∈ S, T

= W(v

) transforms that point from

canonical space into the live, non-rigidly deformed frame of

reference.

Since we will need to estimate the warp function for each

new frame, W

, its representation must be efﬁciently opti-

misable. One possibility is to densely sample the volume,

e.g. representing a quantised SE(3) ﬁeld at the resolution

of the truncated signed distance (TSDF) geometric repre-

sentation. However, a typical TSDF volume reconstruction

at a relatively low resolution of 256

voxels would require

the solution of 6 × 256

parameters per frame, about 10

million times more than in the original KinectFusion al-

gorithm, which only estimates a single rigid transforma-

tion. Clearly, a completely dense parametrisation of the

warp function is infeasible. In reality, surfaces tend to move

smoothly in space, and so we can instead use a sparse set

of transformations as bases and deﬁne the dense volumet-

ric warp function through interpolation. Due to its compu-

tational efﬁciency and high quality interpolation capability

we use dual-quaternion blending DQB [11], to deﬁne our

warp function:

W(x

) ≡ SE3(DQB(x

)) , (1)

where the weighted average over unit dual quaternion trans-

formations is simply DQB(x

) ≡

k∈N (x

)

)ˆq

k∈N (x

)

)ˆq

with each unit dual-quaternion ˆq

∈ R

. Here, N (x) are

the k-nearest transformation nodes to the point x and w

7→ R deﬁnes a weight that alters the radius of inﬂuence

of each node and SE3(.) converts from quaternions back to

an SE(3) transformation matrix. The state of the warp-ﬁeld

at time t is deﬁned by the values of a set of n defor-

mation nodes N

warp

= {dg

, dg

se3

}

. Each of the

i = 1..n nodes has a position in the canonical frame dg

∈

, its associated transformation T

= dg

se3

, and a ra-

dial basis weight dg

that controls the extent of the trans-

formation w

) = exp



−kdg

− x



2(dg

)



Each radius parameter dg

is set to ensure the node’s in-

ﬂuence overlaps with neighbouring nodes, dependent on

the sampling sparsity of nodes, which we describe in de-

tail in section (3.4). Since the warp function deﬁnes a

rigid body transformation for all supported space, both posi-

tion and any associated orientation of space is transformed,

e.g., the vertex v

from a surface with orientation or nor-

mal n

is transformed into the live frame as (v

, 1)

)(v

, 1)

and (n

, 0)

= W

)(n

, 0)

. We note

that scaling of space can also be represented with this warp

function, since compression and expansion of space are rep-

resented by neighbouring points moving in converging and

diverging directions. Finally, we note that we can factor

out any rigid body transformation common to all points in

the volume, e.g., due to camera motion. We therefore in-

troduce the explicit warped model to live camera transform,

, and compose this onto the volumetric warp function;

our complete warp-ﬁeld is then given as:

) = T

SE3(DQB(x

)). (2)

3.2. Dense Non-Rigid Surface Fusion

We now describe how, given the model-to-frame warp

ﬁeld W

, we update our canonical model geometry. Our

reconstruction into the canonical space S is represented by

the sampled TSDF V : S 7→ R

within a voxel domain

S ⊂ N

. The sampled function holds for each voxel x ∈

S corresponding to the sampled point x

, a tuple V(x) 7→

[v(x) ∈ R, w(x) ∈ R]

holding a weighted average of all

projective TSDF values observed for that point so far v(x),

together with the sum of all associated weights w(x).

We extend the projective TSDF fusion approach origi-

nally introduced by [6] to operate over non-rigidly deform-

ing scenes. Given the live depth image D

, we transform

each voxel center x

∈ S by its estimated warp into the live

frame (x

, 1)

= W

)(x

, 1)

, and carry through the

TSDF surface fusion operation by directly projecting the

warped center into the depth frame. This allows the TSDF

for a point in the canonical frame to be updated by com-

puting the projective TSDF in the deforming frame without

having to resample a warped TSDF in the live frame. The

projective signed distance at the warped canonical point is:

psdf(x

) =

−1

)



, 1



− [x

]

, (3)

where u

= π (Kx

) is the pixel into which the voxel cen-

ter projects. We compute distance along the optical (z) axis

of the camera frame using the z component denoted [.]

K is the known 3 × 3 camera intrinsic matrix, and π per-

forms perspective projection. For each voxel x, we update

the TSDF to incorporate the projective SDF observed in the

warped frame using TSDF fusion:

V(x)

(

(x), w

(x)]

, if psdf(dc(x)) > −τ

V(x)

t−1

, otherwise

(4)

where dc(.) transforms a discrete voxel point into the con-

tinuous TSDF domain. The truncation distance τ > 0 and

the updated TSDF value is given by the weighted averaging

scheme [5], with the weight truncation introduced in [18]:

(x) =

v(x)

t−1

w(x)

t−1

+ min(ρ, τ)w(x)

w(x)

t−1

+ w(x)

ρ = psdf(dc(x))

(x) = min(w(x)

t−1

+ w(x), w

max

) . (5)

Unlike the static fusion scenario where the weight w(x)

encodes the uncertainty of the depth value observed at the

projected pixel in the depth frame, we also account for un-

certainty associated with the warp function at x

. In the

case of the single rigid transformation in original TSDF

fusion, we are certain that observed surface regions, free

space, and unobserved regions transform equivalently. In

our non-rigid case, the further away the point x

is from an

already mapped and observable surface region, the less cer-

tain we can be about its transformation. We use the average

distance from x

to its k-nearest deformation nodes as a

proxy for this increase in uncertainty and scale: w(x) ∝

i∈N(x

)

kdg

− x

. We note that our non-rigid

fusion generalises the static reconstruction case used in

KinectFusion, replacing the single (rigid) model-to-camera

transform with a per voxel warp that transforms the asso-

ciated space into the live (non-rigid) frame (see Figure 3).

This technique greatly simpliﬁes the non-rigid reconstruc-

tion process over methods where all frames are explicitly

warped into a canonical frame. Furthermore, given a cor-

rect warp ﬁeld, then, since all TSDF updates are computed

using distances in the camera frame, the non-rigid projec-

tive TSDF fusion approach maintains the optimality guar-

antees for surface reconstruction from noisy observations

originally proved for the static reconstruction case in [6].

3.3. Estimating the Warp-ﬁeld State W

We estimate the current values of the transformations

se3

in W

given a newly observed depth map D

and the

current reconstruction V by constructing an energy function

that is minimised by our desired parameters:

E(W

, V, D

, E) = Data(W

, V, D

) + λReg(W

, E) . (6)

Our data term consists of a dense model-to-frame ICP

cost Data(W

, V, D

) which is coupled with a regular-

isation term Reg(W

, E) that penalises non-smooth mo-

tion ﬁelds, and ensures as-rigid-as-possible deformation be-

tween transformation nodes connected by the edge set E.

The coupling of a data-term formed from linearly blended

transformations with a rigid-as-possible graph based reg-

ularisation is a form of the embedded deformation graph

model introduced in [25]. The regularisation parameter λ

enables a trade-off between relaxing rigidity over the ﬁeld

when given high quality data, and ensuring a smooth con-

sistent deformation of non or noisily observed regions of

space. We deﬁned these terms in the next subsections.

3.3.1 Dense Non-Rigid ICP Data-term

Our aim is to estimate all non-rigid transformation param-

eters T

and T

that warp the canonical volume into the

live frame. We achieve this by performing a dense non-

rigid alignment of the current surface reconstruction, ex-

tracted from the canonical volume’s zero level set, into the

live frame’s depth map.

Surface Prediction and Data-Association: The current

zero level set of the TSDF V is extracted by marching cubes

and stored as a polygon mesh with point-normal pairs in the

canonical frame:

≡ {V

, N

}. We non-rigidly transform

this mesh into the live frame using the current warp ﬁeld

resulting in the warped point-normals

Non-rigid scene deformation Introducing an occlusion

(a) Live frame t = 0 (b) Live Frame t = 1 (c) Canonical 7→ Live (d) Live frame t = 0 (e) Live Frame t = 1 (f) Canonical 7→ Live

Figure 3: An illustration of how each point in the canonical frame maps, through a correct warp ﬁeld, onto a ray in the live camera frame

when observing a deforming scene. In (a) the ﬁrst view of a dynamic scene is observed. In the corresponding canonical frame, the warp is

initialized to the identity transform and the three rays shown in the live frame also map as straight lines in the canonical frame. As the scene

deforms in the live frame (b), the warp function transforms each point from the canonical and into the corresponding live frame location,

causing the corresponding rays to bend (c). Note that this warp can be achieved with two 6D deformation nodes (shown as circles), where

the left node applies a clockwise twist. In (d) we show a new scene that includes a cube that is about to occlude the bar. In the live frame

(e), as the cube occludes a portion of the bar, the points in the canonical frame (f) are warped to correctly pass through the cube.

We obtain an initial estimate for data-association (corre-

spondence) between the model geometry and the live frame

by rendering the warped surface

into the live frame

shaded with canonical frame vertex positions using a ras-

terizing rendering pipeline. This results in a prediction of

the canonical frame’s geometry that is currently predicted

to be visible in the live frame: P(

). We store this pre-

diction as a pair of images {v, n} : Ω 7→ P(

), where

Ω is the pixel domain of the predicted images, storing the

rendered canonical frame vertices and normals.

Given optimal transformation parameters for the current

time frame, the predicted-to-be-visible geometry should

transform close, modulo observation noise, to the live sur-

face vl : Ω 7→ R

, formed by back projection of the depth

image [vl(u)

, 1]

= K

−1

(u)[u

, 1]

. This can be

quantiﬁed by a per pixel dense model-to-frame point-plane

error, which we compute under the robust Tukey penalty

function ψ

data

, summed over the predicted image domain

Ω:

Data(W, V, D

) ≡

u∈Ω

data



ˆn

(ˆv

− vl

˜u

)



.(7)

augment. The transformed model vertex v(u) is simply

= W(v(u)), producing the current canonical to live

frame point-normal predictions ˆv

v(u) and ˆn

n(u), and data-association of that model point-normal

is made with a live frame point-normal through perspective

projection into the pixel ˜u = π(Kˆv

We note that, ignoring the negligible cost of rendering

the geometry

, the ability to extract, predict, and perform

projective data association with the currently visible canoni-

cal geometry leads to a data-term evaluation that has a com-

putational complexity with an upper bound in the number

of pixels in the observation image. Furthermore, each data-

term summand depends only on a subset of the n trans-

formations when computing W, and the region over which

each node has a numerically signiﬁcant impact on the error

function is compact. In practice, the result is a computa-

tional cost similar to a single rigid body dense projective

point-plane data-term evaluation (as used in KinectFusion).

3.3.2 Warp-ﬁeld Regularization

It is crucial for our non-rigid TSDF fusion technique to es-

timate a deformation not only of currently visible surfaces,

but over all space within S. This enables reconstruction

of new regions of the scene surface that are about to come

into view. However, nodes affecting canonical space within

which no currently observed surface resides will have no

associated data term. In any case, noise, missing data and

insufﬁcient geometric texture in the live frame – an ana-

logue to the aperture problem in optical-ﬂow – will result

in optimisation of the transform parameters being ill-posed.

How should we constrain the motion of non-observed ge-

ometry? Whilst the fully correct motion depends on object

dynamics and, where applicable, the subject’s volition, we

make use of a simpler model of unobserved geometry: that

it deforms in a piece-wise smooth way.

We use a deformation graph based regularization deﬁned

between transformation nodes, where an edge in the graph

between nodes i and j adds a rigid-as-possible regularisa-

tion term to the total error being minimized, under the dis-

continuity preserving Huber penalty ψ

reg

. The total regu-

larisation term sums over all pair-wise connected nodes:

Reg(W, E) ≡

i=0

j∈E(i)

reg



− T



, (8)

where E deﬁnes the regularisation graph topology, and α

deﬁnes the weight associated with the edge, which we set

to α

= max(dg

, dg

DynamicFusion: Reconstruction and tracking of non-rigid scenes in real-time

Figures

Citations

Past, Present, and Future of Simultaneous Localization and Mapping: Toward the Robust-Perception Age

Past, Present, and Future of Simultaneous Localization And Mapping: Towards the Robust-Perception Age

VNect: real-time 3D human pose estimation with a single RGB camera

Deferred neural rendering: image synthesis using neural textures

VNect: Real-time 3D Human Pose Estimation with a Single RGB Camera

References

KinectFusion: Real-time dense surface mapping and tracking

A volumetric method for building complex models from range images

KinectFusion: real-time 3D reconstruction and interaction using a moving depth camera

Efficient Model-based 3D Tracking of Hand Articulations using Kinect

Real-time 3D reconstruction at scale using voxel hashing

Related Papers (5)

KinectFusion: Real-time dense surface mapping and tracking

A volumetric method for building complex models from range images

SMPL: a skinned multi-person linear model

KinectFusion: real-time 3D reconstruction and interaction using a moving depth camera

SCAPE: shape completion and animation of people

Frequently Asked Questions (19)

Q1. What have the authors contributed in "Dynamicfusion: reconstruction and tracking of non-rigid scenes in real-time" ?

Q2. What is the key extension over static state space representations?

Q3. What is the main computational complexity in minimisingE?

Q4. What is the main focus of non-rigid tracking research?

Q5. What is the main reason why the authors are building the full linear system on the GPU?

Q6. How do the authors compute the next level of regularisation nodes?

Q7. What is the projective signed distance at the warped canonical point?

Q8. How do the authors obtain an initial estimate for data-association between the model geometry and the live frame?

Q9. What is the way to reconstruct a fixed topology?

Q10. How can the authors avoid the weight of the deformation node?

Q11. What is the k-nearest transformation node to the point x?

Q12. What are the core algorithmic components to the system that are performed in sequence on arrival of each?

Q13. What are the two categories of non-rigid tracking?

Q14. What is the way to track a low resolution shape?

Q15. What is the TSDF for a point in the canonical frame?

Q16. What is the difference between the two warp functions?

Q17. What is the process of updating the deformation graph nodes?

Q18. What is the definition of the knearest neighbours of each node?

Q19. What is the optimisation required to estimate the canonical surface geometry?