Image Guided Depth Upsampling Using Anisotropic Total Generalized Variation

doi:10.1109/ICCV.2013.127

Image Guided Depth Upsampling using Anisotropic Total Generalized Variation

David Ferstl, Christian Reinbacher, Rene Ranftl, Matthias R

¨

uther and Horst Bischof

Graz University of Technology

Institute for Computer Graphics and Vision

Inffeldgasse 16, 8010 Graz, AUSTRIA

{ferstl,reinbacher,ranftl,ruether,bischof}@icg.tugraz.at

Abstract

In this work we present a novel method for the chal-

lenging problem of depth image upsampling. Modern depth

cameras such as Kinect or Time of Flight cameras deliver

dense, high quality depth measurements but are limited in

their lateral resolution. To overcome this limitation we for-

mulate a convex optimization problem using higher order

regularization for depth image upsampling. In this opti-

mization an anisotropic diffusion tensor, calculated from a

high resolution intensity image, is used to guide the upsam-

pling. We derive a numerical algorithm based on a primal-

dual formulation that is efﬁciently parallelized and runs at

multiple frames per second. We show that this novel up-

sampling clearly outperforms state of the art approaches in

terms of speed and accuracy on the widely used Middlebury

2007 datasets. Furthermore, we introduce novel datasets

with highly accurate groundtruth, which, for the ﬁrst time,

enable to benchmark depth upsampling methods using real

sensor data.

1. Introduction

Accurate, high resolution depth sensing is a fundamen-

tal challenge in computer vision. It is used in a variety

of different applications including object reconstruction,

robotic navigation and automotive driver assistance. Tradi-

tional computer vision approaches calculate the scene depth

through computational exhaustive stereo calculations or ex-

pensive laser range measurements.

Recently, Time of Flight (ToF ) range sensors became a

popular alternative for dense depth sensing. A per-pixel

depth is measured actively through the runtime of light. The

measurement is independent from scene texture and largely

independent from environmental lighting conditions. It de-

livers a dense depth map even at very close ranges [12, 21].

No additional calculations are necessary, which results in

depth measurements at high frame rates. Recently, ToF sen-

sors have become affordable in the mass market and a small

(a) Low resolution depth (b) High resolution intensity

(c) High resolution depth upsampling result

Figure 1. Upsampling of a low resolution depth image (a) using

an additional high resolution intensity image (b) through image

guided anisotropic Total Generalized Variation (c). Depth maps

are color coded for better visualization.

packet size and a low energy consumption make them appli-

cable in mobile devices. However, their main disadvantages

are a low resolution caused by chip size limitations and ac-

quisition noise due to limited active illumination energy.

In this work, we propose a method to drastically increase

the lateral measurement resolution by a novel depth map up-

sampling approach, as shown in Figure 1. To increase both,

quality and resolution, we add information from a high reso-

lution intensity camera in a variational optimization frame-

work. We build on the observation that textural edges are

more likely to appear at high depth discontinuities, whereas

homogeneous textured regions correspond to homogeneous

surface parts [23]. Fusing both, low resolution but very ro-

bust depth and high resolution intensity in a spatial sense,

results in a dense depth map with increased lateral resolu-

tion and visual quality.

2013 IEEE International Conference on Computer Vision

DOI 10.1109/ICCV.2013.127

993

2013 IEEE International Conference on Computer Vision

DOI 10.1109/ICCV.2013.127

993

We formulate the upsampling as a convex optimization

problem [2, 6]. The energy is composed of two terms. First,

the data term forces the solution to be similar to the input

depth measurements. Second, the higher order regulariza-

tion term enforces a piecewise afﬁne solution, preserving

sharp edges according to the texture, while compensating

acquisition noise. This term is modeled as a second or-

der Total Generalized Variation (TGV) regularization and

is weighted according to the intensity image texture by an

anisotropic diffusion tensor.

The main contributions of this work are two-fold: (1) We

propose a novel method for fast depth image upsampling by

combining a low resolution depth image with high resolu-

tion texture information in a variational energy optimization

framework. The employed higher order regularization is

well suited to model the image acquisition process of mod-

ern depth cameras and leads to an improved quality of the

upsampled depth maps, compared to state of the art meth-

ods. (2) We propose benchmarking datasets that enable a

quantitative comparison of depth image upsampling meth-

ods providing real To F and intensity camera acquisitions to-

gether with a highly accurate groundtruth measurement. To

encourage further comparison and future work, these novel

datasets and MATLAB code of our method are available at

our website

1

.

In our experiments we demonstrate the upsampling qual-

ity by a numerical and visual comparison on synthetic and

real benchmarking datasets. Compared to state of the art

methods, our method is superior in terms of speed and ac-

curacy on all test sets.

2. Related Work

There are many ways to increase the resolution and the

accuracy of depth measurements. In general, they can be

separated in three main classes: (1) fusion of multiple depth

sensors, (2) temporal and spatial fusion and (3) upsampling

by combining depth and intensity sensors.

Multiple Depth Sensor Fusion Recent works addressed

the fusion of different depth sensing techniques to increase

resolution and quality. Gudmundsson et al .[8] presented a

method for stereo and Time of Flight (ToF ) depth map fu-

sion in a dynamic programming approach. Similar work has

been proposed by Zhu et al .[26] using an accurate depth

calibration and fusing the measurements in a Markov Ran-

dom Field (MRF) framework. Additionally to this spatial

fusion also a temporal fusion was performed by measuring

the frame-to-frame displacement acquired with high speed

intensity cameras.

1

http://rvlab.icg.tugraz.at/tofmark

Temporal and Spatial Upsampling A common way to

improve the resolution and quality of depth information is

to fuse multiple depth measurements into one depth map.

Schuon et al.[22] proposed a method to fuse ToF acquisi-

tions of slightly moved viewpoints. It uses a bilateral regu-

larization in a MRF optimization framework incorporating

also the ToF sensor characteristics. Based on this work, Cui

et al .[4] used a set of fused depth maps with larger displace-

ments. To create whole volumes of depth data Newcombe

et al .[14] proposed a method for simultaneous camera lo-

calization and depth fusion in real time.

Depth Upsampling through Intensity Information

This class of approaches uses additional intensity informa-

tion as depth cue for image upsampling. Yang et al .[24]

used bilateral ﬁltering of a depth cost volume and a RGB

image in an iterative reﬁnement process. Chan et al .[3]

used a noise aware joint bilateral ﬁlter to increase the res-

olution and to reduce depth map errors at multiple frames

per second. Diebel and Thrun [5] performed an upsampling

using a MRF formulation, where the smoothness term is

weighted according to texture derivatives. A more com-

plex approach was proposed by Park et al .[15]. They

used a combination of different weighting terms of a least

squares optimization including segmentation, image gradi-

ents, edge saliency and non-local means for depth upsam-

pling. The combination of intensity and depth data in a

Bayesian Framework was proposed by Li et al .[13].

Discussion While the methods for multiple sensor fusion

deliver accurate depth results, their quality relies on high

calibration effort. Further, most sensor fusion techniques

have to calculate a depth map from passive stereo in a pre-

processing step before the actual fusion is able to start. Con-

trary, temporal and spatial fusion approaches rely on mul-

tiple acquisitions from a single depth sensor. The major

drawback of these methods is that changing environments

during these acquisitions will harm the fusion result.

To overcome these limitations, we chose the combina-

tion of a low resolution depth and a high resolution in-

tensity sensor to increase the natural depth sensor resolu-

tion. The upsampling is calculated on a per image basis

without the need for complex preprocessing. Existing ap-

proaches, such as [3, 24], calculate this depth upsampling

by a bilateral ﬁltering. While bilateral ﬁltering techniques

can operate at high frame rates they have a drawback in

oversmoothing ﬁne details. In contrast, our method builds

on the success of recently introduced upsampling methods

using MRF and least squares optimization [5, 15]. Unlike

them, our approach incorporates a higher order regulariza-

tion, which avoids surface ﬂattening. Furthermore, we use

an anisotropic diffusion tensor based on the intensity image.

This tensor not only weights the depth gradient but also ori-

ents the gradient direction during the optimization process.

994994

3. Method

Our upsampling approach generates a high quality and

high resolution depth map D

H

out of a high resolution in-

tensity image I

H

and a low resolution and noisy depth map

D

L

, where I

H

,D

H

:Ω

H

⊆ R

2

and D

L

:Ω

L

⊆ R

2

. The

methodology of this approach can be divided into three

main areas: (1) Registering the low-resolution depth mea-

surements and the high resolution intensity information in

one common coordinate system (Section 3.1), (2) formu-

lating the depth upsampling problem into a convex energy

functional (Section 3.2), and (3) solving the optimization

problem with a ﬁrst-order primal-dual optimization scheme

(Section 3.3).

3.1. Depth Mapping

Since the low resolution depth map D

L

and the high res-

olution intensity image I

H

stem from different cameras, a

mapping can only be established when intrinsic and ex-

trinsic parameters are known (see Section 4.2). In our

setup we deﬁne the intensity camera as the world coordi-

nate center. Each depth measurement d

i,j

at pixel position

x

i,j

=[i, j, 1]

T

is projected into the high resolution inten-

sity image space Ω

H

. This projection is calculated as

X

i,j

= C

L

+ d

i,j

P

†

L

x

i,j

P

†

L

x

i,j



˜x

i,j

= P

H

X

i,j

∀i, j ∈ Ω

L

,

(1)

where P

†

L

is the pseudoinverse of the depth camera projec-

tion matrix, C

L

the camera center and X

i,j

the 3D point.

Each 3D point is back projected by multiplication with the

projection matrix of the intensity camera P

H

. Hence, we

get a projected depth image D

S

consisting of a sparse set

of base depth points at position ˜x

i,j

in the intensity image

space Ω

H

where the depth value is given by the distance to

the 3D point X

i,j

(see Figure 2).

C

L

C

H

X

i,j

d

i,j

Figure 2. Projection from a low resolution depth map D

L

to a high

resolution sparse depth map D

S

in the intensity camera coordinate

system.

Although, one low resolution sensor pixel D

L

i, j mea-

sures the average depth of multiple pixels in the high reso-

lution space we only project it to one central pixel D

S

i, j

at position ˜x

i,j

. Therewith, we minimize the error which

can occur due to this averaging in the high resolution space.

Through the regularization term, introduced in Section 3.2,

the area between the projected depth pixels is implicitly in-

terpolated.

3.2. Depth Image Upsampling

Our upsampling method increases the resolution of mea-

sured depth data from a low resolution depth sensor by

adding edge cues from a high resolution intensity image.

To be able to use both information, we map the depth mea-

surements to the intensity camera coordinate system as de-

scribed in Section 3.1. With this mapping we get a depth

map D

S

of a sparse set of base depth measurements from

the low resolution depth sensor.

The high resolution depth map D

H

is given by

D

H

=argmin

u

{G(u, D

S

)+αF (u)} . (2)

This formulation is composed of the data term G(u, D

S

)

that measures the ﬁdelity of the argument u to the input

depth measurements D

S

and the regularization term F (u)

that reﬂects prior knowledge of the smoothness of our solu-

tion. F and G are convex lower semi-continuous functions.

The scalar α is used to balance the relative weight between

the data and the regularization.

The data term in our energy model is designed to ensure a

data consistency to the base depth points D

S

from the depth

camera. Additionally, we allow to weight the depth mea-

surements with a weighting operator w =[0, 1] ∈ R

Ω

H

,

which is zero at unmapped image points and between zero

and one on the base points according to some application

speciﬁc conﬁdence. Hence, the data term results in

G(u, D

S

)=



Ω

H

w|(u − D

S

)|

2

dx, (3)

which penalizes deviations of the resulting depth from the

measured depth.

The regularization term has to meet the challenges of

producing a high resolution depth map out of a sparse

set of depth points. Most currently utilized regularization

terms are based on the ﬁrst order smoothness assumption

[19], e.g. the Total Variation semi norm, which results in

F (u)=∇u

1

. While the simple model with L1 norm is

well suited for intensity image denoising, it has a disadvan-

tage when used for range data regularization. Through its

gradient penalization it favors constant solutions. This pre-

vents the depth map to become a piecewise smooth surface,

resulting in piecewise fronto parallel depth reconstructions.

Hence, we use a more generalized regularization model

namely the Total Generalized Variation (TGV) introduced

by Bredies et al.[1]. The TGV is composed of polynomi-

als of arbitrary order, which allows to reconstruct piecewise

995995

polynomial functions. An order of k favors solutions com-

posed of polynomials of order k −1. For depth upsampling,

it turns out that the second order TGV is sufﬁcient, since

most objects can be well approximated by piecewise afﬁne

surfaces. The primal deﬁnition of the second order TGV is

formulated as

TGV

2

α

=min

v



α

1



Ω

|∇u − v| dx + α

0



Ω

|∇v| dx



,

(4)

where the scalars α

0

and α

1

are used to weight each order.

Because the TGV regularizer is convex it allows to compute

a globally optimal solution.

Assuming that texture edges most likely correspond to

depth discontinuities, we use the high resolution intensity

data to produce a more accurate upsampling result. Hence-

forth, we include an anisotropic diffusion tensor T

1

2

. This

tensor is calculated by

T

1

2

=exp(−β |∇I

H

|

γ

) nn

T

+ n

⊥

n

⊥T

, (5)

where n is the normalized direction of the image gradient

n =

∇I

H

|∇I

H

|

, n

⊥

is the normal vector to the gradient and the

scalars β, γ adjust the magnitude and the sharpness of the

tensor. The anisotropic diffusion tensor not only weights

the ﬁrst order depth gradient but also orients the gradient

direction during the optimization process.

Including this term in our TGV model we can penalize

high depth discontinuities at homogeneous regions and al-

low sharp depth edges at corresponding texture differences.

A similar combination of TGV and weighting was used by

Ranftl et al .[18] for passive stereo reconstruction. With

the additional edge tensor information the optimization re-

sult leads to sharper and more deﬁned edges in our solution.

Further, the regions where the depth data is interpolated are

ﬁlled out more reasonably.

The ﬁnal energy is deﬁned as a combination of data term

(3) and the TGV term (4) with anisotropic diffusion (5):

min

u,v



α

1



Ω

H

|T

1

2

(∇u − v)| dx + α

0



Ω

H

|∇v| dx+



Ω

H

w|(u − D

S

)|

2

dx



.

(6)

3.3. Primal-Dual Optimization

The proposed optimization problem (6) is convex but

non smooth due to TGV regularization term and the zeros

in the weighting operator w. To ﬁnd a fast, global opti-

mal solution for our problem we use the primal-dual energy

minimization scheme, as proposed in [2, 6]. We reformu-

late the non-smooth problem in a convex-concave saddle-

point problem applying the Legendre Fenchel transform

(LF) . The optimization problem can be efﬁciently mini-

mized through gradient descent. The transformed saddle-

point problem of our energy functional (6)isgivenby

min

u∈R

MN

,v∈R

2MN

max

p∈P,q∈Q

α

1

T

1

2

(∇u − v),p+

α

0

∇v, q +



i,j∈Ω

w

i,j

(u

i,j

− D

Si,j

)

2

,

(7)

introducing the dual variables p and q. The feasible sets of

these variables are deﬁned by

P =



p:Ω

H

→ R

2

|p

i,j

≤1, ∀i, j ∈ Ω

H



, (8)

Q =



q :Ω

H

→ R

4

|q

i,j

≤1, ∀i, j ∈ Ω

H



. (9)

This formulation is used in the primal-dual algorithm,

where the primal and dual variables are iteratively opti-

mized for the individual pixels in three steps. First, the

dual variables p and q are updated using gradient ascend.

Second, the primal variables are updated using gradient-

descent. Third, the primal variables are reﬁned in an over-

relaxation step. The step sizes are chosen s.t. u

0

= D

S

,

v

0

,p

0

,q

0

=0, σ

p

> 0, σ

q

> 0, τ

u

> 0 and τ

v

> 0.For

any iteration n ≥ 0 the steps are calculated according to

⎧

⎪

⎨

⎪

⎩

p

n+1

= P

p



p

n

+ σ

p

α

1



T

1/2

(∇¯u

n

− ¯v

n

)



q

n+1

= P

q

{q

n

+ σ

q

α

0

∇¯v

n

}

u

n+1

=

u

n

+ τ

u



α

1

∇

T

1/2

p

n+1

+ wD

S



1+τ

u

w

v

n+1

= v

n

+ τ

v



α

0

∇

T

q

n+1

+ α

1

T

1/2

p

n+1



¯u

n+1

= u

n+1

+ θ(u

n+1

− ¯u

n

)

¯v

n+1

= v

n+1

+ θ(v

n+1

− ¯v

n

)

(10)

until a stopping criterion is reached. To fulﬁll the convex

optimality condition in the dual update step, the projection

operators P

p

and P

q

for p and q are calculated through

P

p

{˜p

i,j

} =

˜p

i,j

max (1, |˜p

i,j

|)

,

P

q

{˜q

i,j

} =

˜q

i,j

max (1, |˜q

i,j

|)

.

(11)

In practice the relaxation parameter θ is updated in every

iteration, according to [2], and the optimal step sizes are cal-

culated using preconditioning, as proposed in [17]. There-

with, we achieve a fast and guaranteed convergence to the

global optimal solution for different tensor conditions. The

gradient and divergence operators are approximated using

forward/backward differences with Neumann and Dirichlet

boundary conditions, respectively.

4. Evaluation

In this section, we show a quantitative and qualita-

tive evaluation of our upsampling method. For an exten-

sive evaluation we investigate the performance compared

996996

Art Books Moebius

Avg.Time [s]

x2 x4 x8 x16 x2 x4 x8 x16 x2 x4 x8 x16

Nearest 4.65 5.01 5.71 7.10 4.30 4.68 4.85 5.23 5.08 5.20 5.31 5.65 -

Bilinear 3.09 3.59 4.39 5.91 2.91 3.12 3.34 3.71 3.21 3.45 3.62 4.00 -

Yang et al .[24] 1.36 1.93

2.45 4.52 1.12 1.47 1.81 2.92 1.25 1.63 2.06 3.21 -

He et al .[9] 1.92 2.40 3.32 5.08 1.60 1.82 2.31 3.06 1.77 2.03 2.60 3.34 23.89

Diebel and Thrun [5] 1.62 2.24 3.85 5.70 1.34 2.08 2.85 3.54 1.47 2.29 3.09 3.81 -

Chan et al .[3] 1.83 2.90 4.75 7.70 1.04

1.36 1.94 3.07 1.17 1.55 2.28 3.55 3.02

2

Park et al.[15] 1.24 1.82 2.78 4.17 0.99 1.43 1.98 3.04 1.03 1.49 2.13 3.09 24.05

OURS 0.84 1.29 2.06 3.56 0.51 0.75 1.16 1.89 0.57 0.90 1.38 2.15 1.94

Table 1. Quantitative comparison on the Middlebury 2007 datasets with added noise. The error is measured as RMSE of the pixel disparity

for four different magniﬁcation factors (×2, ×4, ×8, ×16). The best result for each dataset and upscaling factor is highlighted and the

second best is underlined.

(a) RGB (b) input depth (c) Diebel (d) Chan (e) Park (f) OURS (g) Groundtruth

Figure 3. Visual comparison of ×8 upsampling on a snippet of the Middlebury Art dataset including ﬁne structures. (a) RGB intensity

image, (b) low resolution input image (enlarged using nearest neighbor upsampling). (c) Upsampling using MRF proposed by Diebel and

Thrun [5]. (d) Adaptive bilateral upsampling proposed by Chan et al .[3]. (e) Nonlocal means upsampling proposed by Park et al .[15].

(f) Our upsampling method using image guided anisotropic TGV. The results in (c) and (d) still suffer from noise. (e) removes noise but

suffers from edge bleeding especially at small structure boundaries. Our method removes noise and preserves sharp object edges.

to state of the art approaches on the simulated Middlebury

2007 datasets [10, 20] in terms of speed and accuracy. Be-

yond this simulations, we evaluate our method on real data

with highly accurate groundtruth measurements. In our ex-

periments we use a 2 × 2 gradient operator to calculate the

intensity image gradients. The tensor parameters β and γ

as well as the TGV parameters α

0

and α

1

are manually set

once for each upsampling factor and are constant in syn-

thetic and the real world evaluations.

4.1. Middlebury Benchmark Evaluation

An exhaustive evaluation of our method in terms of

quantitative and qualitative comparison is made using in-

put images from the Middlebury datasets [10, 20]. We use

the disparity image as groundtruth and the original RGB

intensity image as input for our anisotropic diffusion ten-

sor. Park et al.[15] provides low resolution input depth

images with different downsampling factors (×2, ×4, ×8,

×16). To simulate the acquisition process, these input im-

ages contain additional Gaussian noise with a standard devi-

ation that increases with the disparity. Using these datasets

we are able to compare our results with the Markov Random

2

This is an extrapolation of the runtime the authors report on images of

size 800 × 600.

Field (MRF) based approach of Diebel and Thrun [5], the

bilateral ﬁltering with cost volume reﬁnement of Yang et al .

[24], the guided image ﬁltering approach of He et al .[9], the

noise-aware bilateral ﬁlter approach by Chan et al .[3] and

the non-local means ﬁltering by Park et al .[15]. Further,

we compare the results to common interpolation methods.

The conﬁdence measure w in our functional is set to 1 for

all depth points. The parameters α

0

and α

1

have been kept

ﬁxed for all datasets and have been empirically chosen for

×2 / ×4 / ×8 / ×16 as 0.154, 0.023 / 0.05, 0.0056 / 0.267,

0.03 / 0.267, 0.03.

This experiment gives an objective comparison on the ro-

bustness, accuracy and speed of a variety of different algo-

rithms. The numerical results for this experiment in terms of

the root mean squared error (RMSE) and computation time

are shown in Table 1. A visual comparison for the different

methods is given in Figure 3. Further quantitative compar-

isons to other depth upsampling methods on the Middlebury

2003 and 2007 datasets can be found in the supplemental

material.

Discussion What can be clearly seen is that our method

delivers an upsampling quality that is superior compared

to state of the art methods at a lower computation time.

997997

Image Guided Depth Upsampling Using Anisotropic Total Generalized Variation

Summary (2 min read)

1. Introduction

3.1. Depth Mapping

3.2. Depth Image Upsampling

3.3. Primal-Dual Optimization

4. Evaluation

4.1. Middlebury Benchmark Evaluation

4.2. Benchmarking based on Real Sensor Data

5. Conclusion

Figures (1)

Citations

Cites background or methods or result from "Image Guided Depth Upsampling Using..."

Cites background or methods from "Image Guided Depth Upsampling Using..."

Cites methods from "Image Guided Depth Upsampling Using..."

Cites methods from "Image Guided Depth Upsampling Using..."

References

Additional excerpts

"Image Guided Depth Upsampling Using..." refers background in this paper

"Image Guided Depth Upsampling Using..." refers methods in this paper

"Image Guided Depth Upsampling Using..." refers background or methods in this paper

"Image Guided Depth Upsampling Using..." refers methods in this paper

Related Papers (5)