Image-guided ToF depth upsampling: a survey

doi:10.1007/S00138-017-0831-9

Noname manuscript No.

(will be inserted by the editor)

Image-Guided ToF Depth Upsampling: A Survey

Iván Eichhardt · Dmitry Chetverikov · Zsolt Jankó

Received: date / Accepted: date

Abstract Recently, there has been remarkable growth of in-

terest in the development and applications of Time-of-Flight

(ToF) depth cameras. However, despite the permanent im-

provement of their characteristics, the practical applicabil-

ity of ToF cameras is still limited by low resolution and

quality of depth measurements. This has motivated many

researchers to combine ToF cameras with other sensors in

order to enhance and upsample depth images. In this paper,

we review the approaches that couple ToF depth images with

high-resolution optical images. Other classes of upsampling

methods are also brieﬂy discussed. Finally, we provide an

overview of performance evaluation tests presented in the

related studies.

Keywords ToF cameras · depth images · optical images ·

depth upsampling · survey

1 Introduction

Image-based 3D reconstruction of static [111,121,49] and

dynamic [125] objects and scenes is a core problem of com-

puter vision. In the early years of computer vision, it was

believed that visual information is sufﬁcient for computer

to solve the problem, as humans can perceive dynamic 3D

scenes based on their vision. However, humans do not need

to build precise 3D models of an environment to be able to

act in the environment while numerous applications of com-

puter vision require precise 3D reconstruction.

I. Eichhardt

Eötvös Loránd University and MTA SZTAKI, Budapest, Hungary

D. Chetverikov

Eötvös Loránd University and MTA SZTAKI, Budapest

Z. Jankó

MTA SZTAKI, Budapest

Today, different sensors and approaches are often com-

bined to achieve the goal of building a detailed, geometri-

cally correct and properly textured 3D or 4D (spatio-tempo-

ral) model of an object or a scene. Visual and non-visual

sensor data are fused to cope with atmospheric haze [112],

varying illumination, surface properties [56], motion and oc-

clusion. This requires good calibration and registration of

the modalities such as colour and infrared images, laser-

measured data (LIDAR, hand-held scanners, Kinect), or ToF

depth cameras. The output is typically a point cloud, a depth

image, or a depth image with a colour value assigned to each

pixel (RGBD).

A calibrated stereo rig is a widespread, classical device

to acquire depth information based on visual data [111].

Since its baseline, i.e, the distance between the two cam-

eras, is usually narrow, the resulting depth accuracy is lim-

ited. (By depth accuracy we mean the overall accuracy of

depth measurement.) Wide-baseline stereo [121] can pro-

vide a better accuracy at the expense of more frequent occlu-

sions and partial loss of spatial data. A collection of different-

size, uncalibrated images of an object (or a video) can also

be used for 3D reconstruction. However, this requires dense

point correspondences (or dense feature tracking) across im-

ages/frames, which is not always possible.

Photometric stereo [49] applies a camera and several

light sources to acquire the surface normals. The normal

vectors are integrated to reconstruct the surface. The method

provides ﬁne surface details but suffers from less robust glo-

bal geometry [92]. The latter is better captured by stereo

methods that can be combined with photometric stereo [92]

to obtain precise local and global geometry.

Shape acquisition systems using structured light [109,

26] contain one or two cameras and a projector that casts a

speciﬁc, ﬁxed or programmable, pattern onto the shape sur-

face. Systems with programmable light pattern can achieve

high precision of surface measurement.

2 Iván Eichhardt et al.

The approaches to image-based 3D reconstruction listed

above are the most widely used in practice. A number of

other approaches to ‘Shape-from-X’ exist [124,126], such

as Shape-from-Texture, Shape-from-Shading, Shape-from-

Focus, or Structure-from-Motion. These approaches are usu-

ally less precise and robust. They can be applied when high

precision is not required, or as additional shape cues in com-

bination with other methods. For example, Shape-from-Sha-

ding can be used to enhance ﬁne shape details in RGBD

data [144,95,44].

Among the non-visual sensors, the popular Kinect [148]

can be used for real-time dense 3D reconstruction, track-

ing and interaction [57,93]. The original device, Kinect I,

combines a colour camera with a depth sensor projecting

invisible structural light. In the Kinect II, the depth sensor

is a ToF camera coupled with a colour camera. Currently,

Kinect’s resolution and precision are somewhat limited but

still sufﬁcient for applications in game industry and human-

computer interaction (HCI). (See the study [94] for Kinect

sensor noise analysis resulting in improved depth measure-

ment.)

Different LIDAR devices [10,38] have numerous appli-

cations in various areas including robot vision, autonomous

vehicles, trafﬁc monitoring, as well as scanning and 3D re-

construction of indoor and outdoor scenes, buildings and

complete residential areas. They deliver point clouds with

a measure of surface reﬂectivity assigned to each point.

Last but not least, ToF depth cameras [28,113,45] ac-

quire low-resolution, registered depth and reﬂectance im-

ages at the rates suitable for real-time robot vision, naviga-

tion, obstacle avoidance, game industry and HCI.

This paper is devoted to a speciﬁc but critical aspect of

ToF image processing, namely, to depth image upsampling.

The upsampling can be performed in different ways. We

give a survey of the methods that combine a low-resolution

ToF depth image with a registered high-resolution optical

image in order to reﬁne the depth image resolution, typically

by a factor of 4 to 16.

The rest of the paper is structured as follows. In Sec-

tion 2, we discuss an important class of ToF cameras and

compare their features to the features of three main image-

based methods. Although our survey is devoted to image-

guided depth upsampling, for the sake of completeness Sec-

tion 3 gives a brief overview of upsampling with stereo and

with multiple measurements, as well. Section 4 is a survey of

depth upsampling based on a single optical image. In Sec-

tion 5, we discuss the performance evaluation test results

presented in the reviewed literature on depth upsampling. Fi-

nally, Section 6 provides further discussion, conclusion and

outlook.

2 Time-of-Flight cameras

A recent survey [28] offers a comprehensive summary of

the operation principles, advantages and limitations of ToF

cameras. The survey [28] focuses on lock-in ToF cameras

which are widely used in numerous applications, while the

other category of ToF cameras, the pulse-based, is still rarely

used. Our survey is also devoted to lock-in ToF cameras; for

simplicity we will omit the term ‘lock-in’.

ToF cameras [113,102,37] are small, compact, low-weight,

low-consumption devices that emit infrared light and mea-

sure the time-of-ﬂight to the observed object for calculating

the distance to the object, usually called the depth. Contrary

to LIDAR devices, ToF cameras have no mobile parts, and

they capture depth images rather than point clouds. In ad-

dition to depth, ToF cameras deliver registered reﬂectance

images of the same size and reliability values of depth mea-

surements.

The main disadvantages of ToF cameras are their low

resolution and signiﬁcant acquisition noise. Although both

resolution and quality are gradually improving, they are in-

herently limited by chip size and small active illumination

energy, respectively. The highest currently available ToF ca-

mera resolution is QVGA (320 × 240), with VGA (640 ×

480) being a target of future development. (See [89] for a

systematic analysis of ground truth datasets and evaluation

methods to assess the quality of ToF imaging data.)

Table 1 compares ToF cameras to three main image-

based methods in terms of basic features. Stereo vision (SV)

and structured light (SL) need to solve the correspondence,

or matching, problem; the other two methods – photomet-

ric stereo (PS) and ToF – are correspondence-free. Of the

four techniques, only ToF does not require extrinsic calibra-

tion. SV is a passive method, the rest use active illumination.

This allows them to work with textureless surfaces when SV

fails. On the other hand, distinct, strong textures facilitate

the operation of SV but can deteriorate the performance of

the active methods, especially when different textures cover

the surface and its reﬂectance varies.

The active methods operate well in low lighting condi-

tions when scene illumination is poor. Not surprisingly, pas-

sive stereo fails when visibility is low. The situation reverses

for bright lighting that can prevent the operation of PS and

reduce the performance of SL and ToF. In particular, bright

lighting can increase ambient light noise in ToF [28] if am-

bient light contains the same wavelength as camera light. (A

more recent report [75] claims that the bright lighting per-

formance of ToF is good.) High-reﬂectivity surfaces can be

a problem for all of the methods.

PS is efﬁcient for neither outdoor nor dynamic scenes.

SL can cope with time-varying surfaces, but currently it is

not applied in outdoor conditions. Both SV and ToF can be

used outdoor and applied to dynamic scenes, although the

Image-Guided ToF Depth Upsampling: A Survey 3

Table 1 Comparison of four techniques for depth measurement.

stereo vision photometric stereo structured light ToF camera

correspondence yes no yes no

extrinsic calibration yes yes yes no

active illumination no yes yes yes

weak texture performance weak good good good

strong texture performance good medium medium medium

low light performance weak good good good

bright light performance good weak medium/weak medium

outdoor scene yes no no yes?

dynamic scene yes no yes yes

image resolution camera dependent camera dependent camera dependent low

depth accuracy mm to cm mm µm to cm mm to cm

outdoor applicability of ToF cameras can be limited by their

illumination energy and range [22,16], as well as by ambi-

ent light. Image resolution of the ﬁrst three techniques de-

pends on the camera and can be high, contrary to ToF cam-

eras whose resolution is low. Depth accuracy of SV depends

on the baseline and is comparable to that of ToF [75]. The

other two techniques, especially SL, can yield higher accu-

racy.

From the comparison in Table 1, we observe that ToF

cameras and passive stereo vision have complementary fea-

tures. In particular, the inﬂuence of surface texture and illu-

mination on the performance of the two techniques is just

the opposite. As discussed in Section 4, this complementar-

ity of ToF sensing and stereo has motivated researchers to

combine the two sources of depth data in order to enhance

applicability, accuracy and robustness of 3D vision systems.

ToF cameras have numerous applications. The related

surveys [29,28] conclude that the most exploited feature of

the cameras is their ability to operate without moving parts

while providing depth maps at high frame rates. This ca-

pability greatly simpliﬁes the solution of a critical task of

3D vision, the foreground-background separation. ToF cam-

eras are exploited in robot vision [55] for navigation [135,

21,128,145] and 3D pose estimation and mapping [101,85,

34].

Further important application areas are 3D reconstruc-

tion of objects and environments [17,27,6,31,67,63], com-

puter graphics [122,103,65] and 3DTV [120,118,133,134,

78]. (See study [116] for a recent survey of depth sensing

for 3D television.) ToF cameras are applied in various tasks

related to recognition and tracking of people [40,7,64] and

parts of human body: hand [79,91], head [35] and face [91,

108]. Alenya et al. [1] use colour and ToF camera data to

build 3D models of leaves for automated plant measure-

ment. Additional applications are discussed in the recent

book [37].

3 Upsampling with stereo and with multiple

measurements

Low resolution and low signal-to-noise ratio are the two

main disadvantages of ToF depth imagery. The goal of depth

image upsampling is to increase the resolution and simul-

taneously improve image quality, in particular, near depth

edges where surface discontinuities tend to result in erro-

neous or missing measurements [28]. In some applications,

such as mixed reality, game industry and 3DTV, the depth

edge areas are especially important because they determine

occlusion and disocclusion of moving actors.

Approaches to depth upsampling can be categorised into

three main classes [24]. In this survey, we discuss image-

guided upsampling when a high-resolution optical image

registered with a low-resolution depth image is used to reﬁne

the depth. However, for completeness we will now brieﬂy

discuss the other two classes, as well.

Note that most of the ToF depth upsampling methods

surveyed in this paper deal with lateral depth enhancement.

As already mentioned, some techniques for RGBD data pro-

cessing [144,95,44] enhance ﬁne shape details by calculat-

ing surface normals.

ToF–stereo fusion combines ToF depth with multicam-

era stereo data. A recent survey of this type of depth upsam-

pling is available in [90]. Hansard et al. [45] discuss some

variants of this approach and provide a comparative eval-

uation of several methods. The important issue of register-

ing the ToF camera and the stereo data is also addressed.

By mapping ToF depth values to the disparities of a high-

resolution camera pair, it is possible to simultaneously up-

sample the depth values and improve the quality of the dis-

parities [39]. Kim et al. [63] address the problem of sparsely

textured surfaces and self-occlusions in stereo vision by fus-

ing multicamera stereo data with multiview ToF sensor mea-

surements. The method yields dense and detailed 3D models

of scenes challenging for stereo alone while enhancing the

ToF depth images. Zhu et al. [150,149,151] also explore the

4 Iván Eichhardt et al.

complementary features of ToF cameras and stereo in order

to improve accuracy and robustness.

Yang et al. [141] present a setup that combines a ToF

depth camera with three stereo cameras and report on GPU-

based, fast stereo depth frame grabbing and real-time ToF

depth upsampling. The system fails in large surface regions

of dark (e.g., black) colour that cause troubles to both stereo

and ToF cameras. Bartczak and Koch [5] combine multiple

high-resolution colour views with a ToF camera to obtain

dense depths maps of a scene. Similar input data are used by

Li et al. [73] who present a joint learning-based method ex-

ploiting differential features of the observed surface. Kang

and Ho [60,51] report on a system that contains multiple

depth and colour cameras.

Hahne and Alexa [41,42] claim that combination of ToF

camera and stereo vision can provide enhanced depth data

even without precise calibration. Kuhnert and Stommel [67]

fuse ToF depth data with stereo data for real-time indoor

3D environment reconstruction in mobile robotics. Further

methods are discussed in the recent survey [90]. A drawback

of ToF–stereo is that it still inherits critical problems of pas-

sive stereo vision: the correspondence problem, the problem

of textureless surfaces, and the problem of occlusions.

A natural way to improve resolution is to combine mul-

tiple measurements of an object. In optical imaging, numer-

ous studies are devoted to super-resolution [131,129] or up-

sampling [23] of colour images. Fusing multiple ToF depth

measurements into one image is sometimes referred to as

temporal and spatial upsampling [24]. This type of depth

upsampling is less widespread than ToF–stereo fusion and

image-guided methods.

Hahne and Alexa [43] obtain enhanced depth images

by adaptively combining several images taken with differ-

ent exposure (integration) times. Their method is inspired

by techniques applied in high dynamic range (HDR) imag-

ing where different measures of image quality are used as

weights for adaptive colour image fusion. For depth image

fusion, the method [43] uses local measures of depth con-

trast, well-exposedness, surface smoothness, and uncertainty

deﬁned via signal entropy.

In [115,15], the authors acquire multiple depth images

of a static scene from different viewpoints and merge them

into a single depth map of higher resolution. An advantage

of such approaches is that it does not need a sensor of an-

other type. Working with depth images only allows one to

avoid the so-called ‘texture copying problem’ of sensor fu-

sion when contrast image textures tend to ‘imprint’ onto the

upsampled depth image. This negative effect will be dis-

cussed later in relation to image-guided upsampling. A limi-

tation of the methods [115,15] is that only static objects can

be measured.

Mac Aodha et al. [83] use a training dataset of high-

resolution depth images for patch-based upsampling of a

low-resolution depth image. Although theoretically attrac-

tive, the method is too time-consuming for most applica-

tions. A somewhat similar patch-based approach is presented

by Hornacek et al. [52] who exploit patch-wise self-similarity

of a scene and search for patch correspondences within the

input depth image. The method [52] aims at single image

based upsampling while the algorithm [83] needs a large

collection of high-resolution exemplars to search in. A draw-

back of the method [52] is that it relies on patch correspon-

dences which may be difﬁcult to obtain, especially for less

characteristic surface regions.

Riegler et al. [104] use a deep network for single depth

map super-resolution. The same problem is addressed in [3]

using the Discrete Wavelet Transform and in [84] using sub-

dictionaries of exemplars constructed from example depth

maps. Finally, the patent [61] describes a method for com-

bined depth ﬁltering and resolution reﬁnement.

4 Image-guided depth upsampling

In this section, we provide a survey of depth upsampling

based on a single optical image assuming calibrated and

fused depth and colour data. As discussed later, precise cali-

bration and sensor fusion are essential for good upsampling.

Similarly to the ToF-stereo fusion survey [90], we classify

the methods as local or global. The former category applies

local ﬁltering while the latter uses global optimisation. The

approaches that fall in neither of the two classes are dis-

cussed separately.

We start the presentation of the methods by illustrating

the upsampling problem, discussing its difﬁculties and in-

troducing the notations. Fig. 1 shows an example of success-

ful upsampling of a high-quality depth image of low resolu-

tion. The input depth and colour images are from the Mid-

dlebury stereo dataset [110]. The original high-resolution

depth image was acquired with structured light, then artiﬁ-

cially downsampled to get the low-resolution image shown

in Fig. 1. Small parts of depth data (dark regions) are lost.

The upsampled depth is smooth and very similar to the orig-

inal high-resolution data used as the ground truth. In the

Middlebury data, depth discontinuities match well the cor-

responding edges of the colour image. This dataset is of-

ten used for quantitative comparative evaluation of image-

guided upsampling techniques.

For real-world data, the upsampling problem is more

complicated than for the Middlebury data. Fig. 2 illustrates

the negative features of depth images captured by ToF cam-

eras

1

. The original depth resolution is very low compared

to that of the colour image. When resized to the size of the

colour image, the depth image clearly shows its deﬁcien-

cies: a part of the data is lost due to low resolution; some

1

Data courtesy of Zinemath Zrt [152].

Image-Guided ToF Depth Upsampling: A Survey 5

input depth and colour images upsampled depth ground-truth depth

Fig. 1 Sample Middlebury data, the upsampled depth and the ground truth.

shapes, e.g., the heads, are distorted. Despite the calibration,

the contours of the depth image do not always coincide with

those of the colour image. There are erroneous and missing

measurements along the depth edges, in the dark region on

the top, and in the background between the chair and the

poster.

To use a high-resolution image for depth upsampling,

one needs to relate image features to depth features. A basic

assumption exploited by most upsampling methods is that

image edges are related to depth edges, that is, to surfaces

discontinuities. It is often assumed [18,33,81,97,98,74,24]

that smooth depth regions exhibit themselves as smooth in-

tensity/colour regions, while depth edges underlie intensity

edges. We will call this condition the depth-intensity edge

coincidence assumption.

Clearly, the assumption is violated in the regions of high-

contrast texture and on the border of a strong shadow. Some

studies [139,123] relax it in order to circumvent the prob-

lems discussed below and avoid the resulting artefacts. How-

ever, depth edges are in any case a sensitive issue. Since im-

age features are the only data available for upsampling, one

has to ﬁnd a balance between the edge coincidence assump-

tion and other priors. This balance is data-dependent, which

may necessitate adaptive parameter tuning of an upsampling

algorithm.

Precise camera calibration is crucial for the applications

that require good-quality depth images, in general, and accu-

rate depth discontinuities, in particular. Techniques and en-

gineering tools used to calibrate ToF cameras and enhance

their quality are discussed in numerous studies [77,50,45,

99,102,72,58]. Procedures for joint calibration of a ToF cam-

era and an intensity camera are described in [97,98,24,132].

Many researchers apply the well-known calibration meth-

od [147]. A ToF camera calibration toolbox implementing

the method presented in [69] is available at the web site [68].

Inaccurate registration of depth and intensity images due

to imprecise calibration results in deterioration of the up-

sampled depth. Schwarz et al. [117] propose an error mod-

el for ToF sensor fusion and analyse relation between the

model and inaccuracies in camera calibration and depth mea-

surements. Xu et al. [137] address the problem of misalign-

ment correction in the context of depth image-based render-

ing. Fig. 3 illustrates the effect of misalignment on depth up-

sampling. The discrepancy between the depth and intensity

images is artiﬁcially introduced by a relative shift of two,

ﬁve and ten pixels. As the shift grows, the depth borders be-

come blurred and coarse.

Because of the optical radial distortion typical for many

cameras, the discrepancy between the input images tends to

grow with the distance from image centre. Fig. 4 shows an

example of this phenomenon. The shape of the person in

the centre of the scene in Fig. 4a is quite precise, with even

ﬁne details such as ﬁngers being upsampled correctly. When

the person moves to the periphery of the scene (Fig. 4b),

his shape, e.g., in the region of the neck, becomes visibly

distorted due to the growing misalignment.

Avoiding depth blur to preserve contrast depth edges

is a major issue of upsampling methods. Because of the

depth-intensity edge coincidence assumption, this issue is

related to the texture copying (transfer) problem. Contrast

image textures tend to exhibit themselves in the upsampled

depth image as illustrated in Fig. 5 where textured regions

cause visible perturbation in the reﬁned depth. This disturb-

ing phenomenon and possible remedies are discussed in the

papers [139,123]. Further typical problems of image-guided

depth upsampling are mentioned in Section 6.

In the sequel, we use the following notations:

D Input (depth) image.

ˆ

D Filtered / Upsampled image.

∇D Gradient image.

˜

I Guide / Reference image.

p, q, . . . 2D pixel coordinates.

kp − qk Distance between pixels p and q.

p

↓

, q

↓

, . . . Low-resolution coordinates, possibly fractional.

Ω(p) A window around pixel p.

D

q

D value of pixel q.

Image-guided ToF depth upsampling: a survey

Citations

SPADnet: deep RGB-SPAD sensor fusion assisted by monocular depth estimation.

Hyperspectral demosaicking and crosstalk correction using deep learning

A 3D Reconstruction Pipeline of Urban Drainage Pipes Based on MultiviewImage Matching Using Low-Cost Panoramic Video Cameras

A Review of Indirect Time-of-Flight Technologies

Depth map artefacts reduction: a review

References

Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography

A flexible new technique for camera calibration

Bilateral filtering for gray and color images

Machine Learning : A Probabilistic Perspective

A taxonomy and evaluation of dense two-frame stereo correspondence algorithms

Related Papers (5)

Joint bilateral upsampling

A taxonomy and evaluation of dense two-frame stereo correspondence algorithms

Color-Guided Depth Recovery From RGB-D Data Using an Adaptive Autoregressive Model

High quality depth map upsampling for 3D-TOF cameras

Joint Geodesic Upsampling of Depth Images