scispace - formally typeset
Open AccessJournal ArticleDOI

Image-guided ToF depth upsampling: a survey

TLDR
This paper reviews the approaches that couple ToF depth images with high-resolution optical images and provides an overview of performance evaluation tests presented in the related studies.
Abstract
Recently, there has been remarkable growth of interest in the development and applications of time-of-flight (ToF) depth cameras. Despite the permanent improvement of their characteristics, the practical applicability of ToF cameras is still limited by low resolution and quality of depth measurements. This has motivated many researchers to combine ToF cameras with other sensors in order to enhance and upsample depth images. In this paper, we review the approaches that couple ToF depth images with high-resolution optical images. Other classes of upsampling methods are also briefly discussed. Finally, we provide an overview of performance evaluation tests presented in the related studies.

read more

Content maybe subject to copyright    Report

Noname manuscript No.
(will be inserted by the editor)
Image-Guided ToF Depth Upsampling: A Survey
Iván Eichhardt · Dmitry Chetverikov · Zsolt Jankó
Received: date / Accepted: date
Abstract Recently, there has been remarkable growth of in-
terest in the development and applications of Time-of-Flight
(ToF) depth cameras. However, despite the permanent im-
provement of their characteristics, the practical applicabil-
ity of ToF cameras is still limited by low resolution and
quality of depth measurements. This has motivated many
researchers to combine ToF cameras with other sensors in
order to enhance and upsample depth images. In this paper,
we review the approaches that couple ToF depth images with
high-resolution optical images. Other classes of upsampling
methods are also briefly discussed. Finally, we provide an
overview of performance evaluation tests presented in the
related studies.
Keywords ToF cameras · depth images · optical images ·
depth upsampling · survey
1 Introduction
Image-based 3D reconstruction of static [111,121,49] and
dynamic [125] objects and scenes is a core problem of com-
puter vision. In the early years of computer vision, it was
believed that visual information is sufficient for computer
to solve the problem, as humans can perceive dynamic 3D
scenes based on their vision. However, humans do not need
to build precise 3D models of an environment to be able to
act in the environment while numerous applications of com-
puter vision require precise 3D reconstruction.
I. Eichhardt
Eötvös Loránd University and MTA SZTAKI, Budapest, Hungary
D. Chetverikov
Eötvös Loránd University and MTA SZTAKI, Budapest
Z. Jankó
MTA SZTAKI, Budapest
Today, different sensors and approaches are often com-
bined to achieve the goal of building a detailed, geometri-
cally correct and properly textured 3D or 4D (spatio-tempo-
ral) model of an object or a scene. Visual and non-visual
sensor data are fused to cope with atmospheric haze [112],
varying illumination, surface properties [56], motion and oc-
clusion. This requires good calibration and registration of
the modalities such as colour and infrared images, laser-
measured data (LIDAR, hand-held scanners, Kinect), or ToF
depth cameras. The output is typically a point cloud, a depth
image, or a depth image with a colour value assigned to each
pixel (RGBD).
A calibrated stereo rig is a widespread, classical device
to acquire depth information based on visual data [111].
Since its baseline, i.e, the distance between the two cam-
eras, is usually narrow, the resulting depth accuracy is lim-
ited. (By depth accuracy we mean the overall accuracy of
depth measurement.) Wide-baseline stereo [121] can pro-
vide a better accuracy at the expense of more frequent occlu-
sions and partial loss of spatial data. A collection of different-
size, uncalibrated images of an object (or a video) can also
be used for 3D reconstruction. However, this requires dense
point correspondences (or dense feature tracking) across im-
ages/frames, which is not always possible.
Photometric stereo [49] applies a camera and several
light sources to acquire the surface normals. The normal
vectors are integrated to reconstruct the surface. The method
provides fine surface details but suffers from less robust glo-
bal geometry [92]. The latter is better captured by stereo
methods that can be combined with photometric stereo [92]
to obtain precise local and global geometry.
Shape acquisition systems using structured light [109,
26] contain one or two cameras and a projector that casts a
specific, fixed or programmable, pattern onto the shape sur-
face. Systems with programmable light pattern can achieve
high precision of surface measurement.

2 Iván Eichhardt et al.
The approaches to image-based 3D reconstruction listed
above are the most widely used in practice. A number of
other approaches to ‘Shape-from-X’ exist [124,126], such
as Shape-from-Texture, Shape-from-Shading, Shape-from-
Focus, or Structure-from-Motion. These approaches are usu-
ally less precise and robust. They can be applied when high
precision is not required, or as additional shape cues in com-
bination with other methods. For example, Shape-from-Sha-
ding can be used to enhance fine shape details in RGBD
data [144,95,44].
Among the non-visual sensors, the popular Kinect [148]
can be used for real-time dense 3D reconstruction, track-
ing and interaction [57,93]. The original device, Kinect I,
combines a colour camera with a depth sensor projecting
invisible structural light. In the Kinect II, the depth sensor
is a ToF camera coupled with a colour camera. Currently,
Kinect’s resolution and precision are somewhat limited but
still sufficient for applications in game industry and human-
computer interaction (HCI). (See the study [94] for Kinect
sensor noise analysis resulting in improved depth measure-
ment.)
Different LIDAR devices [10,38] have numerous appli-
cations in various areas including robot vision, autonomous
vehicles, traffic monitoring, as well as scanning and 3D re-
construction of indoor and outdoor scenes, buildings and
complete residential areas. They deliver point clouds with
a measure of surface reflectivity assigned to each point.
Last but not least, ToF depth cameras [28,113,45] ac-
quire low-resolution, registered depth and reflectance im-
ages at the rates suitable for real-time robot vision, naviga-
tion, obstacle avoidance, game industry and HCI.
This paper is devoted to a specific but critical aspect of
ToF image processing, namely, to depth image upsampling.
The upsampling can be performed in different ways. We
give a survey of the methods that combine a low-resolution
ToF depth image with a registered high-resolution optical
image in order to refine the depth image resolution, typically
by a factor of 4 to 16.
The rest of the paper is structured as follows. In Sec-
tion 2, we discuss an important class of ToF cameras and
compare their features to the features of three main image-
based methods. Although our survey is devoted to image-
guided depth upsampling, for the sake of completeness Sec-
tion 3 gives a brief overview of upsampling with stereo and
with multiple measurements, as well. Section 4 is a survey of
depth upsampling based on a single optical image. In Sec-
tion 5, we discuss the performance evaluation test results
presented in the reviewed literature on depth upsampling. Fi-
nally, Section 6 provides further discussion, conclusion and
outlook.
2 Time-of-Flight cameras
A recent survey [28] offers a comprehensive summary of
the operation principles, advantages and limitations of ToF
cameras. The survey [28] focuses on lock-in ToF cameras
which are widely used in numerous applications, while the
other category of ToF cameras, the pulse-based, is still rarely
used. Our survey is also devoted to lock-in ToF cameras; for
simplicity we will omit the term ‘lock-in’.
ToF cameras [113,102,37] are small, compact, low-weight,
low-consumption devices that emit infrared light and mea-
sure the time-of-flight to the observed object for calculating
the distance to the object, usually called the depth. Contrary
to LIDAR devices, ToF cameras have no mobile parts, and
they capture depth images rather than point clouds. In ad-
dition to depth, ToF cameras deliver registered reflectance
images of the same size and reliability values of depth mea-
surements.
The main disadvantages of ToF cameras are their low
resolution and significant acquisition noise. Although both
resolution and quality are gradually improving, they are in-
herently limited by chip size and small active illumination
energy, respectively. The highest currently available ToF ca-
mera resolution is QVGA (320 × 240), with VGA (640 ×
480) being a target of future development. (See [89] for a
systematic analysis of ground truth datasets and evaluation
methods to assess the quality of ToF imaging data.)
Table 1 compares ToF cameras to three main image-
based methods in terms of basic features. Stereo vision (SV)
and structured light (SL) need to solve the correspondence,
or matching, problem; the other two methods photomet-
ric stereo (PS) and ToF are correspondence-free. Of the
four techniques, only ToF does not require extrinsic calibra-
tion. SV is a passive method, the rest use active illumination.
This allows them to work with textureless surfaces when SV
fails. On the other hand, distinct, strong textures facilitate
the operation of SV but can deteriorate the performance of
the active methods, especially when different textures cover
the surface and its reflectance varies.
The active methods operate well in low lighting condi-
tions when scene illumination is poor. Not surprisingly, pas-
sive stereo fails when visibility is low. The situation reverses
for bright lighting that can prevent the operation of PS and
reduce the performance of SL and ToF. In particular, bright
lighting can increase ambient light noise in ToF [28] if am-
bient light contains the same wavelength as camera light. (A
more recent report [75] claims that the bright lighting per-
formance of ToF is good.) High-reflectivity surfaces can be
a problem for all of the methods.
PS is efficient for neither outdoor nor dynamic scenes.
SL can cope with time-varying surfaces, but currently it is
not applied in outdoor conditions. Both SV and ToF can be
used outdoor and applied to dynamic scenes, although the

Image-Guided ToF Depth Upsampling: A Survey 3
Table 1 Comparison of four techniques for depth measurement.
stereo vision photometric stereo structured light ToF camera
correspondence yes no yes no
extrinsic calibration yes yes yes no
active illumination no yes yes yes
weak texture performance weak good good good
strong texture performance good medium medium medium
low light performance weak good good good
bright light performance good weak medium/weak medium
outdoor scene yes no no yes?
dynamic scene yes no yes yes
image resolution camera dependent camera dependent camera dependent low
depth accuracy mm to cm mm µm to cm mm to cm
outdoor applicability of ToF cameras can be limited by their
illumination energy and range [22,16], as well as by ambi-
ent light. Image resolution of the first three techniques de-
pends on the camera and can be high, contrary to ToF cam-
eras whose resolution is low. Depth accuracy of SV depends
on the baseline and is comparable to that of ToF [75]. The
other two techniques, especially SL, can yield higher accu-
racy.
From the comparison in Table 1, we observe that ToF
cameras and passive stereo vision have complementary fea-
tures. In particular, the influence of surface texture and illu-
mination on the performance of the two techniques is just
the opposite. As discussed in Section 4, this complementar-
ity of ToF sensing and stereo has motivated researchers to
combine the two sources of depth data in order to enhance
applicability, accuracy and robustness of 3D vision systems.
ToF cameras have numerous applications. The related
surveys [29,28] conclude that the most exploited feature of
the cameras is their ability to operate without moving parts
while providing depth maps at high frame rates. This ca-
pability greatly simplifies the solution of a critical task of
3D vision, the foreground-background separation. ToF cam-
eras are exploited in robot vision [55] for navigation [135,
21,128,145] and 3D pose estimation and mapping [101,85,
34].
Further important application areas are 3D reconstruc-
tion of objects and environments [17,27,6,31,67,63], com-
puter graphics [122,103,65] and 3DTV [120,118,133,134,
78]. (See study [116] for a recent survey of depth sensing
for 3D television.) ToF cameras are applied in various tasks
related to recognition and tracking of people [40,7,64] and
parts of human body: hand [79,91], head [35] and face [91,
108]. Alenya et al. [1] use colour and ToF camera data to
build 3D models of leaves for automated plant measure-
ment. Additional applications are discussed in the recent
book [37].
3 Upsampling with stereo and with multiple
measurements
Low resolution and low signal-to-noise ratio are the two
main disadvantages of ToF depth imagery. The goal of depth
image upsampling is to increase the resolution and simul-
taneously improve image quality, in particular, near depth
edges where surface discontinuities tend to result in erro-
neous or missing measurements [28]. In some applications,
such as mixed reality, game industry and 3DTV, the depth
edge areas are especially important because they determine
occlusion and disocclusion of moving actors.
Approaches to depth upsampling can be categorised into
three main classes [24]. In this survey, we discuss image-
guided upsampling when a high-resolution optical image
registered with a low-resolution depth image is used to refine
the depth. However, for completeness we will now briefly
discuss the other two classes, as well.
Note that most of the ToF depth upsampling methods
surveyed in this paper deal with lateral depth enhancement.
As already mentioned, some techniques for RGBD data pro-
cessing [144,95,44] enhance fine shape details by calculat-
ing surface normals.
ToF–stereo fusion combines ToF depth with multicam-
era stereo data. A recent survey of this type of depth upsam-
pling is available in [90]. Hansard et al. [45] discuss some
variants of this approach and provide a comparative eval-
uation of several methods. The important issue of register-
ing the ToF camera and the stereo data is also addressed.
By mapping ToF depth values to the disparities of a high-
resolution camera pair, it is possible to simultaneously up-
sample the depth values and improve the quality of the dis-
parities [39]. Kim et al. [63] address the problem of sparsely
textured surfaces and self-occlusions in stereo vision by fus-
ing multicamera stereo data with multiview ToF sensor mea-
surements. The method yields dense and detailed 3D models
of scenes challenging for stereo alone while enhancing the
ToF depth images. Zhu et al. [150,149,151] also explore the

4 Iván Eichhardt et al.
complementary features of ToF cameras and stereo in order
to improve accuracy and robustness.
Yang et al. [141] present a setup that combines a ToF
depth camera with three stereo cameras and report on GPU-
based, fast stereo depth frame grabbing and real-time ToF
depth upsampling. The system fails in large surface regions
of dark (e.g., black) colour that cause troubles to both stereo
and ToF cameras. Bartczak and Koch [5] combine multiple
high-resolution colour views with a ToF camera to obtain
dense depths maps of a scene. Similar input data are used by
Li et al. [73] who present a joint learning-based method ex-
ploiting differential features of the observed surface. Kang
and Ho [60,51] report on a system that contains multiple
depth and colour cameras.
Hahne and Alexa [41,42] claim that combination of ToF
camera and stereo vision can provide enhanced depth data
even without precise calibration. Kuhnert and Stommel [67]
fuse ToF depth data with stereo data for real-time indoor
3D environment reconstruction in mobile robotics. Further
methods are discussed in the recent survey [90]. A drawback
of ToF–stereo is that it still inherits critical problems of pas-
sive stereo vision: the correspondence problem, the problem
of textureless surfaces, and the problem of occlusions.
A natural way to improve resolution is to combine mul-
tiple measurements of an object. In optical imaging, numer-
ous studies are devoted to super-resolution [131,129] or up-
sampling [23] of colour images. Fusing multiple ToF depth
measurements into one image is sometimes referred to as
temporal and spatial upsampling [24]. This type of depth
upsampling is less widespread than ToF–stereo fusion and
image-guided methods.
Hahne and Alexa [43] obtain enhanced depth images
by adaptively combining several images taken with differ-
ent exposure (integration) times. Their method is inspired
by techniques applied in high dynamic range (HDR) imag-
ing where different measures of image quality are used as
weights for adaptive colour image fusion. For depth image
fusion, the method [43] uses local measures of depth con-
trast, well-exposedness, surface smoothness, and uncertainty
defined via signal entropy.
In [115,15], the authors acquire multiple depth images
of a static scene from different viewpoints and merge them
into a single depth map of higher resolution. An advantage
of such approaches is that it does not need a sensor of an-
other type. Working with depth images only allows one to
avoid the so-called ‘texture copying problem’ of sensor fu-
sion when contrast image textures tend to ‘imprint’ onto the
upsampled depth image. This negative effect will be dis-
cussed later in relation to image-guided upsampling. A limi-
tation of the methods [115,15] is that only static objects can
be measured.
Mac Aodha et al. [83] use a training dataset of high-
resolution depth images for patch-based upsampling of a
low-resolution depth image. Although theoretically attrac-
tive, the method is too time-consuming for most applica-
tions. A somewhat similar patch-based approach is presented
by Hornacek et al. [52] who exploit patch-wise self-similarity
of a scene and search for patch correspondences within the
input depth image. The method [52] aims at single image
based upsampling while the algorithm [83] needs a large
collection of high-resolution exemplars to search in. A draw-
back of the method [52] is that it relies on patch correspon-
dences which may be difficult to obtain, especially for less
characteristic surface regions.
Riegler et al. [104] use a deep network for single depth
map super-resolution. The same problem is addressed in [3]
using the Discrete Wavelet Transform and in [84] using sub-
dictionaries of exemplars constructed from example depth
maps. Finally, the patent [61] describes a method for com-
bined depth filtering and resolution refinement.
4 Image-guided depth upsampling
In this section, we provide a survey of depth upsampling
based on a single optical image assuming calibrated and
fused depth and colour data. As discussed later, precise cali-
bration and sensor fusion are essential for good upsampling.
Similarly to the ToF-stereo fusion survey [90], we classify
the methods as local or global. The former category applies
local filtering while the latter uses global optimisation. The
approaches that fall in neither of the two classes are dis-
cussed separately.
We start the presentation of the methods by illustrating
the upsampling problem, discussing its difficulties and in-
troducing the notations. Fig. 1 shows an example of success-
ful upsampling of a high-quality depth image of low resolu-
tion. The input depth and colour images are from the Mid-
dlebury stereo dataset [110]. The original high-resolution
depth image was acquired with structured light, then artifi-
cially downsampled to get the low-resolution image shown
in Fig. 1. Small parts of depth data (dark regions) are lost.
The upsampled depth is smooth and very similar to the orig-
inal high-resolution data used as the ground truth. In the
Middlebury data, depth discontinuities match well the cor-
responding edges of the colour image. This dataset is of-
ten used for quantitative comparative evaluation of image-
guided upsampling techniques.
For real-world data, the upsampling problem is more
complicated than for the Middlebury data. Fig. 2 illustrates
the negative features of depth images captured by ToF cam-
eras
1
. The original depth resolution is very low compared
to that of the colour image. When resized to the size of the
colour image, the depth image clearly shows its deficien-
cies: a part of the data is lost due to low resolution; some
1
Data courtesy of Zinemath Zrt [152].

Image-Guided ToF Depth Upsampling: A Survey 5
input depth and colour images upsampled depth ground-truth depth
Fig. 1 Sample Middlebury data, the upsampled depth and the ground truth.
shapes, e.g., the heads, are distorted. Despite the calibration,
the contours of the depth image do not always coincide with
those of the colour image. There are erroneous and missing
measurements along the depth edges, in the dark region on
the top, and in the background between the chair and the
poster.
To use a high-resolution image for depth upsampling,
one needs to relate image features to depth features. A basic
assumption exploited by most upsampling methods is that
image edges are related to depth edges, that is, to surfaces
discontinuities. It is often assumed [18,33,81,97,98,74,24]
that smooth depth regions exhibit themselves as smooth in-
tensity/colour regions, while depth edges underlie intensity
edges. We will call this condition the depth-intensity edge
coincidence assumption.
Clearly, the assumption is violated in the regions of high-
contrast texture and on the border of a strong shadow. Some
studies [139,123] relax it in order to circumvent the prob-
lems discussed below and avoid the resulting artefacts. How-
ever, depth edges are in any case a sensitive issue. Since im-
age features are the only data available for upsampling, one
has to find a balance between the edge coincidence assump-
tion and other priors. This balance is data-dependent, which
may necessitate adaptive parameter tuning of an upsampling
algorithm.
Precise camera calibration is crucial for the applications
that require good-quality depth images, in general, and accu-
rate depth discontinuities, in particular. Techniques and en-
gineering tools used to calibrate ToF cameras and enhance
their quality are discussed in numerous studies [77,50,45,
99,102,72,58]. Procedures for joint calibration of a ToF cam-
era and an intensity camera are described in [97,98,24,132].
Many researchers apply the well-known calibration meth-
od [147]. A ToF camera calibration toolbox implementing
the method presented in [69] is available at the web site [68].
Inaccurate registration of depth and intensity images due
to imprecise calibration results in deterioration of the up-
sampled depth. Schwarz et al. [117] propose an error mod-
el for ToF sensor fusion and analyse relation between the
model and inaccuracies in camera calibration and depth mea-
surements. Xu et al. [137] address the problem of misalign-
ment correction in the context of depth image-based render-
ing. Fig. 3 illustrates the effect of misalignment on depth up-
sampling. The discrepancy between the depth and intensity
images is artificially introduced by a relative shift of two,
five and ten pixels. As the shift grows, the depth borders be-
come blurred and coarse.
Because of the optical radial distortion typical for many
cameras, the discrepancy between the input images tends to
grow with the distance from image centre. Fig. 4 shows an
example of this phenomenon. The shape of the person in
the centre of the scene in Fig. 4a is quite precise, with even
fine details such as fingers being upsampled correctly. When
the person moves to the periphery of the scene (Fig. 4b),
his shape, e.g., in the region of the neck, becomes visibly
distorted due to the growing misalignment.
Avoiding depth blur to preserve contrast depth edges
is a major issue of upsampling methods. Because of the
depth-intensity edge coincidence assumption, this issue is
related to the texture copying (transfer) problem. Contrast
image textures tend to exhibit themselves in the upsampled
depth image as illustrated in Fig. 5 where textured regions
cause visible perturbation in the refined depth. This disturb-
ing phenomenon and possible remedies are discussed in the
papers [139,123]. Further typical problems of image-guided
depth upsampling are mentioned in Section 6.
In the sequel, we use the following notations:
D Input (depth) image.
ˆ
D Filtered / Upsampled image.
D Gradient image.
˜
I Guide / Reference image.
p, q, . . . 2D pixel coordinates.
kp qk Distance between pixels p and q.
p
, q
, . . . Low-resolution coordinates, possibly fractional.
(p) A window around pixel p.
D
q
D value of pixel q.

Citations
More filters
Journal ArticleDOI

SPADnet: deep RGB-SPAD sensor fusion assisted by monocular depth estimation.

TL;DR: This work proposes a deep sensor fusion strategy that combines corrupted SPAD data and a conventional 2D image to estimate the depth of a scene and achieves state-of-the-art results for RGB-SPAD fusion with simulated and captured data.
Journal ArticleDOI

Hyperspectral demosaicking and crosstalk correction using deep learning

TL;DR: Insight into the beneficial effects of crosstalk for hyperspectral demosaicking is provided and best practices for several architectural and hyperparameter variations are given as well as a theoretical reasoning behind certain observations.
Journal ArticleDOI

A 3D Reconstruction Pipeline of Urban Drainage Pipes Based on MultiviewImage Matching Using Low-Cost Panoramic Video Cameras

TL;DR: Results show that the strategy can realize high-precision 3D reconstruction of different types of pipes, which can provide effective technical support for rapid and efficient inspection of urban pipes with broad application prospects in the daily management of sustainable urban drainage systems (SUDSs).
Journal ArticleDOI

A Review of Indirect Time-of-Flight Technologies

TL;DR: This article focuses on amplitude-modulated continuous-wave (AMCW) time-of-flight (ToF), which, because of its robustness and stability properties, is the most common form of iToF.
Journal ArticleDOI

Depth map artefacts reduction: a review

TL;DR: The authors survey the depth map artefacts reduction methods proposed in the literature, from mono- to multi-view, via spatial to temporal dimension, in local to global manner, with signal processing to learning-based methods and compare the state of thearts via different metrics to show their potentials in future visual applications.
References
More filters
Journal ArticleDOI

Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography

TL;DR: New results are derived on the minimum number of landmarks needed to obtain a solution, and algorithms are presented for computing these minimum-landmark solutions in closed form that provide the basis for an automatic system that can solve the Location Determination Problem under difficult viewing.
Journal ArticleDOI

A flexible new technique for camera calibration

TL;DR: A flexible technique to easily calibrate a camera that only requires the camera to observe a planar pattern shown at a few (at least two) different orientations is proposed and advances 3D computer vision one more step from laboratory environments to real world use.
Proceedings ArticleDOI

Bilateral filtering for gray and color images

TL;DR: In contrast with filters that operate on the three bands of a color image separately, a bilateral filter can enforce the perceptual metric underlying the CIE-Lab color space, and smooth colors and preserve edges in a way that is tuned to human perception.
Book

Machine Learning : A Probabilistic Perspective

TL;DR: This textbook offers a comprehensive and self-contained introduction to the field of machine learning, based on a unified, probabilistic approach, and is suitable for upper-level undergraduates with an introductory-level college math background and beginning graduate students.
Journal ArticleDOI

A taxonomy and evaluation of dense two-frame stereo correspondence algorithms

TL;DR: This paper has designed a stand-alone, flexible C++ implementation that enables the evaluation of individual components and that can easily be extended to include new algorithms.
Related Papers (5)