Semi-Supervised Deep Learning for Monocular Depth Map Prediction

doi:10.1109/CVPR.2017.238

Yevhen Kuznietsov J

¨

org St

¨

uckler Bastian Leibe

Computer Vision Group, Visual Computing Institute, RWTH Aachen University

yevhen.kuznietsov@rwth-aachen.de, { stueckler | leibe }@vision.rwth-aachen.de

Abstract

Supervised deep learning often suffers from the lack of

sufﬁcient training data. Speciﬁcally in the context of monoc-

ular depth map prediction, it is barely possible to deter-

mine dense ground truth depth images in realistic dynamic

outdoor environments. When using LiDAR sensors, for in-

stance, noise is present in the distance measurements, the

calibration between sensors cannot be perfect, and the mea-

surements are typically much sparser than the camera im-

ages. In this paper, we propose a novel approach to depth

map prediction from monocular images that learns in a

semi-supervised way. While we use sparse ground-truth

depth for supervised learning, we also enforce our deep

network to produce photoconsistent dense depth maps in a

stereo setup using a direct image alignment loss. In exper-

iments we demonstrate superior performance in depth map

prediction from single images compared to the state-of-the-

art methods.

1. Introduction

Estimating depth from single images is an ill-posed

problem which cannot be solved directly from bottom-up

geometric cues in general. Instead, a-priori knowledge

about the typical appearance, layout and size of objects

needs to be used, or further cues such as shape from shading

or focus have to be employed which are difﬁcult to model in

realistic settings. In recent years, supervised deep learning

approaches have demonstrated promising results for single

image depth prediction. These learning approaches appear

to capture the statistical relationship between appearance

and distance to objects well.

Supervised deep learning, however, requires vast

amounts of training data in order to achieve high accu-

racy and to generalize well to novel scenes. Supplementary

depth sensors are typically used to capture ground truth.

In the indoor setting, active RGB-D cameras can be used.

Outdoors, 3D laser scanners are a popular choice to capture

depth measurements. However, using such sensing devices

bears several shortcomings. Firstly, the sensors have their

Figure 1. We concurrently train a CNN from unsupervised and

supervised depth cues to achieve state-of-the-art performance in

single image depth prediction. For supervised training we use

(sparse) ground-truth depth readings from a supplementary sens-

ing cue such as a 3D laser. Unsupervised direct image alignment

complements the ground-truth measurements with a training sig-

nal that is purely based on the stereo images and the predicted

depth map for an image.

own error and noise characteristics, which will be learned

by the network. In addition, when using 3D lasers, the

measurements are typically much sparser than the images

and do not capture high detail depth variations visible in

the images well. Finally, accurate extrinsic and intrinsic

calibration of the sensors is required. Ground truth data

could alternatively be generated through synthetic render-

ing of depth maps. The rendered images, however, do not

fully realistically display the scene and do not incorporate

real image noise characteristics.

Very recently, unsupervised methods have been intro-

duced [6, 9] that learn to predict depth maps directly from

the intensity images in a stereo setup–without the need

for an additional supplementary modality for capturing the

ground truth. One drawback of these approaches is the well-

known fact that stereo depth reconstruction based on im-

age matching is an ill-posed problem on its own. To this

6647

end, common regularization schemes can be used which im-

pose priors on the depth such as small depth gradient norms

which may not be fully satisﬁed in the real environment.

In this paper, we propose a semi-supervised learning ap-

proach that makes use of supervised as well as unsupervised

training cues to incorporate the best of both worlds. Our

method beneﬁts from ground-truth measurements as an un-

ambiguous (but noisy and sparse) cue for the actual depth in

the scene. Unsupervised image alignment complements the

ground-truth by a huge amount of additional training data

which is much simpler to obtain and counteracts the deﬁ-

ciencies of the ground-truth depth measurements. By the

combination of both methods, we achieve signiﬁcant im-

provements over the state-of-the-art in single image depth

map prediction which we evaluate on the popular KITTI

dataset [7] in urban street scenes. We base our approach

on a state-of-the-art deep residual network in an encoder-

decoder architecture for this task [16] and augment it with

long skip connections between corresponding layers in en-

coder and decoder to predict high detail output depth maps.

Our network converges quickly to a good model from little

supervised training data, mainly due to the use of pretrained

encoder weights (on ImageNet [22] classiﬁcation task) and

unsupervised training. The use of supervised training also

simpliﬁes unsupervised learning signiﬁcantly. For instance,

a tedious coarse-to-ﬁne image alignment loss as in previous

unsupervised learning approaches [6] is not required in our

semi-supervised approach.

In summary, we make the following contributions: 1) We

propose a novel semi-supervised deep learning approach to

single image depth map prediction that uses supervised as

well as unsupervised learning cues. 2) Our deep learning

approach demonstrates state-of-the-art performance in chal-

lenging outdoor scenes on the KITTI benchmark.

2. Related Work

Over the last years, several learning-based approaches to

single image depth reconstruction have been proposed that

are trained in a supervised way. Often, measured depth from

RGB-D cameras or 3D laser scanners is used as ground-

truth for training. Saxena et al. [24] proposed one of the ﬁrst

supervised learning-based approaches to single image depth

map prediction. They model depth prediction in a Markov

random ﬁeld and use multi-scale texture features that have

been hand-crafted. The method also combines monocular

cues with stereo correspondences within the MRF.

Many recent approaches learn image features using deep

learning techniques. Eigen et al. [5] propose a CNN ar-

chitecture that integrates coarse-scale depth prediction with

ﬁne-scale prediction. The approach of Li et al. [17] com-

bines deep learning features on image patches with hierar-

chical CRFs deﬁned on a superpixel segmentation of the

image. They use pretrained AlexNet [14] features of im-

age patches to predict depth at the center of the superpix-

els. A hierarchical CRF reﬁnes the depth across individ-

ual pixels. Liu et al. [20] also propose a deep structured

learning approach that avoids hand-crafted features. Their

deep convolutional neural ﬁelds allow for training CNN fea-

tures of unary and pairwise potentials end-to-end, exploit-

ing continuous depth and Gaussian assumptions on the pair-

wise potentials. Very recently, Laina et al . [16] proposed

to use a ResNet-based encoder-decoder architecture to pro-

duce dense depth maps. They demonstrate the approach to

predict depth maps in indoor scenes using RGB-D images

for training. Further lines of research in supervised train-

ing of depth map prediction use the idea of depth transfer

from example images [13, 12, 21], or integrate depth map

prediction with semantic segmentation [15, 19, 4, 26, 18].

Only few very recent methods attempt to learn depth

map prediction in an unsupervised way. Garg et al. [6] pro-

pose an encoder-decoder architecture similar to FlowNet [3]

which is trained to predict single image depth maps on an

image alignment loss. The method only requires images of

a corresponding camera in a stereo setup. The loss quan-

tiﬁes the photometric error of the input image warped into

its corresponding stereo image using the predicted depth.

The loss is linearized using ﬁrst-order Taylor approxima-

tion and hence requires coarse-to-ﬁne training. Xie et al.

[27] do not regress the depth maps directly, but produce

probability maps for different disparity levels. A selec-

tion layer then reconstructs the right image using the left

image and these probability maps. The network is trained

to minimize pixel-wise reconstruction error. Godard et al.

[9] also use an image alignment loss in a convolutional

encoder-decoder architecture but additionally enforce left-

right consistency of the predicted disparities in the stereo

pair. Our semi-supervised approach simpliﬁes the use of

unsupervised cues and does not require multi-scale depth

map prediction in our network architecture. We also do not

explicitly enforce left-right consistency, but use both im-

ages in the stereo pair equivalently to deﬁne our loss func-

tion. The semi-supervised method of Chen et al. [1] in-

corporates the side-task of depth ranking of pairs of pixels

for training a CNN on single image depth prediction. For

the ranking task, ground-truth is much easier to obtain but

only indirectly provides information on continuous depth

values. Our approach uses image alignment as a geometric

cue which does not require manual annotations.

3. Approach

We base our approach on supervised as well as unsu-

pervised principles for learning single image depth map

prediction (see Fig. 1). A straight-forward approach is to

use a supplementary measuring device such as a 3D laser

in order to capture ground-truth depth readings for super-

vised training. This process typically requires an accurate

6648

Figure 2. Components and inputs of our novel semi-supervised loss function.

extrinsic calibration between the 3D laser sensor and the

camera. Furthermore, the laser measurements have several

shortcomings. Firstly, they are affected by erroneous read-

ings and noise. They are also typically much sparser than

the camera images when projected into the image. Finally,

the center of projection of laser and camera do not coincide.

This causes depth readings of objects that are occluded from

the view point of the camera to project into the camera im-

age. To counteract these drawbacks, we make use of two-

view geometry principles to learn depth prediction directly

from the stereo camera images in an unsupervised way. We

achieve this by direct image alignment of one stereo image

to the other. This process only requires a known camera

calibration and the depth map predicted by the CNN. Our

semi-supervised approach learns from supervised and un-

supervised cues concurrently.

We train the CNN to predict the inverse depth ρ(x) at

each pixel x ∈ Ω from the RGB image I. According to the

ground truth, the predicted inverse depth should correspond

to the LiDAR depth measurement Z(x) that projects to the

same pixel, i.e.

ρ(x)

−1

!

= Z(x). (1)

However, the laser measurements only project to a sparse

subset Ω

Z

⊆ Ω of the pixels in the image.

As the unsupervised training signal, we assume photo-

consistency between the left and right stereo images, i.e.,

I

1

(x)

!

= I

2

(ω(x, ρ(x))). (2)

In our calibrated stereo setup, the warping function can be

deﬁned as

ω(x, ρ(x)) := x − f b ρ(x) (3)

on the rectiﬁed images, where f is the focal length and b is

the baseline. This image alignment constraint holds at every

pixel in the image.

We additionally make use of the interchangeability of the

stereo images. We quantify the supervised loss in both im-

ages by projecting the ground truth laser data into each of

the stereo images. We also constrain the depth estimate be-

tween the left and right stereo images to be consistent im-

plicitly by enforcing photoconsistency based on the inverse

depth prediction for both images, i.e.,

I

left

(x)

!

= I

right

(ω(x, ρ(x)))

I

right

(x)

!

= I

left

(ω(x, −ρ(x))).

(4)

Finally, in textureless regions without ground truth depth

readings, the depth map prediction problem is ill-posed and

an adequate regularization needs to be imposed.

3.1. Loss function

We formulate a single loss function that incorporates

both types of constraints that arise from supervised and un-

supervised cues seamlessly,

L

θ

(I

l

, I

r

, Z

l

, Z

r

) =

λ

t

L

S

θ

(I

l

, I

r

, Z

l

, Z

r

) + γL

U

θ

(I

l

, I

r

) + L

R

θ

(I

l

, I

r

), (5)

where λ

t

and γ are trade-off parameters between super-

vised loss L

S

θ

, unsupervised loss L

U

θ

, and a regularization

term L

R

θ

. With θ we denote the CNN network parameters

that generate the inverse depth maps ρ

r/l,θ

.

Supervised loss. The supervised loss term measures

the deviation of the predicted depth map from the available

ground truth at the pixels,

L

S

θ

=

X

x∈Ω

Z,l



ρ

l,θ

(x)

−1

− Z

l

(x)



δ

+

X

x∈Ω

Z,r



ρ

r,θ

(x)

−1

− Z

r

(x)



δ

. (6)

6649

We use the berHu norm k·k

δ

as introduced in [16] to focus

training on larger depth residuals during CNN training,

kdk

δ

=

(

|d|, d ≤ δ

d

2

+δ

2

2δ

, d > δ

. (7)

We adaptively set

δ = 0.2 max

x∈Ω

Z





ρ(x)

−1

− Z(x)





. (8)

Note, that noise in the ground-truth measurements could be

modelled as well, for instance, by weighting each residual

with the inverse of the measurement variance.

Unsupervised loss. The unsupervised part of our loss

quantiﬁes the direct image alignment error in both direc-

tions

L

U

θ

=

X

x∈Ω

U,l

|(G

σ

∗

I

l

)(x) − (G

σ

∗

I

r

)(ω(x, ρ

l,θ

(x)))|

+

X

x∈Ω

U,r

|(G

σ

∗

I

r

)(x) − (G

σ

∗

I

l

)(ω(x, −ρ

r,θ

(x)))| ,

(9)

with a Gaussian smoothing kernel G

σ

with a standard devi-

ation of σ = 1 px. We found this small amount of Gaussian

smoothing to be beneﬁcial, presumably due to reducing im-

age noise. We evaluate the direct image alignment loss at

the sets of image pixels Ω

U,l/r

of the reconstructed images

that warp to a valid location in the second image. We use

linear interpolation for subpixel-level warping.

Regularization loss. As suggested in [9], the smooth-

ness term penalizes depth changes at pixels with low inten-

sity variation. In order to allow for depth discontinuities

at object contours, we downscale the regularization term

anisotropically according to the intensity variation:

L

R

θ

=

X

i∈{l,r}

X

x∈Ω



φ (∇I

i

(x))

⊤

∇ρ

i

(x)



(10)

with φ(g) = (exp(−η |g

x

|), exp(−η |g

y

|))

⊤

and η =

1

255

.

Supervised, unsupervised, and regularization terms are

seamlessly combined within our novel semi-supervised loss

function formulation (see Fig. 2). In contrast to previous

methods, our approach treats both cameras in the stereo

setup equivalently. All three loss components are formu-

lated in a symmetric way for the cameras which implicitly

enforces consistency in the predicted depth maps between

the cameras.

3.2. Network Architecture

We use a deep residual network architecture in an

encoder-decoder scheme, similar to the supervised ap-

proach in [16] (see Fig. 3). Taking inspiration from non-

residual architectures such as FlowNet [3], our architecture

Layer Channels I/O Scaling Inputs

conv1

7

2

3 / 64 2 RGB

max pool1

3

2

64 / 64 4 conv1

res block1

2

1

64 / 256 4 max

pool1

res block2

1

256 / 256 4 res

block1

res block3

1

256 / 256 4 res

block2

res block4

2

256 / 512 8 res

block3

res block5

1

512 / 512 8 res

block4

res block6

1

512 / 512 8 res

block5

res block7

1

512 / 512 8 res

block6

res

block8

2

512 / 1024 16 res

block7

res block9

1

1024 / 1024 16 res

block8

res block10

1

1024 / 1024 16 res

block9

res block11

1

1024 / 1024 16 res

block10

res block12

1

1024 / 1024 16 res block11

res block13

1

1024 / 1024 16 res

block12

res block14

2

1024 / 2048 32 res

block13

res block15

1

2048 / 2048 32 res

block14

res block16

1

2048 / 2048 32 res

block15

conv2

1

2048 / 1024 32 res

block16

upproject1 1024 / 512 16 conv2

upproject2 512 / 256 8 upproject1

res

block13

upproject3 256 / 128 4 upproject2

res block7

upproject4 128 / 64 2 upproject3

res block3

conv3

3

1

64 / 1 2 upproject4

Table 1. Layers in our deep residual encoder-decoder architecture.

We input the ﬁnal output layers at each resolution of the encoder

at the respective decoder layers (long skip connections). This fa-

cilitates the prediction of ﬁne detailed depth maps by the CNN.

includes long skip connections between the encoder and de-

coder to facilitate ﬁne detail predictions at the output reso-

lution. Table 1 details the various layers in our network.

Input to our network is the RGB camera image. The en-

coder resembles a ResNet-50 [11] architecture (without the

ﬁnal fully connected layer) and successively extracts low-

resolution high-dimensional features from the input im-

age. The encoder subsamples the input image in 5 stages,

the ﬁrst stage convolving the image to half input resolu-

tion and each successive stage stacking multiple residual

blocks. The decoder upprojects the output of the encoder

using residual blocks. We found that adding long skip-

connections between corresponding layers in encoder and

decoder to this architecture slightly improves the perfor-

mance on all metrics without affecting convergence. More-

over, the network is able to predict more detailed depth

maps than without skip connections.

6650

Figure 3. Illustration of our deep residual encoder-decoder architecture (c1, c3, mp1 abbreviate conv1, conv3, and max pool1, respectively).

Skip connections from corresponding encoder layers to the decoder facilitate ﬁne detailed depth map prediction.

Figure 4. Type 1 residual block resblock

1

s

with stride s = 1. The

residual is obtained from 3 successive convolutions. The residual

has the same number of channels as the input.

Figure 5. Type 2 residual block resblock

2

s

with stride s. The resid-

ual is obtained from 3 successive convolutions, while the ﬁrst con-

volution applies stride s. An additional convolution applies the

same stride s and projects the input to the number of channels of

the residual.

We denote a convolution of ﬁlter size k × k and stride s

by conv

k

s

. The same notation applies to pooling layers, e.g.,

max

pool

k

s

. Each convolution layer is followed by batch

normalization with exception of the last layer in the net-

work. Furthermore, we use ReLU activation functions on

the output of the convolutions except at the inputs to the sum

operation of the residual blocks where the ReLU comes af-

ter the sum operation. resblock

i

s

denotes the residual block

of type i with stride s at its ﬁrst convolution layer, see

Figs. 4 and 5 for details on each type of residual block.

Smaller feature blocks consist of 16s maps, while larger

blocks contain 4 times more feature maps, where s is the

output scale of the residual block. Lastly, upproject is the

upprojection layer proposed by Laina et al. [16]. We use the

fast implementation of upprojection layers, but for better il-

lustration we visualize upprojection by its ”naive” version

(see Fig. 6).

4. Experiments

We evaluate our approach on the raw sequences of the

KITTI benchmark [7] which is a popular dataset for sin-

Figure 6. Schematic illustration of the upprojection residual block.

It unpools the input by a factor of 2 and applies a residual block

which reduces the number of channels by a factor of 2.

gle image depth map prediction. The sequences contain

stereo imagery taken from a driving car in an urban sce-

nario. The dataset also provides 3D laser measurements

from a Velodyne laser scanner that we use as ground-truth

measurements (projected into the stereo images using the

given intrinsics and extrinsics in KITTI). This dataset has

been used to train and evaluate the state-of-the-art methods

and allows for quantitative comparison.

We evaluate our approach on the KITTI Raw split into 28

testing scenes as proposed by Eigen et al. [5]. We decided

to use the remaining sequences of the KITTI Raw dataset

for training and validation. We obtained a training set from

28 sequences in which we even the sequence distribution

with 450 frames per sequence. This results in 7346 unique

frames and 12600 frames in total for training. We also cre-

ated a validation set by sampling every tenth frame from the

remaining 5 sequences with little image motion. All these

sequences are urban, so we additionally select those frames

from the training sequences that are in the middle between 2

training images with distance of at least 20 frames. In total

we obtain a validation set of 100 urban and 144 residential

area images.

4.1. Implementation Details

We initialize the encoder part of our network with

ResNet-50 [11] weights pretrained for ImageNet classiﬁca-

tion task. The convolution ﬁlter weights in the decoder part

are initialized randomly according to the approach of Glo-

rot and Bengio [8]. We also tried the initialization by He

et al . [10] but did not notice any performance difference.

6651

Semi-Supervised Deep Learning for Monocular Depth Map Prediction

Citations

References

Related Papers (5)