Unsupervised Monocular Depth Estimation with Left-Right Consistency

doi:10.1109/CVPR.2017.699

Cl

´

ement Godard Oisin Mac Aodha Gabriel J. Brostow

University College London

http://visual.cs.ucl.ac.uk/pubs/monoDepth/

Abstract

Learning based methods have shown very promising results

for the task of depth estimation in single images. However,

most existing approaches treat depth prediction as a supervised

regression problem and as a result, require vast quantities

of corresponding ground truth depth data for training. Just

recording quality depth data in a range of environments is a

challenging problem. In this paper, we innovate beyond existing

approaches, replacing the use of explicit depth data during

training with easier-to-obtain binocular stereo footage.

We propose a novel training objective that enables our convo-

lutional neural network to learn to perform single image depth

estimation, despite the absence of ground truth depth data. Ex-

ploiting epipolar geometry constraints, we generate disparity

images by training our network with an image reconstruction

loss. We show that solving for image reconstruction alone re-

sults in poor quality depth images. To overcome this problem,

we propose a novel training loss that enforces consistency be-

tween the disparities produced relative to both the left and right

images, leading to improved performance and robustness com-

pared to existing approaches. Our method produces state of the

art results for monocular depth estimation on the KITTI driving

dataset, even outperforming supervised methods that have been

trained with ground truth depth.

1. Introduction

Depth estimation from images has a long history in computer

vision. Fruitful approaches have relied on structure from motion,

shape-from-X, binocular, and multi-view stereo. However, most

of these techniques rely on the assumption that multiple obser-

vations of the scene of interest are available. These can come

in the form of multiple viewpoints, or observations of the scene

under different lighting conditions. To overcome this limitation,

there has recently been a surge in the number of works that pose

the task of monocular depth estimation as a supervised learning

problem [

32

,

10

,

36

]. These methods attempt to directly predict

the depth of each pixel in an image using models that have been

trained offline on large collections of ground truth depth data.

While these methods have enjoyed great success, to date they

Figure 1. Our depth prediction results on KITTI 2015. Top to bottom:

input image, ground truth disparities, and our result. Our method is

able to estimate depth for thin structures such as street signs and poles.

have been restricted to scenes where large image collections

and their corresponding pixel depths are available.

Understanding the shape of a scene from a single image,

independent of its appearance, is a fundamental problem in

machine perception. There are many applications such as

synthetic object insertion in computer graphics [

29

], synthetic

depth of field in computational photography [

3

], grasping

in robotics [

34

], using depth as a cue in human body pose

estimation [

48

], robot assisted surgery [

49

], and automatic 2D

to 3D conversion in film [

53

]. Accurate depth data from one

or more cameras is also crucial for self-driving cars, where

expensive laser-based systems are often used.

Humans perform well at monocular depth estimation by

exploiting cues such as perspective, scaling relative to the

known size of familiar objects, appearance in the form of

lighting and shading and occlusion [

24

]. This combination of

both top-down and bottom-up cues appears to link full scene

understanding with our ability to accurately estimate depth. In

this work, we take an alternative approach and treat automatic

depth estimation as an image reconstruction problem during

training. Our fully convolutional model does not require any

depth data, and is instead trained to synthesize depth as an

intermediate. It learns to predict the pixel-level correspondence

between pairs of rectified stereo images that have a known

camera baseline. There are some existing methods that also

address the same problem, but with several limitations. For

example they are not fully differentiable, making training

suboptimal [

16

], or have image formation models that do

1

270

not scale to large output resolutions [

53

]. We improve upon

these methods with a novel training objective and enhanced

network architecture that significantly increases the quality

of our final results. An example result from our algorithm is

illustrated in Fig.

1. Our method is fast and only takes on the

order of

35

milliseconds to predict a dense depth map for a

512×256

image on a modern GPU. Specifically, we propose

the following contributions:

1) A network architecture that performs end-to-end unsuper-

vised monocular depth estimation with a novel training loss that

enforces left-right depth consistency inside the network.

2) An evaluation of several training losses and image formation

models highlighting the effectiveness of our approach.

3) In addition to showing state of the art results on a challenging

driving dataset, we also show that our model generalizes to three

different datasets, including a new outdoor urban dataset that

we have collected ourselves, which we make openly available.

2. Related Work

There is a large body of work that focuses on depth

estimation from images, either using pairs [

46

], several

overlapping images captured from different viewpoints [

14

],

temporal sequences [

44

], or assuming a fixed camera, static

scene, and changing lighting [

52

,

2

]. These approaches are

typically only applicable when there is more than one input

image available of the scene of interest. Here we focus on

works related to monocular depth estimation, where there is

only a single input image, and no assumptions about the scene

geometry or types of objects present are made.

Learning-Based Stereo

The vast majority of stereo estimation algorithms have a data

term which computes the similarity between each pixel in the

first image and every other pixel in the second image. Typically

the stereo pair is rectified and thus the problem of disparity (i.e.

scaled inverse depth) estimation can be posed as a 1D search

problem for each pixel. Recently, it has been shown that instead

of using hand defined similarity measures, treating the matching

as a supervised learning problem and training a function to

predict the correspondences produces far superior results

[

54

,

31

]. It has also been shown that posing this binocular

correspondence search as a multi-class classification problem

has advantages both in terms of quality of results and speed

[

38

]. Instead of just learning the matching function, Mayer

et al. [

39

] introduced a fully convolutional [

47

] deep network

called DispNet that directly computes the correspondence

field between two images. At training time, they attempt to

directly predict the disparity for each pixel by minimizing a

regression training loss. DispNet has a similar architecture to

their previous end-to-end deep optical flow network [12].

The above methods rely on having large amounts of accurate

ground truth disparity data and stereo image pairs at training

time. This type of data can be difficult to obtain for real world

scenes, so these approaches typically use synthetic data for

training. Synthetic data is becoming more realistic, e.g. [

15

],

but still requires the manual creation of new content for every

new application scenario.

Supervised Single Image Depth Estimation

Single-view, or monocular, depth estimation refers to the

problem setup where only a single image is available at test

time. Saxena et al. [

45

] proposed a patch-based model known

as Make3D that first over-segments the input image into patches

and then estimates the 3D location and orientation of local

planes to explain each patch. The predictions of the plane

parameters are made using a linear model trained offline on

a dataset of laser scans, and the predictions are then combined

together using an MRF. The disadvantage of this method, and

other planar based approximations, e.g. [

22

], is that they can

have difficulty modeling thin structures and, as predictions

are made locally, lack the global context required to generate

realistic outputs. Instead of hand-tuning the unary and pairwise

terms, Liu et al. [

36

] use a convolutional neural network (CNN)

to learn them. In another local approach, Ladicky et al. [

32

]

incorporate semantics into their model to improve their per

pixel depth estimation. Karsch et al. [

28

] attempt to produce

more consistent image level predictions by copying whole depth

images from a training set. A drawback of this approach is that

it requires the entire training set to be available at test time.

Eigen et al. [

10

,

9

] showed that it was possible to produce

dense pixel depth estimates using a two scale deep network

trained on images and their corresponding depth values. Unlike

most other previous work in single image depth estimation,

they do not rely on hand crafted features or an initial over-

segmentation and instead learn a representation directly from

the raw pixel values. Several works have built upon the success

of this approach using techniques such as CRFs to improve accu-

racy [

35

], changing the loss from regression to classification [

5

],

using other more robust loss functions [

33

], and incorporating

strong scene priors in the case of the related problem of surface

normal estimation [

50

]. Again, like the previous stereo methods,

these approaches rely on having high quality, pixel aligned,

ground truth depth at training time. We too perform single

depth image estimation, but train with an added binocular color

image, instead of requiring ground truth depth.

Unsupervised Depth Estimation

Recently, a small number of deep network based methods for

novel view synthesis and depth estimation have been proposed,

which do not require ground truth depth at training time. Flynn et

al. [

13

] introduced a novel image synthesis network called Deep-

Stereo that generates new views by selecting pixels from nearby

images. During training, the relative pose of multiple cameras

is used to predict the appearance of a held-out nearby image.

Then the most appropriate depths are selected to sample colors

from the neighboring images, based on plane sweep volumes.

271

At test time, image synthesis is performed on small overlapping

patches. As it requires several nearby posed images at test time

DeepStereo is not suitable for monocular depth estimation.

The Deep3D network of Xie et al. [

53

] also addresses the

problem of novel view synthesis, where their goal is to generate

the corresponding right view from an input left image (i.e. the

source image) in the context of binocular pairs. Again using an

image reconstruction loss, their method produces a distribution

over all the possible disparities for each pixel. The resulting

synthesized right image pixel values are a combination of the

pixels on the same scan line from the left image, weighted

by the probability of each disparity. The disadvantage of

their image formation model is that increasing the number

of candidate disparity values greatly increases the memory

consumption of the algorithm, making it difficult to scale their

approach to bigger output resolutions. In this work, we perform

a comparison to the Deep3D image formation model, and show

that our algorithm produces superior results.

Closest to our model in spirit is the concurrent work of

Garg et al. [

16

]. Like Deep3D and our method, they train

a network for monocular depth estimation using an image

reconstruction loss. However, their image formation model is

not fully differentiable. To compensate, they perform a Taylor

approximation to linearize their loss resulting in an objective

that is more challenging to optimize. Similar to other recent

work, e.g. [

43

,

56

,

57

], our model overcomes this problem by

using bilinear sampling [

27

] to generate images, resulting in

a fully (sub-)differentiable training loss.

We propose a fully convolutional deep neural network

loosely inspired by the supervised DispNet architecture of

Mayer et al. [

39

]. By posing monocular depth estimation

as an image reconstruction problem, we can solve for the

disparity field without requiring ground truth depth. However,

only minimizing a photometric loss can result in good quality

image reconstructions but poor quality depth. Among other

terms, our fully differentiable training loss includes a left-right

consistency check to improve the quality of our synthesized

depth images. This type of consistency check is commonly

used as a post-processing step in many stereo methods, e.g.

[54], but we incorporate it directly into our network.

3. Method

This section describes our single image depth prediction

network. We introduce a novel depth estimation training loss,

featuring an inbuilt left-right consistency check, which enables

us to train on image pairs without requiring supervision in the

form of ground truth depth.

3.1. Depth Estimation as Image Reconstruction

Given a single image

I

at test time, our goal is to learn a

function

f

that can predict the per-pixel scene depth,

ˆ

d=f(I)

.

Most existing learning based approaches treat this as a

supervised learning problem, where they have color input

H x W x D

2H x 2W x D/2

UC

C

S

US

C

SC

Figure 2. Our loss module outputs left and right disparity maps,

d

l

and

d

r

. The loss combines smoothness, reconstruction, and left-right

disparity consistency terms. This same module is repeated at each of

the four different output scales. C: Convolution, UC: Up-Convolution,

S: Bilinear Sampling, US: Up-Sampling, SC: Skip Connection.

images and their corresponding target depth values at training. It

is presently not practical to acquire such ground truth depth data

for a large variety of scenes. Even expensive hardware, such

as laser scanners, can be imprecise in natural scenes featuring

movement and reflections. As an alternative, we instead pose

depth estimation as an image reconstruction problem during

training. The intuition here is that, given a calibrated pair of

binocular cameras, if we can learn a function that is able to

reconstruct one image from the other, then we have learned

something about the

3

D shape of the scene that is being imaged.

Specifically, at training time, we have access to two images

I

l

and

I

r

, corresponding to the left and right color images from

a calibrated stereo pair, captured at the same moment in time.

Instead of trying to directly predict the depth, we attempt to

find the dense correspondence field

d

r

that, when applied to the

left image, would enable us to reconstruct the right image. We

will refer to the reconstructed image

I

l

(d

r

)

as

˜

I

r

. Similarly, we

can also estimate the left image given the right one,

˜

I

l

=I

r

(d

l

)

.

Assuming that the images are rectified [

19

],

d

corresponds to

the image disparity - a scalar value per pixel that our model

will learn to predict. Given the baseline distance

b

between the

cameras and the camera focal length

f

, we can then trivially

recover the depth

ˆ

d from the predicted disparity,

ˆ

d=bf/d.

3.2. Depth Estimation Network

At a high level, our network estimates depth by inferring

the disparities that warp the left image to match the right one.

The key insight of our method is that we can simultaneously

infer both disparities (left-to-right and right-to-left), using only

the left input image, and obtain better depths by enforcing them

to be consistent with each other.

Our network generates the predicted image with backward

mapping using a bilinear sampler, resulting in a fully differen-

tiable image formation model. As illustrated in Fig. 3, na

¨

ıvely

learning to generate the right image by sampling from the left

272

Target

Output

Disparity

Sampler

CNN

Naïve

Input

No LR

Ours

Figure 3. Sampling strategies for backward mapping. With na

¨

ıve

sampling the CNN produces a disparity map aligned with the target

instead of the input. No LR corrects for this, but suffers from artifacts.

Our approach uses the left image to produce disparities for both

images, improving quality by enforcing mutual consistency.

one will produce disparities aligned with the right image (target).

However, we want the output disparity map to align with the

input left image, meaning the network has to sample from the

right image. We could instead train the network to generate the

left view by sampling from the right image, thus creating a left

view aligned disparity map (

No LR

in Fig. 3). While this alone

works, the inferred disparities exhibit ‘texture-copy’ artifacts and

errors at depth discontinuities as seen in Fig.

5. We solve this by

training the network to predict the disparity maps for both views

by sampling from the opposite input images. This still only

requires a single left image as input to the convolutional layers

and the right image is only used during training (

Ours

in Fig. 3).

Enforcing consistency between both disparity maps using this

novel left-right consistency cost leads to more accurate results.

Our fully convolutional architecture is inspired by Disp-

Net [

39

], but features several important modifications that

enable us to train without requiring ground truth depth. Our net-

work, is composed of two main parts - an encoder (from cnv1 to

cnv7b) and decoder (from upcnv7), please see the supplementary

material for a detailed description. The decoder uses skip con-

nections [

47

] from the encoder’s activation blocks, enabling it to

resolve higher resolution details. We output disparity predictions

at four different scales (disp4 to disp1), which double in spatial

resolution at each of the subsequent scales. Even though it only

takes a single image as input, our network predicts two disparity

maps at each output scale - left-to-right and right-to-left.

3.3. Training Loss

We define a loss

C

s

at each output scale s, forming the

total loss as the sum

C =

P

4

s=1

C

s

. Our loss module (Fig. 2)

computes C

s

as a combination of three main terms,

C

s

=α

ap

(C

l

ap

+C

r

ap

)+α

ds

(C

l

ds

+C

r

ds

)+α

lr

(C

l

lr

+C

r

lr

),

(1)

where

C

ap

encourages the reconstructed image to appear

similar to the corresponding training input,

C

ds

enforces

smooth disparities, and

C

lr

prefers the predicted left and right

disparities to be consistent. Each of the main terms contains

both a left and a right image variant, but only the left image

is fed through the convolutional layers.

Next, we present each component of our loss in terms of the

left image (e.g.

C

l

ap

). The right image versions, e.g.

C

r

ap

, require

to swap left for right and to sample in the opposite direction.

Appearance Matching Loss

During training, the network

learns to generate an image by sampling pixels from the

opposite stereo image. Our image formation model uses the

image sampler from the spatial transformer network (STN) [

27

]

to sample the input image using a disparity map. The STN uses

bilinear sampling where the output pixel is the weighted sum of

four input pixels. In contrast to alternative approaches [

16

,

53

],

the bilinear sampler used is locally fully differentiable and

integrates seamlessly into our fully convolutional architecture.

This means that we do not require any simplification or

approximation of our cost function.

Inspired by [

55

], we use a combination of an

L1

and

single scale SSIM [

51

] term as our photometric image

reconstruction cost

C

ap

, which compares the input image

I

l

ij

and its reconstruction

˜

I

l

ij

, where N is the number of pixels,

C

l

ap

=

1

N

X

i,j

α

1−SSIM(I

l

ij

,

˜

I

l

ij

)

2

+(1−α)



I

l

ij

−

˜

I

l

ij



. (2)

Here, we use a simplified SSIM with a

3×3

block filter instead

of a Gaussian, and set α=0.85.

Disparity Smoothness Loss

We encourage disparities to be

locally smooth with an

L1

penalty on the disparity gradients

∂d

.

As depth discontinuities often occur at image gradients, similar

to [

21

], we weight this cost with an edge-aware term using the

image gradients ∂I,

C

l

ds

=

1

N

X

i,j



∂

x

d

l

ij



e

−

k

∂

x

I

l

ij

k

+



∂

y

d

l

ij



e

−

k

∂

y

I

l

ij

k

. (3)

Left-Right Disparity Consistency Loss

To produce more

accurate disparity maps, we train our network to predict both

the left and right image disparities, while only being given

the left view as input to the convolutional part of the network.

To ensure coherence, we introduce an

L1

left-right disparity

consistency penalty as part of our model. This cost attempts

to make the left-view disparity map be equal to the projected

right-view disparity map,

C

l

lr

=

1

N

X

i,j



d

l

ij

−d

r

ij+d

l

ij



. (4)

Like all the other terms, this cost is mirrored for the right-view

disparity map and is evaluated at all of the output scales.

273

Method Dataset Abs Rel Sq Rel RMSE RMSE log D1-all δ <1.25 δ<1.25

2

δ<1.25

3

Ours with Deep3D [53] K 0.412 16.37 13.693 0.512 66.85 0.690 0.833 0.891

Ours with Deep3Ds [53] K 0.151 1.312 6.344 0.239 59.64 0.781 0.931 0.976

Ours No LR K 0.123 1.417 6.315 0.220 30.318 0.841 0.937 0.973

Ours K 0.124 1.388 6.125 0.217 30.272 0.841 0.936 0.975

Ours CS 0.699 10.060 14.445 0.542 94.757 0.053 0.326 0.862

Ours CS + K 0.104 1.070 5.417 0.188 25.523 0.875 0.956 0.983

Ours pp CS + K 0.100 0.934 5.141 0.178 25.077 0.878 0.961 0.986

Ours resnet pp CS + K 0.097 0.896 5.093 0.176 23.811 0.879 0.962 0.986

Ours Stereo K 0.068 0.835 4.392 0.146 9.194 0.942 0.978 0.989

Lower is better

Higher is better

Table 1. Comparison of different image formation models. Results on the KITTI 2015 stereo 200 training set disparity images [

17

]. For training,

K is the KITTI dataset [

17

] and CS is Cityscapes [

8

]. Our model with left-right consistency performs the best, and is further improved with the

addition of the Cityscapes data. The last row shows the result of our model trained and tested with two input images instead of one (see Sec. 4.3).

At test time, our network predicts the disparity at the finest

scale level for the left image

d

l

, which has the same resolution

as the input image. Using the known camera baseline and focal

length from the training set, we then convert from the disparity

map to a depth map. While we also estimate the right disparity

d

r

during training, it is not used at test time.

4. Results

Here we compare the performance of our approach to both

supervised and unsupervised single view depth estimation

methods. We train on rectified stereo image pairs, and do

not require any supervision in the form of ground truth depth.

Existing single image datasets, such as [

41

,

45

], that lack

stereo pairs, are not suitable for evaluation. Instead we evaluate

our approach using the popular KITTI 2015 [

17

] dataset. To

evaluate our image formation model, we compare to a variant

of our algorithm that uses the original Deep3D [

53

] image

formation model and a modified one, Deep3Ds, with an added

smoothness constraint. We also evaluate our approach with and

without the left-right consistency constraint.

4.1. Implementation Details

The network which is implemented in TensorFlow [

1

] con-

tains

31

million trainable parameters, and takes on the order of

25

hours to train using a single Titan X GPU on a dataset of

30

thousand images for 50 epochs. Inference is fast and takes less

than

35

ms, or more than

28

frames per second, for a

512×256

image, including transfer times to and from the GPU. Please

see the supplementary material and our code

1

for more details.

During optimization, we set the weighting of the different

loss components to

α

ap

= 1

and

α

lr

= 1

. The possible output

disparities are constrained to be between

0

and

d

max

using a

scaled sigmoid non-linearity, where

d

max

= 0.3×

the image

width at a given output scale. As a result of our multi-scale

output, the typical disparity of neighboring pixels will differ

by a factor of two between each scale (as we are upsampling

the output by a factor of two). To correct for this, we scale the

disparity smoothness term

α

ds

with

r

for each scale to get equiv-

alent smoothing at each level. Thus

α

ds

=0.1/r

, where

r

is the

1

Available at

https://github.com/mrharicot/monodepth

downscaling factor of the corresponding layer with respect to

the resolution of the input image that is passed into the network.

For the non-linearities in the network, we used exponential

linear units [

7

] instead of the commonly used rectified liner units

(ReLU) [

40

]. We found that ReLUs tended to prematurely fix

the predicted disparities at intermediate scales to a single value,

making subsequent improvement difficult. Following [

42

],

we replaced the usual deconvolutions with a nearest neighbor

upsampling followed by a convolutions. We trained our model

from scratch for

50

epochs, with a batch size of

8

using Adam

[

30

], where

β

1

= 0.9

,

β

2

= 0.999

, and

ǫ = 10

−8

. We used an

initial learning rate of

λ=10

−4

which we kept constant for the

first

30

epochs before halving it every

10

epochs until the end.

We initially experimented with progressive update schedules,

as in [

39

], where lower resolution image scales were optimized

first. However, we found that optimizing all four scales at once

led to more stable convergence. Similarly, we use an identical

weighting for the loss of each scale as we found that weighting

them differently led to unstable convergence. We experimented

with batch normalization [

26

], but found that it did not produce

a significant improvement, and ultimately excluded it.

Data augmentation is performed on the fly. We flip the input

images horizontally with a

50%

chance, taking care to also

swap both images so they are in the correct position relative

to each other. We also added color augmentations, with a

50%

chance, where we performed random gamma, brightness,

and color shifts by sampling from uniform distributions in

the ranges

[0.8,1.2]

for gamma,

[0.5,2.0]

for brightness, and

[0.8,1.2] for each color channel separately.

Resnet50

For the sake of completeness, and similar to [

33

],

we also show a variant of our model using Resnet50 [

20

] as

the encoder, the rest of the architecture, parameters and training

procedure staying identical. This variant contains

48

million

trainable parameters and is indicated by resnet in result tables.

Post-processing

In order to reduce the effect of stereo disoc-

clusions which create disparity ramps on both the left side of the

image and of the occluders, a final post-processing step is per-

formed on the output. For an input image

I

at test time, we also

274

Unsupervised Monocular Depth Estimation with Left-Right Consistency

Citations

Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks

Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks

Unsupervised Learning of Depth and Ego-Motion from Video

Deep Ordinal Regression Network for Monocular Depth Estimation

Digging Into Self-Supervised Monocular Depth Estimation

References

I and J

Deep Residual Learning for Image Recognition

Adam: A Method for Stochastic Optimization

Image quality assessment: from error visibility to structural similarity

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Related Papers (5)

Depth Map Prediction from a Single Image using a Multi-Scale Deep Network

Deep Residual Learning for Image Recognition

Are we ready for autonomous driving? The KITTI vision benchmark suite

Vision meets robotics: The KITTI dataset

Image quality assessment: from error visibility to structural similarity