scispace - formally typeset
Open AccessProceedings ArticleDOI

Unsupervised Monocular Depth Estimation with Left-Right Consistency

TLDR
In this article, the authors propose a novel training objective that enables CNNs to learn to perform single image depth estimation, despite the absence of ground truth depth data, by generating disparity images by training their network with an image reconstruction loss.
Abstract
Learning based methods have shown very promising results for the task of depth estimation in single images. However, most existing approaches treat depth prediction as a supervised regression problem and as a result, require vast quantities of corresponding ground truth depth data for training. Just recording quality depth data in a range of environments is a challenging problem. In this paper, we innovate beyond existing approaches, replacing the use of explicit depth data during training with easier-to-obtain binocular stereo footage. We propose a novel training objective that enables our convolutional neural network to learn to perform single image depth estimation, despite the absence of ground truth depth data. Ex-ploiting epipolar geometry constraints, we generate disparity images by training our network with an image reconstruction loss. We show that solving for image reconstruction alone results in poor quality depth images. To overcome this problem, we propose a novel training loss that enforces consistency between the disparities produced relative to both the left and right images, leading to improved performance and robustness compared to existing approaches. Our method produces state of the art results for monocular depth estimation on the KITTI driving dataset, even outperforming supervised methods that have been trained with ground truth depth.

read more

Content maybe subject to copyright    Report

Unsupervised Monocular Depth Estimation with Left-Right Consistency
Cl
´
ement Godard Oisin Mac Aodha Gabriel J. Brostow
University College London
http://visual.cs.ucl.ac.uk/pubs/monoDepth/
Abstract
Learning based methods have shown very promising results
for the task of depth estimation in single images. However,
most existing approaches treat depth prediction as a supervised
regression problem and as a result, require vast quantities
of corresponding ground truth depth data for training. Just
recording quality depth data in a range of environments is a
challenging problem. In this paper, we innovate beyond existing
approaches, replacing the use of explicit depth data during
training with easier-to-obtain binocular stereo footage.
We propose a novel training objective that enables our convo-
lutional neural network to learn to perform single image depth
estimation, despite the absence of ground truth depth data. Ex-
ploiting epipolar geometry constraints, we generate disparity
images by training our network with an image reconstruction
loss. We show that solving for image reconstruction alone re-
sults in poor quality depth images. To overcome this problem,
we propose a novel training loss that enforces consistency be-
tween the disparities produced relative to both the left and right
images, leading to improved performance and robustness com-
pared to existing approaches. Our method produces state of the
art results for monocular depth estimation on the KITTI driving
dataset, even outperforming supervised methods that have been
trained with ground truth depth.
1. Introduction
Depth estimation from images has a long history in computer
vision. Fruitful approaches have relied on structure from motion,
shape-from-X, binocular, and multi-view stereo. However, most
of these techniques rely on the assumption that multiple obser-
vations of the scene of interest are available. These can come
in the form of multiple viewpoints, or observations of the scene
under different lighting conditions. To overcome this limitation,
there has recently been a surge in the number of works that pose
the task of monocular depth estimation as a supervised learning
problem [
32
,
10
,
36
]. These methods attempt to directly predict
the depth of each pixel in an image using models that have been
trained offline on large collections of ground truth depth data.
While these methods have enjoyed great success, to date they
Figure 1. Our depth prediction results on KITTI 2015. Top to bottom:
input image, ground truth disparities, and our result. Our method is
able to estimate depth for thin structures such as street signs and poles.
have been restricted to scenes where large image collections
and their corresponding pixel depths are available.
Understanding the shape of a scene from a single image,
independent of its appearance, is a fundamental problem in
machine perception. There are many applications such as
synthetic object insertion in computer graphics [
29
], synthetic
depth of field in computational photography [
3
], grasping
in robotics [
34
], using depth as a cue in human body pose
estimation [
48
], robot assisted surgery [
49
], and automatic 2D
to 3D conversion in film [
53
]. Accurate depth data from one
or more cameras is also crucial for self-driving cars, where
expensive laser-based systems are often used.
Humans perform well at monocular depth estimation by
exploiting cues such as perspective, scaling relative to the
known size of familiar objects, appearance in the form of
lighting and shading and occlusion [
24
]. This combination of
both top-down and bottom-up cues appears to link full scene
understanding with our ability to accurately estimate depth. In
this work, we take an alternative approach and treat automatic
depth estimation as an image reconstruction problem during
training. Our fully convolutional model does not require any
depth data, and is instead trained to synthesize depth as an
intermediate. It learns to predict the pixel-level correspondence
between pairs of rectified stereo images that have a known
camera baseline. There are some existing methods that also
address the same problem, but with several limitations. For
example they are not fully differentiable, making training
suboptimal [
16
], or have image formation models that do
1
270

not scale to large output resolutions [
53
]. We improve upon
these methods with a novel training objective and enhanced
network architecture that significantly increases the quality
of our final results. An example result from our algorithm is
illustrated in Fig.
1. Our method is fast and only takes on the
order of
35
milliseconds to predict a dense depth map for a
512×256
image on a modern GPU. Specifically, we propose
the following contributions:
1) A network architecture that performs end-to-end unsuper-
vised monocular depth estimation with a novel training loss that
enforces left-right depth consistency inside the network.
2) An evaluation of several training losses and image formation
models highlighting the effectiveness of our approach.
3) In addition to showing state of the art results on a challenging
driving dataset, we also show that our model generalizes to three
different datasets, including a new outdoor urban dataset that
we have collected ourselves, which we make openly available.
2. Related Work
There is a large body of work that focuses on depth
estimation from images, either using pairs [
46
], several
overlapping images captured from different viewpoints [
14
],
temporal sequences [
44
], or assuming a fixed camera, static
scene, and changing lighting [
52
,
2
]. These approaches are
typically only applicable when there is more than one input
image available of the scene of interest. Here we focus on
works related to monocular depth estimation, where there is
only a single input image, and no assumptions about the scene
geometry or types of objects present are made.
Learning-Based Stereo
The vast majority of stereo estimation algorithms have a data
term which computes the similarity between each pixel in the
first image and every other pixel in the second image. Typically
the stereo pair is rectified and thus the problem of disparity (i.e.
scaled inverse depth) estimation can be posed as a 1D search
problem for each pixel. Recently, it has been shown that instead
of using hand defined similarity measures, treating the matching
as a supervised learning problem and training a function to
predict the correspondences produces far superior results
[
54
,
31
]. It has also been shown that posing this binocular
correspondence search as a multi-class classification problem
has advantages both in terms of quality of results and speed
[
38
]. Instead of just learning the matching function, Mayer
et al. [
39
] introduced a fully convolutional [
47
] deep network
called DispNet that directly computes the correspondence
field between two images. At training time, they attempt to
directly predict the disparity for each pixel by minimizing a
regression training loss. DispNet has a similar architecture to
their previous end-to-end deep optical flow network [12].
The above methods rely on having large amounts of accurate
ground truth disparity data and stereo image pairs at training
time. This type of data can be difficult to obtain for real world
scenes, so these approaches typically use synthetic data for
training. Synthetic data is becoming more realistic, e.g. [
15
],
but still requires the manual creation of new content for every
new application scenario.
Supervised Single Image Depth Estimation
Single-view, or monocular, depth estimation refers to the
problem setup where only a single image is available at test
time. Saxena et al. [
45
] proposed a patch-based model known
as Make3D that first over-segments the input image into patches
and then estimates the 3D location and orientation of local
planes to explain each patch. The predictions of the plane
parameters are made using a linear model trained offline on
a dataset of laser scans, and the predictions are then combined
together using an MRF. The disadvantage of this method, and
other planar based approximations, e.g. [
22
], is that they can
have difficulty modeling thin structures and, as predictions
are made locally, lack the global context required to generate
realistic outputs. Instead of hand-tuning the unary and pairwise
terms, Liu et al. [
36
] use a convolutional neural network (CNN)
to learn them. In another local approach, Ladicky et al. [
32
]
incorporate semantics into their model to improve their per
pixel depth estimation. Karsch et al. [
28
] attempt to produce
more consistent image level predictions by copying whole depth
images from a training set. A drawback of this approach is that
it requires the entire training set to be available at test time.
Eigen et al. [
10
,
9
] showed that it was possible to produce
dense pixel depth estimates using a two scale deep network
trained on images and their corresponding depth values. Unlike
most other previous work in single image depth estimation,
they do not rely on hand crafted features or an initial over-
segmentation and instead learn a representation directly from
the raw pixel values. Several works have built upon the success
of this approach using techniques such as CRFs to improve accu-
racy [
35
], changing the loss from regression to classification [
5
],
using other more robust loss functions [
33
], and incorporating
strong scene priors in the case of the related problem of surface
normal estimation [
50
]. Again, like the previous stereo methods,
these approaches rely on having high quality, pixel aligned,
ground truth depth at training time. We too perform single
depth image estimation, but train with an added binocular color
image, instead of requiring ground truth depth.
Unsupervised Depth Estimation
Recently, a small number of deep network based methods for
novel view synthesis and depth estimation have been proposed,
which do not require ground truth depth at training time. Flynn et
al. [
13
] introduced a novel image synthesis network called Deep-
Stereo that generates new views by selecting pixels from nearby
images. During training, the relative pose of multiple cameras
is used to predict the appearance of a held-out nearby image.
Then the most appropriate depths are selected to sample colors
from the neighboring images, based on plane sweep volumes.
271

At test time, image synthesis is performed on small overlapping
patches. As it requires several nearby posed images at test time
DeepStereo is not suitable for monocular depth estimation.
The Deep3D network of Xie et al. [
53
] also addresses the
problem of novel view synthesis, where their goal is to generate
the corresponding right view from an input left image (i.e. the
source image) in the context of binocular pairs. Again using an
image reconstruction loss, their method produces a distribution
over all the possible disparities for each pixel. The resulting
synthesized right image pixel values are a combination of the
pixels on the same scan line from the left image, weighted
by the probability of each disparity. The disadvantage of
their image formation model is that increasing the number
of candidate disparity values greatly increases the memory
consumption of the algorithm, making it difficult to scale their
approach to bigger output resolutions. In this work, we perform
a comparison to the Deep3D image formation model, and show
that our algorithm produces superior results.
Closest to our model in spirit is the concurrent work of
Garg et al. [
16
]. Like Deep3D and our method, they train
a network for monocular depth estimation using an image
reconstruction loss. However, their image formation model is
not fully differentiable. To compensate, they perform a Taylor
approximation to linearize their loss resulting in an objective
that is more challenging to optimize. Similar to other recent
work, e.g. [
43
,
56
,
57
], our model overcomes this problem by
using bilinear sampling [
27
] to generate images, resulting in
a fully (sub-)differentiable training loss.
We propose a fully convolutional deep neural network
loosely inspired by the supervised DispNet architecture of
Mayer et al. [
39
]. By posing monocular depth estimation
as an image reconstruction problem, we can solve for the
disparity field without requiring ground truth depth. However,
only minimizing a photometric loss can result in good quality
image reconstructions but poor quality depth. Among other
terms, our fully differentiable training loss includes a left-right
consistency check to improve the quality of our synthesized
depth images. This type of consistency check is commonly
used as a post-processing step in many stereo methods, e.g.
[54], but we incorporate it directly into our network.
3. Method
This section describes our single image depth prediction
network. We introduce a novel depth estimation training loss,
featuring an inbuilt left-right consistency check, which enables
us to train on image pairs without requiring supervision in the
form of ground truth depth.
3.1. Depth Estimation as Image Reconstruction
Given a single image
I
at test time, our goal is to learn a
function
f
that can predict the per-pixel scene depth,
ˆ
d=f(I)
.
Most existing learning based approaches treat this as a
supervised learning problem, where they have color input
H x W x D
2H x 2W x D/2
UC
C
C
S
S
S
S
US
US
C
SC
Figure 2. Our loss module outputs left and right disparity maps,
d
l
and
d
r
. The loss combines smoothness, reconstruction, and left-right
disparity consistency terms. This same module is repeated at each of
the four different output scales. C: Convolution, UC: Up-Convolution,
S: Bilinear Sampling, US: Up-Sampling, SC: Skip Connection.
images and their corresponding target depth values at training. It
is presently not practical to acquire such ground truth depth data
for a large variety of scenes. Even expensive hardware, such
as laser scanners, can be imprecise in natural scenes featuring
movement and reflections. As an alternative, we instead pose
depth estimation as an image reconstruction problem during
training. The intuition here is that, given a calibrated pair of
binocular cameras, if we can learn a function that is able to
reconstruct one image from the other, then we have learned
something about the
3
D shape of the scene that is being imaged.
Specifically, at training time, we have access to two images
I
l
and
I
r
, corresponding to the left and right color images from
a calibrated stereo pair, captured at the same moment in time.
Instead of trying to directly predict the depth, we attempt to
find the dense correspondence field
d
r
that, when applied to the
left image, would enable us to reconstruct the right image. We
will refer to the reconstructed image
I
l
(d
r
)
as
˜
I
r
. Similarly, we
can also estimate the left image given the right one,
˜
I
l
=I
r
(d
l
)
.
Assuming that the images are rectified [
19
],
d
corresponds to
the image disparity - a scalar value per pixel that our model
will learn to predict. Given the baseline distance
b
between the
cameras and the camera focal length
f
, we can then trivially
recover the depth
ˆ
d from the predicted disparity,
ˆ
d=bf/d.
3.2. Depth Estimation Network
At a high level, our network estimates depth by inferring
the disparities that warp the left image to match the right one.
The key insight of our method is that we can simultaneously
infer both disparities (left-to-right and right-to-left), using only
the left input image, and obtain better depths by enforcing them
to be consistent with each other.
Our network generates the predicted image with backward
mapping using a bilinear sampler, resulting in a fully differen-
tiable image formation model. As illustrated in Fig. 3, na
¨
ıvely
learning to generate the right image by sampling from the left
272

Target
Output
Disparity
Sampler
CNN
Naïve
Input
No LR
Ours
Figure 3. Sampling strategies for backward mapping. With na
¨
ıve
sampling the CNN produces a disparity map aligned with the target
instead of the input. No LR corrects for this, but suffers from artifacts.
Our approach uses the left image to produce disparities for both
images, improving quality by enforcing mutual consistency.
one will produce disparities aligned with the right image (target).
However, we want the output disparity map to align with the
input left image, meaning the network has to sample from the
right image. We could instead train the network to generate the
left view by sampling from the right image, thus creating a left
view aligned disparity map (
No LR
in Fig. 3). While this alone
works, the inferred disparities exhibit ‘texture-copy’ artifacts and
errors at depth discontinuities as seen in Fig.
5. We solve this by
training the network to predict the disparity maps for both views
by sampling from the opposite input images. This still only
requires a single left image as input to the convolutional layers
and the right image is only used during training (
Ours
in Fig. 3).
Enforcing consistency between both disparity maps using this
novel left-right consistency cost leads to more accurate results.
Our fully convolutional architecture is inspired by Disp-
Net [
39
], but features several important modifications that
enable us to train without requiring ground truth depth. Our net-
work, is composed of two main parts - an encoder (from cnv1 to
cnv7b) and decoder (from upcnv7), please see the supplementary
material for a detailed description. The decoder uses skip con-
nections [
47
] from the encoder’s activation blocks, enabling it to
resolve higher resolution details. We output disparity predictions
at four different scales (disp4 to disp1), which double in spatial
resolution at each of the subsequent scales. Even though it only
takes a single image as input, our network predicts two disparity
maps at each output scale - left-to-right and right-to-left.
3.3. Training Loss
We define a loss
C
s
at each output scale s, forming the
total loss as the sum
C =
P
4
s=1
C
s
. Our loss module (Fig. 2)
computes C
s
as a combination of three main terms,
C
s
=α
ap
(C
l
ap
+C
r
ap
)+α
ds
(C
l
ds
+C
r
ds
)+α
lr
(C
l
lr
+C
r
lr
),
(1)
where
C
ap
encourages the reconstructed image to appear
similar to the corresponding training input,
C
ds
enforces
smooth disparities, and
C
lr
prefers the predicted left and right
disparities to be consistent. Each of the main terms contains
both a left and a right image variant, but only the left image
is fed through the convolutional layers.
Next, we present each component of our loss in terms of the
left image (e.g.
C
l
ap
). The right image versions, e.g.
C
r
ap
, require
to swap left for right and to sample in the opposite direction.
Appearance Matching Loss
During training, the network
learns to generate an image by sampling pixels from the
opposite stereo image. Our image formation model uses the
image sampler from the spatial transformer network (STN) [
27
]
to sample the input image using a disparity map. The STN uses
bilinear sampling where the output pixel is the weighted sum of
four input pixels. In contrast to alternative approaches [
16
,
53
],
the bilinear sampler used is locally fully differentiable and
integrates seamlessly into our fully convolutional architecture.
This means that we do not require any simplification or
approximation of our cost function.
Inspired by [
55
], we use a combination of an
L1
and
single scale SSIM [
51
] term as our photometric image
reconstruction cost
C
ap
, which compares the input image
I
l
ij
and its reconstruction
˜
I
l
ij
, where N is the number of pixels,
C
l
ap
=
1
N
X
i,j
α
1SSIM(I
l
ij
,
˜
I
l
ij
)
2
+(1α)
I
l
ij
˜
I
l
ij
. (2)
Here, we use a simplified SSIM with a
3×3
block filter instead
of a Gaussian, and set α=0.85.
Disparity Smoothness Loss
We encourage disparities to be
locally smooth with an
L1
penalty on the disparity gradients
d
.
As depth discontinuities often occur at image gradients, similar
to [
21
], we weight this cost with an edge-aware term using the
image gradients I,
C
l
ds
=
1
N
X
i,j
x
d
l
ij
e
k
x
I
l
ij
k
+
y
d
l
ij
e
k
y
I
l
ij
k
. (3)
Left-Right Disparity Consistency Loss
To produce more
accurate disparity maps, we train our network to predict both
the left and right image disparities, while only being given
the left view as input to the convolutional part of the network.
To ensure coherence, we introduce an
L1
left-right disparity
consistency penalty as part of our model. This cost attempts
to make the left-view disparity map be equal to the projected
right-view disparity map,
C
l
lr
=
1
N
X
i,j
d
l
ij
d
r
ij+d
l
ij
. (4)
Like all the other terms, this cost is mirrored for the right-view
disparity map and is evaluated at all of the output scales.
273

Method Dataset Abs Rel Sq Rel RMSE RMSE log D1-all δ <1.25 δ<1.25
2
δ<1.25
3
Ours with Deep3D [53] K 0.412 16.37 13.693 0.512 66.85 0.690 0.833 0.891
Ours with Deep3Ds [53] K 0.151 1.312 6.344 0.239 59.64 0.781 0.931 0.976
Ours No LR K 0.123 1.417 6.315 0.220 30.318 0.841 0.937 0.973
Ours K 0.124 1.388 6.125 0.217 30.272 0.841 0.936 0.975
Ours CS 0.699 10.060 14.445 0.542 94.757 0.053 0.326 0.862
Ours CS + K 0.104 1.070 5.417 0.188 25.523 0.875 0.956 0.983
Ours pp CS + K 0.100 0.934 5.141 0.178 25.077 0.878 0.961 0.986
Ours resnet pp CS + K 0.097 0.896 5.093 0.176 23.811 0.879 0.962 0.986
Ours Stereo K 0.068 0.835 4.392 0.146 9.194 0.942 0.978 0.989
Lower is better
Higher is better
Table 1. Comparison of different image formation models. Results on the KITTI 2015 stereo 200 training set disparity images [
17
]. For training,
K is the KITTI dataset [
17
] and CS is Cityscapes [
8
]. Our model with left-right consistency performs the best, and is further improved with the
addition of the Cityscapes data. The last row shows the result of our model trained and tested with two input images instead of one (see Sec. 4.3).
At test time, our network predicts the disparity at the finest
scale level for the left image
d
l
, which has the same resolution
as the input image. Using the known camera baseline and focal
length from the training set, we then convert from the disparity
map to a depth map. While we also estimate the right disparity
d
r
during training, it is not used at test time.
4. Results
Here we compare the performance of our approach to both
supervised and unsupervised single view depth estimation
methods. We train on rectified stereo image pairs, and do
not require any supervision in the form of ground truth depth.
Existing single image datasets, such as [
41
,
45
], that lack
stereo pairs, are not suitable for evaluation. Instead we evaluate
our approach using the popular KITTI 2015 [
17
] dataset. To
evaluate our image formation model, we compare to a variant
of our algorithm that uses the original Deep3D [
53
] image
formation model and a modified one, Deep3Ds, with an added
smoothness constraint. We also evaluate our approach with and
without the left-right consistency constraint.
4.1. Implementation Details
The network which is implemented in TensorFlow [
1
] con-
tains
31
million trainable parameters, and takes on the order of
25
hours to train using a single Titan X GPU on a dataset of
30
thousand images for 50 epochs. Inference is fast and takes less
than
35
ms, or more than
28
frames per second, for a
512×256
image, including transfer times to and from the GPU. Please
see the supplementary material and our code
1
for more details.
During optimization, we set the weighting of the different
loss components to
α
ap
= 1
and
α
lr
= 1
. The possible output
disparities are constrained to be between
0
and
d
max
using a
scaled sigmoid non-linearity, where
d
max
= 0.3×
the image
width at a given output scale. As a result of our multi-scale
output, the typical disparity of neighboring pixels will differ
by a factor of two between each scale (as we are upsampling
the output by a factor of two). To correct for this, we scale the
disparity smoothness term
α
ds
with
r
for each scale to get equiv-
alent smoothing at each level. Thus
α
ds
=0.1/r
, where
r
is the
1
Available at
https://github.com/mrharicot/monodepth
downscaling factor of the corresponding layer with respect to
the resolution of the input image that is passed into the network.
For the non-linearities in the network, we used exponential
linear units [
7
] instead of the commonly used rectified liner units
(ReLU) [
40
]. We found that ReLUs tended to prematurely fix
the predicted disparities at intermediate scales to a single value,
making subsequent improvement difficult. Following [
42
],
we replaced the usual deconvolutions with a nearest neighbor
upsampling followed by a convolutions. We trained our model
from scratch for
50
epochs, with a batch size of
8
using Adam
[
30
], where
β
1
= 0.9
,
β
2
= 0.999
, and
ǫ = 10
8
. We used an
initial learning rate of
λ=10
4
which we kept constant for the
first
30
epochs before halving it every
10
epochs until the end.
We initially experimented with progressive update schedules,
as in [
39
], where lower resolution image scales were optimized
first. However, we found that optimizing all four scales at once
led to more stable convergence. Similarly, we use an identical
weighting for the loss of each scale as we found that weighting
them differently led to unstable convergence. We experimented
with batch normalization [
26
], but found that it did not produce
a significant improvement, and ultimately excluded it.
Data augmentation is performed on the fly. We flip the input
images horizontally with a
50%
chance, taking care to also
swap both images so they are in the correct position relative
to each other. We also added color augmentations, with a
50%
chance, where we performed random gamma, brightness,
and color shifts by sampling from uniform distributions in
the ranges
[0.8,1.2]
for gamma,
[0.5,2.0]
for brightness, and
[0.8,1.2] for each color channel separately.
Resnet50
For the sake of completeness, and similar to [
33
],
we also show a variant of our model using Resnet50 [
20
] as
the encoder, the rest of the architecture, parameters and training
procedure staying identical. This variant contains
48
million
trainable parameters and is indicated by resnet in result tables.
Post-processing
In order to reduce the effect of stereo disoc-
clusions which create disparity ramps on both the left side of the
image and of the occluders, a final post-processing step is per-
formed on the output. For an input image
I
at test time, we also
274

Citations
More filters
Proceedings ArticleDOI

Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks

TL;DR: CycleGAN as discussed by the authors learns a mapping G : X → Y such that the distribution of images from G(X) is indistinguishable from the distribution Y using an adversarial loss.
Posted Content

Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks

TL;DR: This work presents an approach for learning to translate an image from a source domain X to a target domain Y in the absence of paired examples, and introduces a cycle consistency loss to push F(G(X)) ≈ X (and vice versa).
Proceedings ArticleDOI

Unsupervised Learning of Depth and Ego-Motion from Video

TL;DR: In this paper, an unsupervised learning framework for the task of monocular depth and camera motion estimation from unstructured video sequences is presented, which uses single-view depth and multiview pose networks with a loss based on warping nearby views to the target using the computed depth and pose.
Proceedings ArticleDOI

Deep Ordinal Regression Network for Monocular Depth Estimation

TL;DR: Deep Ordinal Regression Network (DORN) as discussed by the authors discretizes depth and recast depth network learning as an ordinal regression problem by training the network using an ordinary regression loss, which achieves much higher accuracy and faster convergence in synch.
Posted Content

Digging Into Self-Supervised Monocular Depth Estimation

TL;DR: It is shown that a surprisingly simple model, and associated design choices, lead to superior predictions, and together result in both quantitatively and qualitatively improved depth maps compared to competing self-supervised methods.
References
More filters
Book ChapterDOI

I and J

Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Proceedings Article

Adam: A Method for Stochastic Optimization

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Journal ArticleDOI

Image quality assessment: from error visibility to structural similarity

TL;DR: In this article, a structural similarity index is proposed for image quality assessment based on the degradation of structural information, which can be applied to both subjective ratings and objective methods on a database of images compressed with JPEG and JPEG2000.
Proceedings Article

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

TL;DR: Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.
Related Papers (5)