scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Semi-Supervised Deep Learning for Monocular Depth Map Prediction

01 Jul 2017-pp 2215-2223
TL;DR: In this article, a semi-supervised approach to depth map prediction from monocular images that learns in a semisupervised way is proposed, where they use sparse ground-truth depth for supervised learning, and also enforce their deep network to produce photoconsistent dense depth maps in a stereo setup using a direct image alignment loss.
Abstract: Supervised deep learning often suffers from the lack of sufficient training data. Specifically in the context of monocular depth map prediction, it is barely possible to determine dense ground truth depth images in realistic dynamic outdoor environments. When using LiDAR sensors, for instance, noise is present in the distance measurements, the calibration between sensors cannot be perfect, and the measurements are typically much sparser than the camera images. In this paper, we propose a novel approach to depth map prediction from monocular images that learns in a semi-supervised way. While we use sparse ground-truth depth for supervised learning, we also enforce our deep network to produce photoconsistent dense depth maps in a stereo setup using a direct image alignment loss. In experiments we demonstrate superior performance in depth map prediction from single images compared to the state-of-the-art methods.

Content maybe subject to copyright    Report

Semi-Supervised Deep Learning for Monocular Depth Map Prediction
Yevhen Kuznietsov J
¨
org St
¨
uckler Bastian Leibe
Computer Vision Group, Visual Computing Institute, RWTH Aachen University
yevhen.kuznietsov@rwth-aachen.de, { stueckler | leibe }@vision.rwth-aachen.de
Abstract
Supervised deep learning often suffers from the lack of
sufficient training data. Specifically in the context of monoc-
ular depth map prediction, it is barely possible to deter-
mine dense ground truth depth images in realistic dynamic
outdoor environments. When using LiDAR sensors, for in-
stance, noise is present in the distance measurements, the
calibration between sensors cannot be perfect, and the mea-
surements are typically much sparser than the camera im-
ages. In this paper, we propose a novel approach to depth
map prediction from monocular images that learns in a
semi-supervised way. While we use sparse ground-truth
depth for supervised learning, we also enforce our deep
network to produce photoconsistent dense depth maps in a
stereo setup using a direct image alignment loss. In exper-
iments we demonstrate superior performance in depth map
prediction from single images compared to the state-of-the-
art methods.
1. Introduction
Estimating depth from single images is an ill-posed
problem which cannot be solved directly from bottom-up
geometric cues in general. Instead, a-priori knowledge
about the typical appearance, layout and size of objects
needs to be used, or further cues such as shape from shading
or focus have to be employed which are difficult to model in
realistic settings. In recent years, supervised deep learning
approaches have demonstrated promising results for single
image depth prediction. These learning approaches appear
to capture the statistical relationship between appearance
and distance to objects well.
Supervised deep learning, however, requires vast
amounts of training data in order to achieve high accu-
racy and to generalize well to novel scenes. Supplementary
depth sensors are typically used to capture ground truth.
In the indoor setting, active RGB-D cameras can be used.
Outdoors, 3D laser scanners are a popular choice to capture
depth measurements. However, using such sensing devices
bears several shortcomings. Firstly, the sensors have their
Figure 1. We concurrently train a CNN from unsupervised and
supervised depth cues to achieve state-of-the-art performance in
single image depth prediction. For supervised training we use
(sparse) ground-truth depth readings from a supplementary sens-
ing cue such as a 3D laser. Unsupervised direct image alignment
complements the ground-truth measurements with a training sig-
nal that is purely based on the stereo images and the predicted
depth map for an image.
own error and noise characteristics, which will be learned
by the network. In addition, when using 3D lasers, the
measurements are typically much sparser than the images
and do not capture high detail depth variations visible in
the images well. Finally, accurate extrinsic and intrinsic
calibration of the sensors is required. Ground truth data
could alternatively be generated through synthetic render-
ing of depth maps. The rendered images, however, do not
fully realistically display the scene and do not incorporate
real image noise characteristics.
Very recently, unsupervised methods have been intro-
duced [6, 9] that learn to predict depth maps directly from
the intensity images in a stereo setup–without the need
for an additional supplementary modality for capturing the
ground truth. One drawback of these approaches is the well-
known fact that stereo depth reconstruction based on im-
age matching is an ill-posed problem on its own. To this
6647

end, common regularization schemes can be used which im-
pose priors on the depth such as small depth gradient norms
which may not be fully satisfied in the real environment.
In this paper, we propose a semi-supervised learning ap-
proach that makes use of supervised as well as unsupervised
training cues to incorporate the best of both worlds. Our
method benefits from ground-truth measurements as an un-
ambiguous (but noisy and sparse) cue for the actual depth in
the scene. Unsupervised image alignment complements the
ground-truth by a huge amount of additional training data
which is much simpler to obtain and counteracts the defi-
ciencies of the ground-truth depth measurements. By the
combination of both methods, we achieve significant im-
provements over the state-of-the-art in single image depth
map prediction which we evaluate on the popular KITTI
dataset [7] in urban street scenes. We base our approach
on a state-of-the-art deep residual network in an encoder-
decoder architecture for this task [16] and augment it with
long skip connections between corresponding layers in en-
coder and decoder to predict high detail output depth maps.
Our network converges quickly to a good model from little
supervised training data, mainly due to the use of pretrained
encoder weights (on ImageNet [22] classification task) and
unsupervised training. The use of supervised training also
simplifies unsupervised learning significantly. For instance,
a tedious coarse-to-fine image alignment loss as in previous
unsupervised learning approaches [6] is not required in our
semi-supervised approach.
In summary, we make the following contributions: 1) We
propose a novel semi-supervised deep learning approach to
single image depth map prediction that uses supervised as
well as unsupervised learning cues. 2) Our deep learning
approach demonstrates state-of-the-art performance in chal-
lenging outdoor scenes on the KITTI benchmark.
2. Related Work
Over the last years, several learning-based approaches to
single image depth reconstruction have been proposed that
are trained in a supervised way. Often, measured depth from
RGB-D cameras or 3D laser scanners is used as ground-
truth for training. Saxena et al. [24] proposed one of the first
supervised learning-based approaches to single image depth
map prediction. They model depth prediction in a Markov
random field and use multi-scale texture features that have
been hand-crafted. The method also combines monocular
cues with stereo correspondences within the MRF.
Many recent approaches learn image features using deep
learning techniques. Eigen et al. [5] propose a CNN ar-
chitecture that integrates coarse-scale depth prediction with
fine-scale prediction. The approach of Li et al. [17] com-
bines deep learning features on image patches with hierar-
chical CRFs defined on a superpixel segmentation of the
image. They use pretrained AlexNet [14] features of im-
age patches to predict depth at the center of the superpix-
els. A hierarchical CRF refines the depth across individ-
ual pixels. Liu et al. [20] also propose a deep structured
learning approach that avoids hand-crafted features. Their
deep convolutional neural fields allow for training CNN fea-
tures of unary and pairwise potentials end-to-end, exploit-
ing continuous depth and Gaussian assumptions on the pair-
wise potentials. Very recently, Laina et al . [16] proposed
to use a ResNet-based encoder-decoder architecture to pro-
duce dense depth maps. They demonstrate the approach to
predict depth maps in indoor scenes using RGB-D images
for training. Further lines of research in supervised train-
ing of depth map prediction use the idea of depth transfer
from example images [13, 12, 21], or integrate depth map
prediction with semantic segmentation [15, 19, 4, 26, 18].
Only few very recent methods attempt to learn depth
map prediction in an unsupervised way. Garg et al. [6] pro-
pose an encoder-decoder architecture similar to FlowNet [3]
which is trained to predict single image depth maps on an
image alignment loss. The method only requires images of
a corresponding camera in a stereo setup. The loss quan-
tifies the photometric error of the input image warped into
its corresponding stereo image using the predicted depth.
The loss is linearized using first-order Taylor approxima-
tion and hence requires coarse-to-fine training. Xie et al.
[27] do not regress the depth maps directly, but produce
probability maps for different disparity levels. A selec-
tion layer then reconstructs the right image using the left
image and these probability maps. The network is trained
to minimize pixel-wise reconstruction error. Godard et al.
[9] also use an image alignment loss in a convolutional
encoder-decoder architecture but additionally enforce left-
right consistency of the predicted disparities in the stereo
pair. Our semi-supervised approach simplifies the use of
unsupervised cues and does not require multi-scale depth
map prediction in our network architecture. We also do not
explicitly enforce left-right consistency, but use both im-
ages in the stereo pair equivalently to define our loss func-
tion. The semi-supervised method of Chen et al. [1] in-
corporates the side-task of depth ranking of pairs of pixels
for training a CNN on single image depth prediction. For
the ranking task, ground-truth is much easier to obtain but
only indirectly provides information on continuous depth
values. Our approach uses image alignment as a geometric
cue which does not require manual annotations.
3. Approach
We base our approach on supervised as well as unsu-
pervised principles for learning single image depth map
prediction (see Fig. 1). A straight-forward approach is to
use a supplementary measuring device such as a 3D laser
in order to capture ground-truth depth readings for super-
vised training. This process typically requires an accurate
6648

Figure 2. Components and inputs of our novel semi-supervised loss function.
extrinsic calibration between the 3D laser sensor and the
camera. Furthermore, the laser measurements have several
shortcomings. Firstly, they are affected by erroneous read-
ings and noise. They are also typically much sparser than
the camera images when projected into the image. Finally,
the center of projection of laser and camera do not coincide.
This causes depth readings of objects that are occluded from
the view point of the camera to project into the camera im-
age. To counteract these drawbacks, we make use of two-
view geometry principles to learn depth prediction directly
from the stereo camera images in an unsupervised way. We
achieve this by direct image alignment of one stereo image
to the other. This process only requires a known camera
calibration and the depth map predicted by the CNN. Our
semi-supervised approach learns from supervised and un-
supervised cues concurrently.
We train the CNN to predict the inverse depth ρ(x) at
each pixel x from the RGB image I. According to the
ground truth, the predicted inverse depth should correspond
to the LiDAR depth measurement Z(x) that projects to the
same pixel, i.e.
ρ(x)
1
!
= Z(x). (1)
However, the laser measurements only project to a sparse
subset
Z
of the pixels in the image.
As the unsupervised training signal, we assume photo-
consistency between the left and right stereo images, i.e.,
I
1
(x)
!
= I
2
(ω(x, ρ(x))). (2)
In our calibrated stereo setup, the warping function can be
defined as
ω(x, ρ(x)) := x f b ρ(x) (3)
on the rectified images, where f is the focal length and b is
the baseline. This image alignment constraint holds at every
pixel in the image.
We additionally make use of the interchangeability of the
stereo images. We quantify the supervised loss in both im-
ages by projecting the ground truth laser data into each of
the stereo images. We also constrain the depth estimate be-
tween the left and right stereo images to be consistent im-
plicitly by enforcing photoconsistency based on the inverse
depth prediction for both images, i.e.,
I
left
(x)
!
= I
right
(ω(x, ρ(x)))
I
right
(x)
!
= I
left
(ω(x, ρ(x))).
(4)
Finally, in textureless regions without ground truth depth
readings, the depth map prediction problem is ill-posed and
an adequate regularization needs to be imposed.
3.1. Loss function
We formulate a single loss function that incorporates
both types of constraints that arise from supervised and un-
supervised cues seamlessly,
L
θ
(I
l
, I
r
, Z
l
, Z
r
) =
λ
t
L
S
θ
(I
l
, I
r
, Z
l
, Z
r
) + γL
U
θ
(I
l
, I
r
) + L
R
θ
(I
l
, I
r
), (5)
where λ
t
and γ are trade-off parameters between super-
vised loss L
S
θ
, unsupervised loss L
U
θ
, and a regularization
term L
R
θ
. With θ we denote the CNN network parameters
that generate the inverse depth maps ρ
r/l
.
Supervised loss. The supervised loss term measures
the deviation of the predicted depth map from the available
ground truth at the pixels,
L
S
θ
=
X
x
Z,l
ρ
l,θ
(x)
1
Z
l
(x)
δ
+
X
x
Z,r
ρ
r,θ
(x)
1
Z
r
(x)
δ
. (6)
6649

We use the berHu norm k·k
δ
as introduced in [16] to focus
training on larger depth residuals during CNN training,
kdk
δ
=
(
|d|, d δ
d
2
+δ
2
2δ
, d > δ
. (7)
We adaptively set
δ = 0.2 max
x
Z
ρ(x)
1
Z(x)
. (8)
Note, that noise in the ground-truth measurements could be
modelled as well, for instance, by weighting each residual
with the inverse of the measurement variance.
Unsupervised loss. The unsupervised part of our loss
quantifies the direct image alignment error in both direc-
tions
L
U
θ
=
X
x
U,l
|(G
σ
I
l
)(x) (G
σ
I
r
)(ω(x, ρ
l,θ
(x)))|
+
X
x
U,r
|(G
σ
I
r
)(x) (G
σ
I
l
)(ω(x, ρ
r,θ
(x)))| ,
(9)
with a Gaussian smoothing kernel G
σ
with a standard devi-
ation of σ = 1 px. We found this small amount of Gaussian
smoothing to be beneficial, presumably due to reducing im-
age noise. We evaluate the direct image alignment loss at
the sets of image pixels
U,l/r
of the reconstructed images
that warp to a valid location in the second image. We use
linear interpolation for subpixel-level warping.
Regularization loss. As suggested in [9], the smooth-
ness term penalizes depth changes at pixels with low inten-
sity variation. In order to allow for depth discontinuities
at object contours, we downscale the regularization term
anisotropically according to the intensity variation:
L
R
θ
=
X
i∈{l,r}
X
x
φ (I
i
(x))
ρ
i
(x)
(10)
with φ(g) = (exp(η |g
x
|), exp(η |g
y
|))
and η =
1
255
.
Supervised, unsupervised, and regularization terms are
seamlessly combined within our novel semi-supervised loss
function formulation (see Fig. 2). In contrast to previous
methods, our approach treats both cameras in the stereo
setup equivalently. All three loss components are formu-
lated in a symmetric way for the cameras which implicitly
enforces consistency in the predicted depth maps between
the cameras.
3.2. Network Architecture
We use a deep residual network architecture in an
encoder-decoder scheme, similar to the supervised ap-
proach in [16] (see Fig. 3). Taking inspiration from non-
residual architectures such as FlowNet [3], our architecture
Layer Channels I/O Scaling Inputs
conv1
7
2
3 / 64 2 RGB
max pool1
3
2
64 / 64 4 conv1
res block1
2
1
64 / 256 4 max
pool1
res block2
1
1
256 / 256 4 res
block1
res block3
1
1
256 / 256 4 res
block2
res block4
2
2
256 / 512 8 res
block3
res block5
1
1
512 / 512 8 res
block4
res block6
1
1
512 / 512 8 res
block5
res block7
1
1
512 / 512 8 res
block6
res
block8
2
2
512 / 1024 16 res
block7
res block9
1
1
1024 / 1024 16 res
block8
res block10
1
1
1024 / 1024 16 res
block9
res block11
1
1
1024 / 1024 16 res
block10
res block12
1
1
1024 / 1024 16 res block11
res block13
1
1
1024 / 1024 16 res
block12
res block14
2
2
1024 / 2048 32 res
block13
res block15
1
1
2048 / 2048 32 res
block14
res block16
1
1
2048 / 2048 32 res
block15
conv2
1
1
2048 / 1024 32 res
block16
upproject1 1024 / 512 16 conv2
upproject2 512 / 256 8 upproject1
res
block13
upproject3 256 / 128 4 upproject2
res block7
upproject4 128 / 64 2 upproject3
res block3
conv3
3
1
64 / 1 2 upproject4
Table 1. Layers in our deep residual encoder-decoder architecture.
We input the final output layers at each resolution of the encoder
at the respective decoder layers (long skip connections). This fa-
cilitates the prediction of fine detailed depth maps by the CNN.
includes long skip connections between the encoder and de-
coder to facilitate fine detail predictions at the output reso-
lution. Table 1 details the various layers in our network.
Input to our network is the RGB camera image. The en-
coder resembles a ResNet-50 [11] architecture (without the
final fully connected layer) and successively extracts low-
resolution high-dimensional features from the input im-
age. The encoder subsamples the input image in 5 stages,
the first stage convolving the image to half input resolu-
tion and each successive stage stacking multiple residual
blocks. The decoder upprojects the output of the encoder
using residual blocks. We found that adding long skip-
connections between corresponding layers in encoder and
decoder to this architecture slightly improves the perfor-
mance on all metrics without affecting convergence. More-
over, the network is able to predict more detailed depth
maps than without skip connections.
6650

Figure 3. Illustration of our deep residual encoder-decoder architecture (c1, c3, mp1 abbreviate conv1, conv3, and max pool1, respectively).
Skip connections from corresponding encoder layers to the decoder facilitate fine detailed depth map prediction.
Figure 4. Type 1 residual block resblock
1
s
with stride s = 1. The
residual is obtained from 3 successive convolutions. The residual
has the same number of channels as the input.
Figure 5. Type 2 residual block resblock
2
s
with stride s. The resid-
ual is obtained from 3 successive convolutions, while the first con-
volution applies stride s. An additional convolution applies the
same stride s and projects the input to the number of channels of
the residual.
We denote a convolution of filter size k × k and stride s
by conv
k
s
. The same notation applies to pooling layers, e.g.,
max
pool
k
s
. Each convolution layer is followed by batch
normalization with exception of the last layer in the net-
work. Furthermore, we use ReLU activation functions on
the output of the convolutions except at the inputs to the sum
operation of the residual blocks where the ReLU comes af-
ter the sum operation. resblock
i
s
denotes the residual block
of type i with stride s at its first convolution layer, see
Figs. 4 and 5 for details on each type of residual block.
Smaller feature blocks consist of 16s maps, while larger
blocks contain 4 times more feature maps, where s is the
output scale of the residual block. Lastly, upproject is the
upprojection layer proposed by Laina et al. [16]. We use the
fast implementation of upprojection layers, but for better il-
lustration we visualize upprojection by its ”naive” version
(see Fig. 6).
4. Experiments
We evaluate our approach on the raw sequences of the
KITTI benchmark [7] which is a popular dataset for sin-
Figure 6. Schematic illustration of the upprojection residual block.
It unpools the input by a factor of 2 and applies a residual block
which reduces the number of channels by a factor of 2.
gle image depth map prediction. The sequences contain
stereo imagery taken from a driving car in an urban sce-
nario. The dataset also provides 3D laser measurements
from a Velodyne laser scanner that we use as ground-truth
measurements (projected into the stereo images using the
given intrinsics and extrinsics in KITTI). This dataset has
been used to train and evaluate the state-of-the-art methods
and allows for quantitative comparison.
We evaluate our approach on the KITTI Raw split into 28
testing scenes as proposed by Eigen et al. [5]. We decided
to use the remaining sequences of the KITTI Raw dataset
for training and validation. We obtained a training set from
28 sequences in which we even the sequence distribution
with 450 frames per sequence. This results in 7346 unique
frames and 12600 frames in total for training. We also cre-
ated a validation set by sampling every tenth frame from the
remaining 5 sequences with little image motion. All these
sequences are urban, so we additionally select those frames
from the training sequences that are in the middle between 2
training images with distance of at least 20 frames. In total
we obtain a validation set of 100 urban and 144 residential
area images.
4.1. Implementation Details
We initialize the encoder part of our network with
ResNet-50 [11] weights pretrained for ImageNet classifica-
tion task. The convolution filter weights in the decoder part
are initialized randomly according to the approach of Glo-
rot and Bengio [8]. We also tried the initialization by He
et al . [10] but did not notice any performance difference.
6651

Citations
More filters
Proceedings ArticleDOI
25 Apr 2017
TL;DR: In this paper, an unsupervised learning framework for the task of monocular depth and camera motion estimation from unstructured video sequences is presented, which uses single-view depth and multiview pose networks with a loss based on warping nearby views to the target using the computed depth and pose.
Abstract: We present an unsupervised learning framework for the task of monocular depth and camera motion estimation from unstructured video sequences. In common with recent work [10, 14, 16], we use an end-to-end learning approach with view synthesis as the supervisory signal. In contrast to the previous work, our method is completely unsupervised, requiring only monocular video sequences for training. Our method uses single-view depth and multiview pose networks, with a loss based on warping nearby views to the target using the computed depth and pose. The networks are thus coupled by the loss during training, but can be applied independently at test time. Empirical evaluation on the KITTI dataset demonstrates the effectiveness of our approach: 1) monocular depth performs comparably with supervised methods that use either ground-truth pose or depth for training, and 2) pose estimation performs favorably compared to established SLAM systems under comparable input settings.

1,972 citations

Proceedings ArticleDOI
18 Jun 2018
TL;DR: Deep Ordinal Regression Network (DORN) as discussed by the authors discretizes depth and recast depth network learning as an ordinal regression problem by training the network using an ordinary regression loss, which achieves much higher accuracy and faster convergence in synch.
Abstract: Monocular depth estimation, which plays a crucial role in understanding 3D scene geometry, is an ill-posed problem. Recent methods have gained significant improvement by exploring image-level information and hierarchical features from deep convolutional neural networks (DCNNs). These methods model depth estimation as a regression problem and train the regression networks by minimizing mean squared error, which suffers from slow convergence and unsatisfactory local solutions. Besides, existing depth estimation networks employ repeated spatial pooling operations, resulting in undesirable low-resolution feature maps. To obtain high-resolution depth maps, skip-connections or multilayer deconvolution networks are required, which complicates network training and consumes much more computations. To eliminate or at least largely reduce these problems, we introduce a spacing-increasing discretization (SID) strategy to discretize depth and recast depth network learning as an ordinal regression problem. By training the network using an ordinary regression loss, our method achieves much higher accuracy and faster convergence in synch. Furthermore, we adopt a multi-scale network structure which avoids unnecessary spatial pooling and captures multi-scale information in parallel. The proposed deep ordinal regression network (DORN) achieves state-of-the-art results on three challenging benchmarks, i.e., KITTI [16], Make3D [49], and NYU Depth v2 [41], and outperforms existing methods by a large margin.

1,358 citations

Proceedings ArticleDOI
01 Oct 2019
TL;DR: In this paper, the authors propose a set of improvements, which together result in both quantitatively and qualitatively improved depth maps compared to competing self-supervised methods, and demonstrate the effectiveness of each component in isolation, and show high quality, state-of-theart results on the KITTI benchmark.
Abstract: Per-pixel ground-truth depth data is challenging to acquire at scale. To overcome this limitation, self-supervised learning has emerged as a promising alternative for training models to perform monocular depth estimation. In this paper, we propose a set of improvements, which together result in both quantitatively and qualitatively improved depth maps compared to competing self-supervised methods. Research on self-supervised monocular training usually explores increasingly complex architectures, loss functions, and image formation models, all of which have recently helped to close the gap with fully-supervised methods. We show that a surprisingly simple model, and associated design choices, lead to superior predictions. In particular, we propose (i) a minimum reprojection loss, designed to robustly handle occlusions, (ii) a full-resolution multi-scale sampling method that reduces visual artifacts, and (iii) an auto-masking loss to ignore training pixels that violate camera motion assumptions. We demonstrate the effectiveness of each component in isolation, and show high quality, state-of-the-art results on the KITTI benchmark.

954 citations

Posted Content
TL;DR: It is shown that a surprisingly simple model, and associated design choices, lead to superior predictions, and together result in both quantitatively and qualitatively improved depth maps compared to competing self-supervised methods.
Abstract: Per-pixel ground-truth depth data is challenging to acquire at scale. To overcome this limitation, self-supervised learning has emerged as a promising alternative for training models to perform monocular depth estimation. In this paper, we propose a set of improvements, which together result in both quantitatively and qualitatively improved depth maps compared to competing self-supervised methods. Research on self-supervised monocular training usually explores increasingly complex architectures, loss functions, and image formation models, all of which have recently helped to close the gap with fully-supervised methods. We show that a surprisingly simple model, and associated design choices, lead to superior predictions. In particular, we propose (i) a minimum reprojection loss, designed to robustly handle occlusions, (ii) a full-resolution multi-scale sampling method that reduces visual artifacts, and (iii) an auto-masking loss to ignore training pixels that violate camera motion assumptions. We demonstrate the effectiveness of each component in isolation, and show high quality, state-of-the-art results on the KITTI benchmark.

791 citations

Proceedings ArticleDOI
18 Jun 2018
TL;DR: In this article, a differentiable implementation of direct visual odometry (DVO) along with a novel depth normalization strategy is proposed to train a depth CNN without a pose CNN predictor.
Abstract: The ability to predict depth from a single image - using recent advances in CNNs - is of increasing interest to the vision community. Unsupervised strategies to learning are particularly appealing as they can utilize much larger and varied monocular video datasets during learning without the need for ground truth depth or stereo. In previous works, separate pose and depth CNN predictors had to be determined such that their joint outputs minimized the photometric error. Inspired by recent advances in direct visual odometry (DVO), we argue that the depth CNN predictor can be learned without a pose CNN predictor. Further, we demonstrate empirically that incorporation of a differentiable implementation of DVO, along with a novel depth normalization strategy - substantially improves performance over state of the art that use monocular videos for training.

594 citations

References
More filters
Proceedings ArticleDOI
27 Jun 2016
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

123,388 citations

Proceedings Article
03 Dec 2012
TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overriding in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

73,978 citations

Journal ArticleDOI
TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.
Abstract: The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide a detailed analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the 5 years of the challenge, and propose future directions and improvements.

30,811 citations

Proceedings ArticleDOI
07 Dec 2015
TL;DR: In this paper, a Parametric Rectified Linear Unit (PReLU) was proposed to improve model fitting with nearly zero extra computational cost and little overfitting risk, which achieved a 4.94% top-5 test error on ImageNet 2012 classification dataset.
Abstract: Rectified activation units (rectifiers) are essential for state-of-the-art neural networks. In this work, we study rectifier neural networks for image classification from two aspects. First, we propose a Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit. PReLU improves model fitting with nearly zero extra computational cost and little overfitting risk. Second, we derive a robust initialization method that particularly considers the rectifier nonlinearities. This method enables us to train extremely deep rectified models directly from scratch and to investigate deeper or wider network architectures. Based on the learnable activation and advanced initialization, we achieve 4.94% top-5 test error on the ImageNet 2012 classification dataset. This is a 26% relative improvement over the ILSVRC 2014 winner (GoogLeNet, 6.66% [33]). To our knowledge, our result is the first to surpass the reported human-level performance (5.1%, [26]) on this dataset.

11,732 citations

Proceedings ArticleDOI
16 Jun 2012
TL;DR: The autonomous driving platform is used to develop novel challenging benchmarks for the tasks of stereo, optical flow, visual odometry/SLAM and 3D object detection, revealing that methods ranking high on established datasets such as Middlebury perform below average when being moved outside the laboratory to the real world.
Abstract: Today, visual recognition systems are still rarely employed in robotics applications. Perhaps one of the main reasons for this is the lack of demanding benchmarks that mimic such scenarios. In this paper, we take advantage of our autonomous driving platform to develop novel challenging benchmarks for the tasks of stereo, optical flow, visual odometry/SLAM and 3D object detection. Our recording platform is equipped with four high resolution video cameras, a Velodyne laser scanner and a state-of-the-art localization system. Our benchmarks comprise 389 stereo and optical flow image pairs, stereo visual odometry sequences of 39.2 km length, and more than 200k 3D object annotations captured in cluttered scenarios (up to 15 cars and 30 pedestrians are visible per image). Results from state-of-the-art algorithms reveal that methods ranking high on established datasets such as Middlebury perform below average when being moved outside the laboratory to the real world. Our goal is to reduce this bias by providing challenging benchmarks with novel difficulties to the computer vision community. Our benchmarks are available online at: www.cvlibs.net/datasets/kitti

11,283 citations