scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Learning Non-Lambertian Object Intrinsics Across ShapeNet Categories

TL;DR: In this paper, the authors focus on the non-Lambertian object-level intrinsic problem of recovering diffuse albedo, shading, and specular highlights from a single image of an object.
Abstract: We focus on the non-Lambertian object-level intrinsic problem of recovering diffuse albedo, shading, and specular highlights from a single image of an object. Based on existing 3D models in the ShapeNet database, a large-scale object intrinsics database is rendered with HDR environment maps. Millions of synthetic images of objects and their corresponding albedo, shading, and specular ground-truth images are used to train an encoder-decoder CNN, which can decompose an image into the product of albedo and shading components along with an additive specular component. Our CNN delivers accurate and sharp results in this classical inverse problem of computer vision. Evaluated on our realistically synthetic dataset, our method consistently outperforms the state-of-the-art by a large margin. We train and test our CNN across different object categories. Perhaps surprising especially from the CNN classification perspective, our intrinsics CNN generalizes very well across categories. Our analysis shows that feature learning at the encoder stage is more crucial for developing a universal representation across categories. We apply our model to real images and videos from Internet, and observe robust and realistic intrinsics results. Quality non-Lambertian intrinsics could open up many interesting applications such as realistic product search based on material properties and image-based albedo/specular editing.

Content maybe subject to copyright    Report

Learning Non-Lambertian Object Intrinsics across ShapeNet Categories
Jian Shi
Institute of Software, Chinese Academy of Sciences
University of Chinese Academy of Sciences
shij@ios.ac.cn
Yue Dong
Microsoft Research Asia
yuedong@microsoft.com
Hao Su
Stanford University
haosu@cs.stanford.edu
Stella X. Yu
UC Berkeley / ICSI
stellayu@berkeley.edu
Abstract
We consider the non-Lambertian object intrinsic prob-
lem of recovering diffuse albedo, shading, and specular
highlights from a single image of an object.
We build a large-scale object intrinsics database based
on existing 3D models in the ShapeNet database. Ren-
dered with realistic environment maps, millions of synthetic
images of objects and their corresponding albedo, shad-
ing, and specular ground-truth images are used to train an
encoder-decoder CNN. Once trained, the network can de-
compose an image into the product of albedo and shading
components, along with an additive specular component.
Our CNN delivers accurate and sharp results in this
classical inverse problem of computer vision, sharp details
attributed to skip layer connections at corresponding reso-
lutions from the encoder to the decoder. Benchmarked on
our ShapeNet and MIT intrinsics datasets, our model con-
sistently outperforms the state-of-the-art by a large margin.
We train and test our CNN on different object cate-
gories. Perhaps surprising especially from the CNN clas-
sification perspective, our intrinsics CNN generalizes very
well across categories. Our analysis shows that feature
learning at the encoder stage is more crucial for developing
a universal representation across categories.
We apply our synthetic data trained model to images and
videos downloaded from the internet, and observe robust
and realistic intrinsics results. Quality non-Lambertian in-
trinsics could open up many interesting applications such
as image-based albedo and specular editing.
1. Introduction
Specular reflection is common to objects encountered in
our daily life. However, existing intrinsic image decompo-
sition algorithms, e.g. SIRFS [
3] or Direct Intrinsics (DI)
Figure 1: Specularity is everywhere on objects around us
and is essential for our material perception. Our task is
to decompose an image of a single object into its non-
Lambertian intrinsics components that include not only
albedo and shading, but also specular highlights. We build a
large-scale object non-Lambertian intrinsics database based
on ShapeNet, and render millions of synthetic images with
specular materials and environment maps. We train an
encoder-decoder CNN that delivers much sharper and more
accurate results than the prior art of direct intrinsics (DI).
Our network consistently outperform state-of-the-art espe-
cially for non-Lambertian objects and enables realistic ap-
plications to image-based albedo and specular editing.
[
20], only deal with Lambertian or diffuse reflection. Such
mismatching between the reality of images and the model
assumption often leads to large errors in the intrinsic image
decomposition of real images (Fig.
1).
Our goal is to solve non-Lambertian object intrinsics
from a single image. According to optical imaging physics,
the old Lambertian model can be extended to a non-
1
arXiv:1612.08510v1 [cs.CV] 27 Dec 2016

Lambertian model with the specular component as an ad-
ditive residue term:
old : image I = albedo A × shading S (1)
new : image I = albedo A × shading S + specular R (2)
We take a data-driven deep learning approach, inspired by
DI [20], to learn the associations between the image and its
albedo, shading and specular components simultaneously.
The immediate challenge of our non-Lambertian object
intrinsics task is the lack of ground-truth data, especially
for our non-Lambertian case, and human annotations are
just impossible. Existing intrinsics datasets are not only
Lambertian in nature, with only albedo and shading com-
ponents, but also have their own individual caveates. The
widely used MIT Intrinsic Images dataset [
12] is very small
by today’s standard, with only 20 object instances under 11
illumination conditions. MPI Sintel [
8] intrinsics dataset,
used by direct intrinsics, is too artificial, with 18 cartoon-
like scenes at 50 frames each. Intrinsics in the Wild (IIW)
[
5] is the first large-scale intrinsics dataset of real world im-
ages, but it provides sparse pairwise human ranking judge-
ments on albedo only, an inadequate measure on full image
intrinsic image decomposition.
Another major challenge is how to learn a full multi-
ple image regression task at pixel- and intensity- accurate
level. Deep learning has been tremendously successful for
image classification and somewhat successful for seman-
tic segmentation and depth regression. The main differ-
ences lie in the spatial and tonal resolution demanded at
the output: It is full image and 1 bit for classification,
more for segmentation and depth regression, most for in-
trinsics tasks. The state-of-the-art DI CNN model [
20] is
adapted from a depth regression CNN with a coarse na-
tive spatial resolution. Their results are not only blurry, but
also with false structures variations in the output intrinsics
out of no structures in the input image. While benchmark
scores for many CNN intrinsics models [
21, 34, 35, 20, 22]
are improving, the visual quality of results remains poor,
compared with those from traditional approaches based on
hand-crafted features and multitudes of priors [
6].
Our work address these challenges and makes the fol-
lowing contributions.
1. New non-Lambertian object intrinsics dataset. We de-
velop a new rendering-based object-centric intrinsics
dataset with specular reflection based on ShapeNet, a
large-scale 3D shape dataset.
2. New CNN with accurate and sharp results. Our ap-
proach not only significantly outperforms the state-of-
the-art by every error metric, but also produces much
sharper and detailed visual results.
3. Analysis on cross-category generalization. Surprising
from deep learning perspective on classification or seg-
mentation, our intrinsics CNN shows remarkable gen-
eralization across categories: networks trained only
on chairs also obtain reasonable performance on other
categories such as cars. Our analysis on cross-category
training and testing results reveal that features learned
at the encoder stage is the key for developing a univer-
sal representation across categories.
Our model delivers solid non-Lambertian intrinsics results
on real images and videos, closing the gap between intrinsic
image algorithm development and practical applications.
2. Related Work
Intrinsic Image Decomposition. Much effort has been
devoted to this long standing ill-posed problem [
4] of de-
composing an image into a reflectance layer and a shading
layer. Land and McCann [
18] observe that large gradients
in images usually correspond to changes in reflectance and
small gradients to smooth shading variations. To tackle this
ill-posed problem where two outputs are sought out of a
single input, many priors that constrain the solution space
have been explored, such as reflectance sparsity [
28, 30],
non-local texture [
29], shape and illumination [3], etc. An-
other line of approaches seek additional information from
the input, such as image sequences [
33], depth [2, 10] and
user strokes [
7]. A major challenge in intrinsics research is
the lack of ground-truthed dataset. Grosse et al. [
12] cap-
ture the first real image dataset in a lab setting, with limited
size and variations. Bell et al. [
5] used crowdsourcing to
obtain human judgements on pairs of pixels.
Deep Learning. Narihira et al . [
21] is the first to use
deep learning to learn albedo from IIW’s sparse human
judgement data. Zhou et al. [
34] and Zoran et al. [35]
extend the IIW-CRF model with a CNN learning compo-
nent. Direct Intrinsics [
20] is the first entirely deep learning
model that outputs intrinsics predictions, based on the depth
regression CNN by [
11] and trained on the synthetic MPI
Sintel intrinsics dataset. Their results are blurry, with down-
sampling and convolutions followed by deconvolutions, and
poor due to training on artificial scenes. Our work builds
upon the success of skip layer connections used in deep
CNNs for classification [
14] and segmentation [25, 27]. We
propose so-called mirror-links to forward early encoder fea-
tures to later decoder layers to generate sharp details.
Reflectance Estimation. Multiple images are usually
required for an accurate estimation of surface albedo. Ait-
tala et al. [
1] proposes a learning based method for sin-
gle image inputs, assuming that the surface only contains
stochastic textures and is lit by known lighting directions.
Most methods work on homogeneous objects lit by dis-
tant light sources, with surface reflectance and environment

lighting estimated via blind deconvolution [26] or trained
regression networks [
25]. Our work aims at general in-
trinsic image decomposition from a single image, without
constraints on material or lighting distributions. Our model
predicts spatially varying albedo maps and supports general
lighting conditions.
Learning from Rendered Images. Images rendered
from 3D models are widely used in deep learning, e.g.
[
31, 19, 13, 23] for training object detectors and viewpoint
classifiers. [
32] obtains state-of-the-art results for viewpoint
estimation by adapting CNNs trained from synthetic images
to real ones. ShapeNet [9] provides 330,000 annotated mod-
els from over 4,000 categories, with rich texture information
from artists. We build our non-Lambertian intrinsics dataset
and algorithms based on ShapeNet, rendering and learning
from photorealistic images on many varieties of common
objects.
3. Intrinsic Image with Specular Reflectance
We derived our non-Lambertian intrinsic decomposition
equation based on physics-based rendering. Given an in-
put image, the observed outgoing radiance I at each pixel
can be formulated as the product integral between incident
lighting L and surface reflectance ρ via this rendering equa-
tion [
16]:
I =
Z
+
ρ(ω
i
, ω
o
)(N · ω
i
)L(ω
i
) dω
i
. (3)
Here, ω
o
is the viewing direction, ω
i
the lighting direction
from the upper hemisphere domain
+
, and N the surface
normal direction of the object.
Surface reflectance ρ is a 4D function usually defined as
the bi-directional reflectance distribution function (BRDF).
Various BRDF models have been proposed, all sharing a
similar structure with a diffuse term ρ
d
and a specular term
ρ
s
, and coefficients α
d
, α
s
:
ρ = α
d
· ρ
d
+ α
s
· ρ
s
(4)
For the diffuse component, lights scatter multiple times and
produce view-independent and low-frequency smooth ap-
pearance. By contrast, for the specular component, lights
scatter on the surface point only once and produce shinny
appearance. The scope of reflection is modeled by diffuse
albedo α
d
and specular albedo α
s
.
Combining reflection equation (
4) and rendering equa-
tion (
3), we have the following image formation model:
I = α
d
Z
+
ρ
d
(ω
i
, ω
o
)L(ω
i
) dω
i
+ α
s
Z
+
ρ
s
(ω
i
, ω
o
)L(ω
i
) dω
i
= α
d
s
d
+ α
s
s
s
,
(5)
Figure 2: Our mirror-link CNN architecture has one shared
encoder and three decoders for albedo, shading, specular
components separately. Mirror links connect the encoder
and decoder layers of the same spatial resolution, providing
visual details. The height of layers in this figure indicates
the spatial resolution.
where s
d
and s
s
are the diffuse and specular shading, re-
spectively. Traditional intrinsics models consider diffuse
shading only, by decomposing the input image I as a prod-
uct of diffuse albedo A and shading S. However, it is
only proper to model diffuse and specular components sep-
arately, since their albedos have different values and spatial
distributions. The usual decomposition of I = A × S is
only a crude approximation.
Specular reflectance α
s
s
s
has characteristics very differ-
ent from diffuse reflectance α
d
s
d
: Both specular albedo and
specular shading have high-frequency spatial distributions
and color variations, making decomposition more ambigu-
ous. We thus choose to model specular reflectance as a sin-
gle residual term R , resulting in the non-Lambertian exten-
sion: I = A × S + R, where input image I is decomposed
into diffuse albedo A, diffuse shading S, and specular re-
flectance R respectively.
Our image formation model is developed based on
physics based rendering and physical properties of diffuse
and specular reflection, and it does not assume any spe-
cific BRDF model. Simple BRDF models (e.g. Phong) can
be used for rendering efficiency, and complex models (e.g.
Cook-Torrance) for higher photo-realism.
4. Learning Intrinsics
We develop our CNN model and training procedure for
non-Lambertian intrinsics.
Mirror-Link CNN. Fig.
2 illustrates our encoder-
decoder CNN architecture. The encoder progressively ex-
tracts and down-samples features, while the decoder up-
samples and combines them to construct the output intrin-
sic components. The sizes of feature maps (including in-
put/output) are exactly mirrored in our network. We link
early encoder features to the corresponding decoder layers
at the same spatial resolution, in order to obtain local sharp
details preserved in early encoder layers. We share the same
encoder and use separate decoders for A, S, R.

Figure 3: Environment maps are employed in our render-
ing for realistic appearance, both outdoor and indoor scenes
are included. The environment map not only represents the
dominate light sources in the scene (e.g. sun, lamp and win-
dow) but also includes correct information on the surround-
ings (e.g. sky, wall and building). Although a dominate light
might be sufficient for shading a Lambertian surface, de-
tailed surroundings provide the details in the specular.
Our mirror links are similar to skip connections in Deep
Reflectane Map (DRM) [25] and UNet [27]. However,
our goal is entirely different: DRM solves an interpolation
problem from high resolution sparse inputs to low resolu-
tion dense map outputs in the geometry space, ignoring the
spatial inhomogeneity of reflectance, whereas UNet deals
with image segmentation rather than image-wise regression.
Edge sensitive loss. Human vision is sensitive to edges,
however standard loss functions such as MSE treat pixel
errors equally. To get more precise and sharp edges, we
re-weight pixel errors with image gradients.
Scale invariant Loss. There is an inherent scale am-
biguity between albedo and shading, as only their product
matters in the intrinsic image decomposition. [
20] employs
a weighted combination of MSE loss and scale-invariant
MSE loss for training their intrinsic networks. Scaling am-
biguity also exists in our formulation, and we combine these
loss functions with our edge-sensitive weighting for training
our network.
ShapeNet-Intrinsics Dataset. We obtain the geometry
and albedo texture of 3D shapes from ShapeNet, a large-
scale richly-annotated, 3D shape repository [
9]. We pick
31,072 models from several common categories: car, chair,
bus, sofa, airplane, bench, container, vessel, etc. These ob-
jects often have specular reflections.
Environment maps. To generate photo-realistic images,
we collect 98 HDR environment maps from online public
resources
1
. Indoor and outdoor scenes with various illumi-
nation conditions are included, as shown in Fig.
3.
Rendering. We use an open-source renderer Mit-
suba [
15] to render objects models with various environ-
1
http://www.hdrlabs.com/sibl/archive.html
ment maps and random viewspoints sampled from the upper
hemisphere. A modified Phong reflectance model [
24, 17]
is assigned to objects to generate photo-realistic shading
and specular effects. Since original models in ShapeNet
are only provided with reliable diffuse albedo, we use ran-
dom distribution for specular with k
s
(0, 0.3) and N
s
(0, 300), which covers the range from pure diffuse to high
specular appearance (Fig. 1). We render albedo, shading
and specular layers, and then synthesize images according
to Equation
5.
Training. We split our dataset at the object level in or-
der to avoid images of the same object appearing in both
training and testing sets. We use 80/20 split, resulting in
24, 932 models for training and 6, 240 for testing. All the
98 environment maps are used to rendering 2, 443, 336 im-
ages for the training set. For the testing set, we randomly
pick 1 image per testing model.
More implementation details can be found in the supple-
mentary materials.
5. Evaluation
Our method is evaluated and compared with SIRFS [
3],
IIW [
5], and Direct Intrinsics (DI) [20]. We also train DI us-
ing our ShapeNet intrinsics dataset and denote the model as
DI*. We adopt the usual metrics, MSE, LMSE and DSSIM,
for quantitative evaluation.
5.1. Synthetic Data
Table
1 shows the numeric evaluation on the synthetic
testing set. Our algorithm performs consistently better than
others on the synthetic dataset numerically, compared to
off-the-shelf solutions, our method provides 40-50% per-
formance gain on the DSSIM error. Also note that, DI*,
i.e. DI trained with our dataset, produces second best re-
sults across almost all the error metrics, demonstrating the
advantage of our ShapeNet intrinsics dataset.
Numerical error metrics may not be fully indicative of
visual qualities, e.g. the naive baseline also produces low
errors for some cases. Figure
4 provides visual comparisons
against ground truths.
For objects with strong specular reflectance, e.g. cars,
ShapeNet MSE LMSE DSSIM
intrinsics
albedo shading albedo shading albedo shading
Baseline 0.0232 0.0153 0.0789 0.0231 0.2273 0.2341
SIRFS
0.0211 0.0227 0.0693 0.0324 0.2038 0.1356
IIW
0.0147 0.0149 0.0481 0.0228 0.1649 0.1367
DI
0.0252 0.0245 0.0711 0.0275 0.1984 0.1454
DI*
0.0115 0.0066 0.0470 0.0115 0.1655 0.0996
Ours
0.0083 0.0055 0.0353 0.0097 0.0939 0.0622
specular 0.0042 0.0578 0.0831
Table 1: Evaluation on our synthetic dataset. For the base-
line, we set its albedo to be the input image and its shading
to be 1.0. The last row lists our specular error.

Input SIRFS IIW DI DI* Our Specular GT
Figure 4: Results for the synthetic dataset. Our baselines
include SIRFS, IIW, Direct-Intrinsics with released model
by the author (DI), and model trained by ourselves on our
synthetic dataset (DI*). The top row of each group is
albedo, and the bottom is shading. The Specular column
shows the ground-truth (top) and our result (bottom). We
observe that specularity has basically been removed from
albedo/shading, especially for cars. Even for the sofa (last
row) with little specular, our method still produces good vi-
sual result. See more results in our supplementary material.
specular reflection violates the Lambertian condition as-
sumed by traditional intrinsics algorithms. These algo-
rithms, SIRFS or IIW, simply cannot handle such specular
components. Learning-based approaches, DI, DI*, or our
method, could still learn from the data and perform better
in these cases. For DI, the network trained on our ShapeNet
category also has significantly better visual quality, com-
pared with their released model trained on the Sintel dataset.
However, their results are blurry, as a consequence from
their deep convolution and deconvolution structures without
our skip layer connections. Our model produces sharper im-
MIT MSE LMSE DSSIM
intrinsic
albedo shading albedo shading albedo shading
SIRFS 0.0147 0.0083 0.0416 0.0168 0.1238 0.0985
DI
0.0277 0.0154 0.0585 0.0295 0.1526 0.1328
Ours
0.0468 0.0194 0.0752 0.0318 0.1825 0.1667
Ours*
0.0278 0.0126 0.0503 0.0240 0.1465 0.1200
Table 2: Evaluation on MIT intrinsics dataset.
Input SIRFS DI Ours Ours* GT
Figure 5: Results on the MIT dataset. Ours* is our
ShapeNet trained model fine-tuned on MIT, with data gen-
erated by the GenMIT approach used in DI [
20].
ages preserving many visual details, such as boundaries in
the albedo and specular images. Large specular areas on the
body of cars are also extracted well in the specular output
component, revealing the environment illumination. Such
specular areas would confuse earlier algorithms and bring
serious artifacts to albedo/shading.
5.2. MIT Intrinsics Dataset
We also test the performance of our network on the MIT
intrinsics dataset [
12]. Unlike our color environment light
model designed for common real-world images, the MIT-
intrinsics dataset uses a lab capture oriented lighting model
with single grayscale directional light source, a scenario
that is not included in our synthetic dataset. The light model
differences lead to dramatic visual differences and cause do-
main shift problems on learning based approaches [
20]. We
also follow [
20] to fine tune our network on the MIT dataset.
Table
2 lists benchmark errors and Fig 5 provides sample

Citations
More filters
Reference EntryDOI
15 Oct 2004

2,118 citations

Journal ArticleDOI
TL;DR: This work demonstrates that it can recover non-Lambertian, spatially-varying BRDFs and complex geometry belonging to any arbitrary shape class, from a single RGB image captured under a combination of unknown environment illumination and flash lighting.
Abstract: Reconstructing shape and reflectance properties from images is a highly under-constrained problem, and has previously been addressed by using specialized hardware to capture calibrated data or by assuming known (or highly constrained) shape or reflectance. In contrast, we demonstrate that we can recover non-Lambertian, spatially-varying BRDFs and complex geometry belonging to any arbitrary shape class, from a single RGB image captured under a combination of unknown environment illumination and flash lighting. We achieve this by training a deep neural network to regress shape and reflectance from the image. Our network is able to address this problem because of three novel contributions: first, we build a large-scale dataset of procedurally generated shapes and real-world complex SVBRDFs that approximate real world appearance well. Second, single image inverse rendering requires reasoning at multiple scales, and we propose a cascade network structure that allows this in a tractable manner. Finally, we incorporate an in-network rendering layer that aids the reconstruction task by handling global illumination effects that are important for real-world scenes. Together, these contributions allow us to tackle the entire inverse rendering problem in a holistic manner and produce state-of-the-art results on both synthetic and real data.

244 citations

Proceedings Article
01 Jan 2017
TL;DR: This work proposes MarrNet, an end-to-end trainable model that sequentially estimates 2.5D sketches and 3D object shape and derives differentiable projective functions from 3D shape to 2.
Abstract: 3D object reconstruction from a single image is a highly under-determined problem, requiring strong prior knowledge of plausible 3D shapes. This introduces challenge for learning-based approaches, as 3D object annotations in real images are scarce. Previous work chose to train on synthetic data with ground truth 3D information, but suffered from the domain adaptation issue when tested on real data. In this work, we propose an end-to-end trainable framework, sequentially estimating 2.5D sketches and 3D object shapes. Our disentangled, two-step formulation has three advantages. First, compared to full 3D shape, 2.5D sketches are much easier to be recovered from a 2D image, and to transfer from synthetic to real data. Second, for 3D reconstruction from the 2.5D sketches, we can easily transfer the learned model on synthetic data to real images, as rendered 2.5D sketches are invariant to object appearance variations in real images, including lighting, texture, etc. This further relieves the domain adaptation problem. Third, we derive differentiable projective functions from 3D shape to 2.5D sketches, making the framework end-to-end trainable on real images, requiring no real-image annotations. Our framework achieves state-of-the-art performance on 3D shape reconstruction.

219 citations

Posted Content
TL;DR: A neural reflectance decomposition (NeRD) technique that uses physically-based rendering to decompose the scene into spatially varying BRDF material properties enabling fast real-time rendering with novel illuminations.
Abstract: Decomposing a scene into its shape, reflectance, and illumination is a challenging but essential problem in computer vision and graphics. This problem is inherently more challenging when the illumination is not a single light source under laboratory conditions but is instead an unconstrained environmental illumination. Though recent work has shown that implicit representations can be used to model the radiance field of an object, these techniques only enable view synthesis and not relighting. Additionally, evaluating these radiance fields is resource and time-intensive. By decomposing a scene into explicit representations, any rendering framework can be leveraged to generate novel views under any illumination in real-time. NeRD is a method that achieves this decomposition by introducing physically-based rendering to neural radiance fields. Even challenging non-Lambertian reflectances, complex geometry, and unknown illumination can be decomposed into high-quality models. The datasets and code is available on the project page: this https URL

211 citations

Book ChapterDOI
08 Sep 2018
TL;DR: CGIntrinsics, a new, large-scale dataset of physically-based rendered images of scenes with full ground truth decompositions, is presented, demonstrating the suprising effectiveness of carefully-rendered synthetic data for the intrinsic images task.
Abstract: Intrinsic image decomposition is a challenging, long-standing computer vision problem for which ground truth data is very difficult to acquire. We explore the use of synthetic data for training CNN-based intrinsic image decomposition models, then applying these learned models to real-world images. To that end, we present CGIntrinsics, a new, large-scale dataset of physically-based rendered images of scenes with full ground truth decompositions. The rendering process we use is carefully designed to yield high-quality, realistic images, which we find to be crucial for this problem domain. We also propose a new end-to-end training method that learns better decompositions by leveraging CGIntrinsics, and optionally IIW and SAW, two recent datasets of sparse annotations on real-world images. Surprisingly, we find that a decomposition network trained solely on our synthetic data outperforms the state-of-the-art on both IIW and SAW, and performance improves even further when IIW and SAW data is added during training. Our work demonstrates the suprising effectiveness of carefully-rendered synthetic data for the intrinsic images task.

143 citations

References
More filters
Proceedings ArticleDOI
27 Jun 2016
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

123,388 citations

Book ChapterDOI
05 Oct 2015
TL;DR: Neber et al. as discussed by the authors proposed a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently, which can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks.
Abstract: There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net .

49,590 citations

Posted Content
TL;DR: It is shown that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks.
Abstract: There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at this http URL .

19,534 citations

Posted Content
TL;DR: ShapeNet contains 3D models from a multitude of semantic categories and organizes them under the WordNet taxonomy, a collection of datasets providing many semantic annotations for each 3D model such as consistent rigid alignments, parts and bilateral symmetry planes, physical sizes, keywords, as well as other planned annotations.
Abstract: We present ShapeNet: a richly-annotated, large-scale repository of shapes represented by 3D CAD models of objects. ShapeNet contains 3D models from a multitude of semantic categories and organizes them under the WordNet taxonomy. It is a collection of datasets providing many semantic annotations for each 3D model such as consistent rigid alignments, parts and bilateral symmetry planes, physical sizes, keywords, as well as other planned annotations. Annotations are made available through a public web-based interface to enable data visualization of object attributes, promote data-driven geometric analysis, and provide a large-scale quantitative benchmark for research in computer graphics and vision. At the time of this technical report, ShapeNet has indexed more than 3,000,000 models, 220,000 models out of which are classified into 3,135 categories (WordNet synsets). In this report we describe the ShapeNet effort as a whole, provide details for all currently available datasets, and summarize future plans.

3,707 citations

Journal ArticleDOI
TL;DR: The mathematics of a lightness scheme that generates lightness numbers, the biologic correlate of reflectance, independent of the flux from objects is described.
Abstract: Sensations of color show a strong correlation with reflectance, even though the amount of visible light reaching the eye depends on the product of reflectance and illumination. The visual system must achieve this remarkable result by a scheme that does not measure flux. Such a scheme is described as the basis of retinex theory. This theory assumes that there are three independent cone systems, each starting with a set of receptors peaking, respectively, in the long-, middle-, and short-wavelength regions of the visible spectrum. Each system forms a separate image of the world in terms of lightness that shows a strong correlation with reflectance within its particular band of wavelengths. These images are not mixed, but rather are compared to generate color sensations. The problem then becomes how the lightness of areas in these separate images can be independent of flux. This article describes the mathematics of a lightness scheme that generates lightness numbers, the biologic correlate of reflectance, independent of the flux from objects

3,480 citations