Learning Non-Lambertian Object Intrinsics across ShapeNet Categories
Jian Shi
Institute of Software, Chinese Academy of Sciences
University of Chinese Academy of Sciences
shij@ios.ac.cn
Yue Dong
Microsoft Research Asia
yuedong@microsoft.com
Hao Su
Stanford University
haosu@cs.stanford.edu
Stella X. Yu
UC Berkeley / ICSI
stellayu@berkeley.edu
Abstract
We consider the non-Lambertian object intrinsic prob-
lem of recovering diffuse albedo, shading, and specular
highlights from a single image of an object.
We build a large-scale object intrinsics database based
on existing 3D models in the ShapeNet database. Ren-
dered with realistic environment maps, millions of synthetic
images of objects and their corresponding albedo, shad-
ing, and specular ground-truth images are used to train an
encoder-decoder CNN. Once trained, the network can de-
compose an image into the product of albedo and shading
components, along with an additive specular component.
Our CNN delivers accurate and sharp results in this
classical inverse problem of computer vision, sharp details
attributed to skip layer connections at corresponding reso-
lutions from the encoder to the decoder. Benchmarked on
our ShapeNet and MIT intrinsics datasets, our model con-
sistently outperforms the state-of-the-art by a large margin.
We train and test our CNN on different object cate-
gories. Perhaps surprising especially from the CNN clas-
sification perspective, our intrinsics CNN generalizes very
well across categories. Our analysis shows that feature
learning at the encoder stage is more crucial for developing
a universal representation across categories.
We apply our synthetic data trained model to images and
videos downloaded from the internet, and observe robust
and realistic intrinsics results. Quality non-Lambertian in-
trinsics could open up many interesting applications such
as image-based albedo and specular editing.
1. Introduction
Specular reflection is common to objects encountered in
our daily life. However, existing intrinsic image decompo-
sition algorithms, e.g. SIRFS [
3] or Direct Intrinsics (DI)
Figure 1: Specularity is everywhere on objects around us
and is essential for our material perception. Our task is
to decompose an image of a single object into its non-
Lambertian intrinsics components that include not only
albedo and shading, but also specular highlights. We build a
large-scale object non-Lambertian intrinsics database based
on ShapeNet, and render millions of synthetic images with
specular materials and environment maps. We train an
encoder-decoder CNN that delivers much sharper and more
accurate results than the prior art of direct intrinsics (DI).
Our network consistently outperform state-of-the-art espe-
cially for non-Lambertian objects and enables realistic ap-
plications to image-based albedo and specular editing.
[
20], only deal with Lambertian or diffuse reflection. Such
mismatching between the reality of images and the model
assumption often leads to large errors in the intrinsic image
decomposition of real images (Fig.
1).
Our goal is to solve non-Lambertian object intrinsics
from a single image. According to optical imaging physics,
the old Lambertian model can be extended to a non-
1
arXiv:1612.08510v1 [cs.CV] 27 Dec 2016
Lambertian model with the specular component as an ad-
ditive residue term:
old : image I = albedo A × shading S (1)
new : image I = albedo A × shading S + specular R (2)
We take a data-driven deep learning approach, inspired by
DI [20], to learn the associations between the image and its
albedo, shading and specular components simultaneously.
The immediate challenge of our non-Lambertian object
intrinsics task is the lack of ground-truth data, especially
for our non-Lambertian case, and human annotations are
just impossible. Existing intrinsics datasets are not only
Lambertian in nature, with only albedo and shading com-
ponents, but also have their own individual caveates. The
widely used MIT Intrinsic Images dataset [
12] is very small
by today’s standard, with only 20 object instances under 11
illumination conditions. MPI Sintel [
8] intrinsics dataset,
used by direct intrinsics, is too artificial, with 18 cartoon-
like scenes at 50 frames each. Intrinsics in the Wild (IIW)
[
5] is the first large-scale intrinsics dataset of real world im-
ages, but it provides sparse pairwise human ranking judge-
ments on albedo only, an inadequate measure on full image
intrinsic image decomposition.
Another major challenge is how to learn a full multi-
ple image regression task at pixel- and intensity- accurate
level. Deep learning has been tremendously successful for
image classification and somewhat successful for seman-
tic segmentation and depth regression. The main differ-
ences lie in the spatial and tonal resolution demanded at
the output: It is full image and 1 bit for classification,
more for segmentation and depth regression, most for in-
trinsics tasks. The state-of-the-art DI CNN model [
20] is
adapted from a depth regression CNN with a coarse na-
tive spatial resolution. Their results are not only blurry, but
also with false structures – variations in the output intrinsics
out of no structures in the input image. While benchmark
scores for many CNN intrinsics models [
21, 34, 35, 20, 22]
are improving, the visual quality of results remains poor,
compared with those from traditional approaches based on
hand-crafted features and multitudes of priors [
6].
Our work address these challenges and makes the fol-
lowing contributions.
1. New non-Lambertian object intrinsics dataset. We de-
velop a new rendering-based object-centric intrinsics
dataset with specular reflection based on ShapeNet, a
large-scale 3D shape dataset.
2. New CNN with accurate and sharp results. Our ap-
proach not only significantly outperforms the state-of-
the-art by every error metric, but also produces much
sharper and detailed visual results.
3. Analysis on cross-category generalization. Surprising
from deep learning perspective on classification or seg-
mentation, our intrinsics CNN shows remarkable gen-
eralization across categories: networks trained only
on chairs also obtain reasonable performance on other
categories such as cars. Our analysis on cross-category
training and testing results reveal that features learned
at the encoder stage is the key for developing a univer-
sal representation across categories.
Our model delivers solid non-Lambertian intrinsics results
on real images and videos, closing the gap between intrinsic
image algorithm development and practical applications.
2. Related Work
Intrinsic Image Decomposition. Much effort has been
devoted to this long standing ill-posed problem [
4] of de-
composing an image into a reflectance layer and a shading
layer. Land and McCann [
18] observe that large gradients
in images usually correspond to changes in reflectance and
small gradients to smooth shading variations. To tackle this
ill-posed problem where two outputs are sought out of a
single input, many priors that constrain the solution space
have been explored, such as reflectance sparsity [
28, 30],
non-local texture [
29], shape and illumination [3], etc. An-
other line of approaches seek additional information from
the input, such as image sequences [
33], depth [2, 10] and
user strokes [
7]. A major challenge in intrinsics research is
the lack of ground-truthed dataset. Grosse et al. [
12] cap-
ture the first real image dataset in a lab setting, with limited
size and variations. Bell et al. [
5] used crowdsourcing to
obtain human judgements on pairs of pixels.
Deep Learning. Narihira et al . [
21] is the first to use
deep learning to learn albedo from IIW’s sparse human
judgement data. Zhou et al. [
34] and Zoran et al. [35]
extend the IIW-CRF model with a CNN learning compo-
nent. Direct Intrinsics [
20] is the first entirely deep learning
model that outputs intrinsics predictions, based on the depth
regression CNN by [
11] and trained on the synthetic MPI
Sintel intrinsics dataset. Their results are blurry, with down-
sampling and convolutions followed by deconvolutions, and
poor due to training on artificial scenes. Our work builds
upon the success of skip layer connections used in deep
CNNs for classification [
14] and segmentation [25, 27]. We
propose so-called mirror-links to forward early encoder fea-
tures to later decoder layers to generate sharp details.
Reflectance Estimation. Multiple images are usually
required for an accurate estimation of surface albedo. Ait-
tala et al. [
1] proposes a learning based method for sin-
gle image inputs, assuming that the surface only contains
stochastic textures and is lit by known lighting directions.
Most methods work on homogeneous objects lit by dis-
tant light sources, with surface reflectance and environment
lighting estimated via blind deconvolution [26] or trained
regression networks [
25]. Our work aims at general in-
trinsic image decomposition from a single image, without
constraints on material or lighting distributions. Our model
predicts spatially varying albedo maps and supports general
lighting conditions.
Learning from Rendered Images. Images rendered
from 3D models are widely used in deep learning, e.g.
[
31, 19, 13, 23] for training object detectors and viewpoint
classifiers. [
32] obtains state-of-the-art results for viewpoint
estimation by adapting CNNs trained from synthetic images
to real ones. ShapeNet [9] provides 330,000 annotated mod-
els from over 4,000 categories, with rich texture information
from artists. We build our non-Lambertian intrinsics dataset
and algorithms based on ShapeNet, rendering and learning
from photorealistic images on many varieties of common
objects.
3. Intrinsic Image with Specular Reflectance
We derived our non-Lambertian intrinsic decomposition
equation based on physics-based rendering. Given an in-
put image, the observed outgoing radiance I at each pixel
can be formulated as the product integral between incident
lighting L and surface reflectance ρ via this rendering equa-
tion [
16]:
I =
Z
Ω
+
ρ(ω
i
, ω
o
)(N · ω
i
)L(ω
i
) dω
i
. (3)
Here, ω
o
is the viewing direction, ω
i
the lighting direction
from the upper hemisphere domain Ω
+
, and N the surface
normal direction of the object.
Surface reflectance ρ is a 4D function usually defined as
the bi-directional reflectance distribution function (BRDF).
Various BRDF models have been proposed, all sharing a
similar structure with a diffuse term ρ
d
and a specular term
ρ
s
, and coefficients α
d
, α
s
:
ρ = α
d
· ρ
d
+ α
s
· ρ
s
(4)
For the diffuse component, lights scatter multiple times and
produce view-independent and low-frequency smooth ap-
pearance. By contrast, for the specular component, lights
scatter on the surface point only once and produce shinny
appearance. The scope of reflection is modeled by diffuse
albedo α
d
and specular albedo α
s
.
Combining reflection equation (
4) and rendering equa-
tion (
3), we have the following image formation model:
I = α
d
Z
Ω
+
ρ
d
(ω
i
, ω
o
)L(ω
i
) dω
i
+ α
s
Z
Ω
+
ρ
s
(ω
i
, ω
o
)L(ω
i
) dω
i
= α
d
s
d
+ α
s
s
s
,
(5)
Figure 2: Our mirror-link CNN architecture has one shared
encoder and three decoders for albedo, shading, specular
components separately. Mirror links connect the encoder
and decoder layers of the same spatial resolution, providing
visual details. The height of layers in this figure indicates
the spatial resolution.
where s
d
and s
s
are the diffuse and specular shading, re-
spectively. Traditional intrinsics models consider diffuse
shading only, by decomposing the input image I as a prod-
uct of diffuse albedo A and shading S. However, it is
only proper to model diffuse and specular components sep-
arately, since their albedos have different values and spatial
distributions. The usual decomposition of I = A × S is
only a crude approximation.
Specular reflectance α
s
s
s
has characteristics very differ-
ent from diffuse reflectance α
d
s
d
: Both specular albedo and
specular shading have high-frequency spatial distributions
and color variations, making decomposition more ambigu-
ous. We thus choose to model specular reflectance as a sin-
gle residual term R , resulting in the non-Lambertian exten-
sion: I = A × S + R, where input image I is decomposed
into diffuse albedo A, diffuse shading S, and specular re-
flectance R respectively.
Our image formation model is developed based on
physics based rendering and physical properties of diffuse
and specular reflection, and it does not assume any spe-
cific BRDF model. Simple BRDF models (e.g. Phong) can
be used for rendering efficiency, and complex models (e.g.
Cook-Torrance) for higher photo-realism.
4. Learning Intrinsics
We develop our CNN model and training procedure for
non-Lambertian intrinsics.
Mirror-Link CNN. Fig.
2 illustrates our encoder-
decoder CNN architecture. The encoder progressively ex-
tracts and down-samples features, while the decoder up-
samples and combines them to construct the output intrin-
sic components. The sizes of feature maps (including in-
put/output) are exactly mirrored in our network. We link
early encoder features to the corresponding decoder layers
at the same spatial resolution, in order to obtain local sharp
details preserved in early encoder layers. We share the same
encoder and use separate decoders for A, S, R.
Figure 3: Environment maps are employed in our render-
ing for realistic appearance, both outdoor and indoor scenes
are included. The environment map not only represents the
dominate light sources in the scene (e.g. sun, lamp and win-
dow) but also includes correct information on the surround-
ings (e.g. sky, wall and building). Although a dominate light
might be sufficient for shading a Lambertian surface, de-
tailed surroundings provide the details in the specular.
Our mirror links are similar to skip connections in Deep
Reflectane Map (DRM) [25] and UNet [27]. However,
our goal is entirely different: DRM solves an interpolation
problem from high resolution sparse inputs to low resolu-
tion dense map outputs in the geometry space, ignoring the
spatial inhomogeneity of reflectance, whereas UNet deals
with image segmentation rather than image-wise regression.
Edge sensitive loss. Human vision is sensitive to edges,
however standard loss functions such as MSE treat pixel
errors equally. To get more precise and sharp edges, we
re-weight pixel errors with image gradients.
Scale invariant Loss. There is an inherent scale am-
biguity between albedo and shading, as only their product
matters in the intrinsic image decomposition. [
20] employs
a weighted combination of MSE loss and scale-invariant
MSE loss for training their intrinsic networks. Scaling am-
biguity also exists in our formulation, and we combine these
loss functions with our edge-sensitive weighting for training
our network.
ShapeNet-Intrinsics Dataset. We obtain the geometry
and albedo texture of 3D shapes from ShapeNet, a large-
scale richly-annotated, 3D shape repository [
9]. We pick
31,072 models from several common categories: car, chair,
bus, sofa, airplane, bench, container, vessel, etc. These ob-
jects often have specular reflections.
Environment maps. To generate photo-realistic images,
we collect 98 HDR environment maps from online public
resources
1
. Indoor and outdoor scenes with various illumi-
nation conditions are included, as shown in Fig.
3.
Rendering. We use an open-source renderer Mit-
suba [
15] to render objects models with various environ-
1
http://www.hdrlabs.com/sibl/archive.html
ment maps and random viewspoints sampled from the upper
hemisphere. A modified Phong reflectance model [
24, 17]
is assigned to objects to generate photo-realistic shading
and specular effects. Since original models in ShapeNet
are only provided with reliable diffuse albedo, we use ran-
dom distribution for specular with k
s
∈ (0, 0.3) and N
s
∈
(0, 300), which covers the range from pure diffuse to high
specular appearance (Fig. 1). We render albedo, shading
and specular layers, and then synthesize images according
to Equation
5.
Training. We split our dataset at the object level in or-
der to avoid images of the same object appearing in both
training and testing sets. We use 80/20 split, resulting in
24, 932 models for training and 6, 240 for testing. All the
98 environment maps are used to rendering 2, 443, 336 im-
ages for the training set. For the testing set, we randomly
pick 1 image per testing model.
More implementation details can be found in the supple-
mentary materials.
5. Evaluation
Our method is evaluated and compared with SIRFS [
3],
IIW [
5], and Direct Intrinsics (DI) [20]. We also train DI us-
ing our ShapeNet intrinsics dataset and denote the model as
DI*. We adopt the usual metrics, MSE, LMSE and DSSIM,
for quantitative evaluation.
5.1. Synthetic Data
Table
1 shows the numeric evaluation on the synthetic
testing set. Our algorithm performs consistently better than
others on the synthetic dataset numerically, compared to
off-the-shelf solutions, our method provides 40-50% per-
formance gain on the DSSIM error. Also note that, DI*,
i.e. DI trained with our dataset, produces second best re-
sults across almost all the error metrics, demonstrating the
advantage of our ShapeNet intrinsics dataset.
Numerical error metrics may not be fully indicative of
visual qualities, e.g. the naive baseline also produces low
errors for some cases. Figure
4 provides visual comparisons
against ground truths.
For objects with strong specular reflectance, e.g. cars,
ShapeNet MSE LMSE DSSIM
intrinsics
albedo shading albedo shading albedo shading
Baseline 0.0232 0.0153 0.0789 0.0231 0.2273 0.2341
SIRFS
0.0211 0.0227 0.0693 0.0324 0.2038 0.1356
IIW
0.0147 0.0149 0.0481 0.0228 0.1649 0.1367
DI
0.0252 0.0245 0.0711 0.0275 0.1984 0.1454
DI*
0.0115 0.0066 0.0470 0.0115 0.1655 0.0996
Ours
0.0083 0.0055 0.0353 0.0097 0.0939 0.0622
specular 0.0042 0.0578 0.0831
Table 1: Evaluation on our synthetic dataset. For the base-
line, we set its albedo to be the input image and its shading
to be 1.0. The last row lists our specular error.
Input SIRFS IIW DI DI* Our Specular GT
Figure 4: Results for the synthetic dataset. Our baselines
include SIRFS, IIW, Direct-Intrinsics with released model
by the author (DI), and model trained by ourselves on our
synthetic dataset (DI*). The top row of each group is
albedo, and the bottom is shading. The Specular column
shows the ground-truth (top) and our result (bottom). We
observe that specularity has basically been removed from
albedo/shading, especially for cars. Even for the sofa (last
row) with little specular, our method still produces good vi-
sual result. See more results in our supplementary material.
specular reflection violates the Lambertian condition as-
sumed by traditional intrinsics algorithms. These algo-
rithms, SIRFS or IIW, simply cannot handle such specular
components. Learning-based approaches, DI, DI*, or our
method, could still learn from the data and perform better
in these cases. For DI, the network trained on our ShapeNet
category also has significantly better visual quality, com-
pared with their released model trained on the Sintel dataset.
However, their results are blurry, as a consequence from
their deep convolution and deconvolution structures without
our skip layer connections. Our model produces sharper im-
MIT MSE LMSE DSSIM
intrinsic
albedo shading albedo shading albedo shading
SIRFS 0.0147 0.0083 0.0416 0.0168 0.1238 0.0985
DI
0.0277 0.0154 0.0585 0.0295 0.1526 0.1328
Ours
0.0468 0.0194 0.0752 0.0318 0.1825 0.1667
Ours*
0.0278 0.0126 0.0503 0.0240 0.1465 0.1200
Table 2: Evaluation on MIT intrinsics dataset.
Input SIRFS DI Ours Ours* GT
Figure 5: Results on the MIT dataset. Ours* is our
ShapeNet trained model fine-tuned on MIT, with data gen-
erated by the GenMIT approach used in DI [
20].
ages preserving many visual details, such as boundaries in
the albedo and specular images. Large specular areas on the
body of cars are also extracted well in the specular output
component, revealing the environment illumination. Such
specular areas would confuse earlier algorithms and bring
serious artifacts to albedo/shading.
5.2. MIT Intrinsics Dataset
We also test the performance of our network on the MIT
intrinsics dataset [
12]. Unlike our color environment light
model designed for common real-world images, the MIT-
intrinsics dataset uses a lab capture oriented lighting model
with single grayscale directional light source, a scenario
that is not included in our synthetic dataset. The light model
differences lead to dramatic visual differences and cause do-
main shift problems on learning based approaches [
20]. We
also follow [
20] to fine tune our network on the MIT dataset.
Table
2 lists benchmark errors and Fig 5 provides sample