Learning Non-Lambertian Object Intrinsics Across ShapeNet Categories

doi:10.1109/CVPR.2017.619

Learning Non-Lambertian Object Intrinsics across ShapeNet Categories

Jian Shi

Institute of Software, Chinese Academy of Sciences

University of Chinese Academy of Sciences

shij@ios.ac.cn

Yue Dong

Microsoft Research Asia

yuedong@microsoft.com

Hao Su

Stanford University

haosu@cs.stanford.edu

Stella X. Yu

UC Berkeley / ICSI

stellayu@berkeley.edu

Abstract

We consider the non-Lambertian object intrinsic prob-

lem of recovering diffuse albedo, shading, and specular

highlights from a single image of an object.

We build a large-scale object intrinsics database based

on existing 3D models in the ShapeNet database. Ren-

dered with realistic environment maps, millions of synthetic

images of objects and their corresponding albedo, shad-

ing, and specular ground-truth images are used to train an

encoder-decoder CNN. Once trained, the network can de-

compose an image into the product of albedo and shading

components, along with an additive specular component.

Our CNN delivers accurate and sharp results in this

classical inverse problem of computer vision, sharp details

attributed to skip layer connections at corresponding reso-

lutions from the encoder to the decoder. Benchmarked on

our ShapeNet and MIT intrinsics datasets, our model con-

sistently outperforms the state-of-the-art by a large margin.

We train and test our CNN on different object cate-

gories. Perhaps surprising especially from the CNN clas-

siﬁcation perspective, our intrinsics CNN generalizes very

well across categories. Our analysis shows that feature

learning at the encoder stage is more crucial for developing

a universal representation across categories.

We apply our synthetic data trained model to images and

videos downloaded from the internet, and observe robust

and realistic intrinsics results. Quality non-Lambertian in-

trinsics could open up many interesting applications such

as image-based albedo and specular editing.

1. Introduction

Specular reﬂection is common to objects encountered in

our daily life. However, existing intrinsic image decompo-

sition algorithms, e.g. SIRFS [

3] or Direct Intrinsics (DI)

Figure 1: Specularity is everywhere on objects around us

and is essential for our material perception. Our task is

to decompose an image of a single object into its non-

Lambertian intrinsics components that include not only

albedo and shading, but also specular highlights. We build a

large-scale object non-Lambertian intrinsics database based

on ShapeNet, and render millions of synthetic images with

specular materials and environment maps. We train an

encoder-decoder CNN that delivers much sharper and more

accurate results than the prior art of direct intrinsics (DI).

Our network consistently outperform state-of-the-art espe-

cially for non-Lambertian objects and enables realistic ap-

plications to image-based albedo and specular editing.

[

20], only deal with Lambertian or diffuse reﬂection. Such

mismatching between the reality of images and the model

assumption often leads to large errors in the intrinsic image

decomposition of real images (Fig.

1).

Our goal is to solve non-Lambertian object intrinsics

from a single image. According to optical imaging physics,

the old Lambertian model can be extended to a non-

1

arXiv:1612.08510v1 [cs.CV] 27 Dec 2016

Lambertian model with the specular component as an ad-

ditive residue term:

old : image I = albedo A × shading S (1)

new : image I = albedo A × shading S + specular R (2)

We take a data-driven deep learning approach, inspired by

DI [20], to learn the associations between the image and its

albedo, shading and specular components simultaneously.

The immediate challenge of our non-Lambertian object

intrinsics task is the lack of ground-truth data, especially

for our non-Lambertian case, and human annotations are

just impossible. Existing intrinsics datasets are not only

Lambertian in nature, with only albedo and shading com-

ponents, but also have their own individual caveates. The

widely used MIT Intrinsic Images dataset [

12] is very small

by today’s standard, with only 20 object instances under 11

illumination conditions. MPI Sintel [

8] intrinsics dataset,

used by direct intrinsics, is too artiﬁcial, with 18 cartoon-

like scenes at 50 frames each. Intrinsics in the Wild (IIW)

[

5] is the ﬁrst large-scale intrinsics dataset of real world im-

ages, but it provides sparse pairwise human ranking judge-

ments on albedo only, an inadequate measure on full image

intrinsic image decomposition.

Another major challenge is how to learn a full multi-

ple image regression task at pixel- and intensity- accurate

level. Deep learning has been tremendously successful for

image classiﬁcation and somewhat successful for seman-

tic segmentation and depth regression. The main differ-

ences lie in the spatial and tonal resolution demanded at

the output: It is full image and 1 bit for classiﬁcation,

more for segmentation and depth regression, most for in-

trinsics tasks. The state-of-the-art DI CNN model [

20] is

adapted from a depth regression CNN with a coarse na-

tive spatial resolution. Their results are not only blurry, but

also with false structures – variations in the output intrinsics

out of no structures in the input image. While benchmark

scores for many CNN intrinsics models [

21, 34, 35, 20, 22]

are improving, the visual quality of results remains poor,

compared with those from traditional approaches based on

hand-crafted features and multitudes of priors [

6].

Our work address these challenges and makes the fol-

lowing contributions.

1. New non-Lambertian object intrinsics dataset. We de-

velop a new rendering-based object-centric intrinsics

dataset with specular reﬂection based on ShapeNet, a

large-scale 3D shape dataset.

2. New CNN with accurate and sharp results. Our ap-

proach not only signiﬁcantly outperforms the state-of-

the-art by every error metric, but also produces much

sharper and detailed visual results.

3. Analysis on cross-category generalization. Surprising

from deep learning perspective on classiﬁcation or seg-

mentation, our intrinsics CNN shows remarkable gen-

eralization across categories: networks trained only

on chairs also obtain reasonable performance on other

categories such as cars. Our analysis on cross-category

training and testing results reveal that features learned

at the encoder stage is the key for developing a univer-

sal representation across categories.

Our model delivers solid non-Lambertian intrinsics results

on real images and videos, closing the gap between intrinsic

image algorithm development and practical applications.

2. Related Work

Intrinsic Image Decomposition. Much effort has been

devoted to this long standing ill-posed problem [

4] of de-

composing an image into a reﬂectance layer and a shading

layer. Land and McCann [

18] observe that large gradients

in images usually correspond to changes in reﬂectance and

small gradients to smooth shading variations. To tackle this

ill-posed problem where two outputs are sought out of a

single input, many priors that constrain the solution space

have been explored, such as reﬂectance sparsity [

28, 30],

non-local texture [

29], shape and illumination [3], etc. An-

other line of approaches seek additional information from

the input, such as image sequences [

33], depth [2, 10] and

user strokes [

7]. A major challenge in intrinsics research is

the lack of ground-truthed dataset. Grosse et al. [

12] cap-

ture the ﬁrst real image dataset in a lab setting, with limited

size and variations. Bell et al. [

5] used crowdsourcing to

obtain human judgements on pairs of pixels.

Deep Learning. Narihira et al . [

21] is the ﬁrst to use

deep learning to learn albedo from IIW’s sparse human

judgement data. Zhou et al. [

34] and Zoran et al. [35]

extend the IIW-CRF model with a CNN learning compo-

nent. Direct Intrinsics [

20] is the ﬁrst entirely deep learning

model that outputs intrinsics predictions, based on the depth

regression CNN by [

11] and trained on the synthetic MPI

Sintel intrinsics dataset. Their results are blurry, with down-

sampling and convolutions followed by deconvolutions, and

poor due to training on artiﬁcial scenes. Our work builds

upon the success of skip layer connections used in deep

CNNs for classiﬁcation [

14] and segmentation [25, 27]. We

propose so-called mirror-links to forward early encoder fea-

tures to later decoder layers to generate sharp details.

Reﬂectance Estimation. Multiple images are usually

required for an accurate estimation of surface albedo. Ait-

tala et al. [

1] proposes a learning based method for sin-

gle image inputs, assuming that the surface only contains

stochastic textures and is lit by known lighting directions.

Most methods work on homogeneous objects lit by dis-

tant light sources, with surface reﬂectance and environment

lighting estimated via blind deconvolution [26] or trained

regression networks [

25]. Our work aims at general in-

trinsic image decomposition from a single image, without

constraints on material or lighting distributions. Our model

predicts spatially varying albedo maps and supports general

lighting conditions.

Learning from Rendered Images. Images rendered

from 3D models are widely used in deep learning, e.g.

[

31, 19, 13, 23] for training object detectors and viewpoint

classiﬁers. [

32] obtains state-of-the-art results for viewpoint

estimation by adapting CNNs trained from synthetic images

to real ones. ShapeNet [9] provides 330,000 annotated mod-

els from over 4,000 categories, with rich texture information

from artists. We build our non-Lambertian intrinsics dataset

and algorithms based on ShapeNet, rendering and learning

from photorealistic images on many varieties of common

objects.

3. Intrinsic Image with Specular Reﬂectance

We derived our non-Lambertian intrinsic decomposition

equation based on physics-based rendering. Given an in-

put image, the observed outgoing radiance I at each pixel

can be formulated as the product integral between incident

lighting L and surface reﬂectance ρ via this rendering equa-

tion [

16]:

I =

Z

Ω

+

ρ(ω

i

, ω

o

)(N · ω

i

)L(ω

i

) dω

i

. (3)

Here, ω

o

is the viewing direction, ω

i

the lighting direction

from the upper hemisphere domain Ω

+

, and N the surface

normal direction of the object.

Surface reﬂectance ρ is a 4D function usually deﬁned as

the bi-directional reﬂectance distribution function (BRDF).

Various BRDF models have been proposed, all sharing a

similar structure with a diffuse term ρ

d

and a specular term

ρ

s

, and coefﬁcients α

d

, α

s

:

ρ = α

d

· ρ

d

+ α

s

· ρ

s

(4)

For the diffuse component, lights scatter multiple times and

produce view-independent and low-frequency smooth ap-

pearance. By contrast, for the specular component, lights

scatter on the surface point only once and produce shinny

appearance. The scope of reﬂection is modeled by diffuse

albedo α

d

and specular albedo α

s

.

Combining reﬂection equation (

4) and rendering equa-

tion (

3), we have the following image formation model:

I = α

d

Z

Ω

+

ρ

d

(ω

i

, ω

o

)L(ω

i

) dω

i

+ α

s

Z

Ω

+

ρ

s

(ω

i

, ω

o

)L(ω

i

) dω

i

= α

d

s

d

+ α

s

,

(5)

Figure 2: Our mirror-link CNN architecture has one shared

encoder and three decoders for albedo, shading, specular

components separately. Mirror links connect the encoder

and decoder layers of the same spatial resolution, providing

visual details. The height of layers in this ﬁgure indicates

the spatial resolution.

where s

d

and s

s

are the diffuse and specular shading, re-

spectively. Traditional intrinsics models consider diffuse

shading only, by decomposing the input image I as a prod-

uct of diffuse albedo A and shading S. However, it is

only proper to model diffuse and specular components sep-

arately, since their albedos have different values and spatial

distributions. The usual decomposition of I = A × S is

only a crude approximation.

Specular reﬂectance α

s

has characteristics very differ-

ent from diffuse reﬂectance α

d

s

d

: Both specular albedo and

specular shading have high-frequency spatial distributions

and color variations, making decomposition more ambigu-

ous. We thus choose to model specular reﬂectance as a sin-

gle residual term R , resulting in the non-Lambertian exten-

sion: I = A × S + R, where input image I is decomposed

into diffuse albedo A, diffuse shading S, and specular re-

ﬂectance R respectively.

Our image formation model is developed based on

physics based rendering and physical properties of diffuse

and specular reﬂection, and it does not assume any spe-

ciﬁc BRDF model. Simple BRDF models (e.g. Phong) can

be used for rendering efﬁciency, and complex models (e.g.

Cook-Torrance) for higher photo-realism.

4. Learning Intrinsics

We develop our CNN model and training procedure for

non-Lambertian intrinsics.

Mirror-Link CNN. Fig.

2 illustrates our encoder-

decoder CNN architecture. The encoder progressively ex-

tracts and down-samples features, while the decoder up-

samples and combines them to construct the output intrin-

sic components. The sizes of feature maps (including in-

put/output) are exactly mirrored in our network. We link

early encoder features to the corresponding decoder layers

at the same spatial resolution, in order to obtain local sharp

details preserved in early encoder layers. We share the same

encoder and use separate decoders for A, S, R.

Figure 3: Environment maps are employed in our render-

ing for realistic appearance, both outdoor and indoor scenes

are included. The environment map not only represents the

dominate light sources in the scene (e.g. sun, lamp and win-

dow) but also includes correct information on the surround-

ings (e.g. sky, wall and building). Although a dominate light

might be sufﬁcient for shading a Lambertian surface, de-

tailed surroundings provide the details in the specular.

Our mirror links are similar to skip connections in Deep

Reﬂectane Map (DRM) [25] and UNet [27]. However,

our goal is entirely different: DRM solves an interpolation

problem from high resolution sparse inputs to low resolu-

tion dense map outputs in the geometry space, ignoring the

spatial inhomogeneity of reﬂectance, whereas UNet deals

with image segmentation rather than image-wise regression.

Edge sensitive loss. Human vision is sensitive to edges,

however standard loss functions such as MSE treat pixel

errors equally. To get more precise and sharp edges, we

re-weight pixel errors with image gradients.

Scale invariant Loss. There is an inherent scale am-

biguity between albedo and shading, as only their product

matters in the intrinsic image decomposition. [

20] employs

a weighted combination of MSE loss and scale-invariant

MSE loss for training their intrinsic networks. Scaling am-

biguity also exists in our formulation, and we combine these

loss functions with our edge-sensitive weighting for training

our network.

ShapeNet-Intrinsics Dataset. We obtain the geometry

and albedo texture of 3D shapes from ShapeNet, a large-

scale richly-annotated, 3D shape repository [

9]. We pick

31,072 models from several common categories: car, chair,

bus, sofa, airplane, bench, container, vessel, etc. These ob-

jects often have specular reﬂections.

Environment maps. To generate photo-realistic images,

we collect 98 HDR environment maps from online public

resources

1

. Indoor and outdoor scenes with various illumi-

nation conditions are included, as shown in Fig.

3.

Rendering. We use an open-source renderer Mit-

suba [

15] to render objects models with various environ-

1

http://www.hdrlabs.com/sibl/archive.html

ment maps and random viewspoints sampled from the upper

hemisphere. A modiﬁed Phong reﬂectance model [

24, 17]

is assigned to objects to generate photo-realistic shading

and specular effects. Since original models in ShapeNet

are only provided with reliable diffuse albedo, we use ran-

dom distribution for specular with k

s

∈ (0, 0.3) and N

s

∈

(0, 300), which covers the range from pure diffuse to high

specular appearance (Fig. 1). We render albedo, shading

and specular layers, and then synthesize images according

to Equation

5.

Training. We split our dataset at the object level in or-

der to avoid images of the same object appearing in both

training and testing sets. We use 80/20 split, resulting in

24, 932 models for training and 6, 240 for testing. All the

98 environment maps are used to rendering 2, 443, 336 im-

ages for the training set. For the testing set, we randomly

pick 1 image per testing model.

More implementation details can be found in the supple-

mentary materials.

5. Evaluation

Our method is evaluated and compared with SIRFS [

3],

IIW [

5], and Direct Intrinsics (DI) [20]. We also train DI us-

ing our ShapeNet intrinsics dataset and denote the model as

DI*. We adopt the usual metrics, MSE, LMSE and DSSIM,

for quantitative evaluation.

5.1. Synthetic Data

Table

1 shows the numeric evaluation on the synthetic

testing set. Our algorithm performs consistently better than

others on the synthetic dataset numerically, compared to

off-the-shelf solutions, our method provides 40-50% per-

formance gain on the DSSIM error. Also note that, DI*,

i.e. DI trained with our dataset, produces second best re-

sults across almost all the error metrics, demonstrating the

advantage of our ShapeNet intrinsics dataset.

Numerical error metrics may not be fully indicative of

visual qualities, e.g. the naive baseline also produces low

errors for some cases. Figure

4 provides visual comparisons

against ground truths.

For objects with strong specular reﬂectance, e.g. cars,

ShapeNet MSE LMSE DSSIM

intrinsics

albedo shading albedo shading albedo shading

Baseline 0.0232 0.0153 0.0789 0.0231 0.2273 0.2341

SIRFS

0.0211 0.0227 0.0693 0.0324 0.2038 0.1356

IIW

0.0147 0.0149 0.0481 0.0228 0.1649 0.1367

DI

0.0252 0.0245 0.0711 0.0275 0.1984 0.1454

DI*

0.0115 0.0066 0.0470 0.0115 0.1655 0.0996

Ours

0.0083 0.0055 0.0353 0.0097 0.0939 0.0622

specular 0.0042 0.0578 0.0831

Table 1: Evaluation on our synthetic dataset. For the base-

line, we set its albedo to be the input image and its shading

to be 1.0. The last row lists our specular error.

Input SIRFS IIW DI DI* Our Specular GT

Figure 4: Results for the synthetic dataset. Our baselines

include SIRFS, IIW, Direct-Intrinsics with released model

by the author (DI), and model trained by ourselves on our

synthetic dataset (DI*). The top row of each group is

albedo, and the bottom is shading. The Specular column

shows the ground-truth (top) and our result (bottom). We

observe that specularity has basically been removed from

albedo/shading, especially for cars. Even for the sofa (last

row) with little specular, our method still produces good vi-

sual result. See more results in our supplementary material.

specular reﬂection violates the Lambertian condition as-

sumed by traditional intrinsics algorithms. These algo-

rithms, SIRFS or IIW, simply cannot handle such specular

components. Learning-based approaches, DI, DI*, or our

method, could still learn from the data and perform better

in these cases. For DI, the network trained on our ShapeNet

category also has signiﬁcantly better visual quality, com-

pared with their released model trained on the Sintel dataset.

However, their results are blurry, as a consequence from

their deep convolution and deconvolution structures without

our skip layer connections. Our model produces sharper im-

MIT MSE LMSE DSSIM

intrinsic

albedo shading albedo shading albedo shading

SIRFS 0.0147 0.0083 0.0416 0.0168 0.1238 0.0985

DI

0.0277 0.0154 0.0585 0.0295 0.1526 0.1328

Ours

0.0468 0.0194 0.0752 0.0318 0.1825 0.1667

Ours*

0.0278 0.0126 0.0503 0.0240 0.1465 0.1200

Table 2: Evaluation on MIT intrinsics dataset.

Input SIRFS DI Ours Ours* GT

Figure 5: Results on the MIT dataset. Ours* is our

ShapeNet trained model ﬁne-tuned on MIT, with data gen-

erated by the GenMIT approach used in DI [

20].

ages preserving many visual details, such as boundaries in

the albedo and specular images. Large specular areas on the

body of cars are also extracted well in the specular output

component, revealing the environment illumination. Such

specular areas would confuse earlier algorithms and bring

serious artifacts to albedo/shading.

5.2. MIT Intrinsics Dataset

We also test the performance of our network on the MIT

intrinsics dataset [

12]. Unlike our color environment light

model designed for common real-world images, the MIT-

intrinsics dataset uses a lab capture oriented lighting model

with single grayscale directional light source, a scenario

that is not included in our synthetic dataset. The light model

differences lead to dramatic visual differences and cause do-

main shift problems on learning based approaches [

20]. We

also follow [

20] to ﬁne tune our network on the MIT dataset.

Table

2 lists benchmark errors and Fig 5 provides sample

Learning Non-Lambertian Object Intrinsics Across ShapeNet Categories

Citations

References

Related Papers (5)