What have the authors stated for future works in "Decomposing single images for layered photo retouching" ?

Future work could investigate other decompositions such as global and direct illumination, subsurface-scattering or directional illumination or other inputs, such as videos.

What is the way to perform a decomposition of a single image?

For single images, a more classic approach is to perform intrinsic decomposition into shading (irradiance) and diffuse reflectance (albedo) [BT78, GMLMG12, BBS14], possibly supported by a dedicated UI for images [BPD09, BBPD12], using annotated data [BBS14, ZKE15, ZIKF15], or videos [BST∗14, YGL∗14].

(Open Access) Decomposing Single Images for Layered Photo Retouching (2017) | Carlo Innamorati

Q: What are the contributions mentioned in the paper "Decomposing single images for layered photo retouching" ?

In this paper, the authors suggest a method to decompose a single image into multiple layers that approximates effects such as shadow, diffuse illumination, albedo, and specular shading. The authors demonstrate the effectiveness of their decomposition on synthetic ( i. e., rendered ) and real data ( i. e., photographs ), and use them for photo manipulations, which are otherwise impossible to perform based on single images. The authors provide comparisons with state-of-the-art methods and also evaluate the quality of their decompositions via a user study measuring the effectiveness of the resultant photo retouching setup.

Q: What is the main step in the decomposition of a single image?

The decomposition has two main steps: (i) producing training data (Sec. 3.2) and (ii) a convolutional neural network to decompose single images into editable layers (Sec. 3.3).

Q: How many components are used to produce C?

To produce C, the authors compute four individual components, that can be composed into Eq. 1 or further into directions according to Eq. 2 as per-pixel normals are known at render time.

Q: What is the main part of the approach?

Their approach has two main parts: an imaging model that describes a decomposition of a single photo into layers for individual editing and a method to perform this decomposition.

Eurographics Symposium on Rendering 2017

P. Sander and M. Zwicker

(Guest Editors)

Volume 36 (2017), Number 4

Decomposing Single Images for Layered Photo Retouching

Carlo Innamorati Tobias Ritschel Tim Weyrich Niloy J. Mitra

University College London

Without our method With our method

30 sec2 min

Without our method With our method

30 sec1 min

Figure 1:

Appearance manipulation of a single photograph (top images) when using off-the-shelf software like Photoshop directly (left arrow)

and when using the same in combination with our new layering (right arrow). For the car example, the image was decomposed into layers

(albedo, irradiance, specular, and ambient occlusion), which were then manipulated individually: specular highlights were strengthened and

blurred; irradiance and ambient occlusion were darkened and have added contrast; the albedo color was changed. While the image generated

without our decomposition took much more effort (selections, adjustments with curves, and feathered image areas), the result is still inferior.

For the statue example, a different decomposition splitting the original image into light directions was used. The light coming from the left

was changed to become more blue, while light coming from the right was changed to become more red. A similar effect is hard to achieve in

Photoshop even after one order of magnitude more effort. (Please try the edits yourself using the supplementary psd ﬁles.)

Abstract

Photographers routinely compose multiple manipulated photos of the same scene into a single image, producing a ﬁdelity

difﬁcult to achieve using any individual photo. Alternately, 3D artists set up rendering systems to produce layered images to

isolate individual aspects of the light transport, which are composed into the ﬁnal result in post-production. Regrettably, these

approaches either take considerable time and effort to capture, or remain limited to synthetic scenes. In this paper, we suggest a

method to decompose a single image into multiple layers that approximates effects such as shadow, diffuse illumination, albedo,

and specular shading. To this end, we extend the idea of intrinsic images along two axes: ﬁrst, by complementing shading

and reﬂectance with specularity and occlusion, and second, by introducing directional dependence. We do so by training a

convolutional neural network (CNN) with synthetic data. Such decompositions can then be manipulated in any off-the-shelf image

manipulation software and composited back. We demonstrate the effectiveness of our decomposition on synthetic (i. e., rendered)

and real data (i. e., photographs), and use them for photo manipulations, which are otherwise impossible to perform based on

single images. We provide comparisons with state-of-the-art methods and also evaluate the quality of our decompositions via a

user study measuring the effectiveness of the resultant photo retouching setup. Supplementary material and code are available

for research use at geometry.cs.ucl.ac.uk/projects/2017/layered-retouching.

1. Introduction

Professional photographers regularly compose multiple photos of

the same scene into one image, giving themselves more ﬂexibility

and artistic freedom than achievable by capturing a single photo.

They do so, by ‘decomposing’ the scene into individual layers,

e. g., by changing the scenes physical illumination, manipulating

the individual layers (e. g., typically using a software such as Adobe

Photoshop), and then composing them into a single image.

A typical manipulation is changing a layer’s transparency (or

‘weight’): if a layer holds illumination from a speciﬁc light direction,

this is a direct and easy way to control illumination. Other editing

operations include adjustment of hues, blur, sharpening, etc. These

operations are applied selectively to some layers, leaving the others

unaffected. While the results produced by layered editing could, in

principle, also be produced by editing without layers, the separation

allows artists, and even novice users, to direct their edits to speciﬁc

aspects of an image without the need for tediously selecting image

regions based on color or shape, resulting in higher efﬁcacy. The

key to success is to have a plausible and organized separation into

layers available.

submitted to Eurographics Symposium on Rendering (2017)

2 Innamorati et al. / Decomposing Single Images for Layered Photo Retouching

Unfortunately, acquiring layers requires either taking multiple

photos [CCD03] or to use (layered) rendering [Hec90]. The ﬁrst

option is to capture photos in a studio setup requiring signiﬁcant

setup effort but producing realistic inputs. The other option is to

use layered rendering, which is relatively straight-forward and well

supported, but the results can be limited in realism.

In this work, we set out to devise a system that combines the

strength of both approaches: the ability to directly work on real

photos, combined with a separation into layers. Starting from a

single photograph, our system produces a decomposition into layers,

which can then be individually manipulated and recombined into

the desired image using off-the-shelf image manipulation software.

Fig. 1 shows two examples, one where specular highlights and

albedo were adjusted on the input car image, while on the other

directional light-based manipulations were achieved on single input

photographs. (Please refer to the supplementary for recorded edit

sessions and accompanying PSD ﬁles.)

While many decompositions are possible, we suggest a speciﬁc

layering model that works along two axes: intrinsic features and

direction. This is inspired by how many artists as well as practical

contemporary rendering systems (e. g., in interactive applications

such as computer games) work: ﬁrst, decomposition into ambient

occlusion, diffuse illumination, albedo, and specular shading and

second, a decomposition into light directions. Both axes are op-

tional but can be seamlessly combined. Note that this model is not

physical. However, it is simple and intuitive for artists and, as we

will show, its inverse model is effectively learnable. To invert this

model, we employ a deep convolutional neural network (CNN) that

is trained using synthetic (rendered) data, for which the ground truth

decomposition of a photo into layers is known. While CNNs have

recently been used for intrinsic decompositions such as reﬂectance

and shading, we address the novel problem of reﬁned decomposi-

tion into ambient occlusion and specular as well as into directions,

which is critical for the layered image manipulation workﬂow. Our

contributions are:

a workﬂow, in which a single input photo is automatically decom-

posed into layered components that are suited for post-capture

appearance manipulation within standard image editing software;

two plausible appearance decompositions, particularly suited for

plausible appearance editing: i) advanced intrinsics, including

specular and ambient occlusion and ii) direction; and

a ﬂexible, CNN-based approach to obtain a given type of decom-

position, leveraging the state-of-the-art in deep learning.

We evaluate our approach by demonstrating non-trivial appearance

edits based on our decompositions and a preliminary user study.

We further demonstrate the efﬁcacy of our CNN architecture by

applying it to the well-established intrinsic images problem, where

it compares favourably to the state-of-the-art methods.

2. Previous Work

Combining multiple photos (also referred to as a “stack” [CCD03])

of a scene where one aspect has changed in each layer is routinely

used in computer graphics. For example, NVIDIA IRay actively

supports rendered LPE layers (light path expressions [Hec90]) to

be individually edited to simplify post-processing towards artistic

effects without resorting to solving the inverse rendering problem.

One aspect to change is illumination, such as ﬂash-no-ﬂash photog-

raphy [ED04] or exposure levels [MKVR09]. More advanced effect

involve direction of light [ALK

∗

03, RBD06, FAR07], eventually

resulting in a more sophisticated user interface [BPB13]. All these

approaches require specialized capture to gather multiple images

captured by making invasive changes to the scene, limiting their use

in practice to change an image post-capture. On-line video and photo

communities hold many examples of DIY instructions to setup such

studio conﬁgurations.

For single images, a more classic approach is to perform in-

trinsic decomposition into shading (irradiance) and diffuse re-

ﬂectance (albedo) [BT78,GMLMG12,BBS14], possibly supported

by a dedicated UI for images [BPD09, BBPD12], using anno-

tated data [BBS14,ZKE15,ZIKF15], or videos [BST

∗

14,YGL

∗

14].

Recently, CNNs have been successfully applied to this task pro-

ducing state-of-the-art results [NMY15, SBD15]. For CNNs, a

recent idea is to combine estimation of intrinsic properties and

depth [SBD15,KPSL16]. We will jointly infer intrinsic properties

and normals to allow for a directional illumination decomposition.

Also the relation between intrinsic images and ﬁlter is receiving con-

siderable attention [BHY15,FWHC17]. We also use a data-driven

CNN-based approach to go beyond classic intrinsic image decom-

position layers with further separation into occlusion and specular

components, as well as directions, that are routinely used in layered

image editing (see Sec. 4 and supplementary materials).

In other related efforts, researchers have looked into factorizing

components, such as specular [TNI04,MZBK06] from single im-

ages, or ambient occlusion (AO) from single [YJL

∗

15] or multiple

captures [HWBS13]. We show that our approach can solve this

problem at a comparable quality, but requires only a single photo

and in combination yields further separation of diffuse shading and

albedo without requiring a specialized method.

Despite the advances in recovering reﬂectance (e. g., with two

captures and a stationarity assumption [AWL15], or with dedicated

UIs [DTPG11]), illumination (e. g., Lalonde et al. [LEN09] estimate

sky environment maps and Rematas et al. [RRF

∗

16] reﬂectance

maps) and CNN-based depth [EPF14] from photographs, no system

doing a practical joint decomposition is known. Most relevant to

our effort, is SIRFS [BM15] that build data-driven priors for shape,

reﬂectance, illumination, and use them in an optimization setup to

recover the most likely shape, reﬂectance, and illumination under

these priors (see Sec. 4 for explicit comparison).

In the context of image manipulations, specialized solutions exist:

Oh et al. [OCDD01] represent a scene as a layered collection of color

and depth to enable distortion-free copying of parts of a photograph,

and allow discounting effect of illumination on uniformly textured

areas using bilateral ﬁltering; Khan et al. [KRFB06] enable automat-

ically replacing one material with another (e. g., increase/decrease

specularity, transparency, etc.) starting from a single high dynamic

range image by exploiting our ‘blindness’ to certain physical in-

accuracies; Caroll et al. [CRA11] achieve consistent manipulation

of inter-reﬂections; or the system of Karsch et al. [KHFH11] that

combines many of the above towards compelling image synthesis.

Splitting into light path layers is typical in rendering inspired

by the classic light path notation [Hec90]. In this work, different

submitted to Eurographics Symposium on Rendering (2017)

Innamorati et al. / Decomposing Single Images for Layered Photo Retouching 3

from Heckbbert’s physical

E(S|D)

∗

formalism, we use a more edit-

friendly factorization into ambient occlusion, diffuse light, diffuse

albedo, and specular, instead of separating direct and indirect effects.

While all the above works on photos, it was acknowledged that

rendering beyond the laws of physics can be useful to achieve differ-

ent artistic goals [TABI07, VPB

∗

09,RTD

∗

10,RLMB

∗

14,DDTP15,

SPN

∗

15]. Our approach naturally supports this option, allowing

users to freely change layers, using any image-level software of

their choice, also beyond what is physically correct. For example,

the StyLit system proposed by Fišser et al. [FJL

∗

16] correlates artis-

tic style with light transport expressions, but requires pixels in the

image to be labeled with light path information, e. g., by rendering

and aligning. Hence, it can take our factorized output to stylize

single photographs without being restricted to rendered content.

3. Editable Layers From Single Photographs

Our approach has two main parts: an imaging model that describes

a decomposition of a single photo into layers for individual editing

and a method to perform this decomposition.

Model.

The imaging model (Sec. 3.1) is motivated by the require-

ments of a typical layered workﬂow (Sec. 3.4): The layers have to

be intuitive, they have to be independent, they should only use blend

modes available in a (linear) off-the-shelf image editing software

and they should be positive and low-dynamic-range (LDR). This

motivates a model that can decompose along two axes: intrinsics and

directionality. These axes can be combined and use a new directional

basis we propose.

Decomposition.

The decomposition has two main steps: (i) produc-

ing training data (Sec. 3.2) and (ii) a convolutional neural network

to decompose single images into editable layers (Sec. 3.3). The

training data (Sec. 3.2) is produced by rendering a large number

of 3D scenes into image tuples, where the ﬁrst is the composed

image, while the other images are the layers. This step needs to be

performed only once and the training data will be made available

upon publication. The decomposition (Sec. 3.3) is done using a

CNN that consumes a photo and outputs all its layers. This CNN

is trained using the (training) data from the previous step. We se-

lected a convolution-deconvolution architecture that is only to be

trained once, can be executed efﬁciently on new input images, and

its deﬁnition will be made publicly available upon publication.

3.1. Model

We propose an image formation model that can decompose the

image along one or two independent axis: intrinsic features or direc-

tionality (Fig. 2).

Non-directional model. We model the color C of a pixel as

C = O

(ρ ·E + S), (1)

where

∈ [0,1] ∈ R

denotes the ambient occlusion, which is the

fraction of directions in the upper hemisphere that is blocked from

the light; the variable

ρ ∈ [0,1]

∈ R

describes the diffuse albedo,

i. e., the intrinsic color of the surface itself; the variable

E ∈ [0,1]

∈

denotes the diffuse illumination (irradiance), i. e., the color of

= ( )x +x

Image Occlusion Albedo Irradiance Specular

+ +...+ +

Bottom* Top* Left* Right*

Image

Top*

= x x )+(

Occlusion Albedo Irradiance Top Specular Top

Figure 2:

The components of our two imaging models. The ﬁrst row

is the intrinsic axis, the second row the directional axis, and the

third row shows how one directional element can subsequently be

also decomposed into its intrinsics.

the total light received; and ﬁnally,

S ∈[0,1]

∈ R

is the specular

shading in units of radiance, where we do not separate between the

reﬂectance and the illumination (see Fig. 2 top row).

This model is a generalization of typical intrinsic im-

ages [BKPB17], which only models shading and reﬂectance, to

include specular and occlusion. While in principle occlusion acts dif-

ferently on diffuse and specular components, we follow Kozlowski

and Kautz [KK07], who show that jointly attenuating diffuse and

specular reﬂectance by the same occlusion term is a good approxi-

mation under natural lighting, by using

as a joint multiplier of

ρ ·E

and

, thus keeping the user-visible reﬂectance components to

a minimum.

In summary, this decomposition produces four layers from each

input image that can be combined with simple blending operations

in typical image retouching software.

Directional model.

The directional model is a generalization of

the above. We express pixel color as a generalized intrinsic image

as before, but with diffuse illumination depending on the surface

normal, and specular shading depending on the reﬂection vector:

C = O

∑

i=1

(ρ ·b

(n)E + b

(r)S) , (2)

where

∈ S

→ R

i ∈ 1...N

, are basis functions of an

dimensional lighting basis, parameterized by the surface orientation

, and the reﬂected orientation

r := 2

v,n

n −v

, respectively. All

directions are in view space, so assuming a distant viewer the view

direction is v ≡(0, 0,1)

by construction.

Especially for diffuse lighting, a commonly used lighting ba-

sis would be ﬁrst-order spherical harmonics (SH), i. e.,

) =

−1

)

. That basis is shown to capture diffuse reﬂectance

at high accuracy [RH01b]; however, as we aim for a decomposition

amenable to be used in traditional photo processing software, which

typically quantizes and clamps any layer calculations to

[0,1]

, the

negative lobes of SH would be lost when stored in an image layers.

submitted to Eurographics Symposium on Rendering (2017)

4 Innamorati et al. / Decomposing Single Images for Layered Photo Retouching

A common positive-only reparameterization would use the six gen-

erator functions,

1/2

/2 ±

/2Y

−1

3/4

/2 ±

/2Y

5/6

/2 ±

/2Y

, (3)

with

−1

−

, and

−

. Initial experiments with that lighting basis, how-

ever, showed that the necessary blending calculations between the

corresponding editing layers lead to excessive quantization in an

8-bit image processing workﬂow, and even using Photoshop’s 16-bit

mode mitigated the problem only partially. Moreover, direct editing

of the basis function images turned out unintuitive, because

the effect of editing pixels corresponding to negative SH contri-

butions is not easily evident to the user;

the strong overlap between basis functions makes it difﬁcult to

apply desired edits for individual spatial directions only.

This led us to propose a sparser positive-only basis, where spatial

directions are mostly decoupled. After experimentation with various

bases, we settled on a normalized variant of

as:

(ω) =

(ω)

∑

j=1

(ω)



hω,c

i+ 1



∑

j=1



hω,c

i+ 1



, (4)

using the dotproduct-based formulation of 1st-order SH, denoting

with

the six main spatial directions; the normalization term en-

sures partition of unity, i. e.,

∑

(ω) = 1

. Empirically, we found

p = 5

to offer the best compromise between separation of illumina-

tion functions and smoothness. A polar surface plot of the six basis

functions overlapping is shown in Figure 3.

Figure 3:

Directional bases. Left:

, a positive-only reparame-

terization of the 1st-order SH basis exhibits strong overlap between

neighboring lobes (drawn as opaque surface plots), and with it

strong cross-talk of edits of the associated editing layers; Right:

our

(Equation

(4)

) remains smooth while lobes are separated

much more strongly. Note that

has been uniformly rescaled to

be partition of unity; the difference in amplitude (see axis labels)

further documents the sparser energy distribution in our basis.

Using this basis, Equation

(2)

produces 14 layers from an input

image – where twelve are directionally-dependent and two are not (

and

) – that can be combined using any compositing software. As

Image

Occlusion

Albedo

Irradiance

Specular

Figure 4: Samples from our set of synthetic training data.

shown in the second and third row of Fig. 2, the 14 output layers can

be either collapsed onto 6 directional layers or kept as a combination

of both intrinsic and directional decomposition.

3.2. Training Data

There are many values of

, and

to explain an observed

color

, so the decomposition is not unique. In the same way, many

normals are possible from a pixel. Inverting this mapping from

a single observation is likely to be impossible. At the same time,

humans have an intuition how to infer reﬂectance on familiar ob-

jects [KVDCL96]. One explanation can be that they rely on a context

, on the spatial statistics of multiple observations

C(x)

, such that a

decomposition into layers becomes possible. In other words, sim-

ply not all arrangements of decompositions are equally likely. As

described next, we employ a CNN to similarly learn such a decom-

position. Training data comprises of synthetic images that show a

random shape, with partially random reﬂectance shaded by random

environment map illumination.

Shape.

Surface geometry consists of about 2,000 random instances

from ShapeNet [C

∗

15] coming from the top-level classes, selected

from ShapeNetCore semi-automatically. Speciﬁcally, ShapeNetCore

has 48 top-level classes among which we use 27. We discarded

classes that had either very few models or that were considered

uncommon (e. g., birdhouse). We then randomly sampled a tenth of

the total models from each class resulting in 1,930 models. These

models were also manually ﬁltered to be free of meshing artifacts.

Shapes were rendered under random orientation while maintaining

the up direction intrinsic to each model.

Reﬂectance.

Reﬂectance using the physically-corrected Phong

model [LW94] was sampled as follows: the diffuse colors come

directly from ShapeNet models. The specular component

is as-

sumed to be a single color. A random decision is made if the material

is assumed to be electric or dielectric. If it is electric, we choose the

specular color to be the average color of the diffuse texture. Other-

wise, we choose it to be a uniform random grey value. Glossiness is

set as n = 3.0

10ξ

, where ξ ∈U [0,1].

Illumination.

Illumination is sampled from a set of 90 high-

dynamic-range (HDR) environment maps in resolution

512×256

submitted to Eurographics Symposium on Rendering (2017)

Innamorati et al. / Decomposing Single Images for Layered Photo Retouching 5

that have an uncalibrated absolute range of values but are represen-

tative for typical lighting settings: indoor, outdoor, as well as studio

lights. Illumination were randomly oriented around the vertical axis.

Rendering.

After ﬁxing shape, material, and illumination, we syn-

thesize a single image from a random view from a random angle

around the vertical axis. To produce

, we compute four individ-

ual components, that can be composed into Eq. 1 or further into

directions according to Eq. 2 as per-pixel normals are known at

render time. Due to the large number of training data required, we

use efﬁcient, GPU-based rendering algorithms. The occlusion term

is computed using screen-space occlusion [RGS09]. The diffuse

shading

is computed using pre-convolved irradiance environment

maps [RH01a]. Similarly, specular shading is the product of the

specular color

selected according to the above protocol, and a

pre-convolved illumination map for gloss level

. No indirect illu-

mination or local interactions are rendered.

While this image synthesis is far from being physically accurate,

it can be produced easily, systematically and for a very large number

of images, making it suitable for learning the layer statistics. Overall

we produce 300,000 unique samples in a resolution of

256×256

(ca. 14 GB) in eight hours on a current PC with a higher-end GPU.

A fraction of the images totalling to about 30000 were withheld

to check for convergence (and detect over-ﬁtting). We also used

dropout to prevent over-ﬁtting.

Units.

Care has to be taken in what color space learned and training

data is to be processed. As the illumination is HDR, the resulting

image is an HDR rendering. However, as our input images will be

LDR at deployment time, the HDR images need to be tone-mapped

to match their range. To this end, automatic exposure control is

used to map those values into the LDR range, by selecting the

0.95

luminance percentile of a random subset of the pixels and scale

all values such that this value maps to

. The rendered result

stored after gamma-correction. All other components are stored in

physically linear units (

γ = 1.0

) and are processed in physically

linear units by the CNN and the end-application using the layers.

Doing the ﬁnal gamma-correction will consequentially be up to the

application using the layers later on (as shown in our edit examples).

3.3. Learning a Decomposition

We perform decomposition using a CNN [LBBH98,KSH12] trained

using the data produced as described above. Input to the network is

a single image such as a photograph. Output for the non-directional

variant are ﬁve images (occlusion, diffuse illumination, albedo, spec-

ular shading, and normals), where occlusion is scalar and the oth-

ers are three-vector-valued. Note, that normals and intrinsic prop-

erties are estimated jointly, such as done before for albedo and

depth [SBD15,KPSL16]. The normals are not presented to the user,

but only used to perform the directional decomposition.

We have also experimented with letting the CNN directly compute

the directional decomposition, but found that having an explicit

decomposition using normal and reﬂected direction to be easier to

train and produce better results.

This design follows the convolution-deconvolution idea with

cross-links, resulting in a decoder-encoder scheme [RFB15]. The

network is fully-convolutional. We start at a resolution of

256×256

that is reduced down to

2×2

through stride-two convolutions. We

then perform two stride-one convolutions and increase the number

of feature layers in accordance to the required number of output lay-

ers (i. e., quadruple for the layers, while the whole step is skipped for

normal estimation). The deconvolution part of the network consists

of blocks performing a resize-convolution (upsampling followed by

a stride-one convolution), cross-linking and a stride-one convolution.

Every convolution in the network is followed by a ReLU [NH10]

non-linearity except for the last layer, for which a Sigmoid non-

linearity is used instead. This is done to normalize the output to the

range

[0,1]

. Images with an uneven aspect ratio will be appropriately

cropped and/or padded to be square with white pixels. All receptive

ﬁelds are

3×3

pixels in size except for the ﬁrst and last two layers

that are

5×5

. No ﬁlter weights are shared between layers. Overall,

this network has about 8.5 M trainable parameters.

For the loss function, we combine a per-layer L2 loss with a

novel three-fold recombination loss, that encourages the network

to produce combinations that result in the input image and fulﬁlls

the following requirements: (i) the layers have to produce the input,

C = O

(E ·ρ + S)

; (ii) the components should explain the image

without AO, i. e.,

C/O

= Eρ + S

; and (iii) diffuse reﬂected light

should explain the image without AO and specular, so

C/O

−S =

Eρ

. Note that if the network was able to always perform a perfect

decomposition, a single L2 loss alone would be sufﬁcient. As it

makes errors in practice, the additional loss expressions bias those

errors to at least happen in such a way that the combined result

does not deviate as much from the input. All losses are in the same

RGB-difference range and are weighted equally.

In Tbl. 1, we numerically evaluate the recombination error (i.e.,

the differences between the original and recombined images) by

progressively adding each of the three additional losses to a standard

L2 loss. While a positive trend can be observed with the DSSIM

metric, these beneﬁts are not as evident on the NRMSE metric.

Overall, the network is a rather standard modern design, but

trained to solve a novel task (layers) on novel kind of training

data (synthesized, directionally-dependant information). We used

TensorFlow [A

∗

15] for our implementation platform and each model

requires only several hours on a NVIDIA Titan X GPU with 12 GB

on-board RAM to train (both have been trained for 12 hours). We

used stochastic gradient descent to solve for the network, which we

ran for 6 epochs with batches of size 16. A more detailed description

of the network’s architecture can be found in the supplementary

materials.

Table 1:

Comparing different steps of our recombination loss (rows)

in terms of two metrics (columns): DSSIM and NRMSE on our

validation set.

Loss NRMSE DSSIM

L2 0.2549

0.0807 0.0234

0.0129

L2 + (iii) 0.2598

0.0833 0.0229

0.0126

L2 + (iii) + (ii) 0.2588

0.0799 0.0229

0.0122

L2 + (iii) + (ii) + (i) 0.2460

0.0787 0.0210

0.0119

submitted to Eurographics Symposium on Rendering (2017)

Decomposing Single Images for Layered Photo Retouching

Figures

Citations

Learning to reconstruct shape and spatially-varying reflectance from a single image

Single-image SVBRDF capture with a rendering-aware deep network

Material Editing Using a Physically Based Rendering Network

Self-Supervised Intrinsic Image Decomposition

“Interactive Intrinsic Video Editing”の実装報告

References

ImageNet Classification with Deep Convolutional Neural Networks

U-Net: Convolutional Networks for Biomedical Image Segmentation

Gradient-based learning applied to document recognition

ImageNet classification with deep convolutional neural networks

U-Net: Convolutional Networks for Biomedical Image Segmentation

Related Papers (5)

Recovering intrinsic scene characteristics from images

Shape, Illumination, and Reflectance from Shading

Intrinsic images in the wild

Acquiring the reflectance field of a human face

Lightness and Retinex Theory

Frequently Asked Questions (12)

Q1. What are the contributions mentioned in the paper "Decomposing single images for layered photo retouching" ?

Q2. What have the authors stated for future works in "Decomposing single images for layered photo retouching" ?

Q3. What is the deconvolution part of the network?

Q4. How many random instances of ShapeNet are there?

Q5. What is the main step in the decomposition of a single image?

Q6. How do professional photographers decompose their photos into one image?

Q7. How many components are used to produce C?

Q8. What is the main part of the approach?

Q9. What is the way to perform a decomposition of a single image?

Q10. What is the effect of editing pixels corresponding to negative SH contributions?

Q11. How many images can be produced in eight hours?

Q12. How many layers can be combined using a compositing software?