scispace - formally typeset
Open AccessJournal ArticleDOI

Decomposing Single Images for Layered Photo Retouching

Reads0
Chats0
TLDR
A method to decompose a single image into multiple layers that approximates effects such as shadow, diffuse illumination, albedo, and specular shading is suggested and used for photo manipulations which are otherwise impossible to perform based on single images.
Abstract
Photographers routinely compose multiple manipulated photos of the same scene into a single image, producing a fidelity difficult to achieve using any individual photo. Alternately, 3D artists set up rendering systems to produce layered images to isolate individual aspects of the light transport, which are composed into the final result in post-production. Regrettably, these approaches either take considerable time and effort to capture, or remain limited to synthetic scenes. In this paper, we suggest a method to decompose a single image into multiple layers that approximates effects such as shadow, diffuse illumination, albedo, and specular shading. To this end, we extend the idea of intrinsic images along two axes: first, by complementing shading and reflectance with specularity and occlusion, and second, by introducing directional dependence. We do so by training a convolutional neural network CNN with synthetic data. Such decompositions can then be manipulated in any off-the-shelf image manipulation software and composited back. We demonstrate the effectiveness of our decomposition on synthetic i. e., rendered and real data i. e., photographs, and use them for photo manipulations, which are otherwise impossible to perform based on single images. We provide comparisons with state-of-the-art methods and also evaluate the quality of our decompositions via a user study measuring the effectiveness of the resultant photo retouching setup. Supplementary material and code are available for research use at geometry.cs.ucl.ac.uk/projects/2017/layered-retouching.

read more

Content maybe subject to copyright    Report

Eurographics Symposium on Rendering 2017
P. Sander and M. Zwicker
(Guest Editors)
Volume 36 (2017), Number 4
Decomposing Single Images for Layered Photo Retouching
Carlo Innamorati Tobias Ritschel Tim Weyrich Niloy J. Mitra
University College London
Without our method With our method
30 sec2 min
Without our method With our method
30 sec1 min
Figure 1:
Appearance manipulation of a single photograph (top images) when using off-the-shelf software like Photoshop directly (left arrow)
and when using the same in combination with our new layering (right arrow). For the car example, the image was decomposed into layers
(albedo, irradiance, specular, and ambient occlusion), which were then manipulated individually: specular highlights were strengthened and
blurred; irradiance and ambient occlusion were darkened and have added contrast; the albedo color was changed. While the image generated
without our decomposition took much more effort (selections, adjustments with curves, and feathered image areas), the result is still inferior.
For the statue example, a different decomposition splitting the original image into light directions was used. The light coming from the left
was changed to become more blue, while light coming from the right was changed to become more red. A similar effect is hard to achieve in
Photoshop even after one order of magnitude more effort. (Please try the edits yourself using the supplementary psd files.)
Abstract
Photographers routinely compose multiple manipulated photos of the same scene into a single image, producing a fidelity
difficult to achieve using any individual photo. Alternately, 3D artists set up rendering systems to produce layered images to
isolate individual aspects of the light transport, which are composed into the final result in post-production. Regrettably, these
approaches either take considerable time and effort to capture, or remain limited to synthetic scenes. In this paper, we suggest a
method to decompose a single image into multiple layers that approximates effects such as shadow, diffuse illumination, albedo,
and specular shading. To this end, we extend the idea of intrinsic images along two axes: first, by complementing shading
and reflectance with specularity and occlusion, and second, by introducing directional dependence. We do so by training a
convolutional neural network (CNN) with synthetic data. Such decompositions can then be manipulated in any off-the-shelf image
manipulation software and composited back. We demonstrate the effectiveness of our decomposition on synthetic (i. e., rendered)
and real data (i. e., photographs), and use them for photo manipulations, which are otherwise impossible to perform based on
single images. We provide comparisons with state-of-the-art methods and also evaluate the quality of our decompositions via a
user study measuring the effectiveness of the resultant photo retouching setup. Supplementary material and code are available
for research use at geometry.cs.ucl.ac.uk/projects/2017/layered-retouching.
1. Introduction
Professional photographers regularly compose multiple photos of
the same scene into one image, giving themselves more flexibility
and artistic freedom than achievable by capturing a single photo.
They do so, by ‘decomposing’ the scene into individual layers,
e. g., by changing the scenes physical illumination, manipulating
the individual layers (e. g., typically using a software such as Adobe
Photoshop), and then composing them into a single image.
A typical manipulation is changing a layer’s transparency (or
‘weight’): if a layer holds illumination from a specific light direction,
this is a direct and easy way to control illumination. Other editing
operations include adjustment of hues, blur, sharpening, etc. These
operations are applied selectively to some layers, leaving the others
unaffected. While the results produced by layered editing could, in
principle, also be produced by editing without layers, the separation
allows artists, and even novice users, to direct their edits to specific
aspects of an image without the need for tediously selecting image
regions based on color or shape, resulting in higher efficacy. The
key to success is to have a plausible and organized separation into
layers available.
submitted to Eurographics Symposium on Rendering (2017)

2 Innamorati et al. / Decomposing Single Images for Layered Photo Retouching
Unfortunately, acquiring layers requires either taking multiple
photos [CCD03] or to use (layered) rendering [Hec90]. The first
option is to capture photos in a studio setup requiring significant
setup effort but producing realistic inputs. The other option is to
use layered rendering, which is relatively straight-forward and well
supported, but the results can be limited in realism.
In this work, we set out to devise a system that combines the
strength of both approaches: the ability to directly work on real
photos, combined with a separation into layers. Starting from a
single photograph, our system produces a decomposition into layers,
which can then be individually manipulated and recombined into
the desired image using off-the-shelf image manipulation software.
Fig. 1 shows two examples, one where specular highlights and
albedo were adjusted on the input car image, while on the other
directional light-based manipulations were achieved on single input
photographs. (Please refer to the supplementary for recorded edit
sessions and accompanying PSD files.)
While many decompositions are possible, we suggest a specific
layering model that works along two axes: intrinsic features and
direction. This is inspired by how many artists as well as practical
contemporary rendering systems (e. g., in interactive applications
such as computer games) work: first, decomposition into ambient
occlusion, diffuse illumination, albedo, and specular shading and
second, a decomposition into light directions. Both axes are op-
tional but can be seamlessly combined. Note that this model is not
physical. However, it is simple and intuitive for artists and, as we
will show, its inverse model is effectively learnable. To invert this
model, we employ a deep convolutional neural network (CNN) that
is trained using synthetic (rendered) data, for which the ground truth
decomposition of a photo into layers is known. While CNNs have
recently been used for intrinsic decompositions such as reflectance
and shading, we address the novel problem of refined decomposi-
tion into ambient occlusion and specular as well as into directions,
which is critical for the layered image manipulation workflow. Our
contributions are:
1.
a workflow, in which a single input photo is automatically decom-
posed into layered components that are suited for post-capture
appearance manipulation within standard image editing software;
2.
two plausible appearance decompositions, particularly suited for
plausible appearance editing: i) advanced intrinsics, including
specular and ambient occlusion and ii) direction; and
3.
a flexible, CNN-based approach to obtain a given type of decom-
position, leveraging the state-of-the-art in deep learning.
We evaluate our approach by demonstrating non-trivial appearance
edits based on our decompositions and a preliminary user study.
We further demonstrate the efficacy of our CNN architecture by
applying it to the well-established intrinsic images problem, where
it compares favourably to the state-of-the-art methods.
2. Previous Work
Combining multiple photos (also referred to as a “stack” [CCD03])
of a scene where one aspect has changed in each layer is routinely
used in computer graphics. For example, NVIDIA IRay actively
supports rendered LPE layers (light path expressions [Hec90]) to
be individually edited to simplify post-processing towards artistic
effects without resorting to solving the inverse rendering problem.
One aspect to change is illumination, such as flash-no-flash photog-
raphy [ED04] or exposure levels [MKVR09]. More advanced effect
involve direction of light [ALK
03, RBD06, FAR07], eventually
resulting in a more sophisticated user interface [BPB13]. All these
approaches require specialized capture to gather multiple images
captured by making invasive changes to the scene, limiting their use
in practice to change an image post-capture. On-line video and photo
communities hold many examples of DIY instructions to setup such
studio configurations.
For single images, a more classic approach is to perform in-
trinsic decomposition into shading (irradiance) and diffuse re-
flectance (albedo) [BT78,GMLMG12,BBS14], possibly supported
by a dedicated UI for images [BPD09, BBPD12], using anno-
tated data [BBS14,ZKE15,ZIKF15], or videos [BST
14,YGL
14].
Recently, CNNs have been successfully applied to this task pro-
ducing state-of-the-art results [NMY15, SBD15]. For CNNs, a
recent idea is to combine estimation of intrinsic properties and
depth [SBD15,KPSL16]. We will jointly infer intrinsic properties
and normals to allow for a directional illumination decomposition.
Also the relation between intrinsic images and filter is receiving con-
siderable attention [BHY15,FWHC17]. We also use a data-driven
CNN-based approach to go beyond classic intrinsic image decom-
position layers with further separation into occlusion and specular
components, as well as directions, that are routinely used in layered
image editing (see Sec. 4 and supplementary materials).
In other related efforts, researchers have looked into factorizing
components, such as specular [TNI04,MZBK06] from single im-
ages, or ambient occlusion (AO) from single [YJL
15] or multiple
captures [HWBS13]. We show that our approach can solve this
problem at a comparable quality, but requires only a single photo
and in combination yields further separation of diffuse shading and
albedo without requiring a specialized method.
Despite the advances in recovering reflectance (e. g., with two
captures and a stationarity assumption [AWL15], or with dedicated
UIs [DTPG11]), illumination (e. g., Lalonde et al. [LEN09] estimate
sky environment maps and Rematas et al. [RRF
16] reflectance
maps) and CNN-based depth [EPF14] from photographs, no system
doing a practical joint decomposition is known. Most relevant to
our effort, is SIRFS [BM15] that build data-driven priors for shape,
reflectance, illumination, and use them in an optimization setup to
recover the most likely shape, reflectance, and illumination under
these priors (see Sec. 4 for explicit comparison).
In the context of image manipulations, specialized solutions exist:
Oh et al. [OCDD01] represent a scene as a layered collection of color
and depth to enable distortion-free copying of parts of a photograph,
and allow discounting effect of illumination on uniformly textured
areas using bilateral filtering; Khan et al. [KRFB06] enable automat-
ically replacing one material with another (e. g., increase/decrease
specularity, transparency, etc.) starting from a single high dynamic
range image by exploiting our ‘blindness’ to certain physical in-
accuracies; Caroll et al. [CRA11] achieve consistent manipulation
of inter-reflections; or the system of Karsch et al. [KHFH11] that
combines many of the above towards compelling image synthesis.
Splitting into light path layers is typical in rendering inspired
by the classic light path notation [Hec90]. In this work, different
submitted to Eurographics Symposium on Rendering (2017)

Innamorati et al. / Decomposing Single Images for Layered Photo Retouching 3
from Heckbbert’s physical
E(S|D)
L
formalism, we use a more edit-
friendly factorization into ambient occlusion, diffuse light, diffuse
albedo, and specular, instead of separating direct and indirect effects.
While all the above works on photos, it was acknowledged that
rendering beyond the laws of physics can be useful to achieve differ-
ent artistic goals [TABI07, VPB
09,RTD
10,RLMB
14,DDTP15,
SPN
15]. Our approach naturally supports this option, allowing
users to freely change layers, using any image-level software of
their choice, also beyond what is physically correct. For example,
the StyLit system proposed by Fišser et al. [FJL
16] correlates artis-
tic style with light transport expressions, but requires pixels in the
image to be labeled with light path information, e. g., by rendering
and aligning. Hence, it can take our factorized output to stylize
single photographs without being restricted to rendered content.
3. Editable Layers From Single Photographs
Our approach has two main parts: an imaging model that describes
a decomposition of a single photo into layers for individual editing
and a method to perform this decomposition.
Model.
The imaging model (Sec. 3.1) is motivated by the require-
ments of a typical layered workflow (Sec. 3.4): The layers have to
be intuitive, they have to be independent, they should only use blend
modes available in a (linear) off-the-shelf image editing software
and they should be positive and low-dynamic-range (LDR). This
motivates a model that can decompose along two axes: intrinsics and
directionality. These axes can be combined and use a new directional
basis we propose.
Decomposition.
The decomposition has two main steps: (i) produc-
ing training data (Sec. 3.2) and (ii) a convolutional neural network
to decompose single images into editable layers (Sec. 3.3). The
training data (Sec. 3.2) is produced by rendering a large number
of 3D scenes into image tuples, where the first is the composed
image, while the other images are the layers. This step needs to be
performed only once and the training data will be made available
upon publication. The decomposition (Sec. 3.3) is done using a
CNN that consumes a photo and outputs all its layers. This CNN
is trained using the (training) data from the previous step. We se-
lected a convolution-deconvolution architecture that is only to be
trained once, can be executed efficiently on new input images, and
its definition will be made publicly available upon publication.
3.1. Model
We propose an image formation model that can decompose the
image along one or two independent axis: intrinsic features or direc-
tionality (Fig. 2).
Non-directional model. We model the color C of a pixel as
C = O
a
(ρ ·E + S), (1)
where
O
a
[0,1] R
denotes the ambient occlusion, which is the
fraction of directions in the upper hemisphere that is blocked from
the light; the variable
ρ [0,1]
3
R
3
describes the diffuse albedo,
i. e., the intrinsic color of the surface itself; the variable
E [0,1]
3
R
3
denotes the diffuse illumination (irradiance), i. e., the color of
= ( )x +x
Image Occlusion Albedo Irradiance Specular
+ +...+ +
Bottom* Top* Left* Right*
=
Image
Top*
= x x )+(
Occlusion Albedo Irradiance Top Specular Top
Figure 2:
The components of our two imaging models. The first row
is the intrinsic axis, the second row the directional axis, and the
third row shows how one directional element can subsequently be
also decomposed into its intrinsics.
the total light received; and finally,
S [0,1]
3
R
3
is the specular
shading in units of radiance, where we do not separate between the
reflectance and the illumination (see Fig. 2 top row).
This model is a generalization of typical intrinsic im-
ages [BKPB17], which only models shading and reflectance, to
include specular and occlusion. While in principle occlusion acts dif-
ferently on diffuse and specular components, we follow Kozlowski
and Kautz [KK07], who show that jointly attenuating diffuse and
specular reflectance by the same occlusion term is a good approxi-
mation under natural lighting, by using
O
a
as a joint multiplier of
ρ ·E
and
S
, thus keeping the user-visible reflectance components to
a minimum.
In summary, this decomposition produces four layers from each
input image that can be combined with simple blending operations
in typical image retouching software.
Directional model.
The directional model is a generalization of
the above. We express pixel color as a generalized intrinsic image
as before, but with diffuse illumination depending on the surface
normal, and specular shading depending on the reflection vector:
C = O
a
N
i=1
(ρ ·b
i
(n)E + b
i
(r)S) , (2)
where
b
i
S
2
R
3
,
i 1...N
, are basis functions of an
N
-
dimensional lighting basis, parameterized by the surface orientation
n
, and the reflected orientation
r := 2
h
v,n
i
n v
, respectively. All
directions are in view space, so assuming a distant viewer the view
direction is v (0, 0,1)
>
by construction.
Especially for diffuse lighting, a commonly used lighting ba-
sis would be first-order spherical harmonics (SH), i. e.,
(b
SH
i
) =
(Y
0
0
,Y
1
1
,Y
0
1
,Y
1
1
)
. That basis is shown to capture diffuse reflectance
at high accuracy [RH01b]; however, as we aim for a decomposition
amenable to be used in traditional photo processing software, which
typically quantizes and clamps any layer calculations to
[0,1]
, the
negative lobes of SH would be lost when stored in an image layers.
submitted to Eurographics Symposium on Rendering (2017)

4 Innamorati et al. / Decomposing Single Images for Layered Photo Retouching
A common positive-only reparameterization would use the six gen-
erator functions,
˜
b
SH
1/2
=
1
/2 ±
1
/2Y
1
1
,
˜
b
SH
3/4
=
1
/2 ±
1
/2Y
0
1
,
˜
b
SH
5/6
=
1
/2 ±
1
/2Y
1
1
, (3)
with
Y
0
0
=
˜
b
SH
1
+
˜
b
SH
2
,
Y
1
1
=
˜
b
SH
2
˜
b
SH
1
,
Y
0
1
=
˜
b
SH
4
˜
b
SH
3
, and
Y
1
1
=
˜
b
SH
6
˜
b
SH
5
. Initial experiments with that lighting basis, how-
ever, showed that the necessary blending calculations between the
corresponding editing layers lead to excessive quantization in an
8-bit image processing workflow, and even using Photoshop’s 16-bit
mode mitigated the problem only partially. Moreover, direct editing
of the basis function images turned out unintuitive, because
1.
the effect of editing pixels corresponding to negative SH contri-
butions is not easily evident to the user;
2.
the strong overlap between basis functions makes it difficult to
apply desired edits for individual spatial directions only.
This led us to propose a sparser positive-only basis, where spatial
directions are mostly decoupled. After experimentation with various
bases, we settled on a normalized variant of
˜
b
SH
as:
b
i
(ω) =
˜
b
SH
i
(ω)
p
/
6
j=1
˜
b
SH
j
(ω)
p
=
hω,c
i
i+ 1
p
6
j=1
hω,c
j
i+ 1
p
, (4)
using the dotproduct-based formulation of 1st-order SH, denoting
with
c
i
the six main spatial directions; the normalization term en-
sures partition of unity, i. e.,
N
i
b
i
(ω) = 1
. Empirically, we found
p = 5
to offer the best compromise between separation of illumina-
tion functions and smoothness. A polar surface plot of the six basis
functions overlapping is shown in Figure 3.
Figure 3:
Directional bases. Left:
˜
b
SH
i
, a positive-only reparame-
terization of the 1st-order SH basis exhibits strong overlap between
neighboring lobes (drawn as opaque surface plots), and with it
strong cross-talk of edits of the associated editing layers; Right:
our
b
i
(Equation
(4)
) remains smooth while lobes are separated
much more strongly. Note that
˜
b
SH
i
has been uniformly rescaled to
be partition of unity; the difference in amplitude (see axis labels)
further documents the sparser energy distribution in our basis.
Using this basis, Equation
(2)
produces 14 layers from an input
image where twelve are directionally-dependent and two are not (
ρ
and
O
a
) that can be combined using any compositing software. As
Image
Occlusion
Albedo
Irradiance
Specular
Figure 4: Samples from our set of synthetic training data.
shown in the second and third row of Fig. 2, the 14 output layers can
be either collapsed onto 6 directional layers or kept as a combination
of both intrinsic and directional decomposition.
3.2. Training Data
There are many values of
O
a
,
E
,
ρ
, and
S
to explain an observed
color
C
, so the decomposition is not unique. In the same way, many
normals are possible from a pixel. Inverting this mapping from
a single observation is likely to be impossible. At the same time,
humans have an intuition how to infer reflectance on familiar ob-
jects [KVDCL96]. One explanation can be that they rely on a context
x
, on the spatial statistics of multiple observations
C(x)
, such that a
decomposition into layers becomes possible. In other words, sim-
ply not all arrangements of decompositions are equally likely. As
described next, we employ a CNN to similarly learn such a decom-
position. Training data comprises of synthetic images that show a
random shape, with partially random reflectance shaded by random
environment map illumination.
Shape.
Surface geometry consists of about 2,000 random instances
from ShapeNet [C
15] coming from the top-level classes, selected
from ShapeNetCore semi-automatically. Specifically, ShapeNetCore
has 48 top-level classes among which we use 27. We discarded
classes that had either very few models or that were considered
uncommon (e. g., birdhouse). We then randomly sampled a tenth of
the total models from each class resulting in 1,930 models. These
models were also manually filtered to be free of meshing artifacts.
Shapes were rendered under random orientation while maintaining
the up direction intrinsic to each model.
Reflectance.
Reflectance using the physically-corrected Phong
model [LW94] was sampled as follows: the diffuse colors come
directly from ShapeNet models. The specular component
k
s
is as-
sumed to be a single color. A random decision is made if the material
is assumed to be electric or dielectric. If it is electric, we choose the
specular color to be the average color of the diffuse texture. Other-
wise, we choose it to be a uniform random grey value. Glossiness is
set as n = 3.0
10ξ
, where ξ U [0,1].
Illumination.
Illumination is sampled from a set of 90 high-
dynamic-range (HDR) environment maps in resolution
512×256
submitted to Eurographics Symposium on Rendering (2017)

Innamorati et al. / Decomposing Single Images for Layered Photo Retouching 5
that have an uncalibrated absolute range of values but are represen-
tative for typical lighting settings: indoor, outdoor, as well as studio
lights. Illumination were randomly oriented around the vertical axis.
Rendering.
After fixing shape, material, and illumination, we syn-
thesize a single image from a random view from a random angle
around the vertical axis. To produce
C
, we compute four individ-
ual components, that can be composed into Eq. 1 or further into
directions according to Eq. 2 as per-pixel normals are known at
render time. Due to the large number of training data required, we
use efficient, GPU-based rendering algorithms. The occlusion term
O
a
is computed using screen-space occlusion [RGS09]. The diffuse
shading
E
is computed using pre-convolved irradiance environment
maps [RH01a]. Similarly, specular shading is the product of the
specular color
k
s
selected according to the above protocol, and a
pre-convolved illumination map for gloss level
n
. No indirect illu-
mination or local interactions are rendered.
While this image synthesis is far from being physically accurate,
it can be produced easily, systematically and for a very large number
of images, making it suitable for learning the layer statistics. Overall
we produce 300,000 unique samples in a resolution of
256×256
(ca. 14 GB) in eight hours on a current PC with a higher-end GPU.
A fraction of the images totalling to about 30000 were withheld
to check for convergence (and detect over-fitting). We also used
dropout to prevent over-fitting.
Units.
Care has to be taken in what color space learned and training
data is to be processed. As the illumination is HDR, the resulting
image is an HDR rendering. However, as our input images will be
LDR at deployment time, the HDR images need to be tone-mapped
to match their range. To this end, automatic exposure control is
used to map those values into the LDR range, by selecting the
0.95
luminance percentile of a random subset of the pixels and scale
all values such that this value maps to
1
. The rendered result
C
is
stored after gamma-correction. All other components are stored in
physically linear units (
γ = 1.0
) and are processed in physically
linear units by the CNN and the end-application using the layers.
Doing the final gamma-correction will consequentially be up to the
application using the layers later on (as shown in our edit examples).
3.3. Learning a Decomposition
We perform decomposition using a CNN [LBBH98,KSH12] trained
using the data produced as described above. Input to the network is
a single image such as a photograph. Output for the non-directional
variant are ve images (occlusion, diffuse illumination, albedo, spec-
ular shading, and normals), where occlusion is scalar and the oth-
ers are three-vector-valued. Note, that normals and intrinsic prop-
erties are estimated jointly, such as done before for albedo and
depth [SBD15,KPSL16]. The normals are not presented to the user,
but only used to perform the directional decomposition.
We have also experimented with letting the CNN directly compute
the directional decomposition, but found that having an explicit
decomposition using normal and reflected direction to be easier to
train and produce better results.
This design follows the convolution-deconvolution idea with
cross-links, resulting in a decoder-encoder scheme [RFB15]. The
network is fully-convolutional. We start at a resolution of
256×256
that is reduced down to
2×2
through stride-two convolutions. We
then perform two stride-one convolutions and increase the number
of feature layers in accordance to the required number of output lay-
ers (i. e., quadruple for the layers, while the whole step is skipped for
normal estimation). The deconvolution part of the network consists
of blocks performing a resize-convolution (upsampling followed by
a stride-one convolution), cross-linking and a stride-one convolution.
Every convolution in the network is followed by a ReLU [NH10]
non-linearity except for the last layer, for which a Sigmoid non-
linearity is used instead. This is done to normalize the output to the
range
[0,1]
. Images with an uneven aspect ratio will be appropriately
cropped and/or padded to be square with white pixels. All receptive
fields are
3×3
pixels in size except for the first and last two layers
that are
5×5
. No filter weights are shared between layers. Overall,
this network has about 8.5 M trainable parameters.
For the loss function, we combine a per-layer L2 loss with a
novel three-fold recombination loss, that encourages the network
to produce combinations that result in the input image and fulfills
the following requirements: (i) the layers have to produce the input,
so
C = O
a
(E ·ρ + S)
; (ii) the components should explain the image
without AO, i. e.,
C/O
a
= Eρ + S
; and (iii) diffuse reflected light
should explain the image without AO and specular, so
C/O
a
S =
Eρ
. Note that if the network was able to always perform a perfect
decomposition, a single L2 loss alone would be sufficient. As it
makes errors in practice, the additional loss expressions bias those
errors to at least happen in such a way that the combined result
does not deviate as much from the input. All losses are in the same
RGB-difference range and are weighted equally.
In Tbl. 1, we numerically evaluate the recombination error (i.e.,
the differences between the original and recombined images) by
progressively adding each of the three additional losses to a standard
L2 loss. While a positive trend can be observed with the DSSIM
metric, these benefits are not as evident on the NRMSE metric.
Overall, the network is a rather standard modern design, but
trained to solve a novel task (layers) on novel kind of training
data (synthesized, directionally-dependant information). We used
TensorFlow [A
15] for our implementation platform and each model
requires only several hours on a NVIDIA Titan X GPU with 12 GB
on-board RAM to train (both have been trained for 12 hours). We
used stochastic gradient descent to solve for the network, which we
ran for 6 epochs with batches of size 16. A more detailed description
of the network’s architecture can be found in the supplementary
materials.
Table 1:
Comparing different steps of our recombination loss (rows)
in terms of two metrics (columns): DSSIM and NRMSE on our
validation set.
Loss NRMSE DSSIM
L2 0.2549
±
0.0807 0.0234
±
0.0129
L2 + (iii) 0.2598
±
0.0833 0.0229
±
0.0126
L2 + (iii) + (ii) 0.2588
±
0.0799 0.0229
±
0.0122
L2 + (iii) + (ii) + (i) 0.2460
±
0.0787 0.0210
±
0.0119
submitted to Eurographics Symposium on Rendering (2017)

Figures
Citations
More filters
Journal ArticleDOI

Learning to reconstruct shape and spatially-varying reflectance from a single image

TL;DR: This work demonstrates that it can recover non-Lambertian, spatially-varying BRDFs and complex geometry belonging to any arbitrary shape class, from a single RGB image captured under a combination of unknown environment illumination and flash lighting.
Journal ArticleDOI

Single-image SVBRDF capture with a rendering-aware deep network

TL;DR: This work tackles lightweight appearance capture by training a deep neural network to automatically extract and make sense of visual cues from a single image, and designs a network that combines an encoder-decoder convolutional track for local feature extraction with a fully-connected track for global feature extraction and propagation.
Proceedings ArticleDOI

Material Editing Using a Physically Based Rendering Network

TL;DR: In this article, the authors propose an end-to-end network architecture that replicates the forward image formation process to disentangle intrinsic physical properties of an image, i.e. shape, illumination, and material.
Proceedings Article

Self-Supervised Intrinsic Image Decomposition

TL;DR: In this article, the Rendered Intrinsics Network (RIN) joins together an image decomposition pipeline, which predicts reflectance, shape, and lighting conditions given a single image, with a recombination function, a learned shading model used to recompose the original input based off of intrinsic image predictions.
References
More filters
Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Book ChapterDOI

U-Net: Convolutional Networks for Biomedical Image Segmentation

TL;DR: Neber et al. as discussed by the authors proposed a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently, which can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks.
Journal ArticleDOI

Gradient-based learning applied to document recognition

TL;DR: In this article, a graph transformer network (GTN) is proposed for handwritten character recognition, which can be used to synthesize a complex decision surface that can classify high-dimensional patterns, such as handwritten characters.
Journal ArticleDOI

ImageNet classification with deep convolutional neural networks

TL;DR: A large, deep convolutional neural network was trained to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes and employed a recently developed regularization method called "dropout" that proved to be very effective.
Posted Content

U-Net: Convolutional Networks for Biomedical Image Segmentation

TL;DR: It is shown that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks.
Frequently Asked Questions (12)
Q1. What are the contributions mentioned in the paper "Decomposing single images for layered photo retouching" ?

In this paper, the authors suggest a method to decompose a single image into multiple layers that approximates effects such as shadow, diffuse illumination, albedo, and specular shading. The authors demonstrate the effectiveness of their decomposition on synthetic ( i. e., rendered ) and real data ( i. e., photographs ), and use them for photo manipulations, which are otherwise impossible to perform based on single images. The authors provide comparisons with state-of-the-art methods and also evaluate the quality of their decompositions via a user study measuring the effectiveness of the resultant photo retouching setup. 

Future work could investigate other decompositions such as global and direct illumination, subsurface-scattering or directional illumination or other inputs, such as videos. 

The deconvolution part of the network consists of blocks performing a resize-convolution (upsampling followed by a stride-one convolution), cross-linking and a stride-one convolution. 

Surface geometry consists of about 2,000 random instances from ShapeNet [C∗15] coming from the top-level classes, selected from ShapeNetCore semi-automatically. 

The decomposition has two main steps: (i) producing training data (Sec. 3.2) and (ii) a convolutional neural network to decompose single images into editable layers (Sec. 3.3). 

They do so, by ‘decomposing’ the scene into individual layers, e. g., by changing the scenes physical illumination, manipulating the individual layers (e. g., typically using a software such as Adobe Photoshop), and then composing them into a single image. 

To produce C, the authors compute four individual components, that can be composed into Eq. 1 or further into directions according to Eq. 2 as per-pixel normals are known at render time. 

Their approach has two main parts: an imaging model that describes a decomposition of a single photo into layers for individual editing and a method to perform this decomposition. 

For single images, a more classic approach is to perform intrinsic decomposition into shading (irradiance) and diffuse reflectance (albedo) [BT78, GMLMG12, BBS14], possibly supported by a dedicated UI for images [BPD09, BBPD12], using annotated data [BBS14, ZKE15, ZIKF15], or videos [BST∗14, YGL∗14]. 

the effect of editing pixels corresponding to negative SH contributions is not easily evident to the user; 2. the strong overlap between basis functions makes it difficult to apply desired edits for individual spatial directions only. 

Overall the authors produce 300,000 unique samples in a resolution of 256×256 (ca. 14 GB) in eight hours on a current PC with a higher-end GPU. 

Using this basis, Equation (2) produces 14 layers from an input image – where twelve are directionally-dependent and two are not (ρ and Oa) – that can be combined using any compositing software.