scispace - formally typeset
Open AccessJournal ArticleDOI

Multi-view relighting using a geometry-aware network

Reads0
Chats0
TLDR
This work proposes the first learning-based algorithm that can relight images in a plausible and controllable manner given multiple views of an outdoor scene using a geometry-aware neural network that utilizes multiple geometry cues and source and target shadow masks computed from a noisy proxy geometry obtained by multi-view stereo.
Abstract
We propose the first learning-based algorithm that can relight images in a plausible and controllable manner given multiple views of an outdoor scene. In particular, we introduce a geometry-aware neural network that utilizes multiple geometry cues (normal maps, specular direction, etc.) and source and target shadow masks computed from a noisy proxy geometry obtained by multi-view stereo. Our model is a three-stage pipeline: two subnetworks refine the source and target shadow masks, and a third performs the final relighting. Furthermore, we introduce a novel representation for the shadow masks, which we call RGB shadow images. They reproject the colors from all views into the shadowed pixels and enable our network to cope with inacuraccies in the proxy and the non-locality of the shadow casting interactions. Acquiring large-scale multi-view relighting datasets for real scenes is challenging, so we train our network on photorealistic synthetic data. At train time, we also compute a noisy stereo-based geometric proxy, this time from the synthetic renderings. This allows us to bridge the gap between the real and synthetic domains. Our model generalizes well to real scenes. It can alter the illumination of drone footage, image-based renderings, textured mesh reconstructions, and even internet photo collections.

read more

Content maybe subject to copyright    Report

HAL Id: hal-02125095
https://hal.inria.fr/hal-02125095
Submitted on 10 May 2019
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entic research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diusion de documents
scientiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Multi-view Relighting using a Geometry-Aware Network
Julien Philip, Michaël Gharbi, Tinghui Zhou, Alexei A Efros, George Drettakis
To cite this version:
Julien Philip, Michaël Gharbi, Tinghui Zhou, Alexei A Efros, George Drettakis. Multi-view Relight-
ing using a Geometry-Aware Network. ACM Transactions on Graphics, Association for Computing
Machinery, In press, 38, �10.1145/3306346.3323013�. �hal-02125095�

Multi-view Relighting using a Geometry-Aware Network
JULIEN PHILIP, Université Côte d’Azur and Inria
MICHAËL GHARBI, Adobe
TINGHUI ZHOU, UC Berkeley
ALEXEI A. EFROS, UC Berkeley
GEORGE DRETTAKIS, Université Côte d’Azur and Inria
(a) our algorithm can relight a single-illumination drone video dynamically to synthesize a “time-lapse” eect
(b) single-view input (c) three relit outputs: here we built the proxy geometry using internet photos of the same location
Fig. 1. Two applications of our multi-view relighting system. (a) We show five dierent frames from a drone video (copyright Namyeska youtu.be/JHeDP7_YBos
used with permission) relit with a "time-lapse" eect of a rotating sun (see supplemental for the full video). A user can also relight a photograph of a known
landmark (b) to dierent target lighting conditions (c). For this, we applied our algorithm to a collection of 50 internet images of the same location .
We propose the rst learning-based algorithm that can relight images in
a plausible and controllable manner given multiple views of an outdoor
scene. In particular, we introduce a geometry-aware neural network that
utilizes multiple geometry cues (normal maps, specular direction, etc.) and
source and target shadow masks computed from a noisy proxy geometry
obtained by multi-view stereo. Our model is a three-stage pipeline: two sub-
networks rene the source and target shadow masks, and a third performs
the nal relighting. Furthermore, we introduce a novel representation for the
shadow masks, which we call RGB shadow images. They reproject the colors
from all views into the shadowed pixels and enable our network to cope
with inacuraccies in the proxy and the non-locality of the shadow casting
interactions. Acquiring large-scale multi-view relighting datasets for real
scenes is challenging, so we train our network on photorealistic synthetic
data. At train time, we also compute a noisy stereo-based geometric proxy,
this time from the synthetic renderings. This allows us to bridge the gap
between the real and synthetic domains. Our model generalizes well to real
scenes. It can alter the illumination of drone footage, image-based renderings,
textured mesh reconstructions, and even internet photo collections.
Authors’ addresses: Julien Philip, Université Côte d’Azur and Inria, julien.philip@
inria.fr; Michaël Gharbi, Adobe, mgharbi@adobe.com; Tinghui Zhou, UC Berkeley,
tinghuiz@eecs.berkeley.edu; Alexei A. Efros, UC Berkeley, efros@eecs.berkeley.edu;
George Drettakis, Université Côte d’Azur and Inria, George.Drettakis@inria.fr.
Publication rights licensed to ACM. ACM acknowledges that this contribution was
authored or co-authored by an employee, contractor or aliate of a national govern-
ment. As such, the Government retains a nonexclusive, royalty-free right to publish or
reproduce this article, or to allow others to do so, for Government purposes only.
© 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM.
0730-0301/2019/7-ART78 $15.00
https://doi.org/10.1145/3306346.3323013
CCS Concepts: Computing methodologies Image manipulation.
Additional Key Words and Phrases: Image relighting, Multi-view, Deep
Learning
ACM Reference Format:
Julien Philip, Michaël Gharbi, Tinghui Zhou, Alexei A. Efros, and George
Drettakis. 2019. Multi-view Relighting using a Geometry-Aware Network.
ACM Trans. Graph. 38, 4, Article 78 (July 2019), 14 pages. https://doi.org/10.
1145/3306346.3323013
1 INTRODUCTION
Changing the illumination of an outdoor image is a notoriously dif-
cult problem that requires the lighting to be modied consistently
across the image, and shadows to be removed and resynthesized
for the new sun position [Duchêne et al
.
2015; Tchou et al
.
2004; Yu
et al
.
1999]. Cast shadows are particularly challenging because an
occluder can be arbitrarily far from the point it shadows, or even
out of view.
The basic premise of our approach is to use multi-view infor-
mation and approximate 3D geometry to reason about non-local
lighting interactions and guide the relighting task. We introduce the
rst learning-based algorithm that can relight multi-view datasets
of outdoor scenes (Fig. 1), which have become a commodity thanks
to smartphone cameras, large-scale internet photo collections and
drone cinematography. Our model uses a neural network designed
to exploit geometric cues. It includes a careful treatment of cast
shadows and is trained solely on realistic synthetic renderings.
ACM Trans. Graph., Vol. 38, No. 4, Article 78. Publication date: July 2019.

78:2 Julien Philip, Michaël Gharbi, Tinghui Zhou, Alexei A. Efros, and George Dreakis
(d) illumination buers
  
renderings with multiple
lighting conditions
ground truth
geometry
proxy
(g) relit output
cloudy image
() refined shadow masks
target
target
sourcesource
(e) RGB shadow images
 

(b) 3D proxy & source/target
lighting parameters
(c) reference input
(a) multi-view
dataset
Fig. 2. Overview of our approach.
Le:
We use o-the-shelf stereo to create a 3D geometric proxy of the scene (b). The geometry is encoded as illumination
buers (d) and used to create RGB shadow images (e) that are independently refined by two networks (f), helping the final relighting network remove
and re-synthesize shadows, and change the illumination (g) according to the desired novel lighting condition.
Right:
We train our model with synthetic
data, including accurate ground truth geometry and renderings and an approximate proxy, created using synthetic renderings instead of photos. These two
representations of the training scene allow the network to accurately refine shadows, enabling plausible relighting.
Our method has several applications: it allows automatic cre-
ation of a “time-lapse” eect by dynamically relighting drone videos
(Fig. 1(a)). Or, if we only have a single photo, we can access on-
line photos of the same place to relight the input photo (Fig. 1(b)).
We can also relight images in traditional multi-view pipelines, e.g.,
Image-Based Rendering (IBR) or photogrammetry (Fig. 15).
Previous methods have diculty with the type of input we target.
Inverse-illumination methods [Loscos et al
.
1999; Yu et al
.
1999]
cannot handle the approximate geometry of the proxy, while single-
image relighting solutions struggle with cast shadows [Luan et al
.
2017; Shih et al
.
2013]. Finally, our solution signicantly outperforms
neural-network baselines (Sec. 5.2).
Method Overview. Given a set of images captured from multiple
viewpoints (Fig. 2a), we start by building an approximate repre-
sentation of the scene’s geometry a proxy using o-the-shelf
stereo [RealityCapture 2016; Snavely et al
.
2006] (Fig. 2b). We can
relight any reference view of that scene (Fig. 2c) this could be
one of the input images or a novel view obtained by IBR. The user
provides a target illumination by specifying a sun direction vector
and a scalar cloudiness” level (or a sequence of such parameters for
“time-lapse” eects). From the proxy, we then compute image-space
buers (Fig. 2d: normal maps, specular reection direction, etc.) and
shadow masks for the source and target illuminations. We perform
relighting by training a neural network to map from the reference
image, with extra buers and shadow masks, to the novel lighting
condition.
The importance of accurate shadow estimation for shadow re-
moval has been previously demonstrated [Duchêne et al
.
2015;
Gryka et al
.
2015; Guo et al
.
2011]. But reconstruction errors in
the proxy often lead to inacurrate masks a network cannot trust.
This motivates our network design: we decompose our model into
three sub-networks (Fig. 2). Two modules rene the source (resp.
target) shadow masks (Fig. 2f) while the third implements the -
nal relighting (Fig. 2g). The sub-networks are trained jointly but
with dierent supervision: respectively ground truth shadow masks
and ground truth relit images. Furthermore, instead of computing
standard shadow masks from the proxy, we introduce RGB shadow
images (Fig. 2e). These shadow images re-project colors from the
shadow-casting geometry from all viewpoints into pixels in shadow,
helping the network identify erroneously reconstructed shadow
casters from the reprojected color (Fig. 4,5).
For supervised training, we need data corresponding to dierent
lighting conditions of the exact same views, that is hard to capture
with real photos. Instead, we use professionally-modeled, realis-
tic synthetic scenes to generate physically-based renderings with
many dierent viewpoints and lighting conditions. We introduce a
exible compositing methodology to generate a large variety of illu-
minations on-the-y at training time. This avoids the combinatorial
explosion in the number of images to render. Synthetic scenes also
give us ground truth shadow masks.
To train the shadow renement, it is impossible to capture real
data and we cannot directly use the ground truth shadows cast by the
synthetic geometry. A model trained with these perfectly accurate
shadows would not generalize to real photographs, since it would
have never seen the reconstruction errors of the stereo-based proxy.
Instead, we generate an approximate 3D proxy for each synthetic
training scene using stereo on renderings, from which we obtain
the input illumination buers and (inaccurate) RGB shadow images.
The ground truth shadow masks are used as targets to supervise the
renement sub-networks. This approach makes our model robust to
3D reconstruction errors at test time and limits the generalization
gap between real and synthetic data.
Contributions. In summary, we make the following contributions:
An end-to-end learning method for multi-view relighting
of outdoor scenes, guided by image-space buers, namely
shadow masks and illumination buers, that are computed
from a geometry proxy.
A learning-based shadow renement solution to remove and
resynthesize shadows. It uses the input images as well as our
ACM Trans. Graph., Vol. 38, No. 4, Article 78. Publication date: July 2019.

Multi-view Relighting using a Geometry-Aware Network 78:3
newly-introduced RGB shadow images to overcome recon-
struction errors in the proxy.
A training procedure that uses realistic synthetic scenes to
exibly generate multiple lighting conditions. Critically, we
create a stereo-based proxy for each training scene which,
together with the ground truth geometry, enables supervised
learning for shadow renement.
Although it is entirely trained on synthetic images, our algorithm
generalizes to real multi-view datasets, and can modify the lighting
in a much wider range of illumination conditions than previous
methods (e.g., [Duchêne et al
.
2015]). We evaluate our approach
on real multi-view datasets, and show a variety of applications
(Fig. 1,13,16).
2 RELATED WORK
Our method builds on several dierent areas. We rst discuss tra-
ditional methods for single-image and multi-view relighting. One
major challenge for relighting is the careful treatement of shadows.
Our method removes and re-synthesizes shadows; we thus review
the shadow removal literature. We also briey review some aspects
of image-to-image transformation research that is related to our
solution.
2.1 Image-Based Relighting
Image-based relighting methods try to change the lighting condi-
tions of an input image or a set of images. Early work ([Loscos et al
.
1999; Marschner and Greenberg 1997; Yu et al
.
1999], used laser scans
or early user-assisted reconstruction algorithms to estimate geom-
etry, and reectance and/or environment lighting. Inverse global
illumination is then used for relighting. More involved capture se-
tups such as the Light Stage [Debevec et al
.
2000; Wenger et al
.
2005]
allow for production-quality relighting, with wide-ranging applica-
tions in the lm industry. In contrast, we target casual capture with
a single camera (DSLR, phone or drone), providing approximate
3D geometry, which is most often unsuitable for inverse rendering
methods.
Estimating the lighting environment in an image is an important
step in relighting, with many proposed solutions (e.g., [Debevec
2002; Hold-Georoy et al
.
2017; Lalonde et al
.
2009a; Stumpfel et al
.
2004]). Similarly, several reectance estimation techniques have
been proposed to assist relighting [Masselus et al
.
2003, 2004]. We-
bcam sequences have also been used for relighting [Lalonde et al
.
2009b; Sunkavalli et al
.
2007], although cast shadows often require
manual layering. Alternatively, online digital terrain and urban mod-
els registered to images can be used for approximate relighting [Kopf
et al
.
2008]. None of these methods satises all our requirements, i.e.,
plausible multi-view relighting including cast shadows for outdoors
scenes using casual capture.
Another widely developed area of image relighting focuses on
images of faces (e.g., [Peers et al
.
2007; Wang et al
.
2009; Wen et al
.
2003]). The specic nature of face geometry and reectance result
in solutions that are not well adapted to the outdoors scenes we
target.
Some methods target realistic object editing or compositing in
single images [Karsch et al
.
2011; Kholgade et al
.
2014]. These meth-
ods give good results, but they do not adress major lighting changes,
like editing cast shadows. They also require signicant eort from
the user to annotate the scene.
Several methods on multi-view image relighting have been devel-
oped, both for the case of multiple images sharing single lighting
conditions [Duchêne et al
.
2015], and for images of the same lo-
cation with multiple lighting conditions (typically from internet
photo collections) [Laont et al
.
2012; Xu et al
.
2018]. For the sin-
gle lighting condition, Duchêne et al. [2015], rst perform shadow
classication and intrinsic decomposition using separate optimiza-
tion steps. Despite impressive results, artifacts remain especially
around shadow boundaries and the relighting method fails beyond
limited shadow motion. Our learning solution avoids the pitfalls
of these optimization methods, and allows much larger sun motion
(Section 5.3) as well as treating video sequences.
2.2 Intrinsic images, shadow estimation and removal
Intrinsic image decomposition and shadow removal methods are
closely related to relighting. The classic Retinex work [Land and
McCann 1971] inspired the intrinsic decomposition method of Weiss
[2001], which used time-lapse sequences to compute shadow-free
reectance images. Many previous methods exist to explicitly detect
and remove shadows, both in graphics and computer vision. See
Sanin et al. [2012] for a survey. Most such methods operate on a
single image, for example the work of Finlayson et al. [2006], which
works well on shadows of relatively simple isolated objects. Other
approaches include Lalonde et al. [2010] which uses Conditional
Random Fields to detect the shadow, or Mohan et al. [2007] which
is a gradient-based solution for shadow removal. These methods
typically do not address relighting, which is our main goal. User
assisted methods have also been developed [Shor and Lischinski
2008; Wu et al
.
2007] but our automated approach is more practical
for multi-view datasets.
Even before the massive adoption of deep CNNs, learning meth-
ods were proposed to remove shadows from images. The method
of Guo et al. [2011], detects pairs of points in shadow/light using a
learning approach, and subsequently removes shadows with an op-
timization. More recently, deep learning has been used for shadow
removal [Qu et al
.
2017], using pretrained features, global and lo-
cal information. Generative Adversarial Networks (GANs) have
also been used for shadow detection and removal, e.g., conditional
GANs [Wang et al
.
2018], where a rst GAN learns to generate the
shadow mask, which is then used by the second network to remove
shadows. As with previous shadow removal methods, relighting is
not addressed in this work. Recent deep learning methods achieve
good results for shadow removal, but most often do not address
moving shadows (especially cast shadows) and changing the overall
lighting conditions. Handling such changes in lighting is a much
more complex problem; our solution uses geometry and synthetic
training data, achieving plausible relighting with cast shadows. We
provide comparisons with baseline methods using such solutions in
Section 5.2.
ACM Trans. Graph., Vol. 38, No. 4, Article 78. Publication date: July 2019.

78:4 Julien Philip, Michaël Gharbi, Tinghui Zhou, Alexei A. Efros, and George Dreakis
2.3 Deep learning for image-to-image transformations
The Pix2Pix method [Isola et al
.
2017] uses a U-net [Ronneberger
et al
.
2015] to perform many dierent image transformation tasks
with remarkable success, even though the quantity of training data
is quite low compared to other methods. Similarly, ResNet-like ar-
chitectures [He et al
.
2016] have been particularly successful in
large image transformation tasks [Zhu et al
.
2017], thanks to the
residual blocks that preserve useful information in the network.
There has been a body of work on transforming images, including
day-to-night [Liu et al
.
2017] changes. While impressive, the results
of these methods typically generated by GANs are lacking in consis-
tency and ease of control. Finally, there has also been work on face
or body relighting using deep learning (e.g., [Kanamori and Endo
2018; Shu et al
.
2017]); as with older methods, the specic technical
choices for faces or bodies result in methods that are not necessarily
adapted for relighting of outdoor scenes, especially since the extent
of outdoors scenes results in much more non-local eects.
3 GEOMETRY-AWARE RELIGHTING NETWORK
Our relighting solution is built around a neural network that takes
one image from a multi-view dataset , and a set of corresponding
image-space buers as input, and produces a new image, with the
lighting altered. We identied three key diculties to successfully
implement this image transformation: modeling the illumination
changes (color, intensity, etc), and removing and resynthesizing cast
shadows.
To overcome these diculties, our learning solution exploits a
geometric 3D proxy which we obtain by rst calibrating the input
virtual cameras using structure from motion (SfM) [Snavely et al
.
2006], then running a Multi-View Stereo algorithm [Goesele et al
.
2007; RealityCapture 2016]. Fig. 3 illustrates this procedure.
Because our CNN operates in the image domain, we encode
the geometry and lighting parameters as image-space illumination
buers
B
. These include normal maps, per-pixel specular reection
direction, etc. See Section 3.3. In our ablation study, we found these
buers to be instrumental in synthesizing plausible novel illumina-
tions (Section 5.5).
Furthermore, the proxy gives us a particularly powerful means
to guide the shadow removal and re-synthesis process. We use it to
obtain two shadow masks,
S
src
and
S
tgt
, corresponding to the source
and target sun directions respectively, by running a shadowcasting
algorithm. If the geometry were perfect, these masks would tell the
network precisely which pixels to brighten (resp. darken). However,
because of errors in the stereo reconstruction, the masks typically
contain signicant artifacts and misalignments with respect to the
actual shadows in the image.
While coarse masks are better than no shadow mask at all (see
Section 5.5), we found that the success of the shadow removal proce-
dure strongly depends on the quality of
S
src
. Similarly, the shadow
re-synthesis suers from errors in
S
tgt
. This led us to build an ex-
plicit shadow renement step within our pipeline. We guide the
renement step by introducing RGB shadow images. These maps
use color information from all images in the multi-view dataset to
provide hints to the CNN on reconstruction inaccuracies.
(a) input views (b) calibrated cameras and 3D proxy
Fig. 3. (a) Our method takes as input a set of photos of an outdoor scene,
shot from varying viewpoints (in this example 140). (b) We calibrate the
cameras (shown in green) and build a 3D proxy of the scene using MVS.
This reconstruction is approximate, as can be seen from the multiple holes
(white) and erroneous over-reconstruction (e.g., blobs around palm trees
with reconstructed sky). Our model learns to account for this uncertainty
and generalizes well at test time.
Our overall model can thus be divided into three sub-components
(Fig. 2). Two sub-networks independently rene the shadow masks
S
src
and
S
tgt
, and a third implements the nal relighting given
the illumination buers and the rened shadow masks. The three
components are trained jointly in an end-to-end, supervised fashion,
using a training set of synthetic scenes. Our dataset contains ground
truth source/target images, and approximate/ground truth shadow
mask pairs.
3.1 Overall architecture
At a high-level our network is the composition of three sub-networks,
two for the source (resp. target) shadow renement tasks and one
for relighting (Fig. 2). The renement networks both take the RGB
shadow images (Section 3.2.1) and the input images and predict
rened greyscale shadow masks. These two rened shadow masks,
along with the illumination buers, are sent to the relighting sub-
network which infers the target sun condition image and an overcast
image. This 3-step approach is supported by recent results (e.g.,
[Wang et al
.
2018]) showing that decomposing shadow detection
and removal in two consecutive subtasks within the same network
greatly improves quality. The overall architecture of our network is
shown in Fig. 2; we use a ResNet [He et al
.
2016; Johnson et al
.
2016]
for the shadow renement and the relighting modules [Zhu et al
.
2017]. We also experimented with a Unet-like architecture [Isola et al
.
2017], that gave marginally inferior results. Our network outputs
two images: the relit target image, and a cloudy" rendering which
we use to produce dierent degrees of overcast lighting conditions
(Section 5.6.1).
3.2 Shadow refinement with RGB shadow images
Strong shadow cues are central to the shadow removal and re-
synthesis process (see Section 5.5 for a comparison). The proxy
ACM Trans. Graph., Vol. 38, No. 4, Article 78. Publication date: July 2019.

Figures
Citations
More filters
Journal ArticleDOI

State of the Art on Neural Rendering

TL;DR: Neural rendering as discussed by the authors is a new and rapidly emerging field that combines generative machine learning techniques with physical knowledge from computer graphics, e.g., by the integration of differentiable rendering into network training.
Posted Content

State of the Art on Neural Rendering

TL;DR: This state‐of‐the‐art report summarizes the recent trends and applications of neural rendering and focuses on approaches that combine classic computer graphics techniques with deep generative models to obtain controllable and photorealistic outputs.
Posted Content

Neural Reflectance Fields for Appearance Acquisition

TL;DR: It is demonstrated that neural reflectance fields can be estimated from images captured with a simple collocated camera-light setup, and accurately model the appearance of real-world scenes with complex geometry and reflectance, and enable a complete pipeline from high-quality and practical appearance acquisition to 3D scene composition and rendering.
Proceedings ArticleDOI

NeuTex: Neural Texture Mapping for Volumetric Neural Rendering

TL;DR: In this paper, a 3D-to-2D texture mapping network is introduced to explicitly disentangle geometry and appearance from appearance, represented as a continuous 2D texture map.
Book ChapterDOI

Crowdsampling the Plenoptic Function

TL;DR: A new DeepMPI representation is introduced, motivated by observations on the sparsity structure of the plenoptic function, that allows for real-time synthesis of photorealistic views that are continuous in both space and across changes in lighting.
References
More filters
Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Proceedings Article

Adam: A Method for Stochastic Optimization

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Book ChapterDOI

U-Net: Convolutional Networks for Biomedical Image Segmentation

TL;DR: Neber et al. as discussed by the authors proposed a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently, which can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks.
Posted Content

U-Net: Convolutional Networks for Biomedical Image Segmentation

TL;DR: It is shown that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks.
Proceedings ArticleDOI

Image-to-Image Translation with Conditional Adversarial Networks

TL;DR: Conditional adversarial networks are investigated as a general-purpose solution to image-to-image translation problems and it is demonstrated that this approach is effective at synthesizing photos from label maps, reconstructing objects from edge maps, and colorizing images, among other tasks.
Related Papers (5)
Frequently Asked Questions (12)
Q1. How long does it take to create a supervised shadow refinement subnetwork?

The use of synthetic data allows generation of highly diverse ground truth data, and the creation of a proxy representation in addition to ground truth geometry for each synthetic scene allows supervised training for shadow refinement. 

The ground truth geometry and materials are used to render the sun and sky layers, and to create the ground truth greyscale shadow masks. 

The basic premise of their approach is to use multi-view information and approximate 3D geometry to reason about non-local lighting interactions and guide the relighting task. 

4.2 Photo-realistic rendering, layer decomposition and compositing-based data augmentationPath-tracing complex outdoors sceneswith a physically-based sun/sky model is expensive: rendering a converged image at 1024×768 takes about 10 minutes on their 400-core cluster. 

For shadow refinement to be successful at test time, the network needs to learn the mapping between approximate proxy shadows and ground truth shadows at training time. 

The weight for the contribution of a given image i to the color of a pixel in the RGB shadow image is computed as:1 | |xo − pi (xo)| |22 · |1 + c ᵀ i dsun |2 + ϵ , (1)where ci is a unit vector giving the direction from camera i to xo, pi (xo) ∈ R3 is the first intersection of the camera ray defined by ci with the proxy (Fig. 5) and ϵ = 1e−5. 

Since the authors want to change the lighting, the target masks are generally not aligned with the shadows in the input image, making the problem inherently more ambiguous. 

The source shadow refinement process uses the actual boundary in the input image, giving better overall results compared to the target shadow refinement (Fig. 4, (e)).3.2.2 RGB shadow images. 

The three sub-modules of their network are trained jointly in a supervised manner to minimize the sum of three losses:L = Lrelight + Lsrc + Ltgt. (2) These loss functions compare the accuracy of their network’s predictions (the final relit image as well as both intermediate refined shadow masks) to synthetic ground truth, which the authors detail in Section 4. 

These shadow images re-project colors from the shadow-casting geometry from all viewpoints into pixels in shadow, helping the network identify erroneously reconstructed shadow casters from the reprojected color (Fig. 4,5). 

Other approaches include Lalonde et al. [2010] which uses Conditional Random Fields to detect the shadow, or Mohan et al. [2007] which is a gradient-based solution for shadow removal. 

To bypass these issues, the authors use synthetic training data and render photo-realistic images using the Mitsuba [Jakob 2010] pathtracer.