How long does it take to create a supervised shadow refinement subnetwork?

The use of synthetic data allows generation of highly diverse ground truth data, and the creation of a proxy representation in addition to ground truth geometry for each synthetic scene allows supervised training for shadow refinement.

What is used to create the ground truth shadow masks?

The ground truth geometry and materials are used to render the sun and sky layers, and to create the ground truth greyscale shadow masks.

How long does it take to render a converged image?

4.2 Photo-realistic rendering, layer decomposition and compositing-based data augmentationPath-tracing complex outdoors sceneswith a physically-based sun/sky model is expensive: rendering a converged image at 1024×768 takes about 10 minutes on their 400-core cluster.

How do the authors train the network to generate shadow images?

For shadow refinement to be successful at test time, the network needs to learn the mapping between approximate proxy shadows and ground truth shadows at training time.

What is the weight for the contribution of a given image to the color of a ?

The weight for the contribution of a given image i to the color of a pixel in the RGB shadow image is computed as:1 | |xo − pi (xo)| |22 · |1 + c ᵀ i dsun |2 + ϵ , (1)where ci is a unit vector giving the direction from camera i to xo, pi (xo) ∈ R3 is the first intersection of the camera ray defined by ci with the proxy (Fig. 5) and ϵ = 1e−5.

Why are the target masks not aligned with the input image?

Since the authors want to change the lighting, the target masks are generally not aligned with the shadows in the input image, making the problem inherently more ambiguous.

What is the purpose of the reprocessing of the source and target shadow images?

The source shadow refinement process uses the actual boundary in the input image, giving better overall results compared to the target shadow refinement (Fig. 4, (e)).3.2.2 RGB shadow images.

How are the three sub-modules of their network trained?

The three sub-modules of their network are trained jointly in a supervised manner to minimize the sum of three losses:L = Lrelight + Lsrc + Ltgt. (2) These loss functions compare the accuracy of their network’s predictions (the final relit image as well as both intermediate refined shadow masks) to synthetic ground truth, which the authors detail in Section 4.

How do the authors bypass the problems of synthetic training?

To bypass these issues, the authors use synthetic training data and render photo-realistic images using the Mitsuba [Jakob 2010] pathtracer.

(Open Access) Multi-view relighting using a geometry-aware network (2019) | Julien Philip

Q: What is the basic premise of their approach?

The basic premise of their approach is to use multi-view information and approximate 3D geometry to reason about non-local lighting interactions and guide the relighting task.

HAL Id: hal-02125095

https://hal.inria.fr/hal-02125095

Submitted on 10 May 2019

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-

entic research documents, whether they are pub-

lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diusion de documents

scientiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Multi-view Relighting using a Geometry-Aware Network

Julien Philip, Michaël Gharbi, Tinghui Zhou, Alexei A Efros, George Drettakis

To cite this version:

Julien Philip, Michaël Gharbi, Tinghui Zhou, Alexei A Efros, George Drettakis. Multi-view Relight-

ing using a Geometry-Aware Network. ACM Transactions on Graphics, Association for Computing

Machinery, In press, 38, �10.1145/3306346.3323013�. �hal-02125095�

Multi-view Relighting using a Geometry-Aware Network

JULIEN PHILIP, Université Côte d’Azur and Inria

MICHAËL GHARBI, Adobe

TINGHUI ZHOU, UC Berkeley

ALEXEI A. EFROS, UC Berkeley

GEORGE DRETTAKIS, Université Côte d’Azur and Inria

(a) our algorithm can relight a single-illumination drone video dynamically to synthesize a “time-lapse” eect

(b) single-view input (c) three relit outputs: here we built the proxy geometry using internet photos of the same location

Fig. 1. Two applications of our multi-view relighting system. (a) We show five dierent frames from a drone video (copyright Namyeska youtu.be/JHeDP7_YBos

used with permission) relit with a "time-lapse" eect of a rotating sun (see supplemental for the full video). A user can also relight a photograph of a known

landmark (b) to dierent target lighting conditions (c). For this, we applied our algorithm to a collection of 50 internet images of the same location .

We propose the rst learning-based algorithm that can relight images in

a plausible and controllable manner given multiple views of an outdoor

scene. In particular, we introduce a geometry-aware neural network that

utilizes multiple geometry cues (normal maps, specular direction, etc.) and

source and target shadow masks computed from a noisy proxy geometry

obtained by multi-view stereo. Our model is a three-stage pipeline: two sub-

networks rene the source and target shadow masks, and a third performs

the nal relighting. Furthermore, we introduce a novel representation for the

shadow masks, which we call RGB shadow images. They reproject the colors

from all views into the shadowed pixels and enable our network to cope

with inacuraccies in the proxy and the non-locality of the shadow casting

interactions. Acquiring large-scale multi-view relighting datasets for real

scenes is challenging, so we train our network on photorealistic synthetic

data. At train time, we also compute a noisy stereo-based geometric proxy,

this time from the synthetic renderings. This allows us to bridge the gap

between the real and synthetic domains. Our model generalizes well to real

scenes. It can alter the illumination of drone footage, image-based renderings,

textured mesh reconstructions, and even internet photo collections.

Authors’ addresses: Julien Philip, Université Côte d’Azur and Inria, julien.philip@

inria.fr; Michaël Gharbi, Adobe, mgharbi@adobe.com; Tinghui Zhou, UC Berkeley,

tinghuiz@eecs.berkeley.edu; Alexei A. Efros, UC Berkeley, efros@eecs.berkeley.edu;

George Drettakis, Université Côte d’Azur and Inria, George.Drettakis@inria.fr.

Publication rights licensed to ACM. ACM acknowledges that this contribution was

authored or co-authored by an employee, contractor or aliate of a national govern-

ment. As such, the Government retains a nonexclusive, royalty-free right to publish or

reproduce this article, or to allow others to do so, for Government purposes only.

0730-0301/2019/7-ART78 $15.00

https://doi.org/10.1145/3306346.3323013

CCS Concepts: • Computing methodologies → Image manipulation.

Additional Key Words and Phrases: Image relighting, Multi-view, Deep

Learning

ACM Reference Format:

Julien Philip, Michaël Gharbi, Tinghui Zhou, Alexei A. Efros, and George

Drettakis. 2019. Multi-view Relighting using a Geometry-Aware Network.

ACM Trans. Graph. 38, 4, Article 78 (July 2019), 14 pages. https://doi.org/10.

1145/3306346.3323013

1 INTRODUCTION

Changing the illumination of an outdoor image is a notoriously dif-

cult problem that requires the lighting to be modied consistently

across the image, and shadows to be removed and resynthesized

for the new sun position [Duchêne et al

2015; Tchou et al

2004; Yu

et al

1999]. Cast shadows are particularly challenging because an

occluder can be arbitrarily far from the point it shadows, or even

out of view.

The basic premise of our approach is to use multi-view infor-

mation and approximate 3D geometry to reason about non-local

lighting interactions and guide the relighting task. We introduce the

rst learning-based algorithm that can relight multi-view datasets

of outdoor scenes (Fig. 1), which have become a commodity thanks

to smartphone cameras, large-scale internet photo collections and

drone cinematography. Our model uses a neural network designed

to exploit geometric cues. It includes a careful treatment of cast

shadows and is trained solely on realistic synthetic renderings.

ACM Trans. Graph., Vol. 38, No. 4, Article 78. Publication date: July 2019.

78:2 • Julien Philip, Michaël Gharbi, Tinghui Zhou, Alexei A. Efros, and George Dreakis

(d) illumination buers

  

renderings with multiple

lighting conditions

ground truth

geometry

proxy

(g) relit output

cloudy image

() refined shadow masks

target

sourcesource

(e) RGB shadow images

 



(b) 3D proxy & source/target

lighting parameters

(a) multi-view

dataset

Fig. 2. Overview of our approach.

Le:

We use o-the-shelf stereo to create a 3D geometric proxy of the scene (b). The geometry is encoded as illumination

buers (d) and used to create RGB shadow images (e) that are independently refined by two networks (f), helping the final relighting network remove

and re-synthesize shadows, and change the illumination (g) according to the desired novel lighting condition.

Right:

We train our model with synthetic

data, including accurate ground truth geometry and renderings and an approximate proxy, created using synthetic renderings instead of photos. These two

representations of the training scene allow the network to accurately refine shadows, enabling plausible relighting.

Our method has several applications: it allows automatic cre-

ation of a “time-lapse” eect by dynamically relighting drone videos

(Fig. 1(a)). Or, if we only have a single photo, we can access on-

line photos of the same place to relight the input photo (Fig. 1(b)).

We can also relight images in traditional multi-view pipelines, e.g.,

Image-Based Rendering (IBR) or photogrammetry (Fig. 15).

Previous methods have diculty with the type of input we target.

Inverse-illumination methods [Loscos et al

1999; Yu et al

1999]

cannot handle the approximate geometry of the proxy, while single-

image relighting solutions struggle with cast shadows [Luan et al

2017; Shih et al

2013]. Finally, our solution signicantly outperforms

neural-network baselines (Sec. 5.2).

Method Overview. Given a set of images captured from multiple

viewpoints (Fig. 2a), we start by building an approximate repre-

sentation of the scene’s geometry — a proxy — using o-the-shelf

stereo [RealityCapture 2016; Snavely et al

2006] (Fig. 2b). We can

relight any reference view of that scene (Fig. 2c) — this could be

one of the input images or a novel view obtained by IBR. The user

provides a target illumination by specifying a sun direction vector

and a scalar “cloudiness” level (or a sequence of such parameters for

“time-lapse” eects). From the proxy, we then compute image-space

buers (Fig. 2d: normal maps, specular reection direction, etc.) and

shadow masks for the source and target illuminations. We perform

relighting by training a neural network to map from the reference

image, with extra buers and shadow masks, to the novel lighting

condition.

The importance of accurate shadow estimation for shadow re-

moval has been previously demonstrated [Duchêne et al

2015;

Gryka et al

2015; Guo et al

2011]. But reconstruction errors in

the proxy often lead to inacurrate masks a network cannot trust.

This motivates our network design: we decompose our model into

three sub-networks (Fig. 2). Two modules rene the source (resp.

target) shadow masks (Fig. 2f) while the third implements the -

nal relighting (Fig. 2g). The sub-networks are trained jointly but

with dierent supervision: respectively ground truth shadow masks

and ground truth relit images. Furthermore, instead of computing

standard shadow masks from the proxy, we introduce RGB shadow

images (Fig. 2e). These shadow images re-project colors from the

shadow-casting geometry from all viewpoints into pixels in shadow,

helping the network identify erroneously reconstructed shadow

casters from the reprojected color (Fig. 4,5).

For supervised training, we need data corresponding to dierent

lighting conditions of the exact same views, that is hard to capture

with real photos. Instead, we use professionally-modeled, realis-

tic synthetic scenes to generate physically-based renderings with

many dierent viewpoints and lighting conditions. We introduce a

exible compositing methodology to generate a large variety of illu-

minations on-the-y at training time. This avoids the combinatorial

explosion in the number of images to render. Synthetic scenes also

give us ground truth shadow masks.

To train the shadow renement, it is impossible to capture real

data and we cannot directly use the ground truth shadows cast by the

synthetic geometry. A model trained with these perfectly accurate

shadows would not generalize to real photographs, since it would

have never seen the reconstruction errors of the stereo-based proxy.

Instead, we generate an approximate 3D proxy for each synthetic

training scene using stereo on renderings, from which we obtain

the input illumination buers and (inaccurate) RGB shadow images.

The ground truth shadow masks are used as targets to supervise the

renement sub-networks. This approach makes our model robust to

3D reconstruction errors at test time and limits the generalization

gap between real and synthetic data.

Contributions. In summary, we make the following contributions:

•

An end-to-end learning method for multi-view relighting

of outdoor scenes, guided by image-space buers, namely

shadow masks and illumination buers, that are computed

from a geometry proxy.

•

A learning-based shadow renement solution to remove and

resynthesize shadows. It uses the input images as well as our

ACM Trans. Graph., Vol. 38, No. 4, Article 78. Publication date: July 2019.

Multi-view Relighting using a Geometry-Aware Network • 78:3

newly-introduced RGB shadow images to overcome recon-

struction errors in the proxy.

•

A training procedure that uses realistic synthetic scenes to

exibly generate multiple lighting conditions. Critically, we

create a stereo-based proxy for each training scene which,

together with the ground truth geometry, enables supervised

learning for shadow renement.

Although it is entirely trained on synthetic images, our algorithm

generalizes to real multi-view datasets, and can modify the lighting

in a much wider range of illumination conditions than previous

methods (e.g., [Duchêne et al

2015]). We evaluate our approach

on real multi-view datasets, and show a variety of applications

(Fig. 1,13,16).

2 RELATED WORK

Our method builds on several dierent areas. We rst discuss tra-

ditional methods for single-image and multi-view relighting. One

major challenge for relighting is the careful treatement of shadows.

Our method removes and re-synthesizes shadows; we thus review

the shadow removal literature. We also briey review some aspects

of image-to-image transformation research that is related to our

solution.

2.1 Image-Based Relighting

Image-based relighting methods try to change the lighting condi-

tions of an input image or a set of images. Early work ([Loscos et al

1999; Marschner and Greenberg 1997; Yu et al

1999], used laser scans

or early user-assisted reconstruction algorithms to estimate geom-

etry, and reectance and/or environment lighting. Inverse global

illumination is then used for relighting. More involved capture se-

tups such as the Light Stage [Debevec et al

2000; Wenger et al

2005]

allow for production-quality relighting, with wide-ranging applica-

tions in the lm industry. In contrast, we target casual capture with

a single camera (DSLR, phone or drone), providing approximate

3D geometry, which is most often unsuitable for inverse rendering

methods.

Estimating the lighting environment in an image is an important

step in relighting, with many proposed solutions (e.g., [Debevec

2002; Hold-Georoy et al

2017; Lalonde et al

2009a; Stumpfel et al

2004]). Similarly, several reectance estimation techniques have

been proposed to assist relighting [Masselus et al

2003, 2004]. We-

bcam sequences have also been used for relighting [Lalonde et al

2009b; Sunkavalli et al

2007], although cast shadows often require

manual layering. Alternatively, online digital terrain and urban mod-

els registered to images can be used for approximate relighting [Kopf

et al

2008]. None of these methods satises all our requirements, i.e.,

plausible multi-view relighting including cast shadows for outdoors

scenes using casual capture.

Another widely developed area of image relighting focuses on

images of faces (e.g., [Peers et al

2007; Wang et al

2009; Wen et al

2003]). The specic nature of face geometry and reectance result

in solutions that are not well adapted to the outdoors scenes we

target.

Some methods target realistic object editing or compositing in

single images [Karsch et al

2011; Kholgade et al

2014]. These meth-

ods give good results, but they do not adress major lighting changes,

like editing cast shadows. They also require signicant eort from

the user to annotate the scene.

Several methods on multi-view image relighting have been devel-

oped, both for the case of multiple images sharing single lighting

conditions [Duchêne et al

2015], and for images of the same lo-

cation with multiple lighting conditions (typically from internet

photo collections) [Laont et al

2012; Xu et al

2018]. For the sin-

gle lighting condition, Duchêne et al. [2015], rst perform shadow

classication and intrinsic decomposition using separate optimiza-

tion steps. Despite impressive results, artifacts remain especially

around shadow boundaries and the relighting method fails beyond

limited shadow motion. Our learning solution avoids the pitfalls

of these optimization methods, and allows much larger sun motion

(Section 5.3) as well as treating video sequences.

2.2 Intrinsic images, shadow estimation and removal

Intrinsic image decomposition and shadow removal methods are

closely related to relighting. The classic Retinex work [Land and

McCann 1971] inspired the intrinsic decomposition method of Weiss

[2001], which used time-lapse sequences to compute shadow-free

reectance images. Many previous methods exist to explicitly detect

and remove shadows, both in graphics and computer vision. See

Sanin et al. [2012] for a survey. Most such methods operate on a

single image, for example the work of Finlayson et al. [2006], which

works well on shadows of relatively simple isolated objects. Other

approaches include Lalonde et al. [2010] which uses Conditional

Random Fields to detect the shadow, or Mohan et al. [2007] which

is a gradient-based solution for shadow removal. These methods

typically do not address relighting, which is our main goal. User

assisted methods have also been developed [Shor and Lischinski

2008; Wu et al

2007] but our automated approach is more practical

for multi-view datasets.

Even before the massive adoption of deep CNNs, learning meth-

ods were proposed to remove shadows from images. The method

of Guo et al. [2011], detects pairs of points in shadow/light using a

learning approach, and subsequently removes shadows with an op-

timization. More recently, deep learning has been used for shadow

removal [Qu et al

2017], using pretrained features, global and lo-

cal information. Generative Adversarial Networks (GANs) have

also been used for shadow detection and removal, e.g., conditional

GANs [Wang et al

2018], where a rst GAN learns to generate the

shadow mask, which is then used by the second network to remove

shadows. As with previous shadow removal methods, relighting is

not addressed in this work. Recent deep learning methods achieve

good results for shadow removal, but most often do not address

moving shadows (especially cast shadows) and changing the overall

lighting conditions. Handling such changes in lighting is a much

more complex problem; our solution uses geometry and synthetic

training data, achieving plausible relighting with cast shadows. We

provide comparisons with baseline methods using such solutions in

Section 5.2.

ACM Trans. Graph., Vol. 38, No. 4, Article 78. Publication date: July 2019.

78:4 • Julien Philip, Michaël Gharbi, Tinghui Zhou, Alexei A. Efros, and George Dreakis

2.3 Deep learning for image-to-image transformations

The Pix2Pix method [Isola et al

2017] uses a U-net [Ronneberger

et al

2015] to perform many dierent image transformation tasks

with remarkable success, even though the quantity of training data

is quite low compared to other methods. Similarly, ResNet-like ar-

chitectures [He et al

2016] have been particularly successful in

large image transformation tasks [Zhu et al

2017], thanks to the

residual blocks that preserve useful information in the network.

There has been a body of work on transforming images, including

day-to-night [Liu et al

2017] changes. While impressive, the results

of these methods typically generated by GANs are lacking in consis-

tency and ease of control. Finally, there has also been work on face

or body relighting using deep learning (e.g., [Kanamori and Endo

2018; Shu et al

2017]); as with older methods, the specic technical

choices for faces or bodies result in methods that are not necessarily

adapted for relighting of outdoor scenes, especially since the extent

of outdoors scenes results in much more non-local eects.

3 GEOMETRY-AWARE RELIGHTING NETWORK

Our relighting solution is built around a neural network that takes

one image from a multi-view dataset , and a set of corresponding

image-space buers as input, and produces a new image, with the

lighting altered. We identied three key diculties to successfully

implement this image transformation: modeling the illumination

changes (color, intensity, etc), and removing and resynthesizing cast

shadows.

To overcome these diculties, our learning solution exploits a

geometric 3D proxy which we obtain by rst calibrating the input

virtual cameras using structure from motion (SfM) [Snavely et al

2006], then running a Multi-View Stereo algorithm [Goesele et al

2007; RealityCapture 2016]. Fig. 3 illustrates this procedure.

Because our CNN operates in the image domain, we encode

the geometry and lighting parameters as image-space illumination

buers

. These include normal maps, per-pixel specular reection

direction, etc. See Section 3.3. In our ablation study, we found these

buers to be instrumental in synthesizing plausible novel illumina-

tions (Section 5.5).

Furthermore, the proxy gives us a particularly powerful means

to guide the shadow removal and re-synthesis process. We use it to

obtain two shadow masks,

src

and

tgt

, corresponding to the source

and target sun directions respectively, by running a shadowcasting

algorithm. If the geometry were perfect, these masks would tell the

network precisely which pixels to brighten (resp. darken). However,

because of errors in the stereo reconstruction, the masks typically

contain signicant artifacts and misalignments with respect to the

actual shadows in the image.

While coarse masks are better than no shadow mask at all (see

Section 5.5), we found that the success of the shadow removal proce-

dure strongly depends on the quality of

src

. Similarly, the shadow

re-synthesis suers from errors in

tgt

. This led us to build an ex-

plicit shadow renement step within our pipeline. We guide the

renement step by introducing RGB shadow images. These maps

use color information from all images in the multi-view dataset to

provide hints to the CNN on reconstruction inaccuracies.

(a) input views (b) calibrated cameras and 3D proxy

Fig. 3. (a) Our method takes as input a set of photos of an outdoor scene,

shot from varying viewpoints (in this example 140). (b) We calibrate the

cameras (shown in green) and build a 3D proxy of the scene using MVS.

This reconstruction is approximate, as can be seen from the multiple holes

(white) and erroneous over-reconstruction (e.g., blobs around palm trees

with reconstructed sky). Our model learns to account for this uncertainty

and generalizes well at test time.

Our overall model can thus be divided into three sub-components

(Fig. 2). Two sub-networks independently rene the shadow masks

src

and

tgt

, and a third implements the nal relighting given

the illumination buers and the rened shadow masks. The three

components are trained jointly in an end-to-end, supervised fashion,

using a training set of synthetic scenes. Our dataset contains ground

truth source/target images, and approximate/ground truth shadow

mask pairs.

3.1 Overall architecture

At a high-level our network is the composition of three sub-networks,

two for the source (resp. target) shadow renement tasks and one

for relighting (Fig. 2). The renement networks both take the RGB

shadow images (Section 3.2.1) and the input images and predict

rened greyscale shadow masks. These two rened shadow masks,

along with the illumination buers, are sent to the relighting sub-

network which infers the target sun condition image and an overcast

image. This 3-step approach is supported by recent results (e.g.,

[Wang et al

2018]) showing that decomposing shadow detection

and removal in two consecutive subtasks within the same network

greatly improves quality. The overall architecture of our network is

shown in Fig. 2; we use a ResNet [He et al

2016; Johnson et al

2016]

for the shadow renement and the relighting modules [Zhu et al

2017]. We also experimented with a Unet-like architecture [Isola et al

2017], that gave marginally inferior results. Our network outputs

two images: the relit target image, and a “cloudy" rendering which

we use to produce dierent degrees of overcast lighting conditions

(Section 5.6.1).

3.2 Shadow refinement with RGB shadow images

Strong shadow cues are central to the shadow removal and re-

synthesis process (see Section 5.5 for a comparison). The proxy

ACM Trans. Graph., Vol. 38, No. 4, Article 78. Publication date: July 2019.

Multi-view relighting using a geometry-aware network

Figures

Citations

State of the Art on Neural Rendering

State of the Art on Neural Rendering

Neural Reflectance Fields for Appearance Acquisition

NeuTex: Neural Texture Mapping for Volumetric Neural Rendering

Crowdsampling the Plenoptic Function

References

Deep Residual Learning for Image Recognition

Adam: A Method for Stochastic Optimization

U-Net: Convolutional Networks for Biomedical Image Segmentation

U-Net: Convolutional Networks for Biomedical Image Segmentation

Image-to-Image Translation with Conditional Adversarial Networks

Related Papers (5)

Deferred neural rendering: image synthesis using neural textures

Shape, Illumination, and Reflectance from Shading

Generative Adversarial Nets

Acquiring the reflectance field of a human face

U-Net: Convolutional Networks for Biomedical Image Segmentation

Frequently Asked Questions (12)

Q1. How long does it take to create a supervised shadow refinement subnetwork?

Q2. What is used to create the ground truth shadow masks?

Q3. What is the basic premise of their approach?

Q4. How long does it take to render a converged image?

Q5. How do the authors train the network to generate shadow images?

Q6. What is the weight for the contribution of a given image to the color of a ?

Q7. Why are the target masks not aligned with the input image?

Q8. What is the purpose of the reprocessing of the source and target shadow images?

Q9. How are the three sub-modules of their network trained?

Q10. What are the sub-networks used for shadow refinement?

Q11. What is the method for detecting shadows?

Q12. How do the authors bypass the problems of synthetic training?