What are the future works mentioned in the paper "Sesame: semantic editing of scenes by adding, manipulating or erasing objects" ?

As a future research direction, the authors plan to extend this work on image generation conditioned on other types of information, e. g., scene graphs, could also benefit from their two-stream discriminator. The authors will open-source the code and the models under the repository name OpenSESAME.

How do the authors train the SESAME Generator?

The authors train the Generator in an adversarial manner using the following losses: Perceptual Loss [17], Feature Matching Loss [41] and Hinge Loss [24,46,30] as the Adversarial Loss.

How do the authors train the SESAME discriminator?

For training the authors are using the Two Time-Scale Update Rule [11] to determine the scale between the learning rate of the generator and the discriminators, with lrgen = 0.0001 and lrdisc = 0.0004.

How many images are used for the validation set?

The dataset contains 3,000 street-level view images of 50 different cities in Europe for the training set and 500 images for the validation set.

What is the advantage of using semantic labels?

In particular, their method is able to edit images with pixel-level guidance of semanticlabels, permitting full control over the output.

How do the authors measure the performance of their architecture?

In order to showcase the benefits of their approach the authors ablate the performance of their architecture by varying (a) the generator architecture, (b) the discriminator architecture and (c) the available semantics, by utilizing either the Full semantic layout or the semantics of the rectangular region the authors want to edit, which the authors refer to as BBox Semantics.

What is the main argument for the proposed method?

The authors argue that their proposed method works better together as the large receptive field provided by the dilated convolutions in their generator synergizes well with the highly focused gradient flow coming from their discriminator.

What is the purpose of the generator?

To achieve these goals the authors adapt their generator from the network proposed by Johnson et al . [17] to fill the gaps: two down-sampling layers, a semantic core made of multiple residual layers and two up-sampling ones.

(Open Access) SESAME: Semantic Editing of Scenes by Adding, Manipulating or Erasing Objects. (2020) | Evangelos Ntavelis

Q: What contributions have the authors mentioned in the paper "Sesame: semantic editing of scenes by adding, manipulating or erasing objects" ?

To address these limitations, the authors propose SESAME, a novel generator-discriminator pair for Semantic Editing of Scenes by Adding, Manipulating or Erasing objects. In their setup, the user provides the semantic labels of the areas to be edited and the generator synthesizes the corresponding pixels. The authors evaluate their model on a diverse set of datasets and report state-of-the-art performance on two tasks: ( a ) image manipulation and ( b ) image generation conditioned on semantic labels.

Q: What is the definition of semantic image editing?

In this paper, the authors follow the formulation of Bau et al . [3], and define the task of semantic image editing as the process of adding, altering and removing instances of certain classes or semantic concepts in a scene.

SESAME: Semantic Editing of Scenes by

Adding, Manipulating or Erasing Objects

Evangelos Ntavelis

1,2

, Andr´es Romero

, Iason Kastanis

, Luc Van Gool

1,3

, and

Radu Timofte

Computer Vision Lab, ETH Zurich, Switzerland

Robotics and Machine Learning, CSEM SA, Switzerland

PSI, ESAT, KU Leuven, Belgium

input

labe

Fig. 1. We assess SESAME on three tasks (a) image editing with free form semantic

drawings (ﬁrst row) (b) semantic layout driven semantic editing (second row) (c) layout

to image generation with SESAME discriminator (third row)

Abstract. Recent advances in image generation gave rise to powerful

tools for semantic image editing. However, existing approaches can ei-

ther operate on a single image or require an abundance of additional

information. They are not capable of handling the complete set of edit-

ing operations, that is addition, manipulation or removal of semantic

concepts. To address these limitations, we propose SESAME, a novel

generator-discriminator pair for Semantic Editing of Scenes by Adding,

Manipulating or Erasing objects. In our setup, the user provides the

semantic labels of the areas to be edited and the generator synthesizes

the corresponding pixels. In contrast to previous methods that employ a

discriminator that trivially concatenates semantics and image as an in-

put, the SESAME discriminator is composed of two input streams that

independently process the image and its semantics, using the latter to

manipulate the results of the former. We evaluate our model on a diverse

set of datasets and report state-of-the-art performance on two tasks: (a)

image manipulation and (b) image generation conditioned on semantic

labels.

arXiv:2004.04977v1 [cs.CV] 10 Apr 2020

2 E. Ntavelis et al.

Keywords: Generative Adversarial Networks, Interactive Image Edit-

ing, Image Synthesis

1 Introduction

Image editing is a challenging task that has received increasing attention in the

media, movies and social networks. Since the early 90s, tools like Gimp [38] and

Photoshop [36] have been extensively utilized for this task. Yet, both require

high level expertise and are labour intensive. Generative Adversarial Networks

(GANs) [10] provide a learning-based alternative able to assist non-experts to

express their creativity when retouching photographs. GANs have been able to

produce results of high photo-realistic quality [19,20]. Despite their success in

image synthesis, their applicability on image editing is still not fully explored.

Being able to manipulate images is a crucial task for many applications such as

autonomous driving [15] and industrial imaging [7], where data augmentation

boosts the generalization capabilities of neural networks [1,49,9].

Image manipulation has been used in the literature to refer to various tasks.

In this paper, we follow the formulation of Bau et al . [3], and deﬁne the task of

semantic image editing as the process of adding, altering and removing instances

of certain classes or semantic concepts in a scene. Examples of such manipula-

tions include but are not limited to: removing a car from a road scene, changing

the size of the eyes of a person, adding clouds in the sky, etc. We use the term

semantic concepts to refer to various class labels that can not be identiﬁed as

objects, e.g., mountains, grass, etc.

Training neural networks for visual editing is not a trivial task. It requires

a high level of understanding of the scene, the objects, and their interconnec-

tions [45]. Any region of an image added or removed should look realistic and

should also ﬁt harmoniously with the rest of the scene. In contrast to image

generation, the co-existence of real and fake pixels make the fake pixels more

detectable, as the network cannot take the ”easy route” of generating simple

textures and shapes or even omit a whole class of objects [4]. Moreover, the lack

of natural image datasets, where a scene is captured with and without an object,

makes it impossible to train such models in a supervised manner.

One way to circumvent this problem is by inpainting the regions of an image

we seek to edit. Following this scheme, we mask out and remove all the pixels

we want to manipulate. Recent works [55,32,37,16] improve upon this approach

by incorporating sketch and color inputs to further guide the generation of the

missing areas and thus provide higher level control. However, inpainting can

only tackle some aspects of semantic editing. To address this limitation, Hong et

al. [12] manipulate the semantic layout of an image, and subsequently, they

utilize it for inpainting the image. Yet, this approach requires access to the full

semantic information of the image, which is costly to acquire.

To this end, we propose SESAME, a novel semantic editing architecture based

on adversarial learning, able to manipulate images based on a semantic input. In

particular, our method is able to edit images with pixel-level guidance of semantic

SESAME 3

labels, permitting full control over the output. Note that our method requires

the semantics only for regions to be edited. The generator seeks to synthesize an

altered image such that the synthesized pixels comply with both the context and

the user input. Moreover, we propose a new approach for semantics-conditioned

discrimination, by utilizing two independent streams to process the input image

and the corresponding semantics. We use the output of the semantics stream

to manipulate the output of the image stream. We employ visual results along

with quantitative analysis and a human study to validate the performance and

ﬂexibility of the proposed approach.

2 Related Work

Generative Adversarial Networks [10] have completely revolutionized a great

variety of computer vision tasks such as image generation [20,19,30], super res-

olution [48,27], image attribute manipulation [28,40] and image editing [12,3].

While in their original formulation GANs were only capable of generating

samples drawn from a random distribution [10], soon multiple models emerged

able to perform conditional image synthesis [29,33]. This gave rise to ap-

proaches that employ a generative model for producing outputs conditioned on

diﬀerent types of information. There are many approaches targeting multiple

levels of abstraction and locality of features that we seek to encapsulate in the

output. For example, [29,31,57,5] focus on representing images characterized by

a single label. In a diﬀerent setting, [39,58,59,52] employ a text to image pipeline

to provide a high-level description of the corresponding image. Recently, many

methods utilize information of a scene graph [18,2] and sketches with color [42]

to represent where objects should be positioned on the output image.

A more ﬁne-grained approach aims to translate semantic maps, which carry

pixel-wise information, to realistic-looking images [14,47,35,22]. For all the afore-

mentioned models, the user can control the output image by altering the con-

ditional information. Nonetheless, they are not suitable for manipulating an

existing image, as they do not consider an image as an input.

User-guided semantic image editing is the task where the user is able

to semantically edit an image by adding, manipulating or removing semantic

concepts [3]. Both GANPaint [3] and SinGAN [43] are able to perform such

operations. GANPaint [3] achieves this by manipulating the neuron activations

and SinGAN [43] by learning the internal batch statistics of an image. However,

both are trained on a single image and require retraining in order to be applied

to an another, while our model is able to handle manipulation of multiple images

without retraining.

An other line of work is inpainting [13,56,25], where the user masks a region of

the image for removal and the network ﬁlls it accordingly to the image context.

This can been interpreted as a simple form of editing, but the user does not

have control over the generated pixel. To address this, other research works

guide the generation of the missing areas using edges [55,32] and/or color [37,16]

information.

4 E. Ntavelis et al.

Recently, researchers shifted their attention to more semantic aware ap-

proaches for inpainting, by focusing on object addition and removal. Shetty et

al. [44], for instance, propose a two stage architecture to address removal op-

erations, by using an auxiliary network that predicts the masks of the objects

during training, while, at inference, users provide their own masks. Note that

their model cannot handle new objects generation. Another line of work is tack-

ling object synthesis by utilizing semantic layout information, which provides a

ﬁne-grained guidance over the manipulation of an image. Yet, a subset of them

is limited by generating objects from a single class [34,51] or placing prior ﬁxed

objects on the semantics plane [23]. Hong et al . [12] are able to handle both

addition and removal, but require full semantic information of the scene to pro-

duce even the smallest change to an image. In contrast, our method requires

only the semantics of the region to be edited.

The majority of the aforementioned works rely on adversarial learning to

tackle the problem of image editing, by primarily focusing on adjusting the

generator. Most recent models use a PatchGAN variant [14] which is able to

discriminate on the high frequencies of the image. This is a desired attribute as

conventional losses like Mean Squared Error and Mean Absolute Error can only

convey information about the lower frequencies to the generator. PatchGAN can

also be used for conditional generation of images on semantic maps, similar to our

case study. Previous works targeting a similar problem concatenate the semantic

information to the image and use it as an input to the discriminator. However,

conventional conditional generation literature suggests that concatenation is not

the optimal approach for conditional discrimination [39,33,31]. In this work, we

extend PatchGAN to better incorporate conditional information by processing

it separately from the image input. In a later stage of the network the two

processed streams are merged to produce the ﬁnal output of the discriminator.

3 SESAME

In this work we describe a deep learning pipeline for semantically editing im-

ages, using conditional Generative Adversarial Networks (cGANs). Given an

image I

real

and a semantic guideline of the regions that should be altered by

the network, denoted by M

sem

, we want to produce a realistic output I

out

. The

real pixels values corresponding to M

sem

are removed from the input image.

The generated pixels in their place should be both true to the semantics dic-

tated by the mask and coherent with the rest of the pixels of I

real

. In order to

achieve this, our network is trained end-to-end in an adversarial manner. The

generator is a Encoder-Decoder architecture, with dilated convolutions[53] and

SPADE[35] layers, explained in section 3 and the discriminator is a two-stream

patch discriminator, described in section 3.

SESAME Generator. Semantically editing a scene is an Image to Image

translation problem. We want to transform an image where we substituted the

RGB pixels of the regions with an one-hot semantics vector. From the generator’s

SESAME 5

Semantic Generator

Fig. 2. The SESAME Generator aims to generate the pixels designated by the semantic

mask so they are both (1) true to their label and (2) ﬁt naturally to the rest of the

picture. It is an encoder-decoder architecture with dilated convolutions to increase the

receptive ﬁeld as well as SPADE layers in the decoder to guide in-class generation

output, only the pixels on the masked out regions are retained, while the rest

are retrieved from the original image:

g en

= G(I

, M, M

sem

), (1)

out

= I

g en

· M + I

real

· (1 − M). (2)

This architecture should accomplish two goals: generated pixels should 1) be

coherent with their real neighboring ones as well as 2) be true to the semantic

input. To achieve these goals we adapt our generator from the network proposed

by Johnson et al. [17] to ﬁll the gaps: two down-sampling layers, a semantic core

made of multiple residual layers and two up-sampling ones.

We conceptually divide our architecture into two parts: the encoder and the

decoder. In the encoder we aim to extract the contextual information of the pixels

we want to synthesize. In the decoder part we combine the semantic information

using Spatially Adaptive De-Normalization [35] blocks to every layer. As the

area to be edited can span over a large region, we would like the receptive ﬁeld

of our network to be relatively large. Thus, we use dilated convolutions in the

last and ﬁrst layers of the encoder and the decoder respectively. A scheme of our

SESAME generator can be seen in the Fig. 2, and for further details refer to the

supplementary materials.

SESAME Discriminator. Layout to image editing can be seen as a sub-

task of label to image translation. Inspired by Pix2Pix [14], more recent ap-

proaches [47,35] employ a variation of the PatchGAN discriminator. The Marko-

vian discriminator, as it is also called, was a paradigm shift that made the

discriminator focus on the higher frequencies by limiting the attention of the

discriminator into local patches, producing a diﬀerent fake/real prediction value

for each of them. The subsequent methods added a multi-scale discrimination

SESAME: Semantic Editing of Scenes by Adding, Manipulating or Erasing Objects.

Figures

Citations

You Only Need Adversarial Supervision for Semantic Image Synthesis

Image Inpainting Guided by Coherence Priors of Semantics and Textures

Image Inpainting Guided by Coherence Priors of Semantics and Textures

AIM 2020 Challenge on Image Extreme Inpainting

AIM 2020 Challenge on Image Extreme Inpainting

References

Adam: A Method for Stochastic Optimization

Image quality assessment: from error visibility to structural similarity

Generative Adversarial Nets

Image-to-Image Translation with Conditional Adversarial Networks

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs

Related Papers (5)

Image-to-Image Translation with Conditional Adversarial Networks

High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs

GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium

Generative Adversarial Nets

PatchMatch: a randomized correspondence algorithm for structural image editing

Frequently Asked Questions (13)

Q1. What contributions have the authors mentioned in the paper "Sesame: semantic editing of scenes by adding, manipulating or erasing objects" ?

Q2. What are the future works mentioned in the paper "Sesame: semantic editing of scenes by adding, manipulating or erasing objects" ?

Q3. How do the authors train the SESAME Generator?

Q4. How do the authors train the SESAME discriminator?

Q5. How many images are used for the validation set?

Q6. What is the advantage of using semantic labels?

Q7. What is the semantics of the discriminator?

Q8. What are the different approaches to generating images?

Q9. How do the authors measure the performance of their architecture?

Q10. What is the definition of semantic image editing?

Q11. What is the main argument for the proposed method?

Q12. What is the way to tackle the problem of object synthesis?

Q13. What is the purpose of the generator?