scispace - formally typeset
Open AccessBook ChapterDOI

SESAME: Semantic Editing of Scenes by Adding, Manipulating or Erasing Objects.

TLDR
SESAME as discussed by the authors is a generator-discriminator pair for semantic image editing of scenes by adding, manipulating or erasing objects, where the user provides the semantic labels of the areas to be edited and the generator synthesizes the corresponding pixels.
Abstract
Recent advances in image generation gave rise to powerful tools for semantic image editing. However, existing approaches can either operate on a single image or require an abundance of additional information. They are not capable of handling the complete set of editing operations, that is addition, manipulation or removal of semantic concepts. To address these limitations, we propose SESAME, a novel generator-discriminator pair for Semantic Editing of Scenes by Adding, Manipulating or Erasing objects. In our setup, the user provides the semantic labels of the areas to be edited and the generator synthesizes the corresponding pixels. In contrast to previous methods that employ a discriminator that trivially concatenates semantics and image as an input, the SESAME discriminator is composed of two input streams that independently process the image and its semantics, using the latter to manipulate the results of the former. We evaluate our model on a diverse set of datasets and report state-of-the-art performance on two tasks: (a) image manipulation and (b) image generation conditioned on semantic labels.

read more

Content maybe subject to copyright    Report

SESAME: Semantic Editing of Scenes by
Adding, Manipulating or Erasing Objects
Evangelos Ntavelis
1,2
, Andr´es Romero
1
, Iason Kastanis
2
, Luc Van Gool
1,3
, and
Radu Timofte
1
1
Computer Vision Lab, ETH Zurich, Switzerland
2
Robotics and Machine Learning, CSEM SA, Switzerland
3
PSI, ESAT, KU Leuven, Belgium
input
labe
Fig. 1. We assess SESAME on three tasks (a) image editing with free form semantic
drawings (first row) (b) semantic layout driven semantic editing (second row) (c) layout
to image generation with SESAME discriminator (third row)
Abstract. Recent advances in image generation gave rise to powerful
tools for semantic image editing. However, existing approaches can ei-
ther operate on a single image or require an abundance of additional
information. They are not capable of handling the complete set of edit-
ing operations, that is addition, manipulation or removal of semantic
concepts. To address these limitations, we propose SESAME, a novel
generator-discriminator pair for Semantic Editing of Scenes by Adding,
Manipulating or Erasing objects. In our setup, the user provides the
semantic labels of the areas to be edited and the generator synthesizes
the corresponding pixels. In contrast to previous methods that employ a
discriminator that trivially concatenates semantics and image as an in-
put, the SESAME discriminator is composed of two input streams that
independently process the image and its semantics, using the latter to
manipulate the results of the former. We evaluate our model on a diverse
set of datasets and report state-of-the-art performance on two tasks: (a)
image manipulation and (b) image generation conditioned on semantic
labels.
arXiv:2004.04977v1 [cs.CV] 10 Apr 2020

2 E. Ntavelis et al.
Keywords: Generative Adversarial Networks, Interactive Image Edit-
ing, Image Synthesis
1 Introduction
Image editing is a challenging task that has received increasing attention in the
media, movies and social networks. Since the early 90s, tools like Gimp [38] and
Photoshop [36] have been extensively utilized for this task. Yet, both require
high level expertise and are labour intensive. Generative Adversarial Networks
(GANs) [10] provide a learning-based alternative able to assist non-experts to
express their creativity when retouching photographs. GANs have been able to
produce results of high photo-realistic quality [19,20]. Despite their success in
image synthesis, their applicability on image editing is still not fully explored.
Being able to manipulate images is a crucial task for many applications such as
autonomous driving [15] and industrial imaging [7], where data augmentation
boosts the generalization capabilities of neural networks [1,49,9].
Image manipulation has been used in the literature to refer to various tasks.
In this paper, we follow the formulation of Bau et al . [3], and define the task of
semantic image editing as the process of adding, altering and removing instances
of certain classes or semantic concepts in a scene. Examples of such manipula-
tions include but are not limited to: removing a car from a road scene, changing
the size of the eyes of a person, adding clouds in the sky, etc. We use the term
semantic concepts to refer to various class labels that can not be identified as
objects, e.g., mountains, grass, etc.
Training neural networks for visual editing is not a trivial task. It requires
a high level of understanding of the scene, the objects, and their interconnec-
tions [45]. Any region of an image added or removed should look realistic and
should also fit harmoniously with the rest of the scene. In contrast to image
generation, the co-existence of real and fake pixels make the fake pixels more
detectable, as the network cannot take the ”easy route” of generating simple
textures and shapes or even omit a whole class of objects [4]. Moreover, the lack
of natural image datasets, where a scene is captured with and without an object,
makes it impossible to train such models in a supervised manner.
One way to circumvent this problem is by inpainting the regions of an image
we seek to edit. Following this scheme, we mask out and remove all the pixels
we want to manipulate. Recent works [55,32,37,16] improve upon this approach
by incorporating sketch and color inputs to further guide the generation of the
missing areas and thus provide higher level control. However, inpainting can
only tackle some aspects of semantic editing. To address this limitation, Hong et
al. [12] manipulate the semantic layout of an image, and subsequently, they
utilize it for inpainting the image. Yet, this approach requires access to the full
semantic information of the image, which is costly to acquire.
To this end, we propose SESAME, a novel semantic editing architecture based
on adversarial learning, able to manipulate images based on a semantic input. In
particular, our method is able to edit images with pixel-level guidance of semantic

SESAME 3
labels, permitting full control over the output. Note that our method requires
the semantics only for regions to be edited. The generator seeks to synthesize an
altered image such that the synthesized pixels comply with both the context and
the user input. Moreover, we propose a new approach for semantics-conditioned
discrimination, by utilizing two independent streams to process the input image
and the corresponding semantics. We use the output of the semantics stream
to manipulate the output of the image stream. We employ visual results along
with quantitative analysis and a human study to validate the performance and
flexibility of the proposed approach.
2 Related Work
Generative Adversarial Networks [10] have completely revolutionized a great
variety of computer vision tasks such as image generation [20,19,30], super res-
olution [48,27], image attribute manipulation [28,40] and image editing [12,3].
While in their original formulation GANs were only capable of generating
samples drawn from a random distribution [10], soon multiple models emerged
able to perform conditional image synthesis [29,33]. This gave rise to ap-
proaches that employ a generative model for producing outputs conditioned on
different types of information. There are many approaches targeting multiple
levels of abstraction and locality of features that we seek to encapsulate in the
output. For example, [29,31,57,5] focus on representing images characterized by
a single label. In a different setting, [39,58,59,52] employ a text to image pipeline
to provide a high-level description of the corresponding image. Recently, many
methods utilize information of a scene graph [18,2] and sketches with color [42]
to represent where objects should be positioned on the output image.
A more fine-grained approach aims to translate semantic maps, which carry
pixel-wise information, to realistic-looking images [14,47,35,22]. For all the afore-
mentioned models, the user can control the output image by altering the con-
ditional information. Nonetheless, they are not suitable for manipulating an
existing image, as they do not consider an image as an input.
User-guided semantic image editing is the task where the user is able
to semantically edit an image by adding, manipulating or removing semantic
concepts [3]. Both GANPaint [3] and SinGAN [43] are able to perform such
operations. GANPaint [3] achieves this by manipulating the neuron activations
and SinGAN [43] by learning the internal batch statistics of an image. However,
both are trained on a single image and require retraining in order to be applied
to an another, while our model is able to handle manipulation of multiple images
without retraining.
An other line of work is inpainting [13,56,25], where the user masks a region of
the image for removal and the network fills it accordingly to the image context.
This can been interpreted as a simple form of editing, but the user does not
have control over the generated pixel. To address this, other research works
guide the generation of the missing areas using edges [55,32] and/or color [37,16]
information.

4 E. Ntavelis et al.
Recently, researchers shifted their attention to more semantic aware ap-
proaches for inpainting, by focusing on object addition and removal. Shetty et
al. [44], for instance, propose a two stage architecture to address removal op-
erations, by using an auxiliary network that predicts the masks of the objects
during training, while, at inference, users provide their own masks. Note that
their model cannot handle new objects generation. Another line of work is tack-
ling object synthesis by utilizing semantic layout information, which provides a
fine-grained guidance over the manipulation of an image. Yet, a subset of them
is limited by generating objects from a single class [34,51] or placing prior fixed
objects on the semantics plane [23]. Hong et al . [12] are able to handle both
addition and removal, but require full semantic information of the scene to pro-
duce even the smallest change to an image. In contrast, our method requires
only the semantics of the region to be edited.
The majority of the aforementioned works rely on adversarial learning to
tackle the problem of image editing, by primarily focusing on adjusting the
generator. Most recent models use a PatchGAN variant [14] which is able to
discriminate on the high frequencies of the image. This is a desired attribute as
conventional losses like Mean Squared Error and Mean Absolute Error can only
convey information about the lower frequencies to the generator. PatchGAN can
also be used for conditional generation of images on semantic maps, similar to our
case study. Previous works targeting a similar problem concatenate the semantic
information to the image and use it as an input to the discriminator. However,
conventional conditional generation literature suggests that concatenation is not
the optimal approach for conditional discrimination [39,33,31]. In this work, we
extend PatchGAN to better incorporate conditional information by processing
it separately from the image input. In a later stage of the network the two
processed streams are merged to produce the final output of the discriminator.
3 SESAME
In this work we describe a deep learning pipeline for semantically editing im-
ages, using conditional Generative Adversarial Networks (cGANs). Given an
image I
real
and a semantic guideline of the regions that should be altered by
the network, denoted by M
sem
, we want to produce a realistic output I
out
. The
real pixels values corresponding to M
sem
are removed from the input image.
The generated pixels in their place should be both true to the semantics dic-
tated by the mask and coherent with the rest of the pixels of I
real
. In order to
achieve this, our network is trained end-to-end in an adversarial manner. The
generator is a Encoder-Decoder architecture, with dilated convolutions[53] and
SPADE[35] layers, explained in section 3 and the discriminator is a two-stream
patch discriminator, described in section 3.
SESAME Generator. Semantically editing a scene is an Image to Image
translation problem. We want to transform an image where we substituted the
RGB pixels of the regions with an one-hot semantics vector. From the generator’s

SESAME 5
Semantic Generator
Fig. 2. The SESAME Generator aims to generate the pixels designated by the semantic
mask so they are both (1) true to their label and (2) fit naturally to the rest of the
picture. It is an encoder-decoder architecture with dilated convolutions to increase the
receptive field as well as SPADE layers in the decoder to guide in-class generation
output, only the pixels on the masked out regions are retained, while the rest
are retrieved from the original image:
I
g en
= G(I
m
, M, M
sem
), (1)
I
out
= I
g en
· M + I
real
· (1 M). (2)
This architecture should accomplish two goals: generated pixels should 1) be
coherent with their real neighboring ones as well as 2) be true to the semantic
input. To achieve these goals we adapt our generator from the network proposed
by Johnson et al. [17] to fill the gaps: two down-sampling layers, a semantic core
made of multiple residual layers and two up-sampling ones.
We conceptually divide our architecture into two parts: the encoder and the
decoder. In the encoder we aim to extract the contextual information of the pixels
we want to synthesize. In the decoder part we combine the semantic information
using Spatially Adaptive De-Normalization [35] blocks to every layer. As the
area to be edited can span over a large region, we would like the receptive field
of our network to be relatively large. Thus, we use dilated convolutions in the
last and first layers of the encoder and the decoder respectively. A scheme of our
SESAME generator can be seen in the Fig. 2, and for further details refer to the
supplementary materials.
SESAME Discriminator. Layout to image editing can be seen as a sub-
task of label to image translation. Inspired by Pix2Pix [14], more recent ap-
proaches [47,35] employ a variation of the PatchGAN discriminator. The Marko-
vian discriminator, as it is also called, was a paradigm shift that made the
discriminator focus on the higher frequencies by limiting the attention of the
discriminator into local patches, producing a different fake/real prediction value
for each of them. The subsequent methods added a multi-scale discrimination

Citations
More filters
Posted Content

You Only Need Adversarial Supervision for Semantic Image Synthesis

TL;DR: This work proposes a novel, simplified GAN model, which needs only adversarial supervision to achieve high quality results, and re-designs the discriminator as a semantic segmentation network, directly using the given semantic label maps as the ground truth for training.
Posted Content

Image Inpainting Guided by Coherence Priors of Semantics and Textures

TL;DR: A multi-scale joint optimization framework is adopted to first model the coherence priors and then accordingly interleaving optimize image inpainting and semantic segmentation in a coarse-to-fine manner and two coherence losses are proposed to constrain the consistency between the semantics and the inpainted image in terms of the overall structure and detailed textures.
Proceedings ArticleDOI

Image Inpainting Guided by Coherence Priors of Semantics and Textures

TL;DR: Zhang et al. as discussed by the authors introduce coherence priors between the semantics and textures, which make it possible to concentrate on completing separate textures in a semantic-wise manner, and they adopt a multi-scale joint optimization framework to first model the coherence prior and then accordingly interleaving optimize image inpainting and semantic segmentation in a coarse-to-fine manner.
Posted Content

AIM 2020 Challenge on Image Extreme Inpainting

TL;DR: This paper reviews the AIM 2020 challenge on extreme image inpainting and proposes solutions and results for two different tracks: classical image inPainting and semantically guided image in Painting.
References
More filters
Proceedings Article

Adam: A Method for Stochastic Optimization

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Journal ArticleDOI

Image quality assessment: from error visibility to structural similarity

TL;DR: In this article, a structural similarity index is proposed for image quality assessment based on the degradation of structural information, which can be applied to both subjective ratings and objective methods on a database of images compressed with JPEG and JPEG2000.
Journal ArticleDOI

Generative Adversarial Nets

TL;DR: A new framework for estimating generative models via an adversarial process, in which two models are simultaneously train: a generative model G that captures the data distribution and a discriminative model D that estimates the probability that a sample came from the training data rather than G.
Proceedings ArticleDOI

Image-to-Image Translation with Conditional Adversarial Networks

TL;DR: Conditional adversarial networks are investigated as a general-purpose solution to image-to-image translation problems and it is demonstrated that this approach is effective at synthesizing photos from label maps, reconstructing objects from edge maps, and colorizing images, among other tasks.
Journal ArticleDOI

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs

TL;DR: This work addresses the task of semantic image segmentation with Deep Learning and proposes atrous spatial pyramid pooling (ASPP), which is proposed to robustly segment objects at multiple scales, and improves the localization of object boundaries by combining methods from DCNNs and probabilistic graphical models.
Related Papers (5)
Frequently Asked Questions (13)
Q1. What contributions have the authors mentioned in the paper "Sesame: semantic editing of scenes by adding, manipulating or erasing objects" ?

To address these limitations, the authors propose SESAME, a novel generator-discriminator pair for Semantic Editing of Scenes by Adding, Manipulating or Erasing objects. In their setup, the user provides the semantic labels of the areas to be edited and the generator synthesizes the corresponding pixels. The authors evaluate their model on a diverse set of datasets and report state-of-the-art performance on two tasks: ( a ) image manipulation and ( b ) image generation conditioned on semantic labels. 

As a future research direction, the authors plan to extend this work on image generation conditioned on other types of information, e. g., scene graphs, could also benefit from their two-stream discriminator. The authors will open-source the code and the models under the repository name OpenSESAME. 

The authors train the Generator in an adversarial manner using the following losses: Perceptual Loss [17], Feature Matching Loss [41] and Hinge Loss [24,46,30] as the Adversarial Loss. 

For training the authors are using the Two Time-Scale Update Rule [11] to determine the scale between the learning rate of the generator and the discriminators, with lrgen = 0.0001 and lrdisc = 0.0004. 

The dataset contains 3,000 street-level view images of 50 different cities in Europe for the training set and 500 images for the validation set. 

In particular, their method is able to edit images with pixel-level guidance of semanticlabels, permitting full control over the output. 

The generator is a Encoder-Decoder architecture, with dilated convolutions[53] and SPADE[35] layers, explained in section 3 and the discriminator is a two-stream patch discriminator, described in section 3.SESAME Generator. 

There are many approaches targeting multiple levels of abstraction and locality of features that the authors seek to encapsulate in the output. 

In order to showcase the benefits of their approach the authors ablate the performance of their architecture by varying (a) the generator architecture, (b) the discriminator architecture and (c) the available semantics, by utilizing either the Full semantic layout or the semantics of the rectangular region the authors want to edit, which the authors refer to as BBox Semantics. 

In this paper, the authors follow the formulation of Bau et al . [3], and define the task of semantic image editing as the process of adding, altering and removing instances of certain classes or semantic concepts in a scene. 

The authors argue that their proposed method works better together as the large receptive field provided by the dilated convolutions in their generator synergizes well with the highly focused gradient flow coming from their discriminator. 

Another line of work is tackling object synthesis by utilizing semantic layout information, which provides a fine-grained guidance over the manipulation of an image. 

To achieve these goals the authors adapt their generator from the network proposed by Johnson et al . [17] to fill the gaps: two down-sampling layers, a semantic core made of multiple residual layers and two up-sampling ones.