scispace - formally typeset
Open AccessProceedings ArticleDOI

Controlling Perceptual Factors in Neural Style Transfer

TLDR
The existing Neural Style Transfer method is extended to introduce control over spatial location, colour information and across spatial scale, enabling the combination of style information from multiple sources to generate new, perceptually appealing styles from existing ones.
Abstract
Neural Style Transfer has shown very exciting results enabling new forms of image manipulation. Here we extend the existing method to introduce control over spatial location, colour information and across spatial scale. We demonstrate how this enhances the method by allowing high-resolution controlled stylisation and helps to alleviate common failure cases such as applying ground textures to sky regions. Furthermore, by decomposing style into these perceptual factors we enable the combination of style information from multiple sources to generate new, perceptually appealing styles from existing ones. We also describe how these methods can be used to more efficiently produce large size, high-quality stylisation. Finally we show how the introduced control measures can be applied in recent methods for Fast Neural Style Transfer.

read more

Content maybe subject to copyright    Report

Controlling Perceptual Factors in Neural Style Transfer
Leon A. Gatys
1
Alexander S. Ecker
1
Matthias Bethge
1
Aaron Hertzmann
2
Eli Shechtman
2
1
University of T
¨
ubingen
2
Adobe Research
(a) Content (b) Spatial Control (c) Colour Control (d) Scale Control
Figure 1: Overview of our control methods. (a) Content image, with spatial mask inset. (b) Spatial Control. The sky is stylised using the
sky of Style II from Fig. 2(c). The ground is stylised using Style I from Fig. 4(b). (c) Colour Control. The colour of the content image
is preserved using luminance-only style transfer described in Section 5.1. (d) Scale Control. The fine scale is stylised using using Style I
from Fig. 4(b) and the coarse scale is stylised using Style III from Fig. 4(b). Colour is preserved using the colour matching described in
section 5.2.
Abstract
Neural Style Transfer has shown very exciting results en-
abling new forms of image manipulation. Here we extend
the existing method to introduce control over spatial lo-
cation, colour information and across spatial scale
12
. We
demonstrate how this enhances the method by allowing
high-resolution controlled stylisation and helps to alleviate
common failure cases such as applying ground textures to
sky regions. Furthermore, by decomposing style into these
perceptual factors we enable the combination of style infor-
mation from multiple sources to generate new, perceptually
appealing styles from existing ones. We also describe how
these methods can be used to more efficiently produce large
size, high-quality stylisation. Finally we show how the in-
troduced control measures can be applied in recent methods
for Fast Neural Style Transfer.
1. Introduction
Example-based style transfer is a major way to create
new, perceptually appealing images from existing ones. It
takes two images x
S
and x
C
as input, and produces a new
image
ˆ
x applying the style of x
S
to the content of x
C
. The
concepts of “style” and “content” are both expressed in
terms of image statistics; for example, two images are said
1
Code: github.com/leongatys/NeuralImageSynthesis
2
Supplement: bethgelab.org/media/uploads/stylecontrol/supplement/
to have the same style if they embody the same correlations
of specific image features. To provide intuitive control over
this process, one must identify ways to access perceptual
factors in these statistics.
In order to identify these factors, we observe some of the
different ways that one might describe an artwork such as
Vincent van Gogh’s A Wheatfield with Cypresses (Fig. 2(c)).
First, one might separately describe different styles in dif-
ferent regions, such as in the sky as compared to the ground.
Second, one might describe the colour palette, and how
it relates to the underlying scene, separately from factors
like image composition or brush stroke texture. Third, one
might describe fine-scale spatial structures, such as brush
stroke shape and texture, separately from coarse-scale struc-
tures like the arrangements of strokes and the swirly struc-
ture in the sky of the painting. These observation motivates
our hypothesis: image style can be perceptually factorised
into style in different spatial regions, colour and luminance
information, and across spatial scales, making them mean-
ingful control dimensions for image stylisation.
Here we build on this hypothesis to introduce meaning-
ful control to a recent image stylisation method known as
Neural Style Transfer [8] in which the image statistics that
capture content and style are defined on feature responses in
a Convolutional Neural Network (CNN) [22]. Namely, we
introduce methods for controlling image stylisation inde-
pendently in different spatial regions (Fig. 1(b)), for colour
and luminance information (Fig. 1(c)) as well as on different
spatial scales (Fig. 1(d)). We show how they can be applied
1
3985

to improve Neural Style Transfer and to alleviate some of its
common failure cases. Moreover, we demonstrate how the
factorisation of style into these aspects can gracefully com-
bine style information from multiple images and thus en-
able the creation of new, perceptually interesting styles. We
also show a method for efficiently rendering high-resolution
stylisations using a coarse-to-fine approach that reduced op-
timisation time by an approximate factor of 2.5. Finally,
we show that in addition to the original optimisation-based
style transfer, these control methods can also be applied to
recent fast approximations of Neural Style Transfer [13, 23]
2. Related Work
There is a large body of work on image stylisation
techniques. The first example-based technique was Image
Analogies [12], which built on patch-based texture synthe-
sis techniques [4, 26]. This method introduced stylisation
based on an example painting, as well as ways to preserve
colour, and to control stylisation of different regions sep-
arately. The method used a coarse-to-fine texture synthe-
sis procedure for speed [26]. Since then, improvements
to the optimisation method and new applications [20, 6]
have been proposed. Patch-based methods have also been
used with CNN features [16, 2], leading to improved tex-
ture representations and stylisation results. Scale control
has been developed for patch-based texture synthesis [9]
and many other techniques have been developed for trans-
ferring colour style [5]. There are also many procedural
stylisation techniques that provide extensive user control in
the non-photorealistic rendering literature, e.g., [1, 15, 18].
These procedural methods provide separate controls for ad-
justing spatial variation in styles, colour transformation, and
brush stroke style, but cannot work from training data.
More recently, Neural Style Transfer [8] has demon-
strated impressive results in example-based image stylisa-
tion. The method is based on a parametric texture model
[14, 10, 19] defined by summary statistics on CNN re-
sponses [7] and appears to have several advantages over
patch-based synthesis. Most prominently, during the styli-
sation it displays a greater flexibility to create new image
structures that are not already present in the source images
[16].
However, the representation of image style within the
parametric neural texture model [7] allows far less intuitive
control over the stylisation outcome than patch-based meth-
ods. The texture parameters can be used to influence the
stylisation but their interplay is extremely complex due to
the complexity of the deep representations they are defined
on. Therefore it is difficult to predict their perceptual effect
on the stylisation result. Our main goal in this work is to
introduce intuitive ways to control Neural Style Transfer to
combine the advantages of that method with the more fine-
grained user control of earlier stylisation methods. Note
that concurrent work [27] independently developed a simi-
lar approach for spatial control as presented here.
3. Neural Style Transfer
The Neural Style Transfer method [8] works as follows.
We define a content image x
C
and a style image x
S
with
corresponding feature representations F
(x
C
) and F
(x
S
) in
layer of a CNN. Each column of F
(x) is a vectorised fea-
ture map and thus F
R
M
(x)×N
where N
is the number
of feature maps in layer and M
(x) = H
(x) × W
(x) is
the product of height and width of each feature map. Note
that while N
is independent of the input image, M
(x) de-
pends on the size of the input image.
Neural Style Transfer generates a new image
ˆ
x that de-
picts the content of image x
C
in the style of image x
S
by
minimising following loss function with respect to
ˆ
x
L
total
= αL
content
+ βL
style
(1)
where the content term compares feature maps at a single
layer
C
:
L
content
=
1
N
c
M
c
(x
C
)
X
ij
(F
c
(
ˆ
x) F
c
(x
C
))
2
ij
(2)
and the style term compares a set of summary statistics:
L
style
=
X
w
E
(3)
E
=
1
4N
2
X
ij
(G
(
ˆ
x) G
(x
S
))
2
ij
(4)
where G
(x) =
1
M
(x)
F
(x)
T
F
(x) is the Gram Ma-
trix of the feature maps in layer in response to im-
age x. As in the original work [8], we use the
VGG-19 Network and include “conv4
2” as the layer
C
for the image content and Gram Matrices from lay-
ers “conv1
1”,“conv2 1”,“conv3 1”,“conv4 1”,“conv5 1”
as the image statistics that model style.
4. Spatial Control
We first introduce ways to spatially control Neural Style
Transfer. Our goal is to control which region of the style
image is used to stylise each region in the content image.
For example, we would like to apply one style to the sky re-
gion and another to the ground region of an image to either
avoid artefacts (Fig. 2(d),(e)) or to generate new combina-
tions of styles from multiple sources (Fig. 2(f)). We take
as input R spatial guidance channels T
r
for both the con-
tent and style image (small insets in (Fig. 2(a)-(c)). Each of
these is an image map of values in [0, 1] specifying which
styles should be applied where: regions where the r
th
con-
tent guidance channel is equal to 1 should get the style from
3986

regions where the r
th
style guidance channel is 1. When
there are multiple style images, the regions index over all
the example images. The guidance channels are propagated
to the CNN to produce guidance channels T
r
for each layer.
This can be done by simple re-sampling or more involved
methods as we explain later in this section. We first discuss
algorithms for synthesis given the guidance maps.
4.1. Guided Gram Matrices
In the first method we propose, we multiply the feature
maps of each layer included in the style features with R
guidance channels T
r
and compute one spatially guided
Gram Matrix for each of the R regions in the style image.
Formally we define a spatially guided feature map as
F
r
(x)
[:,i]
= T
r
F
(x)
[:,i]
(5)
Here F
r
(x)
[:,i]
is the i
th
column vector of F
r
(x), r R and
denotes element-wise multiplication. The guidance channel
T
r
is vectorised and can be either a binary mask for hard
guidance or real-valued for soft guidance. We normalise T
r
such that
P
i
(T
r
)
2
i
= 1. The guided Gram Matrix is then
G
r
(x) = F
r
(x)
T
F
r
(x) (6)
Each guided Gram Matrix is used as the optimisation tar-
get for the corresponding region of the content image. The
contribution of layer to the style loss is then:
E
=
1
4N
2
R
X
r=1
X
ij
λ
r
(G
r
(
ˆ
x) G
r
(x
S
))
2
ij
(7)
where λ
r
is a weighting factor that controls the stylisation
strength in the corresponding region r.
An important use for guidance channels is to ensure that
style is transferred between regions of similar scene con-
tent in the content and style image. For example, Figure 2
shows an example in which the sky in the content image has
bright clouds, whereas the sky in the style image has grey-
ish clouds; as a result, the original style transfer stylises the
sky with a bright part of the ground that does not match the
appearance of the sky. We address this by dividing both
images into a sky and a ground region (Fig. 2(a),(b) small
insets) and require that the sky and ground regions from the
painting are used to stylise the respective regions in the pho-
tograph (Fig. 2(e)).
Given the input guidance channel T
r
, we need to first
propagate this channel to produce guidance channels T
r
for
each layer. The most obvious approach would be to down-
sample T
r
to the dimensions of each layer’s feature map.
However, we often find that doing so fails to keep the de-
sired separation of styles by region, e.g., ground texture still
appears in the sky. This is because neurons near the bound-
aries of a guidance region can have large receptive fields
(a) Content (b) Style I
(d) Output using [8]
(e) Output with spatial control
(c) Style II
(f) Output spatially combining styles I and II
Figure 2: Spatial guidance in Neural Style Transfer. (a) Content
image. (b) Style image I. (c) Style image II. Spatial mask separat-
ing the image in sky and ground is shown in the top right corner.
(d) Output from Neural Style Transfer without spatial control [8].
The clouds are stylised with image structures from the ground. (e)
Output with spatial guidance. (f) Output from spatially combining
the the ground-style from (b) and the sky-style from (c).
3987

that overlap into the other region. Instead we use an eroded
version of the spatial guiding channels. We enforce spa-
tial guidance only on the neurons whose receptive field is
entirely inside the guidance region and add another global
guidance channel that is constant over the entire image. We
found that this soft spatial guidance usually yields better re-
sults. For further details on the creation of guidance chan-
nels, see the Supplementary Material, section 1.1.
Another application of this method is to generate a new
style by combining the styles from multiple example im-
ages. Figure 2(f) shows an example in which the region
guidance is used to use the sky style from one image and
the ground style from another. This example demonstrates
the potential of spatial guidance to combine many example
styles together to produce new stylisations.
4.2. Guided Sums
Alternatively, instead of computing a Gram Matrix for
each guidance channel, we can also just stack the guid-
ance channels with the feature maps as it is done in
[2] to spatially guide neural patches [16]. The feature
representation of image x in layer is then F
(x) =
F
(x), T
1
, T
2
, ..., T
R
and F
(x) R
(N
+R)×M
(x)
. Now
the Gram Matrix G
(x) =
1
M
(x)
F
(x)
T
F
(x) includes cor-
relations of the image features with the non-zero entries of
the guidance channels and therefore encourages that the fea-
tures in region r of the style image are used to stylise region
r in the content image. The contribution of layer to the
style loss is simply
E
=
1
4N
2
X
ij
G
(
ˆ
x) G
(x
S
)
2
ij
(8)
This is clearly more efficient than the method presented in
Section 4.1. Instead of computing and matching R Gram
Matrices one only has to compute one Gram Matrix with R
additional channels. Nevertheless, this gain in efficiency
comes at the expense of texture quality. The additional
channels in the new Gram Matrix are the sums over each
feature map spatially weighted by the guidance channel.
G
(x
S
)
i,N
+r
=
X
j
T
r
F
(x
S
)
[:,i]
j
(9)
Hence this method actually interpolates between matching
the original global Gram Matrix stylisation and the spatially
weighted sums over the feature maps. While the feature
map sums also give a non-trivial texture model, their ca-
pacity to model complex textures is limited [7]. In practice
we find that this method can often give decent results but
also does not quite capture the texture of the style image
as would be expected from the inferior texture model. Re-
sults and comparisons can be found in the Supplementary
Material, section 1.2.
5. Colour Control
The colour information of an image is an important per-
ceptual aspect of its style. At the same time it is largely
independent of other style aspects such as the type of brush
strokes used or dominating geometric shapes. Therefore it
is desirable to independently control the colour information
in Neural Style Transfer. A prominent use case for such
control is colour preservation during style transfer. When
stylising an image using Neural Style Transfer, the output
also copies the colour distribution of the style image, which
might be undesirable in many cases (Fig. 3(c)). For exam-
ple, the stylised farmhouse has the colours of the original
van Gogh painting (Fig. 3(c)), whereas one might prefer
the output painting to preserve the colours of the farmhouse
photograph. In particular, one might imagine that the artist
would have used the colours of the scene if they were to
paint the farmhouse. Here we present two simple methods
to preserve the colours of the source image during Neural
Style Transfer in other words, to transfer the style with-
out transferring the colours. We compare two different ap-
proaches to colour preservation: colour histogram matching
and luminance-only transfer (Fig. 3(d,e)).
5.1. Luminance-only transfer
In the first method we perform style transfer only in the
luminance channel, as done in Image Analogies [12]. This
is motivated by the observation that visual perception is far
more sensitive to changes in luminance than in colour [25].
The modification is simple. The luminance channels L
S
and L
C
are first extracted from the style and content im-
ages. Then the Neural Style Transfer algorithm is applied
to these images to produce an output luminance image
ˆ
L.
Using a colour space that separates luminance and colour
information, the colour information of the content image is
combined with
ˆ
L to produce the final colour output image
(Fig. 3(d)).
If there is a substantial mismatch between the luminance
histogram of the style and the content image, it can be help-
ful to match the histogram of the style luminance channel
L
S
to that of the content image L
C
before transferring the
style. For that we simply match mean and variance of the
content luminance. Let µ
S
and µ
C
be the mean luminances
of the two images, and σ
S
and σ
C
be their standard de-
viations. Then each luminance pixel in the style image is
updated as:
L
s
=
σ
C
σ
S
(L
S
µ
S
) + µ
C
(10)
5.2. Colour histogram matching
The second method we present works as follows. Given
the style image x
S
, and the content image x
C
, the style im-
age’s colours are transformed to match the colours of the
3988

(a) Content
(b) Style (c) Output using [8]
(e) Output with colour histogram matching
(d) Output with luminance-only style transfer
Figure 3: Colour preservation in Neural Style Transfer. (a) Con-
tent image. (b) Style image. (c) Output from Neural Style Trans-
fer [8]. The colour scheme is copied from the painting. (d) Output
using style transfer in luminance domain to preserve colours. (e)
Output using colour transfer to preserve colours.
content image. This produces a new style image x
S
that re-
places x
S
as input to the Neural Style Transfer algorithm.
The algorithm is otherwise unchanged.
The one choice to be made is the colour transfer proce-
dure. There are many colour transformation algorithms to
choose from; see [5] for a survey. Here we use linear meth-
ods, which are simple and effective for colour style transfer.
Given the style image, each RGB pixel p
S
is transformed
as:
p
S
= Ap
S
+ b (11)
where A is a 3 × 3 matrix and b is a 3-vector. This trans-
formation is chosen so that the mean and covariance of
the RGB values in the new style image p
S
match those of
p
C
[11] (Appendix B). In general, we find that the colour
matching method works reasonably well with Neural Style
Transfer (Fig. 3(e)), whereas gave poor synthesis results for
Image Analogies [11]. Furthermore, the colour histogram
matching method can also be used to better preserve the
colours of the style image. This can substantially improve
results for cases in which there is a strong mismatch in
colour but one rather wants to keep the colour distribution
of the style image (for example with pencil drawings or line
art styles). Examples of this application can be found in the
Supplementary Material, section 2.2.
5.3. Comparison
In conclusion, both methods give perceptually-
interesting results but have different advantages and
disadvantages. The colour-matching method is naturally
limited by how well the colour transfer from the content
image onto the style image works. The colour distribution
often cannot be matched perfectly, leading to a mismatch
between the colours of the output image and that of the
content image.
In contrast, the luminance-only transfer method pre-
serves the colours of the content image perfectly. However,
dependencies between the luminance and the colour chan-
nels are lost in the output image. While we found that this is
usually very difficult to spot, it can be a problem for styles
with prominent brushstrokes since a single brushstroke can
change colour in an unnatural way. In comparison, when
using full style transfer and colour matching, the output im-
age really consists of strokes which are blotches of paint,
not just variations of light and dark. For a more detailed
discussion of colour preservation in Neural Style Transfer
we refer the reader to the Supplementary Material, section
2.1.
6. Scale Control
In this section, we describe methods for mixing differ-
ent styles at different scales and efficiently generating high-
resolution output with style at desired scales.
6.1. Scale control for style mixing
First we introduce a method to control the stylisation
independently on different spatial scales. Our goal is to
pick separate styles for different scales. For example, we
want to combine the fine-scale brushstrokes of one painting
(Fig. 4(b), Style I) with the coarse-scale angular geometric
shapes of another image (Fig. 4(b), Style II).
3989

Citations
More filters
Proceedings ArticleDOI

Arbitrary Style Transfer in Real-Time with Adaptive Instance Normalization

TL;DR: In this paper, adaptive instance normalization (AdaIN) is proposed to align the mean and variance of the content features with those of the style features, which enables arbitrary style transfer in real-time.
Proceedings ArticleDOI

Recovering Realistic Texture in Image Super-Resolution by Deep Spatial Feature Transform

TL;DR: In this paper, a spatial feature transform (SFT) layer was proposed to generate affine transformation parameters for spatial-wise feature modulation in a single-image super-resolution network.
Proceedings ArticleDOI

Image2StyleGAN++: How to Edit the Embedded Images?

TL;DR: A framework that combines embedding with activation tensor manipulation to perform high quality local edits along with global semantic edits on images and can restore high frequency features in images and thus significantly improves the quality of reconstructed images.
Posted Content

Neural Style Transfer: A Review

TL;DR: A comprehensive overview of the current progress in NST can be found in this paper, where the authors present several evaluation methods and compare different NST algorithms both qualitatively and quantitatively, concluding with a discussion of various applications of NST and open problems for future research.
Journal ArticleDOI

Neural Style Transfer: A Review

TL;DR: A taxonomy of current algorithms in the field of NST is proposed and several evaluation methods are presented and compared to compare different NST algorithms both qualitatively and quantitatively.
References
More filters
Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Book ChapterDOI

Perceptual Losses for Real-Time Style Transfer and Super-Resolution

TL;DR: In this paper, the authors combine the benefits of both approaches, and propose the use of perceptual loss functions for training feed-forward networks for image style transfer, where a feedforward network is trained to solve the optimization problem proposed by Gatys et al. in real-time.
Proceedings ArticleDOI

Image Style Transfer Using Convolutional Neural Networks

TL;DR: A Neural Algorithm of Artistic Style is introduced that can separate and recombine the image content and style of natural images and provide new insights into the deep image representations learned by Convolutional Neural Networks and demonstrate their potential for high level image synthesis and manipulation.
Posted Content

Instance Normalization: The Missing Ingredient for Fast Stylization.

TL;DR: A small change in the stylization architecture results in a significant qualitative improvement in the generated images, and can be used to train high-performance architectures for real-time image generation.
Journal ArticleDOI

A Parametric Texture Model Based on Joint Statistics of Complex Wavelet Coefficients

TL;DR: A universal statistical model for texture images in the context of an overcomplete complex wavelet transform is presented, demonstrating the necessity of subgroups of the parameter set by showing examples of texture synthesis that fail when those parameters are removed from the set.