scispace - formally typeset
Open AccessProceedings ArticleDOI

Learning Residual Images for Face Attribute Manipulation

Wei Shen, +1 more
- pp 1225-1233
Reads0
Chats0
TLDR
In this article, the authors proposed to learn the corresponding residual image defined as the difference between images before and after the manipulation, which can be operated efficiently with modest pixel modification, and applied dual learning to allow transformation networks to learn from each other.
Abstract
Face attributes are interesting due to their detailed description of human faces. Unlike prior researches working on attribute prediction, we address an inverse and more challenging problem called face attribute manipulation which aims at modifying a face image according to a given attribute value. Instead of manipulating the whole image, we propose to learn the corresponding residual image defined as the difference between images before and after the manipulation. In this way, the manipulation can be operated efficiently with modest pixel modification. The framework of our approach is based on the Generative Adversarial Network. It consists of two image transformation networks and a discriminative network. The transformation networks are responsible for the attribute manipulation and its dual operation and the discriminative network is used to distinguish the generated images from real images. We also apply dual learning to allow transformation networks to learn from each other. Experiments show that residual images can be effectively learned and used for attribute manipulations. The generated images remain most of the details in attribute-irrelevant areas.

read more

Content maybe subject to copyright    Report

Learning Residual Images for Face Attribute Manipulation
Wei Shen Rujie Liu
Fujitsu Research & Development Center, Beijing, China.
{shenwei, rjliu}@cn.fujitsu.com
Abstract
Face attributes are interesting due to their detailed de-
scription of human faces. Unlike prior researches work-
ing on attribute prediction, we address an inverse and
more challenging problem called face attribute manipula-
tion which aims at modifying a face image according to a
given attribute value. Instead of manipulating the whole
image, we propose to learn the corresponding residual im-
age defined as the difference between images before and
after the manipulation. In this way, the manipulation can
be operated efficiently with modest pixel modification. The
framework of our approach is based on the Generative Ad-
versarial Network. It consists of two image transformation
networks and a discriminative network. The transformation
networks are responsible for the attribute manipulation and
its dual operation and the discriminative network is used
to distinguish the generated images from real images. We
also apply dual learning to allow transformation networks
to learn from each other. Experiments show that residual
images can be effectively learned and used for attribute ma-
nipulations. The generated images remain most of the de-
tails in attribute-irrelevant areas.
1. Introduction
Considerable progresses have been made on face image
processing, such as age analysis [22][26], emotion detec-
tion [1][5] and attribute classification [4][20][15][18]. Most
of these studies concentrate on inferring attributes from im-
ages. However, we raise an inverse question on whether
we can manipulate a face image towards a desired attribute
value (i.e. face attribute manipulation). Some examples are
shown in Fig. 1.
Generative models such as generative adversarial
networks (GANs) [7] and variational autoencoders
(VAEs) [14] are powerful models capable of generating
images. Images generated from GAN models are sharp
and realistic. However, they can not encode images since
it is the random noise that is used for image generation.
Compared to GAN models, VAE models are able to encode
(a) Glasses: remove and add the glasses
(b) Mouth open: close and open the mouth
(c) No beard: add and remove the beard
Figure 1: Illustration of face attribute manipulation. From
top to bottom are the manipulations of glasses, mouth
open
and no
beard.
the given image to a latent representation. Nevertheless,
passing images through the encoder-decoder pipeline often
harms the quality of the reconstruction. In the scenario of
face attribute manipulation, those details can be identity-
related and the loss of those details will cause undesired
changes. Thus, it is difficult to directly employ GAN
models or VAE models to face attribute manipulation.
An alternative way is to view face attribute manipulation
as a transformation process which takes in original images
as input and then outputs transformed images without ex-
plicit embedding. Such a transformation process can be ef-
ficiently implemented by a feed-forward convolutional neu-
ral network (CNN). When manipulating face attributes, the
feed-forward network is required to modify the attribute-
specific area and keep irrelevant areas unchanged, both of
which are challenging.
In this paper, we propose a novel method based on resid-
4030

ual image learning for face attribute manipulation. The
method combines the generative power of the GAN model
with the efficiency of the feed-forward network (see Fig. 2).
We model the manipulation operation as learning the resid-
ual image which is defined as the difference between the
original input image and the desired manipulated image.
Compared to learning the whole manipulated image, learn-
ing only the residual image avoids the redundant attribute-
irrelevant information by concentrating on the essential
attribute-specific knowledge. To improve the efficiency of
manipulation learning, we adopt two CNNs to model two
inverse manipulations (e.g. removing glasses as the primal
manipulation and adding glasses as the dual manipulation,
Fig. 2) and apply the strategy of dual learning during the
training phase. Our contribution can be summarized as fol-
lows.
1. We propose to learn residual images for face attribute
manipulation. The proposed method focuses on the
attribute-specific face area instead of the entire face
which contains many redundant irrelevant details.
2. We devise a dual learning scheme to learn two inverse
attribute manipulations (one as the primal manipula-
tion and the other as the dual manipulation) simultane-
ously. We demonstrate that the dual learning process
is helpful for generating high quality images.
3. Though it is difficult to assess the manipulated images
quantitatively, we adopt the landmark detection accu-
racy gain as the metric to quantitatively show the effec-
tiveness of the proposed method for glasses removal.
2. Related Work
There are many techniques for image generation in re-
cent years [23][2][17][8][3][14]. Radford et al. [23] applied
deep convolutional generative adversarial networks (DC-
GANs) to learn a hierarchy of representations from object
parts to scenes for general image generation. Chen et al. [2]
introduced an information-theoretic extension to the GAN
that was able to learn disentangled representations. Larsen
et al. [17] combined the VAE with the GAN to learn an em-
bedding in which high-level abstract visual features could
be modified using simple arithmetic.
Our work is an independent work along with [19]. In
[19], Li et al. proposed a deep convolutional network model
for identity-aware transfer of facial attributes. The differ-
ences between our work and [19] are noticeable in three
aspects. (1) Our method generates manipulated images us-
ing residual images which is different from [19]. (2) Our
method models two inverse manipulations within one sin-
gle architecture by sharing the same discriminator while the
work in [19] treats each manipulation independently. (3)
Our method does not require post-processing which is es-
sential in [19].
3. Learning the Residual Image
The architecture of the proposed method is presented in
Fig. 2. For each face attribute manipulation, it contains
two image transformation networks G
0
and G
1
and a dis-
criminative network D. G
0
and G
1
simulate the primal
and the dual manipulation respectively. D classifies the ref-
erence images and generated images into three categories.
The following sections will first give a brief introduction
of the generative adversarial network and then the detailed
description of the proposed method.
3.1. Generative Adversarial Networks
The generative adversarial network is introduced by
Goodfellow et al. [7]. It is an unsupervised framework con-
taining a generative model G and a discriminative model D.
The two models play a minimax two-player game in which
G tries to recover the training data and fool D to make a
mistake about whether the data is from realistic data distri-
bution or from G. Given a data prior p
data
on data x and a
prior on the input noise variables p
z
(z), the formal expres-
sion of the minimax game is as follows:
min
G
max
D
V (D, G) = E
xp
data
(x)
[log D(x)]
+E
zp
z
(z)
[log(1 D(G(z)))].
(1)
The parameters of both G and D are updated iteratively
during the training process. The GAN framework provides
an effective way to learn data distribution of realistic im-
ages and makes it possible to generate images with desired
attribute. Based on the GAN framework, we redesign the
generator and the discriminator for face attribute manipula-
tion in the following sections.
3.2. Image Transformation Networks
The motivation of our approach is that face attribute ma-
nipulation usually only needs modest modification of the
attribute-specific face area while other parts remain un-
changed. For example, when removing a pair of glasses
from a face image, only the area of the glasses should be re-
placed with face skin or eyes while the change of other face
parts such as mouth, nose, and hair should not be involved.
Thus, we model the manipulation as learning the residual
image targeted on the attribute-specific area.
As shown in Fig. 2, image transformation networks G
0
and G
1
are used to simulate the manipulation and its dual
operation. Given the input face image x
0
with a negative
attribute value and the input face image x
1
with a positive
attribute value, the learned network G
0
and G
1
apply the
manipulation transformations to yield the residual images
r
0
and r
1
. Then the input images are added to the residual
images as the final outputs ˜x
0
and ˜x
1
:
˜x
i
= x
i
+ r
i
= x
i
+ G
i
(x
i
), i = 0, 1. (2)
4031

+
+
Image transformation network G
0
Image transformation network G
1
Discriminative network D
Class #2
Class #0 Class #1
Input images
0
Residual images
0
Output images
0
Input images Classification
Input images
1
Residual images
1
Output images
Figure 2: The architecture of the proposed method. Two image transformation networks G
0
and G
1
perform the inverse
attribute manipulation (i.e. adding glasses and removing glasses). Both G
0
and G
1
produce residual images with reference
to the input images. The final output images are the pixel-wise addition of the residual images and the input images. The
discriminative network D is a three-category classifier that classifies images from different categories (i.e. images generated
from G
0
and G
1
, images with positive attribute labels, and images with negative attribute labels).
In order to let the residual image be sparse, we apply an
L-1 norm regularization as
pix
(r
i
) = ||r
i
||
1
, i = 0, 1. (3)
3.3. The Discriminative Network
Given the real images x
0
and x
1
with known attribute
label 0 and label 1, we regard the transformed images ˜x
0
and ˜x
1
as an extra category with label 2. The loss function
is:
cls
(t, p) = log(p
t
), t = 0, 1, 2, (4)
where t is the label of the image and p
t
is the softmax prob-
ability of the t-th label. Similar strategy for constructing the
GAN loss is also adopted in [25].
Perceptual loss is widely used to measure the content
difference between different images [10][6][17]. We also
apply this loss to encourage the transformed image to have
similar content to the input face image. Let φ(x) be the acti-
vation of the third layer in D. The perceptual loss is defined
as:
per
(x, ˜x) = ||φ(x) φ(˜x)||
1
. (5)
Given the discriminative network D, the GAN loss for
the image transformation networks G
0
and G
1
is
GAN
=
(
log(D(G
i
(x
i
))) i = 0,
log(1 D(G
i
(x
i
))) i = 1.
(6)
3.4. Dual Learning
In addition to applying adversarial learning in the model
training, we also adopt dual learning which has been suc-
cessfully applied in machine translation [27]. A brief intro-
duction is as follows. Any machine translation has a dual
task, i.e. the source language to the target language (pri-
mal) and the target language to the source language (dual).
The mechanism of dual learning can be viewed as a two-
player communication game. The first player translates a
message from language A to language B and sends it to the
second player. The second player checks if it is natural in
language B and notifies the first player. Then he translates
the message to language A and sends it back to the first
player. The first player checks whether the received mes-
sage is consistent with his original message and notifies the
4032

G
1
G
0
G
0
G
1
Figure 3: The dual learning process in this work.
second player. The information feedback signals from both
players can benefit each other through a closed loop.
The dual learning process in this work is implemented as
Fig. 3. For a given image x
0
with a negative attribute value,
we pass it through the transformation network G
0
. The ob-
tained image ˜x
0
= G
0
(x
0
) is then fed to the transforma-
tion network G
1
. The yielded image is ˆx
0
= G
1
(˜x
0
) =
G
1
(G
0
(x
0
)). Since G
0
and G
1
are the primal task and the
dual task respectively, ˆx
0
is expected to have the same at-
tribute value as x
0
. Similar process is also applied for x
1
.
The loss function for the transformation networks in this
phase is expressed as:
dual
(˜x
i
) =
(
log(1 D(G
1i
(˜x
i
))) i = 0,
log(D(G
1i
(˜x
i
))) i = 1.
(7)
3.5. Loss Function
Taking the loss functions all together, we have the fol-
lowing loss function for G
0
/G
1
:
G
=
GAN
+
dual
+ αℓ
pix
+ β
per
, (8)
where α and β are constant weight for regularization terms.
For D, the loss function is
D
=
cls
. (9)
4. Datasets
We adopt two datasets in our experiments, i.e. the
CelebA dataset [21] and the Labeled Faces in the Wild
(LFW) dataset [9]. The CelebA dataset contains more than
200K celebrity images, each with 40 binary attributes. We
pick 6 of them, i.e. glasses, mouth
open, smile, no beard,
young, and male to evaluate the proposed method. The
center part of the aligned images in the CelebA dataset
are cropped and scaled to 128×128. Despite there are a
large number of images in the dataset, the attribute labels
are highly biased. Thus, for each attribute, 1,000 images
from the attribute-positive class and 1,000 images from the
attribute-negative class are randomly selected for test. From
the rest images, we select all the images belong to the mi-
nority class and equal number of images from the majority
class to make a balance dataset. The LFW dataset is used
only for testing the generalization of our method. Note that
there are no ground truth manipulated images in the CelebA
dataset for training the transformation networks.
5. Implementation Details
The detailed architectures of G
0
, G
1
and D are spec-
ified in Tab. 1. We keep β = 0.1α and set α=5e-4 for lo-
cal face attribute (i.e. glasses, no
beard, mouth open, smile)
manipulation and α=1e-6 for global face attribute (i.e. male,
young) manipulation. The weights of all the networks are
initialized from a zero-centered Normal distribution with
standard deviation 0.02. The Adam optimizer [13] is used in
the training phase. The learning rates for both the transfor-
mation networks and the discriminators are the same 2e-4.
Both G
0
and G
1
are trained at the same time without any
staging.
6. Experiments
6.1. Local and Global Attribute Manipulation
Among the six attributes, we group glasses, mouth
open,
smile and no
beard as local attributes since the manipula-
tions only operate on local face area. The other two at-
tributes male and young are treated as global attributes. We
compare the results of our method and those of the state-
of-the-art method VAE-GAN [17] on the CelebA dataset
(Fig. 4). The results on the LFW dataset are presented in
Fig. 5.
We firstly give an overall observation of the results. As
shown in Fig. 4, the VAE-GAN model [17] changes many
details, such as hair style, skin color and background ob-
jects. In contrast, the results from our method remain most
of the details. Comparing the original images in the first
row and the transformed images in the third row, we can
find that details in original face images mostly remain the
same in their manipulated counterparts except the areas cor-
responding to the target attributes. This observation is also
proved by the residual images in the last row. For local
attribute manipulation, the strong responses on the resid-
ual images mainly concentrate in local areas. For example,
when adding sun glasses to the face image, the most strong
response on the residual image is the black sun glasses.
Similarly, removing glasses will cause the residual image
to enhance the eyes and remove any hint of glasses that is
presented in the original face image.
Local face attribute manipulations are straightforward
and obvious to notice. Further, we investigate more inter-
esting tasks such as mouth
open and smile manipulations.
Both manipulations will cause the “movement” of the chin.
From Fig. 4(c,d), we can observe that the “movement” of
the chin is captured by the image transformation networks.
When performing the mouth
open manipulation, the net-
4033

Image transformation networks G
0
/G
1
Discriminative network D
Input 128×128 color images Input 128×128 color images
5×5 conv. 64 leaky RELU. stride 1. batchnorm 4×4 conv. 64 leaky RELU. stride 2. batchnorm
4×4 conv. 128 leaky RELU. stride 2. batchnorm 4×4 conv. 128 leaky RELU. stride 2. batchnorm
4×4 conv. 256 leaky RELU. stride 2. batchnorm 4×4 conv. 256 leaky RELU. stride 2. batchnorm
3×3 conv. 128 leaky RELU. stride 1. upsampling. batchnorm 4×4 conv. 512 leaky RELU. stride 2. batchnorm
3×3 conv. 64 leaky RELU. stride 1. upsampling. batchnorm 4×4 conv. 1024 leaky RELU. stride 2. batchnorm
4×4 conv. 3 4×4 conv. 1
Table 1: The network architectures of the image transformation networks G
0
/G
1
and the discriminative network D
work “lowers” the chin and when performing mouth close
manipulation, the network “lifts” the chin.
The most challenging task would be manipulating global
attributes young and male. The networks have to learn
subtle changes such as wrinkles, hair color, beard etc. In
Fig. 4(f), changing from young to old will cause more wrin-
kles and the dual operation will darken the hair color. From
the residual images in Fig. 4(e), we observe that the main
difference between the male and the female are the beard,
the color of the lips and the eyes. The strong responses in
the residual images for these two manipulations are scat-
tered over the entire images rather than restricted within a
local area.
6.2. Ablation Study
Our model consists of two pivotal components: residual
image learning and dual learning. In this section, we further
validate their effectiveness. We modify the proposed model
to obtain two more models. One breaks the identity map-
ping in the transformation networks to enforce the networks
to learn to generate the entire image. The other breaks the
data-feed loop (i.e. the output of G
0
and G
1
will not be fed
to each other). Other network settings are kept the same as
those of the proposed model. We use the manipulation of
glasses as an example and the results are shown in Fig. 6.
We observe that without residual image learning, the model
produces much lower quality images in terms of introduc-
ing much noise and some correlated features (e.g. wrongly
added beard in the second and third column), which indi-
cates the task has become challenging. The drop of dual
learning also deteriorates image quality. We notice some
change in hair color which is caused by the performance
degradation of the transformation networks. The effective-
ness of the dual learning can be explained from two aspects.
1) Images generated from both generators increase the num-
ber of training samples. 2) During the dual learning phase,
the ground truth images for G
1
(G
0
(x
0
)) and G
0
(G
1
(x
1
))
are known, which eases the training of both generators.
Thus, we argue that combining residual image learning with
dual learning will lead to better manipulation results.
Figure 6: Validation of residual image learning and dual
learning in glasses manipulation. First row: the original
input images. Second row: the result images from the pro-
posed model. Third row: the result images from the model
without residual image learning. Last row: the result im-
ages from the model without dual learning.
6.3. Visual Feature Decorrelation
Training a classifier in an end-to-end way does not en-
sure the classifier can precisely identify the target visual
features especially when the dataset is highly biased. Blind
spots of predictive models are observed in [16]. For exam-
ple, if the training data consist only of black dog images,
and white cat images. A predictive model trained on these
data will incorrectly label a white dog as a cat with high
confidence in the test phase. When analyzing the CelebA
dataset, we find that male and no
beard is highly corre-
lated (the Pearson correlation is -0.5222). It is not surprising
since only male faces have beard.
Classifiers trained with correlated features may also
propagate the blind spots back to the generative models
which may cause the generators to produce correlated vi-
sual features. To demonstrate that the proposed method can
learn less correlated features, we choose to add beard to fe-
4034

Citations
More filters
Proceedings ArticleDOI

StarGAN: Unified Generative Adversarial Networks for Multi-domain Image-to-Image Translation

TL;DR: StarGAN as discussed by the authors proposes a unified model architecture to perform image-to-image translation for multiple domains using only a single model, which leads to superior quality of translated images compared to existing models as well as the capability of flexibly translating an input image to any desired target domain.
Journal ArticleDOI

AttGAN: Facial Attribute Editing by Only Changing What You Want

TL;DR: The proposed method is extended for attribute style manipulation in an unsupervised manner and outperforms the state-of-the-art on realistic attribute editing with other facial details well preserved.
Book ChapterDOI

GANimation: anatomically-aware facial animation from a single image

TL;DR: In this article, a GAN conditioning scheme based on Action Units (AU) annotations is proposed, which allows controlling the magnitude of activation of each AU and combine several of them.
Proceedings ArticleDOI

Beyond Face Rotation: Global and Local Perception GAN for Photorealistic and Identity Preserving Frontal View Synthesis

TL;DR: Tang et al. as discussed by the authors proposed a Two-Pathway Generative Adversarial Network (TP-GAN) for photorealistic frontal view synthesis by simultaneously perceiving global structures and local details.
Journal ArticleDOI

DeepFakes and Beyond: A Survey of Face Manipulation and Fake Detection

TL;DR: This survey provides a thorough review of techniques for manipulating face images including DeepFake methods, and methods to detect such manipulations, with special attention to the latest generation of DeepFakes.
References
More filters
Proceedings Article

Adam: A Method for Stochastic Optimization

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Journal ArticleDOI

Generative Adversarial Nets

TL;DR: A new framework for estimating generative models via an adversarial process, in which two models are simultaneously train: a generative model G that captures the data distribution and a discriminative model D that estimates the probability that a sample came from the training data rather than G.
Proceedings Article

Auto-Encoding Variational Bayes

TL;DR: A stochastic variational inference and learning algorithm that scales to large datasets and, under some mild differentiability conditions, even works in the intractable case is introduced.
Posted Content

Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

TL;DR: This work introduces a class of CNNs called deep convolutional generative adversarial networks (DCGANs), that have certain architectural constraints, and demonstrates that they are a strong candidate for unsupervised learning.
Book ChapterDOI

Perceptual Losses for Real-Time Style Transfer and Super-Resolution

TL;DR: In this paper, the authors combine the benefits of both approaches, and propose the use of perceptual loss functions for training feed-forward networks for image style transfer, where a feedforward network is trained to solve the optimization problem proposed by Gatys et al. in real-time.
Related Papers (5)