Learning Residual Images for Face Attribute Manipulation

doi:10.1109/CVPR.2017.135

Wei Shen Rujie Liu

Fujitsu Research & Development Center, Beijing, China.

{shenwei, rjliu}@cn.fujitsu.com

Abstract

Face attributes are interesting due to their detailed de-

scription of human faces. Unlike prior researches work-

ing on attribute prediction, we address an inverse and

more challenging problem called face attribute manipula-

tion which aims at modifying a face image according to a

given attribute value. Instead of manipulating the whole

image, we propose to learn the corresponding residual im-

age deﬁned as the difference between images before and

after the manipulation. In this way, the manipulation can

be operated efﬁciently with modest pixel modiﬁcation. The

framework of our approach is based on the Generative Ad-

versarial Network. It consists of two image transformation

networks and a discriminative network. The transformation

networks are responsible for the attribute manipulation and

its dual operation and the discriminative network is used

to distinguish the generated images from real images. We

also apply dual learning to allow transformation networks

to learn from each other. Experiments show that residual

images can be effectively learned and used for attribute ma-

nipulations. The generated images remain most of the de-

tails in attribute-irrelevant areas.

1. Introduction

Considerable progresses have been made on face image

processing, such as age analysis [22][26], emotion detec-

tion [1][5] and attribute classiﬁcation [4][20][15][18]. Most

of these studies concentrate on inferring attributes from im-

ages. However, we raise an inverse question on whether

we can manipulate a face image towards a desired attribute

value (i.e. face attribute manipulation). Some examples are

shown in Fig. 1.

Generative models such as generative adversarial

networks (GANs) [7] and variational autoencoders

(VAEs) [14] are powerful models capable of generating

images. Images generated from GAN models are sharp

and realistic. However, they can not encode images since

it is the random noise that is used for image generation.

Compared to GAN models, VAE models are able to encode

(a) Glasses: remove and add the glasses

(b) Mouth open: close and open the mouth

(c) No beard: add and remove the beard

Figure 1: Illustration of face attribute manipulation. From

top to bottom are the manipulations of glasses, mouth

open

and no

beard.

the given image to a latent representation. Nevertheless,

passing images through the encoder-decoder pipeline often

harms the quality of the reconstruction. In the scenario of

face attribute manipulation, those details can be identity-

related and the loss of those details will cause undesired

changes. Thus, it is difﬁcult to directly employ GAN

models or VAE models to face attribute manipulation.

An alternative way is to view face attribute manipulation

as a transformation process which takes in original images

as input and then outputs transformed images without ex-

plicit embedding. Such a transformation process can be ef-

ﬁciently implemented by a feed-forward convolutional neu-

ral network (CNN). When manipulating face attributes, the

feed-forward network is required to modify the attribute-

speciﬁc area and keep irrelevant areas unchanged, both of

which are challenging.

In this paper, we propose a novel method based on resid-

4030

ual image learning for face attribute manipulation. The

method combines the generative power of the GAN model

with the efﬁciency of the feed-forward network (see Fig. 2).

We model the manipulation operation as learning the resid-

ual image which is deﬁned as the difference between the

original input image and the desired manipulated image.

Compared to learning the whole manipulated image, learn-

ing only the residual image avoids the redundant attribute-

irrelevant information by concentrating on the essential

attribute-speciﬁc knowledge. To improve the efﬁciency of

manipulation learning, we adopt two CNNs to model two

inverse manipulations (e.g. removing glasses as the primal

manipulation and adding glasses as the dual manipulation,

Fig. 2) and apply the strategy of dual learning during the

training phase. Our contribution can be summarized as fol-

lows.

1. We propose to learn residual images for face attribute

manipulation. The proposed method focuses on the

attribute-speciﬁc face area instead of the entire face

which contains many redundant irrelevant details.

2. We devise a dual learning scheme to learn two inverse

attribute manipulations (one as the primal manipula-

tion and the other as the dual manipulation) simultane-

ously. We demonstrate that the dual learning process

is helpful for generating high quality images.

3. Though it is difﬁcult to assess the manipulated images

quantitatively, we adopt the landmark detection accu-

racy gain as the metric to quantitatively show the effec-

tiveness of the proposed method for glasses removal.

2. Related Work

There are many techniques for image generation in re-

cent years [23][2][17][8][3][14]. Radford et al. [23] applied

deep convolutional generative adversarial networks (DC-

GANs) to learn a hierarchy of representations from object

parts to scenes for general image generation. Chen et al. [2]

introduced an information-theoretic extension to the GAN

that was able to learn disentangled representations. Larsen

et al. [17] combined the VAE with the GAN to learn an em-

bedding in which high-level abstract visual features could

be modiﬁed using simple arithmetic.

Our work is an independent work along with [19]. In

[19], Li et al. proposed a deep convolutional network model

for identity-aware transfer of facial attributes. The differ-

ences between our work and [19] are noticeable in three

aspects. (1) Our method generates manipulated images us-

ing residual images which is different from [19]. (2) Our

method models two inverse manipulations within one sin-

gle architecture by sharing the same discriminator while the

work in [19] treats each manipulation independently. (3)

Our method does not require post-processing which is es-

sential in [19].

3. Learning the Residual Image

The architecture of the proposed method is presented in

Fig. 2. For each face attribute manipulation, it contains

two image transformation networks G

0

and G

1

and a dis-

criminative network D. G

0

and G

1

simulate the primal

and the dual manipulation respectively. D classiﬁes the ref-

erence images and generated images into three categories.

The following sections will ﬁrst give a brief introduction

of the generative adversarial network and then the detailed

description of the proposed method.

3.1. Generative Adversarial Networks

The generative adversarial network is introduced by

Goodfellow et al. [7]. It is an unsupervised framework con-

taining a generative model G and a discriminative model D.

The two models play a minimax two-player game in which

G tries to recover the training data and fool D to make a

mistake about whether the data is from realistic data distri-

bution or from G. Given a data prior p

data

on data x and a

prior on the input noise variables p

z

(z), the formal expres-

sion of the minimax game is as follows:

min

G

max

D

V (D, G) = E

x∼p

data

(x)

[log D(x)]

+E

z∼p

z

(z)

[log(1 − D(G(z)))].

(1)

The parameters of both G and D are updated iteratively

during the training process. The GAN framework provides

an effective way to learn data distribution of realistic im-

ages and makes it possible to generate images with desired

attribute. Based on the GAN framework, we redesign the

generator and the discriminator for face attribute manipula-

tion in the following sections.

3.2. Image Transformation Networks

The motivation of our approach is that face attribute ma-

nipulation usually only needs modest modiﬁcation of the

attribute-speciﬁc face area while other parts remain un-

changed. For example, when removing a pair of glasses

from a face image, only the area of the glasses should be re-

placed with face skin or eyes while the change of other face

parts such as mouth, nose, and hair should not be involved.

Thus, we model the manipulation as learning the residual

image targeted on the attribute-speciﬁc area.

As shown in Fig. 2, image transformation networks G

0

and G

1

are used to simulate the manipulation and its dual

operation. Given the input face image x

0

with a negative

attribute value and the input face image x

1

with a positive

attribute value, the learned network G

0

and G

1

apply the

manipulation transformations to yield the residual images

r

0

and r

1

. Then the input images are added to the residual

images as the ﬁnal outputs ˜x

0

and ˜x

1

:

˜x

i

= x

i

+ r

i

= x

i

+ G

i

(x

i

), i = 0, 1. (2)

4031

+

Image transformation network G

0

Image transformation network G

1

Discriminative network D

Class #2

Class #0 Class #1

Input images



0

Residual images



0

Output images



0

Input images Classification

…

… …

Input images



1

Residual images



1

Output images





…

Figure 2: The architecture of the proposed method. Two image transformation networks G

0

and G

1

perform the inverse

attribute manipulation (i.e. adding glasses and removing glasses). Both G

0

and G

1

produce residual images with reference

to the input images. The ﬁnal output images are the pixel-wise addition of the residual images and the input images. The

discriminative network D is a three-category classiﬁer that classiﬁes images from different categories (i.e. images generated

from G

0

and G

1

, images with positive attribute labels, and images with negative attribute labels).

In order to let the residual image be sparse, we apply an

L-1 norm regularization as

ℓ

pix

(r

i

) = ||r

i

||

1

, i = 0, 1. (3)

3.3. The Discriminative Network

Given the real images x

0

and x

1

with known attribute

label 0 and label 1, we regard the transformed images ˜x

0

and ˜x

1

as an extra category with label 2. The loss function

is:

ℓ

cls

(t, p) = − log(p

t

), t = 0, 1, 2, (4)

where t is the label of the image and p

t

is the softmax prob-

ability of the t-th label. Similar strategy for constructing the

GAN loss is also adopted in [25].

Perceptual loss is widely used to measure the content

difference between different images [10][6][17]. We also

apply this loss to encourage the transformed image to have

similar content to the input face image. Let φ(x) be the acti-

vation of the third layer in D. The perceptual loss is deﬁned

as:

ℓ

per

(x, ˜x) = ||φ(x) − φ(˜x)||

1

. (5)

Given the discriminative network D, the GAN loss for

the image transformation networks G

0

and G

1

is

ℓ

GAN

=

(

− log(D(G

i

(x

i

))) i = 0,

− log(1 − D(G

i

(x

i

))) i = 1.

(6)

3.4. Dual Learning

In addition to applying adversarial learning in the model

training, we also adopt dual learning which has been suc-

cessfully applied in machine translation [27]. A brief intro-

duction is as follows. Any machine translation has a dual

task, i.e. the source language to the target language (pri-

mal) and the target language to the source language (dual).

The mechanism of dual learning can be viewed as a two-

player communication game. The ﬁrst player translates a

message from language A to language B and sends it to the

second player. The second player checks if it is natural in

language B and notiﬁes the ﬁrst player. Then he translates

the message to language A and sends it back to the ﬁrst

player. The ﬁrst player checks whether the received mes-

sage is consistent with his original message and notiﬁes the

4032

G

1

G

0

G

0

G

1

Figure 3: The dual learning process in this work.

second player. The information feedback signals from both

players can beneﬁt each other through a closed loop.

The dual learning process in this work is implemented as

Fig. 3. For a given image x

0

with a negative attribute value,

we pass it through the transformation network G

0

. The ob-

tained image ˜x

0

= G

0

(x

0

) is then fed to the transforma-

tion network G

1

. The yielded image is ˆx

0

= G

1

(˜x

0

) =

G

1

(G

0

(x

0

)). Since G

0

and G

1

are the primal task and the

dual task respectively, ˆx

0

is expected to have the same at-

tribute value as x

0

. Similar process is also applied for x

1

.

The loss function for the transformation networks in this

phase is expressed as:

ℓ

dual

(˜x

i

) =

(

− log(1 − D(G

1−i

(˜x

i

))) i = 0,

− log(D(G

1−i

(˜x

i

))) i = 1.

(7)

3.5. Loss Function

Taking the loss functions all together, we have the fol-

lowing loss function for G

0

/G

1

:

ℓ

G

= ℓ

GAN

+ ℓ

dual

+ αℓ

pix

+ βℓ

per

, (8)

where α and β are constant weight for regularization terms.

For D, the loss function is

ℓ

D

= ℓ

cls

. (9)

4. Datasets

We adopt two datasets in our experiments, i.e. the

CelebA dataset [21] and the Labeled Faces in the Wild

(LFW) dataset [9]. The CelebA dataset contains more than

200K celebrity images, each with 40 binary attributes. We

pick 6 of them, i.e. glasses, mouth

open, smile, no beard,

young, and male to evaluate the proposed method. The

center part of the aligned images in the CelebA dataset

are cropped and scaled to 128×128. Despite there are a

large number of images in the dataset, the attribute labels

are highly biased. Thus, for each attribute, 1,000 images

from the attribute-positive class and 1,000 images from the

attribute-negative class are randomly selected for test. From

the rest images, we select all the images belong to the mi-

nority class and equal number of images from the majority

class to make a balance dataset. The LFW dataset is used

only for testing the generalization of our method. Note that

there are no ground truth manipulated images in the CelebA

dataset for training the transformation networks.

5. Implementation Details

The detailed architectures of G

0

, G

1

and D are spec-

iﬁed in Tab. 1. We keep β = 0.1α and set α=5e-4 for lo-

cal face attribute (i.e. glasses, no

beard, mouth open, smile)

manipulation and α=1e-6 for global face attribute (i.e. male,

young) manipulation. The weights of all the networks are

initialized from a zero-centered Normal distribution with

standard deviation 0.02. The Adam optimizer [13] is used in

the training phase. The learning rates for both the transfor-

mation networks and the discriminators are the same 2e-4.

Both G

0

and G

1

are trained at the same time without any

staging.

6. Experiments

6.1. Local and Global Attribute Manipulation

Among the six attributes, we group glasses, mouth

open,

smile and no

beard as local attributes since the manipula-

tions only operate on local face area. The other two at-

tributes male and young are treated as global attributes. We

compare the results of our method and those of the state-

of-the-art method VAE-GAN [17] on the CelebA dataset

(Fig. 4). The results on the LFW dataset are presented in

Fig. 5.

We ﬁrstly give an overall observation of the results. As

shown in Fig. 4, the VAE-GAN model [17] changes many

details, such as hair style, skin color and background ob-

jects. In contrast, the results from our method remain most

of the details. Comparing the original images in the ﬁrst

row and the transformed images in the third row, we can

ﬁnd that details in original face images mostly remain the

same in their manipulated counterparts except the areas cor-

responding to the target attributes. This observation is also

proved by the residual images in the last row. For local

attribute manipulation, the strong responses on the resid-

ual images mainly concentrate in local areas. For example,

when adding sun glasses to the face image, the most strong

response on the residual image is the black sun glasses.

Similarly, removing glasses will cause the residual image

to enhance the eyes and remove any hint of glasses that is

presented in the original face image.

Local face attribute manipulations are straightforward

and obvious to notice. Further, we investigate more inter-

esting tasks such as mouth

open and smile manipulations.

Both manipulations will cause the “movement” of the chin.

From Fig. 4(c,d), we can observe that the “movement” of

the chin is captured by the image transformation networks.

When performing the mouth

open manipulation, the net-

4033

Image transformation networks G

0

/G

1

Discriminative network D

Input 128×128 color images Input 128×128 color images

5×5 conv. 64 leaky RELU. stride 1. batchnorm 4×4 conv. 64 leaky RELU. stride 2. batchnorm

4×4 conv. 128 leaky RELU. stride 2. batchnorm 4×4 conv. 128 leaky RELU. stride 2. batchnorm

4×4 conv. 256 leaky RELU. stride 2. batchnorm 4×4 conv. 256 leaky RELU. stride 2. batchnorm

3×3 conv. 128 leaky RELU. stride 1. upsampling. batchnorm 4×4 conv. 512 leaky RELU. stride 2. batchnorm

3×3 conv. 64 leaky RELU. stride 1. upsampling. batchnorm 4×4 conv. 1024 leaky RELU. stride 2. batchnorm

4×4 conv. 3 4×4 conv. 1

Table 1: The network architectures of the image transformation networks G

0

/G

1

and the discriminative network D

work “lowers” the chin and when performing mouth close

manipulation, the network “lifts” the chin.

The most challenging task would be manipulating global

attributes young and male. The networks have to learn

subtle changes such as wrinkles, hair color, beard etc. In

Fig. 4(f), changing from young to old will cause more wrin-

kles and the dual operation will darken the hair color. From

the residual images in Fig. 4(e), we observe that the main

difference between the male and the female are the beard,

the color of the lips and the eyes. The strong responses in

the residual images for these two manipulations are scat-

tered over the entire images rather than restricted within a

local area.

6.2. Ablation Study

Our model consists of two pivotal components: residual

image learning and dual learning. In this section, we further

validate their effectiveness. We modify the proposed model

to obtain two more models. One breaks the identity map-

ping in the transformation networks to enforce the networks

to learn to generate the entire image. The other breaks the

data-feed loop (i.e. the output of G

0

and G

1

will not be fed

to each other). Other network settings are kept the same as

those of the proposed model. We use the manipulation of

glasses as an example and the results are shown in Fig. 6.

We observe that without residual image learning, the model

produces much lower quality images in terms of introduc-

ing much noise and some correlated features (e.g. wrongly

added beard in the second and third column), which indi-

cates the task has become challenging. The drop of dual

learning also deteriorates image quality. We notice some

change in hair color which is caused by the performance

degradation of the transformation networks. The effective-

ness of the dual learning can be explained from two aspects.

1) Images generated from both generators increase the num-

ber of training samples. 2) During the dual learning phase,

the ground truth images for G

1

(G

0

(x

0

)) and G

0

(G

1

(x

1

))

are known, which eases the training of both generators.

Thus, we argue that combining residual image learning with

dual learning will lead to better manipulation results.

Figure 6: Validation of residual image learning and dual

learning in glasses manipulation. First row: the original

input images. Second row: the result images from the pro-

posed model. Third row: the result images from the model

without residual image learning. Last row: the result im-

ages from the model without dual learning.

6.3. Visual Feature Decorrelation

Training a classiﬁer in an end-to-end way does not en-

sure the classiﬁer can precisely identify the target visual

features especially when the dataset is highly biased. Blind

spots of predictive models are observed in [16]. For exam-

ple, if the training data consist only of black dog images,

and white cat images. A predictive model trained on these

data will incorrectly label a white dog as a cat with high

conﬁdence in the test phase. When analyzing the CelebA

dataset, we ﬁnd that male and no

beard is highly corre-

lated (the Pearson correlation is -0.5222). It is not surprising

since only male faces have beard.

Classiﬁers trained with correlated features may also

propagate the blind spots back to the generative models

which may cause the generators to produce correlated vi-

sual features. To demonstrate that the proposed method can

learn less correlated features, we choose to add beard to fe-

4034

Learning Residual Images for Face Attribute Manipulation

Citations

StarGAN: Unified Generative Adversarial Networks for Multi-domain Image-to-Image Translation

AttGAN: Facial Attribute Editing by Only Changing What You Want

GANimation: anatomically-aware facial animation from a single image

Beyond Face Rotation: Global and Local Perception GAN for Photorealistic and Identity Preserving Frontal View Synthesis

DeepFakes and Beyond: A Survey of Face Manipulation and Fake Detection

References

Adam: A Method for Stochastic Optimization

Generative Adversarial Nets

Auto-Encoding Variational Bayes

Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

Perceptual Losses for Real-Time Style Transfer and Super-Resolution

Related Papers (5)

Generative Adversarial Nets

Image-to-Image Translation with Conditional Adversarial Networks

Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks

Deep Learning Face Attributes in the Wild

Conditional Generative Adversarial Nets