scispace - formally typeset
Open AccessProceedings ArticleDOI

An eye for an eye: A single camera gaze-replacement method

TLDR
This work proposes an effective solution to this problem that is based on replacing the eyes of the user, and shows that this replacement, when done accurately, is enough to achieve a natural looking video.
Abstract
The camera in video conference systems is typically positioned above, or below, the screen, causing the gaze of the users to appear misplaced. We propose an effective solution to this problem that is based on replacing the eyes of the user. This replacement, when done accurately, is enough to achieve a natural looking video. At an initialization stage the user is asked to look straight at the camera. We store these frames, then track the eyes accurately in the video sequence and replace the eyes, taking care of illumination and ghosting artifacts. We have tested the system on a large number of videos demonstrating the effectiveness of the proposed solution.

read more

Content maybe subject to copyright    Report

An Eye for an Eye: A Single Camera Gaze-Replacement Method
Lior Wolf
The Blavatnik School of Computer Science
Tel-Aviv University
Ziv Freund, Shai Avidan
Department of Electrical Engineering-Systems
Faculty of Engineering
Tel-Aviv University
Abstract
The camera in video conference systems is typically po-
sitioned above, or below, the screen, causing the gaze of
the users to appear misplaced. We propose an effective so-
lution to this problem that is based on replacing the eyes
of the user. This replacement, when done accurately, is
enough to achieve a natural looking video. At an initializa-
tion stage the user is asked to look straight at the camera.
We store these frames,then track the eyes accurately in the
video sequence and replace the eyes, taking care of illumi-
nation and ghosting artifacts. We have tested the system on
a large number of videos demonstrating the effectiveness of
the proposed solution.
1. Introduction
Videoconferencing systems hold the promise of allowing
a natural interpersonal communication at a range. Recent
advances in video quality and the adaptation of large high
definition screens are contributing to a more impressive user
experience. However, to achieve the desired impact of be-
ing in the same room the problem of gaze offset must be
addressed.
This gaze problem arises because the user is looking at
the screen, while the camera(s) capturing the user are posi-
tioned elsewhere. As a result, even when the user is looking
straight into the image of his call partner, the gaze, as per-
ceived at the other side, does not meet the partner’s eyes.
Typically, the camera is located on top of the screen, and
the effect is interpreted as looking down.
Our system solves this problem by replacing the eyes of
a person in the video with eyes that look straight ahead. We
use an example based synthesis that is based on capturing
the eyes at an initial training stage. During the videoconfer-
encing session, we find and track the eyes. In every frame,
the eyes captured during training are accurately pasted to
create an illusion of straight-looking gaze. Somewhat sur-
prisingly, the resulting effect (Figure 1) of replacing the
eyes alone looks natural.
(a) (b)
(c) (d)
Figure 1. Two of the four images contain artificially replaced eyes
to create an effect of a person looking straight forward. These eyes
were automatically placed by our system. The other two images
were untouched. Can you tell which images were modified?
The answer is in the footnote.
1
2. Previous work
The importance of natural gaze, and the problem of gaze
offset was presented in depth by Gemmell et al. [7], where
a synthesis framework was presented. The solved problem
is somewhat more general than what we aim to solve and
includes turning the entire head by means of rendering a 3D
head model.
With regards to the eyes, it was suggested that the loca-
1
Images (a) and (d) are real, images (b) and (c) are modified.
1

tion of the eyes and the gaze direction would be estimated
by a “vision component”. The eyes would then be replaced
with a synthetic pair of eyes gazing in the desired direction.
Written in the year 2000, the authors conclude that the
main difficulty they face is that the vision component is slow
and inaccurate, and suggest using an infrared-based vision
system until computer vision “comes up to speed”.
Despite great advances in object detection and tracking
in the last decade, the accurate commercial eye trackers that
are in existence are still based on infrared images in which
the corneal reflections and the center of the pupils are easily
detected.
In our work we use an eye model similar to the one pro-
posed by [16]. This model contains parameters such as the
center of the eye, the radius of the iris and so on. The au-
thors define an energy function which relates image features
to the geometric structure of the eye contours. Localization
is performed by the steepest descent method.
In [8] an automatic initialization method is proposed
based on the corners of the eye and the computation of the
model fitting process is sped up. In [9] a model for blinking
is introduced, and a KLT tracker is employed to track the
eye model over time.
Recently, [13] proposed a multi-stage process in which
the iris, upper eyelid and lower eyelid are detected sequen-
tially. A very accurate model is being detected by Ding and
Martinez [4], who observe that classifier based approaches
by themselves may be unsuitable for the task due to the
large variability in the shape of facial features.
There are many contributions in which the corners of the
eyes are detected, however, a detailed model is not recov-
ered. In this work we employ the method of Everingham et
al. [5] in order to obtain an initial localization for the eyes.
An alternative solution for the problem of gaze manipu-
lation is view synthesis from multiple cameras. In [3] dy-
namic programming based disparity estimation was used to
generate a middle view from two cameras that were posi-
tioned on the left and right sides of the screen. The ad-
vantage of such a method is that the generated view cor-
responds to a true view, while our solution generates only
a natural looking fake. The disadvantage, of course, is the
need to use multiple cameras positioned at locations that are
suitable for this purpose.
3. System overview
The core of our system is an accurate eye detector that
takes an image of a face and returns an accurate position of
the eye. Once we have the eye position we can replace it
with an image of an eye with a proper gaze direction.
To achieve this goal we learn a regression function that
maps face images to the eye model parameters, using a
database of annotated images. This stage is not accurate
enough and we follow up with a refinement stage.
Figure 2. An illustration of the eye model employed in the gaze
replacement system
The system consists of three stages. An offline stage
where a rough eye pose estimation is learned from a
database of annotated images. This database is a general
database, unrelated to the any particular videoconferencing
session, and we refer to it below as the offline database. A
second stage occurs at the beginning of the video confer-
encing session where the user looks straight at the camera.
This allows the system to construct a database of direct gaze
images. We refer to this as an online database. Finally, in
run-time the system constantly detects the location of the
eye in the current frame, finds the most similar image in the
database and replaces it.
The eye model we use is a simplified model based on
the deformable template of [16, 9]. It is described by six
parameters, as depicted in Figure 2. Two coordinates for
the center of the eye: x
c
and y
c
, the width of the eye w,
the hight of the eye above the line connecting the corners
h
1
, and the maximal height below this line h
2
. Last, there
is the angle θ of the eye from the image scan lines. The
coordinates of the center are given relatively to the leftmost
corner of the eye.
3.1. Offline Training stage
The direct gaze models are learned based on previous
annotated examples. These examples are stored in an offline
database that is used to bootstrap the runtime system. It
consists of a separate set of 14 individuals for whom videos
were captured and manually annotated in accordance with
the six parameter model.
3.2. Online Database construction
Given images of a new person looking directly into the
camera, the closest eye in the offline database is retrieved
and then serves as an initial guess as to the eye parameters
in the new image. These parameters are then adapted to
provide a more accurate estimation.

3.3. Runtime eye localization
At runtime, the parameters of the eyes in the video are
estimated by matching the image to the eyes of the person
when looking straight. The illumination is corrected, and a
replacement is performed.
4. Training phase
Given a video of a person looking directly into the cam-
era we aim to find the eye parameters. This is done for each
frame independently, thereby collecting pairs of straight
looking eyes of that person.
The first stage in the processing of each frame is the lo-
calization of the corners of the eyes. This is done using the
method of [5], which describes the geometric distribution of
facial features as a tree-structured mixture of Gaussians [6],
and captures appearance by Haar-wavelet like features [14].
After the corners of the eyes are localized, a rectangu-
lar region of interest which approximately captures the eye
regions is constructed for each eye. Let w be the distance
between the two corners of the eye. The rectangle of each
eye is a combination of a strip of width h
1
= .4w above
the line connecting the two corners, and of a strip of width
h
2
= .3w below this line.
Each region of interest (ROI) is scaled to an image of
120 × 80 pixels. Then, SIFT [11] descriptors are computed
at 24 evenly spaced points in the stretched image.
From the offline database of manually annotated eyes,
we select the left and right eyes with the closest appear-
ance descriptor. This yields an approximate model. In
Leave-One-Person-Out experiments conducted on the of-
fline database it was found that their average absolute error
is about 3 pixels for the center of the eye, and 6 pixels in the
width of the eye.
Table 1 presents the regression results for the left eye,
and Table 2 presents the results obtained for the right eye.
We compare two representations: the SIFT representation
and the vector of gray values in the eye rectangle, resized to
120 pixels times 80 pixels. We also compare two machine
learning algorithms: Support Vector Regression and Near-
est Neighbor. SIFT obtains preferable performance, espe-
cially for the Nearest Neighbor classifier.
This initial model is then refined by performing a local
search in the space of the 6 parameters. The search is con-
ducted in a range of twice the average error in each param-
eter, and in a coarse to fine manner.
For each set of candidate parameter values, the clos-
est eye in the offline database is translated, rotated and
stretched in accordance with the difference in parameters.
The center is shifted by the different in x
c
and y
c
, the width
is scaled by the ratio of the w values, and the regions above
and below the line connecting the two corners are stretched
in accordance with h
1
and h
2
respectively. Finally, the
Table 1. Mean (± standard deviation) error in pixels for each pa-
rameter of the eye model for the experiments of the left eye. The
errors were estimated in a leave-one-person-out fashion on the of-
fline dataset. Tow image representations (gray values and SIFT)
and two learning algorithms (Nearest Neighbor and Support Vec-
tor Regression) are presented.
Para- Grayvalues SIFT
meter NN SVR NN SVR
x
c
4.26 ± 3.62 3.08 ± 2.3 3.02 ± 2.47 3.08 ± 2.3
y
c
4.3 ± 2.96 3.84 ± 2.92 2.73 ± 1.8 3.23 ± 2.72
w 7.83 ± 6.16 6.86 ± 6.51 6.95 ± 6.52 6.47 ± 6.79
h
1
3.79 ± 2.25 3.35 ± 2.28 3.58 ± 3.04 3.35 ± 2.28
h
2
3.22 ± 2.73 2.93 ± 2.51 2.45 ± 1.72 2.68 ± 2.56
θ 0.10 ± 0.06 0.08 ± 0.04 0.08 ± 0.06 0.07 ± 0.05
Table 2. Mean (± standard deviation) error in pixels for each pa-
rameter of the eye model for the experiments of the right eye. See
Table 1 for details.
Para- Grayvalues SIFT
meter NN SVR NN SVR
x
c
3.69 ± 2.99 7.49 ± 4.66 3.76 ± 3.06 5.72 ± 4.42
y
c
3.62 ± 2.89 3.01 ± 2.62 2.84 ± 2.01 2.91 ± 2.54
w 8.03 ± 6.26 6.13 ± 4.7 5.24 ± 4.48 5.81 ± 4.89
h
1
3.28 ± 2.77 2.94 ± 2.4 2.35 ± 1.89 2.89 ± 2.37
h
2
2.4 ± 1.88 2.05 ± 1.71 2.28 ± 2.04 2.05 ± 1.71
θ 0.07 ± 0.05 0.06 ± 0.05 0.076 ± 0.05 0.06 ± 0.05
database eye image is rotated by the difference in the θ val-
ues between the database image and the candidate parame-
ter value.
As a matching score we use the normalized cross correla-
tion measure between the warped database eye and the eye
in the new directly looking video. The ROI for this com-
parison is the region of the eye, slightly enlarged, and not a
rectangular frame.
A threshold is used to determine cases in which the
method of searching for the eye parameters failed to pro-
duce good database to image matches. Typically, the pro-
cess produces one pair of eyes for 80% of the frames in the
training video.
5. Runtime System
The runtime system replaces frames whenever the eye is
open. During blinks no replacement is done. Right after
the blink, the system is reinitialized and starts similarly to
the first frame. Remarkably, the lack of intervention dur-
ing blink frames does not seem to hurt the quality of the
resulting video.

(a) (b)
(c) (d)
Figure 3. (a-c) Samples of successful facial feature points detection obtained using the method of [5]. The ROI around the eyes are marked
by a black rectangle. (d) An example of a failure case in detection.
(a) (b) (c)
(d) (e) (f)
Figure 4. Initially fitted model (red) and refined model obtained by the search procedure (green). (a) During the online database construc-
tion, the initial guess is the nearest neighbor (for each eye separately) in the SIFT feature space from among the training examples of the
offline database. (b) In the first frame of the videoconferencing session, the initial guess for each eye is the highest correlated example in
the online database. (c) In the tracking phase, the initial guess is the eye-model of the previous frame. (d) depicts the two eyes (from two
different individuals) from the offline database that were used as an initial guess for the eye model of (a). (e) the initial models for (b) are
taken from this frame of the online database. (f) the initial model for (c) is the taken from the previous video frame shown here.
Table 3. Mean (± standard deviation) error in pixels for each pa-
rameter of the eye model after the refinement stage. The errors
were estimated in a leave-one-out fashion on the offline dataset
Parameter Left eye Right eye
x
c
2.04 ±1.80 1.75±1.52
y
c
1.86 ±1.72 1.60±2.02
w 6.66 ± 5.03 4.29 ± 3.70
h
1
3.68 ±3.42 2.54±1.93
h
2
2.22 ±1.95 2.20±1.83
θ 0.08 ±0.06 0.06±0.04
5.1. Eye detection
In the first frame, the best matching set of eyes are
searched for. This is done by using the normalized corre-
lation measurements to compare the learned eye models to
an eye shaped ROI situated between the corners of the eyes
and at a width of 0.7w. Notice that we do not use SIFT
descriptors here, since our goal is to find eyes that are sim-
ilar in appearance. Such eyes are more likely to produce
minimal artifacts during replacement.
After the closest direct looking pair of eyes is found, they
are morphed in order to better fit the new frame. This is
done by a search process over the six eye parameters, simi-

lar to the search done during training. Here, unlike the train-
ing phase, the normalized cross correlation matching score
is used.
Figure 4(b) shows the model eyes and how they were fit
to the new frame before and after the search stage.
5.2. Eye tracking
The eye parameters must be estimated for every frame.
Given a new frame, the eye parameters of the previous
frame serve as an initial guess, and a search (i.e. refine-
ment) process is once again conducted. The matching score
is composed of two components: one considers the sum of
squared differences (SSD) between the eye in the current
frame and the eye in the previous frame, where the latter
is warped to the new frame in accordance with the differ-
ence in the eye parameters; The second considers the SSD
between the current eye and the warped eye from the first
frame.
SSD is used since illumination changes between consec-
utive frames are expected to be small and since it is conve-
nient to combine multiple SSD scores. The combined cost
term minimizes drift over time. Without it, small tracking
errors would accumulate. A noticeable example would be
for the eye to gradually shrink, see Figure 5.
When the search process performed during tracking fails,
a blink is declared, and the system enters a blink mode.
While in this mode, the eye parameters do not adapt, and
no eye replacement takes place. During every blink frame,
tracking based on the last detected eye model is attempted.
Once this tracking is successful for both eyes for at least 2
consecutive frames, the blink state is terminated.
The first frame after the blink mode is treated as the first
frame of the video sequence, in order to allow the system to
move to a more suitable set of eyes. In case the eye corner
detector fails, the last model of an open eye is used to ini-
tialize the eye model search process. Although the pair of
replacement eyes used in between blinks does not change,
this effect is unnoticeable.
5.3. Eye replacement
The eye replacement is done by pasting the warped set of
model eyes onto the eye location in the image. The warping
is done in order to adjust the parameters of the model eyes
to the parameters of the actual eyes in the video frame.
In order to eliminate artifacts caused by this pasting oper-
ator several measures are taken. First, the eyes are enlarged
by a factor of 10% in the vertical direction. This compen-
sates for underestimation of the height of the eye due to the
change in gaze between the model eyes an the actual eyes,
and, in addition, makes sure that there are no residue pixels
from the original eye.
A second measure done in order to ensure smooth inte-
gration is the adjustment of image illumination. To this end,
a low pass quotient image [12, 10, 15] is applied, similar to
what is done in [2] for swapping entire faces in images.
For both the original pixels to be replaced and the new
eye to be pasted, we estimate the illumination by fitting a
third order polynomial to the image data [1].
Denote the Red value of pixel (x, y) in the original video
frame as I
(1)
R
(x, y), and the values in the pasted eye im-
age as I
(2)
R
(x, y). Let
ˆ
I
(1)
R
(x, y) and
ˆ
I
(2)
R
(x, y) be the corre-
sponding values of the fitted low-order polynomial:
ˆ
I
(1)
R
(x, y) =
3
X
i=0
3i
X
j=0
β
(1)
R,ij
x
i
y
j
ˆ
I
(2)
R
(x, y) =
3
X
i=0
3i
X
j=0
β
(2)
R,ij
x
i
y
j
The fitting is done to each of the three channels R,G,B
separately using a least square system of equations (10 un-
known βs per image per channel).
Using quotient image reillumination, the new image val-
ues
b
I
(1)
R
(x, y) are given by:
I
(1)
R
(x, y) = I
(2)
R
(x, y)
ˆ
I
(1)
R
(x, y)
ˆ
I
(2)
R
(x, y)
To further ensure unnoticeable pasting, we use a sim-
ple feathering technique in which on a strip of 10 pixels
around the replaced region, blending with linear weights is
performed between the old image and the pasted eye.
6. Results
We have tested our system on a number of sequences.
All sequences are captured at 1280 × 1024 pixels at 25
frames per second. The offline database consists of 200
images of 14 individuals with manually specified eye mod-
els. At the beginning of the videoconferencing session, we
asked each user to look straight at the camera for a couple of
seconds. These frames became the online database that was
used for the remaining of the session of each individual.
Fig. 7 shows frames from some of the video sequences
we processed. We properly detect the eye and seamlessly
replace it with a direct gaze eye image. In addition, the sys-
tem can automatically detect blinks. During blinks the input
image remains unaltered. Also demonstrated are some of
the limitations of the system. For example, in the third row,
right image, we observe that the person is looking sideways,
but not down. Still, our system replaces the eye with a direct
gaze image. Also, in the bottom right image, the person is
twitching his left eye. We correctly replace the eye, but also
affect some of the surrounding skin.
The system also maintains a very high level of temporal
consistency with no flickering (i.e., the eye does not change

Figures
Citations
More filters
Proceedings ArticleDOI

FSGAN: Subject Agnostic Face Swapping and Reenactment

TL;DR: A novel recurrent neural network (RNN)-based approach for face reenactment which adjusts for both pose and expression variations and can be applied to a single image or a video sequence and uses a novel Poisson blending loss which combines Poisson optimization with perceptual loss.
Proceedings ArticleDOI

On Face Segmentation, Face Swapping, and Face Perception

TL;DR: It is shown that a standard fully convolutional network (FCN) can achieve remarkably fast and accurate segmentations, provided that it is trained on a rich enough example set.
Book ChapterDOI

DeepWarp: Photorealistic Image Resynthesis for Gaze Manipulation

TL;DR: In this article, the task of generating highly realistic images of a given face with a redirected gaze is treated as a specific instance of conditional image generation and a new deep architecture that can handle this task is proposed.
Posted Content

FSGAN: Subject Agnostic Face Swapping and Reenactment

TL;DR: In this paper, a face swapping GAN (FSGAN) is proposed for face swapping and reenactment, which is subject agnostic and can be applied to pairs of faces without requiring training on those faces.
Journal ArticleDOI

EyeOpener: Editing Eyes in the Wild

TL;DR: This article uses a user’s personal photo collection to find a “good” set of reference eyes and transfer them onto a target image and develops a comprehensive pipeline of three-dimensional face estimation, image warping, relighting, image harmonization, automatic segmentation, and image compositing in order to achieve highly believable results.
References
More filters
Journal ArticleDOI

Distinctive Image Features from Scale-Invariant Keypoints

TL;DR: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene and can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Proceedings ArticleDOI

Robust real-time face detection

TL;DR: A new image representation called the “Integral Image” is introduced which allows the features used by the detector to be computed very quickly and a method for combining classifiers in a “cascade” which allows background regions of the image to be quickly discarded while spending more computation on promising face-like regions.
Journal ArticleDOI

Pictorial Structures for Object Recognition

TL;DR: A computationally efficient framework for part-based modeling and recognition of objects, motivated by the pictorial structure models introduced by Fischler and Elschlager, that allows for qualitative descriptions of visual appearance and is suitable for generic recognition problems.
Journal ArticleDOI

Lambertian reflectance and linear subspaces

TL;DR: It is proved that the set of all Lambertian reflectance functions (the mapping from surface normals to intensities) obtained with arbitrary distant light sources lies close to a 9D linear subspace, implying that, in general, theSet of images of a convex Lambertian object obtained under a wide variety of lighting conditions can be approximated accurately by a low-dimensional linear sub space, explaining prior empirical results.
Journal ArticleDOI

Feature extraction from faces using deformable templates

TL;DR: In this article, a deformable template is used to detect and describe features of faces using deformable templates and an energy function is defined which links edges, peaks, and valleys in the image intensity to corresponding properties of the template.
Frequently Asked Questions (7)
Q1. What are the contributions in "An eye for an eye: a single camera gaze-replacement method" ?

The authors propose an effective solution to this problem that is based on replacing the eyes of the user. The authors have tested the system on a large number of videos demonstrating the effectiveness of the proposed solution. 

The authors also plan to allow a better control of the direction of the gaze. The authors plan to address these issues and build a complete end-to-end video conferencing solution. For example, in a multi-party conference call, the authors can render the gaze differently for every viewer to reflect the information of who is looking at who. As can be seen in Figure 8 example-based gaze replacement works even for large head rotations and in replacing left to right gaze directions. 

In [3] dynamic programming based disparity estimation was used to generate a middle view from two cameras that were positioned on the left and right sides of the screen. 

In [8] an automatic initialization method is proposed based on the corners of the eye and the computation of the model fitting process is sped up. 

Written in the year 2000, the authors conclude that the main difficulty they face is that the vision component is slow and inaccurate, and suggest using an infrared-based vision system until computer vision “comes up to speed”. 

This compensates for underestimation of the height of the eye due to the change in gaze between the model eyes an the actual eyes, and, in addition, makes sure that there are no residue pixels from the original eye. 

A very accurate model is being detected by Ding and Martinez [4], who observe that classifier based approaches by themselves may be unsuitable for the task due to the large variability in the shape of facial features.