What are the contributions in "An eye for an eye: a single camera gaze-replacement method" ?

The authors propose an effective solution to this problem that is based on replacing the eyes of the user. The authors have tested the system on a large number of videos demonstrating the effectiveness of the proposed solution.

What have the authors stated for future works in "An eye for an eye: a single camera gaze-replacement method" ?

The authors also plan to allow a better control of the direction of the gaze. The authors plan to address these issues and build a complete end-to-end video conferencing solution. For example, in a multi-party conference call, the authors can render the gaze differently for every viewer to reflect the information of who is looking at who. As can be seen in Figure 8 example-based gaze replacement works even for large head rotations and in replacing left to right gaze directions.

What is the way to detect the eyes?

A very accurate model is being detected by Ding and Martinez [4], who observe that classifier based approaches by themselves may be unsuitable for the task due to the large variability in the shape of facial features.

(Open Access) An eye for an eye: A single camera gaze-replacement method (2010) | Lior Wolf

Q: What was the method used to generate a middle view from two cameras?

In [3] dynamic programming based disparity estimation was used to generate a middle view from two cameras that were positioned on the left and right sides of the screen.

Q: What is the effect of the eye corner detector?

This compensates for underestimation of the height of the eye due to the change in gaze between the model eyes an the actual eyes, and, in addition, makes sure that there are no residue pixels from the original eye.

An Eye for an Eye: A Single Camera Gaze-Replacement Method

Lior Wolf

The Blavatnik School of Computer Science

Tel-Aviv University

Ziv Freund, Shai Avidan

Department of Electrical Engineering-Systems

Faculty of Engineering

Tel-Aviv University

Abstract

The camera in video conference systems is typically po-

sitioned above, or below, the screen, causing the gaze of

the users to appear misplaced. We propose an effective so-

lution to this problem that is based on replacing the eyes

of the user. This replacement, when done accurately, is

enough to achieve a natural looking video. At an initializa-

tion stage the user is asked to look straight at the camera.

We store these frames,then track the eyes accurately in the

video sequence and replace the eyes, taking care of illumi-

nation and ghosting artifacts. We have tested the system on

a large number of videos demonstrating the effectiveness of

the proposed solution.

1. Introduction

Videoconferencing systems hold the promise of allowing

a natural interpersonal communication at a range. Recent

advances in video quality and the adaptation of large high

deﬁnition screens are contributing to a more impressive user

experience. However, to achieve the desired impact of be-

ing in the same room the problem of gaze offset must be

addressed.

This gaze problem arises because the user is looking at

the screen, while the camera(s) capturing the user are posi-

tioned elsewhere. As a result, even when the user is looking

straight into the image of his call partner, the gaze, as per-

ceived at the other side, does not meet the partner’s eyes.

Typically, the camera is located on top of the screen, and

the effect is interpreted as looking down.

Our system solves this problem by replacing the eyes of

a person in the video with eyes that look straight ahead. We

use an example based synthesis that is based on capturing

the eyes at an initial training stage. During the videoconfer-

encing session, we ﬁnd and track the eyes. In every frame,

the eyes captured during training are accurately pasted to

create an illusion of straight-looking gaze. Somewhat sur-

prisingly, the resulting effect (Figure 1) of replacing the

eyes alone looks natural.

(a) (b)

Figure 1. Two of the four images contain artiﬁcially replaced eyes

to create an effect of a person looking straight forward. These eyes

were automatically placed by our system. The other two images

were untouched. Can you tell which images were modiﬁed?

The answer is in the footnote.

2. Previous work

The importance of natural gaze, and the problem of gaze

offset was presented in depth by Gemmell et al. [7], where

a synthesis framework was presented. The solved problem

is somewhat more general than what we aim to solve and

includes turning the entire head by means of rendering a 3D

head model.

With regards to the eyes, it was suggested that the loca-

Images (a) and (d) are real, images (b) and (c) are modiﬁed.

tion of the eyes and the gaze direction would be estimated

by a “vision component”. The eyes would then be replaced

with a synthetic pair of eyes gazing in the desired direction.

Written in the year 2000, the authors conclude that the

main difﬁculty they face is that the vision component is slow

and inaccurate, and suggest using an infrared-based vision

system until computer vision “comes up to speed”.

Despite great advances in object detection and tracking

in the last decade, the accurate commercial eye trackers that

are in existence are still based on infrared images in which

the corneal reﬂections and the center of the pupils are easily

detected.

In our work we use an eye model similar to the one pro-

posed by [16]. This model contains parameters such as the

center of the eye, the radius of the iris and so on. The au-

thors deﬁne an energy function which relates image features

to the geometric structure of the eye contours. Localization

is performed by the steepest descent method.

In [8] an automatic initialization method is proposed

based on the corners of the eye and the computation of the

model ﬁtting process is sped up. In [9] a model for blinking

is introduced, and a KLT tracker is employed to track the

eye model over time.

Recently, [13] proposed a multi-stage process in which

the iris, upper eyelid and lower eyelid are detected sequen-

tially. A very accurate model is being detected by Ding and

Martinez [4], who observe that classiﬁer based approaches

by themselves may be unsuitable for the task due to the

large variability in the shape of facial features.

There are many contributions in which the corners of the

eyes are detected, however, a detailed model is not recov-

ered. In this work we employ the method of Everingham et

al. [5] in order to obtain an initial localization for the eyes.

An alternative solution for the problem of gaze manipu-

lation is view synthesis from multiple cameras. In [3] dy-

namic programming based disparity estimation was used to

generate a middle view from two cameras that were posi-

tioned on the left and right sides of the screen. The ad-

vantage of such a method is that the generated view cor-

responds to a true view, while our solution generates only

a natural looking fake. The disadvantage, of course, is the

need to use multiple cameras positioned at locations that are

suitable for this purpose.

3. System overview

The core of our system is an accurate eye detector that

takes an image of a face and returns an accurate position of

the eye. Once we have the eye position we can replace it

with an image of an eye with a proper gaze direction.

To achieve this goal we learn a regression function that

maps face images to the eye model parameters, using a

database of annotated images. This stage is not accurate

enough and we follow up with a reﬁnement stage.

Figure 2. An illustration of the eye model employed in the gaze

replacement system

The system consists of three stages. An ofﬂine stage

where a rough eye pose estimation is learned from a

database of annotated images. This database is a general

database, unrelated to the any particular videoconferencing

session, and we refer to it below as the ofﬂine database. A

second stage occurs at the beginning of the video confer-

encing session where the user looks straight at the camera.

This allows the system to construct a database of direct gaze

images. We refer to this as an online database. Finally, in

run-time the system constantly detects the location of the

eye in the current frame, ﬁnds the most similar image in the

database and replaces it.

The eye model we use is a simpliﬁed model based on

the deformable template of [16, 9]. It is described by six

parameters, as depicted in Figure 2. Two coordinates for

the center of the eye: x

and y

, the width of the eye w,

the hight of the eye above the line connecting the corners

, and the maximal height below this line h

. Last, there

is the angle θ of the eye from the image scan lines. The

coordinates of the center are given relatively to the leftmost

corner of the eye.

3.1. Ofﬂine Training stage

The direct gaze models are learned based on previous

annotated examples. These examples are stored in an ofﬂine

database that is used to bootstrap the runtime system. It

consists of a separate set of 14 individuals for whom videos

were captured and manually annotated in accordance with

the six parameter model.

3.2. Online Database construction

Given images of a new person looking directly into the

camera, the closest eye in the ofﬂine database is retrieved

and then serves as an initial guess as to the eye parameters

in the new image. These parameters are then adapted to

provide a more accurate estimation.

3.3. Runtime eye localization

At runtime, the parameters of the eyes in the video are

estimated by matching the image to the eyes of the person

when looking straight. The illumination is corrected, and a

replacement is performed.

4. Training phase

Given a video of a person looking directly into the cam-

era we aim to ﬁnd the eye parameters. This is done for each

frame independently, thereby collecting pairs of straight

looking eyes of that person.

The ﬁrst stage in the processing of each frame is the lo-

calization of the corners of the eyes. This is done using the

method of [5], which describes the geometric distribution of

facial features as a tree-structured mixture of Gaussians [6],

and captures appearance by Haar-wavelet like features [14].

After the corners of the eyes are localized, a rectangu-

lar region of interest which approximately captures the eye

regions is constructed for each eye. Let w be the distance

between the two corners of the eye. The rectangle of each

eye is a combination of a strip of width h

= .4w above

the line connecting the two corners, and of a strip of width

= .3w below this line.

Each region of interest (ROI) is scaled to an image of

120 × 80 pixels. Then, SIFT [11] descriptors are computed

at 24 evenly spaced points in the stretched image.

From the ofﬂine database of manually annotated eyes,

we select the left and right eyes with the closest appear-

ance descriptor. This yields an approximate model. In

Leave-One-Person-Out experiments conducted on the of-

ﬂine database it was found that their average absolute error

is about 3 pixels for the center of the eye, and 6 pixels in the

width of the eye.

Table 1 presents the regression results for the left eye,

and Table 2 presents the results obtained for the right eye.

We compare two representations: the SIFT representation

and the vector of gray values in the eye rectangle, resized to

120 pixels times 80 pixels. We also compare two machine

learning algorithms: Support Vector Regression and Near-

est Neighbor. SIFT obtains preferable performance, espe-

cially for the Nearest Neighbor classiﬁer.

This initial model is then reﬁned by performing a local

search in the space of the 6 parameters. The search is con-

ducted in a range of twice the average error in each param-

eter, and in a coarse to ﬁne manner.

For each set of candidate parameter values, the clos-

est eye in the ofﬂine database is translated, rotated and

stretched in accordance with the difference in parameters.

The center is shifted by the different in x

and y

, the width

is scaled by the ratio of the w values, and the regions above

and below the line connecting the two corners are stretched

in accordance with h

and h

respectively. Finally, the

Table 1. Mean (± standard deviation) error in pixels for each pa-

rameter of the eye model for the experiments of the left eye. The

errors were estimated in a leave-one-person-out fashion on the of-

ﬂine dataset. Tow image representations (gray values and SIFT)

and two learning algorithms (Nearest Neighbor and Support Vec-

tor Regression) are presented.

Para- Grayvalues SIFT

meter NN SVR NN SVR

4.26 ± 3.62 3.08 ± 2.3 3.02 ± 2.47 3.08 ± 2.3

4.3 ± 2.96 3.84 ± 2.92 2.73 ± 1.8 3.23 ± 2.72

w 7.83 ± 6.16 6.86 ± 6.51 6.95 ± 6.52 6.47 ± 6.79

3.79 ± 2.25 3.35 ± 2.28 3.58 ± 3.04 3.35 ± 2.28

3.22 ± 2.73 2.93 ± 2.51 2.45 ± 1.72 2.68 ± 2.56

θ 0.10 ± 0.06 0.08 ± 0.04 0.08 ± 0.06 0.07 ± 0.05

Table 2. Mean (± standard deviation) error in pixels for each pa-

rameter of the eye model for the experiments of the right eye. See

Table 1 for details.

Para- Grayvalues SIFT

meter NN SVR NN SVR

3.69 ± 2.99 7.49 ± 4.66 3.76 ± 3.06 5.72 ± 4.42

3.62 ± 2.89 3.01 ± 2.62 2.84 ± 2.01 2.91 ± 2.54

w 8.03 ± 6.26 6.13 ± 4.7 5.24 ± 4.48 5.81 ± 4.89

3.28 ± 2.77 2.94 ± 2.4 2.35 ± 1.89 2.89 ± 2.37

2.4 ± 1.88 2.05 ± 1.71 2.28 ± 2.04 2.05 ± 1.71

θ 0.07 ± 0.05 0.06 ± 0.05 0.076 ± 0.05 0.06 ± 0.05

database eye image is rotated by the difference in the θ val-

ues between the database image and the candidate parame-

ter value.

As a matching score we use the normalized cross correla-

tion measure between the warped database eye and the eye

in the new directly looking video. The ROI for this com-

parison is the region of the eye, slightly enlarged, and not a

rectangular frame.

A threshold is used to determine cases in which the

method of searching for the eye parameters failed to pro-

duce good database to image matches. Typically, the pro-

cess produces one pair of eyes for 80% of the frames in the

training video.

5. Runtime System

The runtime system replaces frames whenever the eye is

open. During blinks no replacement is done. Right after

the blink, the system is reinitialized and starts similarly to

the ﬁrst frame. Remarkably, the lack of intervention dur-

ing blink frames does not seem to hurt the quality of the

resulting video.

(a) (b)

Figure 3. (a-c) Samples of successful facial feature points detection obtained using the method of [5]. The ROI around the eyes are marked

by a black rectangle. (d) An example of a failure case in detection.

(a) (b) (c)

(d) (e) (f)

Figure 4. Initially ﬁtted model (red) and reﬁned model obtained by the search procedure (green). (a) During the online database construc-

tion, the initial guess is the nearest neighbor (for each eye separately) in the SIFT feature space from among the training examples of the

ofﬂine database. (b) In the ﬁrst frame of the videoconferencing session, the initial guess for each eye is the highest correlated example in

the online database. (c) In the tracking phase, the initial guess is the eye-model of the previous frame. (d) depicts the two eyes (from two

different individuals) from the ofﬂine database that were used as an initial guess for the eye model of (a). (e) the initial models for (b) are

taken from this frame of the online database. (f) the initial model for (c) is the taken from the previous video frame shown here.

Table 3. Mean (± standard deviation) error in pixels for each pa-

rameter of the eye model after the reﬁnement stage. The errors

were estimated in a leave-one-out fashion on the ofﬂine dataset

Parameter Left eye Right eye

2.04 ±1.80 1.75±1.52

1.86 ±1.72 1.60±2.02

w 6.66 ± 5.03 4.29 ± 3.70

3.68 ±3.42 2.54±1.93

2.22 ±1.95 2.20±1.83

θ 0.08 ±0.06 0.06±0.04

5.1. Eye detection

In the ﬁrst frame, the best matching set of eyes are

searched for. This is done by using the normalized corre-

lation measurements to compare the learned eye models to

an eye shaped ROI situated between the corners of the eyes

and at a width of 0.7w. Notice that we do not use SIFT

descriptors here, since our goal is to ﬁnd eyes that are sim-

ilar in appearance. Such eyes are more likely to produce

minimal artifacts during replacement.

After the closest direct looking pair of eyes is found, they

are morphed in order to better ﬁt the new frame. This is

done by a search process over the six eye parameters, simi-

lar to the search done during training. Here, unlike the train-

ing phase, the normalized cross correlation matching score

is used.

Figure 4(b) shows the model eyes and how they were ﬁt

to the new frame before and after the search stage.

5.2. Eye tracking

The eye parameters must be estimated for every frame.

Given a new frame, the eye parameters of the previous

frame serve as an initial guess, and a search (i.e. reﬁne-

ment) process is once again conducted. The matching score

is composed of two components: one considers the sum of

squared differences (SSD) between the eye in the current

frame and the eye in the previous frame, where the latter

is warped to the new frame in accordance with the differ-

ence in the eye parameters; The second considers the SSD

between the current eye and the warped eye from the ﬁrst

frame.

SSD is used since illumination changes between consec-

utive frames are expected to be small and since it is conve-

nient to combine multiple SSD scores. The combined cost

term minimizes drift over time. Without it, small tracking

errors would accumulate. A noticeable example would be

for the eye to gradually shrink, see Figure 5.

When the search process performed during tracking fails,

a blink is declared, and the system enters a blink mode.

While in this mode, the eye parameters do not adapt, and

no eye replacement takes place. During every blink frame,

tracking based on the last detected eye model is attempted.

Once this tracking is successful for both eyes for at least 2

consecutive frames, the blink state is terminated.

The ﬁrst frame after the blink mode is treated as the ﬁrst

frame of the video sequence, in order to allow the system to

move to a more suitable set of eyes. In case the eye corner

detector fails, the last model of an open eye is used to ini-

tialize the eye model search process. Although the pair of

replacement eyes used in between blinks does not change,

this effect is unnoticeable.

5.3. Eye replacement

The eye replacement is done by pasting the warped set of

model eyes onto the eye location in the image. The warping

is done in order to adjust the parameters of the model eyes

to the parameters of the actual eyes in the video frame.

In order to eliminate artifacts caused by this pasting oper-

ator several measures are taken. First, the eyes are enlarged

by a factor of 10% in the vertical direction. This compen-

sates for underestimation of the height of the eye due to the

change in gaze between the model eyes an the actual eyes,

and, in addition, makes sure that there are no residue pixels

from the original eye.

A second measure done in order to ensure smooth inte-

gration is the adjustment of image illumination. To this end,

a low pass quotient image [12, 10, 15] is applied, similar to

what is done in [2] for swapping entire faces in images.

For both the original pixels to be replaced and the new

eye to be pasted, we estimate the illumination by ﬁtting a

third order polynomial to the image data [1].

Denote the Red value of pixel (x, y) in the original video

frame as I

(1)

(x, y), and the values in the pasted eye im-

age as I

(2)

(x, y). Let

(1)

(x, y) and

(2)

(x, y) be the corre-

sponding values of the ﬁtted low-order polynomial:

(1)

(x, y) =

i=0

3−i

j=0

(1)

R,ij

(2)

(x, y) =

i=0

3−i

j=0

(2)

R,ij

The ﬁtting is done to each of the three channels R,G,B

separately using a least square system of equations (10 un-

known βs per image per channel).

Using quotient image reillumination, the new image val-

ues

(1)

(x, y) are given by:

(1)

(x, y) = I

(2)

(x, y)

(1)

(x, y)

(2)

(x, y)

To further ensure unnoticeable pasting, we use a sim-

ple feathering technique in which on a strip of 10 pixels

around the replaced region, blending with linear weights is

performed between the old image and the pasted eye.

6. Results

We have tested our system on a number of sequences.

All sequences are captured at 1280 × 1024 pixels at 25

frames per second. The ofﬂine database consists of 200

images of 14 individuals with manually speciﬁed eye mod-

els. At the beginning of the videoconferencing session, we

asked each user to look straight at the camera for a couple of

seconds. These frames became the online database that was

used for the remaining of the session of each individual.

Fig. 7 shows frames from some of the video sequences

we processed. We properly detect the eye and seamlessly

replace it with a direct gaze eye image. In addition, the sys-

tem can automatically detect blinks. During blinks the input

image remains unaltered. Also demonstrated are some of

the limitations of the system. For example, in the third row,

right image, we observe that the person is looking sideways,

but not down. Still, our system replaces the eye with a direct

gaze image. Also, in the bottom right image, the person is

twitching his left eye. We correctly replace the eye, but also

affect some of the surrounding skin.

The system also maintains a very high level of temporal

consistency with no ﬂickering (i.e., the eye does not change

An eye for an eye: A single camera gaze-replacement method

Figures

Citations

FSGAN: Subject Agnostic Face Swapping and Reenactment

On Face Segmentation, Face Swapping, and Face Perception

DeepWarp: Photorealistic Image Resynthesis for Gaze Manipulation

FSGAN: Subject Agnostic Face Swapping and Reenactment

EyeOpener: Editing Eyes in the Wild

References

Distinctive Image Features from Scale-Invariant Keypoints

Robust real-time face detection

Pictorial Structures for Object Recognition

Lambertian reflectance and linear subspaces

Feature extraction from faces using deformable templates

Related Papers (5)

Face swapping: automatically replacing faces in photographs

In the Eye of the Beholder: A Survey of Models for Eyes and Gaze

Gaze locking: passive eye contact detection for human-object interaction

Dlib-ml: A Machine Learning Toolkit

Poisson image editing

Frequently Asked Questions (7)

Q1. What are the contributions in "An eye for an eye: a single camera gaze-replacement method" ?

Q2. What have the authors stated for future works in "An eye for an eye: a single camera gaze-replacement method" ?

Q3. What was the method used to generate a middle view from two cameras?

Q4. What is the proposed method for eye tracking?

Q5. When did the authors conclude that the vision component is slow and inaccurate?

Q6. What is the effect of the eye corner detector?

Q7. What is the way to detect the eyes?