What have the authors contributed in "Like father, like son: facial expression dynamics for kinship verification" ?

This paper explores the possibility of employing facial expression dynamics in this problem. By using features that describe facial dynamics and spatio-temporal appearance over smile expressions, the authors show that it is possible to improve the state of the art in this problem, and verify that it is indeed possible to recognize kinship by resemblance of facial expressions.

What is the function used to fuse the posterior probabilities for the target classes?

a weighted SUM rule is used to fuse the computed posterior probabilities for the target classes of these classifiers.

How many neighborhood pixels are used to extract the smile onset portion of the videos?

Frames in the smile onset portion of the videos (from neutral to expressive face) are split into X = 8×Y = 8×T = 3 non-overlapping blocks, and CLBP-TOP features are extracted from these blocks using three neighborhood pixels.

How many subjects have spontaneous and posed smiles?

By selecting the spontaneous and posed enjoyment smiles of the subjects who have kin relationships, the authors construct a kinship database which has 95 kin relations from 152 subjects.

How many pairs of spontaneous and 287 pairs of posed smile videos are included in the database?

By using different video combinations of each kin relation, 228 pairs of spontaneous and 287 pairs of posed smile videos are included in the database.

What are the 3D posing matrices for the given angles?

R(−θ′x,−θy,−θz) 100ρ(c1, c2) , (3)R(θx, θy, θz) = Rx(θx)Ry(θy)Rz(θz), (4)and Rx, Ry , and Rz are the 3D rotation matrices for the given angles.

What can be explained by the effect of age and gender on facial dynamics?

This can be explained by the effect of age and gender on facial dynamics, since group specific training leads to dynamic features with better accuracy.

What is the way to represent each face?

Each face is represented by only its reflectance and difference of Gaussian filters are used to select keypoints to represent each face.

(Open Access) Like Father, Like Son: Facial Expression Dynamics for Kinship Verification (2013) | Hamdi Dibeklioglu

Q: What are the three stable landmarks used to define a normalizing plane?

Since a plane can be constructed by three non-collinear points, three stable landmarks (eye centers and nose tip) are used to define a normalizing plane P .

Q: What is the common method used for the kinship verification problem?

The evaluation protocols used for the kinship verification problem typically make use of pairs of photographs, where each pair is either a positive sample (i.e. kin) or a negative one.

Q: What is the way to check kinship?

Since a genetic test may not always be available for checking kinship, an unobtrusive and rapid computer vision solution is potentially very useful.

UvA-DARE is a service provided by the library of the University of Amsterdam (http

://dare.uva.nl)

UvA-DARE (Digital Academic Repository)

Like Father, Like Son: Facial Expression Dynamics for Kinship Verification

Dibeklioğlu, H.; Salah, A.A.; Gevers, T.

DOI

10.1109/ICCV.2013.189

Publication date

2013

Document Version

Author accepted manuscript

Published in

2013 IEEE International Conference on Computer Vision

Link to publication

Citation for published version (APA):

Dibeklioğlu, H., Salah, A. A., & Gevers, T. (2013). Like Father, Like Son: Facial Expression

Dynamics for Kinship Verification. In

2013 IEEE International Conference on Computer

Vision: ICCV 2013 : proceedings: 1-8 December 2013, Sydney, NSW, Australia

(pp. 1497-

1504). IEEE Computer Society. https://doi.org/10.1109/ICCV.2013.189

General rights

It is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s)

and/or copyright holder(s), other than for strictly personal, individual use, unless the work is under an open

content license (like Creative Commons).

Disclaimer/Complaints regulations

If you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please

let the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the material

inaccessible and/or remove it from the website. Please Ask the Library: https://uba.uva.nl/en/contact, or a letter

to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands. You

will be contacted as soon as possible.

Download date:10 Aug 2022

Like Father, Like Son: Facial Expression Dynamics for Kinship Veriﬁcation

Hamdi Dibeklio

glu

1,2

, Albert Ali Salah

, and Theo Gevers

Intelligent Systems Lab Amsterdam, University of Amsterdam, Amsterdam, The Netherlands

Pattern Recognition & Bioinformatics Group, Delft University of Technology, Delft, The Netherlands

Department of Computer Engineering, Bo

gazic¸i University, Istanbul, Turkey

h.dibeklioglu@tudelft.nl, salah@boun.edu.tr, th.gevers@uva.nl

Abstract

Kinship veriﬁcation from facial appearance is a difﬁcult

problem. This paper explores the possibility of employing

facial expression dynamics in this problem. By using fea-

tures that describe facial dynamics and spatio-temporal ap-

pearance over smile expressions, we show that it is possible

to improve the state of the art in this problem, and verify that

it is indeed possible to recognize kinship by resemblance of

facial expressions. The proposed method is tested on differ-

ent kin relationships. On t he average, 72.89% veriﬁcation

accuracy is achieved on spontaneous smiles.

1. Introduction

Automatic detection of kinship from facial appearance

is a difﬁcult problem with several applications, includ-

ing social media analysis [20, 21], ﬁnding missing chil-

dren and children adoptions [9], and coaching for imitation

and personiﬁcation. Kinship is a genetic relationship be-

tween two family members, including parent-child, sibling-

sibling, and grandparent-grandchild relations. Since a ge-

netic test may not always be available for checking kinship,

an unobtrusive and rapid computer vision solution is po-

tentially very useful. This paper proposes such a novel ap-

proach for kinship detection.

Kinship may be veriﬁed between people that have dif-

ferent sex and different ages (e.g. father-daughter), which

makes this problem especially challenging. Humans use

an aggregate of different features to judge kinship from

facial images [1]. Furthermore, depending on the age of

the person assessed for kinship, humans use different sets

of features consistent with the expected aging-related form

changes in faces. For example, upper face cues are more

prominently used for kids, as the lower face does not fully

form until adulthood [13]. Automatic kinship detection

methods also employ aggregate sets of features including

color, geometry, and appearance. In Section 2 we summa-

rize the recent related work in this area.

All the methods proposed so far to verify kinship work

with images. In contrast to all published material, in this

paper, we propose a method using facial dynamics to verify

kinship from videos. Our approach intuitively makes sense:

we all know people who do not look like their parents, un-

til they smile. Furthermore, ﬁndings of [14] show that the

appearance of spontaneous facial expressions of born-blind

people and their sighted relatives are similar. However, the

resemblance between facial expressions depends not only

on the appearance of the expression but also on its dynam-

ics, as each expression is created by a combination of vol-

untary and involuntary muscle movements. This is the key

insight behind this paper. In this paper, we verify this in-

sight empirically, and show that dynamic features obtained

during facial expressions have discriminatory power for the

kinship veriﬁcation. This is the ﬁrst work that uses dynamic

features for kinship detection. By combining dynamic and

spatio-temporal features, we approach the problem of au-

tomatic kinship veriﬁcation. We use the recently collected

UvA-NEMO Smile Database [3] in our experiments, com-

pare our method with three recent approaches from the lit-

erature [8, 9, 21], and report state-of-the-art results.

2. Related work

In one of the ﬁrst works on kinship veriﬁcation, Fang

et al. used the skin, hair and eye color, facial geometry

measures, as well as holistic texture features computed on

texture gradients of the whole face [8]. They have selected

the most discriminative inherited features. Color based fea-

tures performed better than the other features in general,

since a good registration between individual face images

was largely lacking in their approach. In the present study,

we use their approach as a baseline under controlled regis-

tration conditions.

Different feature descriptors are evaluated for the kinship

veriﬁcation problem in the literature. In [9], eyes, mouth

and nose parts are matched via DAISY descriptors. During

matching, it is not expected to have good matches on all fea-

tures, but on some features. Therefore, typically, the top few

matching features are used for veriﬁcation. In [21], Gabor-

based Gradient Orientation Pyramid (GGOP) descriptors

are proposed and used to model facial appearance for kin-

ship veriﬁcation. Support vector machines (SVM) with ra-

dial basis function kernels are used as the classiﬁer. A mean

accuracy of around 70% is reported on 800 image pairs.

This is well within human kinship estimation range. In [11],

the Self Similarity Representation of Weber face (SSRW)

algorithm is proposed. Each face is represented by only its

reﬂectance and difference of Gaussian ﬁlters are used to se-

lect keypoints to represent each face. SVM classiﬁers with

different kernel functions are contrasted, and a linear kernel

is found to be the most suitable. While SVM seems to be the

classiﬁer of choice for kinship veriﬁcation, in [12], a metric

learning approach is adopted. Samples that have the kin-

ship relation are pulled close, and other samples are pushed

apart. In this space, the transformation is complemented by

deﬁning a margin for kinship.

The evaluation protocols used for the kinship veriﬁca-

tion problem typically make use of pairs of photographs,

where each pair is either a positive sample (i.e. kin) or

a negative one. In [9], 100 face pairs with kinship and

100 pairs without are selected from family photos. There

was no decomposition of results into speciﬁc kinship cate-

gories. In [8], [21], and [20] photos of celebrities have been

downloaded from the Internet. In these studies, as well as

in [12], four kinship relations (Father-Son, Father-Daughter,

Mother-Son and Mother-Daughter) are analyzed separately.

The largest database reported in the literature so far is the

KinFaceW-II image database, with 250 pairs of kinship re-

lations for each of these four categories.

In [14], Peleg et al. analyze the spontaneous facial ex-

pressions of born-blind people and their sighted relatives.

They show that such expressions carry a unique family sig-

nature. Occurrences of a set of facial movements are used

to classify families of blind subjects. Results show 64%

correct classiﬁcation on the average, with 60% in joy ex-

pressions. These results justify our motivation. Although

[14] has focused on the facial movements for the task, they

did not analyze the dynamics of expressions in terms of du-

ration, intensity, speed, and acceleration, which is an empir-

ical contribution of this paper.

3. Method

In this paper, we propose to combine spatio-temporal fa-

cial features and facial expression dynamics for the kinship

veriﬁcation. To this end, videos of enjoyment smiles are

used. Our system analyzes the entire duration of a smile,

starting from a moderately frontal and neutral face, the un-

folding of the smile, and the return to the neutral face. Un-

like other approaches proposed in the literature, our method

works with videos of faces, rather than images. This is the

ﬁrst approach using videos for kinship veriﬁcation.

͹

ͺ

ͻ ͳͲ

ͳͳ

ͳʹ

ͳ͵

ͳͶ

ͳͷ

ͳ͸

ͳ͹

ͳ

ʹ

͵

Ͷ

ͷ

͸



܍ܡ܍܊ܚܗܟǡ࢟



܍ܡ܍ܔܑ܌ǡ࢟



܋ܐ܍܍ܓǡ࢟



ܔܑܘǡ࢟



܋ܐ܍܍ܓǡ࢞



ܔܑܘǡ࢞

(a) (b)

Figure 1. (a) The facial feature points used in this study with their

indices, (b) the 3D mesh model and visualization of the ampli-

tude signals, which are deﬁned as the mean of left/right amplitude

signals on the face. For simplicity, visualizations are shown on a

single side of the face

We summarize the proposed method here. Our approach

starts with face detection in t he ﬁrst frame and the localiza-

tion of 17 facial landmarks, which are subsequently tracked

during the rest of the video. Using the tracked landmarks,

displacement signals of eyebrows, eyelids, cheeks, and lip

corners are computed. Afterwards, the mean displacement

signal of the lip corners is analyzed and the three main tem-

poral phases (i.e. onset, apex, and offset, respectively) of

the smile are estimated. Then, facial expression dynamics

on eyebrows, eyelids, cheeks, and lip corners are extracted

from each phase separately. To describe the change in ap-

pearance between the neutral and the expressive face (i.e.

the apex of the expression), t emporal Completed Local Bi-

nary Pattern (CLBP) descriptors are computed from the eye,

cheek, and lip regions. After a feature selection step, the

most informative dynamic features are identiﬁed and com-

bined with temporal CLBP features. Finally, resulting fea-

tures are classiﬁed using SVMs. In the rest of the section we

provide more detailed information for each of these steps.

3.1. Landmark detection and tracking

Both the correct detection and accurate tracking of facial

landmarks are crucial for normalizing and aligning faces,

and for extracting consistent dynamic features. In the ﬁrst

frame of the input video, 17 facial landmarks (i.e. centers

of eyebrows, eyebrow corners, eye corners, centers of upper

eyelids, cheek centers, nose tip, and lip corners) are detected

using a recent landmarking approach [4] (see Fig. 1(a)).

This method models Gabor wavelet features of a neighbor-

hood of the landmarks using incremental mixtures of factor

analyzers and enables a shape prior to ensure the integrity of

the landmark constellation. It follows a coarse-to-ﬁne strat-

egy; landmarks are initially detected on a coarse level and

then ﬁne-tuned for higher resolution. Then, these points are

tracked by a piecewise B

ezier volume deformation (PBVD)

tracker [18] during the rest of the video.

Initially, the PBVD tracker warps a generic 3D mesh

model of the face (see Fig. 1(b)) to ﬁt the facial landmarks

in the ﬁrst frame of the image sequence. 16 surface patches

form the generic face model. These patches are embedded

in B

ezier volumes to guarantee the continuity and smooth-

ness of the model. Points in the B

ezier volume, x(u, v, w)

can be deﬁned as:

x(u, v, w)=



i=0



j=0



k=0

i,j,k

(u)B

(v)B

(w),(1)

where the control points denoted with b

i,j,k

and mesh vari-

ables 0 < {u, v, w} < 1 control the shape of the volume.

(u) denotes a Bernstein polynomial, and can be written

as:

(u)=





(1 − u)

n−i

.(2)

Once the face model is ﬁtted, the 3D motion of the head,

as well as individual motions of facial landmarks can be

tracked based on the movements of mesh points. 2D move-

ments on the face (estimated by template matching between

frames, at different resolutions) are modeled as a projection

of the 3D movement onto the image plane. Then, the 3D

movement is calculated using projective motion of several

points.

3.2. Registration

Faces in each frame need to be aligned before the feature

extraction step. To this end, 3D pose of the faces are es-

timated and normalized using the tracked 3D landmarks 

(see Fig. 1(a)). Since a plane can be constructed by three

non-collinear points, three stable landmarks (eye centers

and nose tip) are used to deﬁne a normalizing plane P.Eye

centers c



+

and c



+

are the middle points

between the inner and outer eye corners. Then, angles be-

tween the positive normal vector P and unit vectors on X

(horizontal), Y (vertical), and Z (perpendicular) axes give

the relative head pose. Computed angles (θ

) and (θ

)give

the exact roll and yaw angles of the face with respect to the

camera, respectively. Nevertheless, the estimated pitch (θ

)

angle is a subject-dependent measure, since it depends on

the constellation of the eye corners and the nose tip. If the

face in the ﬁrst frame is assumed as approximately frontal,

then the actual pitch angles (θ



) can be calculated by sub-

tracting the initial value. After estimating the pose of the

head, tracked landmarks are normalized with respect to ro-

tation, scale, and translation. Aligned points 



can be de-

ﬁned as follows:







−

+ c



R(−θ



, −θ

)

100

ρ(c1,c2)

,(3)

R(θ

,θ

)=R

(θ

),(4)

and R

, R

, and R

are the 3D rotation matrices for the

given angles. ρ denotes the Euclidean distance between the

given points. On the normalized face, the middle point be-

tween eye centers is located at the origin and the inter-ocular

distance (distance between eye centers) is set to 100 pixels.

Since the normalized face is approximately frontal with re-

spect to the camera, we ignore the depth (Z) values of the

normalized feature points 



, and denote them as l

3.3. Temporal segmentation

In the proposed method, dynamic and spatio-temporal

features are extracted from videos of smiling persons. We

choose to use the smile expression, since it is the most fre-

quently performed facial expression, for showing several

different meanings such as enjoyment, politeness, fear, em-

barrassment, etc. [5]. A smile can be deﬁned as the upward

movement of the lip corners, which corresponds to Ac-

tion Unit 12 in the facial action coding system (FACS) [6].

Anatomically, the zygomatic major muscle contracts and

raises the corners of the lips during a smile [7].

Most facial expressions are composed of three non-

overlapping phases, namely: the onset, apex, and offset, re-

spectively. Onset is the initial phase of a facial expression

and it deﬁnes the duration from neutral to expressive state.

Apex phase is the stable peak period (may also be very

short) of the expression between onset and offset. Likewise,

offset is the ﬁnal phase from expressive to neutral state. Fol-

lowing the normalization step, we detect these three tempo-

ral phases of the smiles.

For this purpose, the amplitude signal of the smile S is

estimated as the mean distance (Euclidean) of the lip cor-

ners to the lip center during the smile. Then, the com-

puted amplitude signal is normalized by the length of the

lip. Since the faces are normalized, center and length of the

lip is calculated only once in the ﬁrst frame. Afterwards,

the longest continuous increase in S is deﬁned as the onset

phase. Similarly, the offset phase is detected as the longest

continuous decrease in S. The phase between the last frame

of the onset and the ﬁrst frame of the offset deﬁnes the apex.

3.4. Features

We extract two t ypes of features from the faces. What we

call dynamic features are based on the movement of land-

mark points in the registered faces over the expression dura-

tion. These do not contain appearance information. In con-

trast, what we call spatio-temporal features denotes appear-

ance features obtained from multiple frames jointly, thus

contain both spatial and temporal appearance information.

These features are explained in detail next.

3.4.1 Extraction of dynamic features

To describe the smile dynamics, we use horizontal and verti-

cal movements of tracked landmarks and extract a set of dy-

namic features separately from different face regions. Verti-

cal and horizontal amplitude signals are computed from the

movements of eyebrows, eyelids, cheeks, and lip corners.

The (normalized) eye aperture D

eyelid

, and displacements of

eyebrow D

eyebrow

, cheek D

cheek

and lip corner D

lip

, are esti-

mated as follows:

eyelid

(t)=

− l

2ρ(l

)

− l

2ρ(l

)

,(5)

eyebrow

(t)=

− l

2ρ(l

)

− l

2ρ(l

)

,(6)

cheek

(t)=



− l



− l



2ρ(l

)

,(7)

lip

(t)=



− l



− l



2ρ(l

)

,(8)

where l

denotes the 2D location of the i

point in frame

t. Then, vertical (y) components of D

eyebrow

, D

eyelid

, D

cheek

lip

, and horizontal (x) components of D

cheek

, D

lip

are ex-

tracted (see Fig. 1(b)). Extracted sequences are smoothed

by a 4253H-twice method [19]. These estimates are here-

after referred to as amplitude signals. Finally, amplitude

signals are split into three phases as onset, apex, and offset,

which have been previously deﬁned using the smile ampli-

tude S.

Proposed dynamic features and their deﬁnitions are

given in Table 1. It is important to note that the deﬁned fea-

tures are extracted separately from each phase of the smile.

As a result, we obtain three feature sets for each of the

six amplitude signals (see Fig. 1(b)). For a more detailed

analysis, corresponding speed V(t)=

and acceleration

A(t)=

signals are computed in addition to amplitudes.

In Table 1, signals marked with superindex (

) and (

−

)

denote the increasing and decreasing segments of the re-

lated signal, respectively. For example, D

pools the in-

creasing segments in D. η deﬁnes the length (number of

frames) of a given signal, and ω is the frame rate of the

video. For each phase of t he amplitude signal, three 15-

dimensional feature vectors are generated by concatenating

these features. Combination of all the feature vectors forms

the joint dynamic feature vector. In some cases, features

cannot be calculated. For example, if we extract features

from the amplitude signal of the lip corners D

lip

using the

onset phase, then decreasing segments will be an empty set

Table 1. Deﬁnitions of the extracted features.

Feature Deﬁnition

Duration:



η(D

)

η(D

−

)

η(D)



Duration Ratio:



η(D

)

η(D)

η(D

−

)

η(D)



Maximum Amplitude: max(D )

Mean Amplitude:



η(D)

Maximum Speed:



max(V

) , max(|V

−



Mean Speed:



η(V

)



−

η(V

−

)



Maximum Acceleration:



max(A

) , max(|A

−



Mean Acceleration:



η(A

)



−

η(A

−

)



(η (D

−

)=0). For such exceptions, all the features describ-

ing the related segments are set to zero. This is done to have

a generic feature vector format which has the same features

for different phases of each amplitude signal.

3.4.2 Extraction of spatio-temporal features

To describe the temporal changes in the appearance of

faces, we employ a recently proposed spatio-temporal lo-

cal texture descriptor, namely, the Completed Local Binary

Patterns from Three Orthogonal Planes (CLBP-TOP) [16].

CLBP-TOP is a straightforward extension of Completed

Local Binary Patterns (CLBP) operator [10] to describe dy-

namic textures (image sequences), which is calculated by

extracting CLBP histograms from Three Orthogonal Planes

XY, XT, and YT, individually, and by concatenating them

as a single feature vector. Here, X and Y refer to the spa-

tial extent of the image, and T denotes time. CLBP-TOP

regards the face sequence as a volume, and the neighbor-

hood of each pixel is deﬁned in a three dimensional space,

whereas CLBP uses only X and Y dimensions of a single

image. Difference of the CLBP from the original LBP op-

erator is that in addition to the sign of the local difference, it

includes the center pixel of the local neighborhood and the

magnitude of the difference.

We extract CLBP-TOP features from the previously de-

tected smile onsets, since the onset phase shows the change

from neutral to expressive face. On the selected frames,

faces are normalized with respect to roll rotation using

the eye centers c

and c

. Then, each face is resized

and cropped as shown in Fig. 2(a). For scaling and nor-

malization, the inter-ocular distance d

is set to 50 pix-

els. Resulting normalized face images have a resolution of

125 × 100 pixels. To provide more comparable onset du-

Like Father, Like Son: Facial Expression Dynamics for Kinship Verification

Figures

Citations

Discriminative Deep Metric Learning for Face and Kinship Verification

Discriminative Multimetric Learning for Kinship Verification

Prototype-Based Discriminative Feature Learning for Kinship Verification

Combining Facial Dynamics With Appearance for Age Estimation

Modeling Stylized Character Expressions via Deep Learning

References

Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy

Feature selection based on mutual information: criteria ofmax-dependency, max-relevance, and min-redundancy

Facial action coding system: a technique for the measurement of facial movement

A Completed Modeling of Local Binary Pattern Operator for Texture Classification

Telling Lies: Clues to Deceit in the Marketplace, Politics, and Marriage

Related Papers (5)

Towards computational models of kinship verification

Neighborhood repulsed metric learning for kinship verification

Understanding Kin Relationships in a Photo

Discriminative Multimetric Learning for Kinship Verification

Kinship verification from facial images under uncontrolled conditions

Frequently Asked Questions (12)

Q1. What have the authors contributed in "Like father, like son: facial expression dynamics for kinship verification" ?

Q2. What is the function used to fuse the posterior probabilities for the target classes?

Q3. What are the three stable landmarks used to define a normalizing plane?

Q4. How many neighborhood pixels are used to extract the smile onset portion of the videos?

Q5. What is the common method used for the kinship verification problem?

Q6. How many subjects have spontaneous and posed smiles?

Q7. How many pairs of spontaneous and 287 pairs of posed smile videos are included in the database?

Q8. What is the way to check kinship?

Q9. What are the 3D posing matrices for the given angles?

Q10. What can be explained by the effect of age and gender on facial dynamics?

Q11. What is the way to represent each face?

Q12. What is the recent proposed local texture descriptor?