scispace - formally typeset

Proceedings ArticleDOI

A unified framework for context assisted face clustering

16 Apr 2013-pp 9-16

TL;DR: A unified framework that employs bootstrapping to automatically learn adaptive rules to integrate heterogeneous contextual information, along with facial features, together is proposed, which demonstrates the effectiveness of the proposed approach in improving recall while maintaining very high precision of face clustering.
Abstract: Automatic face clustering, which aims to group faces referring to the same people together, is a key component for face tagging and image management. Standard face clustering approaches that are based on analyzing facial features can already achieve high-precision results. However, they often suffer from low recall due to the large variation of faces in pose, expression, illumination, occlusion, etc. To improve the clustering recall without reducing the high precision, we leverage the heterogeneous context information to iteratively merge the clusters referring to same entities. We first investigate the appropriate methods to utilize the context information at the cluster level, including using of "common scene", people co-occurrence, human attributes, and clothing. We then propose a unified framework that employs bootstrapping to automatically learn adaptive rules to integrate this heterogeneous contextual information, along with facial features, together. Experimental results on two personal photo collections and one real-world surveillance dataset demonstrate the effectiveness of the proposed approach in improving recall while maintaining very high precision of face clustering.
Topics: Cluster analysis (62%), Fuzzy clustering (60%)

Content maybe subject to copyright    Report

A Unified Framework for Context Assisted Face Clustering
Liyan Zhang Dmitri V. Kalashnikov Sharad Mehrotra
Department of Computer Science
University of California, Irvine
ABSTRACT
Automatic face clustering, which aims to group faces refer-
ring to the same people together, is a key component for
face tagging and image management. Standard face cluster-
ing approaches that are based on analyzing facial features
can already achieve high-precision results. However, they of-
ten suffer from low recall due to the large variation of faces
in pose, expression, illumination, occlusion, etc. To improve
the clustering recall without reducing the high precision, we
leverage the heterogeneous context information to iterative-
ly merge the clusters referring to same entities. We first
investigate the appropriate methods to utilize the context
information at the cluster level, including using of “com-
mon scene”, people co-occurrence, human attributes, and
clothing. We then propose a unified framework that em-
ploys bootstrapping to automatically learn adaptive rules
to integrate this heterogeneous contextual information, a-
long with facial features, together. Experimental results on
two personal photo collections and one real-world surveil-
lance dataset demonstrate the effectiveness of the proposed
approach in improving recall while maintaining very high
precision of face clustering.
Categories and Subject Descriptors
H.3.3 [Information Systems]: Information Search and Re-
trieval—Clustering
Keywords
Face Clustering, Context Information, Bootstrapping
1. INTRODUCTION
With the explosion of massive media data, the problem
of image organization, management and retrieval has be-
come an important issue [11] [21]. Naturally, the focus in
many image collections is people. To better understand and
This work was supported in part by NSF grants CNS-1118114,
CNS-1059436, CNS-1063596. It is part of NSF supported project
Sherlock @ UCI (http://sherlock.ics.uci.edu): a UC Irvine
project on Data Quality and Entity Resolution [1].
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
ICMR’13, April 16–20, 2013, Dallas, Texas, USA.
Copyright 2013 ACM 978-1-4503-2033-7/13/04 ...$10.00.
Figure 1: Example of Face Clusters by Picasa
manage the human-centered photos, face tagging that aim-
s to help users associate people names with faces becomes
an essential task. The fundamental problem towards face
tagging and management is face clustering, which aims to
group faces that refer to the same people together.
Clustering faces based on facial appearance features is the
most conventional approach. It has been extensively stud-
ied and significant progress has been achieved in the last
two decades [2] [6] [7]. These standard techniques have al-
ready been employed in several commercial systems such
as Google Picasa, Apple iPhoto, and Microsoft EasyAlbum.
These systems usually produce face clusters that have high
precision (faces in each cluster refer to the same person),
but low recall (faces of a single person fall into differen-
t clusters). In addition, a large number of small/singleton
face clusters are often returned, which bring heavy burden
on the users to label all the faces in the album. Fig. 1 illus-
trates the example of face clustering result, where faces of a
single person fall into six different (pure) clusters, instead of
one. One reason for low recall is due to the large variation of
faces in pose, expression, illumination, occlusion, etc. That
makes it challenging to group faces correctly by using the
standard techniques that focus primarily on facial features
and largely ignore the context. Another reason is that when
systems like Picasa ask for manual feedback from the user,
users most often prefer to merge pure (high-precision) clus-
ters rather than manually clean contaminated (low-recall)
ones. Consequently, such systems are often tuned to strong-
ly prefer the precision over recall. The goal of our work is to
leverage heterogeneous context information to improve the
recall of cluster results without reducing the high precision.
Prior research efforts have extensively explored using con-
textual features to improve the quality of face clustering
[16] [17] [19] [20]. In general, in contrast to our work, such
techniques often aim at exploring just one (or a few) con-
textual feature types, with the merging decision often made
at the image level only. We, however, develop a unified
framework that integrates heterogeneous context informa-
tion together to improve the performance of face clustering.
The framework learns the roles and importance of different
feature types from data. It can take into account time decay

of features and makes the merging decision at both image
and cluster levels. Examples of types of contextual cues
that have been used in the past include geo-location and
image capture time [21], people co-occurrence [14] [16] [17] ,
social norm and conventional positioning observed [10], hu-
man attributes [13], text or other linked information [4] [18],
clothing [9] [20], etc. For instance, [13] proposes to em-
ploy human attributes as an additional features. Howev-
er, the authors do not explore the different roles that each
attribute type plays in identifying different people. Social
context, such as people co-occurrence, has been investigat-
ed in [14] [16] [17]. But these approaches do not deal with
cluster-level co-occurrence information. Clothing informa-
tion has been used extensively in face clustering [9] [20].
However, these techniques do not employ the important time
decay factor in leveraging clothing information.
The overall unified framework is illustrated in Figure 2.
We start with the initial set of clusters generated by the s-
tandard approach for the given photo collection. The initial
clusters have high precision but low recall. We iteratively
merge the clusters that are likely to refer to the same entities
to get higher recall. We use contextual and facial features in
two regards: for computing similarities (how similar are two
clusters) and for defining constraints (which clusters cannot
refer to the same person). The framework then uses boot-
strapping to learn the importance of different heterogeneous
feature types directly from data. To achieve higher quali-
ty, this learning is done adaptively per cluster in a photo
collection, because the importance of different features can
change from person to person and in different photo collec-
tions. For example, clothing is a good distinguishing feature
in a photo album where people’s clothes are distinct, but a
weak feature in a photo collection where people are wearing
uniform. We employ the ideas of bootstrapping to partially
label any given dataset in automated fashion without any
human input. These labels then allow us to learn the im-
portance of various features directly from the given photo
collection. Clusters are then merged iteratively, based on
the importance of the learned features and computed simi-
larity, to produce a higher quality clustering.
The rest of this paper is organized as follows. We start
by formally defining the problem in Section 2. In Section 3,
we describe how to leverage the context information at the
cluster level, including common scene, people co-occurrence,
human attributes, and clothing. In Section 4, we propose the
unified framework which automatically learns rules to inte-
grate heterogeneous context information together to itera-
tively merge clusters. The proposed approach is empirically
evaluated in Section 5. Finally, we conclude in Section 6 by
highlighting key points of our work.
2. PROBLEM DEFINITION
Suppose that a human-centered photo album P
h
contain-
s K images {I
1
, I
2
, . . . , I
K
}, see Figure 2. Assume that n
faces are detected in P
h
, with each face denoted as f
i
for
i = 1, 2, . . . , n, or f
I
k
i
(that is, f
i
is extracted from im-
age I
k
). Suppose that by applying the standard algorith-
m which is based on facial features, we obtain N clusters
{C
1
, C
2
, . . . , C
N
}, where each cluster is assumed to be pure,
but multiple clusters could refer to the same entity. Our goal
is to leverage heterogeneous context information to merge
clusters such that we still get very high precision clusters
but also improve the recall.
Photo Collection
Detected Faces
Initial Clusters : High
Precision, Low Recall
Iterative Merging
Final Clusters: High
Precision, High Recall
Figure 2: The General Framework
There have been many studies that analyze behaviors of
different metrics for measuring quality of clustering. A re-
cent prominent study by Artiles et al. suggests that B-cubed
precision, recall and F-measure is one of the best combina-
tion of metrics to use according to many criteria [3]. Let
C(f
i
) be the cluster that f
i
is put into by a clustering algo-
rithm. Let L(f
i
) be to the real category/label (person) f
i
refers to in the ground truth. Given two faces f
i
and f
j
, the
correctness Correct(f
i
, f
j
) is defined as:
Correct(f
i
, f
j
) =
1 if L(f
i
) = L(f
j
) C(f
i
) = C(f
j
)
0 otherwise
B-cubed precision of an item f
i
is computed as the pro-
portion of correctly related items in its cluster (including
itself): P re(f
i
) =
P
f
j
:C(f
i
)=C(f
j
)
Correct(f
i
,f
j
)
k{f
j
|C(f
i
)=C(f
j
)}k
. The over-
all B-cubed precision is the averaged precision of all item-
s: P re =
1
n
P
n
i=1
P re(f
i
). Similarly, B-cubed recall of f
i
is the proportion of correctly related items in its category:
Rec(f
i
) =
P
f
j
:L(f
i
)=L(f
j
)
Correct(f
i
,f
j
)
k{f
j
|L(f
i
)=L(f
j
)}k
. The overall recall is
then: Rec =
1
n
n
X
i=1
Rec(f
i
). The F-measure is then defined
as the harmonic mean of the precision and recall.
3. CONTEXT FEATURE EXTRACTION
Most prior research effort focus on leveraging context fea-
tures directly at the face level [9] [13] [14]. That is, the
similarity is computed between two faces and not two clus-
ters. In this section, we will describe how to utilize context
features at the cluster level. Context features are not only
able to provide additional contextual similarity information
to link clusters that co-refer (refer to the same entity), but
also generate constraints that identify clusters that cannot
co-refer (cannot refer to the same entity).
3.1 Context Similarities
3.1.1 Common Scene
It is common for a photographer to take multiple photo-
s of the same “scene” in a relatively short period of time.
This phenomenon happens for example when the photogra-
pher wants to ensure that at least some of the pictures taken
will be of acceptable quality, or when people pose for photos

Figure 3: Example of Common Scene
and change their poses somewhat in the sequence of com-
mon scene photos. Common scene photos are often taken
within small intervals of time from each other and they con-
tain almost the same background and almost the same group
of people in each photo. Surprisingly, we are not aware of
much existing work that would use common scene detection
to improve face-clustering performance. However common
scene detection can provide additional evidence to link clus-
ters describing the same entity, since images in a common
scene often contain the same people.
To divide images into common scene clusters, some EXIF
information (such as image captured time, geo-location, cam-
era model, etc.), and image visual features (color, texture,
shape) and image file name can be leveraged. Suppose that
in a photo album P
h
containing K images {I
1
, I
2
, ..., I
K
},
the algorithm finds M common scene clusters. Let CS(I
k
)
denotes the common scene of image I
k
. Based on the as-
sumption that two images forming the common scene might
describe the same entities, two entities even with dissimilar
facial appearances might be linked by the common scene.
For example, as shown in Figure 3, C
1
and C
2
are two
initial face clusters based on face appearance. Face f
I
1
1
,
extracted from image I
1
, belongs to cluster C
1
, and face f
I
2
4
extracted from image I
2
is put into C
2
. Since images I
1
and
I
2
share the common scene CS(I
1
) = CS(I
2
), it is possible
they describe the same entities. Thus faces f
I
1
1
and f
I
2
4
have
some possibility to be the same, and the two face clusters
C
1
and C
2
are linked to each other via the common scene.
Thus the context similarity S
cs
(C
m
, C
n
) of two face clus-
ters C
m
and C
n
based on common scene is defined as the
number of distinct common scenes between the pairs of im-
ages from each cluster:
µ
cs
mn
= {CS(I
k
)|CS(I
k
) = CS(I
l
))(f
I
k
i
C
m
)(f
I
l
j
C
n
)}
(1)
S
cs
(C
m
, C
n
) =k µ
cs
mn
k (2)
Thus µ
cs
mn
is the set of common scenes across two face clus-
ters C
m
and C
n
. S
cs
(C
m
, C
n
) is the cardinality of set µ
cs
mn
.
The larger value S
cs
(C
m
, C
n
) is, the higher the likelihood
that C
m
and C
n
refer to the same entity.
3.1.2 People Co-occurrence
The surrounding faces can provide vital evidence in rec-
ognizing the identity of a given face in an image. Suppose
that “Rose” and “John” are good friends and often take pho-
tos together, then the identity of one person will probably
imply the other. In [17], Wu et al. investigated people co-
occurrence feature and proposed a social context similarity
measurement by counting the common co-occurred single
clusters between two clusters. However, this measuremen-
t could be greatly improved because single cluster linkage
alone is not strong evidence. In this section, we propose a
1
f 1
f 2
f 3
f 4
f 5
f 6
I2
f 7
f 8
I1
Cluster Co-Occurrence Graph
1
1
1
1
1
1
1
1
1
1
Figure 4: Example of People Co-occurrence
new social context similarity measurement, which use the
common cluster-group as evidence to link clusters. Experi-
ments reveal that the linkage of cluster-groups is more reli-
able than the linkage of single cluster.
Cluster co-occurrence. First, let us define the co-
occurrence relationship between two clusters. We will say
that clusters C
m
and C
n
co-occur in/via image I
k
, if I
k
contains at least two faces such that one is from C
m
and the
other one is from C
n
. In general, the co-occurrence measure
Co(C
m
, C
n
) returns the number of distinct images in which
C
m
and C
n
co-occur:
Co(C
m
, C
n
) =k {I
k
|∃f
I
k
i
, f
I
k
j
s.t. (f
I
k
i
C
m
)(f
I
k
j
C
n
)} k
The co-occurrence relationship between three and more
face clusters has a similar definition. Consider the faces in
Figure 4 as an example. There, C
1
, C
2
, C
3
, C
4
are four initial
face clusters. Since there exists an image I
1
that contain
three faces f
1
, f
4
and f
6
such that f
1
C
1
, f
4
C
2
, f
6
C
3
,
thus Co(C
1
, C
2
, C
3
) = 1. Similarly, for the clusters C
1
, C
2
,
C
4
it holds Co(C
1
, C
2
, C
4
) = 1. Based on common sense, we
know that a person cannot co-occur with himself in an image
unless the image is doctored or contains a reflection, e.g., in
a mirror. Consequently, clusters connected via a non-zero
co-occurrence relationship should refer to different entities.
This property will be used later on by the framework to
generate context constraints.
Co-occurrence graph. The co-occurrence of two face
clusters reveals the social relationship between them and be-
tween the people they correspond to. We now will describe
how to construct cluster co-occurrence graph. Observe that
if two face clusters have similar co-occurrence relationships,
then the two face clusters might refer to the same entity.
This is since people tend to appear with the same group of
people in photos, e.g., the same friends. In the example in
Figure 4, both C
3
and C
4
co-occur with C
1
and C
2
. Such
co-occurrence can serve as extra evidence that C
3
and C
4
possibly refer to the same entity. Notice, to demonstrate
this graphically, we can represent C
3
and C
4
as nodes in a
graph both of which are linked together via a different node
that corresponds to C
1
and C
2
as a single cluster-group.
To analyze the various co-occurrences among clusters, we
construct the cluster co-occurrence graph G = (V, E). G is a
labeled undirectional graph. The set of nodes V in the graph
consists of two types of nodes: V = V
c
V
g
. Node v
c
i
V
c
corresponds to each single face cluster C
i
. Node v
g
j
V
g
corresponds to each face cluster-group found in an image.
The group nodes are constructed as follows. For each image

I
k
that contains at least two faces, let Φ
I
k
denote the set of
all the clusters that contain faces present in I
k
. We construct
k Φ
I
k
k cluster-groups, where each group is a set of clusters
Φ
I
k
\ {C
j
} for each C
j
Φ
I
k
. For example, if image I
1
has
faces for three clusters Φ
I
1
= {C
1
, C
2
, C
3
}, then the groups
are going to be {C
1
, C
2
}, {C
1
, C
3
}, and {C
2
, C
3
}. A node
v
g
j
is created once per each distinct group. Edge e
ij
E
is created between nodes v
c
i
and v
g
j
only when v
c
i
occurs in
the context of group v
g
j
at least once, that is when exists
at least one image I
k
such that v
c
i
v
g
j
= Φ
I
k
. Edge e
ij
is
labeled with the number of such images, i.e., edge weight
w
ij
=k {I
k
|v
c
i
v
g
j
= Φ
I
k
} k.
Consider Figure 4 as an example. For images I
1
and I
2
we have Φ
I
1
= {C
1
, C
2
, C
3
}, Φ
I
2
= {C
1
, C
2
, C
4
}. Thus we
construct four V
c
nodes for C
1
,C
2
,C
3
,C
4
, and five V
g
nodes
for {C
1
, C
2
}, {C
1
, C
3
}, {C
2
, C
3
}, {C
1
, C
4
}, {C
2
, C
4
}. Edges
are created accordingly.
From the cluster co-occurrence graph, we observe that if
two V
c
nodes v
c
m
and v
c
n
connects to the same V
g
node v
g
k
,
then v
c
m
and v
c
n
possibly refer to the same entity. For in-
stance, in Figure 4, both C
3
and C
4
connects with {C
1
, C
2
},
so C
3
and C
4
are possibly the same. The context similarity
from cluster co-occurrence S
co
(C
m
, C
n
) for C
m
and C
n
can
be then defined as the flow between these two clusters,
S
co
(C
m
, C
n
) =
X
V
g
k
V
c
m
,V
g
k
V
c
n
min(w
mk
, w
kn
) (3)
In general, the co-occurrence similarity between two clus-
ters can be measured as the sum of weights of paths which
link them through V
g
nodes. The larger the number/weight
of paths that link C
m
and C
n
, the higher the likelihood that
C
m
and C
n
refer to the same entity.
3.1.3 Human Attributes
Human attributes, such as gender, age, ethnicity, facial
traits, etc., are important evidence to identify a person.
By considering attributes, many uncertainties and errors for
face clustering can be avoided, such as confusing “men” with
“women”, “adults” with “children”, etc. To get attribute val-
ues for a given face, we use the attribute system [13]. It
returns values for the 73 types of attributes, such as “black
hair”, “big nose”, or “wearing eyeglasses”. Thus, with each
face f
i
we associate a 73-D attribute vector denoted as A
f
i
.
In [13], Kumar et al. suggests that attributes can be used
to help face verification by choosing some measurement (e.g.,
cosine similarity) to compute attribute similarities. Howev-
er, the importance of each type of attribute usually differs
when identifying different entities. For example, in a photo
album containing just one baby, age is an important factor
for identifying this baby; while if several babies exist in an al-
bum, then age is not a strongly discriminative feature. Thus,
it is essential to determine the importance of attributes for
identifying a given entity in the photo collections.
To achieve this, we learn the importance of attributes from
the face cluster itself, by leveraging bootstrapping. Here,
bootstrapping refers to the process of being able to auto-
matically label part of the data, without any human input,
and then use these labels to train a classifier. The learned
classifier is then used to label the remaining data. One of
Notice, in general, there could be different models for assigning
weights to paths in addition to the flow model considered in the
paper. For example, paths that go through larger group nodes
could be assigned higher weight since larger groups of people tend
to be better context than smaller ones.
Face
Attributes
Label
C1
C1
C1
~ C1
~ C1
~ C1
~ C1
~ C1
Attributes Training Set For C1
Train
SVM
C1 ?
Attri
bute
Attri
bute
Attri
bute
C5
Figure 5: Example of Human Attributes
the main challenges in applying bootstrapping is to be able
to provide these partial labels. The general idea of our so-
lution is that faces that belong to one face cluster are very
likely to refer to the same entity due to the purity of the
initial clusters, hence they can form the positive samples.
In turn, faces from two clusters that co-occur in the same
image most likely refer to the different people (since a per-
son cannot co-occur with himself in a photo), which can be
used to construct the negative samples.
Based on the above discussion, the training dataset can
be constructed for each cluster. Figure 5 illustrates the at-
tribute training dataset for identifying C
1
from the exam-
ple in Figure 4. Three faces f
1
, f
2
, f
3
fall into C
1
, so the
attributes of these three faces A
f
1
,A
f
2
,A
f
3
are labeled as
C
1
. Since the other three clusters C
2
, C
3
, C
4
have the co-
occurrence relationship with C
1
, they are considered to de-
scribe different entities. Thus the attributes of faces from the
other three clusters can be treated as the negative samples.
In this way, the attribute training dataset can be construct-
ed automatically for each cluster.
After the attribute training dataset is constructed, a clas-
sifier, such as SVM, can be learned for each cluster C
m
.
Given a 73-D attribute feature A
f
i
for any face f
i
, the task
of the classifier is to output whether this face f
i
belongs
to C
m
. In addition to outputting a binary yes/no decision,
modern classifiers can also output the probability that f
i
belongs to C
m
, denoted as P
A
(f
i
C
m
). Thus, by applying
classifier learned for C
m
to each face in an unknown face
cluster C
n
, we can compute the average probability that C
n
belongs to C
m
, denoted as S
A
(C
n
C
m
):
S
A
(C
n
C
m
) =
1
k C
n
k
X
f
i
C
n
P
A
(f
i
C
m
) (4)
Attribute similarity between C
m
and C
n
is defined as,
S
attr
(C
m
, C
n
) =
S
A
(C
n
C
m
) + S
A
(C
m
C
n
)
2
(5)
That is, the attribute based similarity S
attr
(C
m
, C
n
) be-
tween two clusters is the average of the average probability
of one cluster to belong to the other.
3.1.4 Clothing Information
Clothing information could be a strong feature for de-
termining the identity of a person. However, clothing is a
time-sensitive feature since people can change their clothes.
Clothing has been considered in the previous work for face
clustering, e.g. in [20], but not as a time-sensitive feature
described next.
In this section, we introduce time decay factor to control
the effect of clothing in identifying people. We propose that
the similarity between f
i
and f
j
should be a function of time:

S
c
(f
i
, f
j
) = sim(ch
f
i
, ch
f
j
) × e
−4t/2s
2
(6)
In the above formula, sim(ch
f
i
, ch
f
j
) refers to the cloth-
ing similarity computed only on visual features. Notation
4t refers to the capture time difference between 2 faces.
By construction, the above time-decay function incorporates
the relationship between 4t and the effectiveness of cloth-
ing features. The smaller 4t is, the more effective clothing
feature is. With the time difference value 4t growing, the ef-
fectiveness of clothing feature is decreasing. When the time
difference 4t is much larger than the time slot threshold s,
the clothing feature becomes ineffective.
To compute the clothing similarity, the first step is to
detect the location of clothing for the given face, which can
be implemented by leveraging the techniques from [9] or
simply using a bounding box below detected faces. After
that, some low level image features (color, texture) can be
extracted to represent the clothing information, and then
similarities can be computed.
To obtain the cluster similarity from clothing information,
we can compute the clothing similarity between each pair of
faces and then choose the maximum value:
S
cloth
(C
m
, C
n
) = max
f
i
C
m
,f
j
C
n
S
c
(f
i
, f
j
) (7)
Thus the clothing similarity between C
m
and C
n
is com-
puted by selecting the maximum clothing similarity between
each pair of faces respectively falling in the 2 face clusters.
3.2 Context Constraints
In the previous section we have explained how context fea-
tures can be used as extra positive evidence for computing
similarity between clusters. Context features, such as people
co-occurrence and human attributes, can also provide con-
straints or negative evidence, which can be used to identify
clusters that should refer to different entities.
From cluster co-occurrence relationship, we can derive
that two face clusters with Co(C
m
, C
n
) > 0 should refer
to definitely different entities, because a person cannot co-
occur with himself (in normal cases). Thus we can define
that if Co(C
m
, C
n
) > 0, the context dissimilarity from co-
occurrence feature is 1, denoted as D
co
(C
m
, C
n
) = 1.
From human attributes, we can derive that two cluster-
s with vastly different attributes values, such as age, gen-
der, ethnicity information should refer to different entities.
Thus we can define that if two clusters C
m
and C
n
have
distinct age, gender, ethnicity attribute values, then context
dissimilarity from human attributes feature is 1, referred as
D
attr
(C
m
, C
n
) = 1. Then we can define the context dissim-
ilarity measurement between two clusters as follows:
D(C
m
, C
n
) =
1 if D
co
(C
m
, C
n
) = 1 or D
attr
(C
m
, C
n
) = 1
0 otherwise
Thus D(C
m
, C
n
) = 1 means C
m
and C
n
are most likely
different, D(C
m
, C
n
) = 0 means that the dissimilarity mea-
sure between C
m
and C
n
cannot tell if they are different or
not. The context constraints will be leveraged to implement
the bootstrapping ideas explained in the following section.
4. THE UNIFIED FRAMEWORK
In the previous section we have discussed how to leverage
the context information from two aspects: computing con-
text similarities (S
cs
, S
co
, S
attr
, S
cloth
) and context con-
straints (D
co
, D
attr
). In this section, we will develop an ap-
pairs
Label
Same
Same
Same
Same
Same
Diff
Diff
Diff
Diff
pairs
predict
Same
Same
Diff
Same
Diff
Diff
Same
Diff
Figure 6: Example of Bootstrapping Process
proach for integrating these heterogeneous context features
together to facilitate face clustering.
One possible solution for aggregating these context fea-
tures is to compute the overall similarity as weighted linear
sum of the context similarities. The overall similarity can
then be used to merge clusters that do not violate the con-
text constraints. However, this basic solution has several
limitations: it is too coarse-grained and it could be diffi-
cult to set the weights that would work best for all possible
photo collections. Alternatively, the other option is to auto-
matically learn some rules to combine these context features
together to make a merging decision. If the rules are sat-
isfied, the two face clusters can be merged. For example,
a rule could be if S
cs
(C
m
, C
n
) > 3 and S
co
(C
m
, C
n
) > 4,
then merge C
m
and C
n
. The experiments reveal that if the
rules are defined appropriately, significantly better merging
results can be achieved compared to the basic solution.
Nevertheless, it is hard to define and fix rules that would
work well for all possible photo albums. Instead, rules that
are automatically tuned to each photo collection would nat-
urally perform better. This is since the importance of each
type of context feature usually varies due to the diversity of
image datasets. For example, clothing might be important
evidence in a photo album where people’s clothing is dis-
tinct, but it will lose the effect in a photo collection where
people wearing uniform. Thus, inspired by [5] [12] [15], we
propose a unified framework that can automatically learn
and adapt the rules to get high quality of face clustering.
4.1 Construction of Training Dataset
To automatically learn the rules, training dataset is of-
ten required. However, since we are trying to automati-
cally learn and tune the rules per each photo collection, it
is unlikely that training data will be available, as it will
not accompany each given collection. Nevertheless, such
rules could be learned by leveraging bootstrapping and semi-
supervised learning techniques. To apply those techniques,
we need to automatically partially label the dataset. The
constructed training dataset should contain positive sam-
ples (same face cluster pairs) and negative samples (differ-
ent face cluster pairs). The key challenge is to be able to
automatically, without any human input, label the positive
and negative samples for part of the data.

Citations
More filters

Journal ArticleDOI
TL;DR: The proposed algorithm is called robust tensor clustering (RTC), which firstly finds a lower-rank approximation of the original tensor data using a L1 norm optimization function, and compute high-order singular value decomposition of this approximate tensor to obtain the final clustering results.
Abstract: Face clustering is a key component either in image managements or video analysis. Wild human faces vary with the poses, expressions, and illumination changes. All kinds of noises, like block occlusions, random pixel corruptions, and various disguises may also destroy the consistency of faces referring to the same person. This motivates us to develop a robust face clustering algorithm that is less sensitive to these noises. To retain the underlying structured information within facial images, we use tensors to represent faces, and then accomplish the clustering task based on the tensor data. The proposed algorithm is called robust tensor clustering (RTC), which firstly finds a lower-rank approximation of the original tensor data using a L1 norm optimization function. Because L1 norm does not exaggerate the effect of noises compared with L2 norm, the minimization of the L1 norm approximation function makes RTC robust. Then, we compute high-order singular value decomposition of this approximate tensor to obtain the final clustering results. Different from traditional algorithms solving the approximation function with a greedy strategy, we utilize a nongreedy strategy to obtain a better solution. Experiments conducted on the benchmark facial datasets and gait sequences demonstrate that RTC has better performance than the state-of-the-art clustering algorithms and is more robust to noises.

62 citations


Cites background from "A unified framework for context ass..."

  • ...[31] integrated some heterogeneous contexts into a unified framework to jointly cluster faces....

    [...]


Journal ArticleDOI
01 Sep 2013
TL;DR: A novel Query-Driven approach is developed that performs a minimal number of cleaning steps that are only necessary to answer a given selection query correctly, demonstrating its significant advantage in terms of efficiency over traditional techniques for query-driven applications.
Abstract: This paper explores "on-the-fly" data cleaning in the context of a user query. A novel Query-Driven Approach (QDA) is developed that performs a minimal number of cleaning steps that are only necessary to answer a given selection query correctly. The comprehensive empirical evaluation of the proposed approach demonstrates its significant advantage in terms of efficiency over traditional techniques for query-driven applications.

49 citations


Cites background from "A unified framework for context ass..."

  • ...Recently new approaches exploit new information sources such as analyzing context [4, 9, 29], exploiting relationships between entities [20], domain/integrity constraints [14], behaviors of entities [28], and external knowledge bases such as ontologies and web search engines [12, 21, 24]....

    [...]


Book ChapterDOI
08 Oct 2016
TL;DR: Experiments demonstrate that the proposed joint face representation adaptation and clustering approach generates character clusters with high purity compared to existing video face clustering methods, which are either based on deep face representation (without adaptation) or carefully engineered features.
Abstract: Clustering faces in movies or videos is extremely challenging since characters’ appearance can vary drastically under different scenes. In addition, the various cinematic styles make it difficult to learn a universal face representation for all videos. Unlike previous methods that assume fixed handcrafted features for face clustering, in this work, we formulate a joint face representation adaptation and clustering approach in a deep learning framework. The proposed method allows face representation to gradually adapt from an external source domain to a target video domain. The adaptation of deep representation is achieved without any strong supervision but through iteratively discovered weak pairwise identity constraints derived from potentially noisy face clustering result. Experiments on three benchmark video datasets demonstrate that our approach generates character clusters with high purity compared to existing video face clustering methods, which are either based on deep face representation (without adaptation) or carefully engineered features.

47 citations


Cites background or methods from "A unified framework for context ass..."

  • ...In addition to the inherent pairwise constraint, recent works on video face clustering also incorporate contextual information [1]....

    [...]

  • ...In particular, we employ the B-cubed precision and recall [1,33] to compute one series of score pairs for the tested methods given different numbers of clusters....

    [...]


Proceedings ArticleDOI
Le An1, Xiaojing Chen1, Mehran Kafai1, Songfan Yang1  +1 moreInstitutions (1)
01 Oct 2013
TL;DR: Improving the re-identification performance by reranking the returned results based on soft biometric attributes, such as gender, which can describe probe and gallery subjects at a higher level is aimed at.
Abstract: The problem of person re-identification is to recognize a target subject across non-overlapping distributed cameras at different times and locations The applications of person re-identification include security, surveillance, multi-camera tracking, etc In a real-world scenario, person re-identification is challenging due to the dramatic changes in a subject's appearance in terms of pose, illumination, background, and occlusion Existing approaches either try to design robust features to identify a subject across different views or learn distance metrics to maximize the similarity between different views of the same person and minimize the similarity between different views of different persons In this paper, we aim at improving the re-identification performance by reranking the returned results based on soft biometric attributes, such as gender, which can describe probe and gallery subjects at a higher level During reranking, the soft biometric attributes are detected and attribute-based distance scores are calculated between pairs of images by using a regression model These distance scores are used for reranking the initially returned matches Experiments on a benchmark database with different baseline re-identification methods show that reranking improves the recognition accuracy by moving upwards the returned matches from gallery that share the same soft biometric attributes as the probe subject

43 citations


Cites methods from "A unified framework for context ass..."

  • ...For instance, SB attributes are used for face clustering in [20]....

    [...]


Journal ArticleDOI
01 Oct 2014
TL;DR: This work proposes a framework that leverages heterogeneous contextual information together with facial features to handle the problem of person identification for low-quality data and applies it to a real-world dataset consisting of several weeks of surveillance videos.
Abstract: Smart video surveillance (SVS) applications enhance situational awareness by allowing domain analysts to focus on the events of higher priority SVS approaches operate by trying to extract and interpret higher "semantic" level events that occur in video One of the key challenges of SVS is that of person identification where the task is for each subject that occurs in a video shot to identify the person it corresponds to The problem of person identification is especially challenging in resource-constrained environments where transmission delay, bandwidth restriction, and packet loss may prevent the capture of high-quality data Conventional person identification approaches which primarily are based on analyzing facial features are often not sufficient to deal with poor-quality data To address this challenge, we propose a framework that leverages heterogeneous contextual information together with facial features to handle the problem of person identification for low-quality data We first investigate the appropriate methods to utilize heterogeneous context features including clothing, activity, human attributes, gait, people co-occurrence, and so on We then propose a unified approach for person identification that builds on top of our generic entity resolution framework called RelDC, which can integrate all these context features to improve the quality of person identification This work thus links one well-known problem of person identification from the computer vision research area (that deals with video/images) with another well-recognized challenge known as entity resolution from the database and AI/ML areas (that deals with textual data) We apply the proposed solution to a real-world dataset consisting of several weeks of surveillance videos The results demonstrate the effectiveness and efficiency of our approach even on low-quality video data

30 citations


References
More filters

Proceedings ArticleDOI
Navneet Dalal1, Bill Triggs1Institutions (1)
20 Jun 2005
TL;DR: It is shown experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection, and the influence of each stage of the computation on performance is studied.
Abstract: We study the question of feature sets for robust visual object recognition; adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.

28,803 citations


Journal ArticleDOI
Brendan J. Frey1, Delbert Dueck1Institutions (1)
16 Feb 2007-Science
TL;DR: A method called “affinity propagation,” which takes as input measures of similarity between pairs of data points, which found clusters with much lower error than other methods, and it did so in less than one-hundredth the amount of time.
Abstract: Clustering data by identifying a subset of representative examples is important for processing sensory signals and detecting patterns in data. Such "exemplars" can be found by randomly choosing an initial subset of data points and then iteratively refining it, but this works well only if that initial choice is close to a good solution. We devised a method called "affinity propagation," which takes as input measures of similarity between pairs of data points. Real-valued messages are exchanged between data points until a high-quality set of exemplars and corresponding clusters gradually emerges. We used affinity propagation to cluster images of faces, detect genes in microarray data, identify representative sentences in this manuscript, and identify cities that are efficiently accessed by airline travel. Affinity propagation found clusters with much lower error than other methods, and it did so in less than one-hundredth the amount of time.

5,696 citations


Journal ArticleDOI
TL;DR: This paper presents a novel and efficient facial image representation based on local binary pattern (LBP) texture features that is assessed in the face recognition problem under different challenges.
Abstract: This paper presents a novel and efficient facial image representation based on local binary pattern (LBP) texture features. The face image is divided into several regions from which the LBP feature distributions are extracted and concatenated into an enhanced feature vector to be used as a face descriptor. The performance of the proposed method is assessed in the face recognition problem under different challenges. Other applications and several extensions are also discussed

5,237 citations


Journal ArticleDOI
TL;DR: The discriminatory power of various human facial features is studied and a new scheme for Automatic Face Recognition (AFR) is proposed and an efficient projection-based feature extraction and classification scheme for AFR is proposed.
Abstract: In this paper the discriminatory power of various human facial features is studied and a new scheme for Automatic Face Recognition (AFR) is proposed. Using Linear Discriminant Analysis (LDA) of different aspects of human faces in spatial domain, we first evaluate the significance of visual information in different parts/features of the face for identifying the human subject. The LDA of faces also provides us with a small set of features that carry the most relevant information for classification purposes. The features are obtained through eigenvector analysis of scatter matrices with the objective of maximizing between-class and minimizing within-class variations. The result is an efficient projection-based feature extraction and classification scheme for AFR. Soft decisions made based on each of the projections are combined, using probabilistic or evidential approaches to multisource data analysis. For medium-sized databases of human faces, good classification accuracy is achieved using very low-dimensional feature vectors.

874 citations


Journal ArticleDOI
TL;DR: This article defines a few intuitive formal constraints on such metrics which shed light on which aspects of the quality of a clustering are captured by different metric families, and proposes a modified version of Bcubed that avoids the problems found with other metrics.
Abstract: There is a wide set of evaluation metrics available to compare the quality of text clustering algorithms. In this article, we define a few intuitive formal constraints on such metrics which shed light on which aspects of the quality of a clustering are captured by different metric families. These formal constraints are validated in an experiment involving human assessments, and compared with other constraints proposed in the literature. Our analysis of a wide range of metrics shows that only BCubed satisfies all formal constraints. We also extend the analysis to the problem of overlapping clustering, where items can simultaneously belong to more than one cluster. As Bcubed cannot be directly applied to this task, we propose a modified version of Bcubed that avoids the problems found with other metrics.

665 citations


Performance
Metrics
No. of citations received by the Paper in previous years
YearCitations
20204
20192
20182
20176
20168
20155