scispace - formally typeset
Open AccessProceedings ArticleDOI

Understanding images of groups of people

TLDR
This paper introduced contextual features that encapsulate the group structure locally (for each person in the group), and globally (the overall structure of the group) to accomplish a variety of tasks, such as demographic recognition, calculating scene and camera parameters, and even event recognition.
Abstract
In many social settings, images of groups of people are captured The structure of this group provides meaningful context for reasoning about individuals in the group, and about the structure of the scene as a whole For example, men are more likely to stand on the edge of an image than women Instead of treating each face independently from all others, we introduce contextual features that encapsulate the group structure locally (for each person in the group) and globally (the overall structure of the group) This “social context” allows us to accomplish a variety of tasks, such as such as demographic recognition, calculating scene and camera parameters, and even event recognition We perform human studies to show this context aids recognition of demographic information in images of strangers

read more

Content maybe subject to copyright    Report

Understanding Images of Groups of People
Andrew C. Gallagher
Carnegie Mellon University
Pittsburgh, Pennsylvania
agallagh@cmu.edu
Tsuhan Chen
Carnegie Mellon University
Pittsburgh, Pennsylvania
tsuhan@ece.cornell.edu
Abstract
In many social settings, images of groups of people are
captured. The structure of this group provides meaningful
context for reasoning about individuals in the group, and
about the structure of the scene as a whole. For exam-
ple, men are more likely to stand on the edge of an image
than women. Instead of treating each face independently
from all others, we introduce contextual features that en-
capsulate the group structure locally (for each person in
the group) and globally (the overall structure of the group).
This “social context” allows us to accomplish a variety of
tasks, such as such as demographic recognition, calculating
scene and camera parameters, and even event recognition.
We perform human studies to show this context aids recog-
nition of demographic information in images of strangers.
1. Introduction
It is a common occurrence at social gatherings to capture
a photo of a group of people. The subjects arrange them-
selves in the scene and the image is captured, as shown for
example in Figure 1. Many factors (both social and physi-
cal) play a role in the positioning of people in a group shot.
For example, physical attributes are considered, and phys-
ically taller people (often males) tend to stand in the back
rows of the scene. Sometimes a person of honor (e.g. a
grandparent) is placed closer to the center of the image as a
result of social factors or norms. To best understand group
images of people, the factors related to how people position
themselves in a group must be understood and modeled.
We contend that computer vision algorithms benetby
considering social context, a context that describes people,
their culture, and the social aspects of their interactions. In
this paper, we describe contextual features from groups of
people, one aspect of social context. There are several jus-
tications for this approach. First, the topic of the spac-
ing between people during their interactions has been thor-
oughly studied in the elds of anthropology [ 14] and so-
Figure 1. Just as birds naturally space themselves on a wire (Up-
per Left), people position themselves in a group image. We extract
contextual features that capture the structure of the group of peo-
ple. The nearest face (Upper Right) and minimum spanning tree
(Lower Left) both capture contextual information. Among several
applications, we use this context to determine the gender of the
persons in the image (Lower Right).
cial psychology [2]. A comfortable spacing between people
depends on social relationship, social situation, gender and
culture. This concept, called proxemics, is considered in
architectural design [16, 26] and we suggest computer vi-
sion can benet as well. In our work, we show experimen-
tal results that our contextual features from group images
improves understanding. In addition, we show that human
vision perception exploits similar contextual clues in inter-
preting people images.
We propose contextual features that capture the structure
of a group of people, and the position of individuals within
the group. A traditional approach to this problem might
be to detect faces and independently analyze each face by
extracting features and performing classication. In our ap-
proach, we consider context provided by the global struc-
ture dened by the collection of people in the group. This
allows us to perform or improve several tasks such as: iden-
tifying the demographics (ages and genders) of people in

the image, estimating the camera and scene parameters, and
classifying the image into an event type.
1.1. Related Work
A large amount of research addresses understanding im-
ages of humans, addressing issues such as recognizing an
individual, recognizing age and gender from facial appear-
ance, and determining the structure of the human body.
The vast majority of this work treats each face as an in-
dependent problem. However, there are some notable ex-
ceptions. In [5], names from captions are associated with
faces from images or video in a mutually exclusive man-
ner (each face can only be assigned one name). Similar
constraints are employed in research devoted to solving the
face recognition problem for consumer image collections.
In [10, 19, 24], co-occurences between individuals in la-
beled images are considered for reasoning about the iden-
tities of groups of people (instead of one person at a time).
However, the co-occurence does not consider any aspect of
the spatial arrangement of the people in the image. In [23],
people are matched between multiple images of the same
person group, but only appearance features are used. Fa-
cial arrangement was considered in [1], but only as a way to
measure the similarity between images.
Our use of contextual features from people images is
motivated by the use of context for object detection and
recognition. Hoiem et al.[15], and Torralba and Sinha
[25] describe the context (in 3D and 2D, respectively) of
a scene and the relationship between context and object
detection. Researchers recognize that recognition perfor-
mance is improved by learning reasonable object priors, en-
capsulating the idea that cars are on the road and cows stand
on grass (not trees). Learning these co-occurence, relative
co-locations, and scale models improves object recognition
[11, 20, 21, 22]. These approaches are successful because
the real world is highly structured, and objects are not ran-
domly scattered throughout an image. Similarly, there is
structure to the positions of people in a scene that can be
modeled and used to aid our interpretation of the image.
Our contribution is a new approach for analyzing images
of multiple people. We propose features that relate to the
structure of a group of people and demonstrate that they
contain useful information. The features provide social con-
text that allows us to reason effectively in different problem
domains, such as estimating person demographics, estimat-
ing parameters related to scene structure, and even catego-
rizing the event in the image. In Section 2, we describe our
image collection. In Section 3, we introduce contextual per-
son features, and we detail their performance for classifying
person demographics. We introduce the concept of a face
plane and demonstrate its relationship to the scene structure
and event semantics (Section 5). Finally, in Section 6 we
describe experiments related to human perception based on
0-2 3-7 8-12 13-19 20-36 37-65 66+
Female 439 771 378 956 7767 3604 644
Male 515 824 494 736 7281 3213 609
Total 954 1595 872 1692 15048 6817 1253
Table 1. The distribution of the ages and genders of the 28231
people in our image collection.
cues related to social context.
2. Images and Labeling
We built a collection of people images from Flickr im-
ages. As Flickr does not explicitly allow searches based on
the number of people in the image, we created search terms
likely to yield images of multiple people. The following
three searches were conducted:
wedding+bride+groom+portrait
group shot”or“group photo”orgroup portrait
family portrait
A standard set of negative query terms were used to remove
undesirable images. To prevent a single photographer’s im-
ages from over-representation, a maximum of 100 images
are returned for any given image capture day, and this search
is repeated for 270 different days.
In each image, we labeled the gender and the age cate-
gory for each person. As we are not studying face detection,
we manually add missed faces, but 86% of the faces are au-
tomatically found. We labeled each face as being in one of
seven age categories: 0-2, 3-7, 8-12, 13-19, 20-36, 37-65,
and 66+, roughly corresponding to different life stages. In
all, 5,080 images containing 28,231 faces are labeled with
age and gender (see Table 1), making this what we believe is
the largest dataset of its kind [3]. Many faces have low res-
olution. The median face has only 18.5 pixels between the
eye centers, and 25% of the faces have under 12.5 pixels.
As is expected with Flickr images, there is a great deal
of variety. Some images have people are sitting, laying,
or standing on elevated surfaces. People often have dark
glasses, face occlusions, or unusual facial expressions. Is
there useful information in the structure and arrangement of
people in the image? The rest of the paper is devoted to
answering this question to the afrmative.
3. Contextual Features from People Images
A face detector and an Active Shape Model [7] are used
to detect faces and locate the left and right eye positions.
The position p =
x
i
y
i
T
of a face f is the two di-
mensional centroid of the left and right eye center positions
l =
x
l
y
l
T
and r =
x
r
y
r
T
:
p =
1
2
l +
1
2
r (1)

The distance between the two eye center positions for the
face is the size e = ||l r|| of the face. To capture the
structure of the people image, and allow the structure of
the group to represent context for each face, we compute
the following features and represent each face f
x
as a 12-
dimensional contextual feature vector:
Absolute Position: The absolute position of each face p,
normalized by the image width and height, represents two
dimensions. A third dimension in this category is the angle
of the face relative to horizontal.
Relative Features: The centroid of all the faces in an image
is found. Then, the relative position of a particular face is
the position of the face to the centroid, normalized to the
mean face size:
r =
p p
μ
e
μ
(2)
where r is the relative position of the face, p
μ
is the centroid
of all faces in the image, and e
μ
is the mean size of all faces
from the image. The third dimension in this category is the
ratio of the face size to the mean face size:
e
r
=
e
e
μ
(3)
When three or more faces are found in the image, a linear
model is t to the image to model the face size as a function
of y-axis position in the image. This is described in more
detail in Section 4.2. Using (9 ), the predicted size of the
face compared with the actual face size is the last feature:
e
p
=
e
α
1
y
i
+ α
2
(4)
Minimal Spanning Tree: A complete graph
G =(V,E) is constructed where each face f
n
is rep-
resented by a vertex v
n
V , and each edge (v
n
,v
m
) E
connects vertices v
n
and v
m
. Each edge has a correspond-
ing weight w(v
n
,v
m
) equal to the Euclidean distance
between the face positions p
n
and p
m
. The minimal
spanning tree of the graph MST(G) is found using Prim’s
algorithm. The minimal spanning tree reveals the structure
of the people image; if people are arranged linearly, the
minimal spanning tree MST(G) contains no vertices of
degree three or greater. For each face f
n
, the degree of the
vertex v
n
is a feature deg(v
n
). An example tree is shown
in Figure 1.
Nearest Neighbor: The K nearest neighbors, based again
on Euclidean distance between face positions p are found.
As we will see, the relative juxtaposition of neighboring
faces reveals information about the social relationship
between them. Using the nearest neighbor face, the relative
position, size, and in-plane face tilt angle are calculated, for
a total of four dimensions.
(a) All (b) Female-Male (c) Baby-Other
Figure 2. The position of the nearest face to a given face depends
on the social relationship between the pair. (a) The relative po-
sition of two nearest neighbors, where the red dot represents the
rst face, and lighter areas are more likely positions of the nearest
neighbor. The red circle represents a radius of 1.5 feet (457mm).
(b) When nearest neighbors are male and female, the male tends
to be above and to the side of the female (represented by the red
dot). (b) The position of the nearest neighbor to a baby. The baby
face tends to be spatially beneath the neighbor, and incidentally,
the nearest neighbor to a baby is a female with probability 63%.
(a) P (p) (b) P (f
g
= m|p) (c) P (f
a
< 8|p)
Figure 3. The absolute position of a face in the image provides
clues about age and gender. Each of the three images represent a
normalized image. (a) The density of all 28231 faces in the col-
lection. (b) P (f
g
= male|p). A face near the image edge or top
is likely to be male. (c) P (f
a
< 8|p). A face near the bottom is
likely to be a child.
The feature vector f
x
captures both the pairwise relation-
ships between faces and a sense of of the person’s position
relative to the global structure of all people in the image.
3.1. Evidence of Social Context
It is evident the contextual feature f
x
captures informa-
tion related to demographics. Figure 2 shows the spatial
distributions between nearest neighbors. The relative posi-
tion is dependent on gender (b) and age (c). Using the fact
that the distance between human adult eye centers is 61±3
mm [9], the mean distance between a person and her nearest
neighbor is 306 mm. This is smaller than the 18-inch radius
“personal space” of [2], but perhaps subjects suspend their
need for space for the sake of capturing an image.
Figure 3 shows maps of P (f
a
|p) and P (f
g
|p), the prob-
ability that a face has a particular gender or age given ab-
solute position. Intuitively, physically taller men are more
likely to stand in the group’s back row and appear closer
to the image top. Regarding the degree deg(v
n
) of a face
in MST(G), females tend to be more centrally located in

84 356010
15 35 1875713
19 16 20 21 3 13 8
591119 28 17 11
2692321 25 15
2 4 7 15 24 34 14
4 11 9 23 14 22 17
02
37
812
1319
2036
3765
66+
(a) Context
76 1151117
28 28 194957
16 24 27 13 10 6 5
5122320 17 10 12
1 9 11 23 28 20 8
576142324 21
75535965
02
37
812
1319
2036
3765
66+
(b) Appearance
88 910011
17 41 254536
72729 17 7 5 7
182121 24 12 13
1492833 16 9
156132830 17
165761858
02
37
812
1319
2036
3765
66+
(c) Both
Figure 4. The structure of people in an image provides context
for estimating age. Show are the confusion matrices for classify-
ing age using (a) context alone (no face appearance), (b) content
(facial appearance) alone, (c) both context and facial appearance.
Context improves over content alone.
Gender Age
Random Baseline 50.0% 14.3% 38.8%
Absolute Position 62.5% 25.7% 56.3%
Relative Position 66.8% 28.5% 60.5%
Min. Spanning Tree 55.3% 21.4% 47.2%
Nearest Neighbor 64.3% 26.7% 56.3%
Combined f
x
66.9% 32.9% 64.4%
Table 2. Predicting age and gender from context features f
x
alone.
The rst age column is the accuracy for an exact match, and the
second allows an error of one age category (e.g. a 3-7 year old
classied as 8-12).
a group, and consequently have a higher mean degree in
MST(G). For faces with deg(v
n
) > 2 the probability the
face is female is 62.5%.
3.2. Demographics from Context and Content
The interesting research question we address is this:
How much does the structure of the people in images tell
us about the people? We estimate demographic information
about a person using f
x
. The goal is to estimate each face’s
age f
a
and gender f
g
. We show that age and gender can be
predicted with accuracy signicantly greater than random
by considering only the context provided by f
x
and no ap-
pearance features. In addition, the context has utility for
combining with existing appearance-based age and gender
discrimination algorithms.
3.2.1 Classifying Age and Gender with Context
Each face in the person image is described with a contextual
feature vector f
x
that captures local pairwise information
(from the nearest neighbor) and global position. We trained
classiers for discriminating between age and gender. In
each case, we use a Gaussian Maximum Likelihood (GML)
classier to learn P (f
a
|f
x
) and P (f
g
|f
x
). The distribution
of each class (7 classes for age, 2 for gender) is learned by
tting a multi-variate Gaussian to the distributions P (f
x
|f
a
)
and P (f
x
|f
g
). Other classiers (Adaboost, decision forests,
SVM) yield similar results on this problem, but GML has
the advantage that the posterior is easy to directly estimate.
The age classier is trained from a random selection of
3500 faces, selected such that each age category has an
equal number of samples. Testing is performed on an in-
dependent (also uniformly distributed) set of 1050 faces.
Faces for test images are selected to achieve roughly an even
distribution over the number of people in the image. The
prior for gender is roughly even in our collection, so we use
a larger training set of 23218 images and test on 1881 faces.
For classifying age, our contextual features have an accu-
racy more than double random chance (14.3%), and gender
is correctly classied about two-thirds of the time. Again,
we emphasize that no appearance features are considered.
Table 2 shows the performance of our classiers for the dif-
ferent components of the contextual person feature f
x
.The
strongest single component is Relative Position, but the in-
clusion of all features is the best. Babies are recognized
with good accuracy, mainly because their faces are smaller
and positioned lower than others in the image.
3.2.2 Combining Context with Content
We trained appearance-based age and gender classiers.
These content-based classiers provide probability esti-
mates P (f
g
|f
a
) and P (f
a
|f
a
) that the face has a particu-
lar gender and age category, given the visual appearance
f
a
. Our gender and age classiers were motivated by the
works of [12, 13] where a low dimension manifold for the
age data. Using cropped and scaled faces (61×49 pixels,
with the scaling so the eye centers are 24 pixels apart) from
the age training set, two linear projections (W
a
for age and
W
g
for gender) are learned. Each column of W
a
is a vec-
tor learned by nding the projection that maximizes the ra-
tio of interclass to intraclass variation (by linear discrimi-
nate analysis) for a pair of age categories, resulting in 21
columns for W
a
. A similar approach is used to learn a lin-
ear subspace for gender W
g
. Instead of learning a single
vector from two gender classes, a set of seven projections is
learned by learning a single projection that maximizes gen-
der separability for each age range.
The distance d
ij
between two faces is measured as:
d
ij
=(f
i
f
j
)WW
T
(f
i
f
j
)
T
(5)
For classication for both age and gender, the nearest N
training samples (we use N =25) are found in the space
dened by W
a
for age or W
g
for gender. The class labels
of the neighbors are used to estimate P (f
a
|f
a
) and P (f
g
|f
g
)
by MLE counts. One benet to this approach is that a com-
mon algorithm and training set are used for both tasks, only
the class labels and pairing for learning discriminative pro-
jections are modied.
The performance of both classiers seems reasonable
given the difculty of this collection. The gender classier

Gender Age
Context f
x
66.9% 32.9% 64.4%
Appearance f
a
69.6% 38.3% 71.3%
Combined f
x
, f
a
74.1% 42.9% 78.1%
Table 3. In images of multiple people, age and gender estimates
are improved by considering both appearance and the social con-
text provided by our features. The rst age column is exact age
category accuracy; the second allows errors of one age category.
Gender Age
Context f
x
65.1% 27.5% 63.5%
Appearance f
a
67.4% 30.2% 65.9%
Combined f
x
, f
a
73.4% 36.5% 74.6%
Table 4. For smaller faces 18 pixels between eye centers, classi-
cation suffers. However, the gain provided by combine context
with content increases.
is correct about 70% of the time. This is lower than oth-
ers [4], but our collection contains a substantial number of
children, small faces and difcult expressions. For people
aged 20-65, the gender classication is correct 75%, but for
ages between 0-19, performance is a poorer 60%, as facial
gender differences are not as apparent. For age, the classi-
er is correct 38% of the time, and if a one-category error
is allowed, the performance is 71%. These classiers may
not be state-of-the-art, but are sufcient to illustrate our ap-
proach. We are interested in the benet that can be achieved
by modeling the social context.
Using the N¨aive Bayes assumption, the nal estimate for
the class (for example, gender f
g
) given all available fea-
tures (both content f
a
and context f
x
)is:
P (f
g
|f
a
, f
x
)=P (f
g
|f
a
)P (f
g
|f
x
) (6)
Table 3 shows that both gender and age estimates are im-
proved by incorporating both content (appearance) and con-
text (the structure of the person image). Gender recognition
improves by 4.5% by considering person context. Exact age
category recognition improves by 4.6%, and when the ad-
jacent age category is also considered correct, the improve-
ment is 6.8%. Figure 5 shows the results of gender classi-
cation in image form, with discussion. Accuracy suffers on
smaller faces, but the benet provided by context increases,
as shown in Table 4. For example, context now improves
gender accuracy by 6%. This corroborates [20]inthatthe
importance of context increases as resolution decreases.
4. Scene Geometry and Semantics from Faces
The position of people in an image provides clues about
the geometry of the scene. As shown in [18], camera cali-
bration can be achieved from a video of a walking human,
under some reasonable assumptions (that the person walks
on the ground plane and head and feet are visible). By mak-
ing broader assumptions, we can model the geometry of the
scene from a group of face images. First, we assume faces
approximately dene a plane we call the face plane, a world
plane that passes through the heads (i.e. the centroids of the
eye centers) of the people in the person image. Second, we
assume that head sizes are roughly similar. Third, we as-
sume the camera has no roll with respect to the face plane.
This ensures the face plane horizon is level. In typical group
shots, this is approximately accomplished when the photog-
rapher adjusts the camera to capture the group.
Criminisi et al.[8] and Hoiem et al.[15] describe the
measurement of objects rooted on the ground plane. In con-
trast, the face plane is not necessarily parallel to the ground,
and many times people are either sitting or are not even on
the ground plane at all. However, since the true face sizes
of people are relatively similar, we can compute the face
horizon, the vanishing line associated with the face plane.
4.1. Modeling the Face Plane
From the set of faces in the image, we compute the face
horizon and the camera height (the distance from the camera
to the face plane measured along the face plane normal), not
the height of the camera from the ground. Substituting the
face plane for the ground plane in Hoiem et al.[15], we
have:
E
i
=
e
i
Y
c
y
i
y
o
(7)
where E
i
is the face inter-eye distance in the world (61 mm
for the average adult), e
i
is the face inter-eye distance in the
image, Y
c
is the camera height, y
i
is the y-coordinate of the
face center p,andy
o
is the y-coordinate of the face horizon.
Each of the N face instances in the image provides one
equation. The face horizon y
o
and camera height Y
c
are
solved using least squares by linearizing (7) and writing in
matrix form:
E
i1
e
i1
E
i2
e
i2
... ...
E
iN
e
iN
y
o
Y
c
=
y
i1
E
i1
y
i2
E
i2
...
y
iN
E
iN
(8)
Reasonable face vanishing lines and camera height esti-
mates are produced, although it should be noted that the
camera focal length is not in general recovered. A degen-
erate case occurs when the face plane and image planes are
parallel (e.g. a group shot of standing people of different
heights), the face vanishing line is at innity, and the cam-
era height (i.e. in this case, the distance from the camera to
the group) cannot be recovered.
To quantify the performance of the camera geometry es-
timates, we consider a set of 18 images where the face van-
ishing plane and ground plane are parallel and therefore
share a common vanishing line, the horizon. The horizon is

Figures
Citations
More filters
Proceedings ArticleDOI

Age and gender classification using convolutional neural networks

TL;DR: This paper proposes a simple convolutional net architecture that can be used even when the amount of learning data is limited and shows that by learning representations through the use of deep-convolutional neural networks, a significant increase in performance can be obtained on these tasks.
Journal ArticleDOI

Age Synthesis and Estimation via Faces: A Survey

TL;DR: The complete state-of-the-art techniques in the face image-based age synthesis and estimation topics are surveyed, including existing models, popular algorithms, system performances, technical difficulties, popular face aging databases, evaluation protocols, and promising future directions are provided.
Journal ArticleDOI

Age and Gender Estimation of Unfiltered Faces

TL;DR: This paper presents a robust face alignment technique, which explicitly considers the uncertainties of facial feature detectors, and describes the dropout-support vector machine approach used by the system for face attribute estimation, in order to avoid over-fitting.
Proceedings ArticleDOI

AgeDB: The First Manually Collected, In-the-Wild Age Database

TL;DR: This paper presents the first, to the best of knowledge, manually collected "in-the-wild" age database, dubbed AgeDB, containing images annotated with accurate to the year, noise-free labels, which renders AgeDB suitable when performing experiments on age-invariant face verification, age estimation and face age progression "in the wild".
Book ChapterDOI

Describing clothing by semantic attributes

TL;DR: A fully automated system that is capable of generating a list of nameable attributes for clothes on human body in unconstrained images is proposed, and a novel application of dressing style analysis is introduced that utilizes the semantic attributes produced by the system.
References
More filters
Journal ArticleDOI

Active shape models—their training and application

TL;DR: This work describes a method for building models by learning patterns of variability from a training set of correctly annotated images that can be used for image search in an iterative refinement algorithm analogous to that employed by Active Contour Models (Snakes).
Book ChapterDOI

TextonBoost : joint appearance, shape and context modeling for multi-class object recognition and segmentation

TL;DR: A new approach to learning a discriminative model of object classes, incorporating appearance, shape and context information efficiently, is proposed, which is used for automatic visual recognition and semantic segmentation of photographs.
Journal ArticleDOI

Putting Objects in Perspective

TL;DR: This paper provides a framework for placing local object detection in the context of the overall 3D scene by modeling the interdependence of objects, surface orientations, and camera viewpoint by allowing probabilistic object hypotheses to refine geometry and vice-versa.
Journal ArticleDOI

A System for the Notation of Proxemic Behavior1

TL;DR: A simple system of observation and notation with a view to standardizing the reporting of a narrow range of microcultural events is presented, far from perfect; but if it directs attention to certain behavior, it will have achieved its purpose.
Related Papers (5)
Frequently Asked Questions (16)
Q1. What have the authors contributed in "Understanding images of groups of people" ?

The structure of this group provides meaningful context for reasoning about individuals in the group, and about the structure of the scene as a whole. Instead of treating each face independently from all others, the authors introduce contextual features that encapsulate the group structure locally ( for each person in the group ) and globally ( the overall structure of the group ). The authors perform human studies to show this context aids recognition of demographic information in images of strangers. 

For classifying age, their contextual features have an accuracy more than double random chance (14.3%), and gender is correctly classified about two-thirds of the time. 

Sometimes a person of honor (e.g. a grandparent) is placed closer to the center of the image as a result of social factors or norms. 

A total of 13 subjects estimated age and gender for each of the 45 faces for each of the 3 stages, for a total of 1755 evaluations for age and gender. 

The authors labeled each face as being in one of seven age categories: 0-2, 3-7, 8-12, 13-19, 20-36, 37-65, and 66+, roughly corresponding to different life stages. 

Each column ofWa is a vector learned by finding the projection that maximizes the ratio of interclass to intraclass variation (by linear discriminate analysis) for a pair of age categories, resulting in 21 columns forWa. 

Using the group shot face geometry achieves a median horizon estimate of 4.6%, improving from an error of 17.7% when the horizon is assumed to pass through the image center, or 9.5% when the horizon estimate is the mean position of all other labeled images. 

The face horizon yo and camera height Yc are solved using least squares by linearizing (7) and writing in matrix form:⎡⎢ ⎢ ⎣ Ei1 ei1 Ei2 ei2 . . . . . . 

Exact age category recognition improves by 4.6%, and when the adjacent age category is also considered correct, the improvement is 6.8%. 

The feature vector fx captures both the pairwise relationships between faces and a sense of of the person’s position relative to the global structure of all people in the image. 

The model from (8) could also be used to estimate the size of a face in the face plane, but its objective function minimizes a quantity related to the camera and scene geometry and does not guarantee that the estimated face sizes in the image are optimal. 

The following three searches were conducted: “wedding+bride+groom+portrait” “group shot” or “group photo” or “group portrait” “family portrait” A standard set of negative query terms were used to remove undesirable images. 

One benefit to this approach is that a common algorithm and training set are used for both tasks, only the class labels and pairing for learning discriminative projections are modified. 

Regarding the degree deg(vn) of a face in MST (G), females tend to be more centrally located ina group, and consequently have a higher mean degree in MST (G). 

Using the fact that the distance between human adult eye centers is 61±3 mm [9], the mean distance between a person and her nearest neighbor is 306 mm. 

As Flickr does not explicitly allow searches based on the number of people in the image, the authors created search terms likely to yield images of multiple people.