How many evaluations of the age and gender of the faces?

A total of 13 subjects estimated age and gender for each of the 45 faces for each of the 3 stages, for a total of 1755 evaluations for age and gender.

How does the group shot face geometry achieve a median horizon estimate?

Using the group shot face geometry achieves a median horizon estimate of 4.6%, improving from an error of 17.7% when the horizon is assumed to pass through the image center, or 9.5% when the horizon estimate is the mean position of all other labeled images.

What is the way to estimate the face size?

The model from (8) could also be used to estimate the size of a face in the face plane, but its objective function minimizes a quantity related to the camera and scene geometry and does not guarantee that the estimated face sizes in the image are optimal.

What is the probability of a face having a particular gender?

Regarding the degree deg(vn) of a face in MST (G), females tend to be more centrally located ina group, and consequently have a higher mean degree in MST (G).

What are the terms used to search for people images?

As Flickr does not explicitly allow searches based on the number of people in the image, the authors created search terms likely to yield images of multiple people.

(Open Access) Understanding images of groups of people (2009) | Andrew C. Gallagher

Q: How is the classifier based on age?

For classifying age, their contextual features have an accuracy more than double random chance (14.3%), and gender is correctly classified about two-thirds of the time.

Q: How many age categories are there in the face detection dataset?

The authors labeled each face as being in one of seven age categories: 0-2, 3-7, 8-12, 13-19, 20-36, 37-65, and 66+, roughly corresponding to different life stages.

Q: How many columns are learned for age and gender?

Each column ofWa is a vector learned by finding the projection that maximizes the ratio of interclass to intraclass variation (by linear discriminate analysis) for a pair of age categories, resulting in 21 columns forWa.

Q: How do the authors solve the face horizon and camera height?

The face horizon yo and camera height Yc are solved using least squares by linearizing (7) and writing in matrix form:⎡⎢ ⎢ ⎣ Ei1 ei1 Ei2 ei2 . . . . . .

Q: What is the purpose of the fx?

The feature vector fx captures both the pairwise relationships between faces and a sense of of the person’s position relative to the global structure of all people in the image.

Understanding Images of Groups of People

Andrew C. Gallagher

Carnegie Mellon University

Pittsburgh, Pennsylvania

agallagh@cmu.edu

Tsuhan Chen

Carnegie Mellon University

Pittsburgh, Pennsylvania

tsuhan@ece.cornell.edu

Abstract

In many social settings, images of groups of people are

captured. The structure of this group provides meaningful

context for reasoning about individuals in the group, and

about the structure of the scene as a whole. For exam-

ple, men are more likely to stand on the edge of an image

than women. Instead of treating each face independently

from all others, we introduce contextual features that en-

capsulate the group structure locally (for each person in

the group) and globally (the overall structure of the group).

This “social context” allows us to accomplish a variety of

tasks, such as such as demographic recognition, calculating

scene and camera parameters, and even event recognition.

We perform human studies to show this context aids recog-

nition of demographic information in images of strangers.

1. Introduction

It is a common occurrence at social gatherings to capture

a photo of a group of people. The subjects arrange them-

selves in the scene and the image is captured, as shown for

example in Figure 1. Many factors (both social and physi-

cal) play a role in the positioning of people in a group shot.

For example, physical attributes are considered, and phys-

ically taller people (often males) tend to stand in the back

rows of the scene. Sometimes a person of honor (e.g. a

grandparent) is placed closer to the center of the image as a

result of social factors or norms. To best understand group

images of people, the factors related to how people position

themselves in a group must be understood and modeled.

We contend that computer vision algorithms beneﬁtby

considering social context, a context that describes people,

their culture, and the social aspects of their interactions. In

this paper, we describe contextual features from groups of

people, one aspect of social context. There are several jus-

tiﬁcations for this approach. First, the topic of the spac-

ing between people during their interactions has been thor-

oughly studied in the ﬁelds of anthropology [ 14] and so-

Figure 1. Just as birds naturally space themselves on a wire (Up-

per Left), people position themselves in a group image. We extract

contextual features that capture the structure of the group of peo-

ple. The nearest face (Upper Right) and minimum spanning tree

(Lower Left) both capture contextual information. Among several

applications, we use this context to determine the gender of the

persons in the image (Lower Right).

cial psychology [2]. A comfortable spacing between people

depends on social relationship, social situation, gender and

culture. This concept, called proxemics, is considered in

architectural design [16, 26] and we suggest computer vi-

sion can beneﬁt as well. In our work, we show experimen-

tal results that our contextual features from group images

improves understanding. In addition, we show that human

vision perception exploits similar contextual clues in inter-

preting people images.

We propose contextual features that capture the structure

of a group of people, and the position of individuals within

the group. A traditional approach to this problem might

be to detect faces and independently analyze each face by

extracting features and performing classiﬁcation. In our ap-

proach, we consider context provided by the global struc-

ture deﬁned by the collection of people in the group. This

allows us to perform or improve several tasks such as: iden-

tifying the demographics (ages and genders) of people in

the image, estimating the camera and scene parameters, and

classifying the image into an event type.

1.1. Related Work

A large amount of research addresses understanding im-

ages of humans, addressing issues such as recognizing an

individual, recognizing age and gender from facial appear-

ance, and determining the structure of the human body.

The vast majority of this work treats each face as an in-

dependent problem. However, there are some notable ex-

ceptions. In [5], names from captions are associated with

faces from images or video in a mutually exclusive man-

ner (each face can only be assigned one name). Similar

constraints are employed in research devoted to solving the

face recognition problem for consumer image collections.

In [10, 19, 24], co-occurences between individuals in la-

beled images are considered for reasoning about the iden-

tities of groups of people (instead of one person at a time).

However, the co-occurence does not consider any aspect of

the spatial arrangement of the people in the image. In [23],

people are matched between multiple images of the same

person group, but only appearance features are used. Fa-

cial arrangement was considered in [1], but only as a way to

measure the similarity between images.

Our use of contextual features from people images is

motivated by the use of context for object detection and

recognition. Hoiem et al.[15], and Torralba and Sinha

[25] describe the context (in 3D and 2D, respectively) of

a scene and the relationship between context and object

detection. Researchers recognize that recognition perfor-

mance is improved by learning reasonable object priors, en-

capsulating the idea that cars are on the road and cows stand

on grass (not trees). Learning these co-occurence, relative

co-locations, and scale models improves object recognition

[11, 20, 21, 22]. These approaches are successful because

the real world is highly structured, and objects are not ran-

domly scattered throughout an image. Similarly, there is

structure to the positions of people in a scene that can be

modeled and used to aid our interpretation of the image.

Our contribution is a new approach for analyzing images

of multiple people. We propose features that relate to the

structure of a group of people and demonstrate that they

contain useful information. The features provide social con-

text that allows us to reason effectively in different problem

domains, such as estimating person demographics, estimat-

ing parameters related to scene structure, and even catego-

rizing the event in the image. In Section 2, we describe our

image collection. In Section 3, we introduce contextual per-

son features, and we detail their performance for classifying

person demographics. We introduce the concept of a face

plane and demonstrate its relationship to the scene structure

and event semantics (Section 5). Finally, in Section 6 we

describe experiments related to human perception based on

0-2 3-7 8-12 13-19 20-36 37-65 66+

Female 439 771 378 956 7767 3604 644

Male 515 824 494 736 7281 3213 609

Total 954 1595 872 1692 15048 6817 1253

Table 1. The distribution of the ages and genders of the 28231

people in our image collection.

cues related to social context.

2. Images and Labeling

We built a collection of people images from Flickr im-

ages. As Flickr does not explicitly allow searches based on

the number of people in the image, we created search terms

likely to yield images of multiple people. The following

three searches were conducted:

“

wedding+bride+groom+portrait”

“

group shot”or“group photo”or“group portrait”

“

family portrait”

A standard set of negative query terms were used to remove

undesirable images. To prevent a single photographer’s im-

ages from over-representation, a maximum of 100 images

are returned for any given image capture day, and this search

is repeated for 270 different days.

In each image, we labeled the gender and the age cate-

gory for each person. As we are not studying face detection,

we manually add missed faces, but 86% of the faces are au-

tomatically found. We labeled each face as being in one of

seven age categories: 0-2, 3-7, 8-12, 13-19, 20-36, 37-65,

and 66+, roughly corresponding to different life stages. In

all, 5,080 images containing 28,231 faces are labeled with

age and gender (see Table 1), making this what we believe is

the largest dataset of its kind [3]. Many faces have low res-

olution. The median face has only 18.5 pixels between the

eye centers, and 25% of the faces have under 12.5 pixels.

As is expected with Flickr images, there is a great deal

of variety. Some images have people are sitting, laying,

or standing on elevated surfaces. People often have dark

glasses, face occlusions, or unusual facial expressions. Is

there useful information in the structure and arrangement of

people in the image? The rest of the paper is devoted to

answering this question to the afﬁrmative.

3. Contextual Features from People Images

A face detector and an Active Shape Model [7] are used

to detect faces and locate the left and right eye positions.

The position p =





of a face f is the two di-

mensional centroid of the left and right eye center positions

l =





and r =





p =

l +

r (1)

The distance between the two eye center positions for the

face is the size e = ||l − r|| of the face. To capture the

structure of the people image, and allow the structure of

the group to represent context for each face, we compute

the following features and represent each face f

as a 12-

dimensional contextual feature vector:

Absolute Position: The absolute position of each face p,

normalized by the image width and height, represents two

dimensions. A third dimension in this category is the angle

of the face relative to horizontal.

Relative Features: The centroid of all the faces in an image

is found. Then, the relative position of a particular face is

the position of the face to the centroid, normalized to the

mean face size:

r =

p − p

(2)

where r is the relative position of the face, p

is the centroid

of all faces in the image, and e

is the mean size of all faces

from the image. The third dimension in this category is the

ratio of the face size to the mean face size:

(3)

When three or more faces are found in the image, a linear

model is ﬁt to the image to model the face size as a function

of y-axis position in the image. This is described in more

detail in Section 4.2. Using (9 ), the predicted size of the

face compared with the actual face size is the last feature:

+ α

(4)

Minimal Spanning Tree: A complete graph

G =(V,E) is constructed where each face f

is rep-

resented by a vertex v

∈ V , and each edge (v

) ∈ E

connects vertices v

and v

. Each edge has a correspond-

ing weight w(v

) equal to the Euclidean distance

between the face positions p

and p

. The minimal

spanning tree of the graph MST(G) is found using Prim’s

algorithm. The minimal spanning tree reveals the structure

of the people image; if people are arranged linearly, the

minimal spanning tree MST(G) contains no vertices of

degree three or greater. For each face f

, the degree of the

vertex v

is a feature deg(v

). An example tree is shown

in Figure 1.

Nearest Neighbor: The K nearest neighbors, based again

on Euclidean distance between face positions p are found.

As we will see, the relative juxtaposition of neighboring

faces reveals information about the social relationship

between them. Using the nearest neighbor face, the relative

position, size, and in-plane face tilt angle are calculated, for

a total of four dimensions.

(a) All (b) Female-Male (c) Baby-Other

Figure 2. The position of the nearest face to a given face depends

on the social relationship between the pair. (a) The relative po-

sition of two nearest neighbors, where the red dot represents the

ﬁrst face, and lighter areas are more likely positions of the nearest

neighbor. The red circle represents a radius of 1.5 feet (457mm).

(b) When nearest neighbors are male and female, the male tends

to be above and to the side of the female (represented by the red

dot). (b) The position of the nearest neighbor to a baby. The baby

face tends to be spatially beneath the neighbor, and incidentally,

the nearest neighbor to a baby is a female with probability 63%.

(a) P (p) (b) P (f

= m|p) (c) P (f

< 8|p)

Figure 3. The absolute position of a face in the image provides

clues about age and gender. Each of the three images represent a

normalized image. (a) The density of all 28231 faces in the col-

lection. (b) P (f

= male|p). A face near the image edge or top

is likely to be male. (c) P (f

< 8|p). A face near the bottom is

likely to be a child.

The feature vector f

captures both the pairwise relation-

ships between faces and a sense of of the person’s position

relative to the global structure of all people in the image.

3.1. Evidence of Social Context

It is evident the contextual feature f

captures informa-

tion related to demographics. Figure 2 shows the spatial

distributions between nearest neighbors. The relative posi-

tion is dependent on gender (b) and age (c). Using the fact

that the distance between human adult eye centers is 61±3

mm [9], the mean distance between a person and her nearest

neighbor is 306 mm. This is smaller than the 18-inch radius

“personal space” of [2], but perhaps subjects suspend their

need for space for the sake of capturing an image.

Figure 3 shows maps of P (f

|p) and P (f

|p), the prob-

ability that a face has a particular gender or age given ab-

solute position. Intuitively, physically taller men are more

likely to stand in the group’s back row and appear closer

to the image top. Regarding the degree deg(v

) of a face

in MST(G), females tend to be more centrally located in

84 356010

15 35 1875713

19 16 20 21 3 13 8

591119 28 17 11

2692321 25 15

2 4 7 15 24 34 14

4 11 9 23 14 22 17

0−2

3−7

8−12

13−19

20−36

37−65

66+

(a) Context

76 1151117

28 28 194957

16 24 27 13 10 6 5

5122320 17 10 12

1 9 11 23 28 20 8

576142324 21

75535965

0−2

3−7

8−12

13−19

20−36

37−65

66+

(b) Appearance

88 910011

17 41 254536

72729 17 7 5 7

182121 24 12 13

1492833 16 9

156132830 17

165761858

0−2

3−7

8−12

13−19

20−36

37−65

66+

Figure 4. The structure of people in an image provides context

for estimating age. Show are the confusion matrices for classify-

ing age using (a) context alone (no face appearance), (b) content

(facial appearance) alone, (c) both context and facial appearance.

Context improves over content alone.

Gender Age

Random Baseline 50.0% 14.3% 38.8%

Absolute Position 62.5% 25.7% 56.3%

Relative Position 66.8% 28.5% 60.5%

Min. Spanning Tree 55.3% 21.4% 47.2%

Nearest Neighbor 64.3% 26.7% 56.3%

Combined f

66.9% 32.9% 64.4%

Table 2. Predicting age and gender from context features f

alone.

The ﬁrst age column is the accuracy for an exact match, and the

second allows an error of one age category (e.g. a 3-7 year old

classiﬁed as 8-12).

a group, and consequently have a higher mean degree in

MST(G). For faces with deg(v

) > 2 the probability the

face is female is 62.5%.

3.2. Demographics from Context and Content

The interesting research question we address is this:

How much does the structure of the people in images tell

us about the people? We estimate demographic information

about a person using f

. The goal is to estimate each face’s

age f

and gender f

. We show that age and gender can be

predicted with accuracy signiﬁcantly greater than random

by considering only the context provided by f

and no ap-

pearance features. In addition, the context has utility for

combining with existing appearance-based age and gender

discrimination algorithms.

3.2.1 Classifying Age and Gender with Context

Each face in the person image is described with a contextual

feature vector f

that captures local pairwise information

(from the nearest neighbor) and global position. We trained

classiﬁers for discriminating between age and gender. In

each case, we use a Gaussian Maximum Likelihood (GML)

classiﬁer to learn P (f

) and P (f

). The distribution

of each class (7 classes for age, 2 for gender) is learned by

ﬁtting a multi-variate Gaussian to the distributions P (f

)

and P (f

). Other classiﬁers (Adaboost, decision forests,

SVM) yield similar results on this problem, but GML has

the advantage that the posterior is easy to directly estimate.

The age classiﬁer is trained from a random selection of

3500 faces, selected such that each age category has an

equal number of samples. Testing is performed on an in-

dependent (also uniformly distributed) set of 1050 faces.

Faces for test images are selected to achieve roughly an even

distribution over the number of people in the image. The

prior for gender is roughly even in our collection, so we use

a larger training set of 23218 images and test on 1881 faces.

For classifying age, our contextual features have an accu-

racy more than double random chance (14.3%), and gender

is correctly classiﬁed about two-thirds of the time. Again,

we emphasize that no appearance features are considered.

Table 2 shows the performance of our classiﬁers for the dif-

ferent components of the contextual person feature f

.The

strongest single component is Relative Position, but the in-

clusion of all features is the best. Babies are recognized

with good accuracy, mainly because their faces are smaller

and positioned lower than others in the image.

3.2.2 Combining Context with Content

We trained appearance-based age and gender classiﬁers.

These content-based classiﬁers provide probability esti-

mates P (f

) and P (f

) that the face has a particu-

lar gender and age category, given the visual appearance

. Our gender and age classiﬁers were motivated by the

works of [12, 13] where a low dimension manifold for the

age data. Using cropped and scaled faces (61×49 pixels,

with the scaling so the eye centers are 24 pixels apart) from

the age training set, two linear projections (W

for age and

for gender) are learned. Each column of W

is a vec-

tor learned by ﬁnding the projection that maximizes the ra-

tio of interclass to intraclass variation (by linear discrimi-

nate analysis) for a pair of age categories, resulting in 21

columns for W

. A similar approach is used to learn a lin-

ear subspace for gender W

. Instead of learning a single

vector from two gender classes, a set of seven projections is

learned by learning a single projection that maximizes gen-

der separability for each age range.

The distance d

between two faces is measured as:

=(f

− f

)WW

− f

)

(5)

For classiﬁcation for both age and gender, the nearest N

training samples (we use N =25) are found in the space

deﬁned by W

for age or W

for gender. The class labels

of the neighbors are used to estimate P (f

) and P (f

)

by MLE counts. One beneﬁt to this approach is that a com-

mon algorithm and training set are used for both tasks, only

the class labels and pairing for learning discriminative pro-

jections are modiﬁed.

The performance of both classiﬁers seems reasonable

given the difﬁculty of this collection. The gender classiﬁer

Gender Age

Context f

66.9% 32.9% 64.4%

Appearance f

69.6% 38.3% 71.3%

Combined f

, f

74.1% 42.9% 78.1%

Table 3. In images of multiple people, age and gender estimates

are improved by considering both appearance and the social con-

text provided by our features. The ﬁrst age column is exact age

category accuracy; the second allows errors of one age category.

Gender Age

Context f

65.1% 27.5% 63.5%

Appearance f

67.4% 30.2% 65.9%

Combined f

, f

73.4% 36.5% 74.6%

Table 4. For smaller faces ≤18 pixels between eye centers, classi-

ﬁcation suffers. However, the gain provided by combine context

with content increases.

is correct about 70% of the time. This is lower than oth-

ers [4], but our collection contains a substantial number of

children, small faces and difﬁcult expressions. For people

aged 20-65, the gender classiﬁcation is correct 75%, but for

ages between 0-19, performance is a poorer 60%, as facial

gender differences are not as apparent. For age, the classi-

ﬁer is correct 38% of the time, and if a one-category error

is allowed, the performance is 71%. These classiﬁers may

not be state-of-the-art, but are sufﬁcient to illustrate our ap-

proach. We are interested in the beneﬁt that can be achieved

by modeling the social context.

Using the N¨aive Bayes assumption, the ﬁnal estimate for

the class (for example, gender f

) given all available fea-

tures (both content f

and context f

)is:

P (f

, f

)=P (f

)P (f

) (6)

Table 3 shows that both gender and age estimates are im-

proved by incorporating both content (appearance) and con-

text (the structure of the person image). Gender recognition

improves by 4.5% by considering person context. Exact age

category recognition improves by 4.6%, and when the ad-

jacent age category is also considered correct, the improve-

ment is 6.8%. Figure 5 shows the results of gender classiﬁ-

cation in image form, with discussion. Accuracy suffers on

smaller faces, but the beneﬁt provided by context increases,

as shown in Table 4. For example, context now improves

gender accuracy by 6%. This corroborates [20]inthatthe

importance of context increases as resolution decreases.

4. Scene Geometry and Semantics from Faces

The position of people in an image provides clues about

the geometry of the scene. As shown in [18], camera cali-

bration can be achieved from a video of a walking human,

under some reasonable assumptions (that the person walks

on the ground plane and head and feet are visible). By mak-

ing broader assumptions, we can model the geometry of the

scene from a group of face images. First, we assume faces

approximately deﬁne a plane we call the face plane, a world

plane that passes through the heads (i.e. the centroids of the

eye centers) of the people in the person image. Second, we

assume that head sizes are roughly similar. Third, we as-

sume the camera has no roll with respect to the face plane.

This ensures the face plane horizon is level. In typical group

shots, this is approximately accomplished when the photog-

rapher adjusts the camera to capture the group.

Criminisi et al.[8] and Hoiem et al.[15] describe the

measurement of objects rooted on the ground plane. In con-

trast, the face plane is not necessarily parallel to the ground,

and many times people are either sitting or are not even on

the ground plane at all. However, since the true face sizes

of people are relatively similar, we can compute the face

horizon, the vanishing line associated with the face plane.

4.1. Modeling the Face Plane

From the set of faces in the image, we compute the face

horizon and the camera height (the distance from the camera

to the face plane measured along the face plane normal), not

the height of the camera from the ground. Substituting the

face plane for the ground plane in Hoiem et al.[15], we

have:

− y

(7)

where E

is the face inter-eye distance in the world (61 mm

for the average adult), e

is the face inter-eye distance in the

image, Y

is the camera height, y

is the y-coordinate of the

face center p,andy

is the y-coordinate of the face horizon.

Each of the N face instances in the image provides one

equation. The face horizon y

and camera height Y

are

solved using least squares by linearizing (7) and writing in

matrix form:

⎡

⎢

⎣

... ...

⎤

⎥

⎦





⎡

⎢

⎣

...

⎤

⎥

⎦

(8)

Reasonable face vanishing lines and camera height esti-

mates are produced, although it should be noted that the

camera focal length is not in general recovered. A degen-

erate case occurs when the face plane and image planes are

parallel (e.g. a group shot of standing people of different

heights), the face vanishing line is at inﬁnity, and the cam-

era height (i.e. in this case, the distance from the camera to

the group) cannot be recovered.

To quantify the performance of the camera geometry es-

timates, we consider a set of 18 images where the face van-

ishing plane and ground plane are parallel and therefore

share a common vanishing line, the horizon. The horizon is

Understanding images of groups of people

Figures

Citations

Age and gender classification using convolutional neural networks

Age Synthesis and Estimation via Faces: A Survey

Age and Gender Estimation of Unfiltered Faces

AgeDB: The First Manually Collected, In-the-Wild Age Database

Describing clothing by semantic attributes

References

Active shape models—their training and application

TextonBoost : joint appearance, shape and context modeling for multi-class object recognition and segmentation

The environment and social behavior: Privacy, personal space, territory, crowding

Putting Objects in Perspective

A System for the Notation of Proxemic Behavior1

Related Papers (5)

Labeled Faces in the Wild: A Database forStudying Face Recognition in Unconstrained Environments

MORPH: a longitudinal image database of normal adult age-progression

Histograms of oriented gradients for human detection

Age and gender classification using convolutional neural networks

Face Description with Local Binary Patterns: Application to Face Recognition

Frequently Asked Questions (16)

Q1. What have the authors contributed in "Understanding images of groups of people" ?

Q2. How is the classifier based on age?

Q3. What is the reason why a person of honor is placed closer to the center of the image?

Q4. How many evaluations of the age and gender of the faces?

Q5. How many age categories are there in the face detection dataset?

Q6. How many columns are learned for age and gender?

Q7. How does the group shot face geometry achieve a median horizon estimate?

Q8. How do the authors solve the face horizon and camera height?

Q9. How does the gender classification process improve?

Q10. What is the purpose of the fx?

Q11. What is the way to estimate the face size?

Q12. What are the three search terms used to remove undesirable images?

Q13. What is the benefit of this approach?

Q14. What is the probability of a face having a particular gender?

Q15. How much distance between a person and her nearest neighbor?

Q16. What are the terms used to search for people images?