What are the characteristics of the traits that are used in the biometric approach?

Soft biometric techniques use a mixture of categorical metrics (e.g. Ethnicity) and value metrics (e.g. Height) to represent their traits.

What is the effect of the hair colour on the construction of binary silhouettes?

the construction of binary silhouettes is undoubtedly affected by hair colour when compared to background, and as such the average silhouette images retain hair colour as brightness in the head region.

How do the authors account for the anchoring of terms?

The authors have accounted for anchoring of terms gathered for individual traits by setting the default term of a trait to a neutral “Unsure” rather than any concept of “Average”.

What is the problem with the manual annotation of videos?

the manual annotation of videos is a laborious[7][16] process, too slow for effective use in real time CCTV footage and vulnerable to various sources of human error (subject variables, anchoring etc.).

What are the common ethnic terms?

Their ethnic terms encompass the three categories mentioned most often and an extra two categories (Indian and Middle Eastern) matching the UK census4.

What is the popular approach to extracting gait information from video?

In their experiments, the videos used are from camera set-up “a” during which subjects walk at a natural pace side on to the plane of the camera view and walking either towards the left or right.

How many annotators have been used to analyse gait videos?

Each subject has been annotated by at least two separate annotators, though 10 have been annotated with 40 annotators as part of a previous, more rigourous, though smaller scale experiment [35].

(Open Access) Performing content-based retrieval of humans using gait biometrics (2010) | Sina Samangooei

Q: What are the contributions in "Performing content-based retrieval of humans using gait biometrics" ?

The authors introduce a set of semantic traits discernible by humans at a distance, outlining their psychological validity. Working under the premise that similarity of the chosen gait signature implies similarity of certain semantic traits the authors perform a set of semantic retrieval experiments using popular latent semantic analysis techniques from the information retrieval community.

Performing Content-Based Retrieval of Humans

Using Gait Biometrics

Sina Samangooei and Mark S. Nixon

School of Electronics and Computer Science, Southampton University, Southampton,

SO17 1BJ, United Kingdom

{ss06r,msn}@ecs.soton.ac.uk

Abstract. In order to analyse surveillance video, we need to eﬃciently

explore large datasets containing videos of walking humans. At survei

llance-image resolution, the human walk (their gait) can be determined

automatically, and more readily than other features such as the face.

Eﬀective analysis of such data relies on retrieval of video data which has

been enriched using semantic annotations. A manual annotation process

is time-consuming and prone to error due to subject bias. We explore the

content-based retrieval of videos containing walking subjects, using se-

mantic queries. We evaluate current biometric research using gait, unique

in its eﬀectiveness at recognising people at a distance. We introduce

a set of semantic traits discernible by humans at a distance, outlining

their psychological validity. Working under the premise that similarity

of the chosen gait signature implies similarity of certain semantic traits

we perform a set of semantic retrieval experiments using popular latent

semantic analysis techniques from the information retrieval community.

1 Introduction

In 2006 it was reported that around 4 million CCTV cameras were installed

in the UK[4]. This results in 1Mb of video data per second per camera, us-

ing relatively conservative estimates

. Analysis of this huge volume of data

has motivated the development of a host of interesting automated techniques,

as summarised in[7][16], whose aim is to facilitate eﬀective use of these large

quantities of surveillance data. Most techniques primarily concentrate on the

description of human behaviour and activities. Some approaches concentrate on

low level action features, such as trajectory and direction, whilst others include

detection of more complex concepts such as actor goals and scenario detection.

Eﬀorts have also been developed which analyse non human elements including

automatic detection of exits and entrances, vehicle monitoring, etc.

Eﬃcient use of large collections of images and videos by humans, such as

CCTV footage, can be achieved more readily if media items are meaningfully

semantically transcoded or annotated. Semantic and natural language descrip-

tion has been discussed [16] [41] as an open area of interest in surveillance. This

25 frames per second using 352 × 288 CIF images compressed using MPEG4

(http://www.info4security.com/story.asp?storyCode=3093501)

D. Duke et al. (Eds.): SAMT 2008, LNCS 5392, pp. 105–120, 2008.

 Springer-Verlag Berlin Heidelberg 2008

106 S. Samangooei and M.S. Nixon

includes a mapping between behaviours and the semantic concepts which encap-

sulate them. In essence, automated techniques suﬀer from issues presented by

the multimedia semantic gap[44], between semantic queries which users readily

express and which systems cannot answer.

Although some eﬀorts have attempted to bridge this gap for behavioural de-

scriptions, an area which has received little attention is semantic appearance de-

scriptions, especially in surveillance. Semantic whole body descriptions (Height,

Figure etc.) and global descriptions (Sex, Ethnicity, Age, etc.) are a natural

way to describe individuals. Their use is abundant in character description in

narrative, helping readers put characters in a richer context with a few key

words such as slender or stout. In a more practical capacity, stable physical de-

scriptions are of key importance in eyewitness crime reports, a scenario where

humandescriptionsareparamountashighdetailimagesofassailantsarenot

always available. Many important semantic features are readily discernible from

surveillance videos by humans, and yet are challenging to extract and analyse au-

tomatically. Unfortunately, the manual annotation of videos is a laborious[7][16]

process, too slow for eﬀective use in real time CCTV footage and vulnerable to

various sources of human error (subject variables, anchoring etc.). Automatic

analysis of the way people walk[29] (their gait) is an eﬃcient and eﬀective ap-

proach to describing human features at a distance. Yet automatic gait analysis

techniques do not necessarily generate signatures which are immediately com-

prehensible by humans. We show that Latent Semantic Analysis techniques, as

used successfully by the image retrieval community, can be used to associate

semantic physical descriptions with automatically extracted gait features. In do-

ing so, we contend that retrieval tasks involving semantic physical descriptions

could be readily facilitated.

The rest of this paper is organised in the following way. In Section 2 we de-

scribe Latent Semantic Analysis, the technique chosen to bridge the gap between

semantic physical descriptions and gait signatures. In Section 3 we introduce the

semantic physical traits and their associated terms; justifying their psychological

validity. In Section 4 we brieﬂy summarise modern gait analysis techniques and

the gait signature chosen for our experiments. In Section 5 we outline the source

of our experiment’s description data, using it in Section 6 where we outline the

testing methodology and show that our novel approach allows for content-based

video retrieval based on gait. Finally in Section 7 we discuss the ﬁnal results and

future work.

2 Latent Semantic Analysis

2.1 The Singular Value Decomposition

In text retrieval, Cross Language Latent Semantic indexing (CL-LSI) [20], itself

an extension of LSI [9], is a technique which statistically relates contextual-usage

of terms in large corpuses of text documents. In our approach, LSI is used to

construct a Linear-Algebraic Semantic Space from multimedia sources[14][37]

Performing Content-Based Retrieval of Humans Using Gait Biometrics 107

within which documents and terms sharing similar meaning also have similar

spacial location.

We start by constructing an occurrence matrix O whose values represent the

presence of terms in documents (columns represent documents and rows repre-

sent terms). In our scenario documents are videos. Semantic features and auto-

matic features are considered terms. The “occurrence” of an automatic feature

signiﬁes the magnitude of that portion of the automatic feature vector while the

“occurrence” of a semantic term signiﬁes its semantic relevance to the subject

in the video. Our goal is the production of a rank reduced factorisation of the

observation matrix consisting of a term matrix T and document matrix D,such

that:

O ≈ TD. (1)

Where the vectors in T and D represent the location of individual terms and

documents respectively within some shared space.

T and D can be eﬃciently calculated using the singular value decomposition

(SVD) which is deﬁned as:

O = UΣV

(2)

Such that T = U and D = ΣV

,andtherowsofU represent positions of terms

and the columns of ΣV

represent the position of documents. The diagonal

entries of Σ are equal to the singular values of O.ThecolumnsofU and V

are, respectively, left- and right-singular vectors for the corresponding singular

values in Σ. The singular values of any m × n matrix O are deﬁned as values

{σ

, .., σ

} such that :

= σ

, (3)

and

= σ

(4)

Where v

and u

are deﬁned as the right and left singular vectors respectively.

In can be shown that v

and u

are in fact the eigenvectors with correspond-

ing eigenvalues {λ

= σ

, .., λ

= σ

} of the square symmetric matrices O

and OO

respectively, referred to as the co-occurrence matrices. The matrix U

contains all the eigenvectors of OO

as its rows while V contains all the eigen-

vectors of O

O its rows and Σ contains all the eigenvalues along its diagonal.

Subsequently:

O = VΣ

UΣV

= VΣ

ΣV

, (5)

= UΣV

VΣ

= UΣΣ

. (6)

To appreciate the importance of SVD and the eigenvector matrices V and

U for information retrieval purposes, consider the meaning of the respective

co-occurrence matrices.

= OO

, (7)

= O

O. (8)

108 S. Samangooei and M.S. Nixon

The magnitude of the values in T

relate to how often a particular term ap-

pears with every other term throughout all documents, therefore some concept

of the “relatedness” of terms. The values in D

relate to how many terms ev-

ery document shares with every other document, therefore the “relatedness” of

documents.

By deﬁnition the matrix of eigenvectors U and V of the two matrices T

and

respectively form two basis for the co-occurrence spaces, i.e. the combination

of terms (or documents) which the entire space of term co-occurrence can be

projected into without information loss.

In a similar strategy to Principal Components Analysis (PCA), LSA works on

the premise that the eigenvectors represent underlying latent concepts encoded

by the co-occurrence matrix and by extension the original data. It is helpful

to think of these latent concepts as mixtures (or weightings) of terms or doc-

uments. Making such an assumption allows for some interesting mathematical

conclusions. Firstly, the eigenvectors with the largest corresponding eigenvalues

can be thought of the most representative latent concepts of the space. This

means by using only the most relevant components of T and D (as ordered by

the singular values), less meaningful underlying concepts can be ignored and

higher accuracy achieved. Also as both the document and term co-occurrence

matrices represent the same data, their latent concepts must be identical and

subsequently comparable

. Therefore the position of every term or document

projected into the latent space are similar if the terms and documents in fact

share similar meaning.

2.2 Using SVD

With this insight, our tasks becomes the choice of semantic and visual terms to be

observed from each subject for the generation of an observation matrix. Once this

matrix is generated, content-based retrieval by semantic query of unannotated

documents can be achieved by exploiting the projection of partially observed

vectors into the eigenspace represented by either T or D.

Assume we have two subject-video collections, a fully annotated training col-

lection and a test collection, lacking semantic annotations. A matrix O

train

constructed such that training documents are held in its columns. Both visual

and semantic terms are fully observed for each training document, i.e. a term is

set to a non-zero value encoding its existence or relevance to a particular video.

Using the process described in Section 2.1 we can obtain T

train

and D

train

for

the training matrix O

train

Content-Based Retrieval. To retrieve the set of unannotated subjects based

on their visual gait components alone, a new partially observed document matrix

test

is constructed such that visual gait terms are prescribed and semantic

terms are set to zero. For retrieval by semantic terms, a query document matrix

is constructed where all visual and non-relevant semantic terms are set to zero

It can also be shown that the two sets of eigenvectors are in fact in the same vector

space[37] and are subsequently directly comparable.

Performing Content-Based Retrieval of Humans Using Gait Biometrics 109

while relevant semantic terms are given a non-zero value (usually 1.0), this query

matrix is O

query

. The query and test matrix are projected in the latent space

in following manner:

test

= T

train

test

, (9)

query

= T

train

query

. (10)

Projected test documents held in D

test

are simply ordered according to their co-

sine distance from query documents in D

query

for retrieval. This process readily

allows for automatic annotation, though exploration in this area is beyond the

scope of this report. We postulate that annotation could be achieved by ﬁnding

the distance of D

test

to each term in T

train

. A document is annotated with

a term if that term is the closest compared to others belonging to the same

physical trait (discussed in more detail in Section 3).

We show results for retrieval experiments in Section 6.

3 Human Physical Descriptions

The description of humans based on their physical features has been explored

for several purposes including medicine[34], eyewitness analysis and human iden-

tiﬁcation

. Descriptions chosen diﬀer in levels of granularity and include fea-

tures both visibly measurable but also those only measurable through use of

specialised tools. One of the ﬁrst attempts to systematically describe people

for identiﬁcation based on their physical traits was the anthropometric system

developed by Bertillon [5] in 1896. His system used eleven precisely measured

traits of the human body including height, length of right ear and width of

cheeks. This system was quickly surpassed by other forms of forensic analysis

such as ﬁngerprints. More recently, physical descriptions have also been used in

biometric techniques as an ancillary data source where they are referred to as

soft biometrics[28], as opposed to primary biometric sources such as iris, face

or gait. In behaviour analysis, several model based techniques[1] attempt the

automatic extraction of individual body components as a source of behavioural

information. Though the information about the individual components is not

used directly, these techniques provide some insight into the level of granularity

at which body features are still discernible at a distance.

When choosing the features that should be considered for semantic retrieval

of surveillance media, two major questions must be answered. Firstly, which

human traits should be described and secondly, how should these traits be rep-

resented. The following sections outline and justify the traits chosen and outline

the semantic terms chosen for each physical trait.

Physical Traits

To match the advantages of automatic surveillance media, one of our primary

concerns was to choose traits that are discernible by humans at a distance. To

Interpol. Disaster Victim Identiﬁcation Form (Yellow). booklet, 2008.

Performing content-based retrieval of humans using gait biometrics

Figures

Citations

Learning Race from Face: A Survey

Soft biometrics for surveillance: an overview

Soft Biometrics; Human Identification Using Comparative Descriptions

On soft biometrics

Biometric recognition in surveillance scenarios: a survey

References

Indexing by Latent Semantic Analysis

Social Psychology of Intergroup Relations

An introduction to biometric recognition

Visual perception of biological motion and a model for its analysis

A survey on visual surveillance of object motion and behaviors

Related Papers (5)

On a Large Sequence-Based Human Gait Database

Face Matching and Retrieval Using Soft Biometrics

Soft Biometrics; Human Identification Using Comparative Descriptions

Person Re-identification by Attributes.

Soft Biometric Traits for Continuous User Authentication

Frequently Asked Questions (13)

Q1. What are the contributions in "Performing content-based retrieval of humans using gait biometrics" ?

Q2. What are the characteristics of the traits that are used in the biometric approach?

Q3. What is the semantic term used for retrieval?

Q4. What are the recent uses of biometrics?

Q5. What are the two types of descriptions?

Q6. What is the effect of the hair colour on the construction of binary silhouettes?

Q7. How do the authors account for the anchoring of terms?

Q8. What is the problem with the manual annotation of videos?

Q9. What are the common ethnic terms?

Q10. What is the popular approach to extracting gait information from video?

Q11. What is the use of character descriptions in narrative?

Q12. What are some of the techniques which focus on low level action features?

Q13. How many annotators have been used to analyse gait videos?