scispace - formally typeset
Open AccessJournal ArticleDOI

Performing content-based retrieval of humans using gait biometrics

Sina Samangooei, +1 more
- 01 Aug 2010 - 
- Vol. 49, Iss: 1, pp 195-212
Reads0
Chats0
TLDR
A set of semantic traits discernible by humans at a distance are introduced, outlining their psychological validity and working under the premise that similarity of the chosen gait signature implies similarity of certain semantic traits.
Abstract
In order to analyse surveillance video, we need to efficiently explore large datasets containing videos of walking humans. Effective analysis of such data relies on retrieval of video data which has been enriched using semantic annotations. A manual annotation process is time-consuming and prone to error due to subject bias however, at surveillance-image resolution, the human walk (their gait) can be analysed automatically. We explore the content-based retrieval of videos containing walking subjects, using semantic queries. We evaluate current research in gait biometrics, unique in its effectiveness at recognising people at a distance. We introduce a set of semantic traits discernible by humans at a distance, outlining their psychological validity. Working under the premise that similarity of the chosen gait signature implies similarity of certain semantic traits we perform a set of semantic retrieval experiments using popular Latent Semantic Analysis techniques. We perform experiments on a dataset of 2000 videos of people walking in laboratory conditions and achieve promising retrieval results for features such as Sex (mAP ?=? 14% above random), Age (mAP ?=? 10% above random) and Ethnicity (mAP ?=? 9% above random).

read more

Content maybe subject to copyright    Report

Performing Content-Based Retrieval of Humans
Using Gait Biometrics
Sina Samangooei and Mark S. Nixon
School of Electronics and Computer Science, Southampton University, Southampton,
SO17 1BJ, United Kingdom
{ss06r,msn}@ecs.soton.ac.uk
Abstract. In order to analyse surveillance video, we need to efficiently
explore large datasets containing videos of walking humans. At survei
llance-image resolution, the human walk (their gait) can be determined
automatically, and more readily than other features such as the face.
Effective analysis of such data relies on retrieval of video data which has
been enriched using semantic annotations. A manual annotation process
is time-consuming and prone to error due to subject bias. We explore the
content-based retrieval of videos containing walking subjects, using se-
mantic queries. We evaluate current biometric research using gait, unique
in its effectiveness at recognising people at a distance. We introduce
a set of semantic traits discernible by humans at a distance, outlining
their psychological validity. Working under the premise that similarity
of the chosen gait signature implies similarity of certain semantic traits
we perform a set of semantic retrieval experiments using popular latent
semantic analysis techniques from the information retrieval community.
1 Introduction
In 2006 it was reported that around 4 million CCTV cameras were installed
in the UK[4]. This results in 1Mb of video data per second per camera, us-
ing relatively conservative estimates
1
. Analysis of this huge volume of data
has motivated the development of a host of interesting automated techniques,
as summarised in[7][16], whose aim is to facilitate effective use of these large
quantities of surveillance data. Most techniques primarily concentrate on the
description of human behaviour and activities. Some approaches concentrate on
low level action features, such as trajectory and direction, whilst others include
detection of more complex concepts such as actor goals and scenario detection.
Efforts have also been developed which analyse non human elements including
automatic detection of exits and entrances, vehicle monitoring, etc.
Efficient use of large collections of images and videos by humans, such as
CCTV footage, can be achieved more readily if media items are meaningfully
semantically transcoded or annotated. Semantic and natural language descrip-
tion has been discussed [16] [41] as an open area of interest in surveillance. This
1
25 frames per second using 352 × 288 CIF images compressed using MPEG4
(http://www.info4security.com/story.asp?storyCode=3093501)
D. Duke et al. (Eds.): SAMT 2008, LNCS 5392, pp. 105–120, 2008.
c
Springer-Verlag Berlin Heidelberg 2008

106 S. Samangooei and M.S. Nixon
includes a mapping between behaviours and the semantic concepts which encap-
sulate them. In essence, automated techniques suffer from issues presented by
the multimedia semantic gap[44], between semantic queries which users readily
express and which systems cannot answer.
Although some efforts have attempted to bridge this gap for behavioural de-
scriptions, an area which has received little attention is semantic appearance de-
scriptions, especially in surveillance. Semantic whole body descriptions (Height,
Figure etc.) and global descriptions (Sex, Ethnicity, Age, etc.) are a natural
way to describe individuals. Their use is abundant in character description in
narrative, helping readers put characters in a richer context with a few key
words such as slender or stout. In a more practical capacity, stable physical de-
scriptions are of key importance in eyewitness crime reports, a scenario where
humandescriptionsareparamountashighdetailimagesofassailantsarenot
always available. Many important semantic features are readily discernible from
surveillance videos by humans, and yet are challenging to extract and analyse au-
tomatically. Unfortunately, the manual annotation of videos is a laborious[7][16]
process, too slow for effective use in real time CCTV footage and vulnerable to
various sources of human error (subject variables, anchoring etc.). Automatic
analysis of the way people walk[29] (their gait) is an efficient and effective ap-
proach to describing human features at a distance. Yet automatic gait analysis
techniques do not necessarily generate signatures which are immediately com-
prehensible by humans. We show that Latent Semantic Analysis techniques, as
used successfully by the image retrieval community, can be used to associate
semantic physical descriptions with automatically extracted gait features. In do-
ing so, we contend that retrieval tasks involving semantic physical descriptions
could be readily facilitated.
The rest of this paper is organised in the following way. In Section 2 we de-
scribe Latent Semantic Analysis, the technique chosen to bridge the gap between
semantic physical descriptions and gait signatures. In Section 3 we introduce the
semantic physical traits and their associated terms; justifying their psychological
validity. In Section 4 we briefly summarise modern gait analysis techniques and
the gait signature chosen for our experiments. In Section 5 we outline the source
of our experiment’s description data, using it in Section 6 where we outline the
testing methodology and show that our novel approach allows for content-based
video retrieval based on gait. Finally in Section 7 we discuss the final results and
future work.
2 Latent Semantic Analysis
2.1 The Singular Value Decomposition
In text retrieval, Cross Language Latent Semantic indexing (CL-LSI) [20], itself
an extension of LSI [9], is a technique which statistically relates contextual-usage
of terms in large corpuses of text documents. In our approach, LSI is used to
construct a Linear-Algebraic Semantic Space from multimedia sources[14][37]

Performing Content-Based Retrieval of Humans Using Gait Biometrics 107
within which documents and terms sharing similar meaning also have similar
spacial location.
We start by constructing an occurrence matrix O whose values represent the
presence of terms in documents (columns represent documents and rows repre-
sent terms). In our scenario documents are videos. Semantic features and auto-
matic features are considered terms. The “occurrence” of an automatic feature
signifies the magnitude of that portion of the automatic feature vector while the
“occurrence” of a semantic term signifies its semantic relevance to the subject
in the video. Our goal is the production of a rank reduced factorisation of the
observation matrix consisting of a term matrix T and document matrix D,such
that:
O TD. (1)
Where the vectors in T and D represent the location of individual terms and
documents respectively within some shared space.
T and D can be efficiently calculated using the singular value decomposition
(SVD) which is defined as:
O = UΣV
T
(2)
Such that T = U and D = ΣV
T
,andtherowsofU represent positions of terms
and the columns of ΣV
T
represent the position of documents. The diagonal
entries of Σ are equal to the singular values of O.ThecolumnsofU and V
are, respectively, left- and right-singular vectors for the corresponding singular
values in Σ. The singular values of any m × n matrix O are defined as values
{σ
1
, .., σ
r
} such that :
Ov
i
= σ
i
u
i
, (3)
and
O
T
u
i
= σ
i
v
i
(4)
Where v
i
and u
i
are defined as the right and left singular vectors respectively.
In can be shown that v
i
and u
i
are in fact the eigenvectors with correspond-
ing eigenvalues {λ
1
= σ
2
1
, .., λ
r
= σ
2
r
} of the square symmetric matrices O
T
O
and OO
T
respectively, referred to as the co-occurrence matrices. The matrix U
contains all the eigenvectors of OO
T
as its rows while V contains all the eigen-
vectors of O
T
O its rows and Σ contains all the eigenvalues along its diagonal.
Subsequently:
O
T
O =
T
U
T
UΣV
T
=
T
ΣV
T
, (5)
OO
T
= UΣV
T
T
U
T
= UΣΣ
T
U
T
. (6)
To appreciate the importance of SVD and the eigenvector matrices V and
U for information retrieval purposes, consider the meaning of the respective
co-occurrence matrices.
T
co
= OO
T
, (7)
D
co
= O
T
O. (8)

108 S. Samangooei and M.S. Nixon
The magnitude of the values in T
co
relate to how often a particular term ap-
pears with every other term throughout all documents, therefore some concept
of the relatedness of terms. The values in D
co
relate to how many terms ev-
ery document shares with every other document, therefore the “relatedness” of
documents.
By definition the matrix of eigenvectors U and V of the two matrices T
co
and
D
co
respectively form two basis for the co-occurrence spaces, i.e. the combination
of terms (or documents) which the entire space of term co-occurrence can be
projected into without information loss.
In a similar strategy to Principal Components Analysis (PCA), LSA works on
the premise that the eigenvectors represent underlying latent concepts encoded
by the co-occurrence matrix and by extension the original data. It is helpful
to think of these latent concepts as mixtures (or weightings) of terms or doc-
uments. Making such an assumption allows for some interesting mathematical
conclusions. Firstly, the eigenvectors with the largest corresponding eigenvalues
can be thought of the most representative latent concepts of the space. This
means by using only the most relevant components of T and D (as ordered by
the singular values), less meaningful underlying concepts can be ignored and
higher accuracy achieved. Also as both the document and term co-occurrence
matrices represent the same data, their latent concepts must be identical and
subsequently comparable
2
. Therefore the position of every term or document
projected into the latent space are similar if the terms and documents in fact
share similar meaning.
2.2 Using SVD
With this insight, our tasks becomes the choice of semantic and visual terms to be
observed from each subject for the generation of an observation matrix. Once this
matrix is generated, content-based retrieval by semantic query of unannotated
documents can be achieved by exploiting the projection of partially observed
vectors into the eigenspace represented by either T or D.
Assume we have two subject-video collections, a fully annotated training col-
lection and a test collection, lacking semantic annotations. A matrix O
train
is
constructed such that training documents are held in its columns. Both visual
and semantic terms are fully observed for each training document, i.e. a term is
set to a non-zero value encoding its existence or relevance to a particular video.
Using the process described in Section 2.1 we can obtain T
train
and D
train
for
the training matrix O
train
.
Content-Based Retrieval. To retrieve the set of unannotated subjects based
on their visual gait components alone, a new partially observed document matrix
O
test
is constructed such that visual gait terms are prescribed and semantic
terms are set to zero. For retrieval by semantic terms, a query document matrix
is constructed where all visual and non-relevant semantic terms are set to zero
2
It can also be shown that the two sets of eigenvectors are in fact in the same vector
space[37] and are subsequently directly comparable.

Performing Content-Based Retrieval of Humans Using Gait Biometrics 109
while relevant semantic terms are given a non-zero value (usually 1.0), this query
matrix is O
query
. The query and test matrix are projected in the latent space
in following manner:
D
test
= T
T
train
O
test
, (9)
D
query
= T
T
train
O
query
. (10)
Projected test documents held in D
test
are simply ordered according to their co-
sine distance from query documents in D
query
for retrieval. This process readily
allows for automatic annotation, though exploration in this area is beyond the
scope of this report. We postulate that annotation could be achieved by finding
the distance of D
test
to each term in T
train
. A document is annotated with
a term if that term is the closest compared to others belonging to the same
physical trait (discussed in more detail in Section 3).
We show results for retrieval experiments in Section 6.
3 Human Physical Descriptions
The description of humans based on their physical features has been explored
for several purposes including medicine[34], eyewitness analysis and human iden-
tification
3
. Descriptions chosen differ in levels of granularity and include fea-
tures both visibly measurable but also those only measurable through use of
specialised tools. One of the first attempts to systematically describe people
for identification based on their physical traits was the anthropometric system
developed by Bertillon [5] in 1896. His system used eleven precisely measured
traits of the human body including height, length of right ear and width of
cheeks. This system was quickly surpassed by other forms of forensic analysis
such as fingerprints. More recently, physical descriptions have also been used in
biometric techniques as an ancillary data source where they are referred to as
soft biometrics[28], as opposed to primary biometric sources such as iris, face
or gait. In behaviour analysis, several model based techniques[1] attempt the
automatic extraction of individual body components as a source of behavioural
information. Though the information about the individual components is not
used directly, these techniques provide some insight into the level of granularity
at which body features are still discernible at a distance.
When choosing the features that should be considered for semantic retrieval
of surveillance media, two major questions must be answered. Firstly, which
human traits should be described and secondly, how should these traits be rep-
resented. The following sections outline and justify the traits chosen and outline
the semantic terms chosen for each physical trait.
Physical Traits
To match the advantages of automatic surveillance media, one of our primary
concerns was to choose traits that are discernible by humans at a distance. To
3
Interpol. Disaster Victim Identification Form (Yellow). booklet, 2008.

Citations
More filters
Journal ArticleDOI

Learning Race from Face: A Survey

TL;DR: This survey provides a comprehensive and critical review of the state-of-the-art advances in face-race perception, principles, algorithms, and applications and discusses race perception problem formulation and motivation, while highlighting the conceptual potentials of racial face processing.
Book ChapterDOI

Soft biometrics for surveillance: an overview

TL;DR: This chapter will introduce the current state of the art in the emerging field of soft biometrics, which can be obtained at a distance without subject cooperation and from low quality video footage, making them ideal for use in surveillance applications.
Journal ArticleDOI

Soft Biometrics; Human Identification Using Comparative Descriptions

TL;DR: A novel method of obtaining human descriptions will be introduced which utilizes comparative categorical labels to describe differences between subjects, allowing retrieval of subjects from video footage by using human comparisons, bridging the semantic gap.
Journal ArticleDOI

On soft biometrics

TL;DR: The achievements that have been made in recognition by and in estimation of these parameters are surveyed, describing how these approaches can be used and where they might lead to.
Journal ArticleDOI

Biometric recognition in surveillance scenarios: a survey

TL;DR: Recent developments in human motion analysis and biometric recognition suggest that both can be combined to develop a fully automated system, with a special focus on surveillance scenarios.
References
More filters
Journal ArticleDOI

Indexing by Latent Semantic Analysis

TL;DR: A new method for automatic indexing and retrieval to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries.
Journal ArticleDOI

Social Psychology of Intergroup Relations

TL;DR: In this paper, the scope and range of ethnocentrism in group behavior is discussed. But the focus is on the individual and not on the group as a whole, rather than the entire group.
Journal ArticleDOI

An introduction to biometric recognition

TL;DR: A brief overview of the field of biometrics is given and some of its advantages, disadvantages, strengths, limitations, and related privacy concerns are summarized.
Journal ArticleDOI

Visual perception of biological motion and a model for its analysis

TL;DR: The kinetic-geometric model for visual vector analysis originally developed in the study of perception of motion combinations of the mechanical type was applied to biological motion patterns and the results turned out to be highly positive.
Journal ArticleDOI

A survey on visual surveillance of object motion and behaviors

TL;DR: This paper reviews recent developments and general strategies of the processing framework of visual surveillance in dynamic scenes, and analyzes possible research directions, e.g., occlusion handling, a combination of two and three-dimensional tracking, and fusion of information from multiple sensors, and remote surveillance.
Related Papers (5)
Frequently Asked Questions (13)
Q1. What are the contributions in "Performing content-based retrieval of humans using gait biometrics" ?

The authors introduce a set of semantic traits discernible by humans at a distance, outlining their psychological validity. Working under the premise that similarity of the chosen gait signature implies similarity of certain semantic traits the authors perform a set of semantic retrieval experiments using popular latent semantic analysis techniques from the information retrieval community. 

Soft biometric techniques use a mixture of categorical metrics (e.g. Ethnicity) and value metrics (e.g. Height) to represent their traits. 

For retrieval by semantic terms, a query document matrix is constructed where all visual and non-relevant semantic terms are set to zero 2 

More recently, physical descriptions have also been used in biometric techniques as an ancillary data source where they are referred to as soft biometrics [28], as opposed to primary biometric sources such as iris, face or gait. 

Semantic whole body descriptions (Height, Figure etc.) and global descriptions (Sex, Ethnicity, Age, etc.) are a natural way to describe individuals. 

the construction of binary silhouettes is undoubtedly affected by hair colour when compared to background, and as such the average silhouette images retain hair colour as brightness in the head region. 

The authors have accounted for anchoring of terms gathered for individual traits by setting the default term of a trait to a neutral “Unsure” rather than any concept of “Average”. 

the manual annotation of videos is a laborious[7][16] process, too slow for effective use in real time CCTV footage and vulnerable to various sources of human error (subject variables, anchoring etc.). 

Their ethnic terms encompass the three categories mentioned most often and an extra two categories (Indian and Middle Eastern) matching the UK census4. 

In their experiments, the videos used are from camera set-up “a” during which subjects walk at a natural pace side on to the plane of the camera view and walking either towards the left or right. 

Their use is abundant in character description in narrative, helping readers put characters in a richer context with a few key words such as slender or stout. 

Some approaches concentrate on low level action features, such as trajectory and direction, whilst others include detection of more complex concepts such as actor goals and scenario detection. 

Each subject has been annotated by at least two separate annotators, though 10 have been annotated with 40 annotators as part of a previous, more rigourous, though smaller scale experiment [35].