scispace - formally typeset
Open AccessJournal ArticleDOI

Reidentification by Relative Distance Comparison

TLDR
This paper formulate person reidentification as a relative distance comparison (RDC) learning problem in order to learn the optimal similarity measure between a pair of person images and develops an ensemble RDC model.
Abstract
Matching people across nonoverlapping camera views at different locations and different times, known as person reidentification, is both a hard and important problem for associating behavior of people observed in a large distributed space over a prolonged period of time. Person reidentification is fundamentally challenging because of the large visual appearance changes caused by variations in view angle, lighting, background clutter, and occlusion. To address these challenges, most previous approaches aim to model and extract distinctive and reliable visual features. However, seeking an optimal and robust similarity measure that quantifies a wide range of features against realistic viewing conditions from a distance is still an open and unsolved problem for person reidentification. In this paper, we formulate person reidentification as a relative distance comparison (RDC) learning problem in order to learn the optimal similarity measure between a pair of person images. This approach avoids treating all features indiscriminately and does not assume the existence of some universally distinctive and reliable features. To that end, a novel relative distance comparison model is introduced. The model is formulated to maximize the likelihood of a pair of true matches having a relatively smaller distance than that of a wrong match pair in a soft discriminant manner. Moreover, in order to maintain the tractability of the model in large scale learning, we further develop an ensemble RDC model. Extensive experiments on three publicly available benchmarking datasets are carried out to demonstrate the clear superiority of the proposed RDC models over related popular person reidentification techniques. The results also show that the new RDC models are more robust against visual appearance changes and less susceptible to model overfitting compared to other related existing models.

read more

Content maybe subject to copyright    Report

1
Re-identification by Relative Distance Comparison
Wei-Shi Zheng, Member, IEEE, Shaogang Gong, and Tao Xiang
Abstract Matching people across non-overlapping camera
views at different locations and different time, known as person
re-identification, is both a hard and important problem for
associating behaviour of people observed in a large distributed
space over a prolonged period of time. Person re-identification is
fundamentally challenging because of the large visual appearance
changes caused by variations in view angle, lighting, background
clutter and occlusion. To address these challenges, most previous
approaches aim to model and extract distinctive and reliable
visual features. However, seeking an optimal and robust similarity
measure that quantifies a wide range of features against realistic
viewing conditions from a distance is still an open and unsolved
problem for person re-identification. In this paper, we formulate
person re-identification as a relative distance comparison learning
problem in order to learn the optimal similarity measure between
a pair of person images. This approach avoids treating all features
indiscriminately and does not assume the existence of some
universally distinctive and reliable features. To that end, a novel
relative distance comparison (RDC) model is introduced. The
model is formulated to maximise the likelihood of a pair of true
matches having a relatively smaller distance than that of a wrong
match pair in a soft discriminant manner. Moreover, in order to
maintain the tractability of the model in large scale learning, we
further develop an ensemble RDC model. Extensive experiments
on three publically available benchmarking datasets are carried
out to demonstrate the clear superiority of the proposed RDC
models over related popular person re-identification techniques.
The results also show that the new RDC models are more robust
against visual appearance changes and less susceptible to model
over-fitting compared to other related existing models.
Index Terms Person re-identification, feature quantification,
feature selection, relative distance comparison
I. INTRODUCTION
For understanding behaviour of people in a large area of public
space covered by multiple no-overlapping (disjoint) cameras, it is
critical that when a target disappears from one view, he/she can be
re-identified in another view at a different location among a crowd
of people. Solving this inter-camera people association problem,
known as re-identification, enables tracking of the same person
through different camera views located at different physical sites
[26], [15], [32], [17], [8].
Despite the best efforts from computer vision researchers in
the past five years, the person re-identification problem remains
largely unsolved. This is due to a number of reasons. First, in
a busy uncontrolled environment monitored by cameras from a
distance, person verification relying upon biometrics such as face
and gait is infeasible and unreliable. Second, as the transition
Wei-Shi Zheng is now with School of Information Science and Tech-
nology, Sun Yat-sen University, China, and was with School of Electronic
Engineering and Computer Science, Queen Mary University of London, UK,
wszheng@ieee.org
Shaogang Gong and Tao Xiang are with School of Electronic En-
gineering and Computer Science, Queen Mary University of London,
{sgg,txiang}@eecs.qmul.ac.uk
Fig. 1. Typical examples of appearance changes caused by cross-view
variations in view angle, lighting, background clutter and occlusion. Each
column shows two images of the same person from two different camera
views.
time between disjoint cameras
1
varies greatly from individual to
individual with uncertainty, it is hard to impose accurate temporal
and spatial constraints. Therefore, the person re-identification
problem is made harder still as a model can only rely on
mostly appearance features alone. Third, the visual appearance
features, extracted mainly from clothing and shape of people, are
intrinsically indistinctive for matching people (e.g. most people in
winter wear dark clothes). In addition, a person’s appearance often
undergoes large variations across non-overlapping camera views
due to significant changes in view angle, lighting, background
clutter and occlusion (see Fig. 1), resulting in different people
appearing more alike than that of the same person across different
camera views (see Figs. 6 and 7).
Given a query image of a person, in order to find the correct
match among a large number of candidate images captured from
different camera views, two steps need to be taken. First, a feature
representation is computed from both the query and each of
the gallery images. Second, the distance between each pair of
potential matches is measured, which is then used to determine
whether a gallery image contains the same person as the query
image. Most existing studies have focused on the first step, that
is, seeking a more distinctive and reliable feature representation
of people’s appearance, ranging widely from colour histogram
[26], [15], graph model [10], spatial co-occurrence representation
model [32], principal axis [17], rectangle region histogram [6],
part-based models [1], [4] to combinations of multiple features
[15], [8]. After feature extraction, these methods simply choose
a standard distance measure such as
l
1
-norm [32], l
2
-norm based
distance [17], or Bhattacharyya distance [15]. However under
severe changes in viewing conditions that can cause significant
appearance variations (e.g. view angle and lighting condition
changes, occlusion), computing a set of features that are both
distinctive and reliable is extremely hard if not implausible.
Moreover, given that certain features could be more reliable than
others under a certain condition, applying a standard distance
measure is undesirable as it essentially treats all features equally
without discarding bad features selectively in each individual
matching circumstance.
In this paper, we focus on the second step of person re-
1
The time gap between a person disppearing in one camera view and re-
appearing in another.
Di
g
ital Ob
j
ect Indentifier 10.1109/TPAMI.2012.138 0162-8828/12/$31.00 © 2012 IEEE
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

2
identification. That is, given a set of features extracted from
each person image, we seek to quantify and differentiate these
features by learning the optimal distance measure that is most
likely to give correct matches. This is significantly different from
most existing approaches in that it requires model learning from
a set of training data. In essence, images of each person in a
training set form a class. This learning problem can be framed as
a distance learning problem which always searches for a distance
that minimises intra-class distances while maximising inter-class
distances. However, the person re-identification problem has four
characteristics: (1) The intra-class variation can be large and more
importantly can vary significantly for different classes as it is
caused by large and unpredictable viewing condition changes (see
Fig. 1). (2) The inter-class variation also varies drastically across
different pairs of classes and there are often severe overlappings
between classes in a feature space due to similar appearance
(e.g. clothing) of different people. (3) The training set for learning
the model consists of images of matched people across different
camera views. In order to capture the large intra- and inter-
variations, the number of classes is necessarily large, typically
in the order of hundreds. This represents a large scale learning
problem that challenges existing machine learning algorithms. (4)
Annotating a large number of matched people across camera
views is not only tedious, but also inherently limited in its
usefulness. Typically each annotated class contains only a handful
of images of a person from different camera views, i.e. the
data are inherently under-sampled for building a representative
class distribution. Due to these intrinsic characteristics of the re-
identification problem, especially the problem of large number of
under-sampled classes, a learning model could easily be over-
fitted and/or intractable if it is learned by minimising intra-
class distance and maximising inter-class distance simultaneously
by brute-force, as typically done by existing popular distance
learning techniques.
To alleviate this inherently ill-posed distance learning problem
in person re-identification, we formulate the problem as a relative
distance comparison problem. That is, we perform feature quan-
tification by learning a relative distance comparison model. More
specifically, a novel relative distance comparison (RDC) model is
formulated in order to differentiate the similarity score of a pair
of true match (i.e. two images of person A) from that of a pair of
related wrong match (i.e. two images of different people A and B
respectively) so that the latter one can be always smaller. In other
words, the model aims to learn an optimal distance in the sense
that for a given query image, the true match is desired to be ranked
higher than the wrong matches among the gallery image set. The
model cares less about how large the absolute distance between
the pair of images for the true match. This differs conceptually
from a conventional distance learning approach which aims to
minimise intra-class variation in an absolute sense (i.e. making
all images of person A more similar, or closer in a features space)
whilst maximising inter-class variation (i.e. making two images of
person A and B more dissimilar). A conventional approach thus
attempts to maximise the margin between two classes, or in the
context of person re-identification, enforces a harder discriminant
constraint that the true match is not only ranked higher but also
has as smaller distance to the query image as possible compared
to that of wrong matches. One of the key advantages of our
relative distance comparison based method is that our model is
not easily biased by large variations across many under-sampled
classes, as it aims to seek an optimised individual comparison
between any two data points rather than comparison among data
distribution boundaries or among clusters of data. This alleviates
the over-fitting problem in person identification given under-
sampled training data.
Computationally, learning the proposed relative distance com-
parison model can be a non-convex optimisation problem. It is
also a large scale learning problem even given a moderate training
data size. This is because that the distance between each pair of
images in a training set needs be compared exhaustively during
model learning and the feature space for person re-identification
is typically of high dimension. To address this problem, a novel
iterative optimisation algorithm is developed in this work for
learning the RDC model. The algorithm is theoretically validated
and its convergence is guaranteed.
Furthermore, in order to alleviate the large space complexity
(memory usage cost) and the local optimum learning problem
due to the proposed iterative algorithm for solving high-order
non-linear optimisation criterion, we develop an ensemble RDC
in this work. The aim is to learn a set of weak RDC models each
computed on a small subset of data and then combine them into
a stronger RDC using ensemble learning.
Extensive experiments are conducted on three publically avail-
able large person re-identification datasets, including the ETHZ
[7], i-LIDS [37] and VIPeR [14] datasets. The results demonstrate
that (1) by formulating the person re-identification problem as a
relative distance comparison learning problem based on logistic
function modelling, significant improvement on matching accura-
cy can be obtained against related popular person re-identification
techniques; and (2) our RDC models outperform not only related
distance learning methods but also related learning methods based
on boosting and rank support vector machines (SVMs), both in
terms of matching accuracy and tractability.
II. R
ELATED WORKS
The problem of matching people across disjoint camera views
has received increasing attention in recent years. Existing works
predominantly focus on the problem of feature extraction and
representation with a bag-of-word representation of colour and
texture features being the most common choice. Table I sum-
marises the features and representations employed by existing
methods reported in the literature. In addition to matching based
on similarity of visual appearance, contextual cues can also be
exploited. Brightness transfer function is introduced to explicitly
compensate for the lighting condition changes between cameras
[3], [27], [18]. However, to learn a brightness transfer function one
has to not only annotate a set of matched people but also segment
each person from the image, which significantly increase the
already large annotation cost. The temporal relationships between
camera views can be exploited for object tagging. By modelling
the transition time between two camera views one can reduce
the number of potential matches while also using the probability
distribution of transition time as a feature [12], [25], [24], [22].
However, transition time information could be unreliable when
camera views are significantly disjoint or featured with a large
number of moving objects. Nevertheless, when it can be obtained
reliably, it has been exploited to good effect (see Table I, column
4). Such contextual constraints can also be easily employed to
the proposed RDC models either as part of the representation or
a postprocessing step.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

3
Authors Year Image Features Using Temporal Information Representation
Javed el al. [19] 2005 colour Yes colour appearance with colour brightness
transform
Gilbert el al. [11] 2006 colour Yes consensus-colour conversion of munsell
colour space with colour transformation matrix
Gheissari, et al. [10] 2006 colour and shape Yes graph partition based representation
Hu et al. [17] 2007 geometry Yes principal axis with segmentation
Wang et al. [32] 2007 colour, gradient, and shape No co-occurrence spatial context
Chen et al. [3] 2008 colour Yes colour appearance with temporal colour
& Prosser et al. [27]
brightness transform and spatial information
Javed et al. [18] 2008 colour Yes colour appearance with spatial temporal colour
brightness transform and spatial information
Gray and Tao [15] 2008 colour, gradient, filters No selected histogram features by Adaboost
Zheng et al. [37] 2009 colour and gradient No grouping as dynamic spatial context
Bak et al. [1] & Cheng et al. [4] 2010/2011 colour No covariance matrix between parts
or pictorial structures modelling
Prosser et al. [28] 2010 colour, gradient, filters No quantified histogram feature by RankSVM
Farenzena et al. [8]
2010 colour and structure No symmetry-based ensemble of local features
with background subtraction
TABLE I
M
AIN DEVELOPMENT OF PERSON REIDENTIFICATION.
Since not all features are equally reliable and informative for
person re-identification, Gray and Tao [15] propose a boosting
approach based on Adaboost to select a subset of optimal features
for matching people. However, in a boosting framework, good
features are only selected individually and independently in the
original feature space where different classes can be heavily
overlapped. Such selection may not be globally optimal. Rather
than selecting features individually and independently (local se-
lection), we consider instead to quantify all features jointly (global
selection). Critically, the Adaboost based feature selection method
in [15] could be biased by large variations between appearance
of people, as its modelling shares similar spirit with a typical
discriminant model that tries to maximize the difference between
two images of different people. It is thus prone to model over-
fitting as shown in our experiments (see Sec. VI). In contrast, the
proposed RDC model can be seen as a soft discriminant approach.
Our model thus is less susceptible to over-fitting and more tolerant
to intra- and inter-class variations and severe overlapping of
different classes in a multi-dimensional feature space.
Relative distance comparison is a special case of learning to
rank or machine-learned ranking. Ranking techniques such as
RankSVM [16] and RankBoost [9] have been widely used in text
document analysis and information retrieval. In our early work
[28], the primal RankSVM [2] is applied to solve the problem
of global feature quantification for person re-identification. The
primal RankSVM solves the high computational cost problem for
large scale constraint optimisation in a standard RankSVM for-
mulation. Compared to RankSVM and RankBoost, the proposed
new model in this paper is more principled and tractable in three
aspects: (a) RDC is a second-order feature quantification model,
taking into account the joined effect between different features,
whereas both RankSVM [2] and RankBoost [9] are a first-order
model unable to exploit correlations among different features. (b)
RDC utilises a logistic function to provide a soft margin measure
between the difference vectors of different types whilst RankSVM
does not, and such a formulation of our objective function makes
RDC more tolerant to large intra- and inter-class variations and
better suited for coping with data under-sampling; (c) Using a
primal RankSVM, one must determine the weight between the
margin function and the ranking error cost function, which is
computationally costly. In contrast, our RDC model does not
suffer from such a problem, leading to lower computational cost.
More detailed discussion on the differences between RDC and
related ranking models are given in Sec. V. Extensive experiments
are presented in Sec. VI-F to validate the advantages of RDC over
RankSVM and RankBoost.
Although it has not previously been exploited for person
re-identification, distance learning in general is a well-studied
problem [35], [13], [36], [34], [15], [29], [33], [20], [5]. The
proposed RDC model is related to several existing distance
learning methods. In particular, our model shares the same spirit
with a number of recent works that exploit the idea of relative dis-
tance comparison [29], [33], [20]. However, the relative distance
comparison formulations in these works are not quantified using
logistic function for soft measure, and crucially they are used
as an optimisation constraint rather than an objective function.
Therefore, as analysed in more details in Sec. V, these approaches,
either implicitly [29], [20] or explicitly [33], still aim to learn
a distance by which each class becomes more compact whilst
being more separable from each other in an absolute sense. We
demonstrate through extensive experiments that in practice, they
remain susceptible to model over-fitting and poor tractability for
person re-identification.
In summary, the main contributions of this work are three-folds:
1) For the first time, the person re-identification problem
is formulated as a relative distance comparison learning
problem, with strong rationale both conceptually and com-
putationally.
2) We propose a novel logistic function based relative distance
comparison (RDC) model for feature quantification, which
overcomes the limitations of existing distance learning
techniques given under-sampled data with large intra- and
inter-class variations.
3) A novel iterative optimisation algorithm and an ensemble
RDC model are proposed to improve the tractability of
the RDC model and make it more suitable for large scale
learning.
An early version of this work appeared in [38]. In addition
to giving a more detailed description of the RDC model, the
main changes include (1) an ensemble RDC model proposed to
improve the scalability and tractability of the original RDC model,
(2) more in depth discussion and analysis on its relationship to
alternative learning methods, and (3) more extensive experimental
evaluations including the introduction of a new dataset.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

4
III. QUANTIFYING FEATURES FOR PERSON
RE
-IDENTIFICATION
A. Proposed Relative Distance Comparison Learning
We formally cast the person re-identification problem into the
following distance comparison problem, where we assume each
instance of a person is represented by a feature set (e.g. the
representation described in Sec. VI-B). For an instance
z of person
A, we wish to learn a re-identification model to successfully
identify another instance
z
of the same person captured elsewhere
in space and time. This is achieved by learning a distance function
f(·, ·) so that f(z, z
) <f(z, z

), where z

is an instance of
any other person except A. To this end, given a training set
Z =
(z
i
,y
i
)
N
i=1
, where z
i
∈R
q
is a multi-dimensional
feature vector representing the appearance of a person in one
view and
y
i
is its class label (person ID), we define a pairwise
set O
= {O
i
=(x
p
i
, x
n
i
)}, where each element of a pair-wise data
O
i
itself is computed using a pair of sample feature vectors. More
specifically,
x
p
i
is a difference vector computed between a pair of
relevant samples (of the same class/person) and
x
n
i
is a difference
vector from a pair of related irrelevant samples, i.e. only one
sample for computing
x
n
i
is one of the two relevant samples for
computing
x
p
i
and the other is a mis-match from another class
(e.g.
x
p
i
and x
n
i
share the same z in the following Eq. (1), while
they have different
z
). The difference vector x between any two
samples
z and z
is computed by
x = d(z, z
), z, z
∈R
q
(1)
where
d is an entry-wise difference function that outputs a
difference vector between
z and z
. The specific form of function
d will be described in Sec. III-D.
Given the pairwise set O, a distance function
f will take the
difference vector as input and can be learned based on relative
distance comparison so that a distance between a relevant sample
pair (
f(x
p
i
)) is wished to be smaller than that between a related
irrelevant pair (
f(x
n
i
)). In order to differentiate these two types
of difference vectors, we propose a logistic function based
modelling to describe how a distance between a relevant pair
differs from the one between a related but irrelevant pair as
follows:
C
f
(x
p
i
, x
n
i
)=
1+exp
f(x
p
i
) f (x
n
i
)

1
. (2)
We assume the events of distance comparison between a relevant
pair and a related irrelevant pair are independent
2
. Then, we
wish to minimise the risk of learning
f via all the above relative
distance comparisons as follows:
min
f
r(f,O),r(f,O)= log (
O
i
C
f
(x
p
i
, x
n
i
)). (3)
The distance function
f is parameterised as a Mahalanobis
(quadratic) distance function:
f(x)=x
T
Mx, M 0, (4)
where
M is a semidefinite matrix. The distance learning problem
thus becomes learning
M using Eq. (3). Directly learning M us-
ing semidefinite program techniques is computationally expensive
for high dimensional data [33]. In particular, we found out in our
experiments that given a dimensionality of thousands, typical for
visual object representation, a distance learning method based on
2
Note that we do not assume the data are independent.
learning M becomes intractable. To overcome this problem, we
perform eigenvalue decomposition on
M:
M = AΛA
T
= WW
T
, W =
1
2
, (5)
where the columns of
A are orthonormal eigenvectors of M and
the leading diagonal of
Λ contains the corresponding non-zero
eigenvalues. Note that the columns of
W form a set of orthogonal
vectors. Therefore, learning a function
f is equivalent to learning
such a matrix
W =(w
1
, ··· , w
l
, ··· , w
L
) such that
min
W
r(W, O),s.t. w
T
i
w
j
=0, i = j
r(W,
O)=
O
i
log(1 + exp
||W
T
x
p
i
||
2
−||W
T
x
n
i
||
2
).
(6)
We call this relative distance comparison learning (RDC) for
person re-identification. RDC is based on a logistic function
ranging from 0 to 1 in value. This is designed to avoid dramatic
changes in the response to different relative distance comparisons.
B. An Iterative Optimisation Algorithm
It is important to point out that our optimisation criterion (6)
may not be a convex optimisation problem against the orthogonal
constraint due to the logistic function based relative comparison
modelling. It means that deriving an global solution by directly
optimising
W is not straightforward. In this work we formulate an
iterative optimisation algorithm to learn an optimal
W, which also
aims to seek a low-rank and non-trivial solution automatically.
This is critical for reducing the model complexity thus alleviating
the overfitting problem given a large number of under-sampled
classes.
Starting from an empty matrix, after iteration
, a new estimated
column w
is added to W. The algorithm terminates after L
iterations when a stopping criterion is met. Each iteration consists
of two steps as follows:
Step 1. Assume that after
iterations, a total of orthogonal
vectors
w
1
, ··· , w
have been learned. To learn the next orthog-
onal vector
w
+1
, let
a
+1
i
=exp{
j=0
||w
T
j
x
p,j
i
||
2
−||w
T
j
x
n,j
i
||
2
}, (7)
where we define
w
0
= 0, and x
p,
i
and x
n,
i
are the difference
vectors at the
-th iteration defined as follows:
x
s,
i
= x
s,1
i
˜w
1
˜w
T
1
x
s,1
i
,s∈{p, n},i=1, ··· ,
O
,
(8)
where
1 and ˜w
1
= w
1
/||w
1
||. Note that we define
x
s,0
i
= x
s
i
, s ∈{p, n}, and ˜w
0
= 0.
Step 2. Obtain
x
p,+1
i
, x
n,+1
i
by Eq. (8). Let O
+1
={O
+1
i
=
(x
p,+1
i
, x
n,+1
i
)}. Then, learn a new optimal projection w
+1
on
O
+1
as follows:
w
+1
= arg min
w
r
+1
(w, O
+1
), (9)
where
r
+1
(w, O
+1
)=
O
+1
i
log(1 + a
+1
i
exp
||w
T
x
p,+1
i
||
2
−||w
T
x
n,+1
i
||
2
).
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

5
We seek a solution by a gradient descent method:
w
+1
w
+1
λ ·
∂r
+1
w
+1
0, (10)
∂r
+1
w
+1
=
O
+1
i
2 · a
+1
i
· exp
||w
T
+1
x
p,+1
i
||
2
−||w
T
+1
x
n,+1
i
||
2
1+a
+1
i
· exp
||w
T
+1
x
p,+1
i
||
2
−||w
T
+1
x
n,+1
i
||
2
×
x
p,+1
i
x
p,+1
i
T
x
n,+1
i
x
n,+1
i
T
w
+1
,
where λ is a step length automatically determined at each gradient
update step using similar strategy in [23]. According to the
descent direction in Eq. (10) the initial value of
w
+1
for the
gradient descent method is set to
w
+1
= |O
+1
|
1
O
+1
i
(x
n,+1
i
x
p,+1
i
). (11)
Note that the update in Eq. (8) deducts information from each
sample
x
s,1
i
affected by w
1
as w
T
1
x
s,
i
=0, so that the
next learned vector
w
will only quantify the part of the data left
from the last step, i.e.
x
s,
i
. In addition, a
+1
i
indicates the trends
in the change of distance measures for x
p
i
and x
n
i
over previous
iterations and serve as a priori weight for learning
w
.
The iteration of the algorithm (for
>1) is terminated when
the following criterion is met:
r
(w
, O
) r
+1
(w
+1
, O
+1
) , (12)
where
ε is a small tolerance value set to 10
6
in this work. The
algorithm is summarised in Algorithm 1.
Algorithm 1: Learning the RDC model
Data: O = {O
i
=(x
p
i
, x
n
i
)}>0
begin
w
0
←− 0, ˜w
0
←− 0;
x
s,0
i
←− x
s
i
,s∈{p, n}, O
0
←− O;
←− 0;
while 1 do
Compute a
+1
i
by Eq. (7);
Compute x
s,+1
i
,s∈{p, n} by Eq. (8);
O
+1
←− { O
+1
i
=(x
p,+1
i
, x
n,+1
i
)};
Estimate w
+1
using Eq. (9);
˜w
+1
=
w
+1
||w
+1
||
;
if (>1)&(r
(w
, O
) r
+1
(w
+1
, O
+1
) ) then
break;
end
←− +1;
end
end
Output: W =
w
1
, ··· , w
C. Theoretical Validation
The following two theorems validate the claim that the pro-
posed iterative optimisation algorithm learns a set of orthogonal
vectors
{w
} that iteratively decrease the objective function in
Criterion (6).
Theorem 1: The learned vectors
w
,=1, ··· ,L, are orthog-
onal to each other.
Proof: Assume that
1 orthogonal vectors {w
j
}
1
j=1
have been learned. Let w
be the optimal solution of Criterion
(9) at the
iteration. First, we know that w
is in the range
space
3
of {x
p,
i
}∪{x
n,
i
} according to Eqs. (10) and (11), i.e.
3
This can also be explored by using Lagrangian equation for Eq. (9) for a non-zero
w
.
w
span{x
s,
i
,i =1, ··· , |O|,s ∈{p, n}}. Second, according
to Eq. (8), we have
w
T
j
x
s,j+1
i
=0,s∈{p, n},j=1, ··· , 1
span{x
s,
i
,i=1, ··· , |O|,s ∈{p, n}}
span{x
s,1
i
,i=1, ··· , |O|,s ∈{p, n}}
⊆···⊆span{x
s,0
i
,i=1, ··· , |O|,s ∈{p, n}}.
(13)
Hence,
w
is orthogonal to w
j
,j =1, ··· , 1.
Theorem 2: r(W
+1
, O) r(W
, O), where W
=
(w
1
, ··· , w
), 1. That is, the algorithm iteratively decreases
the objective function value.
Proof: Let
w
+1
be the optimal solution of Eq. (9). By
Theorem 1, it is easy to prove that for any
j 1, w
T
j
x
s,j
i
=
w
T
j
x
s,0
i
= w
T
j
x
s
i
, s ∈{p, n}. Hence we have
r
+1
(w
+1
, O
+1
)
=
O
+1
i
log(1 + a
+1
i
exp
||w
T
+1
x
p,+1
i
||
2
−||w
T
+1
x
n,+1
i
||
2
)
= r(W
+1
, O).
Also r
+1
(0, O
+1
)=r(W
, O). Since w
+1
is the minimal
solution, we have
r
+1
(w
+1
, O
+1
) r
+1
(0, O
+1
), and
therefore r(W
+1
, O) r(W
, O).
Since Criterion (9) may not be convex, a local optimum could
be obtained in each iteration of our algorithm. However, even if
the computation was trapped in a local minimum of Eq. (9) at the
+1 iteration, Theorem 2 is still valid if r
+1
(w
+1
, O
+1
)
r
(w
, O
), otherwise the algorithm will be terminated by the
stopping criterion (12). To alleviate the local optimum problem
at each iteration, multiple initialisations could be deployed in
practice. In this work, we formulate an ensemble algorithm in
Sec. IV to alleviate the problem of local optimum.
D. Learning in an Absolute Data Difference Space
To compute the data difference vector
x defined in Eq. (1), most
existing distance learning methods use the following entry-wise
difference function
x = d(z, z
)=z z
(14)
to learn
M = WW
T
in the normal data difference space denoted
by
DZ =
x
ij
= z
i
z
j
z
i
, z
j
∈Z
. The learned distance
function is thus written as:
f(x
ij
)=(z
i
z
j
)
T
M(z
i
z
j
)=||W
T
x
ij
||
2
. (15)
In this work, we compute the difference vector by the following
entry-wise absolute difference function:
x = d(z, z
)=
z z
, x(k)=
z(k) z
(k)
, (16)
where
z(k) is the k-th element of the sample feature vector. M
is thus learned in an absolute data difference space, denoted by
DZ
=
|x
ij
| = |z
i
z
j
|
z
i
, z
j
∈Z
, and our distance function,
which is a symmetric Premetrics, becomes:
f(|x
ij
|)=|z
i
z
j
|
T
M|z
i
z
j
| = ||W
T
|x
ij
|||
2
. (17)
We now explain why learning in an absolute data difference
space is more suitable to our relative comparison model. First,
we note that:
|z
i
(k) z
j
(k)|−|(z
i
(k) z
j
(k)|
≤|(z
i
(k) z
j
(k)) (z
i
(k) z
j
(k))|,
(18)
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Citations
More filters
Proceedings ArticleDOI

DeepReID: Deep Filter Pairing Neural Network for Person Re-identification

TL;DR: A novel filter pairing neural network (FPNN) to jointly handle misalignment, photometric and geometric transforms, occlusions and background clutter is proposed and significantly outperforms state-of-the-art methods on this dataset.
Proceedings ArticleDOI

Person re-identification by Local Maximal Occurrence representation and metric learning

TL;DR: This paper proposes an effective feature representation called Local Maximal Occurrence (LOMO), and a subspace and metric learning method called Cross-view Quadratic Discriminant Analysis (XQDA), and presents a practical computation method for XQDA.
Book ChapterDOI

Beyond Part Models: Person Retrieval with Refined Part Pooling (and A Strong Convolutional Baseline)

TL;DR: In this paper, a part-based convolutional baseline (PCB) is proposed to learn discriminative part-informed features for person retrieval and two contributions are made: (i) a network named Part-based Convolutional Baseline (PCBB) which outputs a convolutionAL descriptor consisting of several part-level features.
Proceedings ArticleDOI

Harmonious Attention Network for Person Re-identification

TL;DR: A novel Harmonious Attention CNN (HA-CNN) model is formulated for joint learning of soft pixel attention and hard regional attention along with simultaneous optimisation of feature representations, dedicated to optimise person re-id in uncontrolled (misaligned) images.
Posted Content

Person Re-identification: Past, Present and Future

TL;DR: The history of person re-identification and its relationship with image classification and instance retrieval is introduced and two new re-ID tasks which are much closer to real-world applications are described and discussed.
References
More filters
Proceedings Article

Distance Metric Learning for Large Margin Nearest Neighbor Classification

TL;DR: In this article, a Mahanalobis distance metric for k-NN classification is trained with the goal that the k-nearest neighbors always belong to the same class while examples from different classes are separated by a large margin.
Proceedings Article

Distance Metric Learning with Application to Clustering with Side-Information

TL;DR: This paper presents an algorithm that, given examples of similar (and, if desired, dissimilar) pairs of points in �”n, learns a distance metric over ℝn that respects these relationships.
Proceedings ArticleDOI

Information-theoretic metric learning

TL;DR: An information-theoretic approach to learning a Mahalanobis distance function that can handle a wide variety of constraints and can optionally incorporate a prior on the distance function and derive regret bounds for the resulting algorithm.
Journal ArticleDOI

An efficient boosting algorithm for combining preferences

TL;DR: This work describes and analyze an efficient algorithm called RankBoost for combining preferences based on the boosting approach to machine learning, and gives theoretical results describing the algorithm's behavior both on the training data, and on new test data not seen during training.
Proceedings Article

An Efficient Boosting Algorithm for Combining Preferences

TL;DR: RankBoost as discussed by the authors is an algorithm for combining preferences based on the boosting approach to machine learning, which can be applied to several applications, such as that of combining the results of different search engines, or the "collaborative filtering" problem of ranking movies for a user based on movie rankings provided by other users.
Related Papers (5)
Frequently Asked Questions (16)
Q1. What are the contributions mentioned in the paper "Re-identification by relative distance comparison" ?

In this paper, the authors formulate person re-identification as a relative distance comparison learning problem in order to learn the optimal similarity measure between a pair of person images. Moreover, in order to maintain the tractability of the model in large scale learning, the authors further develop an ensemble RDC model. 

Nevertheless, it will be interesting to integrate an explicit background segmentation step into the proposed framework in the future. 

For each stripe, the RGB, YCbCr, HSV color features and two types of texture features extracted by Schmid and Gabor filters were computed across different radiuses and scales, and totally 13 Schimid filters and 8 Gabor filters were obtained. 

the relative distance comparison formulations in these works are not quantified using logistic function for soft measure, and crucially they are used as an optimisation constraint rather than an objective function. 

In total 29 feature channels were constructed for each stripe and each feature channel was represented by a 16 dimensional histogram vector. 

In the i-LIDS MCTS dataset, which was captured indoor at a busy airport arrival hall, there are 119 people with a total 476 person images captured by multiple non-overlapping cameras with an average of 4 images for each person. 

The better performance of RDC is mainly due to the logistic function based modelling that enforces a softer constraint on relative distance comparison and exploiting second-order rather than first-order feature quantification. 

In their experiments, for VIPeR with p = 316, it took around 15 minutes for an Intel dual-core 2.93GHz CPU and 48GB RAM server to learn RDC for each trial. 

Overall the results suggest that over-fitting to under-sampled training data is the main reason for the inferior performance of the compared alternative learning approaches. 

The VIPeR dataset6 is a person reidentification dataset available consisting of 632 people captured outdoor with two images for each person with normalised size at 128 × 64 pixels. 

It is noted that, benefiting from being a Bayesian modelling, MCC gives the most comparable results to RDC when the training set is large. 

By modelling the transition time between two camera views one can reduce the number of potential matches while also using the probability distribution of transition time as a feature [12], [25], [24], [22]. 

The iteration of the algorithm (for > 1) is terminated when the following criterion is met:r (w ,O )− r +1(w +1,O +1) < ε, (12)where ε is a small tolerance value set to 10−6 in this work. 

The results show that without the logistic modellingfor differentiating the margin in the difference information from different types, the RDC-MMC model performs much worse for person re-identification. 

The better performance of ensemble RDC is likely due to the fact that the ensemble learning process can effectively alleviate the local optimum of the iterative algorithm for optimising RDC. 

After generating the weak RDCs, the ensemble learning process itself has a space complexity of O(H · (( 1L − 1 L2) · N3 + ( 1L − 1) · N2)), where H is the number of groups (i.e. the total number of weak RDC models).