What are the future works in "Re-identification by relative distance comparison" ?

Nevertheless, it will be interesting to integrate an explicit background segmentation step into the proposed framework in the future.

How many features were extracted for each stripe?

For each stripe, the RGB, YCbCr, HSV color features and two types of texture features extracted by Schmid and Gabor filters were computed across different radiuses and scales, and totally 13 Schimid filters and 8 Gabor filters were obtained.

How many feature channels were constructed for each stripe?

In total 29 feature channels were constructed for each stripe and each feature channel was represented by a 16 dimensional histogram vector.

How many people were captured in the i-LIDS MCTS dataset?

In the i-LIDS MCTS dataset, which was captured indoor at a busy airport arrival hall, there are 119 people with a total 476 person images captured by multiple non-overlapping cameras with an average of 4 images for each person.

What is the reason why RDC is better than other methods?

The better performance of RDC is mainly due to the logistic function based modelling that enforces a softer constraint on relative distance comparison and exploiting second-order rather than first-order feature quantification.

How long did it take to learn RDC?

In their experiments, for VIPeR with p = 316, it took around 15 minutes for an Intel dual-core 2.93GHz CPU and 48GB RAM server to learn RDC for each trial.

What is the main reason for the inferior performance of the compared alternative learning approaches?

Overall the results suggest that over-fitting to under-sampled training data is the main reason for the inferior performance of the compared alternative learning approaches.

How many people were captured in the VIPeR dataset?

The VIPeR dataset6 is a person reidentification dataset available consisting of 632 people captured outdoor with two images for each person with normalised size at 128 × 64 pixels.

What is the reason why MCC gives the performance to RDC when the training set is?

It is noted that, benefiting from being a Bayesian modelling, MCC gives the most comparable results to RDC when the training set is large.

What is the criterion for the iteration of the algorithm?

The iteration of the algorithm (for > 1) is terminated when the following criterion is met:r (w ,O )− r +1(w +1,O +1) < ε, (12)where ε is a small tolerance value set to 10−6 in this work.

What is the significance of the logistic function for learning a person re-identification model?

The results show that without the logistic modellingfor differentiating the margin in the difference information from different types, the RDC-MMC model performs much worse for person re-identification.

Why is the ensemble RDC better than the batch model?

The better performance of ensemble RDC is likely due to the fact that the ensemble learning process can effectively alleviate the local optimum of the iterative algorithm for optimising RDC.

What is the space complexity of the ensemble learning process?

After generating the weak RDCs, the ensemble learning process itself has a space complexity of O(H · (( 1L − 1 L2) · N3 + ( 1L − 1) · N2)), where H is the number of groups (i.e. the total number of weak RDC models).

(Open Access) Reidentification by Relative Distance Comparison (2013) | Wei-Shi Zheng

Q: What are the contributions mentioned in the paper "Re-identification by relative distance comparison" ?

In this paper, the authors formulate person re-identification as a relative distance comparison learning problem in order to learn the optimal similarity measure between a pair of person images. Moreover, in order to maintain the tractability of the model in large scale learning, the authors further develop an ensemble RDC model.

Q: Why are the relative distance comparison formulations in these works not quantified?

the relative distance comparison formulations in these works are not quantified using logistic function for soft measure, and crucially they are used as an optimisation constraint rather than an objective function.

Re-identiﬁcation by Relative Distance Comparison

Wei-Shi Zheng, Member, IEEE, Shaogang Gong, and Tao Xiang

Abstract— Matching people across non-overlapping camera

views at different locations and different time, known as person

re-identiﬁcation, is both a hard and important problem for

associating behaviour of people observed in a large distributed

space over a prolonged period of time. Person re-identiﬁcation is

fundamentally challenging because of the large visual appearance

changes caused by variations in view angle, lighting, background

clutter and occlusion. To address these challenges, most previous

approaches aim to model and extract distinctive and reliable

visual features. However, seeking an optimal and robust similarity

measure that quantiﬁes a wide range of features against realistic

viewing conditions from a distance is still an open and unsolved

problem for person re-identiﬁcation. In this paper, we formulate

person re-identiﬁcation as a relative distance comparison learning

problem in order to learn the optimal similarity measure between

a pair of person images. This approach avoids treating all features

indiscriminately and does not assume the existence of some

universally distinctive and reliable features. To that end, a novel

relative distance comparison (RDC) model is introduced. The

model is formulated to maximise the likelihood of a pair of true

matches having a relatively smaller distance than that of a wrong

match pair in a soft discriminant manner. Moreover, in order to

maintain the tractability of the model in large scale learning, we

further develop an ensemble RDC model. Extensive experiments

on three publically available benchmarking datasets are carried

out to demonstrate the clear superiority of the proposed RDC

models over related popular person re-identiﬁcation techniques.

The results also show that the new RDC models are more robust

against visual appearance changes and less susceptible to model

over-ﬁtting compared to other related existing models.

Index Terms— Person re-identiﬁcation, feature quantiﬁcation,

feature selection, relative distance comparison

I. INTRODUCTION

For understanding behaviour of people in a large area of public

space covered by multiple no-overlapping (disjoint) cameras, it is

critical that when a target disappears from one view, he/she can be

re-identiﬁed in another view at a different location among a crowd

of people. Solving this inter-camera people association problem,

known as re-identiﬁcation, enables tracking of the same person

through different camera views located at different physical sites

[26], [15], [32], [17], [8].

Despite the best efforts from computer vision researchers in

the past ﬁve years, the person re-identiﬁcation problem remains

largely unsolved. This is due to a number of reasons. First, in

a busy uncontrolled environment monitored by cameras from a

distance, person veriﬁcation relying upon biometrics such as face

and gait is infeasible and unreliable. Second, as the transition

Wei-Shi Zheng is now with School of Information Science and Tech-

nology, Sun Yat-sen University, China, and was with School of Electronic

Engineering and Computer Science, Queen Mary University of London, UK,

wszheng@ieee.org

Shaogang Gong and Tao Xiang are with School of Electronic En-

gineering and Computer Science, Queen Mary University of London,

{sgg,txiang}@eecs.qmul.ac.uk

Fig. 1. Typical examples of appearance changes caused by cross-view

variations in view angle, lighting, background clutter and occlusion. Each

column shows two images of the same person from two different camera

views.

time between disjoint cameras

varies greatly from individual to

individual with uncertainty, it is hard to impose accurate temporal

and spatial constraints. Therefore, the person re-identiﬁcation

problem is made harder still as a model can only rely on

mostly appearance features alone. Third, the visual appearance

features, extracted mainly from clothing and shape of people, are

intrinsically indistinctive for matching people (e.g. most people in

winter wear dark clothes). In addition, a person’s appearance often

undergoes large variations across non-overlapping camera views

due to signiﬁcant changes in view angle, lighting, background

clutter and occlusion (see Fig. 1), resulting in different people

appearing more alike than that of the same person across different

camera views (see Figs. 6 and 7).

Given a query image of a person, in order to ﬁnd the correct

match among a large number of candidate images captured from

different camera views, two steps need to be taken. First, a feature

representation is computed from both the query and each of

the gallery images. Second, the distance between each pair of

potential matches is measured, which is then used to determine

whether a gallery image contains the same person as the query

image. Most existing studies have focused on the ﬁrst step, that

is, seeking a more distinctive and reliable feature representation

of people’s appearance, ranging widely from colour histogram

[26], [15], graph model [10], spatial co-occurrence representation

model [32], principal axis [17], rectangle region histogram [6],

part-based models [1], [4] to combinations of multiple features

[15], [8]. After feature extraction, these methods simply choose

a standard distance measure such as

-norm [32], l

-norm based

distance [17], or Bhattacharyya distance [15]. However under

severe changes in viewing conditions that can cause signiﬁcant

appearance variations (e.g. view angle and lighting condition

changes, occlusion), computing a set of features that are both

distinctive and reliable is extremely hard if not implausible.

Moreover, given that certain features could be more reliable than

others under a certain condition, applying a standard distance

measure is undesirable as it essentially treats all features equally

without discarding bad features selectively in each individual

matching circumstance.

In this paper, we focus on the second step of person re-

The time gap between a person disppearing in one camera view and re-

appearing in another.

ital Ob

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

identiﬁcation. That is, given a set of features extracted from

each person image, we seek to quantify and differentiate these

features by learning the optimal distance measure that is most

likely to give correct matches. This is signiﬁcantly different from

most existing approaches in that it requires model learning from

a set of training data. In essence, images of each person in a

training set form a class. This learning problem can be framed as

a distance learning problem which always searches for a distance

that minimises intra-class distances while maximising inter-class

distances. However, the person re-identiﬁcation problem has four

characteristics: (1) The intra-class variation can be large and more

importantly can vary signiﬁcantly for different classes as it is

caused by large and unpredictable viewing condition changes (see

Fig. 1). (2) The inter-class variation also varies drastically across

different pairs of classes and there are often severe overlappings

between classes in a feature space due to similar appearance

(e.g. clothing) of different people. (3) The training set for learning

the model consists of images of matched people across different

camera views. In order to capture the large intra- and inter-

variations, the number of classes is necessarily large, typically

in the order of hundreds. This represents a large scale learning

problem that challenges existing machine learning algorithms. (4)

Annotating a large number of matched people across camera

views is not only tedious, but also inherently limited in its

usefulness. Typically each annotated class contains only a handful

of images of a person from different camera views, i.e. the

data are inherently under-sampled for building a representative

class distribution. Due to these intrinsic characteristics of the re-

identiﬁcation problem, especially the problem of large number of

under-sampled classes, a learning model could easily be over-

ﬁtted and/or intractable if it is learned by minimising intra-

class distance and maximising inter-class distance simultaneously

by brute-force, as typically done by existing popular distance

learning techniques.

To alleviate this inherently ill-posed distance learning problem

in person re-identiﬁcation, we formulate the problem as a relative

distance comparison problem. That is, we perform feature quan-

tiﬁcation by learning a relative distance comparison model. More

speciﬁcally, a novel relative distance comparison (RDC) model is

formulated in order to differentiate the similarity score of a pair

of true match (i.e. two images of person A) from that of a pair of

related wrong match (i.e. two images of different people A and B

respectively) so that the latter one can be always smaller. In other

words, the model aims to learn an optimal distance in the sense

that for a given query image, the true match is desired to be ranked

higher than the wrong matches among the gallery image set. The

model cares less about how large the absolute distance between

the pair of images for the true match. This differs conceptually

from a conventional distance learning approach which aims to

minimise intra-class variation in an absolute sense (i.e. making

all images of person A more similar, or closer in a features space)

whilst maximising inter-class variation (i.e. making two images of

person A and B more dissimilar). A conventional approach thus

attempts to maximise the margin between two classes, or in the

context of person re-identiﬁcation, enforces a harder discriminant

constraint that the true match is not only ranked higher but also

has as smaller distance to the query image as possible compared

to that of wrong matches. One of the key advantages of our

relative distance comparison based method is that our model is

not easily biased by large variations across many under-sampled

classes, as it aims to seek an optimised individual comparison

between any two data points rather than comparison among data

distribution boundaries or among clusters of data. This alleviates

the over-ﬁtting problem in person identiﬁcation given under-

sampled training data.

Computationally, learning the proposed relative distance com-

parison model can be a non-convex optimisation problem. It is

also a large scale learning problem even given a moderate training

data size. This is because that the distance between each pair of

images in a training set needs be compared exhaustively during

model learning and the feature space for person re-identiﬁcation

is typically of high dimension. To address this problem, a novel

iterative optimisation algorithm is developed in this work for

learning the RDC model. The algorithm is theoretically validated

and its convergence is guaranteed.

Furthermore, in order to alleviate the large space complexity

(memory usage cost) and the local optimum learning problem

due to the proposed iterative algorithm for solving high-order

non-linear optimisation criterion, we develop an ensemble RDC

in this work. The aim is to learn a set of weak RDC models each

computed on a small subset of data and then combine them into

a stronger RDC using ensemble learning.

Extensive experiments are conducted on three publically avail-

able large person re-identiﬁcation datasets, including the ETHZ

[7], i-LIDS [37] and VIPeR [14] datasets. The results demonstrate

that (1) by formulating the person re-identiﬁcation problem as a

relative distance comparison learning problem based on logistic

function modelling, signiﬁcant improvement on matching accura-

cy can be obtained against related popular person re-identiﬁcation

techniques; and (2) our RDC models outperform not only related

distance learning methods but also related learning methods based

on boosting and rank support vector machines (SVMs), both in

terms of matching accuracy and tractability.

II. R

ELATED WORKS

The problem of matching people across disjoint camera views

has received increasing attention in recent years. Existing works

predominantly focus on the problem of feature extraction and

representation with a bag-of-word representation of colour and

texture features being the most common choice. Table I sum-

marises the features and representations employed by existing

methods reported in the literature. In addition to matching based

on similarity of visual appearance, contextual cues can also be

exploited. Brightness transfer function is introduced to explicitly

compensate for the lighting condition changes between cameras

[3], [27], [18]. However, to learn a brightness transfer function one

has to not only annotate a set of matched people but also segment

each person from the image, which signiﬁcantly increase the

already large annotation cost. The temporal relationships between

camera views can be exploited for object tagging. By modelling

the transition time between two camera views one can reduce

the number of potential matches while also using the probability

distribution of transition time as a feature [12], [25], [24], [22].

However, transition time information could be unreliable when

camera views are signiﬁcantly disjoint or featured with a large

number of moving objects. Nevertheless, when it can be obtained

reliably, it has been exploited to good effect (see Table I, column

4). Such contextual constraints can also be easily employed to

the proposed RDC models either as part of the representation or

a postprocessing step.

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Authors Year Image Features Using Temporal Information Representation

Javed el al. [19] 2005 colour Yes colour appearance with colour brightness

transform

Gilbert el al. [11] 2006 colour Yes consensus-colour conversion of munsell

colour space with colour transformation matrix

Gheissari, et al. [10] 2006 colour and shape Yes graph partition based representation

Hu et al. [17] 2007 geometry Yes principal axis with segmentation

Wang et al. [32] 2007 colour, gradient, and shape No co-occurrence spatial context

Chen et al. [3] 2008 colour Yes colour appearance with temporal colour

& Prosser et al. [27]

brightness transform and spatial information

Javed et al. [18] 2008 colour Yes colour appearance with spatial temporal colour

brightness transform and spatial information

Gray and Tao [15] 2008 colour, gradient, ﬁlters No selected histogram features by Adaboost

Zheng et al. [37] 2009 colour and gradient No grouping as dynamic spatial context

Bak et al. [1] & Cheng et al. [4] 2010/2011 colour No covariance matrix between parts

or pictorial structures modelling

Prosser et al. [28] 2010 colour, gradient, ﬁlters No quantiﬁed histogram feature by RankSVM

Farenzena et al. [8]

2010 colour and structure No symmetry-based ensemble of local features

with background subtraction

TABLE I

AIN DEVELOPMENT OF PERSON REIDENTIFICATION.

Since not all features are equally reliable and informative for

person re-identiﬁcation, Gray and Tao [15] propose a boosting

approach based on Adaboost to select a subset of optimal features

for matching people. However, in a boosting framework, good

features are only selected individually and independently in the

original feature space where different classes can be heavily

overlapped. Such selection may not be globally optimal. Rather

than selecting features individually and independently (local se-

lection), we consider instead to quantify all features jointly (global

selection). Critically, the Adaboost based feature selection method

in [15] could be biased by large variations between appearance

of people, as its modelling shares similar spirit with a typical

discriminant model that tries to maximize the difference between

two images of different people. It is thus prone to model over-

ﬁtting as shown in our experiments (see Sec. VI). In contrast, the

proposed RDC model can be seen as a soft discriminant approach.

Our model thus is less susceptible to over-ﬁtting and more tolerant

to intra- and inter-class variations and severe overlapping of

different classes in a multi-dimensional feature space.

Relative distance comparison is a special case of learning to

rank or machine-learned ranking. Ranking techniques such as

RankSVM [16] and RankBoost [9] have been widely used in text

document analysis and information retrieval. In our early work

[28], the primal RankSVM [2] is applied to solve the problem

of global feature quantiﬁcation for person re-identiﬁcation. The

primal RankSVM solves the high computational cost problem for

large scale constraint optimisation in a standard RankSVM for-

mulation. Compared to RankSVM and RankBoost, the proposed

new model in this paper is more principled and tractable in three

aspects: (a) RDC is a second-order feature quantiﬁcation model,

taking into account the joined effect between different features,

whereas both RankSVM [2] and RankBoost [9] are a ﬁrst-order

model unable to exploit correlations among different features. (b)

RDC utilises a logistic function to provide a soft margin measure

between the difference vectors of different types whilst RankSVM

does not, and such a formulation of our objective function makes

RDC more tolerant to large intra- and inter-class variations and

better suited for coping with data under-sampling; (c) Using a

primal RankSVM, one must determine the weight between the

margin function and the ranking error cost function, which is

computationally costly. In contrast, our RDC model does not

suffer from such a problem, leading to lower computational cost.

More detailed discussion on the differences between RDC and

related ranking models are given in Sec. V. Extensive experiments

are presented in Sec. VI-F to validate the advantages of RDC over

RankSVM and RankBoost.

Although it has not previously been exploited for person

re-identiﬁcation, distance learning in general is a well-studied

problem [35], [13], [36], [34], [15], [29], [33], [20], [5]. The

proposed RDC model is related to several existing distance

learning methods. In particular, our model shares the same spirit

with a number of recent works that exploit the idea of relative dis-

tance comparison [29], [33], [20]. However, the relative distance

comparison formulations in these works are not quantiﬁed using

logistic function for soft measure, and crucially they are used

as an optimisation constraint rather than an objective function.

Therefore, as analysed in more details in Sec. V, these approaches,

either implicitly [29], [20] or explicitly [33], still aim to learn

a distance by which each class becomes more compact whilst

being more separable from each other in an absolute sense. We

demonstrate through extensive experiments that in practice, they

remain susceptible to model over-ﬁtting and poor tractability for

person re-identiﬁcation.

In summary, the main contributions of this work are three-folds:

1) For the ﬁrst time, the person re-identiﬁcation problem

is formulated as a relative distance comparison learning

problem, with strong rationale both conceptually and com-

putationally.

2) We propose a novel logistic function based relative distance

comparison (RDC) model for feature quantiﬁcation, which

overcomes the limitations of existing distance learning

techniques given under-sampled data with large intra- and

inter-class variations.

3) A novel iterative optimisation algorithm and an ensemble

RDC model are proposed to improve the tractability of

the RDC model and make it more suitable for large scale

learning.

An early version of this work appeared in [38]. In addition

to giving a more detailed description of the RDC model, the

main changes include (1) an ensemble RDC model proposed to

improve the scalability and tractability of the original RDC model,

(2) more in depth discussion and analysis on its relationship to

alternative learning methods, and (3) more extensive experimental

evaluations including the introduction of a new dataset.

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

III. QUANTIFYING FEATURES FOR PERSON

-IDENTIFICATION

A. Proposed Relative Distance Comparison Learning

We formally cast the person re-identiﬁcation problem into the

following distance comparison problem, where we assume each

instance of a person is represented by a feature set (e.g. the

representation described in Sec. VI-B). For an instance

z of person

A, we wish to learn a re-identiﬁcation model to successfully

identify another instance



of the same person captured elsewhere

in space and time. This is achieved by learning a distance function

f(·, ·) so that f(z, z



) <f(z, z



), where z



is an instance of

any other person except A. To this end, given a training set

Z =



)



i=1

, where z

∈R

is a multi-dimensional

feature vector representing the appearance of a person in one

view and

is its class label (person ID), we deﬁne a pairwise

set O

= {O

=(x

, x

)}, where each element of a pair-wise data

itself is computed using a pair of sample feature vectors. More

speciﬁcally,

is a difference vector computed between a pair of

relevant samples (of the same class/person) and

is a difference

vector from a pair of related irrelevant samples, i.e. only one

sample for computing

is one of the two relevant samples for

computing

and the other is a mis-match from another class

(e.g.

and x

share the same z in the following Eq. (1), while

they have different



). The difference vector x between any two

samples

z and z



is computed by

x = d(z, z



), z, z



∈R

(1)

where

d is an entry-wise difference function that outputs a

difference vector between

z and z



. The speciﬁc form of function

d will be described in Sec. III-D.

Given the pairwise set O, a distance function

f will take the

difference vector as input and can be learned based on relative

distance comparison so that a distance between a relevant sample

pair (

f(x

)) is wished to be smaller than that between a related

irrelevant pair (

f(x

)). In order to differentiate these two types

of difference vectors, we propose a logistic function based

modelling to describe how a distance between a relevant pair

differs from the one between a related but irrelevant pair as

follows:

, x



1+exp



f(x

) − f (x

)



−1

. (2)

We assume the events of distance comparison between a relevant

pair and a related irrelevant pair are independent

. Then, we

wish to minimise the risk of learning

f via all the above relative

distance comparisons as follows:

min

r(f,O),r(f,O)=− log (



, x

)). (3)

The distance function

f is parameterised as a Mahalanobis

(quadratic) distance function:

f(x)=x

Mx, M  0, (4)

where

M is a semideﬁnite matrix. The distance learning problem

thus becomes learning

M using Eq. (3). Directly learning M us-

ing semideﬁnite program techniques is computationally expensive

for high dimensional data [33]. In particular, we found out in our

experiments that given a dimensionality of thousands, typical for

visual object representation, a distance learning method based on

Note that we do not assume the data are independent.

learning M becomes intractable. To overcome this problem, we

perform eigenvalue decomposition on

M = AΛA

= WW

, W = AΛ

, (5)

where the columns of

A are orthonormal eigenvectors of M and

the leading diagonal of

Λ contains the corresponding non-zero

eigenvalues. Note that the columns of

W form a set of orthogonal

vectors. Therefore, learning a function

f is equivalent to learning

such a matrix

W =(w

, ··· , w

) such that

min

r(W, O),s.t. w

=0, ∀i = j

r(W,

O)=



log(1 + exp



||W

−||W



(6)

We call this relative distance comparison learning (RDC) for

person re-identiﬁcation. RDC is based on a logistic function

ranging from 0 to 1 in value. This is designed to avoid dramatic

changes in the response to different relative distance comparisons.

B. An Iterative Optimisation Algorithm

It is important to point out that our optimisation criterion (6)

may not be a convex optimisation problem against the orthogonal

constraint due to the logistic function based relative comparison

modelling. It means that deriving an global solution by directly

optimising

W is not straightforward. In this work we formulate an

iterative optimisation algorithm to learn an optimal

W, which also

aims to seek a low-rank and non-trivial solution automatically.

This is critical for reducing the model complexity thus alleviating

the overﬁtting problem given a large number of under-sampled

classes.

Starting from an empty matrix, after iteration

, a new estimated

column w



is added to W. The algorithm terminates after L

iterations when a stopping criterion is met. Each iteration consists

of two steps as follows:

Step 1. Assume that after

 iterations, a total of  orthogonal

vectors

, ··· , w



have been learned. To learn the next orthog-

onal vector

+1

, let

+1

=exp{





j=0

||w

p,j

−||w

n,j

}, (7)

where we deﬁne

= 0, and x

p,

and x

n,

are the difference

vectors at the

-th iteration deﬁned as follows:

s,

= x

s,−1

− ˜w

−1

˜w

−1

s,−1

,s∈{p, n},i=1, ··· ,



(8)

where

 ≥ 1 and ˜w

−1

= w

−1

/||w

−1

||. Note that we deﬁne

s,0

= x

, s ∈{p, n}, and ˜w

= 0.

Step 2. Obtain

p,+1

, x

n,+1

by Eq. (8). Let O

+1

={O

+1

p,+1

, x

n,+1

)}. Then, learn a new optimal projection w

+1

as follows:

+1

= arg min

+1

(w, O

+1

), (9)

where

+1

(w, O

+1



+1

log(1 + a

+1

exp



||w

p,+1

−||w

n,+1



IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

We seek a solution by a gradient descent method:

+1

← w

+1

− λ ·

∂r

+1

∂w

+1

,λ≥ 0, (10)

∂r

+1

∂w

+1



+1

2 · a

+1

· exp



||w

+1

p,+1

−||w

+1

n,+1



1+a

+1

· exp



||w

+1

p,+1

−||w

+1

n,+1





p,+1

− x

n,+1



+1

where λ is a step length automatically determined at each gradient

update step using similar strategy in [23]. According to the

descent direction in Eq. (10) the initial value of

+1

for the

gradient descent method is set to

+1

= |O

+1

−1



+1

n,+1

− x

p,+1

). (11)

Note that the update in Eq. (8) deducts information from each

sample

s,−1

affected by w

−1

as w

−1

s,

=0, so that the

next learned vector



will only quantify the part of the data left

from the last step, i.e.

s,

. In addition, a

+1

indicates the trends

in the change of distance measures for x

and x

over previous

iterations and serve as a priori weight for learning



The iteration of the algorithm (for

>1) is terminated when

the following criterion is met:



, O



) − r

+1

, O

+1

) <ε, (12)

where

ε is a small tolerance value set to 10

−6

in this work. The

algorithm is summarised in Algorithm 1.

Algorithm 1: Learning the RDC model

Data: O = {O

=(x

, x

)},ε>0

begin

←− 0, ˜w

←− 0;

s,0

←− x

,s∈{p, n}, O

←− O;

 ←− 0;

while 1 do

Compute a

+1

by Eq. (7);

Compute x

s,+1

,s∈{p, n} by Eq. (8);

+1

←− { O

+1

=(x

p,+1

, x

n,+1

)};

Estimate w

+1

using Eq. (9);

˜w

+1

||w

+1

;

if (>1)&(r



, O



) − r

+1

, O

+1

) <ε) then

break;

end

 ←−  +1;

end

Output: W =



, ··· , w





C. Theoretical Validation

The following two theorems validate the claim that the pro-

posed iterative optimisation algorithm learns a set of orthogonal

vectors



} that iteratively decrease the objective function in

Criterion (6).

Theorem 1: The learned vectors



,=1, ··· ,L, are orthog-

onal to each other.

Proof: Assume that

 − 1 orthogonal vectors {w

}

−1

j=1

have been learned. Let w



be the optimal solution of Criterion

(9) at the

 iteration. First, we know that w



is in the range

space

of {x

p,

}∪{x

n,

} according to Eqs. (10) and (11), i.e.

This can also be explored by using Lagrangian equation for Eq. (9) for a non-zero



∈ span{x

s,

,i =1, ··· , |O|,s ∈{p, n}}. Second, according

to Eq. (8), we have

s,j+1

=0,s∈{p, n},j=1, ··· ,− 1

span{x

s,

,i=1, ··· , |O|,s ∈{p, n}}

⊆ span{x

s,−1

,i=1, ··· , |O|,s ∈{p, n}}

⊆···⊆span{x

s,0

,i=1, ··· , |O|,s ∈{p, n}}.

(13)

Hence,



is orthogonal to w

,j =1, ··· ,− 1.

Theorem 2: r(W

+1

, O) ≤ r(W



, O), where W



, ··· , w



),≥ 1. That is, the algorithm iteratively decreases

the objective function value.

Proof: Let

+1

be the optimal solution of Eq. (9). By

Theorem 1, it is easy to prove that for any

j ≥ 1, w

s,j

s,0

= w

, s ∈{p, n}. Hence we have

+1

, O

+1

)



+1

log(1 + a

+1

exp



||w

+1

p,+1

−||w

+1

n,+1



)

= r(W

+1

, O).

Also r

+1

(0, O

+1

)=r(W



, O). Since w

+1

is the minimal

solution, we have

+1

, O

+1

) ≤ r

+1

(0, O

+1

), and

therefore r(W

+1

, O) ≤ r(W



, O).

Since Criterion (9) may not be convex, a local optimum could

be obtained in each iteration of our algorithm. However, even if

the computation was trapped in a local minimum of Eq. (9) at the

 +1 iteration, Theorem 2 is still valid if r

+1

, O

+1

) ≤



, O



), otherwise the algorithm will be terminated by the

stopping criterion (12). To alleviate the local optimum problem

at each iteration, multiple initialisations could be deployed in

practice. In this work, we formulate an ensemble algorithm in

Sec. IV to alleviate the problem of local optimum.

D. Learning in an Absolute Data Difference Space

To compute the data difference vector

x deﬁned in Eq. (1), most

existing distance learning methods use the following entry-wise

difference function

x = d(z, z



)=z − z



(14)

to learn

M = WW

in the normal data difference space denoted

DZ =



= z

− z



, z

∈Z



. The learned distance

function is thus written as:

f(x

)=(z

− z

)

M(z

− z

)=||W

. (15)

In this work, we compute the difference vector by the following

entry-wise absolute difference function:

x = d(z, z





z − z





, x(k)=



z(k) − z



(k)



, (16)

where

z(k) is the k-th element of the sample feature vector. M

is thus learned in an absolute data difference space, denoted by





| = |z

−z



, z

∈Z



, and our distance function,

which is a symmetric Premetrics, becomes:

f(|x

|)=|z

− z

M|z

− z

| = ||W

|||

. (17)

We now explain why learning in an absolute data difference

space is more suitable to our relative comparison model. First,

we note that:

(k) − z

(k)|−|(z

(k) − z



(k)|

≤|(z

(k) − z

(k)) − (z

(k) − z



(k))|,

(18)

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Reidentification by Relative Distance Comparison

Figures

Citations

DeepReID: Deep Filter Pairing Neural Network for Person Re-identification

Person re-identification by Local Maximal Occurrence representation and metric learning

Beyond Part Models: Person Retrieval with Refined Part Pooling (and A Strong Convolutional Baseline)

Harmonious Attention Network for Person Re-identification

Person Re-identification: Past, Present and Future

References

Distance Metric Learning for Large Margin Nearest Neighbor Classification

Distance Metric Learning with Application to Clustering with Side-Information

Information-theoretic metric learning

An efficient boosting algorithm for combining preferences

An Efficient Boosting Algorithm for Combining Preferences

Related Papers (5)

Large scale metric learning from equivalence constraints

Viewpoint Invariant Pedestrian Recognition with an Ensemble of Localized Features

Person re-identification by symmetry-driven accumulation of local features

Person re-identification by Local Maximal Occurrence representation and metric learning

Unsupervised Salience Learning for Person Re-identification

Frequently Asked Questions (16)

Q1. What are the contributions mentioned in the paper "Re-identification by relative distance comparison" ?

Q2. What are the future works in "Re-identification by relative distance comparison" ?

Q3. How many features were extracted for each stripe?

Q4. Why are the relative distance comparison formulations in these works not quantified?

Q5. How many feature channels were constructed for each stripe?

Q6. How many people were captured in the i-LIDS MCTS dataset?

Q7. What is the reason why RDC is better than other methods?

Q8. How long did it take to learn RDC?

Q9. What is the main reason for the inferior performance of the compared alternative learning approaches?

Q10. How many people were captured in the VIPeR dataset?

Q11. What is the reason why MCC gives the performance to RDC when the training set is?

Q12. How can one reduce the number of potential matches?

Q13. What is the criterion for the iteration of the algorithm?

Q14. What is the significance of the logistic function for learning a person re-identification model?

Q15. Why is the ensemble RDC better than the batch model?

Q16. What is the space complexity of the ensemble learning process?