scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

RGB-Infrared Cross-Modality Person Re-identification

TL;DR: The experiments show that RGB-IR cross-modality matching is very challenging but still feasible using the proposed model with deep zero-padding, giving the best performance.
Abstract: Person re-identification (Re-ID) is an important problem in video surveillance, aiming to match pedestrian images across camera views. Currently, most works focus on RGB-based Re-ID. However, in some applications, RGB images are not suitable, e.g. in a dark environment or at night. Infrared (IR) imaging becomes necessary in many visual systems. To that end, matching RGB images with infrared images is required, which are heterogeneous with very different visual characteristics. For person Re-ID, this is a very challenging cross-modality problem that has not been studied so far. In this work, we address the RGB-IR cross-modality Re-ID problem and contribute a new multiple modality Re-ID dataset named SYSU-MM01, including RGB and IR images of 491 identities from 6 cameras, giving in total 287,628 RGB images and 15,792 IR images. To explore the RGB-IR Re-ID problem, we evaluate existing popular cross-domain models, including three commonly used neural network structures (one-stream, two-stream and asymmetric FC layer) and analyse the relation between them. We further propose deep zero-padding for training one-stream network towards automatically evolving domain-specific nodes in the network for cross-modality matching. Our experiments show that RGB-IR cross-modality matching is very challenging but still feasible using the proposed model with deep zero-padding, giving the best performance. Our dataset is available at http:// isee.sysu.edu.cn/project/RGBIRReID.htm.

Summary (3 min read)

1. Introduction

  • Secondly, from imaging principle aspect, the wavelength range of RGB and IR images is different.
  • In existing Re-ID works, colour information is the most important appearance cue for identifying persons.
  • The authors first identify the challenge of RGB-IR Re-ID by conducting extensive evaluations on popularly used crossmodality methods.
  • Considering using neural networks for cross-modality matching, the authors investigate and analyse the relation between different neural network structures, including two-stream structure and asymmetric FC layer structure, in which the domain-specific modelling exists but is designed manually.

2.1. Dataset Description

  • SYSU-MM01 contains images captured by 6 cameras, including two IR cameras and four RGB ones.
  • For each person, there are at least 400 continuous RGB frames with different poses and viewpoints.
  • The IR images have only one channel, and they are different from 3-channel RGB images.
  • Camera 4 and 5 are RGB surveillance cameras placed in two outdoor scenes named gate and garden.
  • These all introduce difficulties for the RGB-IR cross-modality Re-ID problem.

2.2. Evaluation Protocol

  • The authors have a fixed split using 296 identities for training, 99 for validation and 96 for testing.
  • Given a probe image, matching is conducted by computing similarities between the probe image and gallery images.
  • After computing similarities, the authors can get a ranking list according to descending order of similarities.

3. Network Structure Comparison on CrossModality Modelling

  • The authors investigate deep learning network architectures for the task of RGB-IR cross-modality Re-ID.
  • In particular, the authors examine three commonly adopted network structures for visual recognition and cross-modality learning.
  • The authors further exploit the idea of deep zero-padding for model training and give insights on its impact on cross-modality matching task.

3.1. Common Deep Model Network Structures

  • In the past few years, a large number of deep models have been proposed for visual matching and cross-modality modelling, and have achieved satisfactory performance in many tasks.
  • Generally, in these tasks, the inputs to the network are RGB images, which are of the same modality.
  • In the deeper layers, shared parameters are used.
  • The generalized similarity net [26] proposed by Lin et al. for cross- domain visual matching including the Re-ID task is one of the representative structure of this type.
  • Compared to onestream structure, two-stream structure achieves two things, domain adaptation and discriminative feature learning.

3.2. Analysis of Network Structures

  • The three structures discussed above seem to be different, the authors find interestingly that all structures can be represented by one-stream structure in the forward propagation process when the following assumption is hold: Assumption 1.
  • On the right is a one-stream network which can be conditionally equivalent to the two-stream one in forward propagation, in which there is a domain selection sub-network for selecting the following domainspecific structure.
  • The assumption the authors hope above is less feasible.
  • Using the above defined categorization, without loss of generality, x(l) can be factorized into three parts1 x(l) = [x(l),1spe;x(l),2spe;x(l),s] in which the three components denote the domain1-specific, domain2-specific and shared nodes, respectively.
  • In contrast, if one-stream structure can implicitly learn the structure, the implicit structures corresponding to different domains are partially coupled by shared nodes and shared bias parameters (Equations (4) and (5)), which can provide more flexibility in training for cross-modality matching tasks.

4.1. Analysis of Zero-Padding as Network Input

  • In most cases, one-stream network is applied in single-domain tasks, which treats all samples equally so that generally domain-specific nodes may not be learned.
  • It would be easier for neural network to spread the domain specific-nodes in deeper layers.
  • Actually, their neural networks learning empirically support this.
  • As shown in Figure 7 and Figure 8, deep zero-padding helps the network learn domain-specific nodes more easily than that without zeropadding.
  • The details will be illustrated later in Section 4.2.

4.3. Comparison of Cross-Modality Learning

  • While cross-modality matching task has not drawn much attention in Re-ID problem, it has been studied a lot in other fields like information retrieval and face verification.
  • Crossmodality retrieval (e.g. text-image, tag-image) plays an important role in information retrieval.
  • Matching visual face versus near infrared ones (VIS-NIR) [17, 58, 10] is rather related to RGB-IR cross-modality Re-ID.
  • The remaining useful cues may be body shape, which differs greatly with different viewpoints and poses.
  • In comparison, their zero-padding is done in raw image level and the domain-specific and shared learning are done by deep neural network.

5. Experiments

  • The authors conducted extensive evaluations of existing Re-ID and cross-domain matching models as baselines on their SYSU-MM01 dataset.
  • Then, the authors evaluated and analysed the effectiveness of deep models, including the proposed deep zero-padding and three network structures discussed in Section 3.
  • See Section 2.2 for detailed evaluation protocol.

5.1. Compared Models

  • The authors evaluated three favorable handcrafted features and cross-domain metric learning models as baselines.
  • The authors evaluated four deep models shown in Figure 3, including one-steam network, two-stream network, asymmetric FC layer network and the proposed deep zero-padding method (network structure is the same as one- stream network).
  • All of the hyper parameters were kept the same.

5.2. Model Comparisons and Analysis

  • The authors show comparative results in Table 3, including the rank-1, 10, 20 accuracies of CMC [32] and mean average precision (mAP).
  • There were gaps among their performances to some extent.
  • In Table 3 the authors can see that the deep zero-padding outperformed two-stream network and asymmetric FC layer structure.
  • The authors used the codes released by the authors in the experiments.
  • It is inferior for dealing with the much more challenging RGB-IR cross-modality Re-ID problem.

6. Summary

  • To their best knowledge, this work is the first to identify the RGB-IR cross-modality Re-ID problem and introduce a new multi-modality Re-ID dataset named SYSU-MM01.
  • The great difference between RGB and IR images makes RGB-IR cross-modality Re-ID formed as a very challenging problem.
  • The authors have discussed and evaluated three common network structures for cross-domain tasks including one-stream structure, two-stream structure and asymmetric FC layer structure.
  • The authors have analysed the connection between one-stream and two-stream structure and found that one-stream network can learn and evolve domain-specific structure implicitly if there exist domain-specific and shared nodes.
  • The experiments have shown that the one-stream network trained by deep zero-padding achieved the best performance.

Did you find this useful? Give us your feedback

Figures (12)

Content maybe subject to copyright    Report

RGB-Infrared Cross-Modality Person Re-Identification
Ancong Wu
1
, Wei-Shi Zheng
2,5,6
, Hong-Xing Yu
2
, Shaogang Gong
4
, and Jianhuang Lai
2,3
1
School of Electronics and Information Technology, Sun Yat-sen University, China
2
School of Data and Computer Science, Sun Yat-sen University, China
3
Guangdong Province Key Laboratory of Information Security, China
4
Queen Mary University of London, United Kingdom
5
Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, China
6
Collaborative Innovation Center of High Performance Computing, NUDT, China
wuancong@mail2.sysu.edu.cn, wszheng@ieee.org, xKoven@gmail.com,
s.gong@qmul.ac.uk, stsljh@mail.sysu.edu.cn
Abstract
Person re-identification (Re-ID) is an important prob-
lem in video surveillance, aiming to match pedestrian im-
ages across camera views. Currently, most works focus
on RGB-based Re-ID. However, in some applications, RGB
images are not suitable, e.g. in a dark environment or at
night. Infrared (IR) imaging becomes necessary in many
visual systems. To that end, matching RGB images with
infrared images is required, which are heterogeneous with
very different visual characteristics. For person Re-ID, this
is a very challenging cross-modality problem that has not
been studied so far. In this work, we address the RGB-IR
cross-modality Re-ID problem and contribute a new mul-
tiple modality Re-ID dataset named SYSU-MM01, includ-
ing RGB and IR images of 491 identities from 6 cameras,
giving in total 287,628 RGB images and 15,792 IR im-
ages. To explore the RGB-IR Re-ID problem, we evalu-
ate existing popular cross-domain models, including three
commonly used neural network structures (one-stream, two-
stream and asymmetric FC layer) and analyse the relation
between them. We further propose deep zero-padding for
training one-stream network towards automatically evolv-
ing domain-specific nodes in the network for cross-modality
matching. Our experiments show that RGB-IR cross-
modality matching is very challenging but still feasible us-
ing the proposed model with deep zero-padding, giving the
best performance. Our dataset is available at
http://
isee.sysu.edu.cn/project/RGBIRReID.htm
.
1. Introduction
Person re-identification (Re-ID) is an important field in
video surveillance. A large number of models for Re-ID
problem have been proposed, including feature learning
[29, 48, 23], distance metric learning [55, 15, 22, 28, 23, 24,
Corresponding author
RGB camera
in the day
RGB camera
in the night
IR camera
in the night
Figure 1. Examples of RGB images and infrared (IR) images cap-
tured in two outdoor scenes in the day time and in the night, re-
spectively. The images in every two columns are of the same per-
son. Captured by devices receiving light of different wavelength,
RGB images and IR images of the same person look very different.
49, 57, 21, 44, 56] and end-to-end learning [20, 1, 47, 46].
Most Re-ID methods are based on RGB-RGB matching, the
most common single-modality Re-ID problem.
However, RGB-RGB Re-ID can be limited in surveil-
lance when lighting is either poor or unavailable. For in-
stance, RGB images become uninformative at night (Fig-
ure
1). In such a case, imaging devices without relying
on visible light should be applied. Infrared (IR) cameras
are commonly used in video surveillance systems. While
depth images captured by RGB-D cameras such as Kinect
are also independent of visible light, but they are rarely de-
ployed because they are more expensive, used indoor only
and with distance limitations. Since most surveillance cam-
eras are able to automatically switch from RGB to IR mode
in the dark, it is necessary to study RGB-IR cross-modality
matching in 24-hour surveillance systems.
In this work, we introduce the RGB-IR cross-modality
Re-ID problem. Although RGB-IR Re-ID is common and
significant in real-world applications, to our best knowl-
edge, it has been rarely explored and remains an open is-
5380

sue. RGB-IR Re-ID is a very challenging problem due to
the great differences between two modalities. Firstly, RG-
B and IR images are intrinsically distinct. See Figure
1,
RGB images in the first row have three channels containing
colour information of visible light, while IR images in the
second row have one channel containing information of in-
visible light. Thus, they can be regarded as heterogeneous
data. Secondly, from imaging principle aspect, the wave-
length range of RGB and IR images is different. In existing
Re-ID works, colour information is the most important ap-
pearance cue for identifying persons. However, in the RGB-
IR Re-ID problem, this cue can hardly be used. As shown
in Figure
1, even human can hardly recognise the persons
by colour information. This leads to severe data misalign-
ment within the same class. Moreover, viewpoint change,
pose and exposure problems which cause large intra-class
discrepancy in RGB-based Re-ID also bring difficulties to
RGB-IR cross-modality Re-ID, resulting in a much more
challenging problem. Although there exists a few Re-ID
methods using IR images such as Jungling et al. [
13]. They
only consider the IR-IR video matching for Re-ID but does
not consider the cross-modality RGB-IR Re-ID problem.
We first identify the challenge of RGB-IR Re-ID by
conducting extensive evaluations on popularly used cross-
modality methods. For this purpose, we have collected a
new dataset called SYSU Multiple Modality Re-ID (SYSU-
MM01) dataset. The comparison with existing commonly
used Re-ID datasets is shown in Table
1. It contains 287,628
RGB images and 15,792 IR images of 491 persons captured
in 6 cameras. To our best knowledge, this new RGB-IR Re-
ID dataset provides for the first time a meaningful bench-
mark for the study of cross-modality RGB-IR Re-ID.
For cross-modality matching tasks, domain-specific
modelling is important for extracting shared features for
matching because of the domain shift. Considering using
neural networks for cross-modality matching, we investi-
gate and analyse the relation between different neural net-
work structures, including two-stream structure and asym-
metric FC layer structure, in which the domain-specific
modelling exists but is designed manually. Alternatively,
we propose a deep zero-padding method for training one-
stream network tending towards evolving domain-specific
structures automatically. Extensive experiments show the
effectiveness of deep zero-padding, which outperforms the
compared hand-crafted feature and deep models.
The contributions of this paper are: (1) We contribute
for the first time a standard benchmark SYSU-MM01 for
supporting the study of RGB-IR cross-modality Re-ID. We
conducted extensive experiments to evaluate popular base-
line deep learning architectures for cross-modality RGB-IR
Re-ID. (2) We analyse three different network structures
(one-stream structure, two-stream structure and asymmet-
ric FC layer structure) and give insights on their effective-
Table 1. Comparison between SYSU-MM01 with existing Re-ID
datasets. (-/- denotes the RGB#/IR#.)
Datasets ID# images# cameras# RGB IR
VIPER [7] 632 1,264 2 yes no
iLIDS [
54] 119 476 2 yes no
CAVIAR [
5] 72 610 2 yes no
PRID2011 [
11] 200 971 2 yes no
CUHK01 [
19] 972 1,942 2 yes no
SYSU [
8] 502 24,448 2 yes no
CUHK03 [
20] 1467 13,164 6 yes no
Market [
53] 1501 32,668 6 yes no
MARS [
52] 1261 1,191,003 6 yes no
SYSU-MM01 491 287,628/15,792 6 yes yes
cam2
cam3
cam4
cam6
cam5
cam1
Figure 2. Examples of RGB images and infrared (IR) images in
our SYSU-MM01 dataset. Cameras 1-3 on the left are indoor
scenes and cameras 4-6 on the right are outdoor scenes. Every
two columns are of the same person.
ness for RGB-IR Re-ID. (3) We propose deep zero-padding
for evolving domain-specific structure automatically in one-
stream network optimised for RGB-IR Re-ID tasks. Our
experiments show that this approach for RGB-IR cross-
modality Re-ID outperforms not only a standard one-stream
network but also a two-stream network with explicit cross-
domain learning and extra computational costs.
2. SYSU-MM01 Dataset
2.1. Dataset Description
SYSU-MM01 contains images captured by 6 cameras,
including two IR cameras and four RGB ones. Differen-
t from RGB cameras, IR cameras work in dark scenarios.
We show the details in Table
2 and some samples from each
camera view in Figure
2. RGB images of camera 1 and
camera 2 were captured in two bright indoor rooms (room
1 and room 2) by Kinect V1. For each person, there are at
least 400 continuous RGB frames with different poses and
viewpoints. IR images of camera 3 and camera 6 are cap-
tured by IR cameras in the dark. The IR images have only
one channel, and they are different from 3-channel RGB
images. Camera 3 is placed in room 2 in dark environment,
while camera 6 is placed in an outdoor passage with back-
ground clutters. Camera 4 and 5 are RGB surveillance cam-
eras placed in two outdoor scenes named gate and garden.
Observing the samples of the dataset, we can see clearly
that the images of IR cameras (camera 3 and 6) are distinct
from RGB images, in terms of both colour and exposure.
Specifically, although camera 2 and 3 are in the same sce-
5381

Table 2. Overview of SYSU-MM01 dataset.
Cam location (in/out)door lighting ID# RGB#/ID IR#/ID
1 room1 indoor bright 259 400+ -
2 room2 indoor bright 259 400+ -
3 room2 indoor dark 486 - 20
4 gate outdoor bright 493 20 -
5 garden outdoor bright 502 20 -
6 passage outdoor dark 299 - 20
nario, the images of them suffer from dramatic colour shift
and exposure difference. For example, the first person’s
yellow clothes is distinct from her black trousers under the
RGB camera, but this colour distinction is nearly eliminat-
ed under IR camera (Column 1,2, Row 2,3 in Figure
2).
Moreover, IR images have only one channel and might lose
some texture details. The exposure of IR images captured
at different distances is also an issue. These all introduce
difficulties for the RGB-IR cross-modality Re-ID problem.
2.2. Evaluation Protocol
There are 491 valid IDs in SYSU-MM01 dataset. We
have a fixed split using 296 identities for training, 99 for
validation and 96 for testing. During training, all images of
the 296 persons in training set in all cameras can be applied.
In the testing stage, samples from RGB cameras are for
gallery set, and those from IR cameras are for probe set.
We design two modes, all-search mode and indoor-search
mode. For all-search mode, RGB cameras 1, 2, 4 and 5 are
for gallery set and IR cameras 3 and 6 are for probe set.
For indoor-search mode, RGB cameras 1 and 2 (excluding
outdoor cameras 4 and 5) are for gallery set and IR cameras
3 and 6 are for probe set, which is less challenging.
For both modes, we adopt single-shot and multi-shot set-
tings. For every identity under an RGB camera, we ran-
domly choose one/ten image(s) of the identity to form the
gallery set for single-shot/multi-shot setting. As for probe
set, all images are used. Given a probe image, matching is
conducted by computing similarities between the probe im-
age and gallery images. Notice that matching is conducted
between cameras in different locations (locations are shown
in Table
2). Camera 2 and camera 3 are in the same loca-
tion, so probe images of camera 3 skip the gallery images of
camera 2. After computing similarities, we can get a rank-
ing list according to descending order of similarities.
For indicating the performance, we use Cumulative
Matching Characteristic (CMC) [
32] and mean average pre-
cision (mAP). Notice that, for CMC under multi-shot set-
ting, only the maximum similarity in all gallery images of
the same person is taken to compute the rank list. We repeat
the above evaluation 10 times with random split of gallery
and probe set and compute the average performance finally.
3. Network Structure Comparison on Cross-
Modality Modelling
We investigate deep learning network architectures for
Conv
block
Conv
block
Domain1
input
Conv
block
Conv
block
Domain2
input
Conv block
X4
Softmax
loss
Conv
block
Conv
block
Mixed
input
Conv block
X4
Softmax
loss
Conv
block
Conv
block
Conv block
X4
Softmax
loss
Conv
block
Conv
block
Deep zero-
padding
Conv block
X4
Softmax
loss
One-stream structure
Two-stream structure
Asymmetric FC layer structure
One stream structure with deep zero-padding augmentation
(Proposed)
Domain1
input
Domain2
input
Figure 3. Four network structures in our evaluation. The structure
of conv blocks depends on the base network (ResNet [9] in our
evaluation). The colour of conv blocks and FC layers indicates
whether the parameters are shared or not. Red and blue indicate
specific parameters and green indicates shared parameters.
the task of RGB-IR cross-modality Re-ID. In particular, we
examine three commonly adopted network structures for vi-
sual recognition and cross-modality learning. We further
exploit the idea of deep zero-padding for model training and
give insights on its impact on cross-modality matching task.
3.1. Common Deep Model Network Structures
In the past few years, a large number of deep models
have been proposed for visual matching and cross-modality
modelling, and have achieved satisfactory performance in
many tasks. The most commonly used structures can main-
ly be categorized into 3 types. All structures that we are
going to discuss are shown in Figure
3.
One-stream Structure. One-stream structure is the most
commonly used in vision tasks. As shown in the first net-
work in Figure
3, there is single input and all parameters
are shared in the whole network. Representative networks
include AlexNet [
16], VGG [38], GoogleNet [40], ResNet
[
9] and so on, which perform well in classification, detec-
tion, tracking and many other tasks. In the field of Re-ID,
JSTL-DGD [
47], one of the state-of-the-art network, uses
one-stream structure as well. Generally, in these tasks, the
inputs to the network are RGB images, which are of the
same modality. So sharing all parameters in the network is
appropriate for these tasks.
Two-stream Structure. Two-stream structure is commonly
used in cross-modality matching tasks. As shown in the sec-
ond network in Figure
3, there are two inputs, correspond-
ing to data in two different domains. In the shallower layers,
the parameters of network are specific for each domain. In
the deeper layers, shared parameters are used. The gener-
alized similarity net [
26] proposed by Lin et al. for cross-
5382

Domain1
input
Domain2
input
Shared
Specific
Domain1
Input
Domain selection
sub-network
Domain
indicator
.
.
.
.
.
.
Two-stream network One-stream network in Assumption 1
ܠ
ௗଵ
ܠ
ௗଶ
ܠ
ௗଵ
ܡ
௜௡ௗ
=[1,0]
݂
௦௘௟
ሺܠ
ௗଵ
ǡܡ
௜௡ௗ
݈ܽݕ݁ݎͳ
݈ܽݕ݁ݎʹ
݈ܽݕ݁ݎͳ
݈ܽݕ݁ݎʹ
݈ܽݕ݁ݎͲ
݈ܽݕ݁ݎͲ
Figure 4. Explanation of how one-stream network can represent
two-stream network in Assumption 1 with domain indicator and
domain selection sub-network in forward propagation.
domain visual matching including the Re-ID task is one of
the representative structure of this type. Networks with t-
wo inputs similar to two-stream structure are also favorable
in Re-ID tasks, for example, Ahmed’s net [1], SIR-CIR net
[
42], gated siamese net [41], etc. Note that except for Lin’s
structure [
26], most of them prefer sharing parameters in
domain-specific layers. This is not exactly identical to our
definition of two-stream structure. The reason may be, al-
though the images are from different cameras, they are all
of the same modality of RGB images. Compared to one-
stream structure, two-stream structure achieves two things,
domain adaptation and discriminative feature learning. It
is assumed that the domain-specific network can extract
shared features for different domains, and then the shared
network can extract discriminative features for matching.
Asymmetric FC Layer Structure. Asymmetric FC layer
model is also used in multi-domain tasks, for example, MD-
Net [
33] for multi-domain tracking, CVDCA [2] for Re-ID
and IDR [
10] for VIS-NIR face recognition, etc. As shown
in the third network in Figure
3, the structure shares nearly
all parameters except for the last FC layer. This design as-
sumes that the feature extraction for different domains can
be the same and domain adaptation is achieved in feature
level. This order of feature extraction and domain adapta-
tion is different from two-stream structure.
3.2. Analysis of Network Structures
Connection of One-stream and Two-stream Structures
in special case. The three structures discussed above seem
to be different, we find interestingly that all structures can
be represented by one-stream structure in the forward prop-
agation process when the following assumption is hold:
Assumption 1. A domain selection sub-network would ex-
ist somewhere in a network, which can automatically select
samples of the corresponding domain as input, and the do-
main selection sub-network is fixed.
Under Assumption 1, we firstly give a simple example
how one-stream network can perform as two-stream net-
work in forward propagation. As shown in Figure
4, on the
left is a simplified two-stream network: two fully connect-
ed networks, each with a specific layer (blue and red) and a
shared layer (green). On the right is a one-stream network
which can be conditionally equivalent to the two-stream
one in forward propagation, in which there is a domain
selection sub-network for selecting the following domain-
specific structure. We first define some symbols for illus-
tration. Let x
d1
R
d
and x
d2
R
d
denote the input of
domain1 and domain2, respectively. We define a domain
indicator y
ind
as a vector with two elements, of which the
value is [1, 0]
T
or [0, 1]
T
indicating domain1 or domain2,
respectively. Let f
sel
(x, y
ind
) denote the domain selection
sub-network, implementing the following function:
f
sel
(x, y
ind
) =
(
[I
d
, O
d
]
T
x, y
ind
= [1, 0]
T
[O
d
, I
d
]
T
x, y
ind
= [0, 1]
T
.
(1)
The equation above suggests that if the domain selection
sub-network is fixed, the two-stream network can be repre-
sented by one-stream network in forward propagation.
Analysis of One-stream Structure in General Case.
The assumption we hope above is less feasible. Now, we
drop this assumption and analyse the domain-specific prop-
erty of one-stream network. For cross-modality matching
tasks, domain-specific modelling is important for extract-
ing shared components for matching because of domain
shift. Generally, in neural networks, e.g., two-stream and
asymmetric FC layer structure, this is modelled by domain-
specific structures. Thus we intend to analyse the domain-
specific modelling in one-stream network. Our analysis is
based on the following relaxed assumption:
Assumption 2. As shown in Figure
5, for a one-stream net-
work dealing with inputs of two domains, we categorize
the output nodes of each layer into three types, domain1-
specific nodes, domain2-specific nodes and shared nodes.
The categorization depends on whether the response of the
node is domain-specific. Let x
(l)
d1
and x
(l)
d2
denote the input
to layer l + 1 of domain1 and domain2, respectively. For
example, x
(0)
d1
and x
(0)
d2
are inputs of the whole network. Let
η
(l)
i
denote the i-th node in layer l and f
out
(x
(0)
, i, l) de-
note the output of η
(l)
i
with the network input x
(0)
, we have:
f
out
(x
(0)
, i, l) = σ (
X
j
w
(l1)
j,i
f
out
(x
(0)
, j, l 1) + b
(l1)
i
), (2)
where σ(·) is the activation function, w
(l1)
j,i
and b
(l1)
i
are
weight and bias parameters of layer l 1. The type of node
η
(l)
i
is defined by
type(η
(l)
i
) =
domain1 specif ic, f
out
(x
(0)
d2
, i, l) 0
domain2 specif ic, f
out
(x
(0)
d1
, i, l) 0
share d, other wise.
(3)
For domain1-specific nodes, we use identity sign in
f
out
(x
(0)
d2
, i, l) 0 , which means that for any input of do-
main2, the output of node η
(l)
i
is always zero.
Under Assumption 2, we define some symbols for anal-
ysis. Let L denote the loss function. Let o
(l+1)
i
denote the
5383

output of the i-th node before activation function in layer
l + 1, x
(l)
denote the input to layer l + 1 and w
(l)
i
and
b
(l)
i
denote the weight and bias parameters, i.e., o
(l+1)
i
=
(w
(l)
i
)
T
x
(l)
+ b
(l)
i
. Using the above defined categorization,
without loss of generality, x
(l)
can be factorized into three
parts
1
x
(l)
= [x
(l),1spe
; x
(l),2spe
; x
(l),s
] in which the three
components denote the domain1-specific, domain2-specific
and shared nodes, respectively. We can also denote w
(l)
i
as
w
(l)
i
= [w
(l),1spe
; w
(l),2spe
; w
(l),s
].
For an input of the network x
(0)
d1
in domain1, according
to the categorization definition, x
(l),2spe
d1
= 0 because for
the output of each domain2-specific node, f
out
(x
(0)
d1
, i, l)
0. In the forward propagation process, the output of layer
l + 1 is
o
(l+1)
i
= (w
(l),1spe
i
)
T
x
(l),1spe
d1
+ (w
(l),s
i
)
T
x
(l),s
d1
+ b
(l)
i
. (4)
For an input of the network x
(0)
d2
in domain2, similarly, we
have
o
(l+1)
i
= (w
(l),2spe
i
)
T
x
(l),2spe
d2
+ (w
(l),s
i
)
T
x
(l),s
d2
+ b
(l)
i
. (5)
In the back propagation process, for input of the network
x
(0)
d1
in domain1,
L
w
(l),1spe
i
=
L
o
(l+1)
i
o
(l+1)
i
w
(l),1spe
i
=
L
o
(l+1)
i
x
(l),1spe
d1
, (6)
L
w
(l),s
i
=
L
o
(l+1)
i
o
(l+1)
i
w
(l),s
i
=
L
o
(l+1)
i
x
(l),s
d1
, (7)
L
w
(l),2spe
i
=
L
o
(l+1)
i
o
(l+1)
i
w
(l),2spe
i
=
L
o
(l+1)
i
x
(l),2spe
d1
= 0. (8)
From the analysis above, we have two conclusions: (1)
In forward propagation, as shown in Figure
5, the weight
parameters w
(l),1spe
i
(blue connections) and w
(l),2spe
i
(red
connections) only have impact on input of corresponding
domain, which is similar to the domain-specific parameters
in two-stream networks. While for w
(l),s
i
(green connec-
tions), it has impact on both two domains, which is similar
to the shared parameters in two-stream networks. Thus, the
network can implicitly control the domain-specific structure
by domain-specific nodes and control the shared structure
by shared nodes. (2) In backward propagation, if a node is
domain2-specific, with input in domain1, its corresponding
weight parameters will not be updated because the gradi-
ent is zero. That means the training samples of the oth-
er domain would not influence the implicit domain-specific
structure. Note that for an input x
(0)
d2
, the same conclusion
can be drawn in a similar way.
Remark 1. A one-stream network may implicitly learn and
evolve the domain-specific and shared structures in the net-
work if the three types of nodes defined by Equation (
3) are
assumed to be existed in the network.
1
“;” means concatenation of vectors.
Zero-padding input
Domain1-
specific nodes
Shared nodes
Domain2-
specific nodes
.
.
.
.
.
.
.
.
.
.
.
.
Zero
vector
Domain2
input
Domain1
input
Zero
vector
Domain1-
specific nodes
Shared nodes
Domain2-
specific nodes
.
.
.
.
.
.
.
.
.
.
.
.
Figure 5. Explanation of deep zero-padding method. In each lay-
er, the blue nodes denote the domain1-specific nodes, the red n-
odes denote the domain2-specific nodes, the green nodes denote
the shared nodes and the dotted line nodes denote zero values.
Remark 2. Considering two-stream structure and asym-
metric FC layer structure, they are designed manually and
fixed during training. Moreover, the domain-specific struc-
ture of two domains are decoupled, while the shared struc-
ture is completely identical. In contrast, if one-stream struc-
ture can implicitly learn the structure, the implicit struc-
tures corresponding to different domains are partially cou-
pled by shared nodes and shared bias parameters (Equations
(
4) and (5)), which can provide more flexibility in training
for cross-modality matching tasks.
4. Deep Zero-Padding
4.1. Analysis of Zero-Padding as Network Input
Since the node type we define in the last section (E-
quation (3)) is very optimal based on the assumption that
f
out
(x
(0)
d1
, i, l) 0 or f
out
(x
(0)
d2
, i, l) 0, and how to make
the network learn such nodes with the domain-specific
property in training stage remains an important problem. In
most cases, one-stream network is applied in single-domain
tasks, which treats all samples equally so that generally
domain-specific nodes may not be learned.
As analyzed in the previous sections, the structures of
two-stream network and asymmetric FC layer network are
designed manually and fixed during training, while one-
stream network can evolve the network structure implicit-
ly by learning domain-specific nodes, which may generate
more optimal structure. For this purpose, we propose to
use zero-padding input to stimulate the domain-specific re-
sponse. As shown in Figure
5, for inputs from two domains
x
d1
R
d
and x
d2
R
d
, we apply zero-padding as follows:
x
pad
d1
= [x
T
d1
, O
1×d
]
T
, x
pad
d2
= [O
1×d
, x
T
d2
]
T
. (9)
If we regard the network input as a prior-layer (or called
the 0-th layer), then all the nodes in such a prior-layer will
be definitely categorized as domain-specific nodes accord-
ing to our definition in Equation (
3). Now, what is the case
for the nodes on the next layer? Indeed, it is hard to math-
ematically tell what it is, but we find that with the zero-
padding as network input, the nodes in the networks are
5384

Citations
More filters
Posted Content
TL;DR: A powerful AGW baseline is designed, achieving state-of-the-art or at least comparable performance on twelve datasets for four different Re-ID tasks, and a new evaluation metric (mINP) is introduced, indicating the cost for finding all the correct matches, which provides an additional criteria to evaluate the Re- ID system for real applications.
Abstract: Person re-identification (Re-ID) aims at retrieving a person of interest across multiple non-overlapping cameras. With the advancement of deep neural networks and increasing demand of intelligent video surveillance, it has gained significantly increased interest in the computer vision community. By dissecting the involved components in developing a person Re-ID system, we categorize it into the closed-world and open-world settings. The widely studied closed-world setting is usually applied under various research-oriented assumptions, and has achieved inspiring success using deep learning techniques on a number of datasets. We first conduct a comprehensive overview with in-depth analysis for closed-world person Re-ID from three different perspectives, including deep feature representation learning, deep metric learning and ranking optimization. With the performance saturation under closed-world setting, the research focus for person Re-ID has recently shifted to the open-world setting, facing more challenging issues. This setting is closer to practical applications under specific scenarios. We summarize the open-world Re-ID in terms of five different aspects. By analyzing the advantages of existing methods, we design a powerful AGW baseline, achieving state-of-the-art or at least comparable performance on twelve datasets for FOUR different Re-ID tasks. Meanwhile, we introduce a new evaluation metric (mINP) for person Re-ID, indicating the cost for finding all the correct matches, which provides an additional criteria to evaluate the Re-ID system for real applications. Finally, some important yet under-investigated open issues are discussed.

737 citations


Cites background from "RGB-Infrared Cross-Modality Person ..."

  • ...[196], both the query and gallery sets may contain different modalities (visible, thermal [20], depth [53] or text description [9])....

    [...]

  • ...RegDB [51] SYSU-MM01 [20] Visible-Thermal All Search Indoor Search Method R1 mAP R1 mAP R1 mAP Zero-Pad [20] ICCV17 17....

    [...]

  • ...[20] start the first attempt to address this issue, by proposing a deep zero-padding framework [20] to adaptively learn the modality sharable features....

    [...]

  • ...spectrums [20], [51], sketches [52] or depth images [53], and even text descriptions [54]....

    [...]

  • ...of different viewpoints [10], [11], varying low-image resolutions [12], [13], illumination changes [14], unconstrained poses [15], [16], [17], occlusions [18], [19], heterogeneous modalities [9], [20], etc....

    [...]

Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors conducted a comprehensive overview with in-depth analysis for closed-world person Re-ID from three different perspectives, including deep feature representation learning, deep metric learning and ranking optimization.
Abstract: Person re-identification (Re-ID) aims at retrieving a person of interest across multiple non-overlapping cameras. With the advancement of deep neural networks and increasing demand of intelligent video surveillance, it has gained significantly increased interest in the computer vision community. By dissecting the involved components in developing a person Re-ID system, we categorize it into the closed-world and open-world settings. The widely studied closed-world setting is usually applied under various research-oriented assumptions, and has achieved inspiring success using deep learning techniques on a number of datasets. We first conduct a comprehensive overview with in-depth analysis for closed-world person Re-ID from three different perspectives, including deep feature representation learning, deep metric learning and ranking optimization. With the performance saturation under closed-world setting, the research focus for person Re-ID has recently shifted to the open-world setting, facing more challenging issues. This setting is closer to practical applications under specific scenarios. We summarize the open-world Re-ID in terms of five different aspects. By analyzing the advantages of existing methods, we design a powerful AGW baseline, achieving state-of-the-art or at least comparable performance on twelve datasets for four different Re-ID tasks. Meanwhile, we introduce a new evaluation metric (mINP) for person Re-ID, indicating the cost for finding all the correct matches, which provides an additional criteria to evaluate the Re-ID system for real applications. Finally, some important yet under-investigated open issues are discussed.

301 citations

Proceedings ArticleDOI
Pingyang Dai1, Rongrong Ji1, Haibin Wang1, Qiong Wu1, Yuyu Huang1 
01 Jul 2018
TL;DR: This paper proposes a novel cross-modality generative adversarial network (termed cmGAN) that integrates both identification loss and cross- modality triplet loss, which minimize inter-class ambiguity while maximizing cross-Modality similarity among instances.
Abstract: Person re-identification (Re-ID) is an important task in video surveillance which automatically searches and identifies people across different cameras. Despite the extensive Re-ID progress in RGB cameras, few works have studied the Re-ID between infrared and RGB images, which is essentially a cross-modality problem and widely encountered in real-world scenarios. The key challenge lies in two folds, i.e., the lack of discriminative information to re-identify the same person between RGB and infrared modalities, and the difficulty to learn a robust metric towards such a large-scale cross-modality retrieval. In this paper, we tackle the above two challenges by proposing a novel cross-modality generative adversarial network (termed cmGAN). To handle the issue of insufficient discriminative information, we leverage the cutting-edge generative adversarial training to design our own discriminator to learn discriminative feature representation from different modalities. To handle the issue of large-scale cross-modality metric learning, we integrates both identification loss and cross-modality triplet loss, which minimize inter-class ambiguity while maximizing cross-modality similarity among instances. The entire cmGAN can be trained in an end-to-end manner by using standard deep neural network framework. We have quantized the performance of our work in the newly-released SYSU RGB-IR Re-ID benchmark, and have reported superior performance, i.e., Cumulative Match Characteristic curve (CMC) and Mean Average Precision (MAP), over the state-of-the-art works [Wu et al., 2017], respectively.

287 citations

Proceedings ArticleDOI
14 Jun 2020
TL;DR: The aim of this paper is to design a generalizable person ReID framework which trains a model on source domains yet is able to generalize/perform well on target domains, and to enforce a dual causal loss constraint in SNR to encourage the separation of identity-relevant features and identity-irrelevant features.
Abstract: Existing fully-supervised person re-identification (ReID) methods usually suffer from poor generalization capability caused by domain gaps. The key to solving this problem lies in filtering out identity-irrelevant interference and learning domain-invariant person representations. In this paper, we aim to design a generalizable person ReID framework which trains a model on source domains yet is able to generalize/perform well on target domains. To achieve this goal, we propose a simple yet effective Style Normalization and Restitution (SNR) module. Specifically, we filter out style variations (e.g., illumination, color contrast) by Instance Normalization (IN). However, such a process inevitably removes discriminative information. We propose to distill identity-relevant feature from the removed information and restitute it to the network to ensure high discrimination. For better disentanglement, we enforce a dual causal loss constraint in SNR to encourage the separation of identity-relevant features and identity-irrelevant features. Extensive experiments demonstrate the strong generalization capability of our framework. Our models empowered by the SNR modules significantly outperform the state-of-the-art domain generalization approaches on multiple widely-used person ReID benchmarks, and also show superiority on unsupervised domain adaptation.

276 citations


Cites methods from "RGB-Infrared Cross-Modality Person ..."

  • ...To further demonstrate the capability of SNR in handling images with large style variations, we conduct experiment on a more challenging RGB-Infrared cross-modality person ReID task on benchmark dataset SYSU-MM01 [46]....

    [...]

Proceedings ArticleDOI
01 Jul 2018
TL;DR: A dual-path network with a novel bi-directional dual-constrained top-ranking loss to learn discriminative feature representations and identity loss is further incorporated to model the identity-specific information to handle large intra-class variations.
Abstract: Cross-modality person re-identification between the thermal and visible domains is extremely important for night-time surveillance applications. Existing works in this filed mainly focus on learning sharable feature representations to handle the cross-modality discrepancies. However, besides the cross-modality discrepancy caused by different camera spectrums, visible thermal person re-identification also suffers from large cross-modality and intra-modality variations caused by different camera views and human poses. In this paper, we propose a dual-path network with a novel bi-directional dual-constrained top-ranking loss to learn discriminative feature representations. It is advantageous in two aspects: 1) end-to-end feature learning directly from the data without extra metric learning steps, 2) it simultaneously handles the cross-modality and intra-modality variations to ensure the discriminability of the learnt representations. Meanwhile, identity loss is further incorporated to model the identity-specific information to handle large intra-class variations. Extensive experiments on two datasets demonstrate the superior performance compared to the state-of-the-arts.

269 citations

References
More filters
Proceedings ArticleDOI
27 Jun 2016
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

123,388 citations

Proceedings Article
03 Dec 2012
TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overriding in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

73,978 citations

Proceedings Article
01 Jan 2015
TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

49,914 citations

Proceedings ArticleDOI
07 Jun 2015
TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).
Abstract: We propose a deep convolutional neural network architecture codenamed Inception that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. By a carefully crafted design, we increased the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC14 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.

40,257 citations

Proceedings ArticleDOI
20 Jun 2005
TL;DR: It is shown experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection, and the influence of each stage of the computation on performance is studied.
Abstract: We study the question of feature sets for robust visual object recognition; adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.

31,952 citations


"RGB-Infrared Cross-Modality Person ..." refers methods in this paper

  • ...Handcrafted features included HOG [4], LOMO [23] and HIPHOP [3]....

    [...]

Frequently Asked Questions (12)
Q1. What have the authors contributed in "Rgb-infrared cross-modality person re-identification" ?

For person Re-ID, this is a very challenging cross-modality problem that has not been studied so far. In this work, the authors address the RGB-IR cross-modality Re-ID problem and contribute a new multiple modality Re-ID dataset named SYSU-MM01, including RGB and IR images of 491 identities from 6 cameras, giving in total 287,628 RGB images and 15,792 IR images. The authors further propose deep zero-padding for training one-stream network towards automatically evolving domain-specific nodes in the network for cross-modality matching. 

Representative networks include AlexNet [16], VGG [38], GoogleNet [40], ResNet [9] and so on, which perform well in classification, detection, tracking and many other tasks. 

For cross-modality matching tasks, domain-specific modelling is important for extracting shared components for matching because of domain shift. 

For cross-modality matching tasks, domain-specific modelling is important for extracting shared features for matching because of the domain shift. 

Generalised similarity measure proposed by Lin et al. [26] is for cross-domain visual matching tasks, including RGB-RGB Re-ID task. 

In some multi-domain learning methods, e.g., HFA [18], CRAFT [3], zero-padding in feature level is applied and proved to be effective. 

The formerone includes subspace learning methods [25, 30, 60] and deep learning frameworks [45, 6, 14, 12], while the latter one includes linear models [39, 36, 59, 51] and non-linear models [27, 50, 31]. 

Remark 2. Considering two-stream structure and asymmetric FC layer structure, they are designed manually and fixed during training. 

it is hard to mathematically tell what it is, but the authors find that with the zeropadding as network input, the nodes in the networks aremore possibly becoming domain-specific nodes. 

Networks with two inputs similar to two-stream structure are also favorable in Re-ID tasks, for example, Ahmed’s net [1], SIR-CIR net [42], gated siamese net [41], etc. 

Domain-specific nodes enable the network to convolve image from different domains using different filters, so as to better alleviate the differences (e.g., gradient orientations and exposure differences in Figure 1) between two domains. 

Using deep zero-padding helps to generate more domainspecific nodes, while the proportions without zero-padding are low in most layers.