Proceedings Article•DOI•

RGB-Infrared Cross-Modality Person Re-identification

Ancong Wu¹, Wei-Shi Zheng¹, Hong-Xing Yu¹, Shaogang Gong², Jian-Huang Lai³ - Show less +1 more•Institutions (3)

Sun Yat-sen University¹, Queen Mary University of London², Guangzhou Higher Education Mega Center³

01 Oct 2017-pp 5390-5399

TL;DR: The experiments show that RGB-IR cross-modality matching is very challenging but still feasible using the proposed model with deep zero-padding, giving the best performance.

read less

Abstract: Person re-identification (Re-ID) is an important problem in video surveillance, aiming to match pedestrian images across camera views. Currently, most works focus on RGB-based Re-ID. However, in some applications, RGB images are not suitable, e.g. in a dark environment or at night. Infrared (IR) imaging becomes necessary in many visual systems. To that end, matching RGB images with infrared images is required, which are heterogeneous with very different visual characteristics. For person Re-ID, this is a very challenging cross-modality problem that has not been studied so far. In this work, we address the RGB-IR cross-modality Re-ID problem and contribute a new multiple modality Re-ID dataset named SYSU-MM01, including RGB and IR images of 491 identities from 6 cameras, giving in total 287,628 RGB images and 15,792 IR images. To explore the RGB-IR Re-ID problem, we evaluate existing popular cross-domain models, including three commonly used neural network structures (one-stream, two-stream and asymmetric FC layer) and analyse the relation between them. We further propose deep zero-padding for training one-stream network towards automatically evolving domain-specific nodes in the network for cross-modality matching. Our experiments show that RGB-IR cross-modality matching is very challenging but still feasible using the proposed model with deep zero-padding, giving the best performance. Our dataset is available at http:// isee.sysu.edu.cn/project/RGBIRReID.htm.

...read moreread less

Summary (3 min read)

Jump to: [1. Introduction] – [2.1. Dataset Description] – [2.2. Evaluation Protocol] – [3. Network Structure Comparison on CrossModality Modelling] – [3.1. Common Deep Model Network Structures] – [3.2. Analysis of Network Structures] – [4.1. Analysis of Zero-Padding as Network Input] – [4.3. Comparison of Cross-Modality Learning] – [5. Experiments] – [5.1. Compared Models] – [5.2. Model Comparisons and Analysis] and [6. Summary]

1. Introduction

Secondly, from imaging principle aspect, the wavelength range of RGB and IR images is different.
In existing Re-ID works, colour information is the most important appearance cue for identifying persons.
The authors first identify the challenge of RGB-IR Re-ID by conducting extensive evaluations on popularly used crossmodality methods.
Considering using neural networks for cross-modality matching, the authors investigate and analyse the relation between different neural network structures, including two-stream structure and asymmetric FC layer structure, in which the domain-specific modelling exists but is designed manually.

2.1. Dataset Description

SYSU-MM01 contains images captured by 6 cameras, including two IR cameras and four RGB ones.
For each person, there are at least 400 continuous RGB frames with different poses and viewpoints.
The IR images have only one channel, and they are different from 3-channel RGB images.
Camera 4 and 5 are RGB surveillance cameras placed in two outdoor scenes named gate and garden.
These all introduce difficulties for the RGB-IR cross-modality Re-ID problem.

2.2. Evaluation Protocol

The authors have a fixed split using 296 identities for training, 99 for validation and 96 for testing.
Given a probe image, matching is conducted by computing similarities between the probe image and gallery images.
After computing similarities, the authors can get a ranking list according to descending order of similarities.

3. Network Structure Comparison on CrossModality Modelling

The authors investigate deep learning network architectures for the task of RGB-IR cross-modality Re-ID.
In particular, the authors examine three commonly adopted network structures for visual recognition and cross-modality learning.
The authors further exploit the idea of deep zero-padding for model training and give insights on its impact on cross-modality matching task.

3.1. Common Deep Model Network Structures

In the past few years, a large number of deep models have been proposed for visual matching and cross-modality modelling, and have achieved satisfactory performance in many tasks.
Generally, in these tasks, the inputs to the network are RGB images, which are of the same modality.
In the deeper layers, shared parameters are used.
The generalized similarity net [26] proposed by Lin et al. for cross- domain visual matching including the Re-ID task is one of the representative structure of this type.
Compared to onestream structure, two-stream structure achieves two things, domain adaptation and discriminative feature learning.

3.2. Analysis of Network Structures

The three structures discussed above seem to be different, the authors find interestingly that all structures can be represented by one-stream structure in the forward propagation process when the following assumption is hold: Assumption 1.
On the right is a one-stream network which can be conditionally equivalent to the two-stream one in forward propagation, in which there is a domain selection sub-network for selecting the following domainspecific structure.
The assumption the authors hope above is less feasible.
Using the above defined categorization, without loss of generality, x(l) can be factorized into three parts1 x(l) = [x(l),1spe;x(l),2spe;x(l),s] in which the three components denote the domain1-specific, domain2-specific and shared nodes, respectively.
In contrast, if one-stream structure can implicitly learn the structure, the implicit structures corresponding to different domains are partially coupled by shared nodes and shared bias parameters (Equations (4) and (5)), which can provide more flexibility in training for cross-modality matching tasks.

4.1. Analysis of Zero-Padding as Network Input

In most cases, one-stream network is applied in single-domain tasks, which treats all samples equally so that generally domain-specific nodes may not be learned.
It would be easier for neural network to spread the domain specific-nodes in deeper layers.
Actually, their neural networks learning empirically support this.
As shown in Figure 7 and Figure 8, deep zero-padding helps the network learn domain-specific nodes more easily than that without zeropadding.
The details will be illustrated later in Section 4.2.

4.3. Comparison of Cross-Modality Learning

While cross-modality matching task has not drawn much attention in Re-ID problem, it has been studied a lot in other fields like information retrieval and face verification.
Crossmodality retrieval (e.g. text-image, tag-image) plays an important role in information retrieval.
Matching visual face versus near infrared ones (VIS-NIR) [17, 58, 10] is rather related to RGB-IR cross-modality Re-ID.
The remaining useful cues may be body shape, which differs greatly with different viewpoints and poses.
In comparison, their zero-padding is done in raw image level and the domain-specific and shared learning are done by deep neural network.

5. Experiments

The authors conducted extensive evaluations of existing Re-ID and cross-domain matching models as baselines on their SYSU-MM01 dataset.
Then, the authors evaluated and analysed the effectiveness of deep models, including the proposed deep zero-padding and three network structures discussed in Section 3.
See Section 2.2 for detailed evaluation protocol.

5.1. Compared Models

The authors evaluated three favorable handcrafted features and cross-domain metric learning models as baselines.
The authors evaluated four deep models shown in Figure 3, including one-steam network, two-stream network, asymmetric FC layer network and the proposed deep zero-padding method (network structure is the same as one- stream network).
All of the hyper parameters were kept the same.

5.2. Model Comparisons and Analysis

The authors show comparative results in Table 3, including the rank-1, 10, 20 accuracies of CMC [32] and mean average precision (mAP).
There were gaps among their performances to some extent.
In Table 3 the authors can see that the deep zero-padding outperformed two-stream network and asymmetric FC layer structure.
The authors used the codes released by the authors in the experiments.
It is inferior for dealing with the much more challenging RGB-IR cross-modality Re-ID problem.

6. Summary

To their best knowledge, this work is the first to identify the RGB-IR cross-modality Re-ID problem and introduce a new multi-modality Re-ID dataset named SYSU-MM01.
The great difference between RGB and IR images makes RGB-IR cross-modality Re-ID formed as a very challenging problem.
The authors have discussed and evaluated three common network structures for cross-domain tasks including one-stream structure, two-stream structure and asymmetric FC layer structure.
The authors have analysed the connection between one-stream and two-stream structure and found that one-stream network can learn and evolve domain-specific structure implicitly if there exist domain-specific and shared nodes.
The experiments have shown that the one-stream network trained by deep zero-padding achieved the best performance.

Did you find this useful? Give us your feedback

Figures (12)

Figure 5. Explanation of deep zero-padding method. In each layer, the blue nodes denote the domain1-specific nodes, the red nodes denote the domain2-specific nodes, the green nodes denote the shared nodes and the dotted line nodes denote zero values.

Figure 1. Examples of RGB images and infrared (IR) images captured in two outdoor scenes in the day time and in the night, respectively. The images in every two columns are of the same person. Captured by devices receiving light of different wavelength, RGB images and IR images of the same person look very different.

Figure 6. Deep zero-padding for RGB and infrared (IR) images.

Figure 7. Feature maps of the first and second convolution layers of ResNet-6 with deep zero-padding and single channel input. In each layer, the first row shows feature maps of RGB input and the second row shows those of IR input. It is evident that domain-specific channels on the left learned by deep zero-padding are much more than those learned by single-channel input.

Table 1. Comparison between SYSU-MM01 with existing Re-ID datasets. (-/- denotes the RGB#/IR#.)

Figure 2. Examples of RGB images and infrared (IR) images in our SYSU-MM01 dataset. Cameras 1-3 on the left are indoor scenes and cameras 4-6 on the right are outdoor scenes. Every two columns are of the same person.

Figure 8. Relation between proportion of domain-specific nodes and layer depth. The x-axis denotes layer depth from bottom to top of the network, and the y-axis denotes the proportion of domainspecific nodes. The strict threshold is T = 0.01 std(x (l) i ) and the loose threshold is T = 0.05 std(x (l) i ) (std(x (l) i ) is the standard deviation of the output of the i-th node in layer l). Generally, the proportion of domain-specific nodes using deep zero-padding is higher than that without zero-padding.

Figure 3. Four network structures in our evaluation. The structure of conv blocks depends on the base network (ResNet [9] in our evaluation). The colour of conv blocks and FC layers indicates whether the parameters are shared or not. Red and blue indicate specific parameters and green indicates shared parameters.

Table 4. Comparison of deep zero-padding and similar networks. r1 and r10 denote rank-1 and 10 accuracies (%).

Table 3. Performance under all-search and indoor-search mode. r1, r10, r20 denote rank-1, 10, 20 accuracies (%).

Figure 4. Explanation of how one-stream network can represent two-stream network in Assumption 1 with domain indicator and domain selection sub-network in forward propagation.

Content maybe subject to copyright Report

RGB-Infrared Cross-Modality Person Re-Identiﬁcation

Ancong Wu

, Wei-Shi Zheng

2,5,6∗

, Hong-Xing Yu

, Shaogang Gong

, and Jianhuang Lai

2,3

School of Electronics and Information Technology, Sun Yat-sen University, China

School of Data and Computer Science, Sun Yat-sen University, China

Guangdong Province Key Laboratory of Information Security, China

Queen Mary University of London, United Kingdom

Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, China

Collaborative Innovation Center of High Performance Computing, NUDT, China

wuancong@mail2.sysu.edu.cn, wszheng@ieee.org, xKoven@gmail.com,

s.gong@qmul.ac.uk, stsljh@mail.sysu.edu.cn

Abstract

Person re-identiﬁcation (Re-ID) is an important prob-

lem in video surveillance, aiming to match pedestrian im-

ages across camera views. Currently, most works focus

on RGB-based Re-ID. However, in some applications, RGB

images are not suitable, e.g. in a dark environment or at

night. Infrared (IR) imaging becomes necessary in many

visual systems. To that end, matching RGB images with

infrared images is required, which are heterogeneous with

very different visual characteristics. For person Re-ID, this

is a very challenging cross-modality problem that has not

been studied so far. In this work, we address the RGB-IR

cross-modality Re-ID problem and contribute a new mul-

tiple modality Re-ID dataset named SYSU-MM01, includ-

ing RGB and IR images of 491 identities from 6 cameras,

giving in total 287,628 RGB images and 15,792 IR im-

ages. To explore the RGB-IR Re-ID problem, we evalu-

ate existing popular cross-domain models, including three

commonly used neural network structures (one-stream, two-

stream and asymmetric FC layer) and analyse the relation

between them. We further propose deep zero-padding for

training one-stream network towards automatically evolv-

ing domain-speciﬁc nodes in the network for cross-modality

matching. Our experiments show that RGB-IR cross-

modality matching is very challenging but still feasible us-

ing the proposed model with deep zero-padding, giving the

best performance. Our dataset is available at

http://

isee.sysu.edu.cn/project/RGBIRReID.htm

1. Introduction

Person re-identiﬁcation (Re-ID) is an important ﬁeld in

video surveillance. A large number of models for Re-ID

problem have been proposed, including feature learning

[29, 48, 23], distance metric learning [55, 15, 22, 28, 23, 24,

∗

Corresponding author

RGB camera

in the day

RGB camera

in the night

IR camera

in the night

Figure 1. Examples of RGB images and infrared (IR) images cap-

tured in two outdoor scenes in the day time and in the night, re-

spectively. The images in every two columns are of the same per-

son. Captured by devices receiving light of different wavelength,

RGB images and IR images of the same person look very different.

49, 57, 21, 44, 56] and end-to-end learning [20, 1, 47, 46].

Most Re-ID methods are based on RGB-RGB matching, the

most common single-modality Re-ID problem.

However, RGB-RGB Re-ID can be limited in surveil-

lance when lighting is either poor or unavailable. For in-

stance, RGB images become uninformative at night (Fig-

ure

1). In such a case, imaging devices without relying

on visible light should be applied. Infrared (IR) cameras

are commonly used in video surveillance systems. While

depth images captured by RGB-D cameras such as Kinect

are also independent of visible light, but they are rarely de-

ployed because they are more expensive, used indoor only

and with distance limitations. Since most surveillance cam-

eras are able to automatically switch from RGB to IR mode

in the dark, it is necessary to study RGB-IR cross-modality

matching in 24-hour surveillance systems.

In this work, we introduce the RGB-IR cross-modality

Re-ID problem. Although RGB-IR Re-ID is common and

signiﬁcant in real-world applications, to our best knowl-

edge, it has been rarely explored and remains an open is-

5380

sue. RGB-IR Re-ID is a very challenging problem due to

the great differences between two modalities. Firstly, RG-

B and IR images are intrinsically distinct. See Figure

RGB images in the ﬁrst row have three channels containing

colour information of visible light, while IR images in the

second row have one channel containing information of in-

visible light. Thus, they can be regarded as heterogeneous

data. Secondly, from imaging principle aspect, the wave-

length range of RGB and IR images is different. In existing

Re-ID works, colour information is the most important ap-

pearance cue for identifying persons. However, in the RGB-

IR Re-ID problem, this cue can hardly be used. As shown

in Figure

1, even human can hardly recognise the persons

by colour information. This leads to severe data misalign-

ment within the same class. Moreover, viewpoint change,

pose and exposure problems which cause large intra-class

discrepancy in RGB-based Re-ID also bring difﬁculties to

RGB-IR cross-modality Re-ID, resulting in a much more

challenging problem. Although there exists a few Re-ID

methods using IR images such as Jungling et al. [

13]. They

only consider the IR-IR video matching for Re-ID but does

not consider the cross-modality RGB-IR Re-ID problem.

We ﬁrst identify the challenge of RGB-IR Re-ID by

conducting extensive evaluations on popularly used cross-

modality methods. For this purpose, we have collected a

new dataset called SYSU Multiple Modality Re-ID (SYSU-

MM01) dataset. The comparison with existing commonly

used Re-ID datasets is shown in Table

1. It contains 287,628

RGB images and 15,792 IR images of 491 persons captured

in 6 cameras. To our best knowledge, this new RGB-IR Re-

ID dataset provides for the ﬁrst time a meaningful bench-

mark for the study of cross-modality RGB-IR Re-ID.

For cross-modality matching tasks, domain-speciﬁc

modelling is important for extracting shared features for

matching because of the domain shift. Considering using

neural networks for cross-modality matching, we investi-

gate and analyse the relation between different neural net-

work structures, including two-stream structure and asym-

metric FC layer structure, in which the domain-speciﬁc

modelling exists but is designed manually. Alternatively,

we propose a deep zero-padding method for training one-

stream network tending towards evolving domain-speciﬁc

structures automatically. Extensive experiments show the

effectiveness of deep zero-padding, which outperforms the

compared hand-crafted feature and deep models.

The contributions of this paper are: (1) We contribute

for the ﬁrst time a standard benchmark SYSU-MM01 for

supporting the study of RGB-IR cross-modality Re-ID. We

conducted extensive experiments to evaluate popular base-

line deep learning architectures for cross-modality RGB-IR

Re-ID. (2) We analyse three different network structures

(one-stream structure, two-stream structure and asymmet-

ric FC layer structure) and give insights on their effective-

Table 1. Comparison between SYSU-MM01 with existing Re-ID

datasets. (-/- denotes the RGB#/IR#.)

Datasets ID# images# cameras# RGB IR

VIPER [7] 632 1,264 2 yes no

iLIDS [

54] 119 476 2 yes no

CAVIAR [

5] 72 610 2 yes no

PRID2011 [

11] 200 971 2 yes no

CUHK01 [

19] 972 1,942 2 yes no

SYSU [

8] 502 24,448 2 yes no

CUHK03 [

20] 1467 13,164 6 yes no

Market [

53] 1501 32,668 6 yes no

MARS [

52] 1261 1,191,003 6 yes no

SYSU-MM01 491 287,628/15,792 6 yes yes

cam2

cam3

cam4

cam6

cam5

cam1

Figure 2. Examples of RGB images and infrared (IR) images in

our SYSU-MM01 dataset. Cameras 1-3 on the left are indoor

scenes and cameras 4-6 on the right are outdoor scenes. Every

two columns are of the same person.

ness for RGB-IR Re-ID. (3) We propose deep zero-padding

for evolving domain-speciﬁc structure automatically in one-

stream network optimised for RGB-IR Re-ID tasks. Our

experiments show that this approach for RGB-IR cross-

modality Re-ID outperforms not only a standard one-stream

network but also a two-stream network with explicit cross-

domain learning and extra computational costs.

2. SYSU-MM01 Dataset

2.1. Dataset Description

SYSU-MM01 contains images captured by 6 cameras,

including two IR cameras and four RGB ones. Differen-

t from RGB cameras, IR cameras work in dark scenarios.

We show the details in Table

2 and some samples from each

camera view in Figure

2. RGB images of camera 1 and

camera 2 were captured in two bright indoor rooms (room

1 and room 2) by Kinect V1. For each person, there are at

least 400 continuous RGB frames with different poses and

viewpoints. IR images of camera 3 and camera 6 are cap-

tured by IR cameras in the dark. The IR images have only

one channel, and they are different from 3-channel RGB

images. Camera 3 is placed in room 2 in dark environment,

while camera 6 is placed in an outdoor passage with back-

ground clutters. Camera 4 and 5 are RGB surveillance cam-

eras placed in two outdoor scenes named gate and garden.

Observing the samples of the dataset, we can see clearly

that the images of IR cameras (camera 3 and 6) are distinct

from RGB images, in terms of both colour and exposure.

Speciﬁcally, although camera 2 and 3 are in the same sce-

5381

Table 2. Overview of SYSU-MM01 dataset.

Cam location (in/out)door lighting ID# RGB#/ID IR#/ID

1 room1 indoor bright 259 400+ -

2 room2 indoor bright 259 400+ -

3 room2 indoor dark 486 - 20

4 gate outdoor bright 493 20 -

5 garden outdoor bright 502 20 -

6 passage outdoor dark 299 - 20

nario, the images of them suffer from dramatic colour shift

and exposure difference. For example, the ﬁrst person’s

yellow clothes is distinct from her black trousers under the

RGB camera, but this colour distinction is nearly eliminat-

ed under IR camera (Column 1,2, Row 2,3 in Figure

2).

Moreover, IR images have only one channel and might lose

some texture details. The exposure of IR images captured

at different distances is also an issue. These all introduce

difﬁculties for the RGB-IR cross-modality Re-ID problem.

2.2. Evaluation Protocol

There are 491 valid IDs in SYSU-MM01 dataset. We

have a ﬁxed split using 296 identities for training, 99 for

validation and 96 for testing. During training, all images of

the 296 persons in training set in all cameras can be applied.

In the testing stage, samples from RGB cameras are for

gallery set, and those from IR cameras are for probe set.

We design two modes, all-search mode and indoor-search

mode. For all-search mode, RGB cameras 1, 2, 4 and 5 are

for gallery set and IR cameras 3 and 6 are for probe set.

For indoor-search mode, RGB cameras 1 and 2 (excluding

outdoor cameras 4 and 5) are for gallery set and IR cameras

3 and 6 are for probe set, which is less challenging.

For both modes, we adopt single-shot and multi-shot set-

tings. For every identity under an RGB camera, we ran-

domly choose one/ten image(s) of the identity to form the

gallery set for single-shot/multi-shot setting. As for probe

set, all images are used. Given a probe image, matching is

conducted by computing similarities between the probe im-

age and gallery images. Notice that matching is conducted

between cameras in different locations (locations are shown

in Table

2). Camera 2 and camera 3 are in the same loca-

tion, so probe images of camera 3 skip the gallery images of

camera 2. After computing similarities, we can get a rank-

ing list according to descending order of similarities.

For indicating the performance, we use Cumulative

Matching Characteristic (CMC) [

32] and mean average pre-

cision (mAP). Notice that, for CMC under multi-shot set-

ting, only the maximum similarity in all gallery images of

the same person is taken to compute the rank list. We repeat

the above evaluation 10 times with random split of gallery

and probe set and compute the average performance ﬁnally.

3. Network Structure Comparison on Cross-

Modality Modelling

We investigate deep learning network architectures for

Conv

block

Conv

block

Domain1

input

Conv

block

Conv

block

Domain2

input

Conv block

Softmax

loss

Conv

block

Conv

block

Mixed

input

Conv block

Softmax

loss

Conv

block

Conv

block

Conv block

Softmax

loss

Conv

block

Conv

block

Deep zero-

padding

Conv block

Softmax

loss

One-stream structure

Two-stream structure

Asymmetric FC layer structure

One stream structure with deep zero-padding augmentation

(Proposed)

Domain1

input

Domain2

input

Figure 3. Four network structures in our evaluation. The structure

of conv blocks depends on the base network (ResNet [9] in our

evaluation). The colour of conv blocks and FC layers indicates

whether the parameters are shared or not. Red and blue indicate

speciﬁc parameters and green indicates shared parameters.

the task of RGB-IR cross-modality Re-ID. In particular, we

examine three commonly adopted network structures for vi-

sual recognition and cross-modality learning. We further

exploit the idea of deep zero-padding for model training and

give insights on its impact on cross-modality matching task.

3.1. Common Deep Model Network Structures

In the past few years, a large number of deep models

have been proposed for visual matching and cross-modality

modelling, and have achieved satisfactory performance in

many tasks. The most commonly used structures can main-

ly be categorized into 3 types. All structures that we are

going to discuss are shown in Figure

One-stream Structure. One-stream structure is the most

commonly used in vision tasks. As shown in the ﬁrst net-

work in Figure

3, there is single input and all parameters

are shared in the whole network. Representative networks

include AlexNet [

16], VGG [38], GoogleNet [40], ResNet

[

9] and so on, which perform well in classiﬁcation, detec-

tion, tracking and many other tasks. In the ﬁeld of Re-ID,

JSTL-DGD [

47], one of the state-of-the-art network, uses

one-stream structure as well. Generally, in these tasks, the

inputs to the network are RGB images, which are of the

same modality. So sharing all parameters in the network is

appropriate for these tasks.

Two-stream Structure. Two-stream structure is commonly

used in cross-modality matching tasks. As shown in the sec-

ond network in Figure

3, there are two inputs, correspond-

ing to data in two different domains. In the shallower layers,

the parameters of network are speciﬁc for each domain. In

the deeper layers, shared parameters are used. The gener-

alized similarity net [

26] proposed by Lin et al. for cross-

5382

Domain1

input

Domain2

input

Shared

Specific

Domain1

Input

Domain selection

sub-network

Domain

indicator

Two-stream network One-stream network in Assumption 1

ௗଵ

ௗଶ

ௗଵ

௜௡ௗ

=[1,0]

்

௦௘௟

ሺܠ

ௗଵ

ǡܡ

௜௡ௗ

ሻ

݈ܽݕ݁ݎͳ

݈ܽݕ݁ݎʹ

݈ܽݕ݁ݎͳ

݈ܽݕ݁ݎʹ

݈ܽݕ݁ݎͲ

Figure 4. Explanation of how one-stream network can represent

two-stream network in Assumption 1 with domain indicator and

domain selection sub-network in forward propagation.

domain visual matching including the Re-ID task is one of

the representative structure of this type. Networks with t-

wo inputs similar to two-stream structure are also favorable

in Re-ID tasks, for example, Ahmed’s net [1], SIR-CIR net

[

42], gated siamese net [41], etc. Note that except for Lin’s

structure [

26], most of them prefer sharing parameters in

domain-speciﬁc layers. This is not exactly identical to our

deﬁnition of two-stream structure. The reason may be, al-

though the images are from different cameras, they are all

of the same modality of RGB images. Compared to one-

stream structure, two-stream structure achieves two things,

domain adaptation and discriminative feature learning. It

is assumed that the domain-speciﬁc network can extract

shared features for different domains, and then the shared

network can extract discriminative features for matching.

Asymmetric FC Layer Structure. Asymmetric FC layer

model is also used in multi-domain tasks, for example, MD-

Net [

33] for multi-domain tracking, CVDCA [2] for Re-ID

and IDR [

10] for VIS-NIR face recognition, etc. As shown

in the third network in Figure

3, the structure shares nearly

all parameters except for the last FC layer. This design as-

sumes that the feature extraction for different domains can

be the same and domain adaptation is achieved in feature

level. This order of feature extraction and domain adapta-

tion is different from two-stream structure.

3.2. Analysis of Network Structures

– Connection of One-stream and Two-stream Structures

in special case. The three structures discussed above seem

to be different, we ﬁnd interestingly that all structures can

be represented by one-stream structure in the forward prop-

agation process when the following assumption is hold:

Assumption 1. A domain selection sub-network would ex-

ist somewhere in a network, which can automatically select

samples of the corresponding domain as input, and the do-

main selection sub-network is ﬁxed.

Under Assumption 1, we ﬁrstly give a simple example

how one-stream network can perform as two-stream net-

work in forward propagation. As shown in Figure

4, on the

left is a simpliﬁed two-stream network: two fully connect-

ed networks, each with a speciﬁc layer (blue and red) and a

shared layer (green). On the right is a one-stream network

which can be conditionally equivalent to the two-stream

one in forward propagation, in which there is a domain

selection sub-network for selecting the following domain-

speciﬁc structure. We ﬁrst deﬁne some symbols for illus-

tration. Let x

∈ R

and x

∈ R

denote the input of

domain1 and domain2, respectively. We deﬁne a domain

indicator y

ind

as a vector with two elements, of which the

value is [1, 0]

or [0, 1]

indicating domain1 or domain2,

respectively. Let f

sel

(x, y

ind

) denote the domain selection

sub-network, implementing the following function:

sel

(x, y

ind

) =

(

, O

]

x, y

ind

= [1, 0]

, I

]

x, y

ind

= [0, 1]

(1)

The equation above suggests that if the domain selection

sub-network is ﬁxed, the two-stream network can be repre-

sented by one-stream network in forward propagation.

– Analysis of One-stream Structure in General Case.

The assumption we hope above is less feasible. Now, we

drop this assumption and analyse the domain-speciﬁc prop-

erty of one-stream network. For cross-modality matching

tasks, domain-speciﬁc modelling is important for extract-

ing shared components for matching because of domain

shift. Generally, in neural networks, e.g., two-stream and

asymmetric FC layer structure, this is modelled by domain-

speciﬁc structures. Thus we intend to analyse the domain-

speciﬁc modelling in one-stream network. Our analysis is

based on the following relaxed assumption:

Assumption 2. As shown in Figure

5, for a one-stream net-

work dealing with inputs of two domains, we categorize

the output nodes of each layer into three types, domain1-

speciﬁc nodes, domain2-speciﬁc nodes and shared nodes.

The categorization depends on whether the response of the

node is domain-speciﬁc. Let x

(l)

and x

(l)

denote the input

to layer l + 1 of domain1 and domain2, respectively. For

example, x

(0)

and x

(0)

are inputs of the whole network. Let

(l)

denote the i-th node in layer l and f

out

(0)

, i, l) de-

note the output of η

(l)

with the network input x

(0)

, we have:

out

(0)

, i, l) = σ (

(l−1)

j,i

out

(0)

, j, l − 1) + b

(l−1)

), (2)

where σ(·) is the activation function, w

(l−1)

j,i

and b

(l−1)

are

weight and bias parameters of layer l − 1. The type of node

(l)

is deﬁned by

type(η

(l)

) =











domain1 − specif ic, f

out

(0)

, i, l) ≡ 0

domain2 − specif ic, f

out

(0)

, i, l) ≡ 0

share d, other wise.

(3)

For domain1-speciﬁc nodes, we use identity sign in

out

(0)

, i, l) ≡ 0 , which means that for any input of do-

main2, the output of node η

(l)

is always zero.

Under Assumption 2, we deﬁne some symbols for anal-

ysis. Let L denote the loss function. Let o

(l+1)

denote the

5383

output of the i-th node before activation function in layer

l + 1, x

(l)

denote the input to layer l + 1 and w

(l)

and

(l)

denote the weight and bias parameters, i.e., o

(l+1)

(l)

)

(l)

+ b

(l)

. Using the above deﬁned categorization,

without loss of generality, x

(l)

can be factorized into three

parts

(l)

= [x

(l),1spe

; x

(l),2spe

; x

(l),s

] in which the three

components denote the domain1-speciﬁc, domain2-speciﬁc

and shared nodes, respectively. We can also denote w

(l)

= [w

(l),1spe

; w

(l),2spe

; w

(l),s

For an input of the network x

(0)

in domain1, according

to the categorization deﬁnition, x

(l),2spe

= 0 because for

the output of each domain2-speciﬁc node, f

out

(0)

, i, l) ≡

0. In the forward propagation process, the output of layer

l + 1 is

(l+1)

= (w

(l),1spe

)

(l),1spe

+ (w

(l),s

)

(l),s

+ b

(l)

. (4)

For an input of the network x

(0)

in domain2, similarly, we

have

(l+1)

= (w

(l),2spe

)

(l),2spe

+ (w

(l),s

)

(l),s

+ b

(l)

. (5)

In the back propagation process, for input of the network

(0)

in domain1,

∂L

∂w

(l),1spe

∂L

∂o

(l+1)

∂o

(l+1)

∂w

(l),1spe

∂L

∂o

(l+1)

(l),1spe

, (6)

∂L

∂w

(l),s

∂L

∂o

(l+1)

∂o

(l+1)

∂w

(l),s

∂L

∂o

(l+1)

(l),s

, (7)

∂L

∂w

(l),2spe

∂L

∂o

(l+1)

∂o

(l+1)

∂w

(l),2spe

∂L

∂o

(l+1)

(l),2spe

= 0. (8)

From the analysis above, we have two conclusions: (1)

In forward propagation, as shown in Figure

5, the weight

parameters w

(l),1spe

(blue connections) and w

(l),2spe

(red

connections) only have impact on input of corresponding

domain, which is similar to the domain-speciﬁc parameters

in two-stream networks. While for w

(l),s

(green connec-

tions), it has impact on both two domains, which is similar

to the shared parameters in two-stream networks. Thus, the

network can implicitly control the domain-speciﬁc structure

by domain-speciﬁc nodes and control the shared structure

by shared nodes. (2) In backward propagation, if a node is

domain2-speciﬁc, with input in domain1, its corresponding

weight parameters will not be updated because the gradi-

ent is zero. That means the training samples of the oth-

er domain would not inﬂuence the implicit domain-speciﬁc

structure. Note that for an input x

(0)

, the same conclusion

can be drawn in a similar way.

Remark 1. A one-stream network may implicitly learn and

evolve the domain-speciﬁc and shared structures in the net-

work if the three types of nodes deﬁned by Equation (

3) are

assumed to be existed in the network.

“;” means concatenation of vectors.

Zero-padding input

Domain1-

specific nodes

Shared nodes

Domain2-

specific nodes

Zero

vector

Domain2

input

Domain1

input

Zero

vector

Domain1-

specific nodes

Shared nodes

Domain2-

specific nodes

Figure 5. Explanation of deep zero-padding method. In each lay-

er, the blue nodes denote the domain1-speciﬁc nodes, the red n-

odes denote the domain2-speciﬁc nodes, the green nodes denote

the shared nodes and the dotted line nodes denote zero values.

Remark 2. Considering two-stream structure and asym-

metric FC layer structure, they are designed manually and

ﬁxed during training. Moreover, the domain-speciﬁc struc-

ture of two domains are decoupled, while the shared struc-

ture is completely identical. In contrast, if one-stream struc-

ture can implicitly learn the structure, the implicit struc-

tures corresponding to different domains are partially cou-

pled by shared nodes and shared bias parameters (Equations

(

4) and (5)), which can provide more ﬂexibility in training

for cross-modality matching tasks.

4. Deep Zero-Padding

4.1. Analysis of Zero-Padding as Network Input

Since the node type we deﬁne in the last section (E-

quation (3)) is very optimal based on the assumption that

out

(0)

, i, l) ≡ 0 or f

out

(0)

, i, l) ≡ 0, and how to make

the network learn such nodes with the domain-speciﬁc

property in training stage remains an important problem. In

most cases, one-stream network is applied in single-domain

tasks, which treats all samples equally so that generally

domain-speciﬁc nodes may not be learned.

As analyzed in the previous sections, the structures of

two-stream network and asymmetric FC layer network are

designed manually and ﬁxed during training, while one-

stream network can evolve the network structure implicit-

ly by learning domain-speciﬁc nodes, which may generate

more optimal structure. For this purpose, we propose to

use zero-padding input to stimulate the domain-speciﬁc re-

sponse. As shown in Figure

5, for inputs from two domains

∈ R

and x

∈ R

, we apply zero-padding as follows:

pad

= [x

, O

1×d

]

, x

pad

= [O

1×d

, x

]

. (9)

If we regard the network input as a prior-layer (or called

the 0-th layer), then all the nodes in such a prior-layer will

be deﬁnitely categorized as domain-speciﬁc nodes accord-

ing to our deﬁnition in Equation (

3). Now, what is the case

for the nodes on the next layer? Indeed, it is hard to math-

ematically tell what it is, but we ﬁnd that with the zero-

padding as network input, the nodes in the networks are

5384

HTML Viewer

Frequently Asked Questions (12)

Q1. What have the authors contributed in "Rgb-infrared cross-modality person re-identification" ?

For person Re-ID, this is a very challenging cross-modality problem that has not been studied so far. In this work, the authors address the RGB-IR cross-modality Re-ID problem and contribute a new multiple modality Re-ID dataset named SYSU-MM01, including RGB and IR images of 491 identities from 6 cameras, giving in total 287,628 RGB images and 15,792 IR images. The authors further propose deep zero-padding for training one-stream network towards automatically evolving domain-specific nodes in the network for cross-modality matching.

Q2. What are the common networks used in the field of Re-ID?

Representative networks include AlexNet [16], VGG [38], GoogleNet [40], ResNet [9] and so on, which perform well in classification, detection, tracking and many other tasks.

Q3. What is the importance of domain-specific modelling for cross-modality matching?

For cross-modality matching tasks, domain-specific modelling is important for extracting shared components for matching because of domain shift.

Q4. What is the main reason for the domain shift?

For cross-modality matching tasks, domain-specific modelling is important for extracting shared features for matching because of the domain shift.

Q5. What is the generalised similarity measure proposed by Lin et al.?

Generalised similarity measure proposed by Lin et al. [26] is for cross-domain visual matching tasks, including RGB-RGB Re-ID task.

Q6. What is the method for learning features?

In some multi-domain learning methods, e.g., HFA [18], CRAFT [3], zero-padding in feature level is applied and proved to be effective.

Q7. What are the types of learning frameworks used for cross-modality?

The formerone includes subspace learning methods [25, 30, 60] and deep learning frameworks [45, 6, 14, 12], while the latter one includes linear models [39, 36, 59, 51] and non-linear models [27, 50, 31].

Q8. What is the definition of asymmetric FC layers?

Remark 2. Considering two-stream structure and asymmetric FC layer structure, they are designed manually and fixed during training.

Q9. What is the case for the nodes on the next layer?

it is hard to mathematically tell what it is, but the authors find that with the zeropadding as network input, the nodes in the networks aremore possibly becoming domain-specific nodes.

Q10. What is the way to use two-stream structure in a Re-ID task?

Networks with two inputs similar to two-stream structure are also favorable in Re-ID tasks, for example, Ahmed’s net [1], SIR-CIR net [42], gated siamese net [41], etc.

Q11. What is the relation between the proportion of domain-specific nodes and layer depth?

Domain-specific nodes enable the network to convolve image from different domains using different filters, so as to better alleviate the differences (e.g., gradient orientations and exposure differences in Figure 1) between two domains.

Q12. What is the relation between the proportion of domainspecific nodes and layer depth?

Using deep zero-padding helps to generate more domainspecific nodes, while the proportions without zero-padding are low in most layers.

RGB-Infrared Cross-Modality Person Re-identification

Summary (3 min read)

1. Introduction

2.1. Dataset Description

2.2. Evaluation Protocol

3. Network Structure Comparison on CrossModality Modelling

3.1. Common Deep Model Network Structures

3.2. Analysis of Network Structures

4.1. Analysis of Zero-Padding as Network Input

4.3. Comparison of Cross-Modality Learning

5. Experiments

5.1. Compared Models

5.2. Model Comparisons and Analysis

6. Summary

Figures (12)

Citations

Cites background from "RGB-Infrared Cross-Modality Person ..."

Cites methods from "RGB-Infrared Cross-Modality Person ..."

References

"RGB-Infrared Cross-Modality Person ..." refers methods in this paper

Related Papers (5)

Frequently Asked Questions (12)

Q1. What have the authors contributed in "Rgb-infrared cross-modality person re-identification" ?

Q2. What are the common networks used in the field of Re-ID?

Q3. What is the importance of domain-specific modelling for cross-modality matching?

Q4. What is the main reason for the domain shift?

Q5. What is the generalised similarity measure proposed by Lin et al.?

Q6. What is the method for learning features?

Q7. What are the types of learning frameworks used for cross-modality?

Q8. What is the definition of asymmetric FC layers?

Q9. What is the case for the nodes on the next layer?

Q10. What is the way to use two-stream structure in a Re-ID task?

Q11. What is the relation between the proportion of domain-specific nodes and layer depth?

Q12. What is the relation between the proportion of domainspecific nodes and layer depth?