RGB-Infrared Cross-Modality Person Re-identification
Summary (3 min read)
1. Introduction
- Secondly, from imaging principle aspect, the wavelength range of RGB and IR images is different.
- In existing Re-ID works, colour information is the most important appearance cue for identifying persons.
- The authors first identify the challenge of RGB-IR Re-ID by conducting extensive evaluations on popularly used crossmodality methods.
- Considering using neural networks for cross-modality matching, the authors investigate and analyse the relation between different neural network structures, including two-stream structure and asymmetric FC layer structure, in which the domain-specific modelling exists but is designed manually.
2.1. Dataset Description
- SYSU-MM01 contains images captured by 6 cameras, including two IR cameras and four RGB ones.
- For each person, there are at least 400 continuous RGB frames with different poses and viewpoints.
- The IR images have only one channel, and they are different from 3-channel RGB images.
- Camera 4 and 5 are RGB surveillance cameras placed in two outdoor scenes named gate and garden.
- These all introduce difficulties for the RGB-IR cross-modality Re-ID problem.
2.2. Evaluation Protocol
- The authors have a fixed split using 296 identities for training, 99 for validation and 96 for testing.
- Given a probe image, matching is conducted by computing similarities between the probe image and gallery images.
- After computing similarities, the authors can get a ranking list according to descending order of similarities.
3. Network Structure Comparison on CrossModality Modelling
- The authors investigate deep learning network architectures for the task of RGB-IR cross-modality Re-ID.
- In particular, the authors examine three commonly adopted network structures for visual recognition and cross-modality learning.
- The authors further exploit the idea of deep zero-padding for model training and give insights on its impact on cross-modality matching task.
3.1. Common Deep Model Network Structures
- In the past few years, a large number of deep models have been proposed for visual matching and cross-modality modelling, and have achieved satisfactory performance in many tasks.
- Generally, in these tasks, the inputs to the network are RGB images, which are of the same modality.
- In the deeper layers, shared parameters are used.
- The generalized similarity net [26] proposed by Lin et al. for cross- domain visual matching including the Re-ID task is one of the representative structure of this type.
- Compared to onestream structure, two-stream structure achieves two things, domain adaptation and discriminative feature learning.
3.2. Analysis of Network Structures
- The three structures discussed above seem to be different, the authors find interestingly that all structures can be represented by one-stream structure in the forward propagation process when the following assumption is hold: Assumption 1.
- On the right is a one-stream network which can be conditionally equivalent to the two-stream one in forward propagation, in which there is a domain selection sub-network for selecting the following domainspecific structure.
- The assumption the authors hope above is less feasible.
- Using the above defined categorization, without loss of generality, x(l) can be factorized into three parts1 x(l) = [x(l),1spe;x(l),2spe;x(l),s] in which the three components denote the domain1-specific, domain2-specific and shared nodes, respectively.
- In contrast, if one-stream structure can implicitly learn the structure, the implicit structures corresponding to different domains are partially coupled by shared nodes and shared bias parameters (Equations (4) and (5)), which can provide more flexibility in training for cross-modality matching tasks.
4.1. Analysis of Zero-Padding as Network Input
- In most cases, one-stream network is applied in single-domain tasks, which treats all samples equally so that generally domain-specific nodes may not be learned.
- It would be easier for neural network to spread the domain specific-nodes in deeper layers.
- Actually, their neural networks learning empirically support this.
- As shown in Figure 7 and Figure 8, deep zero-padding helps the network learn domain-specific nodes more easily than that without zeropadding.
- The details will be illustrated later in Section 4.2.
4.3. Comparison of Cross-Modality Learning
- While cross-modality matching task has not drawn much attention in Re-ID problem, it has been studied a lot in other fields like information retrieval and face verification.
- Crossmodality retrieval (e.g. text-image, tag-image) plays an important role in information retrieval.
- Matching visual face versus near infrared ones (VIS-NIR) [17, 58, 10] is rather related to RGB-IR cross-modality Re-ID.
- The remaining useful cues may be body shape, which differs greatly with different viewpoints and poses.
- In comparison, their zero-padding is done in raw image level and the domain-specific and shared learning are done by deep neural network.
5. Experiments
- The authors conducted extensive evaluations of existing Re-ID and cross-domain matching models as baselines on their SYSU-MM01 dataset.
- Then, the authors evaluated and analysed the effectiveness of deep models, including the proposed deep zero-padding and three network structures discussed in Section 3.
- See Section 2.2 for detailed evaluation protocol.
5.1. Compared Models
- The authors evaluated three favorable handcrafted features and cross-domain metric learning models as baselines.
- The authors evaluated four deep models shown in Figure 3, including one-steam network, two-stream network, asymmetric FC layer network and the proposed deep zero-padding method (network structure is the same as one- stream network).
- All of the hyper parameters were kept the same.
5.2. Model Comparisons and Analysis
- The authors show comparative results in Table 3, including the rank-1, 10, 20 accuracies of CMC [32] and mean average precision (mAP).
- There were gaps among their performances to some extent.
- In Table 3 the authors can see that the deep zero-padding outperformed two-stream network and asymmetric FC layer structure.
- The authors used the codes released by the authors in the experiments.
- It is inferior for dealing with the much more challenging RGB-IR cross-modality Re-ID problem.
6. Summary
- To their best knowledge, this work is the first to identify the RGB-IR cross-modality Re-ID problem and introduce a new multi-modality Re-ID dataset named SYSU-MM01.
- The great difference between RGB and IR images makes RGB-IR cross-modality Re-ID formed as a very challenging problem.
- The authors have discussed and evaluated three common network structures for cross-domain tasks including one-stream structure, two-stream structure and asymmetric FC layer structure.
- The authors have analysed the connection between one-stream and two-stream structure and found that one-stream network can learn and evolve domain-specific structure implicitly if there exist domain-specific and shared nodes.
- The experiments have shown that the one-stream network trained by deep zero-padding achieved the best performance.
Did you find this useful? Give us your feedback
Citations
737 citations
Cites background from "RGB-Infrared Cross-Modality Person ..."
...[196], both the query and gallery sets may contain different modalities (visible, thermal [20], depth [53] or text description [9])....
[...]
...RegDB [51] SYSU-MM01 [20] Visible-Thermal All Search Indoor Search Method R1 mAP R1 mAP R1 mAP Zero-Pad [20] ICCV17 17....
[...]
...[20] start the first attempt to address this issue, by proposing a deep zero-padding framework [20] to adaptively learn the modality sharable features....
[...]
...spectrums [20], [51], sketches [52] or depth images [53], and even text descriptions [54]....
[...]
...of different viewpoints [10], [11], varying low-image resolutions [12], [13], illumination changes [14], unconstrained poses [15], [16], [17], occlusions [18], [19], heterogeneous modalities [9], [20], etc....
[...]
301 citations
287 citations
276 citations
Cites methods from "RGB-Infrared Cross-Modality Person ..."
...To further demonstrate the capability of SNR in handling images with large style variations, we conduct experiment on a more challenging RGB-Infrared cross-modality person ReID task on benchmark dataset SYSU-MM01 [46]....
[...]
269 citations
References
123,388 citations
73,978 citations
49,914 citations
40,257 citations
31,952 citations
"RGB-Infrared Cross-Modality Person ..." refers methods in this paper
...Handcrafted features included HOG [4], LOMO [23] and HIPHOP [3]....
[...]
Related Papers (5)
Frequently Asked Questions (12)
Q2. What are the common networks used in the field of Re-ID?
Representative networks include AlexNet [16], VGG [38], GoogleNet [40], ResNet [9] and so on, which perform well in classification, detection, tracking and many other tasks.
Q3. What is the importance of domain-specific modelling for cross-modality matching?
For cross-modality matching tasks, domain-specific modelling is important for extracting shared components for matching because of domain shift.
Q4. What is the main reason for the domain shift?
For cross-modality matching tasks, domain-specific modelling is important for extracting shared features for matching because of the domain shift.
Q5. What is the generalised similarity measure proposed by Lin et al.?
Generalised similarity measure proposed by Lin et al. [26] is for cross-domain visual matching tasks, including RGB-RGB Re-ID task.
Q6. What is the method for learning features?
In some multi-domain learning methods, e.g., HFA [18], CRAFT [3], zero-padding in feature level is applied and proved to be effective.
Q7. What are the types of learning frameworks used for cross-modality?
The formerone includes subspace learning methods [25, 30, 60] and deep learning frameworks [45, 6, 14, 12], while the latter one includes linear models [39, 36, 59, 51] and non-linear models [27, 50, 31].
Q8. What is the definition of asymmetric FC layers?
Remark 2. Considering two-stream structure and asymmetric FC layer structure, they are designed manually and fixed during training.
Q9. What is the case for the nodes on the next layer?
it is hard to mathematically tell what it is, but the authors find that with the zeropadding as network input, the nodes in the networks aremore possibly becoming domain-specific nodes.
Q10. What is the way to use two-stream structure in a Re-ID task?
Networks with two inputs similar to two-stream structure are also favorable in Re-ID tasks, for example, Ahmed’s net [1], SIR-CIR net [42], gated siamese net [41], etc.
Q11. What is the relation between the proportion of domain-specific nodes and layer depth?
Domain-specific nodes enable the network to convolve image from different domains using different filters, so as to better alleviate the differences (e.g., gradient orientations and exposure differences in Figure 1) between two domains.
Q12. What is the relation between the proportion of domainspecific nodes and layer depth?
Using deep zero-padding helps to generate more domainspecific nodes, while the proportions without zero-padding are low in most layers.