scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

A Two Stream Siamese Convolutional Neural Network for Person Re-identification

01 Oct 2017-pp 1992-2000
TL;DR: This paper presents a two stream convolutional neural network where each stream is a Siamese network that can learn spatial and temporal information separately and proposes a weighted two stream training objective function which combines theSiamese cost of the spatial and spatial streams with the objective of predicting a person's identity.
Abstract: Person re-identification is an important task in video surveillance systems. It can be formally defined as establishing the correspondence between images of a person taken from different cameras at different times. In this paper, we present a two stream convolutional neural network where each stream is a Siamese network. This architecture can learn spatial and temporal information separately. We also propose a weighted two stream training objective function which combines the Siamese cost of the spatial and temporal streams with the objective of predicting a person's identity. We evaluate our proposed method on the publicly available PRID2011 and iLIDS-VID datasets and demonstrate the efficacy of our proposed method. On average, the top rank matching accuracy is 4% higher than the accuracy achieved by the cross-view quadratic discriminant analysis used in combination with the hierarchical Gaussian descriptor (GOG+XQDA), and 5% higher than the recurrent neural network method.
Citations
More filters
Posted Content
TL;DR: A powerful AGW baseline is designed, achieving state-of-the-art or at least comparable performance on twelve datasets for four different Re-ID tasks, and a new evaluation metric (mINP) is introduced, indicating the cost for finding all the correct matches, which provides an additional criteria to evaluate the Re- ID system for real applications.
Abstract: Person re-identification (Re-ID) aims at retrieving a person of interest across multiple non-overlapping cameras. With the advancement of deep neural networks and increasing demand of intelligent video surveillance, it has gained significantly increased interest in the computer vision community. By dissecting the involved components in developing a person Re-ID system, we categorize it into the closed-world and open-world settings. The widely studied closed-world setting is usually applied under various research-oriented assumptions, and has achieved inspiring success using deep learning techniques on a number of datasets. We first conduct a comprehensive overview with in-depth analysis for closed-world person Re-ID from three different perspectives, including deep feature representation learning, deep metric learning and ranking optimization. With the performance saturation under closed-world setting, the research focus for person Re-ID has recently shifted to the open-world setting, facing more challenging issues. This setting is closer to practical applications under specific scenarios. We summarize the open-world Re-ID in terms of five different aspects. By analyzing the advantages of existing methods, we design a powerful AGW baseline, achieving state-of-the-art or at least comparable performance on twelve datasets for FOUR different Re-ID tasks. Meanwhile, we introduce a new evaluation metric (mINP) for person Re-ID, indicating the cost for finding all the correct matches, which provides an additional criteria to evaluate the Re-ID system for real applications. Finally, some important yet under-investigated open issues are discussed.

737 citations


Cites background or methods from "A Two Stream Siamese Convolutional ..."

  • ...We review the deeply learned Re-ID models, including CoSeg [135], GLTR [140], STA [139], ADFD [108], STC [19], DRSA [136], Snippet [138], ETAP [128], DuATM [85], SDM [182], TwoS [130], ASTPN [134], RQEN [178], Forest [133], RNN [129] 10 16 17 18 19 Year 70 75 80 85 90 95 R an k1( % ) GLTR ADFD DRSA Snippet SDM Two-S ASTPN RQEN Forest RNN IDEX (a) SOTA on PRID-2011 [126] 16 17 18 19 Year 55 60 65 70 75 80 85 R an k1( % ) CoSeg GLTR ADFD STC DRSA Snippet SDMTwo-S ASTPN RQEN Forest RNN IDEX (b) SOTA on iLIDS-VID [6] 50 60 70 80 mAP(...

    [...]

  • ...A weighted scheme for spatial and temporal streams is developed in [130]....

    [...]

  • ...Snippet [138], ETAP [128], DuATM [85], SDM [182], TwoS [130], ASTPN [134], RQEN [178], Forest [133], RNN [129]...

    [...]

Proceedings ArticleDOI
18 Jun 2018
TL;DR: This paper introduces the binary segmentation masks to construct synthetic RGB-Mask pairs as inputs, then designs a mask-guided contrastive attention model (MGCAM) to learn features separately from the body and background regions, and proposes a novel region-level triplet loss to restrain the features learnt from different regions.
Abstract: Person Re-identification (ReID) is an important yet challenging task in computer vision. Due to the diverse background clutters, variations on viewpoints and body poses, it is far from solved. How to extract discriminative and robust features invariant to background clutters is the core problem. In this paper, we first introduce the binary segmentation masks to construct synthetic RGB-Mask pairs as inputs, then we design a mask-guided contrastive attention model (MGCAM) to learn features separately from the body and background regions. Moreover, we propose a novel region-level triplet loss to restrain the features learnt from different regions, i.e., pulling the features from the full image and body region close, whereas pushing the features from backgrounds away. We may be the first one to successfully introduce the binary mask into person ReID task and the first one to propose region-level contrastive learning. We evaluate the proposed method on three public datasets, including MARS, Market-1501 and CUHK03. Extensive experimental results show that the proposed method is effective and achieves the state-of-the-art results. Mask and code will be released upon request.

592 citations


Cites background from "A Two Stream Siamese Convolutional ..."

  • ...In addition, some works try to introduce the pair-wise contrastive loss [14], triplet ranking loss [54] and quadruplet loss [8] to further enhance the IDE feature....

    [...]

Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed part loss, which automatically generates several parts for an image, and computes the person classification loss on each part separately, which enforces the deep network to focus on the entire human body and learn discriminative representations for different parts.
Abstract: Learning discriminative representations for unseen person images is critical for person Re-Identification (ReID). Most of current approaches learn deep representations in classification tasks, which essentially minimize the empirical classification risk on the training set. As shown in our experiments, such representations commonly focus on several body parts discriminative to the training set, rather than the entire human body. Inspired by the structural risk minimization principle in SVM, we revise the traditional deep representation learning procedure to minimize both the empirical classification risk and the representation learning risk. The representation learning risk is evaluated by the proposed part loss, which automatically generates several parts for an image, and computes the person classification loss on each part separately. Compared with traditional global classification loss, simultaneously considering multiple part loss enforces the deep network to focus on the entire human body and learn discriminative representations for different parts. Experimental results on three datasets, i.e., Market1501, CUHK03, VIPeR, show that our representation outperforms the existing deep representations.

356 citations

Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors conducted a comprehensive overview with in-depth analysis for closed-world person Re-ID from three different perspectives, including deep feature representation learning, deep metric learning and ranking optimization.
Abstract: Person re-identification (Re-ID) aims at retrieving a person of interest across multiple non-overlapping cameras. With the advancement of deep neural networks and increasing demand of intelligent video surveillance, it has gained significantly increased interest in the computer vision community. By dissecting the involved components in developing a person Re-ID system, we categorize it into the closed-world and open-world settings. The widely studied closed-world setting is usually applied under various research-oriented assumptions, and has achieved inspiring success using deep learning techniques on a number of datasets. We first conduct a comprehensive overview with in-depth analysis for closed-world person Re-ID from three different perspectives, including deep feature representation learning, deep metric learning and ranking optimization. With the performance saturation under closed-world setting, the research focus for person Re-ID has recently shifted to the open-world setting, facing more challenging issues. This setting is closer to practical applications under specific scenarios. We summarize the open-world Re-ID in terms of five different aspects. By analyzing the advantages of existing methods, we design a powerful AGW baseline, achieving state-of-the-art or at least comparable performance on twelve datasets for four different Re-ID tasks. Meanwhile, we introduce a new evaluation metric (mINP) for person Re-ID, indicating the cost for finding all the correct matches, which provides an additional criteria to evaluate the Re-ID system for real applications. Finally, some important yet under-investigated open issues are discussed.

301 citations

Proceedings Article
01 Oct 2018
TL;DR: The proposedFD-GAN achieves state-of-the-art performance on three person reID datasets, which demonstrates that the effectiveness and robust feature distilling capability of the proposed FD-GAN.
Abstract: Person re-identification (reID) is an important task that requires to retrieve a person's images from an image dataset, given one image of the person of interest For learning robust person features, the pose variation of person images is one of the key challenges Existing works targeting the problem either perform human alignment, or learn human-region-based representations Extra pose information and computational cost is generally required for inference To solve this issue, a Feature Distilling Generative Adversarial Network (FD-GAN) is proposed for learning identity-related and pose-unrelated representations It is a novel framework based on a Siamese structure with multiple novel discriminators on human poses and identities In addition to the discriminators, a novel same-pose loss is also integrated, which requires appearance of a same person's generated images to be similar After learning pose-unrelated person features with pose guidance, no auxiliary pose information and additional computational cost is required during testing Our proposed FD-GAN achieves state-of-the-art performance on three person reID datasets, which demonstrates that the effectiveness and robust feature distilling capability of the proposed FD-GAN

252 citations


Cites background from "A Two Stream Siamese Convolutional ..."

  • ...Person reID [22, 23, 24, 25, 26, 27, 28, 29] is a challenging task due to various human poses, domain differences, occlusions, etc....

    [...]

References
More filters
Journal ArticleDOI
28 May 2015-Nature
TL;DR: Deep learning is making major advances in solving problems that have resisted the best attempts of the artificial intelligence community for many years, and will have many more successes in the near future because it requires very little engineering by hand and can easily take advantage of increases in the amount of available computation and data.
Abstract: Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech recognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and audio, whereas recurrent nets have shone light on sequential data such as text and speech.

46,982 citations

Journal Article
TL;DR: It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.
Abstract: Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different "thinned" networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.

33,597 citations


"A Two Stream Siamese Convolutional ..." refers methods in this paper

  • ...Dropout [40] is also used to reduce the model over-fitting....

    [...]

Proceedings Article
24 Aug 1981
TL;DR: In this paper, the spatial intensity gradient of the images is used to find a good match using a type of Newton-Raphson iteration, which can be generalized to handle rotation, scaling and shearing.
Abstract: Image registration finds a variety of applications in computer vision. Unfortunately, traditional image registration techniques tend to be costly. We present a new image registration technique that makes use of the spatial intensity gradient of the images to find a good match using a type of Newton-Raphson iteration. Our technique is taster because it examines far fewer potential matches between the images than existing techniques Furthermore, this registration technique can be generalized to handle rotation, scaling and shearing. We show how our technique can be adapted tor use in a stereo vision system.

12,944 citations

Proceedings Article
08 Dec 2014
TL;DR: This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data.
Abstract: We investigate architectures of discriminatively trained deep Convolutional Networks (ConvNets) for action recognition in video. The challenge is to capture the complementary information on appearance from still frames and motion between frames. We also aim to generalise the best performing hand-crafted features within a data-driven learning framework. Our contribution is three-fold. First, we propose a two-stream ConvNet architecture which incorporates spatial and temporal networks. Second, we demonstrate that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data. Finally, we show that multitask learning, applied to two different action classification datasets, can be used to increase the amount of training data and improve the performance on both. Our architecture is trained and evaluated on the standard video actions benchmarks of UCF-101 and HMDB-51, where it is competitive with the state of the art. It also exceeds by a large margin previous attempts to use deep nets for video classification.

6,397 citations


"A Two Stream Siamese Convolutional ..." refers background or methods in this paper

  • ...The effectiveness of using optical flow to learn temporal features are demonstrated in [20, 21]....

    [...]

  • ...To address this limitation, we propose the use of a two stream convolutional neural network (CNN) [21] with weighted objective function where each stream has Siamese structure [22]....

    [...]

Proceedings Article
Jane Bromley1, Isabelle Guyon1, Yann LeCun1, E. Sackinger1, Roopak Shah1 
29 Nov 1993
TL;DR: An algorithm for verification of signatures written on a pen-input tablet based on a novel, artificial neural network called a "Siamese" neural network, which consists of two identical sub-networks joined at their outputs.
Abstract: This paper describes an algorithm for verification of signatures written on a pen-input tablet. The algorithm is based on a novel, artificial neural network, called a "Siamese" neural network. This network consists of two identical sub-networks joined at their outputs. During training the two sub-networks extract features from two signatures, while the joining neuron measures the distance between the two feature vectors. Verification consists of comparing an extracted feature vector with a stored feature vector for the signer. Signatures closer to this stored representation than a chosen threshold are accepted, all other signatures are rejected as forgeries.

2,980 citations


"A Two Stream Siamese Convolutional ..." refers background or methods in this paper

  • ...Siamese networks are composed of two sub-networks with shared weights [22]....

    [...]

  • ...Siamese CNNs contain two identical sub-networks with shared weights and are suitable for tasks which involve finding the similarity between two comparable inputs [22]....

    [...]

  • ...To address this limitation, we propose the use of a two stream convolutional neural network (CNN) [21] with weighted objective function where each stream has Siamese structure [22]....

    [...]