scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

HydraPlus-Net: Attentive Deep Features for Pedestrian Analysis

TL;DR: A new attentionbased deep neural network, named as HydraPlus-Net (HPnet), that multi-directionally feeds the multi-level attention maps to different feature layers to enrich the final feature representations for a pedestrian image.
Abstract: Pedestrian analysis plays a vital role in intelligent video surveillance and is a key component for security-centric computer vision systems. Despite that the convolutional neural networks are remarkable in learning discriminative features from images, the learning of comprehensive features of pedestrians for fine-grained tasks remains an open problem. In this study, we propose a new attentionbased deep neural network, named as HydraPlus-Net (HPnet), that multi-directionally feeds the multi-level attention maps to different feature layers. The attentive deep features learned from the proposed HP-net bring unique advantages: (1) the model is capable of capturing multiple attentions from low-level to semantic-level, and (2) it explores the multi-scale selectiveness of attentive features to enrich the final feature representations for a pedestrian image. We demonstrate the effectiveness and generality of the proposed HP-net for pedestrian analysis on two tasks, i.e. pedestrian attribute recognition and person reidentification. Intensive experimental results have been provided to prove that the HP-net outperforms the state-of-theart methods on various datasets.
Citations
More filters
Book ChapterDOI
08 Sep 2018
TL;DR: In this paper, a part-based convolutional baseline (PCB) is proposed to learn discriminative part-informed features for person retrieval and two contributions are made: (i) a network named Part-based Convolutional Baseline (PCBB) which outputs a convolutionAL descriptor consisting of several part-level features.
Abstract: Employing part-level features offers fine-grained information for pedestrian image description. A prerequisite of part discovery is that each part should be well located. Instead of using external resources like pose estimator, we consider content consistency within each part for precise part location. Specifically, we target at learning discriminative part-informed features for person retrieval and make two contributions. (i) A network named Part-based Convolutional Baseline (PCB). Given an image input, it outputs a convolutional descriptor consisting of several part-level features. With a uniform partition strategy, PCB achieves competitive results with the state-of-the-art methods, proving itself as a strong convolutional baseline for person retrieval. (ii) A refined part pooling (RPP) method. Uniform partition inevitably incurs outliers in each part, which are in fact more similar to other parts. RPP re-assigns these outliers to the parts they are closest to, resulting in refined parts with enhanced within-part consistency. Experiment confirms that RPP allows PCB to gain another round of performance boost. For instance, on the Market-1501 dataset, we achieve (77.4+4.2)% mAP and (92.3+1.5)% rank-1 accuracy, surpassing the state of the art by a large margin. Code is available at: https://github.com/syfafterzy/PCB_RPP

1,633 citations

Proceedings ArticleDOI
Guanshuo Wang1, Yufeng Yuan, Xiong Chen, Jiwei Li, Xi Zhou1 
15 Oct 2018
TL;DR: Comprehensive experiments implemented on the mainstream evaluation datasets including Market-1501, DukeMTMC-reid and CUHK03 indicate that the proposed end-to-end feature learning strategy robustly achieves state-of-the-art performances and outperforms any existing approaches by a large margin.
Abstract: The combination of global and partial features has been an essential solution to improve discriminative performances in person re-identification (Re-ID) tasks. Previous part-based methods mainly focus on locating regions with specific pre-defined semantics to learn local representations, which increases learning difficulty but not efficient or robust to scenarios with large variances. In this paper, we propose an end-to-end feature learning strategy integrating discriminative information with various granularities. We carefully design the Multiple Granularity Network (MGN), a multi-branch deep network architecture consisting of one branch for global feature representations and two branches for local feature representations. Instead of learning on semantic regions, we uniformly partition the images into several stripes, and vary the number of parts in different local branches to obtain local feature representations with multiple granularities. Comprehensive experiments implemented on the mainstream evaluation datasets including Market-1501, DukeMTMC-reid and CUHK03 indicate that our method robustly achieves state-of-the-art performances and outperforms any existing approaches by a large margin. For example, on Market-1501 dataset in single query mode, we obtain a top result of Rank-1/mAP=96.6%/94.2% with this method after re-ranking.

1,050 citations


Cites background or methods from "HydraPlus-Net: Attentive Deep Featu..."

  • ...Part-based methods for person Re-ID can be divided into three main pathways according to their part locating methods: 1) Locating part regions with strong structural information such as empirical knowledge about human bodies [8, 21, 36, 43] or strong learning-based pose information [33, 44]; 2) Locating part regions by region proposal methods [19, 41]; 3) Enhancing features by middle-level attention on salient partitions [22, 24, 25, 45]....

    [...]

  • ...Softmax loss is almost the only choice of classification loss function for its strong robustness to various kinds of multi-class classification tasks, which can be used individually [1, 19, 22, 25, 36, 39, 41, 47] or combined with other losses [3, 8, 20, 43] in embedding learning procedures for Re-ID....

    [...]

  • ...Attention information can be a powerful complement for discrimination, which are enhanced in [22, 24, 25]....

    [...]

  • ...To locate semantic partitions without strongly learning-based predictors, region proposal methods such as [11, 18] are employed in some part-based methods [19, 22, 25, 41, 45]....

    [...]

Proceedings ArticleDOI
22 Feb 2018
TL;DR: A novel Harmonious Attention CNN (HA-CNN) model is formulated for joint learning of soft pixel attention and hard regional attention along with simultaneous optimisation of feature representations, dedicated to optimise person re-id in uncontrolled (misaligned) images.
Abstract: Existing person re-identification (re-id) methods either assume the availability of well-aligned person bounding box images as model input or rely on constrained attention selection mechanisms to calibrate misaligned images. They are therefore sub-optimal for re-id matching in arbitrarily aligned person images potentially with large human pose variations and unconstrained auto-detection errors. In this work, we show the advantages of jointly learning attention selection and feature representation in a Convolutional Neural Network (CNN) by maximising the complementary information of different levels of visual attention subject to re-id discriminative learning constraints. Specifically, we formulate a novel Harmonious Attention CNN (HA-CNN) model for joint learning of soft pixel attention and hard regional attention along with simultaneous optimisation of feature representations, dedicated to optimise person re-id in uncontrolled (misaligned) images. Extensive comparative evaluations validate the superiority of this new HA-CNN model for person re-id over a wide variety of state-of-the-art methods on three large-scale benchmarks including CUHK03, Market-1501, and DukeMTMC-ReID.

1,020 citations


Cites background from "HydraPlus-Net: Attentive Deep Featu..."

  • ...While soft re-id attention modelling is considered in [24], this model assumes tight person boxes thus less suitable for poor detections....

    [...]

Posted Content
TL;DR: Zhang et al. as mentioned in this paper proposed a part-based convolutional baseline (PCB) for person retrieval, which employs part-level features to learn discriminative part-informed features for pedestrian image description.
Abstract: Employing part-level features for pedestrian image description offers fine-grained information and has been verified as beneficial for person retrieval in very recent literature. A prerequisite of part discovery is that each part should be well located. Instead of using external cues, e.g., pose estimation, to directly locate parts, this paper lays emphasis on the content consistency within each part. Specifically, we target at learning discriminative part-informed features for person retrieval and make two contributions. (i) A network named Part-based Convolutional Baseline (PCB). Given an image input, it outputs a convolutional descriptor consisting of several part-level features. With a uniform partition strategy, PCB achieves competitive results with the state-of-the-art methods, proving itself as a strong convolutional baseline for person retrieval. (ii) A refined part pooling (RPP) method. Uniform partition inevitably incurs outliers in each part, which are in fact more similar to other parts. RPP re-assigns these outliers to the parts they are closest to, resulting in refined parts with enhanced within-part consistency. Experiment confirms that RPP allows PCB to gain another round of performance boost. For instance, on the Market-1501 dataset, we achieve (77.4+4.2)% mAP and (92.3+1.5)% rank-1 accuracy, surpassing the state of the art by a large margin.

778 citations

Proceedings ArticleDOI
18 Jun 2018
TL;DR: In this article, a Transferable Joint Attribute-Identity Deep Learning (TJ-AIDL) model is proposed to simultaneously learn an attribute-semantic and identity discriminative feature representation space transferrable to any new (unseen) target domain for re-id tasks without the need for collecting new labelled training data from the target domain.
Abstract: Most existing person re-identification (re-id) methods require supervised model learning from a separate large set of pairwise labelled training data for every single camera pair. This significantly limits their scalability and usability in real-world large scale deployments with the need for performing re-id across many camera views. To address this scalability problem, we develop a novel deep learning method for transferring the labelled information of an existing dataset to a new unseen (unlabelled) target domain for person re-id without any supervised learning in the target domain. Specifically, we introduce an Transferable Joint Attribute-Identity Deep Learning (TJ-AIDL) for simultaneously learning an attribute-semantic and identity-discriminative feature representation space transferrable to any new (unseen) target domain for re-id tasks without the need for collecting new labelled training data from the target domain (i.e. unsupervised learning in the target domain). Extensive comparative evaluations validate the superiority of this new TJ-AIDL model for unsupervised person re-id over a wide range of state-of-the-art methods on four challenging benchmarks including VIPeR, PRID, Market-1501, and DukeMTMC-ReID.

568 citations

References
More filters
Posted Content
Sergey Ioffe1, Christian Szegedy1
TL;DR: Batch Normalization as mentioned in this paper normalizes layer inputs for each training mini-batch to reduce the internal covariate shift in deep neural networks, and achieves state-of-the-art performance on ImageNet.
Abstract: Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters.

17,184 citations

Journal ArticleDOI
TL;DR: This article presents a general class of associative reinforcement learning algorithms for connectionist networks containing stochastic units that are shown to make weight adjustments in a direction that lies along the gradient of expected reinforcement in both immediate-reinforcement tasks and certain limited forms of delayed-reInforcement tasks, and they do this without explicitly computing gradient estimates.
Abstract: This article presents a general class of associative reinforcement learning algorithms for connectionist networks containing stochastic units. These algorithms, called REINFORCE algorithms, are shown to make weight adjustments in a direction that lies along the gradient of expected reinforcement in both immediate-reinforcement tasks and certain limited forms of delayed-reinforcement tasks, and they do this without explicitly computing gradient estimates or even storing information from which such estimates could be computed. Specific examples of such algorithms are presented, some of which bear a close relationship to certain existing algorithms while others are novel but potentially interesting in their own right. Also given are results that show how such algorithms can be naturally integrated with backpropagation. We close with a brief discussion of a number of additional issues surrounding the use of such algorithms, including what is known about their limiting behaviors as well as further considerations that might be used to help develop similar but potentially more powerful reinforcement learning algorithms.

7,930 citations


"HydraPlus-Net: Attentive Deep Featu..." refers background in this paper

  • ...Compared to non-differentiable hard attention trained by reinforce algorithms [28], soft attention which weights the feature maps is differentiable and can be trained by back propagation....

    [...]

Proceedings Article
06 Jul 2015
TL;DR: An attention based model that automatically learns to describe the content of images is introduced that can be trained in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower bound.
Abstract: Inspired by recent work in machine translation and object detection, we introduce an attention based model that automatically learns to describe the content of images. We describe how we can train this model in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower bound. We also show through visualization how the model is able to automatically learn to fix its gaze on salient objects while generating the corresponding words in the output sequence. We validate the use of attention with state-of-the-art performance on three benchmark datasets: Flickr9k, Flickr30k and MS COCO.

6,485 citations


"HydraPlus-Net: Attentive Deep Featu..." refers methods or result in this paper

  • ...Most previous studies [34, 21] merely demonstrated the effectiveness of an attention-based model with a limited number of channels...

    [...]

  • ...Attention models In computer vision, attention models have been used in tasks such as image caption generation [34], visual question answering [18, 33] and object detection [2]....

    [...]

  • ...Moreover, the MDA module also differs from the traditional attention-based models [21, 34] that push the attention map back to the same block, and it extends this mechanism by applying the attention maps to adjacent blocks, as shown in lines with varying hot colors in Fig....

    [...]

Book ChapterDOI
08 Oct 2016
TL;DR: This work introduces a novel convolutional network architecture for the task of human pose estimation that is described as a “stacked hourglass” network based on the successive steps of pooling and upsampling that are done to produce a final set of predictions.
Abstract: This work introduces a novel convolutional network architecture for the task of human pose estimation. Features are processed across all scales and consolidated to best capture the various spatial relationships associated with the body. We show how repeated bottom-up, top-down processing used in conjunction with intermediate supervision is critical to improving the performance of the network. We refer to the architecture as a “stacked hourglass” network based on the successive steps of pooling and upsampling that are done to produce a final set of predictions. State-of-the-art results are achieved on the FLIC and MPII benchmarks outcompeting all recent methods.

3,865 citations


"HydraPlus-Net: Attentive Deep Featu..." refers methods or result in this paper

  • ...Most previous studies [34, 21] merely demonstrated the effectiveness of an attention-based model with a limited number of channels...

    [...]

  • ...Moreover, the MDA module also differs from the traditional attention-based models [21, 34] that push the attention map back to the same block, and it extends this mechanism by applying the attention maps to adjacent blocks, as shown in lines with varying hot colors in Fig....

    [...]

Proceedings ArticleDOI
07 Dec 2015
TL;DR: A minor contribution, inspired by recent advances in large-scale image search, an unsupervised Bag-of-Words descriptor is proposed that yields competitive accuracy on VIPeR, CUHK03, and Market-1501 datasets, and is scalable on the large- scale 500k dataset.
Abstract: This paper contributes a new high quality dataset for person re-identification, named "Market-1501". Generally, current datasets: 1) are limited in scale, 2) consist of hand-drawn bboxes, which are unavailable under realistic settings, 3) have only one ground truth and one query image for each identity (close environment). To tackle these problems, the proposed Market-1501 dataset is featured in three aspects. First, it contains over 32,000 annotated bboxes, plus a distractor set of over 500K images, making it the largest person re-id dataset to date. Second, images in Market-1501 dataset are produced using the Deformable Part Model (DPM) as pedestrian detector. Third, our dataset is collected in an open system, where each identity has multiple images under each camera. As a minor contribution, inspired by recent advances in large-scale image search, this paper proposes an unsupervised Bag-of-Words descriptor. We view person re-identification as a special task of image search. In experiment, we show that the proposed descriptor yields competitive accuracy on VIPeR, CUHK03, and Market-1501 datasets, and is scalable on the large-scale 500k dataset.

3,564 citations


"HydraPlus-Net: Attentive Deep Featu..." refers methods in this paper

  • ...The proposed approach is evaluated on three publicly standard datasets, including CUHK03 [16], VIPeR [8], and Market-1501 [38]....

    [...]

  • ...The Market-1501 [38] dataset is also evaluated with the metric learning WARCA-L [11], a novel Siamese LSTM architecture LOMO+CN [27], the Siamese CNN with learnable gate S-CNN [26], and the bag of words model BoW-best [38]....

    [...]

  • ...For the Market-1501 dataset, the same data separation strategy is used as [38]....

    [...]

  • ...From Table 4, we can ob- serve that the proposed approach achieves the Top-1 accuracies of 91.8%, 56.6% and 76.9% on the CUHK03, ViPeR and Market-1501 datasets, respectively, and it achieves the state-of-the-art performance on all the three datasets....

    [...]

  • ...Besides the quantitative results in Dataset Market-1501 [38] CUHK03 [16] VIPeR [8] # identities 1501 1360 632 # images 32643 13164 1264 # cameras 6 2 2 # training IDs 750 1160 316 # test IDs 751 100 316 # probe images 3368 100 316 # gallery images 19732 100 316...

    [...]