HydraPlus-Net: Attentive Deep Features for Pedestrian Analysis

doi:10.1109/ICCV.2017.46

Home
/
Papers
/
HydraPlus-Net: Attentive Deep Features for Pedestrian Analysis

Proceedings Article•DOI•

HydraPlus-Net: Attentive Deep Features for Pedestrian Analysis

Xihui Liu¹, Haiyu Zhao², Tian Maoqing³, Lu Sheng², Jing Shao³, Shuai Yi³, Junjie Yan³, Xiaogang Wang² - Show less +4 more•Institutions (3)

Tsinghua University¹, The Chinese University of Hong Kong², SenseTime³

01 Oct 2017-pp 350-359

TL;DR: A new attentionbased deep neural network, named as HydraPlus-Net (HPnet), that multi-directionally feeds the multi-level attention maps to different feature layers to enrich the final feature representations for a pedestrian image.

read less

Abstract: Pedestrian analysis plays a vital role in intelligent video surveillance and is a key component for security-centric computer vision systems. Despite that the convolutional neural networks are remarkable in learning discriminative features from images, the learning of comprehensive features of pedestrians for fine-grained tasks remains an open problem. In this study, we propose a new attentionbased deep neural network, named as HydraPlus-Net (HPnet), that multi-directionally feeds the multi-level attention maps to different feature layers. The attentive deep features learned from the proposed HP-net bring unique advantages: (1) the model is capable of capturing multiple attentions from low-level to semantic-level, and (2) it explores the multi-scale selectiveness of attentive features to enrich the final feature representations for a pedestrian image. We demonstrate the effectiveness and generality of the proposed HP-net for pedestrian analysis on two tasks, i.e. pedestrian attribute recognition and person reidentification. Intensive experimental results have been provided to prove that the HP-net outperforms the state-of-theart methods on various datasets.

...read moreread less

Citations

PDF

Open Access

More filters

Book Chapter•DOI•

Beyond Part Models: Person Retrieval with Refined Part Pooling (and A Strong Convolutional Baseline)

[...]

Yifan Sun¹, Liang Zheng², Yi Yang³, Qi Tian⁴, Shengjin Wang¹ - Show less +1 more•Institutions (4)

Tsinghua University¹, Australian National University², University of Technology, Sydney³, University of Texas at San Antonio⁴

08 Sep 2018

TL;DR: In this paper, a part-based convolutional baseline (PCB) is proposed to learn discriminative part-informed features for person retrieval and two contributions are made: (i) a network named Part-based Convolutional Baseline (PCBB) which outputs a convolutionAL descriptor consisting of several part-level features.

...read moreread less

Abstract: Employing part-level features offers fine-grained information for pedestrian image description. A prerequisite of part discovery is that each part should be well located. Instead of using external resources like pose estimator, we consider content consistency within each part for precise part location. Specifically, we target at learning discriminative part-informed features for person retrieval and make two contributions. (i) A network named Part-based Convolutional Baseline (PCB). Given an image input, it outputs a convolutional descriptor consisting of several part-level features. With a uniform partition strategy, PCB achieves competitive results with the state-of-the-art methods, proving itself as a strong convolutional baseline for person retrieval. (ii) A refined part pooling (RPP) method. Uniform partition inevitably incurs outliers in each part, which are in fact more similar to other parts. RPP re-assigns these outliers to the parts they are closest to, resulting in refined parts with enhanced within-part consistency. Experiment confirms that RPP allows PCB to gain another round of performance boost. For instance, on the Market-1501 dataset, we achieve (77.4+4.2)% mAP and (92.3+1.5)% rank-1 accuracy, surpassing the state of the art by a large margin. Code is available at: https://github.com/syfafterzy/PCB_RPP

...read moreread less

1,633 citations

Proceedings Article•DOI•

Learning Discriminative Features with Multiple Granularities for Person Re-Identification

[...]

Guanshuo Wang¹, Yufeng Yuan, Xiong Chen, Jiwei Li, Xi Zhou¹ - Show less +1 more•Institutions (1)

Shanghai Jiao Tong University¹

15 Oct 2018

TL;DR: Comprehensive experiments implemented on the mainstream evaluation datasets including Market-1501, DukeMTMC-reid and CUHK03 indicate that the proposed end-to-end feature learning strategy robustly achieves state-of-the-art performances and outperforms any existing approaches by a large margin.

...read moreread less

Abstract: The combination of global and partial features has been an essential solution to improve discriminative performances in person re-identification (Re-ID) tasks. Previous part-based methods mainly focus on locating regions with specific pre-defined semantics to learn local representations, which increases learning difficulty but not efficient or robust to scenarios with large variances. In this paper, we propose an end-to-end feature learning strategy integrating discriminative information with various granularities. We carefully design the Multiple Granularity Network (MGN), a multi-branch deep network architecture consisting of one branch for global feature representations and two branches for local feature representations. Instead of learning on semantic regions, we uniformly partition the images into several stripes, and vary the number of parts in different local branches to obtain local feature representations with multiple granularities. Comprehensive experiments implemented on the mainstream evaluation datasets including Market-1501, DukeMTMC-reid and CUHK03 indicate that our method robustly achieves state-of-the-art performances and outperforms any existing approaches by a large margin. For example, on Market-1501 dataset in single query mode, we obtain a top result of Rank-1/mAP=96.6%/94.2% with this method after re-ranking.

...read moreread less

1,050 citations

Cites background or methods from "HydraPlus-Net: Attentive Deep Featu..."

...Part-based methods for person Re-ID can be divided into three main pathways according to their part locating methods: 1) Locating part regions with strong structural information such as empirical knowledge about human bodies [8, 21, 36, 43] or strong learning-based pose information [33, 44]; 2) Locating part regions by region proposal methods [19, 41]; 3) Enhancing features by middle-level attention on salient partitions [22, 24, 25, 45]....
[...]
...Softmax loss is almost the only choice of classification loss function for its strong robustness to various kinds of multi-class classification tasks, which can be used individually [1, 19, 22, 25, 36, 39, 41, 47] or combined with other losses [3, 8, 20, 43] in embedding learning procedures for Re-ID....
[...]
...Attention information can be a powerful complement for discrimination, which are enhanced in [22, 24, 25]....
[...]
...To locate semantic partitions without strongly learning-based predictors, region proposal methods such as [11, 18] are employed in some part-based methods [19, 22, 25, 41, 45]....
[...]

Proceedings Article•DOI•

Harmonious Attention Network for Person Re-identification

[...]

Wei Li¹, Xiatian Zhu, Shaogang Gong¹•Institutions (1)

Queen Mary University of London¹

22 Feb 2018

TL;DR: A novel Harmonious Attention CNN (HA-CNN) model is formulated for joint learning of soft pixel attention and hard regional attention along with simultaneous optimisation of feature representations, dedicated to optimise person re-id in uncontrolled (misaligned) images.

...read moreread less

Abstract: Existing person re-identification (re-id) methods either assume the availability of well-aligned person bounding box images as model input or rely on constrained attention selection mechanisms to calibrate misaligned images. They are therefore sub-optimal for re-id matching in arbitrarily aligned person images potentially with large human pose variations and unconstrained auto-detection errors. In this work, we show the advantages of jointly learning attention selection and feature representation in a Convolutional Neural Network (CNN) by maximising the complementary information of different levels of visual attention subject to re-id discriminative learning constraints. Specifically, we formulate a novel Harmonious Attention CNN (HA-CNN) model for joint learning of soft pixel attention and hard regional attention along with simultaneous optimisation of feature representations, dedicated to optimise person re-id in uncontrolled (misaligned) images. Extensive comparative evaluations validate the superiority of this new HA-CNN model for person re-id over a wide variety of state-of-the-art methods on three large-scale benchmarks including CUHK03, Market-1501, and DukeMTMC-ReID.

...read moreread less

1,020 citations

Cites background from "HydraPlus-Net: Attentive Deep Featu..."

...While soft re-id attention modelling is considered in [24], this model assumes tight person boxes thus less suitable for poor detections....
[...]

Posted Content•

Beyond Part Models: Person Retrieval with Refined Part Pooling (and a Strong Convolutional Baseline)

[...]

Yifan Sun¹, Liang Zheng², Yi Yang³, Qi Tian⁴, Shengjin Wang¹ - Show less +1 more•Institutions (4)

Tsinghua University¹, Australian National University², University of Technology, Sydney³, University of Texas at San Antonio⁴

26 Nov 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: Zhang et al. as mentioned in this paper proposed a part-based convolutional baseline (PCB) for person retrieval, which employs part-level features to learn discriminative part-informed features for pedestrian image description.

...read moreread less

Abstract: Employing part-level features for pedestrian image description offers fine-grained information and has been verified as beneficial for person retrieval in very recent literature. A prerequisite of part discovery is that each part should be well located. Instead of using external cues, e.g., pose estimation, to directly locate parts, this paper lays emphasis on the content consistency within each part. Specifically, we target at learning discriminative part-informed features for person retrieval and make two contributions. (i) A network named Part-based Convolutional Baseline (PCB). Given an image input, it outputs a convolutional descriptor consisting of several part-level features. With a uniform partition strategy, PCB achieves competitive results with the state-of-the-art methods, proving itself as a strong convolutional baseline for person retrieval. (ii) A refined part pooling (RPP) method. Uniform partition inevitably incurs outliers in each part, which are in fact more similar to other parts. RPP re-assigns these outliers to the parts they are closest to, resulting in refined parts with enhanced within-part consistency. Experiment confirms that RPP allows PCB to gain another round of performance boost. For instance, on the Market-1501 dataset, we achieve (77.4+4.2)% mAP and (92.3+1.5)% rank-1 accuracy, surpassing the state of the art by a large margin.

...read moreread less

778 citations

Proceedings Article•DOI•

Transferable Joint Attribute-Identity Deep Learning for Unsupervised Person Re-identification

[...]

Jingya Wang¹, Xiatian Zhu, Shaogang Gong¹, Wei Li¹•Institutions (1)

Queen Mary University of London¹

18 Jun 2018

TL;DR: In this article, a Transferable Joint Attribute-Identity Deep Learning (TJ-AIDL) model is proposed to simultaneously learn an attribute-semantic and identity discriminative feature representation space transferrable to any new (unseen) target domain for re-id tasks without the need for collecting new labelled training data from the target domain.

...read moreread less

Abstract: Most existing person re-identification (re-id) methods require supervised model learning from a separate large set of pairwise labelled training data for every single camera pair. This significantly limits their scalability and usability in real-world large scale deployments with the need for performing re-id across many camera views. To address this scalability problem, we develop a novel deep learning method for transferring the labelled information of an existing dataset to a new unseen (unlabelled) target domain for person re-id without any supervised learning in the target domain. Specifically, we introduce an Transferable Joint Attribute-Identity Deep Learning (TJ-AIDL) for simultaneously learning an attribute-semantic and identity-discriminative feature representation space transferrable to any new (unseen) target domain for re-id tasks without the need for collecting new labelled training data from the target domain (i.e. unsupervised learning in the target domain). Extensive comparative evaluations validate the superiority of this new TJ-AIDL model for unsupervised person re-id over a wide range of state-of-the-art methods on four challenging benchmarks including VIPeR, PRID, Market-1501, and DukeMTMC-ReID.

...read moreread less

568 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80

Collapse

References

PDF

Open Access

More filters

Posted Content•

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

[...]

Sergey Ioffe¹, Christian Szegedy¹•Institutions (1)

Google¹

11 Feb 2015-arXiv: Learning

TL;DR: Batch Normalization as mentioned in this paper normalizes layer inputs for each training mini-batch to reduce the internal covariate shift in deep neural networks, and achieves state-of-the-art performance on ImageNet.

...read moreread less

Abstract: Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters.

...read moreread less

17,184 citations

Journal Article•DOI•

Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning

[...]

Ronald J. Williams¹•Institutions (1)

Northeastern University¹

01 May 1992-Machine Learning

TL;DR: This article presents a general class of associative reinforcement learning algorithms for connectionist networks containing stochastic units that are shown to make weight adjustments in a direction that lies along the gradient of expected reinforcement in both immediate-reinforcement tasks and certain limited forms of delayed-reInforcement tasks, and they do this without explicitly computing gradient estimates.

...read moreread less

Abstract: This article presents a general class of associative reinforcement learning algorithms for connectionist networks containing stochastic units. These algorithms, called REINFORCE algorithms, are shown to make weight adjustments in a direction that lies along the gradient of expected reinforcement in both immediate-reinforcement tasks and certain limited forms of delayed-reinforcement tasks, and they do this without explicitly computing gradient estimates or even storing information from which such estimates could be computed. Specific examples of such algorithms are presented, some of which bear a close relationship to certain existing algorithms while others are novel but potentially interesting in their own right. Also given are results that show how such algorithms can be naturally integrated with backpropagation. We close with a brief discussion of a number of additional issues surrounding the use of such algorithms, including what is known about their limiting behaviors as well as further considerations that might be used to help develop similar but potentially more powerful reinforcement learning algorithms.

...read moreread less

7,930 citations

"HydraPlus-Net: Attentive Deep Featu..." refers background in this paper

...Compared to non-differentiable hard attention trained by reinforce algorithms [28], soft attention which weights the feature maps is differentiable and can be trained by back propagation....
[...]

Proceedings Article•

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

[...]

Kelvin Xu¹, Jimmy Ba², Ryan Kiros², Kyunghyun Cho¹, Aaron Courville¹, Ruslan Salakhudinov³, Ruslan Salakhudinov², Rich Zemel², Rich Zemel³, Yoshua Bengio¹, Yoshua Bengio³ - Show less +7 more•Institutions (3)

Université de Montréal¹, University of Toronto², Canadian Institute for Advanced Research³

06 Jul 2015

TL;DR: An attention based model that automatically learns to describe the content of images is introduced that can be trained in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower bound.

...read moreread less

Abstract: Inspired by recent work in machine translation and object detection, we introduce an attention based model that automatically learns to describe the content of images. We describe how we can train this model in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower bound. We also show through visualization how the model is able to automatically learn to fix its gaze on salient objects while generating the corresponding words in the output sequence. We validate the use of attention with state-of-the-art performance on three benchmark datasets: Flickr9k, Flickr30k and MS COCO.

...read moreread less

6,485 citations

"HydraPlus-Net: Attentive Deep Featu..." refers methods or result in this paper

...Most previous studies [34, 21] merely demonstrated the effectiveness of an attention-based model with a limited number of channels...
[...]
...Attention models In computer vision, attention models have been used in tasks such as image caption generation [34], visual question answering [18, 33] and object detection [2]....
[...]
...Moreover, the MDA module also differs from the traditional attention-based models [21, 34] that push the attention map back to the same block, and it extends this mechanism by applying the attention maps to adjacent blocks, as shown in lines with varying hot colors in Fig....
[...]

Book Chapter•DOI•

Stacked Hourglass Networks for Human Pose Estimation

[...]

Alejandro Newell¹, Kaiyu Yang¹, Jia Deng¹•Institutions (1)

University of Michigan¹

08 Oct 2016

TL;DR: This work introduces a novel convolutional network architecture for the task of human pose estimation that is described as a “stacked hourglass” network based on the successive steps of pooling and upsampling that are done to produce a final set of predictions.

...read moreread less

Abstract: This work introduces a novel convolutional network architecture for the task of human pose estimation. Features are processed across all scales and consolidated to best capture the various spatial relationships associated with the body. We show how repeated bottom-up, top-down processing used in conjunction with intermediate supervision is critical to improving the performance of the network. We refer to the architecture as a “stacked hourglass” network based on the successive steps of pooling and upsampling that are done to produce a final set of predictions. State-of-the-art results are achieved on the FLIC and MPII benchmarks outcompeting all recent methods.

...read moreread less

3,865 citations

"HydraPlus-Net: Attentive Deep Featu..." refers methods or result in this paper

...Most previous studies [34, 21] merely demonstrated the effectiveness of an attention-based model with a limited number of channels...
[...]
...Moreover, the MDA module also differs from the traditional attention-based models [21, 34] that push the attention map back to the same block, and it extends this mechanism by applying the attention maps to adjacent blocks, as shown in lines with varying hot colors in Fig....
[...]

Proceedings Article•DOI•

Scalable Person Re-identification: A Benchmark

[...]

Liang Zheng¹, Liang Zheng², Liyue Shen¹, Lu Tian¹, Shengjin Wang¹, Jingdong Wang³, Qi Tian¹ - Show less +3 more•Institutions (3)

Tsinghua University¹, University of Texas at San Antonio², Microsoft³

07 Dec 2015

TL;DR: A minor contribution, inspired by recent advances in large-scale image search, an unsupervised Bag-of-Words descriptor is proposed that yields competitive accuracy on VIPeR, CUHK03, and Market-1501 datasets, and is scalable on the large- scale 500k dataset.

...read moreread less

Abstract: This paper contributes a new high quality dataset for person re-identification, named "Market-1501". Generally, current datasets: 1) are limited in scale, 2) consist of hand-drawn bboxes, which are unavailable under realistic settings, 3) have only one ground truth and one query image for each identity (close environment). To tackle these problems, the proposed Market-1501 dataset is featured in three aspects. First, it contains over 32,000 annotated bboxes, plus a distractor set of over 500K images, making it the largest person re-id dataset to date. Second, images in Market-1501 dataset are produced using the Deformable Part Model (DPM) as pedestrian detector. Third, our dataset is collected in an open system, where each identity has multiple images under each camera. As a minor contribution, inspired by recent advances in large-scale image search, this paper proposes an unsupervised Bag-of-Words descriptor. We view person re-identification as a special task of image search. In experiment, we show that the proposed descriptor yields competitive accuracy on VIPeR, CUHK03, and Market-1501 datasets, and is scalable on the large-scale 500k dataset.

...read moreread less

3,564 citations

"HydraPlus-Net: Attentive Deep Featu..." refers methods in this paper

...The proposed approach is evaluated on three publicly standard datasets, including CUHK03 [16], VIPeR [8], and Market-1501 [38]....
[...]
...The Market-1501 [38] dataset is also evaluated with the metric learning WARCA-L [11], a novel Siamese LSTM architecture LOMO+CN [27], the Siamese CNN with learnable gate S-CNN [26], and the bag of words model BoW-best [38]....
[...]
...For the Market-1501 dataset, the same data separation strategy is used as [38]....
[...]
...From Table 4, we can ob- serve that the proposed approach achieves the Top-1 accuracies of 91.8%, 56.6% and 76.9% on the CUHK03, ViPeR and Market-1501 datasets, respectively, and it achieves the state-of-the-art performance on all the three datasets....
[...]
...Besides the quantitative results in Dataset Market-1501 [38] CUHK03 [16] VIPeR [8] # identities 1501 1360 632 # images 32643 13164 1264 # cameras 6 2 2 # training IDs 750 1160 316 # test IDs 751 100 316 # probe images 3368 100 316 # gallery images 19732 100 316...
[...]