Person Search with Natural Language Description

doi:10.1109/CVPR.2017.551

Home
/
Papers
/
Person Search with Natural Language Description

Proceedings Article•DOI•

Person Search with Natural Language Description

Shuang Li¹, Tong Xiao¹, Hongsheng Li¹, Bolei Zhou, Dayu Yue², Xiaogang Wang¹ - Show less +2 more•Institutions (2)

The Chinese University of Hong Kong¹, SenseTime²

01 Jul 2017-pp 5187-5196

TL;DR: Zhang et al. as discussed by the authors proposed a recurrent neural network with gated neural attention mechanism (GNA-RNN) for person search in large-scale image databases with the query of natural language description.

read less

Abstract: Searching persons in large-scale image databases with the query of natural language description has important applications in video surveillance. Existing methods mainly focused on searching persons with image-based or attribute-based queries, which have major limitations for a practical usage. In this paper, we study the problem of person search with natural language description. Given the textual description of a person, the algorithm of the person search is required to rank all the samples in the person database then retrieve the most relevant sample corresponding to the queried description. Since there is no person dataset or benchmark with textual description available, we collect a large-scale person description dataset with detailed natural language annotations and person samples from various sources, termed as CUHK Person Description Dataset (CUHK-PEDES). A wide range of possible models and baselines have been evaluated and compared on the person search benchmark. An Recurrent Neural Network with Gated Neural Attention mechanism (GNA-RNN) is proposed to establish the state-of-the art performance on person search.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Posted Content•

Deep Learning for Person Re-identification: A Survey and Outlook

[...]

Mang Ye, Jianbing Shen¹, Gaojie Lin², Tao Xiang³, Ling Shao⁴, Steven C. H. Hoi⁵ - Show less +2 more•Institutions (5)

University of California, Los Angeles¹, Beijing Institute of Technology², University of Surrey³, University of East Anglia⁴, Singapore Management University⁵

13 Jan 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: A powerful AGW baseline is designed, achieving state-of-the-art or at least comparable performance on twelve datasets for four different Re-ID tasks, and a new evaluation metric (mINP) is introduced, indicating the cost for finding all the correct matches, which provides an additional criteria to evaluate the Re- ID system for real applications.

...read moreread less

Abstract: Person re-identification (Re-ID) aims at retrieving a person of interest across multiple non-overlapping cameras. With the advancement of deep neural networks and increasing demand of intelligent video surveillance, it has gained significantly increased interest in the computer vision community. By dissecting the involved components in developing a person Re-ID system, we categorize it into the closed-world and open-world settings. The widely studied closed-world setting is usually applied under various research-oriented assumptions, and has achieved inspiring success using deep learning techniques on a number of datasets. We first conduct a comprehensive overview with in-depth analysis for closed-world person Re-ID from three different perspectives, including deep feature representation learning, deep metric learning and ranking optimization. With the performance saturation under closed-world setting, the research focus for person Re-ID has recently shifted to the open-world setting, facing more challenging issues. This setting is closer to practical applications under specific scenarios. We summarize the open-world Re-ID in terms of five different aspects. By analyzing the advantages of existing methods, we design a powerful AGW baseline, achieving state-of-the-art or at least comparable performance on twelve datasets for FOUR different Re-ID tasks. Meanwhile, we introduce a new evaluation metric (mINP) for person Re-ID, indicating the cost for finding all the correct matches, which provides an additional criteria to evaluate the Re-ID system for real applications. Finally, some important yet under-investigated open issues are discussed.

...read moreread less

737 citations

Proceedings Article•DOI•

LaSOT: A High-Quality Benchmark for Large-Scale Single Object Tracking

[...]

Heng Fan¹, Haibin Ling¹, Liting Lin², Fan Yang¹, Peng Chu¹, Ge Deng¹, Sijia Yu¹, Hexin Bai¹, Yong Xu², Chunyuan Liao - Show less +6 more•Institutions (2)

Temple University¹, South China University of Technology²

15 Jun 2019

TL;DR: LaSOT is presented, a high-quality benchmark for Large-scale Single Object Tracking that consists of 1,400 sequences with more than 3.5M frames in total, and is the largest, to the best of the authors' knowledge, densely annotated tracking benchmark.

...read moreread less

Abstract: In this paper, we present LaSOT, a high-quality benchmark for Large-scale Single Object Tracking. LaSOT consists of 1,400 sequences with more than 3.5M frames in total. Each frame in these sequences is carefully and manually annotated with a bounding box, making LaSOT the largest, to the best of our knowledge, densely annotated tracking benchmark. The average video length of LaSOT is more than 2,500 frames, and each sequence comprises various challenges deriving from the wild where target objects may disappear and re-appear again in the view. By releasing LaSOT, we expect to provide the community with a large-scale dedicated benchmark with high quality for both the training of deep trackers and the veritable evaluation of tracking algorithms. Moreover, considering the close connections of visual appearance and natural language, we enrich LaSOT by providing additional language specification, aiming at encouraging the exploration of natural linguistic feature for tracking. A thorough experimental evaluation of 35 tracking algorithms on LaSOT is presented with detailed analysis, and the results demonstrate that there is still a big room for improvements.

...read moreread less

653 citations

Posted Content•

LaSOT: A High-quality Benchmark for Large-scale Single Object Tracking

[...]

Heng Fan, Liting Lin, Fan Yang, Peng Chu, Ge Deng, Sijia Yu, Hexin Bai, Yong Xu, Chunyuan Liao, Haibin Ling - Show less +6 more

20 Sep 2018-arXiv: Computer Vision and Pattern Recognition

TL;DR: The LaSOT benchmark as discussed by the authors provides a high-quality benchmark for large-scale single object tracking, which consists of 1,400 sequences with more than 3.5M frames in total.

...read moreread less

501 citations

Proceedings Article•DOI•

Scene Graph Generation from Objects, Phrases and Region Captions

[...]

Yikang Li¹, Wanli Ouyang², Bolei Zhou³, Kun Wang¹, Xiaogang Wang¹ - Show less +1 more•Institutions (3)

The Chinese University of Hong Kong¹, University of Sydney², Massachusetts Institute of Technology³

01 Oct 2017

TL;DR: Zhang et al. as mentioned in this paper proposed a multi-level scene description network (MSDN) to solve the three vision tasks jointly in an end-to-end manner, where object, phrase, and caption regions are aligned with a dynamic graph based on their spatial and semantic connections.

...read moreread less

Abstract: Object detection, scene graph generation and region captioning, which are three scene understanding tasks at different semantic levels, are tied together: scene graphs are generated on top of objects detected in an image with their pairwise relationship predicted, while region captioning gives a language description of the objects, their attributes, relations and other context information. In this work, to leverage the mutual connections across semantic levels, we propose a novel neural network model, termed as Multi-level Scene Description Network (denoted as MSDN), to solve the three vision tasks jointly in an end-to-end manner. Object, phrase, and caption regions are first aligned with a dynamic graph based on their spatial and semantic connections. Then a feature refining structure is used to pass messages across the three levels of semantic tasks through the graph. We benchmark the learned model on three tasks, and show the joint learning across three tasks with our proposed method can bring mutual improvements over previous models. Particularly, on the scene graph generation task, our proposed method outperforms the stateof- art method with more than 3% margin. Code has been made publicly available.

...read moreread less

477 citations

Proceedings Article•DOI•

HydraPlus-Net: Attentive Deep Features for Pedestrian Analysis

[...]

Xihui Liu¹, Haiyu Zhao², Tian Maoqing³, Lu Sheng², Jing Shao³, Shuai Yi³, Junjie Yan³, Xiaogang Wang² - Show less +4 more•Institutions (3)

Tsinghua University¹, The Chinese University of Hong Kong², SenseTime³

01 Oct 2017

TL;DR: A new attentionbased deep neural network, named as HydraPlus-Net (HPnet), that multi-directionally feeds the multi-level attention maps to different feature layers to enrich the final feature representations for a pedestrian image.

...read moreread less

Abstract: Pedestrian analysis plays a vital role in intelligent video surveillance and is a key component for security-centric computer vision systems. Despite that the convolutional neural networks are remarkable in learning discriminative features from images, the learning of comprehensive features of pedestrians for fine-grained tasks remains an open problem. In this study, we propose a new attentionbased deep neural network, named as HydraPlus-Net (HPnet), that multi-directionally feeds the multi-level attention maps to different feature layers. The attentive deep features learned from the proposed HP-net bring unique advantages: (1) the model is capable of capturing multiple attentions from low-level to semantic-level, and (2) it explores the multi-scale selectiveness of attentive features to enrich the final feature representations for a pedestrian image. We demonstrate the effectiveness and generality of the proposed HP-net for pedestrian analysis on two tasks, i.e. pedestrian attribute recognition and person reidentification. Intensive experimental results have been provided to prove that the HP-net outperforms the state-of-theart methods on various datasets.

...read moreread less

396 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38

Collapse

References

PDF

Open Access

More filters

Proceedings Article•DOI•

Deep Residual Learning for Image Recognition

[...]

Kaiming He¹, Xiangyu Zhang¹, Shaoqing Ren¹, Jian Sun¹•Institutions (1)

Microsoft¹

27 Jun 2016

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.

...read moreread less

Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

...read moreread less

123,388 citations

Proceedings Article•

ImageNet Classification with Deep Convolutional Neural Networks

[...]

Alex Krizhevsky¹, Ilya Sutskever¹, Geoffrey E. Hinton¹•Institutions (1)

University of Toronto¹

03 Dec 2012

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.

...read moreread less

Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overriding in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

...read moreread less

73,978 citations

Journal Article•DOI•

Long short-term memory

[...]

Sepp Hochreiter¹, Jürgen Schmidhuber²•Institutions (2)

Technische Universität München¹, Dalle Molle Institute for Artificial Intelligence Research²

01 Nov 1997-Neural Computation

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.

...read moreread less

Abstract: Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O. 1. Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.

...read moreread less

72,897 citations

Book Chapter•DOI•

Microsoft COCO: Common Objects in Context

[...]

Tsung-Yi Lin¹, Michael Maire², Serge Belongie¹, James Hays, Pietro Perona², Deva Ramanan³, Piotr Dollár⁴, C. Lawrence Zitnick⁴ - Show less +4 more•Institutions (4)

Cornell University¹, California Institute of Technology², University of California, Irvine³, Microsoft⁴

06 Sep 2014

TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.

...read moreread less

Abstract: We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in precise object localization. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.

...read moreread less

30,462 citations

Proceedings Article•

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

[...]

Kelvin Xu¹, Jimmy Ba², Ryan Kiros², Kyunghyun Cho¹, Aaron Courville¹, Ruslan Salakhudinov³, Ruslan Salakhudinov², Rich Zemel³, Rich Zemel², Yoshua Bengio¹, Yoshua Bengio³ - Show less +7 more•Institutions (3)

Université de Montréal¹, University of Toronto², Canadian Institute for Advanced Research³

06 Jul 2015

TL;DR: An attention based model that automatically learns to describe the content of images is introduced that can be trained in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower bound.

...read moreread less

Abstract: Inspired by recent work in machine translation and object detection, we introduce an attention based model that automatically learns to describe the content of images. We describe how we can train this model in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower bound. We also show through visualization how the model is able to automatically learn to fix its gaze on salient objects while generating the corresponding words in the output sequence. We validate the use of attention with state-of-the-art performance on three benchmark datasets: Flickr9k, Flickr30k and MS COCO.

...read moreread less

6,485 citations