Person Search with Natural Language Description

doi:10.1109/CVPR.2017.551

Open AccessProceedings ArticleDOI

Person Search with Natural Language Description

Shuang Li, +5 more

- pp 5187-5196

Chats0

TLDR

Zhang et al. as discussed by the authors proposed a recurrent neural network with gated neural attention mechanism (GNA-RNN) for person search in large-scale image databases with the query of natural language description.

Abstract:

Searching persons in large-scale image databases with the query of natural language description has important applications in video surveillance. Existing methods mainly focused on searching persons with image-based or attribute-based queries, which have major limitations for a practical usage. In this paper, we study the problem of person search with natural language description. Given the textual description of a person, the algorithm of the person search is required to rank all the samples in the person database then retrieve the most relevant sample corresponding to the queried description. Since there is no person dataset or benchmark with textual description available, we collect a large-scale person description dataset with detailed natural language annotations and person samples from various sources, termed as CUHK Person Description Dataset (CUHK-PEDES). A wide range of possible models and baselines have been evaluated and compared on the person search benchmark. An Recurrent Neural Network with Gated Neural Attention mechanism (GNA-RNN) is proposed to establish the state-of-the art performance on person search.

Citations

PDF

Open Access

More filters

Posted Content

Deep Learning for Person Re-identification: A Survey and Outlook

Mang Ye, +5 more

- 13 Jan 2020 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: A powerful AGW baseline is designed, achieving state-of-the-art or at least comparable performance on twelve datasets for four different Re-ID tasks, and a new evaluation metric (mINP) is introduced, indicating the cost for finding all the correct matches, which provides an additional criteria to evaluate the Re- ID system for real applications.

...read moreread less

Proceedings ArticleDOI

LaSOT: A High-Quality Benchmark for Large-Scale Single Object Tracking

Heng Fan, +9 more

TL;DR: LaSOT is presented, a high-quality benchmark for Large-scale Single Object Tracking that consists of 1,400 sequences with more than 3.5M frames in total, and is the largest, to the best of the authors' knowledge, densely annotated tracking benchmark.

...read moreread less

Posted Content

LaSOT: A High-quality Benchmark for Large-scale Single Object Tracking

Heng Fan, +9 more

- 20 Sep 2018 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: The LaSOT benchmark as discussed by the authors provides a high-quality benchmark for large-scale single object tracking, which consists of 1,400 sequences with more than 3.5M frames in total.

...read moreread less

Proceedings ArticleDOI

Scene Graph Generation from Objects, Phrases and Region Captions

Yikang Li, +4 more

TL;DR: Zhang et al. as mentioned in this paper proposed a multi-level scene description network (MSDN) to solve the three vision tasks jointly in an end-to-end manner, where object, phrase, and caption regions are aligned with a dynamic graph based on their spatial and semantic connections.

...read moreread less

Proceedings ArticleDOI

HydraPlus-Net: Attentive Deep Features for Pedestrian Analysis

Xihui Liu, +7 more

TL;DR: A new attentionbased deep neural network, named as HydraPlus-Net (HPnet), that multi-directionally feeds the multi-level attention maps to different feature layers to enrich the final feature representations for a pedestrian image.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings Article

Exploring models and data for image question answering

Mengye Ren, +2 more

TL;DR: This work proposes to use neural networks and visual semantic embeddings, without intermediate stages such as object detection and image segmentation, to predict answers to simple questions about images, and presents a question generation algorithm that converts image descriptions into QA form.

...read moreread less

Proceedings ArticleDOI

Image Question Answering Using Convolutional Neural Network with Dynamic Parameter Prediction

Hyeonwoo Noh, +2 more

TL;DR: In this paper, a joint network with the CNN for ImageQA and the parameter prediction network is proposed, which is trained end-to-end through back-propagation, where its weights are initialized using a pre-trained CNN and GRU.

...read moreread less

Proceedings ArticleDOI

Object Detection from Video Tubelets with Convolutional Neural Networks

Kai Kang, +3 more

TL;DR: Wang et al. as mentioned in this paper introduced a complete framework for the object detection from video (VID) task based on still-image object detection and general object tracking, and a temporal convolution network is proposed to incorporate temporal information to regularize the detection results and shows its effectiveness for the task.

...read moreread less

Posted Content

Simple Baseline for Visual Question Answering

Bolei Zhou, +4 more

- 07 Dec 2015 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: A very simple bag-of-words baseline for visual question answering that concatenates the word features from the question and CNN features fromThe image to predict the answer.

...read moreread less

Book ChapterDOI

Segmentation from Natural Language Expressions

Ronghang Hu, +3 more

TL;DR: An end-to-end trainable recurrent and convolutional network model that jointly learns to process visual and linguistic information is proposed that can produce quality segmentation output from the natural language expression, and outperforms baseline methods by a large margin.

...read moreread less

Collapse

Person Search with Natural Language Description

Citations

Deep Learning for Person Re-identification: A Survey and Outlook

LaSOT: A High-Quality Benchmark for Large-Scale Single Object Tracking

LaSOT: A High-quality Benchmark for Large-scale Single Object Tracking

Scene Graph Generation from Objects, Phrases and Region Captions

HydraPlus-Net: Attentive Deep Features for Pedestrian Analysis

References

Exploring models and data for image question answering

Image Question Answering Using Convolutional Neural Network with Dynamic Parameter Prediction

Object Detection from Video Tubelets with Convolutional Neural Networks

Simple Baseline for Visual Question Answering

Segmentation from Natural Language Expressions

Related Papers (5)

Deep Residual Learning for Image Recognition

Show and tell: A neural image caption generator

Person re-identification by Local Maximal Occurrence representation and metric learning

DeepReID: Deep Filter Pairing Neural Network for Person Re-identification

Beyond Part Models: Person Retrieval with Refined Part Pooling (and A Strong Convolutional Baseline)