scispace - formally typeset
Open AccessProceedings ArticleDOI

Person Search with Natural Language Description

Reads0
Chats0
TLDR
Zhang et al. as discussed by the authors proposed a recurrent neural network with gated neural attention mechanism (GNA-RNN) for person search in large-scale image databases with the query of natural language description.
Abstract
Searching persons in large-scale image databases with the query of natural language description has important applications in video surveillance. Existing methods mainly focused on searching persons with image-based or attribute-based queries, which have major limitations for a practical usage. In this paper, we study the problem of person search with natural language description. Given the textual description of a person, the algorithm of the person search is required to rank all the samples in the person database then retrieve the most relevant sample corresponding to the queried description. Since there is no person dataset or benchmark with textual description available, we collect a large-scale person description dataset with detailed natural language annotations and person samples from various sources, termed as CUHK Person Description Dataset (CUHK-PEDES). A wide range of possible models and baselines have been evaluated and compared on the person search benchmark. An Recurrent Neural Network with Gated Neural Attention mechanism (GNA-RNN) is proposed to establish the state-of-the art performance on person search.

read more

Content maybe subject to copyright    Report

Citations
More filters
Posted Content

Deep Learning for Person Re-identification: A Survey and Outlook

TL;DR: A powerful AGW baseline is designed, achieving state-of-the-art or at least comparable performance on twelve datasets for four different Re-ID tasks, and a new evaluation metric (mINP) is introduced, indicating the cost for finding all the correct matches, which provides an additional criteria to evaluate the Re- ID system for real applications.
Proceedings ArticleDOI

LaSOT: A High-Quality Benchmark for Large-Scale Single Object Tracking

TL;DR: LaSOT is presented, a high-quality benchmark for Large-scale Single Object Tracking that consists of 1,400 sequences with more than 3.5M frames in total, and is the largest, to the best of the authors' knowledge, densely annotated tracking benchmark.
Posted Content

LaSOT: A High-quality Benchmark for Large-scale Single Object Tracking

TL;DR: The LaSOT benchmark as discussed by the authors provides a high-quality benchmark for large-scale single object tracking, which consists of 1,400 sequences with more than 3.5M frames in total.
Proceedings ArticleDOI

Scene Graph Generation from Objects, Phrases and Region Captions

TL;DR: Zhang et al. as mentioned in this paper proposed a multi-level scene description network (MSDN) to solve the three vision tasks jointly in an end-to-end manner, where object, phrase, and caption regions are aligned with a dynamic graph based on their spatial and semantic connections.
Proceedings ArticleDOI

HydraPlus-Net: Attentive Deep Features for Pedestrian Analysis

TL;DR: A new attentionbased deep neural network, named as HydraPlus-Net (HPnet), that multi-directionally feeds the multi-level attention maps to different feature layers to enrich the final feature representations for a pedestrian image.
References
More filters
Proceedings Article

Exploring models and data for image question answering

TL;DR: This work proposes to use neural networks and visual semantic embeddings, without intermediate stages such as object detection and image segmentation, to predict answers to simple questions about images, and presents a question generation algorithm that converts image descriptions into QA form.
Proceedings ArticleDOI

Image Question Answering Using Convolutional Neural Network with Dynamic Parameter Prediction

TL;DR: In this paper, a joint network with the CNN for ImageQA and the parameter prediction network is proposed, which is trained end-to-end through back-propagation, where its weights are initialized using a pre-trained CNN and GRU.
Proceedings ArticleDOI

Object Detection from Video Tubelets with Convolutional Neural Networks

TL;DR: Wang et al. as mentioned in this paper introduced a complete framework for the object detection from video (VID) task based on still-image object detection and general object tracking, and a temporal convolution network is proposed to incorporate temporal information to regularize the detection results and shows its effectiveness for the task.
Posted Content

Simple Baseline for Visual Question Answering

TL;DR: A very simple bag-of-words baseline for visual question answering that concatenates the word features from the question and CNN features fromThe image to predict the answer.
Book ChapterDOI

Segmentation from Natural Language Expressions

TL;DR: An end-to-end trainable recurrent and convolutional network model that jointly learns to process visual and linguistic information is proposed that can produce quality segmentation output from the natural language expression, and outperforms baseline methods by a large margin.
Related Papers (5)