Scene Graph Generation from Objects, Phrases and Region Captions

doi:10.1109/ICCV.2017.142

Open AccessProceedings ArticleDOI

Scene Graph Generation from Objects, Phrases and Region Captions

Yikang Li, +4 more

- pp 1270-1279

Chats0

TLDR

Zhang et al. as mentioned in this paper proposed a multi-level scene description network (MSDN) to solve the three vision tasks jointly in an end-to-end manner, where object, phrase, and caption regions are aligned with a dynamic graph based on their spatial and semantic connections.

Abstract:

Object detection, scene graph generation and region captioning, which are three scene understanding tasks at different semantic levels, are tied together: scene graphs are generated on top of objects detected in an image with their pairwise relationship predicted, while region captioning gives a language description of the objects, their attributes, relations and other context information. In this work, to leverage the mutual connections across semantic levels, we propose a novel neural network model, termed as Multi-level Scene Description Network (denoted as MSDN), to solve the three vision tasks jointly in an end-to-end manner. Object, phrase, and caption regions are first aligned with a dynamic graph based on their spatial and semantic connections. Then a feature refining structure is used to pass messages across the three levels of semantic tasks through the graph. We benchmark the learned model on three tasks, and show the joint learning across three tasks with our proposed method can bring mutual improvements over previous models. Particularly, on the scene graph generation task, our proposed method outperforms the stateof- art method with more than 3% margin. Code has been made publicly available.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Deep Learning for Generic Object Detection: A Survey

Li Liu, +7 more

- 01 Feb 2020 -

International Journal of Computer Vision

TL;DR: A comprehensive survey of the recent achievements in this field brought about by deep learning techniques, covering many aspects of generic object detection: detection frameworks, object feature representation, object proposal generation, context modeling, training strategies, and evaluation metrics.

...read moreread less

Book ChapterDOI

Exploring Visual Relationship for Image Captioning

Ting Yao, +3 more

TL;DR: Zhang et al. as discussed by the authors proposed GCN-LSTM with attention mechanism to explore the connections between objects for image captioning under the umbrella of attention-based encoder-decoder framework.

...read moreread less

Book ChapterDOI

Graph R-CNN for Scene Graph Generation

Jianwei Yang, +6 more

TL;DR: A novel scene graph generation model called Graph R-CNN, that is both effective and efficient at detecting objects and their relations in images, is proposed and a new evaluation metric is introduced that is more holistic and realistic than existing metrics.

...read moreread less

Proceedings ArticleDOI

Neural Motifs: Scene Graph Parsing with Global Context

Rowan Zellers, +3 more

TL;DR: In this paper, the role of motifs in scene graph representation is investigated and the Stacked Motif Networks (SFN) architecture is proposed to capture higher order motifs.

...read moreread less

Proceedings ArticleDOI

Towards Diverse and Natural Image Descriptions via a Conditional GAN

Bo Dai, +3 more

TL;DR: This article proposed a new framework based on conditional generative adversarial networks (CGAN), which jointly learns a generator to produce descriptions conditioned on images and an evaluator to assess how well a description fits the visual content.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan, +1 more

TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.

...read moreread less

Proceedings ArticleDOI

You Only Look Once: Unified, Real-Time Object Detection

Joseph Redmon, +3 more

TL;DR: Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background, and outperforms other detection methods, including DPM and R-CNN, when generalizing from natural images to other domains like artwork.

...read moreread less

Proceedings ArticleDOI

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

Ross Girshick, +3 more

TL;DR: RCNN as discussed by the authors combines CNNs with bottom-up region proposals to localize and segment objects, and when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost.

...read moreread less

Book ChapterDOI

SSD: Single Shot MultiBox Detector

Wei Liu, +6 more

TL;DR: The approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location, which makes SSD easy to train and straightforward to integrate into systems that require a detection component.

...read moreread less

Proceedings ArticleDOI

Fast R-CNN

Ross Girshick

TL;DR: Fast R-CNN as discussed by the authors proposes a Fast Region-based Convolutional Network method for object detection, which employs several innovations to improve training and testing speed while also increasing detection accuracy and achieves a higher mAP on PASCAL VOC 2012.

...read moreread less

Collapse

Related Papers (5)

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Ranjay Krishna, +11 more

- 01 May 2017 -

International Journal of Computer Vision

Scene Graph Generation from Objects, Phrases and Region Captions

Citations

Deep Learning for Generic Object Detection: A Survey

Exploring Visual Relationship for Image Captioning

Graph R-CNN for Scene Graph Generation

Neural Motifs: Scene Graph Parsing with Global Context

Towards Diverse and Natural Image Descriptions via a Conditional GAN

References

Very Deep Convolutional Networks for Large-Scale Image Recognition

You Only Look Once: Unified, Real-Time Object Detection

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

SSD: Single Shot MultiBox Detector

Fast R-CNN

Related Papers (5)

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Scene Graph Generation by Iterative Message Passing

Image retrieval using scene graphs

Visual Relationship Detection with Language Priors

Deep Residual Learning for Image Recognition