Visual Relationship Detection with Internal and External Linguistic Knowledge Distillation

doi:10.1109/ICCV.2017.121

Open AccessProceedings ArticleDOI

Visual Relationship Detection with Internal and External Linguistic Knowledge Distillation

Ruichi Yu, +3 more

- pp 1068-1076

Chats0

TLDR

This work uses knowledge of linguistic statistics to regularize visual model learning and suggests that with this linguistic knowledge distillation, the model outperforms the state-of- the-art methods significantly, especially when predicting unseen relationships.

Abstract:

Understanding the visual relationship between two objects involves identifying the subject, the object, and a predicate relating them. We leverage the strong correlations between the predicate and the hsubj; obji pair (both semantically and spatially) to predict predicates conditioned on the subjects and the objects. Modeling the three entities jointly more accurately reflects their relationships compared to modeling them independently, but it complicates learning since the semantic space of visual relationships is huge and training data is limited, especially for longtail relationships that have few instances. To overcome this, we use knowledge of linguistic statistics to regularize visual model learning. We obtain linguistic knowledge by mining from both training annotations (internal knowledge) and publicly available text, e.g., Wikipedia (external knowledge), computing the conditional probability distribution of a predicate given a (subj, obj) pair. As we train the visual model, we distill this knowledge into the deep model to achieve better generalization. Our experimental results on the Visual Relationship Detection (VRD) and Visual Genome datasets suggest that with this linguistic knowledge distillation, our model outperforms the stateof- the-art methods significantly, especially when predicting unseen relationships (e.g., recall improved from 8.45% to 19.17% on VRD zero-shot testing set).

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Improving Visual Relationship Detection With Two-Stage Correlation Exploitation

Hao Zhou, +4 more

- 01 Jul 2021 -

IEEE Transactions on Circuits and System...

TL;DR: Experiments show that the proposed unified visual relationship detection framework with two types of correlation exploitation to address the combination explosion problem in the object-pairs proposing stage and the non-exclusive labelproblem in the predicate recognition stage outperforms current state-of-the-art methods.

...read moreread less

Proceedings ArticleDOI

Memory-Based Network for Scene Graph with Unbalanced Relations

Weitao Wang, +5 more

TL;DR: This work proposes a novel scene graph generation model that can effectively improve the detection of low-frequency relations and uses the method of memory features to realize the transfer of high-frequency relation features to low- frequencies.

...read moreread less

Proceedings ArticleDOI

Learning Prototypes for Visual Relationship Detection

François Plesse, +3 more

TL;DR: This paper proposes a framework for learning predicate prototypes that aims to capture the multimodal nature of predicate distributions and finds that coupling prototype learning with a nearest neighbors approach increases the performance from 85.4 % to 87.6 % over a standard classification approach.

...read moreread less

Proceedings ArticleDOI

Hierarchical Visual Relationship Detection

Xu Sun, +4 more

TL;DR: A novel VRD task named hierarchical visual relationship detection (HVRD), which encourages predictions with abstract yet compatible relationship triplets when the confidence level of the specific image content is relatively low and can handle the inevitable ambiguity of groundtruth annotation in VRD.

...read moreread less

Posted Content

2nd Place Solution to the GQA Challenge 2019

Shijie Geng, +4 more

- 15 Jul 2019 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: A simple method that achieves unexpectedly superior performance for Complex Reasoning involved Visual Question Answering and shows significant gaps when using the same reasoning model with 1) ground-truth features; 2) statistical features; 3) detected features from completely learned detectors, and what these gaps mean to researches on visual reasoning topics.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan, +1 more

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.

...read moreread less

Proceedings ArticleDOI

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

Ross Girshick, +3 more

TL;DR: RCNN as discussed by the authors combines CNNs with bottom-up region proposals to localize and segment objects, and when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost.

...read moreread less

Posted Content

Efficient Estimation of Word Representations in Vector Space

Tomas Mikolov, +3 more

- 16 Jan 2013 -

arXiv: Computation and Language

TL;DR: This paper proposed two novel model architectures for computing continuous vector representations of words from very large data sets, and the quality of these representations is measured in a word similarity task and the results are compared to the previously best performing techniques based on different types of neural networks.

...read moreread less

Proceedings ArticleDOI

Fast R-CNN

Ross Girshick

TL;DR: Fast R-CNN as discussed by the authors proposes a Fast Region-based Convolutional Network method for object detection, which employs several innovations to improve training and testing speed while also increasing detection accuracy and achieves a higher mAP on PASCAL VOC 2012.

...read moreread less

Posted Content

Fast R-CNN

Ross Girshick

- 30 Apr 2015 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: This paper proposes a Fast Region-based Convolutional Network method (Fast R-CNN) for object detection that builds on previous work to efficiently classify object proposals using deep convolutional networks.

...read moreread less

Collapse

Related Papers (5)

Visual Relationship Detection with Language Priors

Cewu Lu, +3 more

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Ranjay Krishna, +11 more

- 01 May 2017 -

International Journal of Computer Vision

Visual Relationship Detection with Internal and External Linguistic Knowledge Distillation

Citations

Improving Visual Relationship Detection With Two-Stage Correlation Exploitation

Memory-Based Network for Scene Graph with Unbalanced Relations

Learning Prototypes for Visual Relationship Detection

Hierarchical Visual Relationship Detection

2nd Place Solution to the GQA Challenge 2019

References

Very Deep Convolutional Networks for Large-Scale Image Recognition

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

Efficient Estimation of Word Representations in Vector Space

Fast R-CNN

Fast R-CNN

Related Papers (5)

Visual Relationship Detection with Language Priors

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Scene Graph Generation by Iterative Message Passing

Faster R-CNN: towards real-time object detection with region proposal networks

Deep Residual Learning for Image Recognition