Visual Relationship Detection with Internal and External Linguistic Knowledge Distillation

doi:10.1109/ICCV.2017.121

Open AccessProceedings ArticleDOI

Visual Relationship Detection with Internal and External Linguistic Knowledge Distillation

Ruichi Yu, +3 more

- pp 1068-1076

Chats0

TLDR

This work uses knowledge of linguistic statistics to regularize visual model learning and suggests that with this linguistic knowledge distillation, the model outperforms the state-of- the-art methods significantly, especially when predicting unseen relationships.

Abstract:

Understanding the visual relationship between two objects involves identifying the subject, the object, and a predicate relating them. We leverage the strong correlations between the predicate and the hsubj; obji pair (both semantically and spatially) to predict predicates conditioned on the subjects and the objects. Modeling the three entities jointly more accurately reflects their relationships compared to modeling them independently, but it complicates learning since the semantic space of visual relationships is huge and training data is limited, especially for longtail relationships that have few instances. To overcome this, we use knowledge of linguistic statistics to regularize visual model learning. We obtain linguistic knowledge by mining from both training annotations (internal knowledge) and publicly available text, e.g., Wikipedia (external knowledge), computing the conditional probability distribution of a predicate given a (subj, obj) pair. As we train the visual model, we distill this knowledge into the deep model to achieve better generalization. Our experimental results on the Visual Relationship Detection (VRD) and Visual Genome datasets suggest that with this linguistic knowledge distillation, our model outperforms the stateof- the-art methods significantly, especially when predicting unseen relationships (e.g., recall improved from 8.45% to 19.17% on VRD zero-shot testing set).

Visual Relationship Detection with Internal and External Linguistic Knowledge Distillation

Citations

Multiscale Conditional Relationship Graph Network for Referring Relationships in Images

Towards Interpretable Video Anomaly Detection

A Symmetric Fusion Learning Model for Detecting Visual Relations and Scene Parsing

Modeling graph-structured contexts for image captioning

Modeling Semantic Correlation and Hierarchy for Real-world Wildlife Recognition

References

Very Deep Convolutional Networks for Large-Scale Image Recognition

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

Efficient Estimation of Word Representations in Vector Space

Fast R-CNN

Fast R-CNN

Related Papers (5)

Visual Relationship Detection with Language Priors

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Scene Graph Generation by Iterative Message Passing

Faster R-CNN: towards real-time object detection with region proposal networks

Deep Residual Learning for Image Recognition