scispace - formally typeset
Open AccessProceedings ArticleDOI

Visual Relationship Detection with Internal and External Linguistic Knowledge Distillation

Reads0
Chats0
TLDR
This work uses knowledge of linguistic statistics to regularize visual model learning and suggests that with this linguistic knowledge distillation, the model outperforms the state-of- the-art methods significantly, especially when predicting unseen relationships.
Abstract
Understanding the visual relationship between two objects involves identifying the subject, the object, and a predicate relating them. We leverage the strong correlations between the predicate and the hsubj; obji pair (both semantically and spatially) to predict predicates conditioned on the subjects and the objects. Modeling the three entities jointly more accurately reflects their relationships compared to modeling them independently, but it complicates learning since the semantic space of visual relationships is huge and training data is limited, especially for longtail relationships that have few instances. To overcome this, we use knowledge of linguistic statistics to regularize visual model learning. We obtain linguistic knowledge by mining from both training annotations (internal knowledge) and publicly available text, e.g., Wikipedia (external knowledge), computing the conditional probability distribution of a predicate given a (subj, obj) pair. As we train the visual model, we distill this knowledge into the deep model to achieve better generalization. Our experimental results on the Visual Relationship Detection (VRD) and Visual Genome datasets suggest that with this linguistic knowledge distillation, our model outperforms the stateof- the-art methods significantly, especially when predicting unseen relationships (e.g., recall improved from 8.45% to 19.17% on VRD zero-shot testing set).

read more

Citations
More filters
Journal ArticleDOI

Multiscale Conditional Relationship Graph Network for Referring Relationships in Images

TL;DR: Wang et al. as mentioned in this paper proposed a multiscale conditional relationship graph network (CRGN) to localize entities based on visual relationships, where an attention pyramid network is first introduced to generate multISCale attention maps to capture entities with various sizes for entity matching.
Proceedings ArticleDOI

Towards Interpretable Video Anomaly Detection

TL;DR: Wang et al. as mentioned in this paper proposed a novel framework which can explain the detected anomalous event in a surveillance video by monitoring the interactions between objects to detect anomalous events and explain their root causes and demonstrate that the scene graphs obtained by monitoring object interactions provided an interpretation for the context of the anomaly while performing competitively with respect to the recent state-of-the-art approaches.
Journal ArticleDOI

A Symmetric Fusion Learning Model for Detecting Visual Relations and Scene Parsing

TL;DR: Wang et al. as mentioned in this paper proposed a symmetric fusion learning model for visual relationship detection and scene graph parsing, which integrates objects and relationship features at visual and semantic levels for better relations feature mapping.
Journal ArticleDOI

Modeling graph-structured contexts for image captioning

TL;DR: Zhang et al. as mentioned in this paper combine scene graphs with Transformer to explicitly encode available visual relationships between detected objects, which can realize state-of-the-art results in terms of all the standard evaluation metrics.
Journal ArticleDOI

Modeling Semantic Correlation and Hierarchy for Real-world Wildlife Recognition

TL;DR: In this paper , the authors explore the challenges of human-in-the-loop frameworks to label wildlife recognition datasets with a neural network and propose leveraging the semantic correlation to train the model more effectively by adding a co-occurrence layer to their model during training.
References
More filters
Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Proceedings ArticleDOI

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

TL;DR: RCNN as discussed by the authors combines CNNs with bottom-up region proposals to localize and segment objects, and when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost.
Posted Content

Efficient Estimation of Word Representations in Vector Space

TL;DR: This paper proposed two novel model architectures for computing continuous vector representations of words from very large data sets, and the quality of these representations is measured in a word similarity task and the results are compared to the previously best performing techniques based on different types of neural networks.
Proceedings ArticleDOI

Fast R-CNN

TL;DR: Fast R-CNN as discussed by the authors proposes a Fast Region-based Convolutional Network method for object detection, which employs several innovations to improve training and testing speed while also increasing detection accuracy and achieves a higher mAP on PASCAL VOC 2012.
Posted Content

Fast R-CNN

TL;DR: This paper proposes a Fast Region-based Convolutional Network method (Fast R-CNN) for object detection that builds on previous work to efficiently classify object proposals using deep convolutional networks.
Related Papers (5)