Visual Relationship Detection with Internal and External Linguistic Knowledge Distillation
Ruichi Yu,Ang Li,Vlad I. Morariu,Larry S. Davis +3 more
- pp 1068-1076
Reads0
Chats0
TLDR
This work uses knowledge of linguistic statistics to regularize visual model learning and suggests that with this linguistic knowledge distillation, the model outperforms the state-of- the-art methods significantly, especially when predicting unseen relationships.Abstract:
Understanding the visual relationship between two objects involves identifying the subject, the object, and a predicate relating them. We leverage the strong correlations between the predicate and the hsubj; obji pair (both semantically and spatially) to predict predicates conditioned on the subjects and the objects. Modeling the three entities jointly more accurately reflects their relationships compared to modeling them independently, but it complicates learning since the semantic space of visual relationships is huge and training data is limited, especially for longtail relationships that have few instances. To overcome this, we use knowledge of linguistic statistics to regularize visual model learning. We obtain linguistic knowledge by mining from both training annotations (internal knowledge) and publicly available text, e.g., Wikipedia (external knowledge), computing the conditional probability distribution of a predicate given a (subj, obj) pair. As we train the visual model, we distill this knowledge into the deep model to achieve better generalization. Our experimental results on the Visual Relationship Detection (VRD) and Visual Genome datasets suggest that with this linguistic knowledge distillation, our model outperforms the stateof- the-art methods significantly, especially when predicting unseen relationships (e.g., recall improved from 8.45% to 19.17% on VRD zero-shot testing set).read more
Citations
More filters
Proceedings ArticleDOI
On Exploring Undetermined Relationships for Visual Relationship Detection
TL;DR: Zhang et al. as discussed by the authors proposed a multi-modal feature based undetermined relationship learning network (MF-URLN), which extracts and fuses features of object pairs from three complementary modals: visual, spatial, and linguistic modals.
Proceedings ArticleDOI
Detecting Visual Relationships Using Box Attention
TL;DR: In this article, a box attention mechanism is proposed to model pairwise interactions between objects using standard object detection pipelines, and the resulting model is conceptually clean, expressive and relies on well-justified training and prediction procedures.
Proceedings ArticleDOI
Hierarchical Graph Attention Network for Visual Relationship Detection
Li Mi,Zhenzhong Chen +1 more
TL;DR: A Hierarchical Graph Attention Network (HGAT) is proposed to capture the dependencies on both object-level and triplet-level, which significantly outperforms the state-of-the-art methods on VG and VRD datasets.
Journal ArticleDOI
Contextual Translation Embedding for Visual Relationship Detection and Scene Graph Generation
TL;DR: A context-augmented translation embedding model that can capture both common and rare relations, which outperforms previous translation-based models and comes close to or exceeds the state of the art across a range of settings.
Proceedings ArticleDOI
Beyond Short-Term Snippet: Video Relation Detection With Spatio-Temporal Global Context
TL;DR: This work proposes a novel sliding-window scheme to simultaneously predict short-term and long-term relationships in video visual relation detection and achieves state-of-the-art performance on both ImageNet-VidVRD and VidOR dataset across multiple tasks.
References
More filters
Proceedings Article
Very Deep Convolutional Networks for Large-Scale Image Recognition
Karen Simonyan,Andrew Zisserman +1 more
TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Proceedings ArticleDOI
Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation
TL;DR: RCNN as discussed by the authors combines CNNs with bottom-up region proposals to localize and segment objects, and when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost.
Posted Content
Efficient Estimation of Word Representations in Vector Space
TL;DR: This paper proposed two novel model architectures for computing continuous vector representations of words from very large data sets, and the quality of these representations is measured in a word similarity task and the results are compared to the previously best performing techniques based on different types of neural networks.
Proceedings ArticleDOI
Fast R-CNN
TL;DR: Fast R-CNN as discussed by the authors proposes a Fast Region-based Convolutional Network method for object detection, which employs several innovations to improve training and testing speed while also increasing detection accuracy and achieves a higher mAP on PASCAL VOC 2012.
Posted Content
Fast R-CNN
TL;DR: This paper proposes a Fast Region-based Convolutional Network method (Fast R-CNN) for object detection that builds on previous work to efficiently classify object proposals using deep convolutional networks.