Visual Relationship Detection with Internal and External Linguistic Knowledge Distillation

doi:10.1109/ICCV.2017.121

Home
/
Papers
/
Visual Relationship Detection with Internal and External Linguistic Knowledge Distillation

Proceedings Article•DOI•

Visual Relationship Detection with Internal and External Linguistic Knowledge Distillation

Ruichi Yu¹, Ang Li², Vlad I. Morariu², Larry S. Davis²•Institutions (2)

IBM¹, University of Maryland, College Park²

01 Oct 2017-pp 1068-1076

TL;DR: This work uses knowledge of linguistic statistics to regularize visual model learning and suggests that with this linguistic knowledge distillation, the model outperforms the state-of- the-art methods significantly, especially when predicting unseen relationships.

read less

Abstract: Understanding the visual relationship between two objects involves identifying the subject, the object, and a predicate relating them. We leverage the strong correlations between the predicate and the hsubj; obji pair (both semantically and spatially) to predict predicates conditioned on the subjects and the objects. Modeling the three entities jointly more accurately reflects their relationships compared to modeling them independently, but it complicates learning since the semantic space of visual relationships is huge and training data is limited, especially for longtail relationships that have few instances. To overcome this, we use knowledge of linguistic statistics to regularize visual model learning. We obtain linguistic knowledge by mining from both training annotations (internal knowledge) and publicly available text, e.g., Wikipedia (external knowledge), computing the conditional probability distribution of a predicate given a (subj, obj) pair. As we train the visual model, we distill this knowledge into the deep model to achieve better generalization. Our experimental results on the Visual Relationship Detection (VRD) and Visual Genome datasets suggest that with this linguistic knowledge distillation, our model outperforms the stateof- the-art methods significantly, especially when predicting unseen relationships (e.g., recall improved from 8.45% to 19.17% on VRD zero-shot testing set).

...read moreread less

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

NISP: Pruning Networks Using Neuron Importance Score Propagation

[...]

Ruichi Yu¹, Ang Li, Chun-Fu Chen², Jui-Hsin Lai, Vlad I. Morariu³, Xintong Han¹, Mingfei Gao¹, Ching-Yung Lin, Larry S. Davis¹ - Show less +5 more•Institutions (3)

University of Maryland, College Park¹, IBM², Adobe Systems³

01 Jun 2018

TL;DR: Zhang et al. as mentioned in this paper proposed the Neuron Importance Score Propagation (NISP) algorithm to propagate the importance scores of final responses to every neuron in the network.

...read moreread less

Abstract: To reduce the significant redundancy in deep Convolutional Neural Networks (CNNs), most existing methods prune neurons by only considering the statistics of an individual layer or two consecutive layers (e.g., prune one layer to minimize the reconstruction error of the next layer), ignoring the effect of error propagation in deep networks. In contrast, we argue that for a pruned network to retain its predictive power, it is essential to prune neurons in the entire neuron network jointly based on a unified goal: minimizing the reconstruction error of important responses in the "final response layer" (FRL), which is the second-to-last layer before classification. Specifically, we apply feature ranking techniques to measure the importance of each neuron in the FRL, formulate network pruning as a binary integer optimization problem, and derive a closed-form solution to it for pruning neurons in earlier layers. Based on our theoretical analysis, we propose the Neuron Importance Score Propagation (NISP) algorithm to propagate the importance scores of final responses to every neuron in the network. The CNN is pruned by removing neurons with least importance, and it is then fine-tuned to recover its predictive power. NISP is evaluated on several datasets with multiple CNN models and demonstrated to achieve significant acceleration and compression with negligible accuracy loss.

...read moreread less

658 citations

Proceedings Article•DOI•

Neural Motifs: Scene Graph Parsing with Global Context

[...]

Rowan Zellers¹, Mark Yatskar¹, Sam Thomson², Yejin Choi¹•Institutions (2)

University of Washington¹, Carnegie Mellon University²

18 Jun 2018

TL;DR: In this paper, the role of motifs in scene graph representation is investigated and the Stacked Motif Networks (SFN) architecture is proposed to capture higher order motifs.

...read moreread less

Abstract: We investigate the problem of producing structured graph representations of visual scenes. Our work analyzes the role of motifs: regularly appearing substructures in scene graphs. We present new quantitative insights on such repeated structures in the Visual Genome dataset. Our analysis shows that object labels are highly predictive of relation labels but not vice-versa. We also find that there are recurring patterns even in larger subgraphs: more than 50% of graphs contain motifs involving at least two relations. Our analysis motivates a new baseline: given object detections, predict the most frequent relation between object pairs with the given labels, as seen in the training set. This baseline improves on the previous state-of-the-art by an average of 3.6% relative improvement across evaluation settings. We then introduce Stacked Motif Networks, a new architecture designed to capture higher order motifs in scene graphs that further improves over our strong baseline by an average 7.1% relative gain. Our code is available at github.com/rowanz/neural-motifs.

...read moreread less

577 citations

Posted Content•

Improved Knowledge Distillation via Teacher Assistant

[...]

Seyed Iman Mirzadeh¹, Mehrdad Farajtabar¹, Ang Li², Nir Levine², Akihiro Matsukawa², Hassan Ghasemzadeh¹ - Show less +2 more•Institutions (2)

Washington State University¹, Google²

09 Feb 2019-arXiv: Learning

TL;DR: Multi-step knowledge distillation is introduced, which employs an intermediate-sized network (teacher assistant) to bridge the gap between the student and the teacher and study the effect of teacher assistant size and extend the framework to multi-step distillation.

...read moreread less

Abstract: Despite the fact that deep neural networks are powerful models and achieve appealing results on many tasks, they are too large to be deployed on edge devices like smartphones or embedded sensor nodes. There have been efforts to compress these networks, and a popular method is knowledge distillation, where a large (teacher) pre-trained network is used to train a smaller (student) network. However, in this paper, we show that the student network performance degrades when the gap between student and teacher is large. Given a fixed student network, one cannot employ an arbitrarily large teacher, or in other words, a teacher can effectively transfer its knowledge to students up to a certain size, not smaller. To alleviate this shortcoming, we introduce multi-step knowledge distillation, which employs an intermediate-sized network (teacher assistant) to bridge the gap between the student and the teacher. Moreover, we study the effect of teacher assistant size and extend the framework to multi-step distillation. Theoretical analysis and extensive experiments on CIFAR-10,100 and ImageNet datasets and on CNN and ResNet architectures substantiate the effectiveness of our proposed approach.

...read moreread less

399 citations

Cites background from "Visual Relationship Detection with ..."

...To solve this issue, Yim et al.; Yu et al. (2017; 2017) used a shared representation of layers, however, it’s not straightforward to choose the appropriate layer to be matched....
[...]
...Since then, knowledge distillation has been widely adopted in a variety of learning tasks (Yim et al. 2017; Yu et al. 2017; Schmitt et al. 2018; Chen et al. 2017)....
[...]

Proceedings Article•DOI•

Revisiting Knowledge Distillation via Label Smoothing Regularization

[...]

Li Yuan¹, Francis E. H. Tay¹, Guilin Li², Tao Wang¹, Jiashi Feng¹ - Show less +1 more•Institutions (2)

National University of Singapore¹, Huawei²

14 Jun 2020

TL;DR: It is argued that the success of KD is not fully due to the similarity information between categories from teachers, but also to the regularization of soft targets, which is equally or even more important.

...read moreread less

Abstract: Knowledge Distillation (KD) aims to distill the knowledge of a cumbersome teacher model into a lightweight student model. Its success is generally attributed to the privileged information on similarities among categories provided by the teacher model, and in this sense, only strong teacher models are deployed to teach weaker students in practice. In this work, we challenge this common belief by following experimental observations: 1) beyond the acknowledgment that the teacher can improve the student, the student can also enhance the teacher significantly by reversing the KD procedure; 2) a poorly-trained teacher with much lower accuracy than the student can still improve the latter significantly. To explain these observations, we provide a theoretical analysis of the relationships between KD and label smoothing regularization. We prove that 1) KD is a type of learned label smoothing regularization and 2) label smoothing regularization provides a virtual teacher model for KD. From these results, we argue that the success of KD is not fully due to the similarity information between categories from teachers, but also to the regularization of soft targets, which is equally or even more important. Based on these analyses, we further propose a novel Teacher-free Knowledge Distillation (Tf-KD) framework, where a student model learns from itself or manually-designed regularization distribution. The Tf-KD achieves comparable performance with normal KD from a superior teacher, which is well applied when a stronger teacher model is unavailable. Meanwhile, Tf-KD is generic and can be directly deployed for training deep neural networks. Without any extra computation cost, Tf-KD achieves up to 0.65\% improvement on ImageNet over well-established baseline models, which is superior to label smoothing regularization.

...read moreread less

368 citations

Cites methods from "Visual Relationship Detection with ..."

...Knowledge Distillation Since [7] proposed knowledge distillation based on prior work [2], KD has been widely adopted or modified [14, 19, 20, 4, 1, 11, 17]....
[...]

Proceedings Article•DOI•

Scene Graph Generation With External Knowledge and Image Reconstruction

[...]

Jiuxiang Gu¹, Handong Zhao², Zhe Lin², Sheng Li³, Jianfei Cai¹, Ling Mingyang⁴ - Show less +2 more•Institutions (4)

Nanyang Technological University¹, Adobe Systems², University of Georgia³, Google⁴

15 Jun 2019

TL;DR: This paper proposes a novel scene graph generation algorithm with external knowledge and image reconstruction loss to overcome dataset issues, and extracts commonsense knowledge from the external knowledge base to refine object and phrase features for improving generalizability inscene graph generation.

...read moreread less

Abstract: Scene graph generation has received growing attention with the advancements in image understanding tasks such as object detection, attributes and relationship prediction, etc. However, existing datasets are biased in terms of object and relationship labels, or often come with noisy and missing annotations, which makes the development of a reliable scene graph prediction model very challenging. In this paper, we propose a novel scene graph generation algorithm with external knowledge and image reconstruction loss to overcome these dataset issues. In particular, we extract commonsense knowledge from the external knowledge base to refine object and phrase features for improving generalizability in scene graph generation. To address the bias of noisy object annotations, we introduce an auxiliary image reconstruction path to regularize the scene graph generation network. Extensive experiments show that our framework can generate better scene graphs, achieving the state-of-the-art performance on two benchmark datasets: Visual Relationship Detection and Visual Genome datasets.

...read moreread less

271 citations

Cites background from "Visual Relationship Detection with ..."

...Unlike [45], which leverages linguistic knowledge to regularize the network, our knowledge-based module improves the feature refining procedure by reasoning over a basket of commonsense knowledge retrieved from ConceptNet....
[...]
...[45] extract linguistic knowledge from training annotations and Wikipedia, and distill knowledge to regularize training and provide extra cues for inference....
[...]
...One direction to improve the datadriven model is to distill external knowledge into Deep Neural Networks [39, 45, 18]....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66

Collapse

References

PDF

Open Access

More filters

Proceedings Article•

Very Deep Convolutional Networks for Large-Scale Image Recognition

[...]

Karen Simonyan¹, Andrew Zisserman¹•Institutions (1)

University of Oxford¹

01 Jan 2015

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.

...read moreread less

Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

...read moreread less

49,914 citations

Proceedings Article•DOI•

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

[...]

Ross Girshick¹, Jeff Donahue¹, Trevor Darrell¹, Jitendra Malik¹•Institutions (1)

University of California, Berkeley¹

23 Jun 2014

TL;DR: RCNN as discussed by the authors combines CNNs with bottom-up region proposals to localize and segment objects, and when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost.

...read moreread less

Abstract: Object detection performance, as measured on the canonical PASCAL VOC dataset, has plateaued in the last few years. The best-performing methods are complex ensemble systems that typically combine multiple low-level image features with high-level context. In this paper, we propose a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012 -- achieving a mAP of 53.3%. Our approach combines two key insights: (1) one can apply high-capacity convolutional neural networks (CNNs) to bottom-up region proposals in order to localize and segment objects and (2) when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost. Since we combine region proposals with CNNs, we call our method R-CNN: Regions with CNN features. We also present experiments that provide insight into what the network learns, revealing a rich hierarchy of image features. Source code for the complete system is available at http://www.cs.berkeley.edu/~rbg/rcnn.

...read moreread less

21,729 citations

Posted Content•

Efficient Estimation of Word Representations in Vector Space

[...]

Tomas Mikolov¹, Kai Chen², Greg S. Corrado³, Jeffrey Dean³•Institutions (3)

Brno University of Technology¹, Beijing University of Posts and Telecommunications², Google³

16 Jan 2013-arXiv: Computation and Language

TL;DR: This paper proposed two novel model architectures for computing continuous vector representations of words from very large data sets, and the quality of these representations is measured in a word similarity task and the results are compared to the previously best performing techniques based on different types of neural networks.

...read moreread less

Abstract: We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.

...read moreread less

20,077 citations

Proceedings Article•DOI•

Fast R-CNN

[...]

Ross Girshick¹•Institutions (1)

Microsoft¹

07 Dec 2015

TL;DR: Fast R-CNN as discussed by the authors proposes a Fast Region-based Convolutional Network method for object detection, which employs several innovations to improve training and testing speed while also increasing detection accuracy and achieves a higher mAP on PASCAL VOC 2012.

...read moreread less

Abstract: This paper proposes a Fast Region-based Convolutional Network method (Fast R-CNN) for object detection. Fast R-CNN builds on previous work to efficiently classify object proposals using deep convolutional networks. Compared to previous work, Fast R-CNN employs several innovations to improve training and testing speed while also increasing detection accuracy. Fast R-CNN trains the very deep VGG16 network 9x faster than R-CNN, is 213x faster at test-time, and achieves a higher mAP on PASCAL VOC 2012. Compared to SPPnet, Fast R-CNN trains VGG16 3x faster, tests 10x faster, and is more accurate. Fast R-CNN is implemented in Python and C++ (using Caffe) and is available under the open-source MIT License at https://github.com/rbgirshick/fast-rcnn.

...read moreread less

14,824 citations

Posted Content•

Fast R-CNN

[...]

Ross Girshick¹•Institutions (1)

Microsoft¹

30 Apr 2015-arXiv: Computer Vision and Pattern Recognition

TL;DR: This paper proposes a Fast Region-based Convolutional Network method (Fast R-CNN) for object detection that builds on previous work to efficiently classify object proposals using deep convolutional networks.

...read moreread less

14,747 citations