Scene Graph Generation by Iterative Message Passing

doi:10.1109/CVPR.2017.330

Danfei Xu

1

Yuke Zhu

1

Christopher B. Choy

2

Li Fei-Fei

1

Department of Computer Science, Stanford University

2

Department of Electrical Engineering, Stanford University

{danfei, yukez, chrischoy, feifeili}@cs.stanford.edu

Abstract

Understanding a visual scene goes beyond recognizing

individual objects in isolation. Relationships between ob-

jects also constitute rich semantic information about the

scene. In this work, we explicitly model the objects and

their relationships using scene graphs, a visually-grounded

graphical structure of an image. We propose a novel end-

to-end model that generates such structured scene repre-

sentation from an input image. The model solves the scene

graph inference problem using standard RNNs and learns

to iteratively improves its predictions via message passing.

Our joint inference model can take advantage of contex-

tual cues to make better predictions on objects and their

relationships. The experiments show that our model signif-

icantly outperforms previous methods on generating scene

graphs using Visual Genome dataset and inferring support

relations with NYU Depth v2 dataset.

1. Introduction

Today’s state-of-the-art perceptual models [15, 32] have

mostly tackled detecting and recognizing individual objects

in isolation. However, understanding a visual scene often

goes beyond recognizing individual objects. Take a look

at the two images in Fig. 1. Even a perfect object detec-

tor would struggle to perceive the subtle difference between

a man feeding a horse and a man standing by a horse. The

rich semantic relationships between these objects have been

largely untapped by these models. As indicated by a series

of previous works [26, 34, 41], one crucial step towards a

deeper understanding of visual scenes is building a struc-

tured representation that captures objects and their semantic

relationships. Such representation not only offers contex-

tual cues for fundamental recognition tasks [27, 29, 38, 39]

but also provide values in a larger variety of high-level vi-

sual tasks [18, 44, 40].

The recent success of deep learning-based recognition

models [15, 21, 36] has surged interest in examining the de-

tailed structures of a visual scene, especially in the form of

man

horse

object

detection

scene graph

generation

horse

bucket

eat fromholding

feeding

man

wearing

glasses

...

Figure 1. Object detectors perceive a scene by attending to indi-

vidual objects. As a result, even a perfect detector would produce

similar outputs on two semantically distinct images (ﬁrst row). We

propose a scene graph generation model that takes an image as in-

put, and generates a visually-grounded scene graph (second row,

right) that captures the objects in the image (blue nodes) and their

pairwise relationships (red nodes).

object relationships [5, 20, 26, 33]. Scene graph, proposed

by Johnson et al. [18], offers a platform to explicitly model

objects and their relationships. In short, a scene graph is

a visually-grounded graph over the object instances in an

image, where the edges depict their pairwise relationships

(see example in Fig. 1). The value of scene graph represen-

tation has been proven in a wide range of visual tasks, such

as semantic image retrieval [18], 3D scene synthesis [4],

and visual question answering [37]. Anderson et al. re-

cently proposed SPICE [1] as an enhanced automated cap-

tion evaluation metric deﬁned over scene graphs. However,

these models that use scene graphs either rely on ground-

truth annotations [18], synthetic images [37], or extract a

scene graph from text domain [1, 4]. To truly take advan-

tage of such rich structure, it is crucial to devise a model

that automatically generates scene graphs from images.

In this work, we address the problem of scene graph gen-

eration, where the goal is to generate a visually-grounded

scene graph from an image. In a generated scene graph,

an object instance is characterized by a bounding box with

an object category label, and a relationship is characterized

by a directed edge between two bounding boxes (i.e., ob-

1

5410

ject and subject) with a relationship predicate (red nodes in

Fig. 1). The major challenge of generating scene graphs

is reasoning about relationships. Much effort has been ex-

pended on localizing and recognizing semantic relation-

ships in images [6, 8, 26, 34, 39]. Most methods have

focused on making local predictions of object relation-

ships [26, 34], which essentially simplify the scene graph

generation problem into independently predicting relation-

ships between pairs of objects. However, by doing lo-

cal predictions these models ignore surrounding context,

whereas joint reasoning with contextual information can of-

ten resolve ambiguity due to local predictions in isolation.

To capture this intuition, we propose a novel end-to-

end model that learns to generate image-grounded scene

graphs (Fig. 2). The model takes an image as input and out-

puts a scene graph that consists of object categories, their

bounding boxes, and semantic relationships between pairs

of objects. Our major contribution is that instead of in-

ferring each component of a scene graph in isolation, the

model passes messages containing contextual information

between a pair of bipartite sub-graphs of the scene graph,

and iteratively reﬁnes its predictions using RNNs. We eval-

uate our model on a new scene graph dataset based on Vi-

sual Genome [20], which contains human-annotated scene

graphs on 108,077 images. On average, each image is anno-

tated with 25 objects and 22 pairwise object relationships.

We show that relationship prediction in scene graphs can

be signiﬁcantly improved by our model. Furthermore, we

also apply our model to the NYU Depth v2 dataset [28],

establishing new state-of-the-art results in reasoning about

spatial relations, such as horizontal and vertical supports.

In summary, we propose an end-to-end model that gen-

erates visually-grounded scene graphs from images. The

model uses a novel inference formulation that iteratively re-

ﬁnes its prediction by passing contextual messages along

the topological structure of a scene graph. We demonstrate

its use for generating semantic scene graphs from a new

scene graph dataset as well as predicting support relations

using the NYU Depth v2 dataset [28].

2. Related Work

Scene understanding and relationship prediction. Visual

scene understanding often harnesses the statistical patterns

of object co-occurrence [11, 22, 30, 35] as well as spa-

tial layout [2, 9]. A series of contextual models based on

surrounding pixels and regions have also been developed

for perceptual tasks [3, 13, 25, 27]. Recent works [6, 31]

exploits more complex structures for relationship predic-

tion. However, these works focus on image-level predic-

tions without detailed visual grounding. Physical rela-

tionships, such as support and stability, have been studied

in [17, 28, 42]. Lu et al. [26] directly tackled the semantic

relationship detection by combining visual inputs with lan-

CNN+RPN

Graph

Inference

object proposalimage scene graph

horse

face of

man

riding

wearing

hat

shirt

mountain behind

Figure 2. An overview of our model architecture. Given an image

as input, the model ﬁrst produces a set of object proposals using

a Region Proposal Network (RPN) [32], and then passes the ex-

tracted features of the object regions to our novel graph inference

module. The output of the model is a scene graph [18], which

contains a set of localized objects, categories of each object, and

relationship types between each pair of objects.

guage priors to cope with the long-tail distribution of real-

world relationships. However, their method predicts each

relationship independently. We show that our model out-

performs theirs with joint inference.

Visual scene representation. One of the most popular

ways of representing a visual scene is through text descrip-

tions [14, 34, 44]. Although text-based representation has

been shown to be helpful for scene classiﬁcation and re-

trieval, its power is often limited by ambiguity and lack

of expressiveness. In comparison, scene graphs [18] of-

fer explicit grounding of visual concepts, avoiding referen-

tial uncertainty in text-based representation. Scene graphs

have been used in many downstream tasks such as image re-

trieval [18], 3D scene synthesis [4] and understanding [10],

visual question answering [37], and automatic caption eval-

uation [1]. However, previous work on scene graphs shied

away from the graph generation problem by either using

ground-truth annotations [18, 37], or extracting the graphs

from other modalities [1, 4, 10]. Our work addresses the

problem of generating scene graphs directly from images.

Graph inference. Conditional Random Fields (CRF) have

been used extensively in graph inference. Johnson et al.

used CRF to infer scene graph grounding distributions for

image retrieval [18]. Yatskar et al. [40] proposed situation-

driven object and action prediction using a deep CRF

model. Our work is closely related to CRFasRNN [43] and

Graph-LSTM [23] in that we also formulate the graph infer-

ence problem using an RNN-based model. A key difference

is that they focus on node inference while treating edges as

pairwise constraints, whereas we enable edge predictions

using a novel primal-dual graph inference scheme. We also

5411

share the same spirit as Structural RNN [16]. A crucial

distinction is that our model iteratively reﬁnes its predic-

tions through message passing, whereas the Structural RNN

model only makes one-time predictions along the temporal

dimension, and thus cannot reﬁne its past predictions.

3. Scene Graph Generation

A scene graph, as deﬁned by Johnson et al. [18], is a

structured representation of an image, where nodes in a

scene graph correspond to object bounding boxes with their

object categories, and edges correspond to their pairwise re-

lationships between objects. The task of scene graph gen-

eration is to generate a visually-grounded scene graph that

most accurately correlates with an image. Intuitively, indi-

vidual predictions of objects and relationships can beneﬁt

from their surrounding context. For instance, knowing “a

horse is on grass ﬁeld” is likely to increase the chance of

detecting a person and predicting the relationship of “man

riding horse”. To capture this intuition, we propose a joint

inference framework to enable contextual information to

propagate through the scene graph topology via a message

passing scheme.

Inference on a densely connected graph can be very ex-

pensive. As shown in previous work [19] and [43], dense

graph inference can be approximated by mean ﬁeld in Con-

ditional Random Fields (CRF). Our approach is inspired by

Zheng et al. [43], which designs fully differentiable lay-

ers to enable end-to-end learning with recurrent neural net-

works (RNN). Yet their model relies on purpose-built RNN

layers. To achieve greater ﬂexibility in a more principled

training framework, we use a generic RNN unit instead, in

particular a Gated Recurrent Unit (GRU) [7]. At each iter-

ation, each GRU takes its previous hidden state and an in-

coming message as input, and produces a new hidden state

as output. Each node and edge in the scene graph main-

tains its internal state in its corresponding GRU unit, where

all nodes share the same GRU weights (node GRUs), and

all edges share the other set of GRU weights (edge GRUs).

This setup allows the model to pass messages (i.e., aggre-

gation of GRU hidden states) among the GRU units along

the scene graph topology. We also propose a message pool-

ing function that learns to dynamically aggregate the hidden

states of the GRUs into messages.

We further observe that the unique structure of scene

graphs forms a bipartite structure of message passing chan-

nels. Since messages only pass along the topological struc-

ture of a scene graph, the set of edge GRUs and the set of

node GRUs form a bipartite graph, where no message is

passed inside each set. Inspired by this observation, we

formulate two disjoint sub-graphs that are essentially the

dual graph to each other. The primal graph deﬁnes chan-

nels for messages to pass from edge GRUs to node GRUs.

The dual graph deﬁnes channels for messages to pass from

node GRUs to edge GRUs. With such primal-dual formu-

lation, we can therefore improve inference efﬁciency by

iteratively passing messages between these sub-graphs in-

stead of through a densely connected graph. Fig. 3 gives an

overview of our model.

3.1. Problem Formulation

We ﬁrst lay out the mathematical formulation of our

scene graph generation problem. To generate a visually

grounded scene graph, we need to obtain an initial set of

object bounding boxes. These bounding boxes can be ei-

ther from ground-truth human annotation or algorithmically

generated. In practice, we use the Region Proposal Network

(RPN) [32] to automatically generate a set of object bound-

ing box proposals B

I

from an image I as the base input to

the inference procedure (Fig. 3(a)).

For each object box proposal, we need to infer two types

of object-centric variables: 1) an object class label, and 2)

four bounding box offsets relative to the proposal box co-

ordinates, which are used for reﬁning the proposal boxes.

In addition, we need to infer a relationship-centric variable

between every pair of proposal boxes, which denotes the

predicate type of the relationship between the correspond-

ing object pair. Given a set of object classes C (including

background) and a set of relationship types R (including

none relationship), we denote the set of all variables to be

x = {x

cls

i

, x

bbox

i

, x

i→j

|i = 1 . . . n, j = 1 . . . n, i 6= j},

where n is the number of proposal boxes, x

cls

i

∈ C is the

class label of the i-th proposal box, x

bbox

i

∈ R

4

is the

bounding box offsets relative to the i-th proposal box coor-

dinates, and x

i→j

∈ R is the relationship predicate between

the i-th and the j-th proposal boxes.

At the high level, the inference task is to classify objects,

predict their bounding box offsets, and classify relationship

predicates between each pair of objects. Formally, we for-

mulate the scene graph generation problem as ﬁnding the

optimal x

∗

= arg max

x

Pr(x|I, B

I

) that maximizes the

following probability function given the image I and box

proposals B

I

:

Pr(x|I, B

I

) =

Y

i∈V

Y

j6=i

Pr(x

cls

i

, x

bbox

i

, x

i→j

|I, B

I

). (1)

In the next subsection, we introduce a way to approx-

imate the inference procedure using an iterative message

passing scheme modeled with Gated Recurrent Units [7].

3.2. Inference using Recurrent Neural Network

We use mean ﬁeld to perform approximate inference. We

denote the probability of each variable x as Q(x|·), and as-

sume that the probability only depends on the current state

of each node and edge at each iteration. In contrast to

Zheng et al. [43], we use a generic RNN module to compute

5412

edge

GRU

node

GRU

primal

graph

edge

feature

node

feature

node

state

outbound

edge states

inbound

edge states

dual

graph

edge

state

subject

state

object

state

edge

GRU

node

GRU

node

message

edge

message

node message pooling

message

passing

edge

GRU

node

GRU

node message

pooling

edge message

pooling

message

passing

edge message pooling

edge

GRU

node

GRU

...

T = 0 T = 1

T = 2

T = N

horse

face of

man

riding

wearing

hat

shirt

mountain behind

object proposal

scene graph

(a) (b) (c) (d)

Figure 3. An illustration of our model architecture (Sec. 3). The model ﬁrst extracts visual features of nodes and edges from a set of object

proposals, and edge GRUs and node GRUs then take the visual features as initial input and produce a set of hidden states (a). Then a node

message pooling function computes messages that are passed to the node GRU in the next iteration from the hidden states. Similarly, an

edge message pooling function computes messages and feed to the edge GRU (b). The ⊕ symbol denotes a learnt weighted sum. The

model iteratively updates the hidden states of the GRUs (c). At the last iteration step, the hidden states of the GRUs are used to predict

object categories, bounding box offsets, and relationship types (d).

the hidden states. In particular, we choose Gated Recurrent

Units [7] due to its simplicity and effectiveness. We use the

hidden state of the corresponding GRU, a high-dimensional

vector, to represent the current state of each node and each

edge. As all the nodes (edges) share the same update rule,

we share the same set of parameters among all the node

GRUs, and the other set of parameters among all the edge

GRUs (Fig. 3). We denote the current hidden state of node

i as h

i

and the current hidden state of edge i → j as h

i→j

.

Then the mean ﬁeld distribution can be formulated as

Q(x|I, B

I

) =

n

Y

i=1

Q(x

cls

i

, x

bbox

i

|h

i

)Q(h

i

|f

v

i

)

Y

j6=i

Q(x

i→j

|h

i→j

)Q(h

i→j

|f

e

i→j

)

(2)

where f

v

i

is the visual feature of the i-th node, and f

e

i→j

is

the visual feature of the edge from the i-th node to the j-th

node. In the ﬁrst iteration, the GRU units take the visual

features f

v

and f

e

as input (Fig. 3(a)). We use the visual

feature of the proposal box as the visual feature f

v

i

for the

i-th node. We use the visual feature of the union box over

the proposal boxes b

i

, b

j

as the visual feature f

e

i→j

for edge

i ∈ j. These visual features are extracted by a ROI-pooling

layer [12] from the image. In later iterations, the inputs are

the aggregated messages from other GRU units of the pre-

vious step. We talk about how the messages are aggregated

and passed in the next subsection.

3.3. Primal Dual Update and Message Pooling

Sec. 3.2 offers a generic formulation for solving graph

inference problem using RNNs. However, we observe that

we can further improve the inference efﬁciency by leverag-

ing the unique bipartite structure of a scene graph. In the

scene graph topology, the neighbors of the edge GRUs are

node GRUs, and vice versa. Passing messages along this

structure forms two disjoint sub-graphs that are the dual

graph to each other. Speciﬁcally, we have a node-centric

primal graph, in which each node GRU gets messages from

its inbound and outbound edge GRUs. In the edge-centric

dual graph, each edge GRU gets messages from its sub-

ject node GRU and object node GRU (Fig. 3(b)). We can

therefore improve inference efﬁciency by iteratively passing

messages between these two sub-graphs instead of through

a densely connected graph (Fig. 3(c)).

As each GRU receives multiple incoming messages, we

need an aggregation function that can fuse information from

all messages into a meaningful representation. A na

¨

ıve ap-

proach would be standard pooling methods such as average-

or max-pooling. However, we found that it is more effective

to learn adaptive weights that can modulate the inﬂuences of

incoming messages and only keep the relevant information.

We introduce a message pooling function that computes the

weight factors for each incoming message and fuse the mes-

sages using a weighted sum. We provide an empirical anal-

ysis of different message pooling functions in Sec. 4.

Formally, given the current GRU hidden states of nodes

and edges h

i

and h

i→j

, we denote the messages to update

the i-th node as m

i

, which is computed by a function of its

own hidden state h

i

, and the hidden states of its outbound

edge GRUs h

i→j

and inbound edge GRUs h

j→i

. Similarly,

we denote the message to update the edge from the i-th node

to the j-th node as m

i→j

, which is computed by a function

of its own hidden state h

i→j

, the hidden states of its subject

5413

node GRU h

i

and its object node GRU h

j

. To be more

speciﬁc, m

i

and m

i→j

are computed by the following two

adaptively weighted message pooling functions:

m

i

=

X

j:i→j

σ(v

T

1

[h

i

, h

i→j

])h

i→j

+

X

j:j→i

σ(v

T

2

[h

i

, h

j→i

])h

j→i

(3)

m

i→j

= σ(w

T

1

[h

i

, h

i→j

])h

i

+ σ(w

T

2

[h

j

, h

i→j

])h

j

(4)

where [·] denotes a concatenation of vectors, and σ denotes

a sigmoid function. w

1

, w

2

and v

1

, v

2

are learnable param-

eters. These two equations describe the primal-dual update

rules, as shown in (b) of Fig. 3.

3.4. Implementation Details

Our ﬁnal output layers follow closely with the faster R-

CNN setup [32]. We use a softmax layer to produce the ﬁnal

scores for the object class as well as relationship predicate.

We use a fully-connected layer to regress to the bounding

box offsets for each object class separately. We use the cross

entropy loss for the object class and the relationship predi-

cate. We use ℓ1 loss for the bounding box offsets.

We use an MS COCO-pretrained VGG-16 network to ex-

tract visual features from images. We freeze the weights of

all convolution layers, and only ﬁnetune the fully connected

layers, including the GRUs. The node GRUs and the edge

GRUs have both 512-dimensional input and output. Dur-

ing training, we ﬁrst use NMS to select at most 2,000 boxes

from all proposed boxes B

I

, and then randomly select 128

boxes as the object proposals. Due to the quadratic number

of edges and sparsity of the annotations, we ﬁrst sample all

edges that have labels. If an image has less than 128 labeled

edges, we ﬁll the rest with unlabeled edges. At test time,

we use NMS to select at most 50 boxes from the object pro-

posals with an IoU threshold of 0.3. We make predictions

on all edges except the self-connections at the test time.

4. Experiments

We evaluate our model on generating scene graphs from

images. We compare our model against a recently proposed

model on visual relationship prediction [26]. Our goal is to

analyze our model in datasets with both sparse and dense

relationship annotations. We use a new scene graph dataset

based on the VisualGenome dataset [20] in our main ex-

periment. We also evaluate our model on the support rela-

tion inference task in the NYU Depth v2 dataset. The key

difference between these two datasets is that scene graph

annotation is very sparse: among all possible pairing of

objects, only 1.6% of them are labeled with a relationship

predicate. The NYU Depth v2 dataset, on the other hand,

exhaustively annotates the support of every labeled object.

Our experiments show that our model outperforms the base-

line model [26], and can generalize to other types of rela-

tionships, in particular support relations [28], without any

architecture change.

Visual Genome We introduce a new scene graph dataset

based on the Visual Genome dataset [20]. The original VG

scene graph dataset contains 108,077 images with an aver-

age of 38 objects and 22 relationships per image. However,

a substantial fraction of the object annotations have poor-

quality and overlapping bounding boxes and/or ambiguous

object names. We manually cleaned up per-box annota-

tions. On average, this annotation reﬁnement process cor-

rected 22 bounding boxes and/or names, deleted 7.4 boxes,

and merged 5.4 duplicate bounding boxes per image. The

new dataset contains an average of 25 distinct objects and

22 relationships per image. In this experiment, we use the

most frequent 150 object categories and 50 predicates for

evaluation. As a result, each image has a scene graph of

around 11.5 objects and 6.2 relationships. We use 70% of

the images for training and the remaining 30% for testing.

NYU Depth V2 We also evaluate our model on the support

relation graphs from the NYU Depth v2 dataset [28]. The

dataset contains 1,449 RGB-D images captured in 27 indoor

scenes. Each image is annotated with instance segmenta-

tion, region class labels, and support relations between re-

gions. We use the standard split, with 795 images used for

training and 654 images for testing.

4.1. Semantic Scene Graph Generation

Setup Given an image, the scene graph generation task

is to localize a set of objects, classify their category labels,

and predict relationships between each pair of the objects.

We evaluate our model on the new scene graph dataset. We

analyze our model in three setups below.

1. The predicate classiﬁcation (PREDCLS) task is to

predict the predicates of all pairwise relationships of

a set of localized objects. This task examines the

model’s performance on predicate classiﬁcation in iso-

lation from other factors.

2. The scene graph classiﬁcation (SGCLS) task is to

predict the predicate as well as the object categories

of the subject and the object in every pairwise relation-

ship given a set of localized objects.

3. The scene graph generation (SGGEN) task is to si-

multaneously detect a set of objects and predict the

predicate between each pair of the detected objects.

An object is considered to be correctly detected if it

has at least 0.5 IoU overlap with the ground-truth box.

We adopted the image-wise recall evaluation metrics,

R@50 and R@100, that are used in Lu et al. [26] for

5414

Scene Graph Generation by Iterative Message Passing

Citations

References

Related Papers (5)