scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Scene Graph Generation by Iterative Message Passing

01 Jul 2017-pp 3097-3106
TL;DR: In this article, the problem of graph generation is formulated as message passing between the primal node graph and its dual edge graph, which can take advantage of contextual cues to make better predictions on objects and their relationships.
Abstract: Understanding a visual scene goes beyond recognizing individual objects in isolation. Relationships between objects also constitute rich semantic information about the scene. In this work, we explicitly model the objects and their relationships using scene graphs, a visually-grounded graphical structure of an image. We propose a novel end-to-end model that generates such structured scene representation from an input image. Our key insight is that the graph generation problem can be formulated as message passing between the primal node graph and its dual edge graph. Our joint inference model can take advantage of contextual cues to make better predictions on objects and their relationships. The experiments show that our model significantly outperforms previous methods on the Visual Genome dataset as well as support relation inference in NYU Depth V2 dataset.

Content maybe subject to copyright    Report

Scene Graph Generation by Iterative Message Passing
Danfei Xu
1
Yuke Zhu
1
Christopher B. Choy
2
Li Fei-Fei
1
1
Department of Computer Science, Stanford University
2
Department of Electrical Engineering, Stanford University
{danfei, yukez, chrischoy, feifeili}@cs.stanford.edu
Abstract
Understanding a visual scene goes beyond recognizing
individual objects in isolation. Relationships between ob-
jects also constitute rich semantic information about the
scene. In this work, we explicitly model the objects and
their relationships using scene graphs, a visually-grounded
graphical structure of an image. We propose a novel end-
to-end model that generates such structured scene repre-
sentation from an input image. The model solves the scene
graph inference problem using standard RNNs and learns
to iteratively improves its predictions via message passing.
Our joint inference model can take advantage of contex-
tual cues to make better predictions on objects and their
relationships. The experiments show that our model signif-
icantly outperforms previous methods on generating scene
graphs using Visual Genome dataset and inferring support
relations with NYU Depth v2 dataset.
1. Introduction
Today’s state-of-the-art perceptual models [15, 32] have
mostly tackled detecting and recognizing individual objects
in isolation. However, understanding a visual scene often
goes beyond recognizing individual objects. Take a look
at the two images in Fig. 1. Even a perfect object detec-
tor would struggle to perceive the subtle difference between
a man feeding a horse and a man standing by a horse. The
rich semantic relationships between these objects have been
largely untapped by these models. As indicated by a series
of previous works [26, 34, 41], one crucial step towards a
deeper understanding of visual scenes is building a struc-
tured representation that captures objects and their semantic
relationships. Such representation not only offers contex-
tual cues for fundamental recognition tasks [27, 29, 38, 39]
but also provide values in a larger variety of high-level vi-
sual tasks [18, 44, 40].
The recent success of deep learning-based recognition
models [15, 21, 36] has surged interest in examining the de-
tailed structures of a visual scene, especially in the form of
man
horse
object
detection
scene graph
generation
horse
bucket
eat fromholding
feeding
man
wearing
glasses
...
Figure 1. Object detectors perceive a scene by attending to indi-
vidual objects. As a result, even a perfect detector would produce
similar outputs on two semantically distinct images (first row). We
propose a scene graph generation model that takes an image as in-
put, and generates a visually-grounded scene graph (second row,
right) that captures the objects in the image (blue nodes) and their
pairwise relationships (red nodes).
object relationships [5, 20, 26, 33]. Scene graph, proposed
by Johnson et al. [18], offers a platform to explicitly model
objects and their relationships. In short, a scene graph is
a visually-grounded graph over the object instances in an
image, where the edges depict their pairwise relationships
(see example in Fig. 1). The value of scene graph represen-
tation has been proven in a wide range of visual tasks, such
as semantic image retrieval [18], 3D scene synthesis [4],
and visual question answering [37]. Anderson et al. re-
cently proposed SPICE [1] as an enhanced automated cap-
tion evaluation metric defined over scene graphs. However,
these models that use scene graphs either rely on ground-
truth annotations [18], synthetic images [37], or extract a
scene graph from text domain [1, 4]. To truly take advan-
tage of such rich structure, it is crucial to devise a model
that automatically generates scene graphs from images.
In this work, we address the problem of scene graph gen-
eration, where the goal is to generate a visually-grounded
scene graph from an image. In a generated scene graph,
an object instance is characterized by a bounding box with
an object category label, and a relationship is characterized
by a directed edge between two bounding boxes (i.e., ob-
1
5410

ject and subject) with a relationship predicate (red nodes in
Fig. 1). The major challenge of generating scene graphs
is reasoning about relationships. Much effort has been ex-
pended on localizing and recognizing semantic relation-
ships in images [6, 8, 26, 34, 39]. Most methods have
focused on making local predictions of object relation-
ships [26, 34], which essentially simplify the scene graph
generation problem into independently predicting relation-
ships between pairs of objects. However, by doing lo-
cal predictions these models ignore surrounding context,
whereas joint reasoning with contextual information can of-
ten resolve ambiguity due to local predictions in isolation.
To capture this intuition, we propose a novel end-to-
end model that learns to generate image-grounded scene
graphs (Fig. 2). The model takes an image as input and out-
puts a scene graph that consists of object categories, their
bounding boxes, and semantic relationships between pairs
of objects. Our major contribution is that instead of in-
ferring each component of a scene graph in isolation, the
model passes messages containing contextual information
between a pair of bipartite sub-graphs of the scene graph,
and iteratively refines its predictions using RNNs. We eval-
uate our model on a new scene graph dataset based on Vi-
sual Genome [20], which contains human-annotated scene
graphs on 108,077 images. On average, each image is anno-
tated with 25 objects and 22 pairwise object relationships.
We show that relationship prediction in scene graphs can
be significantly improved by our model. Furthermore, we
also apply our model to the NYU Depth v2 dataset [28],
establishing new state-of-the-art results in reasoning about
spatial relations, such as horizontal and vertical supports.
In summary, we propose an end-to-end model that gen-
erates visually-grounded scene graphs from images. The
model uses a novel inference formulation that iteratively re-
fines its prediction by passing contextual messages along
the topological structure of a scene graph. We demonstrate
its use for generating semantic scene graphs from a new
scene graph dataset as well as predicting support relations
using the NYU Depth v2 dataset [28].
2. Related Work
Scene understanding and relationship prediction. Visual
scene understanding often harnesses the statistical patterns
of object co-occurrence [11, 22, 30, 35] as well as spa-
tial layout [2, 9]. A series of contextual models based on
surrounding pixels and regions have also been developed
for perceptual tasks [3, 13, 25, 27]. Recent works [6, 31]
exploits more complex structures for relationship predic-
tion. However, these works focus on image-level predic-
tions without detailed visual grounding. Physical rela-
tionships, such as support and stability, have been studied
in [17, 28, 42]. Lu et al. [26] directly tackled the semantic
relationship detection by combining visual inputs with lan-
CNN+RPN
Graph
Inference
object proposalimage scene graph
horse
face of
man
riding
wearing
wearing
hat
shirt
mountain behind
Figure 2. An overview of our model architecture. Given an image
as input, the model first produces a set of object proposals using
a Region Proposal Network (RPN) [32], and then passes the ex-
tracted features of the object regions to our novel graph inference
module. The output of the model is a scene graph [18], which
contains a set of localized objects, categories of each object, and
relationship types between each pair of objects.
guage priors to cope with the long-tail distribution of real-
world relationships. However, their method predicts each
relationship independently. We show that our model out-
performs theirs with joint inference.
Visual scene representation. One of the most popular
ways of representing a visual scene is through text descrip-
tions [14, 34, 44]. Although text-based representation has
been shown to be helpful for scene classification and re-
trieval, its power is often limited by ambiguity and lack
of expressiveness. In comparison, scene graphs [18] of-
fer explicit grounding of visual concepts, avoiding referen-
tial uncertainty in text-based representation. Scene graphs
have been used in many downstream tasks such as image re-
trieval [18], 3D scene synthesis [4] and understanding [10],
visual question answering [37], and automatic caption eval-
uation [1]. However, previous work on scene graphs shied
away from the graph generation problem by either using
ground-truth annotations [18, 37], or extracting the graphs
from other modalities [1, 4, 10]. Our work addresses the
problem of generating scene graphs directly from images.
Graph inference. Conditional Random Fields (CRF) have
been used extensively in graph inference. Johnson et al.
used CRF to infer scene graph grounding distributions for
image retrieval [18]. Yatskar et al. [40] proposed situation-
driven object and action prediction using a deep CRF
model. Our work is closely related to CRFasRNN [43] and
Graph-LSTM [23] in that we also formulate the graph infer-
ence problem using an RNN-based model. A key difference
is that they focus on node inference while treating edges as
pairwise constraints, whereas we enable edge predictions
using a novel primal-dual graph inference scheme. We also
5411

share the same spirit as Structural RNN [16]. A crucial
distinction is that our model iteratively refines its predic-
tions through message passing, whereas the Structural RNN
model only makes one-time predictions along the temporal
dimension, and thus cannot refine its past predictions.
3. Scene Graph Generation
A scene graph, as defined by Johnson et al. [18], is a
structured representation of an image, where nodes in a
scene graph correspond to object bounding boxes with their
object categories, and edges correspond to their pairwise re-
lationships between objects. The task of scene graph gen-
eration is to generate a visually-grounded scene graph that
most accurately correlates with an image. Intuitively, indi-
vidual predictions of objects and relationships can benefit
from their surrounding context. For instance, knowing “a
horse is on grass field” is likely to increase the chance of
detecting a person and predicting the relationship of “man
riding horse”. To capture this intuition, we propose a joint
inference framework to enable contextual information to
propagate through the scene graph topology via a message
passing scheme.
Inference on a densely connected graph can be very ex-
pensive. As shown in previous work [19] and [43], dense
graph inference can be approximated by mean field in Con-
ditional Random Fields (CRF). Our approach is inspired by
Zheng et al. [43], which designs fully differentiable lay-
ers to enable end-to-end learning with recurrent neural net-
works (RNN). Yet their model relies on purpose-built RNN
layers. To achieve greater flexibility in a more principled
training framework, we use a generic RNN unit instead, in
particular a Gated Recurrent Unit (GRU) [7]. At each iter-
ation, each GRU takes its previous hidden state and an in-
coming message as input, and produces a new hidden state
as output. Each node and edge in the scene graph main-
tains its internal state in its corresponding GRU unit, where
all nodes share the same GRU weights (node GRUs), and
all edges share the other set of GRU weights (edge GRUs).
This setup allows the model to pass messages (i.e., aggre-
gation of GRU hidden states) among the GRU units along
the scene graph topology. We also propose a message pool-
ing function that learns to dynamically aggregate the hidden
states of the GRUs into messages.
We further observe that the unique structure of scene
graphs forms a bipartite structure of message passing chan-
nels. Since messages only pass along the topological struc-
ture of a scene graph, the set of edge GRUs and the set of
node GRUs form a bipartite graph, where no message is
passed inside each set. Inspired by this observation, we
formulate two disjoint sub-graphs that are essentially the
dual graph to each other. The primal graph defines chan-
nels for messages to pass from edge GRUs to node GRUs.
The dual graph defines channels for messages to pass from
node GRUs to edge GRUs. With such primal-dual formu-
lation, we can therefore improve inference efficiency by
iteratively passing messages between these sub-graphs in-
stead of through a densely connected graph. Fig. 3 gives an
overview of our model.
3.1. Problem Formulation
We first lay out the mathematical formulation of our
scene graph generation problem. To generate a visually
grounded scene graph, we need to obtain an initial set of
object bounding boxes. These bounding boxes can be ei-
ther from ground-truth human annotation or algorithmically
generated. In practice, we use the Region Proposal Network
(RPN) [32] to automatically generate a set of object bound-
ing box proposals B
I
from an image I as the base input to
the inference procedure (Fig. 3(a)).
For each object box proposal, we need to infer two types
of object-centric variables: 1) an object class label, and 2)
four bounding box offsets relative to the proposal box co-
ordinates, which are used for refining the proposal boxes.
In addition, we need to infer a relationship-centric variable
between every pair of proposal boxes, which denotes the
predicate type of the relationship between the correspond-
ing object pair. Given a set of object classes C (including
background) and a set of relationship types R (including
none relationship), we denote the set of all variables to be
x = {x
cls
i
, x
bbox
i
, x
ij
|i = 1 . . . n, j = 1 . . . n, i 6= j},
where n is the number of proposal boxes, x
cls
i
C is the
class label of the i-th proposal box, x
bbox
i
R
4
is the
bounding box offsets relative to the i-th proposal box coor-
dinates, and x
ij
R is the relationship predicate between
the i-th and the j-th proposal boxes.
At the high level, the inference task is to classify objects,
predict their bounding box offsets, and classify relationship
predicates between each pair of objects. Formally, we for-
mulate the scene graph generation problem as finding the
optimal x
= arg max
x
Pr(x|I, B
I
) that maximizes the
following probability function given the image I and box
proposals B
I
:
Pr(x|I, B
I
) =
Y
iV
Y
j6=i
Pr(x
cls
i
, x
bbox
i
, x
ij
|I, B
I
). (1)
In the next subsection, we introduce a way to approx-
imate the inference procedure using an iterative message
passing scheme modeled with Gated Recurrent Units [7].
3.2. Inference using Recurrent Neural Network
We use mean field to perform approximate inference. We
denote the probability of each variable x as Q(x), and as-
sume that the probability only depends on the current state
of each node and edge at each iteration. In contrast to
Zheng et al. [43], we use a generic RNN module to compute
5412

edge
GRU
node
GRU
primal
graph
edge
feature
node
feature
node
state
outbound
edge states
inbound
edge states
dual
graph
edge
state
subject
state
object
state
edge
GRU
node
GRU
node
message
edge
message
node message pooling
message
passing
edge
GRU
node
GRU
node message
pooling
edge message
pooling
message
passing
edge message pooling
edge
GRU
node
GRU
...
T = 0 T = 1
T = 2
T = N
horse
face of
man
riding
wearing
wearing
hat
shirt
mountain behind
object proposal
scene graph
(a) (b) (c) (d)
Figure 3. An illustration of our model architecture (Sec. 3). The model first extracts visual features of nodes and edges from a set of object
proposals, and edge GRUs and node GRUs then take the visual features as initial input and produce a set of hidden states (a). Then a node
message pooling function computes messages that are passed to the node GRU in the next iteration from the hidden states. Similarly, an
edge message pooling function computes messages and feed to the edge GRU (b). The symbol denotes a learnt weighted sum. The
model iteratively updates the hidden states of the GRUs (c). At the last iteration step, the hidden states of the GRUs are used to predict
object categories, bounding box offsets, and relationship types (d).
the hidden states. In particular, we choose Gated Recurrent
Units [7] due to its simplicity and effectiveness. We use the
hidden state of the corresponding GRU, a high-dimensional
vector, to represent the current state of each node and each
edge. As all the nodes (edges) share the same update rule,
we share the same set of parameters among all the node
GRUs, and the other set of parameters among all the edge
GRUs (Fig. 3). We denote the current hidden state of node
i as h
i
and the current hidden state of edge i j as h
ij
.
Then the mean field distribution can be formulated as
Q(x|I, B
I
) =
n
Y
i=1
Q(x
cls
i
, x
bbox
i
|h
i
)Q(h
i
|f
v
i
)
Y
j6=i
Q(x
ij
|h
ij
)Q(h
ij
|f
e
ij
)
(2)
where f
v
i
is the visual feature of the i-th node, and f
e
ij
is
the visual feature of the edge from the i-th node to the j-th
node. In the first iteration, the GRU units take the visual
features f
v
and f
e
as input (Fig. 3(a)). We use the visual
feature of the proposal box as the visual feature f
v
i
for the
i-th node. We use the visual feature of the union box over
the proposal boxes b
i
, b
j
as the visual feature f
e
ij
for edge
i j. These visual features are extracted by a ROI-pooling
layer [12] from the image. In later iterations, the inputs are
the aggregated messages from other GRU units of the pre-
vious step. We talk about how the messages are aggregated
and passed in the next subsection.
3.3. Primal Dual Update and Message Pooling
Sec. 3.2 offers a generic formulation for solving graph
inference problem using RNNs. However, we observe that
we can further improve the inference efficiency by leverag-
ing the unique bipartite structure of a scene graph. In the
scene graph topology, the neighbors of the edge GRUs are
node GRUs, and vice versa. Passing messages along this
structure forms two disjoint sub-graphs that are the dual
graph to each other. Specifically, we have a node-centric
primal graph, in which each node GRU gets messages from
its inbound and outbound edge GRUs. In the edge-centric
dual graph, each edge GRU gets messages from its sub-
ject node GRU and object node GRU (Fig. 3(b)). We can
therefore improve inference efficiency by iteratively passing
messages between these two sub-graphs instead of through
a densely connected graph (Fig. 3(c)).
As each GRU receives multiple incoming messages, we
need an aggregation function that can fuse information from
all messages into a meaningful representation. A na
¨
ıve ap-
proach would be standard pooling methods such as average-
or max-pooling. However, we found that it is more effective
to learn adaptive weights that can modulate the influences of
incoming messages and only keep the relevant information.
We introduce a message pooling function that computes the
weight factors for each incoming message and fuse the mes-
sages using a weighted sum. We provide an empirical anal-
ysis of different message pooling functions in Sec. 4.
Formally, given the current GRU hidden states of nodes
and edges h
i
and h
ij
, we denote the messages to update
the i-th node as m
i
, which is computed by a function of its
own hidden state h
i
, and the hidden states of its outbound
edge GRUs h
ij
and inbound edge GRUs h
ji
. Similarly,
we denote the message to update the edge from the i-th node
to the j-th node as m
ij
, which is computed by a function
of its own hidden state h
ij
, the hidden states of its subject
5413

node GRU h
i
and its object node GRU h
j
. To be more
specific, m
i
and m
ij
are computed by the following two
adaptively weighted message pooling functions:
m
i
=
X
j:ij
σ(v
T
1
[h
i
, h
ij
])h
ij
+
X
j:ji
σ(v
T
2
[h
i
, h
ji
])h
ji
(3)
m
ij
= σ(w
T
1
[h
i
, h
ij
])h
i
+ σ(w
T
2
[h
j
, h
ij
])h
j
(4)
where [·] denotes a concatenation of vectors, and σ denotes
a sigmoid function. w
1
, w
2
and v
1
, v
2
are learnable param-
eters. These two equations describe the primal-dual update
rules, as shown in (b) of Fig. 3.
3.4. Implementation Details
Our final output layers follow closely with the faster R-
CNN setup [32]. We use a softmax layer to produce the final
scores for the object class as well as relationship predicate.
We use a fully-connected layer to regress to the bounding
box offsets for each object class separately. We use the cross
entropy loss for the object class and the relationship predi-
cate. We use 1 loss for the bounding box offsets.
We use an MS COCO-pretrained VGG-16 network to ex-
tract visual features from images. We freeze the weights of
all convolution layers, and only finetune the fully connected
layers, including the GRUs. The node GRUs and the edge
GRUs have both 512-dimensional input and output. Dur-
ing training, we first use NMS to select at most 2,000 boxes
from all proposed boxes B
I
, and then randomly select 128
boxes as the object proposals. Due to the quadratic number
of edges and sparsity of the annotations, we first sample all
edges that have labels. If an image has less than 128 labeled
edges, we fill the rest with unlabeled edges. At test time,
we use NMS to select at most 50 boxes from the object pro-
posals with an IoU threshold of 0.3. We make predictions
on all edges except the self-connections at the test time.
4. Experiments
We evaluate our model on generating scene graphs from
images. We compare our model against a recently proposed
model on visual relationship prediction [26]. Our goal is to
analyze our model in datasets with both sparse and dense
relationship annotations. We use a new scene graph dataset
based on the VisualGenome dataset [20] in our main ex-
periment. We also evaluate our model on the support rela-
tion inference task in the NYU Depth v2 dataset. The key
difference between these two datasets is that scene graph
annotation is very sparse: among all possible pairing of
objects, only 1.6% of them are labeled with a relationship
predicate. The NYU Depth v2 dataset, on the other hand,
exhaustively annotates the support of every labeled object.
Our experiments show that our model outperforms the base-
line model [26], and can generalize to other types of rela-
tionships, in particular support relations [28], without any
architecture change.
Visual Genome We introduce a new scene graph dataset
based on the Visual Genome dataset [20]. The original VG
scene graph dataset contains 108,077 images with an aver-
age of 38 objects and 22 relationships per image. However,
a substantial fraction of the object annotations have poor-
quality and overlapping bounding boxes and/or ambiguous
object names. We manually cleaned up per-box annota-
tions. On average, this annotation refinement process cor-
rected 22 bounding boxes and/or names, deleted 7.4 boxes,
and merged 5.4 duplicate bounding boxes per image. The
new dataset contains an average of 25 distinct objects and
22 relationships per image. In this experiment, we use the
most frequent 150 object categories and 50 predicates for
evaluation. As a result, each image has a scene graph of
around 11.5 objects and 6.2 relationships. We use 70% of
the images for training and the remaining 30% for testing.
NYU Depth V2 We also evaluate our model on the support
relation graphs from the NYU Depth v2 dataset [28]. The
dataset contains 1,449 RGB-D images captured in 27 indoor
scenes. Each image is annotated with instance segmenta-
tion, region class labels, and support relations between re-
gions. We use the standard split, with 795 images used for
training and 654 images for testing.
4.1. Semantic Scene Graph Generation
Setup Given an image, the scene graph generation task
is to localize a set of objects, classify their category labels,
and predict relationships between each pair of the objects.
We evaluate our model on the new scene graph dataset. We
analyze our model in three setups below.
1. The predicate classification (PREDCLS) task is to
predict the predicates of all pairwise relationships of
a set of localized objects. This task examines the
model’s performance on predicate classification in iso-
lation from other factors.
2. The scene graph classification (SGCLS) task is to
predict the predicate as well as the object categories
of the subject and the object in every pairwise relation-
ship given a set of localized objects.
3. The scene graph generation (SGGEN) task is to si-
multaneously detect a set of objects and predict the
predicate between each pair of the detected objects.
An object is considered to be correctly detected if it
has at least 0.5 IoU overlap with the ground-truth box.
We adopted the image-wise recall evaluation metrics,
R@50 and R@100, that are used in Lu et al. [26] for
5414

Citations
More filters
Journal ArticleDOI
TL;DR: This article provides a comprehensive overview of graph neural networks (GNNs) in data mining and machine learning fields and proposes a new taxonomy to divide the state-of-the-art GNNs into four categories, namely, recurrent GNNS, convolutional GNN’s, graph autoencoders, and spatial–temporal Gnns.
Abstract: Deep learning has revolutionized many machine learning tasks in recent years, ranging from image classification and video processing to speech recognition and natural language understanding. The data in these tasks are typically represented in the Euclidean space. However, there is an increasing number of applications, where data are generated from non-Euclidean domains and are represented as graphs with complex relationships and interdependency between objects. The complexity of graph data has imposed significant challenges on the existing machine learning algorithms. Recently, many studies on extending deep learning approaches for graph data have emerged. In this article, we provide a comprehensive overview of graph neural networks (GNNs) in data mining and machine learning fields. We propose a new taxonomy to divide the state-of-the-art GNNs into four categories, namely, recurrent GNNs, convolutional GNNs, graph autoencoders, and spatial–temporal GNNs. We further discuss the applications of GNNs across various domains and summarize the open-source codes, benchmark data sets, and model evaluation of GNNs. Finally, we propose potential research directions in this rapidly growing field.

4,584 citations

Proceedings ArticleDOI
18 Jun 2018
TL;DR: It is argued that the organization of 3D point clouds can be efficiently captured by a structure called superpoint graph (SPG), derived from a partition of the scanned scene into geometrically homogeneous elements.
Abstract: We propose a novel deep learning-based framework to tackle the challenge of semantic segmentation of large-scale point clouds of millions of points. We argue that the organization of 3D point clouds can be efficiently captured by a structure called superpoint graph (SPG), derived from a partition of the scanned scene into geometrically homogeneous elements. SPGs offer a compact yet rich representation of contextual relationships between object parts, which is then exploited by a graph convolutional network. Our framework sets a new state of the art for segmenting outdoor LiDAR scans (+11.9 and +8.8 mIoU points for both Semantic3D test sets), as well as indoor scans (+12.4 mIoU points for the S3DIS dataset).

1,083 citations

Book ChapterDOI
08 Sep 2018
TL;DR: Zhang et al. as discussed by the authors proposed GCN-LSTM with attention mechanism to explore the connections between objects for image captioning under the umbrella of attention-based encoder-decoder framework.
Abstract: It is always well believed that modeling relationships between objects would be helpful for representing and eventually describing an image Nevertheless, there has not been evidence in support of the idea on image description generation In this paper, we introduce a new design to explore the connections between objects for image captioning under the umbrella of attention-based encoder-decoder framework Specifically, we present Graph Convolutional Networks plus Long Short-Term Memory (dubbed as GCN-LSTM) architecture that novelly integrates both semantic and spatial object relationships into image encoder Technically, we build graphs over the detected objects in an image based on their spatial and semantic connections The representations of each region proposed on objects are then refined by leveraging graph structure through GCN With the learnt region-level features, our GCN-LSTM capitalizes on LSTM-based captioning framework with attention mechanism for sentence generation Extensive experiments are conducted on COCO image captioning dataset, and superior results are reported when comparing to state-of-the-art approaches More remarkably, GCN-LSTM increases CIDEr-D performance from 1201% to 1287% on COCO testing set

775 citations

Proceedings ArticleDOI
04 Apr 2018
TL;DR: This work proposes a method for generating images from scene graphs, enabling explicitly reasoning about objects and their relationships, and validates this approach on Visual Genome and COCO-Stuff.
Abstract: To truly understand the visual world our models should be able not only to recognize images but also generate them. To this end, there has been exciting recent progress on generating images from natural language descriptions. These methods give stunning results on limited domains such as descriptions of birds or flowers, but struggle to faithfully reproduce complex sentences with many objects and relationships. To overcome this limitation we propose a method for generating images from scene graphs, enabling explicitly reasoning about objects and their relationships. Our model uses graph convolution to process input graphs, computes a scene layout by predicting bounding boxes and segmentation masks for objects, and converts the layout to an image with a cascaded refinement network. The network is trained adversarially against a pair of discriminators to ensure realistic outputs. We validate our approach on Visual Genome and COCO-Stuff, where qualitative results, ablations, and user studies demonstrate our method's ability to generate complex images with multiple objects.

645 citations

Proceedings ArticleDOI
07 Apr 2019
TL;DR: In this article, a very deep GCN architecture is proposed to solve the vanishing gradient problem in point cloud semantic segmentation, which is based on graph convolutional networks (GCNs).
Abstract: Convolutional Neural Networks (CNNs) achieve impressive performance in a wide variety of fields. Their success benefited from a massive boost when very deep CNN models were able to be reliably trained. Despite their merits, CNNs fail to properly address problems with non-Euclidean data. To overcome this challenge, Graph Convolutional Networks (GCNs) build graphs to represent non-Euclidean data, borrow concepts from CNNs, and apply them in training. GCNs show promising results, but they are usually limited to very shallow models due to the vanishing gradient problem. As a result, most state-of-the-art GCN models are no deeper than 3 or 4 layers. In this work, we present new ways to successfully train very deep GCNs. We do this by borrowing concepts from CNNs, specifically residual/dense connections and dilated convolutions, and adapting them to GCN architectures. Extensive experiments show the positive effect of these deep GCN frameworks. Finally, we use these new concepts to build a very deep 56-layer GCN, and show how it significantly boosts performance (+3.7% mIoU over state-of-the-art) in the task of point cloud semantic segmentation. We believe that the community can greatly benefit from this work, as it opens up many opportunities for advancing GCN-based research.

609 citations

References
More filters
Proceedings ArticleDOI
27 Jun 2016
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

123,388 citations

Proceedings Article
03 Dec 2012
TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overriding in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

73,978 citations

Proceedings Article
04 Sep 2014
TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

55,235 citations

Proceedings Article
01 Jan 2015
TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

49,914 citations

Posted Content
TL;DR: Faster R-CNN as discussed by the authors proposes a Region Proposal Network (RPN) to generate high-quality region proposals, which are used by Fast R-NN for detection.
Abstract: State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet and Fast R-CNN have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. We further merge RPN and Fast R-CNN into a single network by sharing their convolutional features---using the recently popular terminology of neural networks with 'attention' mechanisms, the RPN component tells the unified network where to look. For the very deep VGG-16 model, our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007, 2012, and MS COCO datasets with only 300 proposals per image. In ILSVRC and COCO 2015 competitions, Faster R-CNN and RPN are the foundations of the 1st-place winning entries in several tracks. Code has been made publicly available.

23,183 citations