scispace - formally typeset
Open AccessProceedings ArticleDOI

Encoding Sentences with Graph Convolutional Networks for Semantic Role Labeling

Reads0
Chats0
TLDR
A version of graph convolutional networks (GCNs), a recent class of neural networks operating on graphs, suited to model syntactic dependency graphs, is proposed, observing that GCN layers are complementary to LSTM ones.
Abstract
Semantic role labeling (SRL) is the task of identifying the predicate-argument structure of a sentence. It is typically regarded as an important step in the standard NLP pipeline. As the semantic representations are closely related to syntactic ones, we exploit syntactic information in our model. We propose a version of graph convolutional networks (GCNs), a recent class of neural networks operating on graphs, suited to model syntactic dependency graphs. GCNs over syntactic dependency trees are used as sentence encoders, producing latent feature representations of words in a sentence. We observe that GCN layers are complementary to LSTM ones: when we stack both GCN and LSTM layers, we obtain a substantial improvement over an already state-of-the-art LSTM SRL model, resulting in the best reported scores on the standard benchmark (CoNLL-2009) both for Chinese and English.

read more

Content maybe subject to copyright    Report

Edinburgh Research Explorer
Encoding Sentences with Graph Convolutional Networks for
Semantic Role Labeling
Citation for published version:
Marcheggiani, D & Titov, I 2017, Encoding Sentences with Graph Convolutional Networks for Semantic
Role Labeling. in Proceedings of the 2017 Conference on Empirical Methods in Natural Language
Processing (EMNLP 2017). Association for Computational Linguistics, Copenhagen, Denmark , pp.
1506–1515, EMNLP 2017: Conference on Empirical Methods in Natural Language Processing,
Copenhagen, Denmark, 7/09/17. https://doi.org/10.18653/v1/D17-1159
Digital Object Identifier (DOI):
10.18653/v1/D17-1159
Link:
Link to publication record in Edinburgh Research Explorer
Document Version:
Peer reviewed version
Published In:
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP 2017)
General rights
Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s)
and / or other copyright owners and it is a condition of accessing these publications that users recognise and
abide by the legal requirements associated with these rights.
Take down policy
The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer
content complies with UK legislation. If you believe that the public display of this file breaches copyright please
contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and
investigate your claim.
Download date: 26. Aug. 2022

Encoding Sentences with Graph Convolutional Networks
for Semantic Role Labeling
Diego Marcheggiani
1
Ivan Titov
1,2
1
ILLC, University of Amsterdam
2
ILCC, School of Informatics, University of Edinburgh
marcheggiani@uva.nl ititov@inf.ed.ac.uk
Abstract
Semantic role labeling (SRL) is the task of
identifying the predicate-argument struc-
ture of a sentence. It is typically re-
garded as an important step in the stan-
dard NLP pipeline. As the semantic rep-
resentations are closely related to syntac-
tic ones, we exploit syntactic information
in our model. We propose a version of
graph convolutional networks (GCNs), a
recent class of neural networks operating
on graphs, suited to model syntactic de-
pendency graphs. GCNs over syntactic de-
pendency trees are used as sentence en-
coders, producing latent feature represen-
tations of words in a sentence. We ob-
serve that GCN layers are complementary
to LSTM ones: when we stack both GCN
and LSTM layers, we obtain a substantial
improvement over an already state-of-the-
art LSTM SRL model, resulting in the best
reported scores on the standard benchmark
(CoNLL-2009) both for Chinese and En-
glish.
1 Introduction
Semantic role labeling (SRL) (Gildea and Juraf-
sky, 2002) can be informally described as the task
of discovering who did what to whom. For ex-
ample, consider an SRL dependency graph shown
above the sentence in Figure 1. Formally, the task
includes (1) detection of predicates (e.g., makes);
(2) labeling the predicates with a sense from a
sense inventory (e.g., make.01); (3) identifying
and assigning arguments to semantic roles (e.g.,
Sequa is A0, i.e., an agent / ‘doer’ for the corre-
sponding predicate, and engines is A1, i.e., a pa-
tient / ‘an affected entity’). SRL is often regarded
Sequa makes and repairs jet engines.
SBJ
COORD
OBJ
CONJ NMOD
ROOT
A1
A1
A1
A0
A0
make.01
repair.01 engine.01
Figure 1: An example sentence annotated with se-
mantic (top) and syntactic dependencies (bottom).
as an important step in the standard NLP pipeline,
providing information to downstream tasks such
as information extraction and question answering.
The semantic representations are closely re-
lated to syntactic ones, even though the syntax-
semantics interface is far from trivial (Levin,
1993). For example, one can observe that many
arcs in the syntactic dependency graph (shown in
black below the sentence in Figure 1) are mir-
rored in the semantic dependency graph. Given
these similarities and also because of availability
of accurate syntactic parsers for many languages,
it seems natural to exploit syntactic information
when predicting semantics. Though historically
most SRL approaches did rely on syntax (Thomp-
son et al., 2003; Pradhan et al., 2005; Punyakanok
et al., 2008; Johansson and Nugues, 2008), the last
generation of SRL models put syntax aside in fa-
vor of neural sequence models, namely LSTMs
(Zhou and Xu, 2015; Marcheggiani et al., 2017),
and outperformed syntactically-driven methods on
standard benchmarks. We believe that one of the
reasons for this radical choice is the lack of sim-
ple and effective methods for incorporating syn-
tactic information into sequential neural networks
(namely, at the level of words). In this paper we

propose one way how to address this limitation.
Specifically, we rely on graph convolutional
networks (GCNs) (Duvenaud et al., 2015; Kipf
and Welling, 2017; Kearnes et al., 2016), a recent
class of multilayer neural networks operating on
graphs. For every node in the graph (in our case
a word in a sentence), GCN encodes relevant in-
formation about its neighborhood as a real-valued
feature vector. GCNs have been studied largely in
the context of undirected unlabeled graphs. We in-
troduce a version of GCNs for modeling syntactic
dependency structures and generally applicable to
labeled directed graphs.
One layer GCN encodes only information about
immediate neighbors and K layers are needed
to encode K-order neighborhoods (i.e., informa-
tion about nodes at most K hops aways). This
contrasts with recurrent and recursive neural net-
works (Elman, 1990; Socher et al., 2013) which, at
least in theory, can capture statistical dependencies
across unbounded paths in a trees or in a sequence.
However, as we will further discuss in Section 3.3,
this is not a serious limitation when GCNs are used
in combination with encoders based on recurrent
networks (LSTMs). When we stack GCNs on top
of LSTM layers, we obtain a substantial improve-
ment over an already state-of-the-art LSTM SRL
model, resulting in the best reported scores on the
standard benchmark (CoNLL-2009), both for En-
glish and Chinese.
1
Interestingly, again unlike recursive neural net-
works, GCNs do not constrain the graph to be
a tree. We believe that there are many applica-
tions in NLP, where GCN-based encoders of sen-
tences or even documents can be used to incor-
porate knowledge about linguistic structures (e.g.,
representations of syntax, semantics or discourse).
For example, GCNs can take as input combined
syntactic-semantic graphs (e.g., the entire graph
from Figure 1) and be used within downstream
tasks such as machine translation or question an-
swering. However, we leave this for future work
and here solely focus on SRL.
The contributions of this paper can be summa-
rized as follows:
we are the first to show that GCNs are effec-
tive for NLP;
we propose a generalization of GCNs suited
1
The code is available at https://github.com/
diegma/neural-dep-srl.
to encoding syntactic information at word
level;
we propose a GCN-based SRL model and
obtain state-of-the-art results on English and
Chinese portions of the CoNLL-2009 dataset;
we show that bidirectional LSTMs and
syntax-based GCNs have complementary
modeling power.
2 Graph Convolutional Networks
In this section we describe GCNs of Kipf and
Welling (2017). Please refer to Gilmer et al.
(2017) for a comprehensive overview of GCN ver-
sions.
GCNs are neural networks operating on graphs
and inducing features of nodes (i.e., real-valued
vectors / embeddings) based on properties of their
neighborhoods. In Kipf and Welling (2017), they
were shown to be very effective for the node clas-
sification task: the classifier was estimated jointly
with a GCN, so that the induced node features
were informative for the node classification prob-
lem. Depending on how many layers of convolu-
tion are used, GCNs can capture information only
about immediate neighbors (with one layer of con-
volution) or any nodes at most K hops aways (if
K layers are stacked on top of each other).
More formally, consider an undirected graph
G = (V, E), where V (|V | = n) and E are
sets of nodes and edges, respectively. Kipf and
Welling (2017) assume that edges contain all the
self-loops, i.e., (v, v) E for any v. We can de-
fine a matrix X R
m×n
with each its column
x
v
R
m
(v V) encoding node features. The
vectors can either encode genuine features (e.g.,
this vector can encode the title of a paper if citation
graphs are considered) or be a one-hot vector. The
node representation, encoding information about
its immediate neighbors, is computed as
h
v
= ReLU
X
u∈N (v)
(W x
u
+ b)
, (1)
where W R
m×m
and b R
m
are a weight ma-
trix and a bias, respectively; N(v) are neighbors
of v; ReLU is the rectifier linear unit activation
function.
2
Note that v N(v) (because of self-
loops), so the input feature representation of v (i.e.
x
v
) affects its induced representation h
v
.
2
We dropped normalization factors used in Kipf and
Welling (2017), as they are not used in our syntactic GCNs.

Lane disputed those estimates
ReLU(·)
ReLU(·)
ReLU(·)
ReLU(·)
W
(1)
self
W
(1)
self
W
(1)
self
W
(1)
self
W
(1)
subj
W
(1)
obj
W
(1)
nmod
0
W
(1)
obj
0
W
(1)
subj
0
ReLU(·)
ReLU(·)
ReLU(·)
ReLU(·)
W
(2)
self
W
(2)
self
W
(2)
self
W
(2)
self
W
(2)
subj
W
(2)
subj
0
W
(2)
obj
0
W
(2)
obj
W
(2)
nmod
W
(2)
nmod
0
NMOD
SBJ
OBJ
Figure 2: A simplified syntactic GCN (bias terms
and gates are omitted); the syntactic graph of the
sentence is shown with dashed lines at the bottom.
Parameter matrices are sub-indexed with syntactic
functions, and apostrophes (e.g., subj’) signify that
information flows in the direction opposite of the
dependency arcs (i.e., from dependents to heads).
As in standard convolutional networks (LeCun
et al., 2001), by stacking GCN layers one can in-
corporate higher degree neighborhoods:
h
(k+1)
v
= ReLU
X
u∈N (v)
W
(k)
h
(k)
u
+ b
(k)
where k denotes the layer number and h
(1)
v
= x
v
.
3 Syntactic GCNs
As syntactic dependency trees are directed and la-
beled (we refer to the dependency labels as syn-
tactic functions), we first need to modify the com-
putation in order to incorporate label information
(Section 3.1). In the subsequent section, we incor-
porate gates in GCNs, so that the model can decide
which edges are more relevant to the task in ques-
tion. Having gates is also important as we rely on
automatically predicted syntactic representations,
and the gates can detect and downweight poten-
tially erroneous edges.
3.1 Incorporating directions and labels
Now, we introduce a generalization of GCNs ap-
propriate for syntactic dependency trees, and in
general, for directed labeled graphs. First note
that there is no reason to assume that information
flows only along the syntactic dependency arcs
(e.g., from makes to Sequa), so we allow it to flow
in the opposite direction as well (i.e., from depen-
dents to heads). We use a graph G = (V, E), where
the edge set contains all pairs of nodes (i.e., words)
adjacent in the dependency tree. In our example,
both (Sequa, makes) and (makes, Sequa) belong
to the edge set. The graph is labeled, and the label
L(u, v) for (u, v) E contains both information
about the syntactic function and indicates whether
the edge is in the same or opposite direction as
the syntactic dependency arc. For example, the la-
bel for (makes, Sequa) is subj, whereas the label
for (Sequa, makes) is subj
0
, with the apostrophe
indicating that the edge is in the direction oppo-
site to the corresponding syntactic arc. Similarly,
self-loops will have label self. Consequently, we
can simply assume that the GCN parameters are
label-specific, resulting in the following computa-
tion, also illustrated in Figure 2:
h
(k+1)
v
= ReLU
X
u∈N (v)
W
(k)
L(u,v)
h
(k)
u
+ b
(k)
L(u,v)
.
This model is over-parameterized,
3
especially
given that SRL datasets are moderately sized, by
deep learning standards. So instead of learning the
GCN parameters directly, we define them as
W
(k)
L(u,v)
= V
(k)
dir(u,v)
, (2)
where dir(u, v) indicates whether the edge (u, v)
is directed (1) along, (2) in the opposite direction
to the syntactic dependency arc, or (3) is a self-
loop; V
(k)
dir(u,v)
R
m×m
. Our simplification cap-
tures the intuition that information should be prop-
agated differently along edges depending whether
this is a head-to-dependent or dependent-to-head
edge (i.e., along or opposite the corresponding
syntactic arc) and whether it is a self-loop. So we
do not share any parameters between these three
very different edge types. Syntactic functions are
important, but perhaps less crucial, so they are en-
coded only in the feature vectors b
L(u,v)
.
3.2 Edge-wise gating
Uniformly accepting information from all neigh-
boring nodes may not be appropriate for the SRL
3
Chinese and English CoNLL-2009 datasets used 41 and
48 different syntactic functions, which would result in having
83 and 97 different matrices in every layer, respectively.

setting. For example, we see in Figure 1 that many
semantic arcs just mirror their syntactic counter-
parts, so they may need to be up-weighted. More-
over, we rely on automatically predicted syntactic
structures, and, even for English, syntactic parsers
are far from being perfect, especially when used
out-of-domain. It is risky for a downstream ap-
plication to rely on a potentially wrong syntactic
edge, so the corresponding message in the neural
network may need to be down-weighted.
In order to address the above issues, inspired
by recent literature (van den Oord et al., 2016;
Dauphin et al., 2016), we calculate for each edge
node pair a scalar gate of the form
g
(k)
u,v
= σ
h
(k)
u
· ˆv
(k)
dir(u,v)
+
ˆ
b
(k)
L(u,v)
, (3)
where σ is the logistic sigmoid function,
ˆv
(k)
dir(u,v)
R
m
and
ˆ
b
(k)
L(u,v)
R are weights and
a bias for the gate. With this additional gating
mechanism, the final syntactic GCN computation
is formulated as
h
(k+1)
v
=ReLU(
X
u∈N (v)
g
(k)
v,u
(V
(k)
dir(u,v)
h
(k)
u
+ b
(k)
L(u,v)
)). (4)
3.3 Complementarity of GCNs and LSTMs
The inability of GCNs to capture dependencies
between nodes far away from each other in the
graph may seem like a serious problem, especially
in the context of SRL: paths between predicates
and arguments often include many dependency
arcs (Roth and Lapata, 2016). However, when
graph convolution is performed on top of LSTM
states (i.e., LSTM states serve as input x
v
= h
(1)
v
to GCN) rather than static word embeddings, GCN
may not need to capture more than a couple of
hops.
To elaborate on this, let us speculate what role
GCNs would play when used in combinations
with LSTMs, given that LSTMs have already been
shown very effective for SRL (Zhou and Xu, 2015;
Marcheggiani et al., 2017). Though LSTMs are
capable of capturing at least some degree of syn-
tax (Linzen et al., 2016) without explicit syntactic
supervision, SRL datasets are moderately sized,
so LSTM models may still struggle with harder
cases. Typically, harder cases for SRL involve ar-
guments far away from their predicates. In fact,
20% and 30% of arguments are more than 5 to-
kens away from their predicate, in our English and
A1
Classifier
J layers
BiLSTM
Lane disputed those estimates
dobj
nmod
nsubj
K layers
GCN
word
representation
Figure 3: Predicting an argument and its label
with an LSTM + GCN encoder.
Chinese collections, respectively. However, if we
imagine that we can ‘teleport’ even over a sin-
gle (longest) syntactic dependency edge, the dis-
tance’ would shrink: only 9% and 13% arguments
will now be more than 5 LSTM steps away (again
for English and Chinese, respectively). GCNs pro-
vide this ‘teleportation’ capability. These observa-
tions suggest that LSTMs and GCNs may be com-
plementary, and we will see that empirical results
support this intuition.
4 Syntax-Aware Neural SRL Encoder
In this work, we build our semantic role la-
beler on top of the syntax-agnostic LSTM-based
SRL model of Marcheggiani et al. (2017), which
already achieves state-of-the-art results on the
CoNLL-2009 English dataset. Following their ap-
proach we employ the same bidirectional (BiL-
STM) encoder and enrich it with a syntactic GCN.
The CoNLL-2009 benchmark assumes that
predicate positions are already marked in the test
set (e.g., we would know that makes, repairs and
engines in Figure 1 are predicates), so no predicate
identification is needed. Also, as we focus here
solely on identifying arguments and labeling them
with semantic roles, for predicate disambiguation

Citations
More filters
Journal ArticleDOI

A Comprehensive Survey on Graph Neural Networks

TL;DR: This article provides a comprehensive overview of graph neural networks (GNNs) in data mining and machine learning fields and proposes a new taxonomy to divide the state-of-the-art GNNs into four categories, namely, recurrent GNNS, convolutional GNN’s, graph autoencoders, and spatial–temporal Gnns.
Posted Content

Graph Neural Networks: A Review of Methods and Applications

TL;DR: A detailed review over existing graph neural network models is provided, systematically categorize the applications, and four open problems for future research are proposed.
Journal ArticleDOI

Graph Neural Networks: A Review of Methods and Applications

TL;DR: In this paper, the authors propose a general design pipeline for GNN models and discuss the variants of each component, systematically categorize the applications, and propose four open problems for future research.
Book ChapterDOI

Exploring Visual Relationship for Image Captioning

TL;DR: Zhang et al. as discussed by the authors proposed GCN-LSTM with attention mechanism to explore the connections between objects for image captioning under the umbrella of attention-based encoder-decoder framework.
Proceedings ArticleDOI

Graph Convolution over Pruned Dependency Trees Improves Relation Extraction.

TL;DR: An extension of graph convolutional networks that is tailored for relation extraction, which pools information over arbitrary dependency structures efficiently in parallel is proposed, and a novel pruning strategy is applied to the input trees by keeping words immediately around the shortest path between the two entities among which a relation might hold.
References
More filters
Proceedings Article

Adam: A Method for Stochastic Optimization

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Journal ArticleDOI

Long short-term memory

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
Proceedings ArticleDOI

Learning Phrase Representations using RNN Encoder--Decoder for Statistical Machine Translation

TL;DR: In this paper, the encoder and decoder of the RNN Encoder-Decoder model are jointly trained to maximize the conditional probability of a target sequence given a source sequence.
Proceedings Article

Sequence to Sequence Learning with Neural Networks

TL;DR: The authors used a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector.
Journal ArticleDOI

Finding Structure in Time

TL;DR: A proposal along these lines first described by Jordan (1986) which involves the use of recurrent links in order to provide networks with a dynamic memory and suggests a method for representing lexical categories and the type/token distinction is developed.
Related Papers (5)