Edinburgh Research Explorer
Encoding Sentences with Graph Convolutional Networks for
Semantic Role Labeling
Citation for published version:
Marcheggiani, D & Titov, I 2017, Encoding Sentences with Graph Convolutional Networks for Semantic
Role Labeling. in Proceedings of the 2017 Conference on Empirical Methods in Natural Language
Processing (EMNLP 2017). Association for Computational Linguistics, Copenhagen, Denmark , pp.
1506–1515, EMNLP 2017: Conference on Empirical Methods in Natural Language Processing,
Copenhagen, Denmark, 7/09/17. https://doi.org/10.18653/v1/D17-1159
Digital Object Identifier (DOI):
10.18653/v1/D17-1159
Link:
Link to publication record in Edinburgh Research Explorer
Document Version:
Peer reviewed version
Published In:
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP 2017)
General rights
Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s)
and / or other copyright owners and it is a condition of accessing these publications that users recognise and
abide by the legal requirements associated with these rights.
Take down policy
The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer
content complies with UK legislation. If you believe that the public display of this file breaches copyright please
contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and
investigate your claim.
Download date: 26. Aug. 2022
Encoding Sentences with Graph Convolutional Networks
for Semantic Role Labeling
Diego Marcheggiani
1
Ivan Titov
1,2
1
ILLC, University of Amsterdam
2
ILCC, School of Informatics, University of Edinburgh
marcheggiani@uva.nl ititov@inf.ed.ac.uk
Abstract
Semantic role labeling (SRL) is the task of
identifying the predicate-argument struc-
ture of a sentence. It is typically re-
garded as an important step in the stan-
dard NLP pipeline. As the semantic rep-
resentations are closely related to syntac-
tic ones, we exploit syntactic information
in our model. We propose a version of
graph convolutional networks (GCNs), a
recent class of neural networks operating
on graphs, suited to model syntactic de-
pendency graphs. GCNs over syntactic de-
pendency trees are used as sentence en-
coders, producing latent feature represen-
tations of words in a sentence. We ob-
serve that GCN layers are complementary
to LSTM ones: when we stack both GCN
and LSTM layers, we obtain a substantial
improvement over an already state-of-the-
art LSTM SRL model, resulting in the best
reported scores on the standard benchmark
(CoNLL-2009) both for Chinese and En-
glish.
1 Introduction
Semantic role labeling (SRL) (Gildea and Juraf-
sky, 2002) can be informally described as the task
of discovering who did what to whom. For ex-
ample, consider an SRL dependency graph shown
above the sentence in Figure 1. Formally, the task
includes (1) detection of predicates (e.g., makes);
(2) labeling the predicates with a sense from a
sense inventory (e.g., make.01); (3) identifying
and assigning arguments to semantic roles (e.g.,
Sequa is A0, i.e., an agent / ‘doer’ for the corre-
sponding predicate, and engines is A1, i.e., a pa-
tient / ‘an affected entity’). SRL is often regarded
Sequa makes and repairs jet engines.
SBJ
COORD
OBJ
CONJ NMOD
ROOT
A1
A1
A1
A0
A0
make.01
repair.01 engine.01
Figure 1: An example sentence annotated with se-
mantic (top) and syntactic dependencies (bottom).
as an important step in the standard NLP pipeline,
providing information to downstream tasks such
as information extraction and question answering.
The semantic representations are closely re-
lated to syntactic ones, even though the syntax-
semantics interface is far from trivial (Levin,
1993). For example, one can observe that many
arcs in the syntactic dependency graph (shown in
black below the sentence in Figure 1) are mir-
rored in the semantic dependency graph. Given
these similarities and also because of availability
of accurate syntactic parsers for many languages,
it seems natural to exploit syntactic information
when predicting semantics. Though historically
most SRL approaches did rely on syntax (Thomp-
son et al., 2003; Pradhan et al., 2005; Punyakanok
et al., 2008; Johansson and Nugues, 2008), the last
generation of SRL models put syntax aside in fa-
vor of neural sequence models, namely LSTMs
(Zhou and Xu, 2015; Marcheggiani et al., 2017),
and outperformed syntactically-driven methods on
standard benchmarks. We believe that one of the
reasons for this radical choice is the lack of sim-
ple and effective methods for incorporating syn-
tactic information into sequential neural networks
(namely, at the level of words). In this paper we
propose one way how to address this limitation.
Specifically, we rely on graph convolutional
networks (GCNs) (Duvenaud et al., 2015; Kipf
and Welling, 2017; Kearnes et al., 2016), a recent
class of multilayer neural networks operating on
graphs. For every node in the graph (in our case
a word in a sentence), GCN encodes relevant in-
formation about its neighborhood as a real-valued
feature vector. GCNs have been studied largely in
the context of undirected unlabeled graphs. We in-
troduce a version of GCNs for modeling syntactic
dependency structures and generally applicable to
labeled directed graphs.
One layer GCN encodes only information about
immediate neighbors and K layers are needed
to encode K-order neighborhoods (i.e., informa-
tion about nodes at most K hops aways). This
contrasts with recurrent and recursive neural net-
works (Elman, 1990; Socher et al., 2013) which, at
least in theory, can capture statistical dependencies
across unbounded paths in a trees or in a sequence.
However, as we will further discuss in Section 3.3,
this is not a serious limitation when GCNs are used
in combination with encoders based on recurrent
networks (LSTMs). When we stack GCNs on top
of LSTM layers, we obtain a substantial improve-
ment over an already state-of-the-art LSTM SRL
model, resulting in the best reported scores on the
standard benchmark (CoNLL-2009), both for En-
glish and Chinese.
1
Interestingly, again unlike recursive neural net-
works, GCNs do not constrain the graph to be
a tree. We believe that there are many applica-
tions in NLP, where GCN-based encoders of sen-
tences or even documents can be used to incor-
porate knowledge about linguistic structures (e.g.,
representations of syntax, semantics or discourse).
For example, GCNs can take as input combined
syntactic-semantic graphs (e.g., the entire graph
from Figure 1) and be used within downstream
tasks such as machine translation or question an-
swering. However, we leave this for future work
and here solely focus on SRL.
The contributions of this paper can be summa-
rized as follows:
• we are the first to show that GCNs are effec-
tive for NLP;
• we propose a generalization of GCNs suited
1
The code is available at https://github.com/
diegma/neural-dep-srl.
to encoding syntactic information at word
level;
• we propose a GCN-based SRL model and
obtain state-of-the-art results on English and
Chinese portions of the CoNLL-2009 dataset;
• we show that bidirectional LSTMs and
syntax-based GCNs have complementary
modeling power.
2 Graph Convolutional Networks
In this section we describe GCNs of Kipf and
Welling (2017). Please refer to Gilmer et al.
(2017) for a comprehensive overview of GCN ver-
sions.
GCNs are neural networks operating on graphs
and inducing features of nodes (i.e., real-valued
vectors / embeddings) based on properties of their
neighborhoods. In Kipf and Welling (2017), they
were shown to be very effective for the node clas-
sification task: the classifier was estimated jointly
with a GCN, so that the induced node features
were informative for the node classification prob-
lem. Depending on how many layers of convolu-
tion are used, GCNs can capture information only
about immediate neighbors (with one layer of con-
volution) or any nodes at most K hops aways (if
K layers are stacked on top of each other).
More formally, consider an undirected graph
G = (V, E), where V (|V | = n) and E are
sets of nodes and edges, respectively. Kipf and
Welling (2017) assume that edges contain all the
self-loops, i.e., (v, v) ∈ E for any v. We can de-
fine a matrix X ∈ R
m×n
with each its column
x
v
∈ R
m
(v ∈ V) encoding node features. The
vectors can either encode genuine features (e.g.,
this vector can encode the title of a paper if citation
graphs are considered) or be a one-hot vector. The
node representation, encoding information about
its immediate neighbors, is computed as
h
v
= ReLU
X
u∈N (v)
(W x
u
+ b)
, (1)
where W ∈ R
m×m
and b ∈ R
m
are a weight ma-
trix and a bias, respectively; N(v) are neighbors
of v; ReLU is the rectifier linear unit activation
function.
2
Note that v ∈ N(v) (because of self-
loops), so the input feature representation of v (i.e.
x
v
) affects its induced representation h
v
.
2
We dropped normalization factors used in Kipf and
Welling (2017), as they are not used in our syntactic GCNs.
Lane disputed those estimates
ReLU(⌃·)
ReLU(⌃·)
ReLU(⌃·)
ReLU(⌃·)
⇥W
(1)
self
⇥W
(1)
self
⇥W
(1)
self
⇥W
(1)
self
⇥W
(1)
subj
⇥W
(1)
obj
⇥W
(1)
nmod
⇥W
(1)
nmod
0
⇥W
(1)
obj
0
⇥W
(1)
subj
0
ReLU(⌃·)
ReLU(⌃·)
ReLU(⌃·)
ReLU(⌃·)
⇥W
(2)
self
⇥W
(2)
self
⇥W
(2)
self
⇥W
(2)
self
⇥W
(2)
subj
⇥W
(2)
subj
0
⇥W
(2)
obj
0
⇥W
(2)
obj
⇥W
(2)
nmod
⇥W
(2)
nmod
0
… … … …
NMOD
SBJ
OBJ
Figure 2: A simplified syntactic GCN (bias terms
and gates are omitted); the syntactic graph of the
sentence is shown with dashed lines at the bottom.
Parameter matrices are sub-indexed with syntactic
functions, and apostrophes (e.g., subj’) signify that
information flows in the direction opposite of the
dependency arcs (i.e., from dependents to heads).
As in standard convolutional networks (LeCun
et al., 2001), by stacking GCN layers one can in-
corporate higher degree neighborhoods:
h
(k+1)
v
= ReLU
X
u∈N (v)
W
(k)
h
(k)
u
+ b
(k)
where k denotes the layer number and h
(1)
v
= x
v
.
3 Syntactic GCNs
As syntactic dependency trees are directed and la-
beled (we refer to the dependency labels as syn-
tactic functions), we first need to modify the com-
putation in order to incorporate label information
(Section 3.1). In the subsequent section, we incor-
porate gates in GCNs, so that the model can decide
which edges are more relevant to the task in ques-
tion. Having gates is also important as we rely on
automatically predicted syntactic representations,
and the gates can detect and downweight poten-
tially erroneous edges.
3.1 Incorporating directions and labels
Now, we introduce a generalization of GCNs ap-
propriate for syntactic dependency trees, and in
general, for directed labeled graphs. First note
that there is no reason to assume that information
flows only along the syntactic dependency arcs
(e.g., from makes to Sequa), so we allow it to flow
in the opposite direction as well (i.e., from depen-
dents to heads). We use a graph G = (V, E), where
the edge set contains all pairs of nodes (i.e., words)
adjacent in the dependency tree. In our example,
both (Sequa, makes) and (makes, Sequa) belong
to the edge set. The graph is labeled, and the label
L(u, v) for (u, v) ∈ E contains both information
about the syntactic function and indicates whether
the edge is in the same or opposite direction as
the syntactic dependency arc. For example, the la-
bel for (makes, Sequa) is subj, whereas the label
for (Sequa, makes) is subj
0
, with the apostrophe
indicating that the edge is in the direction oppo-
site to the corresponding syntactic arc. Similarly,
self-loops will have label self. Consequently, we
can simply assume that the GCN parameters are
label-specific, resulting in the following computa-
tion, also illustrated in Figure 2:
h
(k+1)
v
= ReLU
X
u∈N (v)
W
(k)
L(u,v)
h
(k)
u
+ b
(k)
L(u,v)
.
This model is over-parameterized,
3
especially
given that SRL datasets are moderately sized, by
deep learning standards. So instead of learning the
GCN parameters directly, we define them as
W
(k)
L(u,v)
= V
(k)
dir(u,v)
, (2)
where dir(u, v) indicates whether the edge (u, v)
is directed (1) along, (2) in the opposite direction
to the syntactic dependency arc, or (3) is a self-
loop; V
(k)
dir(u,v)
∈ R
m×m
. Our simplification cap-
tures the intuition that information should be prop-
agated differently along edges depending whether
this is a head-to-dependent or dependent-to-head
edge (i.e., along or opposite the corresponding
syntactic arc) and whether it is a self-loop. So we
do not share any parameters between these three
very different edge types. Syntactic functions are
important, but perhaps less crucial, so they are en-
coded only in the feature vectors b
L(u,v)
.
3.2 Edge-wise gating
Uniformly accepting information from all neigh-
boring nodes may not be appropriate for the SRL
3
Chinese and English CoNLL-2009 datasets used 41 and
48 different syntactic functions, which would result in having
83 and 97 different matrices in every layer, respectively.
setting. For example, we see in Figure 1 that many
semantic arcs just mirror their syntactic counter-
parts, so they may need to be up-weighted. More-
over, we rely on automatically predicted syntactic
structures, and, even for English, syntactic parsers
are far from being perfect, especially when used
out-of-domain. It is risky for a downstream ap-
plication to rely on a potentially wrong syntactic
edge, so the corresponding message in the neural
network may need to be down-weighted.
In order to address the above issues, inspired
by recent literature (van den Oord et al., 2016;
Dauphin et al., 2016), we calculate for each edge
node pair a scalar gate of the form
g
(k)
u,v
= σ
h
(k)
u
· ˆv
(k)
dir(u,v)
+
ˆ
b
(k)
L(u,v)
, (3)
where σ is the logistic sigmoid function,
ˆv
(k)
dir(u,v)
∈ R
m
and
ˆ
b
(k)
L(u,v)
∈ R are weights and
a bias for the gate. With this additional gating
mechanism, the final syntactic GCN computation
is formulated as
h
(k+1)
v
=ReLU(
X
u∈N (v)
g
(k)
v,u
(V
(k)
dir(u,v)
h
(k)
u
+ b
(k)
L(u,v)
)). (4)
3.3 Complementarity of GCNs and LSTMs
The inability of GCNs to capture dependencies
between nodes far away from each other in the
graph may seem like a serious problem, especially
in the context of SRL: paths between predicates
and arguments often include many dependency
arcs (Roth and Lapata, 2016). However, when
graph convolution is performed on top of LSTM
states (i.e., LSTM states serve as input x
v
= h
(1)
v
to GCN) rather than static word embeddings, GCN
may not need to capture more than a couple of
hops.
To elaborate on this, let us speculate what role
GCNs would play when used in combinations
with LSTMs, given that LSTMs have already been
shown very effective for SRL (Zhou and Xu, 2015;
Marcheggiani et al., 2017). Though LSTMs are
capable of capturing at least some degree of syn-
tax (Linzen et al., 2016) without explicit syntactic
supervision, SRL datasets are moderately sized,
so LSTM models may still struggle with harder
cases. Typically, harder cases for SRL involve ar-
guments far away from their predicates. In fact,
20% and 30% of arguments are more than 5 to-
kens away from their predicate, in our English and
A1
Classifier
J layers
BiLSTM
Lane disputed those estimates
dobj
nmod
nsubj
K layers
GCN
word
representation
Figure 3: Predicting an argument and its label
with an LSTM + GCN encoder.
Chinese collections, respectively. However, if we
imagine that we can ‘teleport’ even over a sin-
gle (longest) syntactic dependency edge, the ’dis-
tance’ would shrink: only 9% and 13% arguments
will now be more than 5 LSTM steps away (again
for English and Chinese, respectively). GCNs pro-
vide this ‘teleportation’ capability. These observa-
tions suggest that LSTMs and GCNs may be com-
plementary, and we will see that empirical results
support this intuition.
4 Syntax-Aware Neural SRL Encoder
In this work, we build our semantic role la-
beler on top of the syntax-agnostic LSTM-based
SRL model of Marcheggiani et al. (2017), which
already achieves state-of-the-art results on the
CoNLL-2009 English dataset. Following their ap-
proach we employ the same bidirectional (BiL-
STM) encoder and enrich it with a syntactic GCN.
The CoNLL-2009 benchmark assumes that
predicate positions are already marked in the test
set (e.g., we would know that makes, repairs and
engines in Figure 1 are predicates), so no predicate
identification is needed. Also, as we focus here
solely on identifying arguments and labeling them
with semantic roles, for predicate disambiguation