Encoding Sentences with Graph Convolutional Networks for Semantic Role Labeling

doi:10.18653/V1/D17-1159

Edinburgh Research Explorer

Encoding Sentences with Graph Convolutional Networks for

Semantic Role Labeling

Citation for published version:

Marcheggiani, D & Titov, I 2017, Encoding Sentences with Graph Convolutional Networks for Semantic

Role Labeling. in Proceedings of the 2017 Conference on Empirical Methods in Natural Language

Processing (EMNLP 2017). Association for Computational Linguistics, Copenhagen, Denmark , pp.

1506–1515, EMNLP 2017: Conference on Empirical Methods in Natural Language Processing,

Copenhagen, Denmark, 7/09/17. https://doi.org/10.18653/v1/D17-1159

Digital Object Identifier (DOI):

10.18653/v1/D17-1159

Link:

Link to publication record in Edinburgh Research Explorer

Document Version:

Peer reviewed version

Published In:

Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP 2017)

General rights

Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s)

and / or other copyright owners and it is a condition of accessing these publications that users recognise and

abide by the legal requirements associated with these rights.

Take down policy

The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer

content complies with UK legislation. If you believe that the public display of this file breaches copyright please

contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and

investigate your claim.

Download date: 26. Aug. 2022

Encoding Sentences with Graph Convolutional Networks

for Semantic Role Labeling

Diego Marcheggiani

1

Ivan Titov

1,2

1

ILLC, University of Amsterdam

2

ILCC, School of Informatics, University of Edinburgh

marcheggiani@uva.nl ititov@inf.ed.ac.uk

Abstract

Semantic role labeling (SRL) is the task of

identifying the predicate-argument struc-

ture of a sentence. It is typically re-

garded as an important step in the stan-

dard NLP pipeline. As the semantic rep-

resentations are closely related to syntac-

tic ones, we exploit syntactic information

in our model. We propose a version of

graph convolutional networks (GCNs), a

recent class of neural networks operating

on graphs, suited to model syntactic de-

pendency graphs. GCNs over syntactic de-

pendency trees are used as sentence en-

coders, producing latent feature represen-

tations of words in a sentence. We ob-

serve that GCN layers are complementary

to LSTM ones: when we stack both GCN

and LSTM layers, we obtain a substantial

improvement over an already state-of-the-

art LSTM SRL model, resulting in the best

reported scores on the standard benchmark

(CoNLL-2009) both for Chinese and En-

glish.

1 Introduction

Semantic role labeling (SRL) (Gildea and Juraf-

sky, 2002) can be informally described as the task

of discovering who did what to whom. For ex-

ample, consider an SRL dependency graph shown

above the sentence in Figure 1. Formally, the task

includes (1) detection of predicates (e.g., makes);

(2) labeling the predicates with a sense from a

sense inventory (e.g., make.01); (3) identifying

and assigning arguments to semantic roles (e.g.,

Sequa is A0, i.e., an agent / ‘doer’ for the corre-

sponding predicate, and engines is A1, i.e., a pa-

tient / ‘an affected entity’). SRL is often regarded

Sequa makes and repairs jet engines.

SBJ

COORD

OBJ

CONJ NMOD

ROOT

A1

A0

make.01

repair.01 engine.01

Figure 1: An example sentence annotated with se-

mantic (top) and syntactic dependencies (bottom).

as an important step in the standard NLP pipeline,

providing information to downstream tasks such

as information extraction and question answering.

The semantic representations are closely re-

lated to syntactic ones, even though the syntax-

semantics interface is far from trivial (Levin,

1993). For example, one can observe that many

arcs in the syntactic dependency graph (shown in

black below the sentence in Figure 1) are mir-

rored in the semantic dependency graph. Given

these similarities and also because of availability

of accurate syntactic parsers for many languages,

it seems natural to exploit syntactic information

when predicting semantics. Though historically

most SRL approaches did rely on syntax (Thomp-

son et al., 2003; Pradhan et al., 2005; Punyakanok

et al., 2008; Johansson and Nugues, 2008), the last

generation of SRL models put syntax aside in fa-

vor of neural sequence models, namely LSTMs

(Zhou and Xu, 2015; Marcheggiani et al., 2017),

and outperformed syntactically-driven methods on

standard benchmarks. We believe that one of the

reasons for this radical choice is the lack of sim-

ple and effective methods for incorporating syn-

tactic information into sequential neural networks

(namely, at the level of words). In this paper we

propose one way how to address this limitation.

Speciﬁcally, we rely on graph convolutional

networks (GCNs) (Duvenaud et al., 2015; Kipf

and Welling, 2017; Kearnes et al., 2016), a recent

class of multilayer neural networks operating on

graphs. For every node in the graph (in our case

a word in a sentence), GCN encodes relevant in-

formation about its neighborhood as a real-valued

feature vector. GCNs have been studied largely in

the context of undirected unlabeled graphs. We in-

troduce a version of GCNs for modeling syntactic

dependency structures and generally applicable to

labeled directed graphs.

One layer GCN encodes only information about

immediate neighbors and K layers are needed

to encode K-order neighborhoods (i.e., informa-

tion about nodes at most K hops aways). This

contrasts with recurrent and recursive neural net-

works (Elman, 1990; Socher et al., 2013) which, at

least in theory, can capture statistical dependencies

across unbounded paths in a trees or in a sequence.

However, as we will further discuss in Section 3.3,

this is not a serious limitation when GCNs are used

in combination with encoders based on recurrent

networks (LSTMs). When we stack GCNs on top

of LSTM layers, we obtain a substantial improve-

ment over an already state-of-the-art LSTM SRL

model, resulting in the best reported scores on the

standard benchmark (CoNLL-2009), both for En-

glish and Chinese.

1

Interestingly, again unlike recursive neural net-

works, GCNs do not constrain the graph to be

a tree. We believe that there are many applica-

tions in NLP, where GCN-based encoders of sen-

tences or even documents can be used to incor-

porate knowledge about linguistic structures (e.g.,

representations of syntax, semantics or discourse).

For example, GCNs can take as input combined

syntactic-semantic graphs (e.g., the entire graph

from Figure 1) and be used within downstream

tasks such as machine translation or question an-

swering. However, we leave this for future work

and here solely focus on SRL.

The contributions of this paper can be summa-

rized as follows:

• we are the ﬁrst to show that GCNs are effec-

tive for NLP;

• we propose a generalization of GCNs suited

1

The code is available at https://github.com/

diegma/neural-dep-srl.

to encoding syntactic information at word

level;

• we propose a GCN-based SRL model and

obtain state-of-the-art results on English and

Chinese portions of the CoNLL-2009 dataset;

• we show that bidirectional LSTMs and

syntax-based GCNs have complementary

modeling power.

2 Graph Convolutional Networks

In this section we describe GCNs of Kipf and

Welling (2017). Please refer to Gilmer et al.

(2017) for a comprehensive overview of GCN ver-

sions.

GCNs are neural networks operating on graphs

and inducing features of nodes (i.e., real-valued

vectors / embeddings) based on properties of their

neighborhoods. In Kipf and Welling (2017), they

were shown to be very effective for the node clas-

siﬁcation task: the classiﬁer was estimated jointly

with a GCN, so that the induced node features

were informative for the node classiﬁcation prob-

lem. Depending on how many layers of convolu-

tion are used, GCNs can capture information only

about immediate neighbors (with one layer of con-

volution) or any nodes at most K hops aways (if

K layers are stacked on top of each other).

More formally, consider an undirected graph

G = (V, E), where V (|V | = n) and E are

sets of nodes and edges, respectively. Kipf and

Welling (2017) assume that edges contain all the

self-loops, i.e., (v, v) ∈ E for any v. We can de-

ﬁne a matrix X ∈ R

m×n

with each its column

x

v

∈ R

m

(v ∈ V) encoding node features. The

vectors can either encode genuine features (e.g.,

this vector can encode the title of a paper if citation

graphs are considered) or be a one-hot vector. The

node representation, encoding information about

its immediate neighbors, is computed as

h

v

= ReLU





X

u∈N (v)

(W x

u

+ b)





, (1)

where W ∈ R

m×m

and b ∈ R

m

are a weight ma-

trix and a bias, respectively; N(v) are neighbors

of v; ReLU is the rectiﬁer linear unit activation

function.

2

Note that v ∈ N(v) (because of self-

loops), so the input feature representation of v (i.e.

x

v

) affects its induced representation h

v

.

2

We dropped normalization factors used in Kipf and

Welling (2017), as they are not used in our syntactic GCNs.

Lane disputed those estimates

ReLU(⌃·)

⇥W

(1)

self

⇥W

(1)

self

⇥W

(1)

self

⇥W

(1)

self

⇥W

(1)

subj

⇥W

(1)

obj

⇥W

(1)

nmod

⇥W

(1)

nmod

0

⇥W

(1)

obj

0

⇥W

(1)

subj

0

ReLU(⌃·)

⇥W

(2)

self

⇥W

(2)

self

⇥W

(2)

self

⇥W

(2)

self

⇥W

(2)

subj

⇥W

(2)

subj

0

⇥W

(2)

obj

0

⇥W

(2)

obj

⇥W

(2)

nmod

⇥W

(2)

nmod

0

… … … …

NMOD

SBJ

OBJ

Figure 2: A simpliﬁed syntactic GCN (bias terms

and gates are omitted); the syntactic graph of the

sentence is shown with dashed lines at the bottom.

Parameter matrices are sub-indexed with syntactic

functions, and apostrophes (e.g., subj’) signify that

information ﬂows in the direction opposite of the

dependency arcs (i.e., from dependents to heads).

As in standard convolutional networks (LeCun

et al., 2001), by stacking GCN layers one can in-

corporate higher degree neighborhoods:

h

(k+1)

v

= ReLU





X

u∈N (v)

W

(k)

h

(k)

u

+ b

(k)





where k denotes the layer number and h

(1)

v

= x

v

.

3 Syntactic GCNs

As syntactic dependency trees are directed and la-

beled (we refer to the dependency labels as syn-

tactic functions), we ﬁrst need to modify the com-

putation in order to incorporate label information

(Section 3.1). In the subsequent section, we incor-

porate gates in GCNs, so that the model can decide

which edges are more relevant to the task in ques-

tion. Having gates is also important as we rely on

automatically predicted syntactic representations,

and the gates can detect and downweight poten-

tially erroneous edges.

3.1 Incorporating directions and labels

Now, we introduce a generalization of GCNs ap-

propriate for syntactic dependency trees, and in

general, for directed labeled graphs. First note

that there is no reason to assume that information

ﬂows only along the syntactic dependency arcs

(e.g., from makes to Sequa), so we allow it to ﬂow

in the opposite direction as well (i.e., from depen-

dents to heads). We use a graph G = (V, E), where

the edge set contains all pairs of nodes (i.e., words)

adjacent in the dependency tree. In our example,

both (Sequa, makes) and (makes, Sequa) belong

to the edge set. The graph is labeled, and the label

L(u, v) for (u, v) ∈ E contains both information

about the syntactic function and indicates whether

the edge is in the same or opposite direction as

the syntactic dependency arc. For example, the la-

bel for (makes, Sequa) is subj, whereas the label

for (Sequa, makes) is subj

0

, with the apostrophe

indicating that the edge is in the direction oppo-

site to the corresponding syntactic arc. Similarly,

self-loops will have label self. Consequently, we

can simply assume that the GCN parameters are

label-speciﬁc, resulting in the following computa-

tion, also illustrated in Figure 2:

h

(k+1)

v

= ReLU





X

u∈N (v)

W

(k)

L(u,v)

h

(k)

u

+ b

(k)

L(u,v)





.

This model is over-parameterized,

3

especially

given that SRL datasets are moderately sized, by

deep learning standards. So instead of learning the

GCN parameters directly, we deﬁne them as

W

(k)

L(u,v)

= V

(k)

dir(u,v)

, (2)

where dir(u, v) indicates whether the edge (u, v)

is directed (1) along, (2) in the opposite direction

to the syntactic dependency arc, or (3) is a self-

loop; V

(k)

dir(u,v)

∈ R

m×m

. Our simpliﬁcation cap-

tures the intuition that information should be prop-

agated differently along edges depending whether

this is a head-to-dependent or dependent-to-head

edge (i.e., along or opposite the corresponding

syntactic arc) and whether it is a self-loop. So we

do not share any parameters between these three

very different edge types. Syntactic functions are

important, but perhaps less crucial, so they are en-

coded only in the feature vectors b

L(u,v)

.

3.2 Edge-wise gating

Uniformly accepting information from all neigh-

boring nodes may not be appropriate for the SRL

3

Chinese and English CoNLL-2009 datasets used 41 and

48 different syntactic functions, which would result in having

83 and 97 different matrices in every layer, respectively.

setting. For example, we see in Figure 1 that many

semantic arcs just mirror their syntactic counter-

parts, so they may need to be up-weighted. More-

over, we rely on automatically predicted syntactic

structures, and, even for English, syntactic parsers

are far from being perfect, especially when used

out-of-domain. It is risky for a downstream ap-

plication to rely on a potentially wrong syntactic

edge, so the corresponding message in the neural

network may need to be down-weighted.

In order to address the above issues, inspired

by recent literature (van den Oord et al., 2016;

Dauphin et al., 2016), we calculate for each edge

node pair a scalar gate of the form

g

(k)

u,v

= σ



h

(k)

u

· ˆv

(k)

dir(u,v)

+

ˆ

b

(k)

L(u,v)



, (3)

where σ is the logistic sigmoid function,

ˆv

(k)

dir(u,v)

∈ R

m

and

ˆ

b

(k)

L(u,v)

∈ R are weights and

a bias for the gate. With this additional gating

mechanism, the ﬁnal syntactic GCN computation

is formulated as

h

(k+1)

v

=ReLU(

X

u∈N (v)

g

(k)

v,u

(V

(k)

dir(u,v)

h

(k)

u

+ b

(k)

L(u,v)

)). (4)

3.3 Complementarity of GCNs and LSTMs

The inability of GCNs to capture dependencies

between nodes far away from each other in the

graph may seem like a serious problem, especially

in the context of SRL: paths between predicates

and arguments often include many dependency

arcs (Roth and Lapata, 2016). However, when

graph convolution is performed on top of LSTM

states (i.e., LSTM states serve as input x

v

= h

(1)

v

to GCN) rather than static word embeddings, GCN

may not need to capture more than a couple of

hops.

To elaborate on this, let us speculate what role

GCNs would play when used in combinations

with LSTMs, given that LSTMs have already been

shown very effective for SRL (Zhou and Xu, 2015;

Marcheggiani et al., 2017). Though LSTMs are

capable of capturing at least some degree of syn-

tax (Linzen et al., 2016) without explicit syntactic

supervision, SRL datasets are moderately sized,

so LSTM models may still struggle with harder

cases. Typically, harder cases for SRL involve ar-

guments far away from their predicates. In fact,

20% and 30% of arguments are more than 5 to-

kens away from their predicate, in our English and

A1

Classiﬁer

J layers

BiLSTM

Lane disputed those estimates



dobj

nmod

nsubj

K layers

GCN

word

representation

Figure 3: Predicting an argument and its label

with an LSTM + GCN encoder.

Chinese collections, respectively. However, if we

imagine that we can ‘teleport’ even over a sin-

gle (longest) syntactic dependency edge, the ’dis-

tance’ would shrink: only 9% and 13% arguments

will now be more than 5 LSTM steps away (again

for English and Chinese, respectively). GCNs pro-

vide this ‘teleportation’ capability. These observa-

tions suggest that LSTMs and GCNs may be com-

plementary, and we will see that empirical results

support this intuition.

4 Syntax-Aware Neural SRL Encoder

In this work, we build our semantic role la-

beler on top of the syntax-agnostic LSTM-based

SRL model of Marcheggiani et al. (2017), which

already achieves state-of-the-art results on the

CoNLL-2009 English dataset. Following their ap-

proach we employ the same bidirectional (BiL-

STM) encoder and enrich it with a syntactic GCN.

The CoNLL-2009 benchmark assumes that

predicate positions are already marked in the test

set (e.g., we would know that makes, repairs and

engines in Figure 1 are predicates), so no predicate

identiﬁcation is needed. Also, as we focus here

solely on identifying arguments and labeling them

with semantic roles, for predicate disambiguation

Encoding Sentences with Graph Convolutional Networks for Semantic Role Labeling

Citations

A Comprehensive Survey on Graph Neural Networks

Graph Neural Networks: A Review of Methods and Applications

Graph Neural Networks: A Review of Methods and Applications

Exploring Visual Relationship for Image Captioning

Graph Convolution over Pruned Dependency Trees Improves Relation Extraction.

References

Adam: A Method for Stochastic Optimization

Long short-term memory

Learning Phrase Representations using RNN Encoder--Decoder for Statistical Machine Translation

Sequence to Sequence Learning with Neural Networks

Finding Structure in Time

Related Papers (5)

Glove: Global Vectors for Word Representation

Adam: A Method for Stochastic Optimization

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Long short-term memory

Attention is All you Need