What Do Recurrent Neural Network Grammars Learn About Syntax

doi:10.18653/V1/E17-1117

Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 1249–1258,

Valencia, Spain, April 3-7, 2017.

c

2017 Association for Computational Linguistics

What Do Recurrent Neural Network Grammars Learn About Syntax?

Adhiguna Kuncoro

♠

Miguel Ballesteros

♦

Lingpeng Kong

♠

Chris Dyer

♠♣

Graham Neubig

♠

Noah A. Smith

♥

♠

School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA

♦

IBM T.J. Watson Research Center, Yorktown Heights, NY, USA

♣

DeepMind, London, UK

♥

Computer Science & Engineering, University of Washington, Seattle, WA, USA

{akuncoro,lingpenk,gneubig}@cs.cmu.edu

miguel.ballesteros@ibm.com, cdyer@google.com, nasmith@cs.washington.edu

Abstract

Recurrent neural network grammars

(RNNG) are a recently proposed prob-

ablistic generative modeling family for

natural language. They show state-of-

the-art language modeling and parsing

performance. We investigate what in-

formation they learn, from a linguistic

perspective, through various ablations

to the model and the data, and by aug-

menting the model with an attention

mechanism (GA-RNNG) to enable closer

inspection. We ﬁnd that explicit modeling

of composition is crucial for achieving the

best performance. Through the attention

mechanism, we ﬁnd that headedness

plays a central role in phrasal represen-

tation (with the model’s latent attention

largely agreeing with predictions made

by hand-crafted head rules, albeit with

some important differences). By training

grammars without nonterminal labels, we

ﬁnd that phrasal representations depend

minimally on nonterminals, providing

support for the endocentricity hypothesis.

1 Introduction

In this paper, we focus on a recently proposed

class of probability distributions, recurrent neural

network grammars (RNNGs; Dyer et al., 2016),

designed to model syntactic derivations of sen-

tences. We focus on RNNGs as generative proba-

bilistic models over trees, as summarized in §2.

Fitting a probabilistic model to data has often

been understood as a way to test or conﬁrm some

aspect of a theory. We talk about a model’s as-

sumptions and sometimes explore its parameters

or posteriors over its latent variables in order to

gain understanding of what it “discovers” from the

data. In some sense, such models can be thought

of as mini-scientists.

Neural networks, including RNNGs, are capa-

ble of representing larger classes of hypotheses

than traditional probabilistic models, giving them

more freedom to explore. Unfortunately, they tend

to be bad mini-scientists, because their parameters

are difﬁcult for human scientists to interpret.

RNNGs are striking because they obtain state-

of-the-art parsing and language modeling perfor-

mance. Their relative lack of independence as-

sumptions, while still incorporating a degree of

linguistically-motivated prior knowledge, affords

the model considerable freedom to derive its own

insights about syntax. If they are mini-scientists,

the discoveries they make should be of particular

interest as propositions about syntax (at least for

the particular genre and dialect of the data).

This paper manipulates the inductive bias of

RNNGs to test linguistic hypotheses.

1

We be-

gin with an ablation study to discover the impor-

tance of the composition function in §3. Based

on the ﬁndings, we augment the RNNG composi-

tion function with a novel gated attention mech-

anism (leading to the GA-RNNG) to incorporate

more interpretability into the model in §4. Using

the GA-RNNG, we proceed by investigating the

role that individual heads play in phrasal represen-

tation (§5) and the role that nonterminal category

labels play (§6). Our key ﬁndings are that lexi-

cal heads play an important role in representing

most phrase types (although compositions of mul-

tiple salient heads are not infrequent, especially

1

RNNGs have less inductive bias relative to traditional

unlexicalized probabilistic context-free grammars, but more

than models that parse by transducing word sequences to

linearized parse trees represented as strings (Vinyals et al.,

2015). Inductive bias is necessary for learning (Mitchell,

1980); we believe the important question is not “how little

can a model get away with?” but rather the beneﬁt of differ-

ent forms of inductive bias as data vary.

1249

for conjunctions) and that nonterminal labels pro-

vide little additional information. As a by-product

of our investigation, a variant of the RNNG with-

out ensembling achieved the best reported super-

vised phrase-structure parsing (93.6 F

1

; English

PTB) and, through conversion, dependency pars-

ing (95.8 UAS, 94.6 LAS; PTB SD). The code and

pretrained models to replicate our results are pub-

licly available

2

.

2 Recurrent Neural Network Grammars

An RNNG deﬁnes a joint probability distribution

over string terminals and phrase-structure nonter-

minals.

3

Formally, the RNNG is deﬁned by a

triple hN, Σ, Θi, where N denotes the set of non-

terminal symbols (NP, VP, etc.), Σ the set of all

terminal symbols (we assume that N ∩ Σ = ∅),

and Θ the set of all model parameters. Unlike

previous works that rely on hand-crafted rules to

compose more ﬁne-grained phrase representations

(Collins, 1997; Klein and Manning, 2003), the

RNNG implicitly parameterizes the information

passed through compositions of phrases (in Θ and

the neural network architecture), hence weakening

the strong independence assumptions in classical

probabilistic context-free grammars.

The RNNG is based on an abstract state ma-

chine like those used in transition-based parsing,

with its algorithmic state consisting of a stack

of partially completed constituents, a buffer of

already-generated terminal symbols, and a list of

past actions. To generate a sentence x and its

phrase-structure tree y, the RNNG samples a se-

quence of actions to construct y top-down. Given

y, there is one such sequence (easily identiﬁed),

which we call the oracle, a = ha

1

, . . . , a

n

i used

during supervised training.

The RNNG uses three different actions:

• NT(X), where X ∈ N , introduces an open non-

terminal symbol onto the stack, e.g., “(NP”;

• GEN(x), where x ∈ Σ, generates a terminal

symbol and places it on the stack and buffer; and

• REDUCE indicates a constituent is now com-

plete. The elements of the stack that comprise

the current constituent (going back to the last

2

https://github.com/clab/rnng/tree/

master/interpreting-rnng

3

Dyer et al. (2016) also deﬁned a conditional version of

the RNNG that can be used only for parsing; here we focus

on the generative version since it is more ﬂexible and (rather

surprisingly) even learns better estimates of p(y | x).

The hungry cat

NP (VP(S

REDUCE

GEN

NT(NP)

NT(VP)

…

cat hungry The

a

<t

p(a

t

)

u

t

T

t

z }| {

S

t

z }| {

Figure 1: The RNNG consists of a stack, buffer of

generated words, and list of past actions that lead

to the current conﬁguration. Each component is

embedded with LSTMs, and the parser state sum-

mary u

t

is used as top-layer features to predict a

softmax over all feasible actions. This ﬁgure is

due to Dyer et al. (2016).

open nonterminal) are popped, a composition

function is executed, yielding a composed rep-

resentation that is pushed onto the stack.

At each timestep, the model encodes the stack,

buffer, and past actions, with a separate LSTM

(Hochreiter and Schmidhuber, 1997) for each

component as features to deﬁne a distribution over

the next action to take (conditioned on the full

algorithmic state). The overall architecture is il-

lustrated in Figure 1; examples of full action se-

quences can be found in Dyer et al. (2016).

A key element of the RNNG is the composition

function, which reduces a completed constituent

into a single element on the stack. This function

computes a vector representation of the new con-

stituent; it also uses an LSTM (here a bidirectional

one). This composition function, which we con-

sider in greater depth in §3, is illustrated in Fig. 2.

NP

u

v

w

NP

u

v

w

NP

x

Figure 2: RNNG composition function on each

REDUCE operation; the network on the right mod-

els the structure on the left (Dyer et al., 2016).

Since the RNNG is a generative model, it at-

tempts to maximize p(x, y), the joint distribution

1250

of strings and trees, deﬁned as

p(x, y) = p(a) =

n

Y

t=1

p(a

t

| a

1

, . . . , a

t−1

).

In other words, p(x, y) is deﬁned as a product

of local probabilities, conditioned on all past ac-

tions. The joint probability estimate p(x, y) can

be used for both phrase-structure parsing (ﬁnding

arg max

y

p(y | x)) and language modeling (ﬁnd-

ing p(x) by marginalizing over the set of possi-

ble parses for x). Both inference problems can be

solved using an importance sampling procedure.

4

We report all RNNG performance based on the

corrigendum to Dyer et al. (2016).

3 Composition is Key

Given the same data, under both the discrimina-

tive and generative settings RNNGs were found to

parse with signiﬁcantly higher accuracy than (re-

spectively) the models of Vinyals et al. (2015) and

Choe and Charniak (2016) that represent y as a

“linearized” sequence of symbols and parentheses

without explicitly capturing the tree structure, or

even constraining the y to be a well-formed tree

(see Table 1). Vinyals et al. (2015) directly predict

the sequence of nonterminals, “shifts” (which con-

sume a terminal symbol), and parentheses from

left to right, conditional on the input terminal se-

quence x, while Choe and Charniak (2016) used a

sequential LSTM language model on the same lin-

earized trees to create a generative variant of the

Vinyals et al. (2015) model. The generative model

is used to re-rank parse candidates.

Model

F

1

Vinyals et al. (2015) – PTB only 88.3

Discriminative RNNG 91.2

Choe and Charniak (2016) – PTB only 92.6

Generative RNNG 93.3

Table 1: Phrase-structure parsing performance on

PTB §23. All results are reported using single-

model performance and without any additional

data.

The results in Table 1 suggest that the RNNG’s

explicit composition function (Fig. 2), which

4

Importance sampling works by using a proposal distri-

bution q(y | x) that is easy to sample from. In Dyer et al.

(2016) and this paper, the proposal distribution is the discrim-

inative variant of the RNNG; see Dyer et al. (2016).

Vinyals et al. (2015) and Choe and Charniak

(2016) must learn implicitly, plays a crucial role in

the RNNG’s generalization success. Beyond this,

Choe and Charniak’s generative variant of Vinyals

et al. (2015) is another instance where generative

models trained on the PTB outperform discrimina-

tive models.

3.1 Ablated RNNGs

On close inspection, it is clear that the RNNG’s

three data structures—stack, buffer, and action

history—are redundant. For example, the action

history and buffer contents completely determine

the structure of the stack at every timestep. Every

generated word goes onto the stack, too; and some

past words will be composed into larger structures,

but through the composition function, they are all

still “available” to the network that predicts the

next action. Similarly, the past actions are redun-

dant with the stack. Despite this redundancy, only

the stack incorporates the composition function.

Since each of the ablated models is sufﬁcient to

encode all necessary partial tree information, the

primary difference is that ablations with the stack

use explicit composition, to which we can there-

fore attribute most of the performance difference.

We conjecture that the stack—the component

that makes use of the composition function—is

critical to the RNNG’s performance, and that the

buffer and action history are not. In transition-

based parsers built on expert-crafted features, the

most recent words and actions are useful if they

are salient, although neural representation learners

can automatically learn what information should

be salient.

To test this conjecture, we train ablated RN-

NGs that lack each of the three data structures (ac-

tion history, buffer, stack), as well as one that lacks

both the action history and buffer.

5

If our conjec-

ture is correct, performance should degrade most

without the stack, and the stack alone should per-

form competitively.

Experimental settings. We perform our exper-

iments on the English PTB corpus, with §02–21

for training, §24 for validation, and §23 for test;

no additional data were used for training. We fol-

5

Note that the ablated RNNG without a stack is quite sim-

ilar to Vinyals et al. (2015), who encoded a (partial) phrase-

structure tree as a sequence of open and close parentheses,

terminals, and nonterminal symbols; our action history is

quite close to this, with each NT(X) capturing a left parenthe-

sis and X nonterminal, and each REDUCE capturing a right

parenthesis.

1251

low the same hyperparameters as the generative

model proposed in Dyer et al. (2016).

6

The gen-

erative model did not use any pretrained word em-

beddings or POS tags; a discriminative variant of

the standard RNNG was used to obtain tree sam-

ples for the generative model. All further experi-

ments use the same settings and hyperparameters

unless otherwise noted.

Experimental results. We trained each abla-

tion from scratch, and compared these models on

three tasks: English phrase-structure parsing (la-

beled F

1

), Table 2; dependency parsing, Table 3,

by converting parse output to Stanford dependen-

cies (De Marneffe et al., 2006) using the tool by

Kong and Smith (2014); and language modeling,

Table 4. The last row of each table reports the

performance of a novel variant of the (stack-only)

RNNG with attention, to be presented in §4.

Model F

1

Vinyals et al. (2015)

†

92.1

Choe and Charniak (2016) 92.6

Choe and Charniak (2016)

†

93.8

Baseline RNNG 93.3

Ablated RNNG (no history) 93.2

Ablated RNNG (no buffer) 93.3

Ablated RNNG (no stack) 92.5

Stack-only RNNG 93.6

GA-RNNG 93.5

Table 2: Phrase-structure parsing performance on

PTB §23.

†

indicates systems that use additional

unparsed data (semisupervised). The GA-RNNG

results will be discussed in §4.

Discussion. The RNNG with only a stack is the

strongest of the ablations, and it even outperforms

the “full” RNNG with all three data structures.

Ablating the stack gives the worst among the new

results. This strongly supports the importance of

the composition function: a proper REDUCE oper-

ation that transforms a constituent’s parts and non-

terminal label into a single explicit (vector) repre-

sentation is helpful to performance.

It is noteworthy that the stack alone is stronger

than the original RNNG, which—in principle—

can learn to disregard the buffer and action his-

6

The model is trained using stochastic gradient descent,

with a learning rate of 0.1 and a per-epoch decay of 0.08. All

experiments with the generative RNNG used 100 tree sam-

ples for each sentence, obtained by sampling from the local

softmax distribution of the discriminative RNNG.

Model UAS LAS

Kiperwasser and Goldberg (2016) 93.9 91.9

Andor et al. (2016) 94.6 92.8

Dozat and Manning (2016) 95.4 93.8

Choe and Charniak (2016)

†

95.9 94.1

Baseline RNNG 95.6 94.4

Ablated RNNG (no history) 95.4 94.2

Ablated RNNG (no buffer) 95.6 94.4

Ablated RNNG (no stack) 95.1 93.8

Stack-only RNNG 95.8 94.6

GA-RNNG 95.7 94.5

Table 3: Dependency parsing performance on

PTB §23 with Stanford Dependencies (De Marn-

effe and Manning, 2008).

†

indicates systems that

use additional unparsed data (semisupervised).

tory. Since the stack maintains syntactically “re-

cent” information near its top, we conjecture that

the learner is overﬁtting to spurious predictors in

the buffer and action history that explain the train-

ing data but do not generalize well.

A similar performance degradation is seen in

language modeling (Table 4): the stack-only

RNNG achieves the best performance, and ablat-

ing the stack is most harmful. Indeed, model-

ing syntax without explicit composition (the stack-

ablated RNNG) provides little beneﬁt over a se-

quential LSTM language model.

Model Test ppl. (PTB)

IKN 5-gram 169.3

LSTM LM 113.4

RNNG 105.2

Ablated RNNG (no history) 105.7

Ablated RNNG (no buffer) 106.1

Ablated RNNG (no stack) 113.1

Stack-only RNNG 101.2

GA-RNNG 100.9

Table 4: Language modeling: perplexity. IKN

refers to Kneser-Ney 5-gram LM.

We remark that the stack-only results are

the best published PTB results for both phrase-

structure and dependency parsing among super-

vised models.

4 Gated Attention RNNG

Having established that the composition function

is key to RNNG performance (§3), we now seek

to understand the nature of the composed phrasal

representations that are learned. Like most neural

networks, interpreting the composition function’s

1252

behavior is challenging. Fortunately, linguistic

theories offer a number of hypotheses about the

nature of representations of phrases that can pro-

vide a conceptual scaffolding to understand them.

4.1 Linguistic Hypotheses

We consider two theories about phrasal represen-

tation. The ﬁrst is that phrasal representations are

strongly determined by a privileged lexical head.

Augmenting grammars with lexical head informa-

tion has a long history in parsing, starting with

the models of Collins (1997), and theories of syn-

tax such as the “bare phrase structure” hypothe-

sis of the Minimalist Program (Chomsky, 1993)

posit that phrases are represented purely by sin-

gle lexical heads. Proposals for multiple headed

phrases (to deal with tricky cases like conjunction)

likewise exist (Jackendoff, 1977; Keenan, 1987).

Do the phrasal representations learned by RN-

NGs depend on individual lexical heads or mul-

tiple heads? Or do the representations combine all

children without any salient head?

Related to the question about the role of heads

in phrasal representation is the question of whether

phrase-internal material wholly determines the

representation of a phrase (an endocentric repre-

sentation) or whether nonterminal relabeling of a

constitutent introduces new information (exocen-

tric representations). To illustrate the contrast, an

endocentric representation is representing a noun

phrase with a noun category, whereas S → NP VP

exocentrically introduces a new syntactic category

that is neither NP nor VP (Chomsky, 1970).

4.2 Gated Attention Composition

To investigate what the stack-only RNNG learns

about headedness (and later endocentricity), we

propose a variant of the composition function

that makes use of an explicit attention mechanism

(Bahdanau et al., 2015) and a sigmoid gate with

multiplicative interactions, henceforth called GA-

RNNG.

At every REDUCE operation, the GA-RNNG as-

signs an “attention weight” to each of its chil-

dren (between 0 and 1 such that the total weight

off all children sums to 1), and the parent phrase

is represented by the combination of a sum of

each child’s representation scaled by its attention

weight and its nonterminal type. Our weighted

sum is more expressive than traditional head rules,

however, because it allows attention to be divided

among multiple constituents. Head rules, con-

versely, are analogous to giving all attention to one

constituent, the one containing the lexical head.

We now formally deﬁne the GA-RNNG’s com-

position function. Recall that u

t

is the concatena-

tion of the vector representations of the RNNG’s

data structures, used to assign probabilities to each

of the actions available at timestep t (see Fig. 1, the

layer before the softmax at the top). For simplicity,

we drop the timestep index here. Let o

nt

denote

the vector embedding (learned) of the nonterminal

being constructed, for the purpose of computing

attention weights.

Now let c

1

, c

2

, . . . denote the sequence of vec-

tor embeddings for the constituents of the new

phrase. The length of these vectors is deﬁned by

the dimensionality of the bidirectional LSTM used

in the original composition function (Fig. 2). We

use semicolon (;) to denote vector concatenation

operations.

The attention vector is given by:

a = softmax



[c

1

c

2

· · · ]

>

V [u; o

nt

]



(1)

Note that the length of a is the same as the num-

ber of constituents, and that this vector sums to

one due to the softmax. It divides a single unit of

attention among the constituents.

Next, note that the constituent source vector

m = [c

1

; c

2

; · · · ]a is a convex combination of the

child-constituents, weighted by attention. We will

combine this with another embedding of the non-

terminal denoted as t

nt

(separate from o

nt

) using

a sigmoid gating mechanism:

g = σ (W

1

t

nt

+ W

2

m + b) (2)

Note that the value of the gate is bounded between

[0, 1] in each dimension.

The new phrase’s ﬁnal representation uses

element-wise multiplication () with respect to

both t

nt

and m, a process reminiscent of the

LSTM “forget” gate:

c = g  t

nt

+ (1 − g)  m. (3)

The intuition is that the composed represen-

tation should incorporate both nonterminal in-

formation and information about the constituents

(through weighted sum and attention mechanism).

The gate g modulates the interaction between

them to account for varying importance between

the two in different contexts.

1253

What Do Recurrent Neural Network Grammars Learn About Syntax

Citations

Cites background from "What Do Recurrent Neural Network Gr..."

References

"What Do Recurrent Neural Network Gr..." refers methods in this paper

Additional excerpts

"What Do Recurrent Neural Network Gr..." refers background or methods in this paper

"What Do Recurrent Neural Network Gr..." refers methods in this paper

Related Papers (5)