scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

What Do Recurrent Neural Network Grammars Learn About Syntax

01 Apr 2017-Vol. 1, pp 1249-1258
TL;DR: By training grammars without nonterminal labels, it is found that phrasal representations depend minimally on nonterminals, providing support for the endocentricity hypothesis.
Abstract: Recurrent neural network grammars (RNNG) are a recently proposed probablistic generative modeling family for natural language. They show state-of-the-art language modeling and parsing performance. We investigate what information they learn, from a linguistic perspective, through various ablations to the model and the data, and by augmenting the model with an attention mechanism (GA-RNNG) to enable closer inspection. We find that explicit modeling of composition is crucial for achieving the best performance. Through the attention mechanism, we find that headedness plays a central role in phrasal representation (with the model’s latent attention largely agreeing with predictions made by hand-crafted head rules, albeit with some important differences). By training grammars without nonterminal labels, we find that phrasal representations depend minimally on nonterminals, providing support for the endocentricity hypothesis.

Content maybe subject to copyright    Report

Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 1249–1258,
Valencia, Spain, April 3-7, 2017.
c
2017 Association for Computational Linguistics
What Do Recurrent Neural Network Grammars Learn About Syntax?
Adhiguna Kuncoro
Miguel Ballesteros
Lingpeng Kong
Chris Dyer
♠♣
Graham Neubig
Noah A. Smith
School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
IBM T.J. Watson Research Center, Yorktown Heights, NY, USA
DeepMind, London, UK
Computer Science & Engineering, University of Washington, Seattle, WA, USA
{akuncoro,lingpenk,gneubig}@cs.cmu.edu
miguel.ballesteros@ibm.com, cdyer@google.com, nasmith@cs.washington.edu
Abstract
Recurrent neural network grammars
(RNNG) are a recently proposed prob-
ablistic generative modeling family for
natural language. They show state-of-
the-art language modeling and parsing
performance. We investigate what in-
formation they learn, from a linguistic
perspective, through various ablations
to the model and the data, and by aug-
menting the model with an attention
mechanism (GA-RNNG) to enable closer
inspection. We find that explicit modeling
of composition is crucial for achieving the
best performance. Through the attention
mechanism, we find that headedness
plays a central role in phrasal represen-
tation (with the model’s latent attention
largely agreeing with predictions made
by hand-crafted head rules, albeit with
some important differences). By training
grammars without nonterminal labels, we
find that phrasal representations depend
minimally on nonterminals, providing
support for the endocentricity hypothesis.
1 Introduction
In this paper, we focus on a recently proposed
class of probability distributions, recurrent neural
network grammars (RNNGs; Dyer et al., 2016),
designed to model syntactic derivations of sen-
tences. We focus on RNNGs as generative proba-
bilistic models over trees, as summarized in §2.
Fitting a probabilistic model to data has often
been understood as a way to test or confirm some
aspect of a theory. We talk about a model’s as-
sumptions and sometimes explore its parameters
or posteriors over its latent variables in order to
gain understanding of what it “discovers” from the
data. In some sense, such models can be thought
of as mini-scientists.
Neural networks, including RNNGs, are capa-
ble of representing larger classes of hypotheses
than traditional probabilistic models, giving them
more freedom to explore. Unfortunately, they tend
to be bad mini-scientists, because their parameters
are difficult for human scientists to interpret.
RNNGs are striking because they obtain state-
of-the-art parsing and language modeling perfor-
mance. Their relative lack of independence as-
sumptions, while still incorporating a degree of
linguistically-motivated prior knowledge, affords
the model considerable freedom to derive its own
insights about syntax. If they are mini-scientists,
the discoveries they make should be of particular
interest as propositions about syntax (at least for
the particular genre and dialect of the data).
This paper manipulates the inductive bias of
RNNGs to test linguistic hypotheses.
1
We be-
gin with an ablation study to discover the impor-
tance of the composition function in §3. Based
on the findings, we augment the RNNG composi-
tion function with a novel gated attention mech-
anism (leading to the GA-RNNG) to incorporate
more interpretability into the model in §4. Using
the GA-RNNG, we proceed by investigating the
role that individual heads play in phrasal represen-
tation (§5) and the role that nonterminal category
labels play (§6). Our key findings are that lexi-
cal heads play an important role in representing
most phrase types (although compositions of mul-
tiple salient heads are not infrequent, especially
1
RNNGs have less inductive bias relative to traditional
unlexicalized probabilistic context-free grammars, but more
than models that parse by transducing word sequences to
linearized parse trees represented as strings (Vinyals et al.,
2015). Inductive bias is necessary for learning (Mitchell,
1980); we believe the important question is not “how little
can a model get away with?” but rather the benefit of differ-
ent forms of inductive bias as data vary.
1249

for conjunctions) and that nonterminal labels pro-
vide little additional information. As a by-product
of our investigation, a variant of the RNNG with-
out ensembling achieved the best reported super-
vised phrase-structure parsing (93.6 F
1
; English
PTB) and, through conversion, dependency pars-
ing (95.8 UAS, 94.6 LAS; PTB SD). The code and
pretrained models to replicate our results are pub-
licly available
2
.
2 Recurrent Neural Network Grammars
An RNNG defines a joint probability distribution
over string terminals and phrase-structure nonter-
minals.
3
Formally, the RNNG is defined by a
triple hN, Σ, Θi, where N denotes the set of non-
terminal symbols (NP, VP, etc.), Σ the set of all
terminal symbols (we assume that N Σ = ),
and Θ the set of all model parameters. Unlike
previous works that rely on hand-crafted rules to
compose more fine-grained phrase representations
(Collins, 1997; Klein and Manning, 2003), the
RNNG implicitly parameterizes the information
passed through compositions of phrases (in Θ and
the neural network architecture), hence weakening
the strong independence assumptions in classical
probabilistic context-free grammars.
The RNNG is based on an abstract state ma-
chine like those used in transition-based parsing,
with its algorithmic state consisting of a stack
of partially completed constituents, a buffer of
already-generated terminal symbols, and a list of
past actions. To generate a sentence x and its
phrase-structure tree y, the RNNG samples a se-
quence of actions to construct y top-down. Given
y, there is one such sequence (easily identified),
which we call the oracle, a = ha
1
, . . . , a
n
i used
during supervised training.
The RNNG uses three different actions:
NT(X), where X N , introduces an open non-
terminal symbol onto the stack, e.g., “(NP”;
GEN(x), where x Σ, generates a terminal
symbol and places it on the stack and buffer; and
REDUCE indicates a constituent is now com-
plete. The elements of the stack that comprise
the current constituent (going back to the last
2
https://github.com/clab/rnng/tree/
master/interpreting-rnng
3
Dyer et al. (2016) also defined a conditional version of
the RNNG that can be used only for parsing; here we focus
on the generative version since it is more flexible and (rather
surprisingly) even learns better estimates of p(y | x).
The hungry cat
NP (VP(S
REDUCE
GEN
NT(NP)
NT(VP)
cat hungry The
a
<t
p(a
t
)
u
t
T
t
z }| {
Figure 1: The RNNG consists of a stack, buffer of
generated words, and list of past actions that lead
to the current configuration. Each component is
embedded with LSTMs, and the parser state sum-
mary u
t
is used as top-layer features to predict a
softmax over all feasible actions. This figure is
due to Dyer et al. (2016).
open nonterminal) are popped, a composition
function is executed, yielding a composed rep-
resentation that is pushed onto the stack.
At each timestep, the model encodes the stack,
buffer, and past actions, with a separate LSTM
(Hochreiter and Schmidhuber, 1997) for each
component as features to define a distribution over
the next action to take (conditioned on the full
algorithmic state). The overall architecture is il-
lustrated in Figure 1; examples of full action se-
quences can be found in Dyer et al. (2016).
A key element of the RNNG is the composition
function, which reduces a completed constituent
into a single element on the stack. This function
computes a vector representation of the new con-
stituent; it also uses an LSTM (here a bidirectional
one). This composition function, which we con-
sider in greater depth in §3, is illustrated in Fig. 2.
NP
u
v
w
NP
u
v
w
NP
x
x
Figure 2: RNNG composition function on each
REDUCE operation; the network on the right mod-
els the structure on the left (Dyer et al., 2016).
Since the RNNG is a generative model, it at-
tempts to maximize p(x, y), the joint distribution
1250

of strings and trees, defined as
p(x, y) = p(a) =
n
Y
t=1
p(a
t
| a
1
, . . . , a
t1
).
In other words, p(x, y) is defined as a product
of local probabilities, conditioned on all past ac-
tions. The joint probability estimate p(x, y) can
be used for both phrase-structure parsing (finding
arg max
y
p(y | x)) and language modeling (find-
ing p(x) by marginalizing over the set of possi-
ble parses for x). Both inference problems can be
solved using an importance sampling procedure.
4
We report all RNNG performance based on the
corrigendum to Dyer et al. (2016).
3 Composition is Key
Given the same data, under both the discrimina-
tive and generative settings RNNGs were found to
parse with significantly higher accuracy than (re-
spectively) the models of Vinyals et al. (2015) and
Choe and Charniak (2016) that represent y as a
“linearized” sequence of symbols and parentheses
without explicitly capturing the tree structure, or
even constraining the y to be a well-formed tree
(see Table 1). Vinyals et al. (2015) directly predict
the sequence of nonterminals, “shifts” (which con-
sume a terminal symbol), and parentheses from
left to right, conditional on the input terminal se-
quence x, while Choe and Charniak (2016) used a
sequential LSTM language model on the same lin-
earized trees to create a generative variant of the
Vinyals et al. (2015) model. The generative model
is used to re-rank parse candidates.
Model
F
1
Vinyals et al. (2015) PTB only 88.3
Discriminative RNNG 91.2
Choe and Charniak (2016) PTB only 92.6
Generative RNNG 93.3
Table 1: Phrase-structure parsing performance on
PTB §23. All results are reported using single-
model performance and without any additional
data.
The results in Table 1 suggest that the RNNG’s
explicit composition function (Fig. 2), which
4
Importance sampling works by using a proposal distri-
bution q(y | x) that is easy to sample from. In Dyer et al.
(2016) and this paper, the proposal distribution is the discrim-
inative variant of the RNNG; see Dyer et al. (2016).
Vinyals et al. (2015) and Choe and Charniak
(2016) must learn implicitly, plays a crucial role in
the RNNG’s generalization success. Beyond this,
Choe and Charniak’s generative variant of Vinyals
et al. (2015) is another instance where generative
models trained on the PTB outperform discrimina-
tive models.
3.1 Ablated RNNGs
On close inspection, it is clear that the RNNG’s
three data structures—stack, buffer, and action
history—are redundant. For example, the action
history and buffer contents completely determine
the structure of the stack at every timestep. Every
generated word goes onto the stack, too; and some
past words will be composed into larger structures,
but through the composition function, they are all
still “available” to the network that predicts the
next action. Similarly, the past actions are redun-
dant with the stack. Despite this redundancy, only
the stack incorporates the composition function.
Since each of the ablated models is sufficient to
encode all necessary partial tree information, the
primary difference is that ablations with the stack
use explicit composition, to which we can there-
fore attribute most of the performance difference.
We conjecture that the stack—the component
that makes use of the composition function—is
critical to the RNNG’s performance, and that the
buffer and action history are not. In transition-
based parsers built on expert-crafted features, the
most recent words and actions are useful if they
are salient, although neural representation learners
can automatically learn what information should
be salient.
To test this conjecture, we train ablated RN-
NGs that lack each of the three data structures (ac-
tion history, buffer, stack), as well as one that lacks
both the action history and buffer.
5
If our conjec-
ture is correct, performance should degrade most
without the stack, and the stack alone should per-
form competitively.
Experimental settings. We perform our exper-
iments on the English PTB corpus, with §02–21
for training, §24 for validation, and §23 for test;
no additional data were used for training. We fol-
5
Note that the ablated RNNG without a stack is quite sim-
ilar to Vinyals et al. (2015), who encoded a (partial) phrase-
structure tree as a sequence of open and close parentheses,
terminals, and nonterminal symbols; our action history is
quite close to this, with each NT(X) capturing a left parenthe-
sis and X nonterminal, and each REDUCE capturing a right
parenthesis.
1251

low the same hyperparameters as the generative
model proposed in Dyer et al. (2016).
6
The gen-
erative model did not use any pretrained word em-
beddings or POS tags; a discriminative variant of
the standard RNNG was used to obtain tree sam-
ples for the generative model. All further experi-
ments use the same settings and hyperparameters
unless otherwise noted.
Experimental results. We trained each abla-
tion from scratch, and compared these models on
three tasks: English phrase-structure parsing (la-
beled F
1
), Table 2; dependency parsing, Table 3,
by converting parse output to Stanford dependen-
cies (De Marneffe et al., 2006) using the tool by
Kong and Smith (2014); and language modeling,
Table 4. The last row of each table reports the
performance of a novel variant of the (stack-only)
RNNG with attention, to be presented in §4.
Model F
1
Vinyals et al. (2015)
92.1
Choe and Charniak (2016) 92.6
Choe and Charniak (2016)
93.8
Baseline RNNG 93.3
Ablated RNNG (no history) 93.2
Ablated RNNG (no buffer) 93.3
Ablated RNNG (no stack) 92.5
Stack-only RNNG 93.6
GA-RNNG 93.5
Table 2: Phrase-structure parsing performance on
PTB §23.
indicates systems that use additional
unparsed data (semisupervised). The GA-RNNG
results will be discussed in §4.
Discussion. The RNNG with only a stack is the
strongest of the ablations, and it even outperforms
the “full” RNNG with all three data structures.
Ablating the stack gives the worst among the new
results. This strongly supports the importance of
the composition function: a proper REDUCE oper-
ation that transforms a constituent’s parts and non-
terminal label into a single explicit (vector) repre-
sentation is helpful to performance.
It is noteworthy that the stack alone is stronger
than the original RNNG, which—in principle—
can learn to disregard the buffer and action his-
6
The model is trained using stochastic gradient descent,
with a learning rate of 0.1 and a per-epoch decay of 0.08. All
experiments with the generative RNNG used 100 tree sam-
ples for each sentence, obtained by sampling from the local
softmax distribution of the discriminative RNNG.
Model UAS LAS
Kiperwasser and Goldberg (2016) 93.9 91.9
Andor et al. (2016) 94.6 92.8
Dozat and Manning (2016) 95.4 93.8
Choe and Charniak (2016)
95.9 94.1
Baseline RNNG 95.6 94.4
Ablated RNNG (no history) 95.4 94.2
Ablated RNNG (no buffer) 95.6 94.4
Ablated RNNG (no stack) 95.1 93.8
Stack-only RNNG 95.8 94.6
GA-RNNG 95.7 94.5
Table 3: Dependency parsing performance on
PTB §23 with Stanford Dependencies (De Marn-
effe and Manning, 2008).
indicates systems that
use additional unparsed data (semisupervised).
tory. Since the stack maintains syntactically “re-
cent” information near its top, we conjecture that
the learner is overfitting to spurious predictors in
the buffer and action history that explain the train-
ing data but do not generalize well.
A similar performance degradation is seen in
language modeling (Table 4): the stack-only
RNNG achieves the best performance, and ablat-
ing the stack is most harmful. Indeed, model-
ing syntax without explicit composition (the stack-
ablated RNNG) provides little benefit over a se-
quential LSTM language model.
Model Test ppl. (PTB)
IKN 5-gram 169.3
LSTM LM 113.4
RNNG 105.2
Ablated RNNG (no history) 105.7
Ablated RNNG (no buffer) 106.1
Ablated RNNG (no stack) 113.1
Stack-only RNNG 101.2
GA-RNNG 100.9
Table 4: Language modeling: perplexity. IKN
refers to Kneser-Ney 5-gram LM.
We remark that the stack-only results are
the best published PTB results for both phrase-
structure and dependency parsing among super-
vised models.
4 Gated Attention RNNG
Having established that the composition function
is key to RNNG performance (§3), we now seek
to understand the nature of the composed phrasal
representations that are learned. Like most neural
networks, interpreting the composition function’s
1252

behavior is challenging. Fortunately, linguistic
theories offer a number of hypotheses about the
nature of representations of phrases that can pro-
vide a conceptual scaffolding to understand them.
4.1 Linguistic Hypotheses
We consider two theories about phrasal represen-
tation. The first is that phrasal representations are
strongly determined by a privileged lexical head.
Augmenting grammars with lexical head informa-
tion has a long history in parsing, starting with
the models of Collins (1997), and theories of syn-
tax such as the “bare phrase structure” hypothe-
sis of the Minimalist Program (Chomsky, 1993)
posit that phrases are represented purely by sin-
gle lexical heads. Proposals for multiple headed
phrases (to deal with tricky cases like conjunction)
likewise exist (Jackendoff, 1977; Keenan, 1987).
Do the phrasal representations learned by RN-
NGs depend on individual lexical heads or mul-
tiple heads? Or do the representations combine all
children without any salient head?
Related to the question about the role of heads
in phrasal representation is the question of whether
phrase-internal material wholly determines the
representation of a phrase (an endocentric repre-
sentation) or whether nonterminal relabeling of a
constitutent introduces new information (exocen-
tric representations). To illustrate the contrast, an
endocentric representation is representing a noun
phrase with a noun category, whereas S NP VP
exocentrically introduces a new syntactic category
that is neither NP nor VP (Chomsky, 1970).
4.2 Gated Attention Composition
To investigate what the stack-only RNNG learns
about headedness (and later endocentricity), we
propose a variant of the composition function
that makes use of an explicit attention mechanism
(Bahdanau et al., 2015) and a sigmoid gate with
multiplicative interactions, henceforth called GA-
RNNG.
At every REDUCE operation, the GA-RNNG as-
signs an “attention weight” to each of its chil-
dren (between 0 and 1 such that the total weight
off all children sums to 1), and the parent phrase
is represented by the combination of a sum of
each child’s representation scaled by its attention
weight and its nonterminal type. Our weighted
sum is more expressive than traditional head rules,
however, because it allows attention to be divided
among multiple constituents. Head rules, con-
versely, are analogous to giving all attention to one
constituent, the one containing the lexical head.
We now formally define the GA-RNNG’s com-
position function. Recall that u
t
is the concatena-
tion of the vector representations of the RNNG’s
data structures, used to assign probabilities to each
of the actions available at timestep t (see Fig. 1, the
layer before the softmax at the top). For simplicity,
we drop the timestep index here. Let o
nt
denote
the vector embedding (learned) of the nonterminal
being constructed, for the purpose of computing
attention weights.
Now let c
1
, c
2
, . . . denote the sequence of vec-
tor embeddings for the constituents of the new
phrase. The length of these vectors is defined by
the dimensionality of the bidirectional LSTM used
in the original composition function (Fig. 2). We
use semicolon (;) to denote vector concatenation
operations.
The attention vector is given by:
a = softmax
[c
1
c
2
· · · ]
>
V [u; o
nt
]
(1)
Note that the length of a is the same as the num-
ber of constituents, and that this vector sums to
one due to the softmax. It divides a single unit of
attention among the constituents.
Next, note that the constituent source vector
m = [c
1
; c
2
; · · · ]a is a convex combination of the
child-constituents, weighted by attention. We will
combine this with another embedding of the non-
terminal denoted as t
nt
(separate from o
nt
) using
a sigmoid gating mechanism:
g = σ (W
1
t
nt
+ W
2
m + b) (2)
Note that the value of the gate is bounded between
[0, 1] in each dimension.
The new phrase’s final representation uses
element-wise multiplication () with respect to
both t
nt
and m, a process reminiscent of the
LSTM “forget” gate:
c = g t
nt
+ (1 g) m. (3)
The intuition is that the composed represen-
tation should incorporate both nonterminal in-
formation and information about the constituents
(through weighted sum and attention mechanism).
The gate g modulates the interaction between
them to account for varying importance between
the two in different contexts.
1253

Citations
More filters
Proceedings Article
04 Nov 2016
TL;DR: This paper used a larger but more thoroughly regularized parser with biaffine classifiers to predict arcs and labels, achieving 95.7% UAS and 94.1% LAS on the most popular English PTB dataset.
Abstract: This paper builds off recent work from Kiperwasser & Goldberg (2016) using neural attention in a simple graph-based dependency parser. We use a larger but more thoroughly regularized parser than other recent BiLSTM-based approaches, with biaffine classifiers to predict arcs and labels. Our parser gets state of the art or near state of the art performance on standard treebanks for six different languages, achieving 95.7% UAS and 94.1% LAS on the most popular English PTB dataset. This makes it the highest-performing graph-based parser on this benchmark---outperforming Kiperwasser Goldberg (2016) by 1.8% and 2.2%---and comparable to the highest performing transition-based parser (Kuncoro et al., 2016), which achieves 95.8% UAS and 94.6% LAS. We also show which hyperparameter choices had a significant effect on parsing accuracy, allowing us to achieve large gains over other graph-based approaches.

841 citations

Proceedings ArticleDOI
01 Sep 2017
TL;DR: The authors introduce a joint many-task model together with a strategy for successively growing its depth to solve increasingly complex tasks, and use a simple regularization term to allow for optimizing all model weights to improve one task's loss without exhibiting catastrophic interference of the other tasks.
Abstract: Transfer and multi-task learning have traditionally focused on either a single source-target pair or very few, similar tasks. Ideally, the linguistic levels of morphology, syntax and semantics would benefit each other by being trained in a single model. We introduce a joint many-task model together with a strategy for successively growing its depth to solve increasingly complex tasks. Higher layers include shortcut connections to lower-level task predictions to reflect linguistic hierarchies. We use a simple regularization term to allow for optimizing all model weights to improve one task’s loss without exhibiting catastrophic interference of the other tasks. Our single end-to-end model obtains state-of-the-art or competitive results on five different tasks from tagging, parsing, relatedness, and entailment tasks.

541 citations

Proceedings ArticleDOI
01 Aug 2017
TL;DR: An update to UDPipe 1.0 (Straka et al., 2016), a trainable pipeline which performs sentence segmentation, tokenization, POS tagging, lemmatization and dependency parsing, which provides models for all 50 languages of UD 2.0.
Abstract: We present an update to UDPipe 1.0 (Straka et al., 2016), a trainable pipeline which performs sentence segmentation, tokenization, POS tagging, lemmatization and dependency parsing. We provide models for all 50 languages of UD 2.0, and furthermore, the pipeline can be trained easily using data in CoNLL-U format. For the purpose of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, the updated UDPipe 1.1 was used as one of the baseline systems, finishing as the 13th system of 33 participants. A further improved UDPipe 1.2 participated in the shared task, placing as the 8th best system, while achieving low running times and moder- ately sized models. The tool is available under open-source Mozilla Public Licence (MPL) and provides bindings for C++, Python (through ufal.udpipe PyPI package), Perl (through UFAL::UDPipe CPAN package), Java and C#.

536 citations

Proceedings ArticleDOI
01 Mar 2019
TL;DR: This article investigated the transferability of contextual word representations derived from large-scale neural language models with a suite of sixteen diverse probing tasks and found that linear models trained on top of frozen contextual representations are competitive with task-specific models in many cases, but fail on tasks requiring fine-grained linguistic knowledge (e.g., conjunct identification).
Abstract: Contextual word representations derived from large-scale neural language models are successful across a diverse set of NLP tasks, suggesting that they encode useful and transferable features of language. To shed light on the linguistic knowledge they capture, we study the representations produced by several recent pretrained contextualizers (variants of ELMo, the OpenAI transformer language model, and BERT) with a suite of sixteen diverse probing tasks. We find that linear models trained on top of frozen contextual representations are competitive with state-of-the-art task-specific models in many cases, but fail on tasks requiring fine-grained linguistic knowledge (e.g., conjunct identification). To investigate the transferability of contextual word representations, we quantify differences in the transferability of individual layers within contextualizers, especially between recurrent neural networks (RNNs) and transformers. For instance, higher layers of RNNs are more task-specific, while transformer layers do not exhibit the same monotonic trend. In addition, to better understand what makes contextual word representations transferable, we compare language model pretraining with eleven supervised pretraining tasks. For any given task, pretraining on a closely related task yields better performance than language model pretraining (which is better on average) when the pretraining dataset is fixed. However, language model pretraining on more data gives the best results.

505 citations

Posted Content
TL;DR: This paper proposes mining-based and paraphrasing-based methods to automatically generate high-quality and diverse prompts, as well as ensemble methods to combine answers from different prompts to provide a tighter lower bound on what LMs know.
Abstract: Recent work has presented intriguing results examining the knowledge contained in language models (LM) by having the LM fill in the blanks of prompts such as "Obama is a _ by profession". These prompts are usually manually created, and quite possibly sub-optimal; another prompt such as "Obama worked as a _" may result in more accurately predicting the correct profession. Because of this, given an inappropriate prompt, we might fail to retrieve facts that the LM does know, and thus any given prompt only provides a lower bound estimate of the knowledge contained in an LM. In this paper, we attempt to more accurately estimate the knowledge contained in LMs by automatically discovering better prompts to use in this querying process. Specifically, we propose mining-based and paraphrasing-based methods to automatically generate high-quality and diverse prompts, as well as ensemble methods to combine answers from different prompts. Extensive experiments on the LAMA benchmark for extracting relational knowledge from LMs demonstrate that our methods can improve accuracy from 31.1% to 39.6%, providing a tighter lower bound on what LMs know. We have released the code and the resulting LM Prompt And Query Archive (LPAQA) at this https URL.

397 citations


Cites background from "What Do Recurrent Neural Network Gr..."

  • ..., 2017), or by ablations to the models to investigate how behavior varies (Li et al., 2016b; Smith et al., 2017)....

    [...]

  • ...…by using extrinsic probing tasks to examine whether certain linguistic properties can be predicted from those representations (Shi et al., 2016; Linzen et al., 2016; Belinkov et al., 2017), or by ablations to the models to investigate how behavior varies (Li et al., 2016b; Smith et al., 2017)....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
Abstract: Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O. 1. Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.

72,897 citations


"What Do Recurrent Neural Network Gr..." refers methods in this paper

  • ...The new phrase’s final representation uses element-wise multiplication ( ) with respect to both tnt and m, a process reminiscent of the LSTM “forget” gate: c = g tnt + (1− g) m. (3) The intuition is that the composed representation should incorporate both nonterminal information and information about the constituents (through weighted sum and attention mechanism)....

    [...]

  • ...The length of these vectors is defined by the dimensionality of the bidirectional LSTM used in the original composition function (Fig....

    [...]

  • ...Vinyals et al. (2015) directly predict the sequence of nonterminals, “shifts” (which consume a terminal symbol), and parentheses from left to right, conditional on the input terminal sequence x, while Choe and Charniak (2016) used a sequential LSTM language model on the same linearized trees to create a generative variant of the Vinyals et al. (2015) model....

    [...]

  • ...Even when the preposi- 7Cf. Li et al. (2016), where sequential LSTMs discover polarity information in sentiment analysis, although perhaps more surprising as polarity information is less intuitively central to syntax and language modeling. tional phrase is only used to make a connection between two noun phrases (e.g., “PP → NP after NP”, 10), the prepositional connector is still considered the most salient element....

    [...]

  • ...…capturing a right parenthesis. low the same hyperparameters as the generative model proposed in Dyer et al. (2016).6 The generative model did not use any pretrained word embeddings or POS tags; a discriminative variant of the standard RNNG was used to obtain tree samples for the generative model....

    [...]

Proceedings Article
01 Jan 2015
TL;DR: It is conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and it is proposed to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.
Abstract: Neural machine translation is a recently proposed approach to machine translation. Unlike the traditional statistical machine translation, the neural machine translation aims at building a single neural network that can be jointly tuned to maximize the translation performance. The models proposed recently for neural machine translation often belong to a family of encoder-decoders and consists of an encoder that encodes a source sentence into a fixed-length vector from which a decoder generates a translation. In this paper, we conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and propose to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly. With this new approach, we achieve a translation performance comparable to the existing state-of-the-art phrase-based system on the task of English-to-French translation. Furthermore, qualitative analysis reveals that the (soft-)alignments found by the model agree well with our intuition.

20,027 citations


Additional excerpts

  • ...To investigate what the stack-only RNNG learns about headedness (and later endocentricity), we propose a variant of the composition function that makes use of an explicit attention mechanism (Bahdanau et al., 2015) and a sigmoid gate with multiplicative interactions, henceforth called GARNNG....

    [...]

Posted Content
TL;DR: In this paper, the authors propose to use a soft-searching model to find the parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.
Abstract: Neural machine translation is a recently proposed approach to machine translation. Unlike the traditional statistical machine translation, the neural machine translation aims at building a single neural network that can be jointly tuned to maximize the translation performance. The models proposed recently for neural machine translation often belong to a family of encoder-decoders and consists of an encoder that encodes a source sentence into a fixed-length vector from which a decoder generates a translation. In this paper, we conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and propose to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly. With this new approach, we achieve a translation performance comparable to the existing state-of-the-art phrase-based system on the task of English-to-French translation. Furthermore, qualitative analysis reveals that the (soft-)alignments found by the model agree well with our intuition.

14,077 citations

Proceedings ArticleDOI
07 Jul 2003
TL;DR: It is demonstrated that an unlexicalized PCFG can parse much more accurately than previously shown, by making use of simple, linguistically motivated state splits, which break down false independence assumptions latent in a vanilla treebank grammar.
Abstract: We demonstrate that an unlexicalized PCFG can parse much more accurately than previously shown, by making use of simple, linguistically motivated state splits, which break down false independence assumptions latent in a vanilla treebank grammar. Indeed, its performance of 86.36% (LP/LR F1) is better than that of early lexicalized PCFG models, and surprisingly close to the current state-of-the-art. This result has potential uses beyond establishing a strong lower bound on the maximum possible accuracy of unlexicalized models: an unlexicalized PCFG is much more compact, easier to replicate, and easier to interpret than more complex lexical models, and the parsing algorithms are simpler, more widely understood, of lower asymptotic complexity, and easier to optimize.

3,291 citations


"What Do Recurrent Neural Network Gr..." refers background or methods in this paper

  • ...Extensive prior work on phrase-structure parsing typically employs the probabilistic context-free grammar formalism, with lexicalized (Collins, 1997) and nonterminal (Johnson, 1998; Klein and Manning, 2003) augmentations....

    [...]

  • ...Every generated word goes onto the stack, too; and some past words will be composed into larger structures, but through the composition function, they are all still “available” to the network that predicts the next action....

    [...]

  • ...Unlike previous works that rely on hand-crafted rules to compose more fine-grained phrase representations (Collins, 1997; Klein and Manning, 2003), the RNNG implicitly parameterizes the information passed through compositions of phrases (in Θ and the neural network architecture), hence weakening…...

    [...]

Book
15 Sep 2020

2,527 citations


"What Do Recurrent Neural Network Gr..." refers methods in this paper

  • ...…grammars with lexical head information has a long history in parsing, starting with the models of Collins (1997), and theories of syntax such as the “bare phrase structure” hypothesis of the Minimalist Program (Chomsky, 1993) posit that phrases are represented purely by single lexical heads....

    [...]

  • ...Augmenting grammars with lexical head information has a long history in parsing, starting with the models of Collins (1997), and theories of syntax such as the “bare phrase structure” hypothesis of the Minimalist Program (Chomsky, 1993) posit that phrases are represented purely by single lexical heads....

    [...]