Ordinal Common-sense Inference
Sheng Zhang
Johns Hopkins University
zsheng2@jhu.edu
Rachel Rudinger
Johns Hopkins University
rudinger@jhu.edu
Kevin Duh
Johns Hopkins University
kevinduh@cs.jhu.edu
Benjamin Van Durme
Johns Hopkins University
vandurme@cs.jhu.edu
Abstract
Humans have the capacity to draw common-
sense inferences from natural language: vari-
ous things that are likely but not certain to hold
based on established discourse, and are rarely
stated explicitly. We propose an evaluation
of automated common-sense inference based
on an extension of recognizing textual entail-
ment: predicting ordinal human responses on
the subjective likelihood of an inference hold-
ing in a given context. We describe a frame-
work for extracting common-sense knowledge
from corpora, which is then used to construct
a dataset for this ordinal entailment task. We
train a neural sequence-to-sequence model on
this dataset, which we use to score and gen-
erate possible inferences. Further, we anno-
tate subsets of previously established datasets
via our ordinal annotation protocol in order
to then analyze the distinctions between these
and what we have constructed.
1 Introduction
We use words to talk about the world. There-
fore, to understand what words mean, we must
have a prior explication of how we view the
world. – Hobbs (1987)
Researchers in Artificial Intelligence and (Compu-
tational) Linguistics have long-cited the require-
ment of common-sense knowledge in language un-
derstanding.
1
This knowledge is viewed as a key
1
Schank (1975): It has been apparent ... within ... natural
language understanding ... that the eventual limit to our solu-
tion ... would be our ability to characterize world knowledge.
Sam bought a new clock ; The clock runs
Dave found an axe in his garage ; A car is parked
in the garage
Tom was accidentally shot by his teammate in the
army ; The teammate dies
Two friends were in a heated game of checkers ;
A person shoots the checkers
My friends and I decided to go swimming in the
ocean ; The ocean is carbonated
Figure 1: Examples of common-sense inference ranging
from very likely, likely, plausible, technically possible, to
impossible.
component in filling in the gaps between the tele-
graphic style of natural language statements. We are
able to convey considerable information in a rela-
tively sparse channel, presumably owing to a par-
tially shared model at the start of any discourse.
2
Common-sense inference – inferences based on
common-sense knowledge – is possibilistic: things
everyone more or less would expect to hold in a
given context, but without the necessary strength of
logical entailment.
3
Because natural language cor-
pora exhibits human reporting bias (Gordon and Van
Durme, 2013), systems that derive knowledge ex-
clusively from such corpora may be more accurately
considered models of language, rather than of the
2
McCarthy (1959): a program has common sense if it au-
tomatically deduces for itself a sufficiently wide class of imme-
diate consequences of anything it is told and what it already
knows.
3
Many of the bridging inferences of Clark (1975) make use
of common-sense knowledge, such as the following example of
“Probable part”: I walked into the room. The windows looked
out to the bay. To resolve the definite reference the windows,
one needs to know that rooms have windows is probable.
379
Transactions of the Association for Computational Linguistics, vol. 5, pp. 379–395, 2017. Action Editor: Mark Steedman.
Submission batch: 12/2016; Revision batch: 3/2017; Published 11/2017.
c
2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.
world (Rudinger et al., 2015). Facts such as “A per-
son walking into a room is very likely to be blink-
ing and breathing” are usually unstated in text, so
their real-world likelihoods do not align to language
model probabilities.
4
We would like to have systems
capable of reading a sentence that describes a real-
world situation and inferring how likely other state-
ments about that situation are to hold true in the real
world, e.g. This capability is subtly but crucially
distinct from the ability to predict other sentences
reported in the same text, as a language model may
be trained to do.
We therefore propose a model of knowledge ac-
quisition based on first deriving possibilistic state-
ments from text. As the relative frequency of these
statements suffers the mentioned reporting bias, we
then follow up with human annotation of derived ex-
amples. Since we initially are uncertain about the
real-world likelihood of the derived common-sense
knowledge holding in any particular context, we pair
it with various grounded context and present to hu-
mans for their own assessment. As these examples
vary in assessed plausibility, we propose the task of
ordinal common-sense inference, which embraces a
wider set of natural conclusions arising from lan-
guage comprehension (see Fig 1).
In what follows, we describe prior efforts in
common-sense and textual inference (§2). We then
state our position on how ordinal common-sense in-
ference should be defined (§3), and detail our own
framework for large-scale extraction and abstrac-
tion, along with a crowdsourcing protocol for assess-
ment (§4). This includes a novel neural model for
forward generation of textual inference statements.
Together these methods are applied to contexts de-
rived from various prior textual inference resources,
resulting in the JHU Ordinal Common-sense Infer-
ence (JOCI) corpus, a large collection of diverse
common-sense inference examples, judged to hold
with varying levels of subjective likelihood (§5). We
provide baseline results (§6) for prediction on the
JOCI corpus.
5
4
For further background see discussions by Van Durme
(2010), Gordon and Van Durme (2013), Rudinger et al. (2015)
and Misra et al. (2016).
5
The JOCI corpus is released freely at: http://decomp.
net/.
2 Background
Mining Common Sense Building large collec-
tions of common-sense knowledge can be done
manually via professionals (Hobbs and Navarretta,
1993), but at considerable cost in terms of time and
expense (Miller, 1995; Lenat, 1995; Baker et al.,
1998; Friedland et al., 2004). Efforts have pursued
volunteers (Singh, 2002; Havasi et al., 2007) and
games with a purpose (Chklovski, 2003), but are
still left fully reliant on human labor. Many have
pursued automating the process, such as in expand-
ing lexical hierarchies (Hearst, 1992; Snow et al.,
2006), constructing inference patterns (Lin and Pan-
tel, 2001; Berant et al., 2011), reading reference
materials (Richardson et al., 1998; Suchanek et al.,
2007), mining search engine query logs (Pas¸ca and
Van Durme, 2007), and most relevant here: abstract-
ing from instance-level predications discovered in
descriptive texts (Schubert, 2002; Liakata and Pul-
man, 2002; Clark et al., 2003; Banko and Etzioni,
2007). In this article we are concerned with knowl-
edge mining for purposes of seeding a text genera-
tion process (constructing common-sense inference
examples).
Common-sense Tasks Many textual inference
tasks have been designed to require some de-
gree of common-sense knowledge, e.g., the Wino-
grad Schema Challenge discussed by Levesque et
al. (2011). The data for these tasks are either
smaller, carefully constructed evaluation sets by pro-
fessionals, following efforts like the FRACAS test
suite (Cooper et al., 1996), or they rely on crowd-
sourced elicitation (Bowman et al., 2015). Crowd-
sourcing is scalable, but elicitation protocols can
lead to biased responses unlikely to contain a wide
range of possible common-sense inferences. Hu-
mans can generally agree on the plausibility of a
wide range of possible inference pairs, but they are
not likely to generate them from an initial prompt.
6
The construction of SICK (Sentences Involving
Compositional Knowledge) made use of existing
paraphrastic sentence pairs (descriptions by differ-
6
McRae et al. (2005): Features such as <is larger than
a tulip> or <moves faster than an infant>, for example; al-
though logically possible, do not occur in [human responses]
[...] Although people are capable of verifying that a <dog is
larger than a pencil>.
380
ent people of the same image), which were modi-
fied through a series of rule-based transformations
then judged by humans (Marelli et al., 2014). As
with SICK, we rely on humans only for judging pro-
vided examples, rather than elicitation of text. Un-
like SICK, our generation is based on a process tar-
geted specifically at common sense (see §4.1.1).
Plausibility Researchers in psycholinguistics
have explored a notion of plausibility in human
sentence processing, where, for instance, arguments
to predicates are intuitively more or less “plausible”
as fillers to different thematic roles, as reflected in
human reading times. For example, McRae et al.
(1998) looked at manipulations such as:
(a) The boss hired by the corporation was per-
fect for the job.
(b) The applicant hired by the corporation was
perfect for the job.
where the plausibility of a boss being the agent – as
compared to patient – of the predicate hired might be
measured by looking at delays in reading time in the
words following the predicate. This measurement is
then contrasted with the timing observed in the same
positions in (b).
7
Rather than measuring according to predictions
such as human reading times, here we ask anno-
tators explicitly to judge plausibility on a 5-point
ordinal scale (See §3). Further, our effort might
be described in this setting as conditional plausibil-
ity,
8
where plausibility judgments for a given sen-
tence are expected to be dependent on preceding
context. Further exploration of conditional plau-
sibility is an interesting avenue of potential future
work, perhaps through the measurement of human
reading times when using prompts derived from our
ordinal common-sense inference examples. Compu-
tational modeling of (unconditional) semantic plau-
sibility has been explored by those such as Pad
´
o et
al. (2009), Erk et al. (2010) and Sayeed et al. (2015).
Textual Entailment A multi-year source of tex-
tual inference examples were generated under the
Recognizing Textual Entailment (RTE) Challenges,
introduced by Dagan et al. (2006):
7
This notion of thematic plausibility is then related to
the notion of verb-argument selectional preference (Zernik,
1992; Resnik, 1993; Clark and Weir, 1999), and sortal
(in)correctness (Thomason, 1972).
8
Thanks to the anonymous reviewer for this connection.
We say that T entails H if, typically, a human
reading T would infer that H is most likely
true. This somewhat informal definition is
based on (and assumes) common human un-
derstanding of language as well as common
background knowledge.
This definition strayed from the more strict notion
of entailment as used by linguistic semanticists, such
as those involved with FRACAS. While Giampic-
colo et al. (2008) extended binary RTE with an “un-
known” category, the entailment community has pri-
marily focused on issues such as “paraphrase” and
“monotonicity”. An example of this is the Natural
Logic implementation of MacCartney and Manning
(2007).
Language understanding in context is not only un-
derstanding the entailments of a sentence, but also
the plausible inferences of the sentence, i.e. the
new posterior on the world after reading the sen-
tence. A new sentence in a discourse is almost never
entailed by another sentence in the discourse, be-
cause such a sentence would add no new informa-
tion. In order to successfully process a discourse,
there needs to be some understanding of what new
information can be, possibly or plausibly, added to
the discourse. Collecting sentence pairs with ordi-
nal entailment connections is potentially useful for
improving and testing these language understanding
capabilities that would be needed by algorithms for
applications like storytelling.
Garrette et al. (2011) and Beltagy et al. (2017)
treated textual entailment as probabilistic logical in-
ference in Markov Logic Networks (Richardson and
Domingos, 2006). However, the notion of probabil-
ity in their entailment task has a subtle distinction
from our problem of common-sense inference. The
probability of being an entailment given by a proba-
bilistic model trained for a binary classification (be-
ing an entailment or not) is not necessarily the same
as the likelihood of an inference being true. For ex-
ample:
T: A person flips a coin.
H: That flip comes up heads.
No human reading T should infer that H is true.
A model trained to make ordinal predictions should
say: “plausible, with probability 1.0”, whereas a
model trained to make binary entailed/not-entailed
381
predictions should say: “not entailed, with probabil-
ity 1.0”. The following example exhibits the same
property:
T: An animal eats food.
H: A person eats food.
Again, with high confidence, H is plausible; and,
with high confidence, it is also not entailed.
Non-entailing Inference Of the various non-
“entailment” textual inference tasks, a few are most
salient here. Agirre et al. (2012) piloted a Textual
Similarity evaluation which has been refined in sub-
sequent years. Systems produce scalar values corre-
sponding to predictions of how similar the meaning
is between two provided sentences, e.g., the follow-
ing pair from SICK was judged very similar (4.2 out
of 5), while also being a contradiction: There is no
biker jumping in the air and A lone biker is jump-
ing in the air. The ordinal approach we advocate for
relies on a graded notion, like textual similarity.
The Choice of Plausible Alternative (COPA)
task (Roemmele et al., 2011) was a reaction to RTE,
similarly motivated to probe a system’s ability to un-
derstand inferences that are not strictly entailed. A
single context was provided, with two alternative in-
ferences, and a system had to judge which was more
plausible. The COPA dataset was manually elicited,
and is not large; we discuss this data further in §5.
The Narrative Cloze task (Chambers and Juraf-
sky, 2008) requires a system to score candidate in-
ferences as to how likely they are to appear in a
document that also included the provided context.
Many such inferences are then not strictly entailed
by the context. Further, the Cloze task gives the ben-
efit of being able to generate very large numbers of
examples automatically by simply occluding parts
of existing documents and asking a system to pre-
dict what is missing. The LAMBADA dataset (Pa-
perno et al., 2016) is akin to our strategy for auto-
matic generation followed by human filtering, but
for Cloze examples. As our concern is with infer-
ences that are often true but never stated in a doc-
ument, this approach is not viable here. The ROC-
Stories corpus (Mostafazadeh et al., 2016) elicited
a more “plausible” collection of documents in or-
der to retain the narrative Cloze in the context of
common-sense inference. The ROCStories corpus
can be viewed as an extension of the idea behind
the COPA corpus, done at a larger scale with crowd-
sourcing, and with multi-sentence contexts; we con-
sider this dataset in §5.
Alongside the narrative Cloze, Pichotta and
Mooney (2016) made use of a 5-point Likert scale
(very likely to very unlikely) as a secondary evalu-
ation of various script induction techniques. While
they were concerned with measuring their ability to
generate very likely inferences, here we are inter-
ested in generating a wide swath of inference candi-
dates, including those that are impossible.
3 Ordinal Common-sense Inference
Our goal is a system that can perform speculative,
common-sense inference as part of understanding
language. Based on the observed shortfalls of prior
work, we propose the notion of Ordinal Common-
sense Inference (OCI). OCI embraces the notion of
Dagan et al. (2006), in that we are concerned with
human judgments of epistemic modality.
9
As agreed by many linguists, modality in nat-
ural language is a continuous category, but
speakers are able to map areas of this axis into
discrete values (Lyons, 1977; Horn, 1989; de
Haan, 1997) – Saur
´
ı and Pustejovsky (2009)
According to Horn (1989), there are two scales
of epistemic modality which differ in polarity (posi-
tive vs. negative polarity): hcertain, likely, possiblei
and himpossible, unlikely, uncertaini. The Square
of Opposition (SO) (Fig 2) illustrates the logical re-
lations holding between values in the two scales.
Based on their logical relations, we can make a set
of exhaustive epistemic modals: hvery likely, likely,
possible, impossiblei, where hvery likely, likely, pos-
siblei lie on a single, positive Horn scale, and im-
possible, a complementary concept from the cor-
responding negative Horn scale, completes the set.
In this paper, we further replace the value possible
by the more fine-grained values (technically possi-
ble and plausible). This results in a 5-point scale
of likelihood: hvery likely, likely, plausible, techni-
cally possible, impossiblei. The OCI task definition
directly embraces subjective likelihood on such an
9
Epistemic modality: the likelihood that (some aspect of) a
certain state of affairs is/has been/will be true (or false) in the
context of the possible world under consideration.
382
ordinal scale. Humans are presented with a context
C and asked whether a provided hypothesis H is very
likely, likely, plausible, technically possible, or im-
possible. Furthermore, an important part of this pro-
cess is the generation of H by automatic methods,
which seeks to avoid the elicitation bias of many
prior works.
A E
Contraries
I O
Subcontraries
Contradictories
certain
likely
possible
impossible
unlikely
uncertain
Positive Negative
Figure 2: SO for epistemic modals (Saur
´
ı and Puste-
jovsky, 2009).
10
4 Framework for collecting OCI corpus
We now describe our framework for collecting ordi-
nal common-sense inference examples. It is natural
to collect this data in two stages. In the first stage
(§4.1), we automatically generate inference candi-
dates given some context. We propose two broad
approaches using either general world knowledge or
neural methods. In the second stage (§4.2), we an-
notate these candidates with ordinal labels.
4.1 Generation of Common-sense Inference
Candidates
4.1.1 Generation based on World Knowledge
Our motivation for this approach was first intro-
duced by Schubert (2002):
There is a largely untapped source of general
knowledge in texts, lying at a level beneath the
explicit assertional content. This knowledge
consists of relationships implied to be possi-
ble in the world, or, under certain conditions,
implied to be normal or commonplace in the
world.
Following Schubert (2002) and Van Durme and
Schubert (2008), we define an approach for ab-
stracting over explicit assertions derived from cor-
pora, leading to a large-scale collection of general
possibilistic statements. As shown in Fig 3, this
10
“Contradictories”: exhaustive and mutually exclusive con-
ditions. “Contraries”: non-exhaustive and mutually exclusive.
“Subcontraries”: exhaustive and non-mutually exclusive.
approach generates common-sense inference can-
didates in four steps: (a) extracting propositions
with predicate-argument structures from texts, (b)
abstracting over propositions to generate templates
for concepts, (c) deriving properties of concepts via
different strategies, and (d) generating possibilistic
hypotheses from contexts.
publication.n.01
person buy ____
collection.n.02
magazine.n.01
book.n.01
No
person subscribe to ____
Yes
person borrow ____ from library
…
Yes No
Yes
(c) Property derivation using the decision tree
feature
feature
feature
No
[person] borrow [book] from [library]
person.n.01
book.n.01
library.n.01
____ borrow book from library
person borrow ____ from library
person borrow book from ____
propositional templates
abstracted proposition
[John] borrowed [the books] from [the library]
pred-arg structured proposition
John borrowed the books from the library .
plain text
(a) Extraction
(b) Abstraction
The professor recommended [books] for this course.
context
(d) Inference generation
A person borrows the books from a library.
inference
approximation
template generation
extraction
property
derivation
verbalization
hypothesis
Hypothesis generation
Figure 3: Generating common-sense inferences
based on general world knowledge.
(a) Extracting propositions: First we extract a
large set of propositions with predicate-argument
structures from noun phrases and clauses, under
which general world presumptions often lie. To
achieve this goal, we use PredPatt
11
(White et al.,
2016; Zhang et al., 2017), which defines a frame-
11
https://github.com/hltcoe/PredPatt
383