Ordinal Common-sense Inference

doi:10.1162/TACL_A_00068

Sheng Zhang

Johns Hopkins University

zsheng2@jhu.edu

Rachel Rudinger

Johns Hopkins University

rudinger@jhu.edu

Kevin Duh

Johns Hopkins University

kevinduh@cs.jhu.edu

Benjamin Van Durme

Johns Hopkins University

vandurme@cs.jhu.edu

Abstract

Humans have the capacity to draw common-

sense inferences from natural language: vari-

ous things that are likely but not certain to hold

based on established discourse, and are rarely

stated explicitly. We propose an evaluation

of automated common-sense inference based

on an extension of recognizing textual entail-

ment: predicting ordinal human responses on

the subjective likelihood of an inference hold-

ing in a given context. We describe a frame-

work for extracting common-sense knowledge

from corpora, which is then used to construct

a dataset for this ordinal entailment task. We

train a neural sequence-to-sequence model on

this dataset, which we use to score and gen-

erate possible inferences. Further, we anno-

tate subsets of previously established datasets

via our ordinal annotation protocol in order

to then analyze the distinctions between these

and what we have constructed.

1 Introduction

We use words to talk about the world. There-

fore, to understand what words mean, we must

have a prior explication of how we view the

world. – Hobbs (1987)

Researchers in Artiﬁcial Intelligence and (Compu-

tational) Linguistics have long-cited the require-

ment of common-sense knowledge in language un-

derstanding.

1

This knowledge is viewed as a key

1

Schank (1975): It has been apparent ... within ... natural

language understanding ... that the eventual limit to our solu-

tion ... would be our ability to characterize world knowledge.

Sam bought a new clock ; The clock runs

Dave found an axe in his garage ; A car is parked

in the garage

Tom was accidentally shot by his teammate in the

army ; The teammate dies

Two friends were in a heated game of checkers ;

A person shoots the checkers

My friends and I decided to go swimming in the

ocean ; The ocean is carbonated

Figure 1: Examples of common-sense inference ranging

from very likely, likely, plausible, technically possible, to

impossible.

component in ﬁlling in the gaps between the tele-

graphic style of natural language statements. We are

able to convey considerable information in a rela-

tively sparse channel, presumably owing to a par-

tially shared model at the start of any discourse.

2

Common-sense inference – inferences based on

common-sense knowledge – is possibilistic: things

everyone more or less would expect to hold in a

given context, but without the necessary strength of

logical entailment.

3

Because natural language cor-

pora exhibits human reporting bias (Gordon and Van

Durme, 2013), systems that derive knowledge ex-

clusively from such corpora may be more accurately

considered models of language, rather than of the

2

McCarthy (1959): a program has common sense if it au-

tomatically deduces for itself a sufﬁciently wide class of imme-

diate consequences of anything it is told and what it already

knows.

3

Many of the bridging inferences of Clark (1975) make use

of common-sense knowledge, such as the following example of

“Probable part”: I walked into the room. The windows looked

out to the bay. To resolve the deﬁnite reference the windows,

one needs to know that rooms have windows is probable.

379

Transactions of the Association for Computational Linguistics, vol. 5, pp. 379–395, 2017. Action Editor: Mark Steedman.

Submission batch: 12/2016; Revision batch: 3/2017; Published 11/2017.

c

2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.

world (Rudinger et al., 2015). Facts such as “A per-

son walking into a room is very likely to be blink-

ing and breathing” are usually unstated in text, so

their real-world likelihoods do not align to language

model probabilities.

4

We would like to have systems

capable of reading a sentence that describes a real-

world situation and inferring how likely other state-

ments about that situation are to hold true in the real

world, e.g. This capability is subtly but crucially

distinct from the ability to predict other sentences

reported in the same text, as a language model may

be trained to do.

We therefore propose a model of knowledge ac-

quisition based on ﬁrst deriving possibilistic state-

ments from text. As the relative frequency of these

statements suffers the mentioned reporting bias, we

then follow up with human annotation of derived ex-

amples. Since we initially are uncertain about the

real-world likelihood of the derived common-sense

knowledge holding in any particular context, we pair

it with various grounded context and present to hu-

mans for their own assessment. As these examples

vary in assessed plausibility, we propose the task of

ordinal common-sense inference, which embraces a

wider set of natural conclusions arising from lan-

guage comprehension (see Fig 1).

In what follows, we describe prior efforts in

common-sense and textual inference (§2). We then

state our position on how ordinal common-sense in-

ference should be deﬁned (§3), and detail our own

framework for large-scale extraction and abstrac-

tion, along with a crowdsourcing protocol for assess-

ment (§4). This includes a novel neural model for

forward generation of textual inference statements.

Together these methods are applied to contexts de-

rived from various prior textual inference resources,

resulting in the JHU Ordinal Common-sense Infer-

ence (JOCI) corpus, a large collection of diverse

common-sense inference examples, judged to hold

with varying levels of subjective likelihood (§5). We

provide baseline results (§6) for prediction on the

JOCI corpus.

5

4

For further background see discussions by Van Durme

(2010), Gordon and Van Durme (2013), Rudinger et al. (2015)

and Misra et al. (2016).

5

The JOCI corpus is released freely at: http://decomp.

net/.

2 Background

Mining Common Sense Building large collec-

tions of common-sense knowledge can be done

manually via professionals (Hobbs and Navarretta,

1993), but at considerable cost in terms of time and

expense (Miller, 1995; Lenat, 1995; Baker et al.,

1998; Friedland et al., 2004). Efforts have pursued

volunteers (Singh, 2002; Havasi et al., 2007) and

games with a purpose (Chklovski, 2003), but are

still left fully reliant on human labor. Many have

pursued automating the process, such as in expand-

ing lexical hierarchies (Hearst, 1992; Snow et al.,

2006), constructing inference patterns (Lin and Pan-

tel, 2001; Berant et al., 2011), reading reference

materials (Richardson et al., 1998; Suchanek et al.,

2007), mining search engine query logs (Pas¸ca and

Van Durme, 2007), and most relevant here: abstract-

ing from instance-level predications discovered in

descriptive texts (Schubert, 2002; Liakata and Pul-

man, 2002; Clark et al., 2003; Banko and Etzioni,

2007). In this article we are concerned with knowl-

edge mining for purposes of seeding a text genera-

tion process (constructing common-sense inference

examples).

Common-sense Tasks Many textual inference

tasks have been designed to require some de-

gree of common-sense knowledge, e.g., the Wino-

grad Schema Challenge discussed by Levesque et

al. (2011). The data for these tasks are either

smaller, carefully constructed evaluation sets by pro-

fessionals, following efforts like the FRACAS test

suite (Cooper et al., 1996), or they rely on crowd-

sourced elicitation (Bowman et al., 2015). Crowd-

sourcing is scalable, but elicitation protocols can

lead to biased responses unlikely to contain a wide

range of possible common-sense inferences. Hu-

mans can generally agree on the plausibility of a

wide range of possible inference pairs, but they are

not likely to generate them from an initial prompt.

6

The construction of SICK (Sentences Involving

Compositional Knowledge) made use of existing

paraphrastic sentence pairs (descriptions by differ-

6

McRae et al. (2005): Features such as <is larger than

a tulip> or <moves faster than an infant>, for example; al-

though logically possible, do not occur in [human responses]

[...] Although people are capable of verifying that a <dog is

larger than a pencil>.

380

ent people of the same image), which were modi-

ﬁed through a series of rule-based transformations

then judged by humans (Marelli et al., 2014). As

with SICK, we rely on humans only for judging pro-

vided examples, rather than elicitation of text. Un-

like SICK, our generation is based on a process tar-

geted speciﬁcally at common sense (see §4.1.1).

Plausibility Researchers in psycholinguistics

have explored a notion of plausibility in human

sentence processing, where, for instance, arguments

to predicates are intuitively more or less “plausible”

as ﬁllers to different thematic roles, as reﬂected in

human reading times. For example, McRae et al.

(1998) looked at manipulations such as:

(a) The boss hired by the corporation was per-

fect for the job.

(b) The applicant hired by the corporation was

perfect for the job.

where the plausibility of a boss being the agent – as

compared to patient – of the predicate hired might be

measured by looking at delays in reading time in the

words following the predicate. This measurement is

then contrasted with the timing observed in the same

positions in (b).

7

Rather than measuring according to predictions

such as human reading times, here we ask anno-

tators explicitly to judge plausibility on a 5-point

ordinal scale (See §3). Further, our effort might

be described in this setting as conditional plausibil-

ity,

8

where plausibility judgments for a given sen-

tence are expected to be dependent on preceding

context. Further exploration of conditional plau-

sibility is an interesting avenue of potential future

work, perhaps through the measurement of human

reading times when using prompts derived from our

ordinal common-sense inference examples. Compu-

tational modeling of (unconditional) semantic plau-

sibility has been explored by those such as Pad

´

o et

al. (2009), Erk et al. (2010) and Sayeed et al. (2015).

Textual Entailment A multi-year source of tex-

tual inference examples were generated under the

Recognizing Textual Entailment (RTE) Challenges,

introduced by Dagan et al. (2006):

7

This notion of thematic plausibility is then related to

the notion of verb-argument selectional preference (Zernik,

1992; Resnik, 1993; Clark and Weir, 1999), and sortal

(in)correctness (Thomason, 1972).

8

Thanks to the anonymous reviewer for this connection.

We say that T entails H if, typically, a human

reading T would infer that H is most likely

true. This somewhat informal deﬁnition is

based on (and assumes) common human un-

derstanding of language as well as common

background knowledge.

This deﬁnition strayed from the more strict notion

of entailment as used by linguistic semanticists, such

as those involved with FRACAS. While Giampic-

colo et al. (2008) extended binary RTE with an “un-

known” category, the entailment community has pri-

marily focused on issues such as “paraphrase” and

“monotonicity”. An example of this is the Natural

Logic implementation of MacCartney and Manning

(2007).

Language understanding in context is not only un-

derstanding the entailments of a sentence, but also

the plausible inferences of the sentence, i.e. the

new posterior on the world after reading the sen-

tence. A new sentence in a discourse is almost never

entailed by another sentence in the discourse, be-

cause such a sentence would add no new informa-

tion. In order to successfully process a discourse,

there needs to be some understanding of what new

information can be, possibly or plausibly, added to

the discourse. Collecting sentence pairs with ordi-

nal entailment connections is potentially useful for

improving and testing these language understanding

capabilities that would be needed by algorithms for

applications like storytelling.

Garrette et al. (2011) and Beltagy et al. (2017)

treated textual entailment as probabilistic logical in-

ference in Markov Logic Networks (Richardson and

Domingos, 2006). However, the notion of probabil-

ity in their entailment task has a subtle distinction

from our problem of common-sense inference. The

probability of being an entailment given by a proba-

bilistic model trained for a binary classiﬁcation (be-

ing an entailment or not) is not necessarily the same

as the likelihood of an inference being true. For ex-

ample:

T: A person ﬂips a coin.

H: That ﬂip comes up heads.

No human reading T should infer that H is true.

A model trained to make ordinal predictions should

say: “plausible, with probability 1.0”, whereas a

model trained to make binary entailed/not-entailed

381

predictions should say: “not entailed, with probabil-

ity 1.0”. The following example exhibits the same

property:

T: An animal eats food.

H: A person eats food.

Again, with high conﬁdence, H is plausible; and,

with high conﬁdence, it is also not entailed.

Non-entailing Inference Of the various non-

“entailment” textual inference tasks, a few are most

salient here. Agirre et al. (2012) piloted a Textual

Similarity evaluation which has been reﬁned in sub-

sequent years. Systems produce scalar values corre-

sponding to predictions of how similar the meaning

is between two provided sentences, e.g., the follow-

ing pair from SICK was judged very similar (4.2 out

of 5), while also being a contradiction: There is no

biker jumping in the air and A lone biker is jump-

ing in the air. The ordinal approach we advocate for

relies on a graded notion, like textual similarity.

The Choice of Plausible Alternative (COPA)

task (Roemmele et al., 2011) was a reaction to RTE,

similarly motivated to probe a system’s ability to un-

derstand inferences that are not strictly entailed. A

single context was provided, with two alternative in-

ferences, and a system had to judge which was more

plausible. The COPA dataset was manually elicited,

and is not large; we discuss this data further in §5.

The Narrative Cloze task (Chambers and Juraf-

sky, 2008) requires a system to score candidate in-

ferences as to how likely they are to appear in a

document that also included the provided context.

Many such inferences are then not strictly entailed

by the context. Further, the Cloze task gives the ben-

eﬁt of being able to generate very large numbers of

examples automatically by simply occluding parts

of existing documents and asking a system to pre-

dict what is missing. The LAMBADA dataset (Pa-

perno et al., 2016) is akin to our strategy for auto-

matic generation followed by human ﬁltering, but

for Cloze examples. As our concern is with infer-

ences that are often true but never stated in a doc-

ument, this approach is not viable here. The ROC-

Stories corpus (Mostafazadeh et al., 2016) elicited

a more “plausible” collection of documents in or-

der to retain the narrative Cloze in the context of

common-sense inference. The ROCStories corpus

can be viewed as an extension of the idea behind

the COPA corpus, done at a larger scale with crowd-

sourcing, and with multi-sentence contexts; we con-

sider this dataset in §5.

Alongside the narrative Cloze, Pichotta and

Mooney (2016) made use of a 5-point Likert scale

(very likely to very unlikely) as a secondary evalu-

ation of various script induction techniques. While

they were concerned with measuring their ability to

generate very likely inferences, here we are inter-

ested in generating a wide swath of inference candi-

dates, including those that are impossible.

3 Ordinal Common-sense Inference

Our goal is a system that can perform speculative,

common-sense inference as part of understanding

language. Based on the observed shortfalls of prior

work, we propose the notion of Ordinal Common-

sense Inference (OCI). OCI embraces the notion of

Dagan et al. (2006), in that we are concerned with

human judgments of epistemic modality.

9

As agreed by many linguists, modality in nat-

ural language is a continuous category, but

speakers are able to map areas of this axis into

discrete values (Lyons, 1977; Horn, 1989; de

Haan, 1997) – Saur

´

ı and Pustejovsky (2009)

According to Horn (1989), there are two scales

of epistemic modality which differ in polarity (posi-

tive vs. negative polarity): hcertain, likely, possiblei

and himpossible, unlikely, uncertaini. The Square

of Opposition (SO) (Fig 2) illustrates the logical re-

lations holding between values in the two scales.

Based on their logical relations, we can make a set

of exhaustive epistemic modals: hvery likely, likely,

possible, impossiblei, where hvery likely, likely, pos-

siblei lie on a single, positive Horn scale, and im-

possible, a complementary concept from the cor-

responding negative Horn scale, completes the set.

In this paper, we further replace the value possible

by the more ﬁne-grained values (technically possi-

ble and plausible). This results in a 5-point scale

of likelihood: hvery likely, likely, plausible, techni-

cally possible, impossiblei. The OCI task deﬁnition

directly embraces subjective likelihood on such an

9

Epistemic modality: the likelihood that (some aspect of) a

certain state of affairs is/has been/will be true (or false) in the

context of the possible world under consideration.

382

ordinal scale. Humans are presented with a context

C and asked whether a provided hypothesis H is very

likely, likely, plausible, technically possible, or im-

possible. Furthermore, an important part of this pro-

cess is the generation of H by automatic methods,

which seeks to avoid the elicitation bias of many

prior works.

A E

Contraries

I O

Subcontraries

Contradictories

certain

likely

possible

impossible

unlikely

uncertain

Positive Negative

Figure 2: SO for epistemic modals (Saur

´

ı and Puste-

jovsky, 2009).

10

4 Framework for collecting OCI corpus

We now describe our framework for collecting ordi-

nal common-sense inference examples. It is natural

to collect this data in two stages. In the ﬁrst stage

(§4.1), we automatically generate inference candi-

dates given some context. We propose two broad

approaches using either general world knowledge or

neural methods. In the second stage (§4.2), we an-

notate these candidates with ordinal labels.

4.1 Generation of Common-sense Inference

Candidates

4.1.1 Generation based on World Knowledge

Our motivation for this approach was ﬁrst intro-

duced by Schubert (2002):

There is a largely untapped source of general

knowledge in texts, lying at a level beneath the

explicit assertional content. This knowledge

consists of relationships implied to be possi-

ble in the world, or, under certain conditions,

implied to be normal or commonplace in the

world.

Following Schubert (2002) and Van Durme and

Schubert (2008), we deﬁne an approach for ab-

stracting over explicit assertions derived from cor-

pora, leading to a large-scale collection of general

possibilistic statements. As shown in Fig 3, this

10

“Contradictories”: exhaustive and mutually exclusive con-

ditions. “Contraries”: non-exhaustive and mutually exclusive.

“Subcontraries”: exhaustive and non-mutually exclusive.

approach generates common-sense inference can-

didates in four steps: (a) extracting propositions

with predicate-argument structures from texts, (b)

abstracting over propositions to generate templates

for concepts, (c) deriving properties of concepts via

different strategies, and (d) generating possibilistic

hypotheses from contexts.

publication.n.01

person buy ____

collection.n.02

magazine.n.01

book.n.01

No

person subscribe to ____

Yes

person borrow ____ from library

…

Yes No

Yes

(c) Property derivation using the decision tree

feature

No

[person] borrow [book] from [library]

person.n.01

book.n.01

library.n.01

____ borrow book from library

person borrow ____ from library

person borrow book from ____

propositional templates

abstracted proposition

[John] borrowed [the books] from [the library]

pred-arg structured proposition

John borrowed the books from the library .

plain text

(a) Extraction

(b) Abstraction

The professor recommended [books] for this course.

context

(d) Inference generation

A person borrows the books from a library.

inference

approximation

template generation

extraction

property

derivation

verbalization

hypothesis

Hypothesis generation

Figure 3: Generating common-sense inferences

based on general world knowledge.

(a) Extracting propositions: First we extract a

large set of propositions with predicate-argument

structures from noun phrases and clauses, under

which general world presumptions often lie. To

achieve this goal, we use PredPatt

11

(White et al.,

2016; Zhang et al., 2017), which deﬁnes a frame-

11

https://github.com/hltcoe/PredPatt

383

Ordinal Common-sense Inference

Citations

References

Related Papers (5)