scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Ordinal Common-sense Inference

05 Nov 2017-Transactions of the Association for Computational Linguistics (MIT Press One Rogers Street, Cambridge, MA 02142-1209 USA journals-info@mit.edu)-Vol. 5, Iss: 1, pp 379-395
TL;DR: The authors proposed an automated common-sense inference based on an extension of recognizing textual entailment: predicting ordinal human responses on the subjective likelihood of an inference holding in a given context.
Abstract: Humans have the capacity to draw common-sense inferences from natural language: various things that are likely but not certain to hold based on established discourse, and are rarely stated explicitly. We propose an evaluation of automated common-sense inference based on an extension of recognizing textual entailment: predicting ordinal human responses on the subjective likelihood of an inference holding in a given context. We describe a framework for extracting common-sense knowledge from corpora, which is then used to construct a dataset for this ordinal entailment task. We train a neural sequence-to-sequence model on this dataset, which we use to score and generate possible inferences. Further, we annotate subsets of previously established datasets via our ordinal annotation protocol in order to then analyze the distinctions between these and what we have constructed.

Content maybe subject to copyright    Report

Ordinal Common-sense Inference
Sheng Zhang
Johns Hopkins University
zsheng2@jhu.edu
Rachel Rudinger
Johns Hopkins University
rudinger@jhu.edu
Kevin Duh
Johns Hopkins University
kevinduh@cs.jhu.edu
Benjamin Van Durme
Johns Hopkins University
vandurme@cs.jhu.edu
Abstract
Humans have the capacity to draw common-
sense inferences from natural language: vari-
ous things that are likely but not certain to hold
based on established discourse, and are rarely
stated explicitly. We propose an evaluation
of automated common-sense inference based
on an extension of recognizing textual entail-
ment: predicting ordinal human responses on
the subjective likelihood of an inference hold-
ing in a given context. We describe a frame-
work for extracting common-sense knowledge
from corpora, which is then used to construct
a dataset for this ordinal entailment task. We
train a neural sequence-to-sequence model on
this dataset, which we use to score and gen-
erate possible inferences. Further, we anno-
tate subsets of previously established datasets
via our ordinal annotation protocol in order
to then analyze the distinctions between these
and what we have constructed.
1 Introduction
We use words to talk about the world. There-
fore, to understand what words mean, we must
have a prior explication of how we view the
world. Hobbs (1987)
Researchers in Artificial Intelligence and (Compu-
tational) Linguistics have long-cited the require-
ment of common-sense knowledge in language un-
derstanding.
1
This knowledge is viewed as a key
1
Schank (1975): It has been apparent ... within ... natural
language understanding ... that the eventual limit to our solu-
tion ... would be our ability to characterize world knowledge.
Sam bought a new clock ; The clock runs
Dave found an axe in his garage ; A car is parked
in the garage
Tom was accidentally shot by his teammate in the
army ; The teammate dies
Two friends were in a heated game of checkers ;
A person shoots the checkers
My friends and I decided to go swimming in the
ocean ; The ocean is carbonated
Figure 1: Examples of common-sense inference ranging
from very likely, likely, plausible, technically possible, to
impossible.
component in filling in the gaps between the tele-
graphic style of natural language statements. We are
able to convey considerable information in a rela-
tively sparse channel, presumably owing to a par-
tially shared model at the start of any discourse.
2
Common-sense inference inferences based on
common-sense knowledge is possibilistic: things
everyone more or less would expect to hold in a
given context, but without the necessary strength of
logical entailment.
3
Because natural language cor-
pora exhibits human reporting bias (Gordon and Van
Durme, 2013), systems that derive knowledge ex-
clusively from such corpora may be more accurately
considered models of language, rather than of the
2
McCarthy (1959): a program has common sense if it au-
tomatically deduces for itself a sufficiently wide class of imme-
diate consequences of anything it is told and what it already
knows.
3
Many of the bridging inferences of Clark (1975) make use
of common-sense knowledge, such as the following example of
“Probable part”: I walked into the room. The windows looked
out to the bay. To resolve the definite reference the windows,
one needs to know that rooms have windows is probable.
379
Transactions of the Association for Computational Linguistics, vol. 5, pp. 379–395, 2017. Action Editor: Mark Steedman.
Submission batch: 12/2016; Revision batch: 3/2017; Published 11/2017.
c
2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.

world (Rudinger et al., 2015). Facts such as A per-
son walking into a room is very likely to be blink-
ing and breathing are usually unstated in text, so
their real-world likelihoods do not align to language
model probabilities.
4
We would like to have systems
capable of reading a sentence that describes a real-
world situation and inferring how likely other state-
ments about that situation are to hold true in the real
world, e.g. This capability is subtly but crucially
distinct from the ability to predict other sentences
reported in the same text, as a language model may
be trained to do.
We therefore propose a model of knowledge ac-
quisition based on first deriving possibilistic state-
ments from text. As the relative frequency of these
statements suffers the mentioned reporting bias, we
then follow up with human annotation of derived ex-
amples. Since we initially are uncertain about the
real-world likelihood of the derived common-sense
knowledge holding in any particular context, we pair
it with various grounded context and present to hu-
mans for their own assessment. As these examples
vary in assessed plausibility, we propose the task of
ordinal common-sense inference, which embraces a
wider set of natural conclusions arising from lan-
guage comprehension (see Fig 1).
In what follows, we describe prior efforts in
common-sense and textual inference (§2). We then
state our position on how ordinal common-sense in-
ference should be defined (§3), and detail our own
framework for large-scale extraction and abstrac-
tion, along with a crowdsourcing protocol for assess-
ment (§4). This includes a novel neural model for
forward generation of textual inference statements.
Together these methods are applied to contexts de-
rived from various prior textual inference resources,
resulting in the JHU Ordinal Common-sense Infer-
ence (JOCI) corpus, a large collection of diverse
common-sense inference examples, judged to hold
with varying levels of subjective likelihood (§5). We
provide baseline results (§6) for prediction on the
JOCI corpus.
5
4
For further background see discussions by Van Durme
(2010), Gordon and Van Durme (2013), Rudinger et al. (2015)
and Misra et al. (2016).
5
The JOCI corpus is released freely at: http://decomp.
net/.
2 Background
Mining Common Sense Building large collec-
tions of common-sense knowledge can be done
manually via professionals (Hobbs and Navarretta,
1993), but at considerable cost in terms of time and
expense (Miller, 1995; Lenat, 1995; Baker et al.,
1998; Friedland et al., 2004). Efforts have pursued
volunteers (Singh, 2002; Havasi et al., 2007) and
games with a purpose (Chklovski, 2003), but are
still left fully reliant on human labor. Many have
pursued automating the process, such as in expand-
ing lexical hierarchies (Hearst, 1992; Snow et al.,
2006), constructing inference patterns (Lin and Pan-
tel, 2001; Berant et al., 2011), reading reference
materials (Richardson et al., 1998; Suchanek et al.,
2007), mining search engine query logs (Pas¸ca and
Van Durme, 2007), and most relevant here: abstract-
ing from instance-level predications discovered in
descriptive texts (Schubert, 2002; Liakata and Pul-
man, 2002; Clark et al., 2003; Banko and Etzioni,
2007). In this article we are concerned with knowl-
edge mining for purposes of seeding a text genera-
tion process (constructing common-sense inference
examples).
Common-sense Tasks Many textual inference
tasks have been designed to require some de-
gree of common-sense knowledge, e.g., the Wino-
grad Schema Challenge discussed by Levesque et
al. (2011). The data for these tasks are either
smaller, carefully constructed evaluation sets by pro-
fessionals, following efforts like the FRACAS test
suite (Cooper et al., 1996), or they rely on crowd-
sourced elicitation (Bowman et al., 2015). Crowd-
sourcing is scalable, but elicitation protocols can
lead to biased responses unlikely to contain a wide
range of possible common-sense inferences. Hu-
mans can generally agree on the plausibility of a
wide range of possible inference pairs, but they are
not likely to generate them from an initial prompt.
6
The construction of SICK (Sentences Involving
Compositional Knowledge) made use of existing
paraphrastic sentence pairs (descriptions by differ-
6
McRae et al. (2005): Features such as <is larger than
a tulip> or <moves faster than an infant>, for example; al-
though logically possible, do not occur in [human responses]
[...] Although people are capable of verifying that a <dog is
larger than a pencil>.
380

ent people of the same image), which were modi-
fied through a series of rule-based transformations
then judged by humans (Marelli et al., 2014). As
with SICK, we rely on humans only for judging pro-
vided examples, rather than elicitation of text. Un-
like SICK, our generation is based on a process tar-
geted specifically at common sense (see §4.1.1).
Plausibility Researchers in psycholinguistics
have explored a notion of plausibility in human
sentence processing, where, for instance, arguments
to predicates are intuitively more or less “plausible”
as fillers to different thematic roles, as reflected in
human reading times. For example, McRae et al.
(1998) looked at manipulations such as:
(a) The boss hired by the corporation was per-
fect for the job.
(b) The applicant hired by the corporation was
perfect for the job.
where the plausibility of a boss being the agent as
compared to patient of the predicate hired might be
measured by looking at delays in reading time in the
words following the predicate. This measurement is
then contrasted with the timing observed in the same
positions in (b).
7
Rather than measuring according to predictions
such as human reading times, here we ask anno-
tators explicitly to judge plausibility on a 5-point
ordinal scale (See §3). Further, our effort might
be described in this setting as conditional plausibil-
ity,
8
where plausibility judgments for a given sen-
tence are expected to be dependent on preceding
context. Further exploration of conditional plau-
sibility is an interesting avenue of potential future
work, perhaps through the measurement of human
reading times when using prompts derived from our
ordinal common-sense inference examples. Compu-
tational modeling of (unconditional) semantic plau-
sibility has been explored by those such as Pad
´
o et
al. (2009), Erk et al. (2010) and Sayeed et al. (2015).
Textual Entailment A multi-year source of tex-
tual inference examples were generated under the
Recognizing Textual Entailment (RTE) Challenges,
introduced by Dagan et al. (2006):
7
This notion of thematic plausibility is then related to
the notion of verb-argument selectional preference (Zernik,
1992; Resnik, 1993; Clark and Weir, 1999), and sortal
(in)correctness (Thomason, 1972).
8
Thanks to the anonymous reviewer for this connection.
We say that T entails H if, typically, a human
reading T would infer that H is most likely
true. This somewhat informal definition is
based on (and assumes) common human un-
derstanding of language as well as common
background knowledge.
This definition strayed from the more strict notion
of entailment as used by linguistic semanticists, such
as those involved with FRACAS. While Giampic-
colo et al. (2008) extended binary RTE with an “un-
known” category, the entailment community has pri-
marily focused on issues such as “paraphrase” and
“monotonicity”. An example of this is the Natural
Logic implementation of MacCartney and Manning
(2007).
Language understanding in context is not only un-
derstanding the entailments of a sentence, but also
the plausible inferences of the sentence, i.e. the
new posterior on the world after reading the sen-
tence. A new sentence in a discourse is almost never
entailed by another sentence in the discourse, be-
cause such a sentence would add no new informa-
tion. In order to successfully process a discourse,
there needs to be some understanding of what new
information can be, possibly or plausibly, added to
the discourse. Collecting sentence pairs with ordi-
nal entailment connections is potentially useful for
improving and testing these language understanding
capabilities that would be needed by algorithms for
applications like storytelling.
Garrette et al. (2011) and Beltagy et al. (2017)
treated textual entailment as probabilistic logical in-
ference in Markov Logic Networks (Richardson and
Domingos, 2006). However, the notion of probabil-
ity in their entailment task has a subtle distinction
from our problem of common-sense inference. The
probability of being an entailment given by a proba-
bilistic model trained for a binary classification (be-
ing an entailment or not) is not necessarily the same
as the likelihood of an inference being true. For ex-
ample:
T: A person flips a coin.
H: That flip comes up heads.
No human reading T should infer that H is true.
A model trained to make ordinal predictions should
say: “plausible, with probability 1.0”, whereas a
model trained to make binary entailed/not-entailed
381

predictions should say: “not entailed, with probabil-
ity 1.0”. The following example exhibits the same
property:
T: An animal eats food.
H: A person eats food.
Again, with high confidence, H is plausible; and,
with high confidence, it is also not entailed.
Non-entailing Inference Of the various non-
“entailment” textual inference tasks, a few are most
salient here. Agirre et al. (2012) piloted a Textual
Similarity evaluation which has been refined in sub-
sequent years. Systems produce scalar values corre-
sponding to predictions of how similar the meaning
is between two provided sentences, e.g., the follow-
ing pair from SICK was judged very similar (4.2 out
of 5), while also being a contradiction: There is no
biker jumping in the air and A lone biker is jump-
ing in the air. The ordinal approach we advocate for
relies on a graded notion, like textual similarity.
The Choice of Plausible Alternative (COPA)
task (Roemmele et al., 2011) was a reaction to RTE,
similarly motivated to probe a system’s ability to un-
derstand inferences that are not strictly entailed. A
single context was provided, with two alternative in-
ferences, and a system had to judge which was more
plausible. The COPA dataset was manually elicited,
and is not large; we discuss this data further in §5.
The Narrative Cloze task (Chambers and Juraf-
sky, 2008) requires a system to score candidate in-
ferences as to how likely they are to appear in a
document that also included the provided context.
Many such inferences are then not strictly entailed
by the context. Further, the Cloze task gives the ben-
efit of being able to generate very large numbers of
examples automatically by simply occluding parts
of existing documents and asking a system to pre-
dict what is missing. The LAMBADA dataset (Pa-
perno et al., 2016) is akin to our strategy for auto-
matic generation followed by human filtering, but
for Cloze examples. As our concern is with infer-
ences that are often true but never stated in a doc-
ument, this approach is not viable here. The ROC-
Stories corpus (Mostafazadeh et al., 2016) elicited
a more “plausible” collection of documents in or-
der to retain the narrative Cloze in the context of
common-sense inference. The ROCStories corpus
can be viewed as an extension of the idea behind
the COPA corpus, done at a larger scale with crowd-
sourcing, and with multi-sentence contexts; we con-
sider this dataset in §5.
Alongside the narrative Cloze, Pichotta and
Mooney (2016) made use of a 5-point Likert scale
(very likely to very unlikely) as a secondary evalu-
ation of various script induction techniques. While
they were concerned with measuring their ability to
generate very likely inferences, here we are inter-
ested in generating a wide swath of inference candi-
dates, including those that are impossible.
3 Ordinal Common-sense Inference
Our goal is a system that can perform speculative,
common-sense inference as part of understanding
language. Based on the observed shortfalls of prior
work, we propose the notion of Ordinal Common-
sense Inference (OCI). OCI embraces the notion of
Dagan et al. (2006), in that we are concerned with
human judgments of epistemic modality.
9
As agreed by many linguists, modality in nat-
ural language is a continuous category, but
speakers are able to map areas of this axis into
discrete values (Lyons, 1977; Horn, 1989; de
Haan, 1997) Saur
´
ı and Pustejovsky (2009)
According to Horn (1989), there are two scales
of epistemic modality which differ in polarity (posi-
tive vs. negative polarity): hcertain, likely, possiblei
and himpossible, unlikely, uncertaini. The Square
of Opposition (SO) (Fig 2) illustrates the logical re-
lations holding between values in the two scales.
Based on their logical relations, we can make a set
of exhaustive epistemic modals: hvery likely, likely,
possible, impossiblei, where hvery likely, likely, pos-
siblei lie on a single, positive Horn scale, and im-
possible, a complementary concept from the cor-
responding negative Horn scale, completes the set.
In this paper, we further replace the value possible
by the more fine-grained values (technically possi-
ble and plausible). This results in a 5-point scale
of likelihood: hvery likely, likely, plausible, techni-
cally possible, impossiblei. The OCI task definition
directly embraces subjective likelihood on such an
9
Epistemic modality: the likelihood that (some aspect of) a
certain state of affairs is/has been/will be true (or false) in the
context of the possible world under consideration.
382

ordinal scale. Humans are presented with a context
C and asked whether a provided hypothesis H is very
likely, likely, plausible, technically possible, or im-
possible. Furthermore, an important part of this pro-
cess is the generation of H by automatic methods,
which seeks to avoid the elicitation bias of many
prior works.
A E
Contraries
I O
Subcontraries
Contradictories
certain
likely
possible
impossible
unlikely
uncertain
Positive Negative
Figure 2: SO for epistemic modals (Saur
´
ı and Puste-
jovsky, 2009).
10
4 Framework for collecting OCI corpus
We now describe our framework for collecting ordi-
nal common-sense inference examples. It is natural
to collect this data in two stages. In the first stage
(§4.1), we automatically generate inference candi-
dates given some context. We propose two broad
approaches using either general world knowledge or
neural methods. In the second stage (§4.2), we an-
notate these candidates with ordinal labels.
4.1 Generation of Common-sense Inference
Candidates
4.1.1 Generation based on World Knowledge
Our motivation for this approach was first intro-
duced by Schubert (2002):
There is a largely untapped source of general
knowledge in texts, lying at a level beneath the
explicit assertional content. This knowledge
consists of relationships implied to be possi-
ble in the world, or, under certain conditions,
implied to be normal or commonplace in the
world.
Following Schubert (2002) and Van Durme and
Schubert (2008), we define an approach for ab-
stracting over explicit assertions derived from cor-
pora, leading to a large-scale collection of general
possibilistic statements. As shown in Fig 3, this
10
“Contradictories”: exhaustive and mutually exclusive con-
ditions. “Contraries”: non-exhaustive and mutually exclusive.
“Subcontraries”: exhaustive and non-mutually exclusive.
approach generates common-sense inference can-
didates in four steps: (a) extracting propositions
with predicate-argument structures from texts, (b)
abstracting over propositions to generate templates
for concepts, (c) deriving properties of concepts via
different strategies, and (d) generating possibilistic
hypotheses from contexts.
publication.n.01
person buy ____
collection.n.02
magazine.n.01
book.n.01
No
person subscribe to ____
Yes
person borrow ____ from library
Yes No
Yes
(c) Property derivation using the decision tree
feature
feature
feature
No
[person] borrow [book] from [library]
person.n.01
book.n.01
library.n.01
____ borrow book from library
person borrow ____ from library
person borrow book from ____
propositional templates
abstracted proposition
[John] borrowed [the books] from [the library]
pred-arg structured proposition
John borrowed the books from the library .
plain text
(a) Extraction
(b) Abstraction
The professor recommended [books] for this course.
context
(d) Inference generation
A person borrows the books from a library.
inference
approximation
template generation
extraction
property
derivation
verbalization
hypothesis
Hypothesis generation
Figure 3: Generating common-sense inferences
based on general world knowledge.
(a) Extracting propositions: First we extract a
large set of propositions with predicate-argument
structures from noun phrases and clauses, under
which general world presumptions often lie. To
achieve this goal, we use PredPatt
11
(White et al.,
2016; Zhang et al., 2017), which defines a frame-
11
https://github.com/hltcoe/PredPatt
383

Citations
More filters
Proceedings ArticleDOI
16 Aug 2018
TL;DR: In this paper, the authors introduce the task of grounded commonsense inference, unifying natural language inference and commonsense reasoning, and present SWAG, a new dataset with 113k multiple choice questions about a rich spectrum of grounded situations.
Abstract: Given a partial description like “she opened the hood of the car,” humans can reason about the situation and anticipate what might come next (”then, she examined the engine”). In this paper, we introduce the task of grounded commonsense inference, unifying natural language inference and commonsense reasoning. We present SWAG, a new dataset with 113k multiple choice questions about a rich spectrum of grounded situations. To address the recurring challenges of the annotation artifacts and human biases found in many existing datasets, we propose Adversarial Filtering (AF), a novel procedure that constructs a de-biased dataset by iteratively training an ensemble of stylistic classifiers, and using them to filter the data. To account for the aggressive adversarial filtering, we use state-of-the-art language models to massively oversample a diverse set of potential counterfactuals. Empirical results demonstrate that while humans can solve the resulting inference problems with high accuracy (88%), various competitive models struggle on our task. We provide comprehensive analysis that indicates significant opportunities for future research.

505 citations

Proceedings ArticleDOI
02 May 2018
TL;DR: This article proposed a hypothesis-only baseline for diagnosing NLI, which is able to significantly outperform a majority-class baseline across a number of NLI datasets, and showed that statistical irregularities may allow a model to perform NLI in some datasets beyond what should be achievable without access to the context.
Abstract: We propose a hypothesis only baseline for diagnosing Natural Language Inference (NLI). Especially when an NLI dataset assumes inference is occurring based purely on the relationship between a context and a hypothesis, it follows that assessing entailment relations while ignoring the provided context is a degenerate solution. Yet, through experiments on 10 distinct NLI datasets, we find that this approach, which we refer to as a hypothesis-only model, is able to significantly outperform a majority-class baseline across a number of NLI datasets. Our analysis suggests that statistical irregularities may allow a model to perform NLI in some datasets beyond what should be achievable without access to the context.

421 citations

Proceedings ArticleDOI
01 Oct 2019
TL;DR: Social IQa as mentioned in this paper is a large-scale benchmark for commonsense reasoning about social situations, which contains 38,000 multiple choice questions for probing emotional and social intelligence in a variety of everyday situations.
Abstract: We introduce Social IQa, the first large-scale benchmark for commonsense reasoning about social situations. Social IQa contains 38,000 multiple choice questions for probing emotional and social intelligence in a variety of everyday situations (e.g., Q: “Jordan wanted to tell Tracy a secret, so Jordan leaned towards Tracy. Why did Jordan do this?” A: “Make sure no one else could hear”). Through crowdsourcing, we collect commonsense questions along with correct and incorrect answers about social interactions, using a new framework that mitigates stylistic artifacts in incorrect answers by asking workers to provide the right answer to a different but related question. Empirical results show that our benchmark is challenging for existing question-answering models based on pretrained language models, compared to human performance (>20% gap). Notably, we further establish Social IQa as a resource for transfer learning of commonsense knowledge, achieving state-of-the-art performance on multiple commonsense reasoning tasks (Winograd Schemas, COPA).

388 citations

Proceedings ArticleDOI
01 Jun 2019
TL;DR: In this article, the authors present commonsenseQA, a dataset for commonsense question answering with prior knowledge, where workers are asked to create multiple-choice questions with complex semantics that often require prior knowledge.
Abstract: When answering a question, people often draw upon their rich world knowledge in addition to the particular context. Recent work has focused primarily on answering questions given some relevant document or context, and required very little general background. To investigate question answering with prior knowledge, we present CommonsenseQA: a challenging new dataset for commonsense question answering. To capture common sense beyond associations, we extract from ConceptNet (Speer et al., 2017) multiple target concepts that have the same semantic relation to a single source concept. Crowd-workers are asked to author multiple-choice questions that mention the source concept and discriminate in turn between each of the target concepts. This encourages workers to create questions with complex semantics that often require prior knowledge. We create 12,247 questions through this procedure and demonstrate the difficulty of our task with a large number of strong baselines. Our best baseline is based on BERT-large (Devlin et al., 2018) and obtains 56% accuracy, well below human performance, which is 89%.

351 citations

Posted Content
TL;DR: This work presents a large-scale dataset, ReCoRD, for machine reading comprehension requiring commonsense reasoning, and demonstrates that the performance of state-of-the-art MRC systems fall far behind human performance.
Abstract: We present a large-scale dataset, ReCoRD, for machine reading comprehension requiring commonsense reasoning. Experiments on this dataset demonstrate that the performance of state-of-the-art MRC systems fall far behind human performance. ReCoRD represents a challenge for future research to bridge the gap between human and machine commonsense reading comprehension. ReCoRD is available at this http URL.

252 citations

References
More filters
Journal ArticleDOI
TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
Abstract: Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O. 1. Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.

72,897 citations

Proceedings Article
01 Jan 2015
TL;DR: It is conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and it is proposed to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.
Abstract: Neural machine translation is a recently proposed approach to machine translation. Unlike the traditional statistical machine translation, the neural machine translation aims at building a single neural network that can be jointly tuned to maximize the translation performance. The models proposed recently for neural machine translation often belong to a family of encoder-decoders and consists of an encoder that encodes a source sentence into a fixed-length vector from which a decoder generates a translation. In this paper, we conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and propose to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly. With this new approach, we achieve a translation performance comparable to the existing state-of-the-art phrase-based system on the task of English-to-French translation. Furthermore, qualitative analysis reveals that the (soft-)alignments found by the model agree well with our intuition.

20,027 citations

Journal ArticleDOI
TL;DR: WordNet1 provides a more effective combination of traditional lexicographic information and modern computing, and is an online lexical database designed for use under program control.
Abstract: Because meaningful sentences are composed of meaningful words, any system that hopes to process natural languages as people do must have information about words and their meanings. This information is traditionally provided through dictionaries, and machine-readable dictionaries are now widely available. But dictionary entries evolved for the convenience of human readers, not for machines. WordNet1 provides a more effective combination of traditional lexicographic information and modern computing. WordNet is an online lexical database designed for use under program control. English nouns, verbs, adjectives, and adverbs are organized into sets of synonyms, each representing a lexicalized concept. Semantic relations link the synonym sets [4].

15,068 citations

Posted Content
TL;DR: In this paper, the authors propose to use a soft-searching model to find the parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.
Abstract: Neural machine translation is a recently proposed approach to machine translation. Unlike the traditional statistical machine translation, the neural machine translation aims at building a single neural network that can be jointly tuned to maximize the translation performance. The models proposed recently for neural machine translation often belong to a family of encoder-decoders and consists of an encoder that encodes a source sentence into a fixed-length vector from which a decoder generates a translation. In this paper, we conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and propose to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly. With this new approach, we achieve a translation performance comparable to the existing state-of-the-art phrase-based system on the task of English-to-French translation. Furthermore, qualitative analysis reveals that the (soft-)alignments found by the model agree well with our intuition.

14,077 citations

Proceedings ArticleDOI
08 May 2007
TL;DR: YAGO as discussed by the authors is a light-weight and extensible ontology with high coverage and quality, which includes the Is-A hierarchy as well as non-taxonomic relations between entities (such as HASONEPRIZE).
Abstract: We present YAGO, a light-weight and extensible ontology with high coverage and quality. YAGO builds on entities and relations and currently contains more than 1 million entities and 5 million facts. This includes the Is-A hierarchy as well as non-taxonomic relations between entities (such as HASONEPRIZE). The facts have been automatically extracted from Wikipedia and unified with WordNet, using a carefully designed combination of rule-based and heuristic methods described in this paper. The resulting knowledge base is a major step beyond WordNet: in quality by adding knowledge about individuals like persons, organizations, products, etc. with their semantic relationships - and in quantity by increasing the number of facts by more than an order of magnitude. Our empirical evaluation of fact correctness shows an accuracy of about 95%. YAGO is based on a logically clean model, which is decidable, extensible, and compatible with RDFS. Finally, we show how YAGO can be further extended by state-of-the-art information extraction techniques.

3,710 citations