scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Yago: a core of semantic knowledge

08 May 2007-pp 697-706
TL;DR: YAGO as discussed by the authors is a light-weight and extensible ontology with high coverage and quality, which includes the Is-A hierarchy as well as non-taxonomic relations between entities (such as HASONEPRIZE).
Abstract: We present YAGO, a light-weight and extensible ontology with high coverage and quality. YAGO builds on entities and relations and currently contains more than 1 million entities and 5 million facts. This includes the Is-A hierarchy as well as non-taxonomic relations between entities (such as HASONEPRIZE). The facts have been automatically extracted from Wikipedia and unified with WordNet, using a carefully designed combination of rule-based and heuristic methods described in this paper. The resulting knowledge base is a major step beyond WordNet: in quality by adding knowledge about individuals like persons, organizations, products, etc. with their semantic relationships - and in quantity by increasing the number of facts by more than an order of magnitude. Our empirical evaluation of fact correctness shows an accuracy of about 95%. YAGO is based on a logically clean model, which is decidable, extensible, and compatible with RDFS. Finally, we show how YAGO can be further extended by state-of-the-art information extraction techniques.

Summary (3 min read)

1.1 Motivation

  • Many applications in modern information technology utilize ontological background knowledge.
  • It would have to comprise not only concepts in the style of WordNet, but also named entities like people, organizations, geographic locations, books, songs, products, etc., and also relations among these such as whatis-located-where, who-was-born-when, who-has-won-whichprize, etc.
  • If such an ontology were available, it could boost the performance of existing applications and also open up the path towards new applications in the Semantic Web era.

1.3 Contributions and Outline

  • This paper presents YAGO3, a new ontology that combines high coverage with high quality.
  • Category pages are lists of articles that belong to a specific category (e.g., Zidane is in the category of French football players4).
  • To the best of their knowledge, their method is the first approach that accomplishes this unification between WordNet and facts derived from Wikipedia with an accuracy of 97%.
  • The authors observe that the more facts YAGO contains, the better it can be extended.
  • 3Yet Another Great Ontology 4Soccer is called football in some countries sources from which the current YAGO is assembled, namely, Wikipedia and WordNet.

2.1 Structure

  • This makes it possible to express that a certain word refers to a certain entity, like in the following example: ”Einstein” means AlbertEinstein.
  • In the YAGO model, relations are entities as well.
  • Common entities that are not classes will be called individuals.
  • Then, an n-ary fact can be represented by a new entity that is linked by these binary relations to all of its arguments (as is proposed for OWL): AlbertEinstein winner EinsteinWonNP1921 NobelPrize prize EinsteinWonNP1921 1921 time EinsteinWonNP1921.

2.2 Semantics

  • This section will give a model-theoretic semantics to YAGO.
  • The set of common entities C must contain at least the classes entity, class, relation, acyclicTransitiveRelation and classes for all literals (as evident from the following list).
  • Each derivable fact (x, r, y) needs a new fact identifier, which is just fx,r,y.
  • This makes the canonical base a natural choice to efficiently store a YAGO ontology.

2.3 Relation to Other Formalisms

  • Just as YAGO, RDFS knows the properties domain, range, subClassOf and subPropertyOf (i.e. subRelationOf).
  • These properties have a semantics that is equivalent to that of the corresponding YAGO relations.
  • The authors plan to investigate the relation of YAGO and OWL once OWL 1.1 has been fully established.

3.1 WordNet

  • WordNet is a semantic lexicon for the English language developed at the Cognitive Science Laboratory of Princeton University.
  • WordNet distinguishes between words as literally appearing in texts and the actual senses of the words.
  • Thus, each synset identifies one sense (i.e., semantic concept).
  • WordNet provides relations between synsets such as hypernymy/hyponymy (i.e., the relation between a sub-concept and a super-concept) and holonymy/meronymy (i.e., the relation between a part and the whole); for this paper, the authors focus on hypernyms/hyponyms.

3.2 Wikipedia

  • The authors downloaded the English version of Wikipedia in January 2007, which comprised 1,600,000 articles at that time.
  • Each Wikipedia article is a single Web page and usually describes a single topic.
  • The majority of Wikipedia pages have been manually assigned to one or multiple categories.
  • The page about Albert Einstein, for example, is in the categories German language philosophers, Swiss physicists, and 34 more.
  • The categorization of Wikipedia pages and their link structure are available as SQL tables, so that they can be exploited without parsing the actual Wikipedia articles.

4. THE YAGO SYSTEM

  • The authors system is designed to extract a YAGO ontology from WordNet and Wikipedia.
  • Facts extracted by other techniques (e.g. based on statistical learning) can have smaller confidence values.
  • This gives us a (possibly empty) set of conceptual categories for each Wikipedia page.
  • First, the authors introduce a class for each synset known to WordNet (i.e. city).
  • If the words used to refer to these individuals match the common pattern of a given name and a family name, the authors extract the name components and establish the relations givenNameOf and familyNameOf.

4.2 YAGO Storage

  • The YAGO model itself is independent of a particular data storage format.
  • The authors maintain a folder for each relation and each folder contains files that list the entity pairs.
  • The authors store only facts that cannot be derived by the rewrite rules of YAGO (see 2.2), so that they store in fact the unique canonical base of the ontology.
  • The table has the simple schema FACTS(factId, arg1, relation, arg2, confidence).
  • For their experiments, the authors used the Oracle version of YAGO.

4.3 Enriching YAGO

  • An application that adds new facts to the YAGO ontology is required to obey the following protocol.
  • For the disambiguation, the application can make use of the extensive information that YAGO provides for the existing entities: the relations to other entities, the words used to refer to the entities, and the context of the entities, as provided by the context relation.
  • The authors propose to take the maximum, but other options can be considered.
  • If (x, r, y) does not yet exist in the ontology, the application has to add the fact together with a new fact identifier.

5.1 Manual evaluation

  • The authors presented randomly selected facts of the ontology to human judges and asked them to assess whether the facts were correct.
  • Since common sense often does not suffice to judge the correctness of YAGO facts, the authors also presented them a snippet of the corresponding Wikipedia page.
  • Furthermore, accuracy can usually be varied at the cost of recall.
  • State-ofthe-art taxonomy induction as described in [23] achieves an accuracy of 84%. KnowItAll [9] and KnowItNow [4] are reported to have accuracy rates of 85% and 80%, respectively.
  • With the exception of Cyc (which is not publicly available), the facts of these ontologies are in the hundreds of thousands, whereas the facts of YAGO are in the millions.

5.2 Sample facts

  • In YAGO, the word ”Paris”, can refer to 71 distinct entities.
  • Preprocessing ensures that words in the query are considered in all their possible meanings.
  • The query algorithms are not in the scope of this paper.
  • Here, the authors only show some sample queries to illustrate the applicability of YAGO (Table 6).

5.3 Enrichment experiment

  • To demonstrate how an application can add new facts to the YAGO ontology, the authors conducted an experiment with the knowledge extraction system Leila [25].
  • Leila is a state-ofthe-art system that uses pattern matching on natural language text.
  • This relation holds between a company and the city of its headquarters.
  • For each candidate fact, the company and the city have to be mapped to the respective individuals in YAGO.
  • Hence the authors assume that the more facts and entities YAGO contains, the better it can be extended by new facts.

6. CONCLUSION

  • The authors presented YAGO, a large and extendable ontology of high quality.
  • YAGO contains 1 million entities and 5 million facts – more than any other publicly available formal ontology.
  • YAGO is available in different export formats, including plain text, XML, RDFS and SQL database formats at http://www.mpii.mpg.de/~suchanek/yago.
  • YAGO opens the door to numerous new challenges.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

HAL Id: hal-01472497
https://hal.archives-ouvertes.fr/hal-01472497
Submitted on 20 Feb 2017
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entic research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diusion de documents
scientiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Yago: A Core of Semantic Knowledge Unifying
WordNet and Wikipedia
Fabian Suchanek, Gjergji M Kasneci, Gerhard M Weikum
To cite this version:
Fabian Suchanek, Gjergji M Kasneci, Gerhard M Weikum. Yago: A Core of Semantic Knowledge
Unifying WordNet and Wikipedia. 16th international conference on World Wide Web, May 2007,
Ban, Canada. pp.697 - 697, �10.1145/1242572.1242667�. �hal-01472497�

YAGO: A Core of Semantic Knowledge
Unifying WordNet and Wikipedia
Fabian M. Suchanek
Max-Planck-Institut
Saarbr
¨
ucken / Germany
suchanekaOmpii.mpg.de
Gjergji Kasneci
Max-Planck-Institut
Saarbr
¨
ucken / Germany
kasneciaOmpii.mpg.de
Gerhard Weikum
Max-Planck-Institut
Saarbr
¨
ucken / Germany
weikumaOmpii.mpg.de
ABSTRACT
We present YAGO, a light-weight and extensible ontology
with high coverage and quality. YAGO builds on entities
and relations and currently contains more than 1 million
entities and 5 million facts. This includes the Is-A hierarchy
as well as non-taxonomic relations between entities (such
as hasWonPrize). The facts have been automatically ex-
tracted from Wikipedia and unified with WordNet, using
a carefully designed combination of rule-based and heuris-
tic methods described in this paper. The resulting knowl-
edge base is a major step beyond WordNet: in quality by
adding knowledge ab out individuals like persons, organiza-
tions, products, etc. with their semantic relationships and
in quantity by increasing the number of facts by more than
an order of magnitude. Our empirical evaluation of fact cor-
rectness shows an accuracy of about 95%. YAGO is based on
a logically clean model, which is decidable, extensible, and
compatible with RDFS. Finally, we show how YAGO can be
further extended by state-of-the-art information extraction
techniques.
Categories and Subject Descriptors
H.0 [Information Systems]: General
General Terms
Knowledge Extraction, Ontologies
Keywords
Wikip edia, WordNet
1. INTRODUCTION
1.1 Motivation
Many applications in modern information technology uti-
lize ontological background knowledge. This applies above
all to applications in the vision of the Semantic Web, but
there are many other application fields. Machine translation
(e.g. [5]) and word sense disambiguation (e.g. [3]) exploit
lexical knowledge, query expansion uses taxonomies (e.g.
[16, 11, 27]), document classification based on supervised or
semi-sup erv ised learning can be combined with ontologies
(e.g. [14]), and [13] demonstrates the utility of background
knowledge for question answering and information retrieval.
Copyright is held by the International World Wide Web Conference Com-
mittee (IW3C2). Distribution of these papers is limited to classroom use,
and personal use by others.
WWW 2007, May 8–12, 2007, Banff, Alberta, Canada.
ACM 978-1-59593-654-7/07/0005.
Furthermore, ontological knowledge structures play an im-
portant role in data cleaning (e.g., for a data warehouse) [6],
record linkage (aka. entity resolution) [7], and information
integration in general [19].
But the existing applications typically use only a single
source of background knowledge (mostly WordNet [10] or
Wikip edia). They could boost their performance, if a huge
ontology with knowledge from several sources was available.
Such an ontology would have to be of high quality, with ac-
curacy close to 100 percent, i.e. comparable in quality to
an encyclopedia. It would have to comprise not only con-
cepts in the style of WordNet, but also named entities like
people, organizations, geographic locations, books, songs,
pro ducts, etc., and also relations among these such as what-
is-located-where, who-was-born-when, who-has-won-which-
prize, etc. It would have to be extensible, easily re-usable,
and application-independent. If such an ontology were avail-
able, it could boost the performance of existing applications
and also open up the path towards new applications in the
Semantic Web era.
1.2 Related Work
Knowledge representation is an old field in AI and has
provided numerous models from frames and KL-ONE to
recent variants of description logics and RDFS and OWL
(see [22] and [24]). Numerous approaches have been pro-
posed to create general-purpose ontologies on top of these
representations. One class of approaches focuses on extract-
ing knowledge structures automatically from text corpora.
These approaches use information extraction technologies
that include pattern matching, natural-language parsing,
and statistical learning [25, 9, 4, 1, 23, 20, 8]. These tech-
niques have also been used to extend WordNet by Wikip edia
individuals [21]. Another project along these lines is Know-
ItAll [9], which aims at extracting and compiling instances
of unary and binary predicate instances on a very large scale
e.g., as many soccer players as possible or almost all com-
pany/CEO pairs from the business world. Although these
approaches have recently improved the quality of their re-
sults considerably, the quality is still significantly below that
of a man-made knowledge base. Typical results contain
many false positives (e.g., IsA(Aachen Cathedral, City), to
give one example from KnowItAll). Furthermore, obtaining
a recall above 90 percent for a closed domain typically en-
tails a drastic loss of precision in return. Thus, information-
extraction approaches are only of little use for applications
that need near-perfect ontologies (e.g. for automated rea-
soning). Furthermore, they typically do not have an explicit
(logic-based) knowledge representation model.

Due to the quality bottleneck, the most successful and
widely employed ontologies are still man-made. These in-
clude WordNet [10], Cyc or OpenCyc [17], SUMO [18], and
especially domain-specific ontologies and taxonomies such as
SNOMED
1
or the GeneOntology
2
. These knowledge sources
have the advantage of satisfying the highest quality expecta-
tions, because they are manually assembled. However, they
suffer from low coverage, high cost for assembly and quality
assurance, and fast aging. No human-made ontology knows
the most recent Windows version or the latest soccer stars.
1.3 Contributions and Outline
This paper presents YAGO
3
, a new ontology that com-
bines high coverage with high quality. Its core is assem-
bled from one of the most comprehensive lexicons available
to day, Wikipedia. But rather than using information ex-
traction methods to leverage the knowledge of Wikipedia,
our approach utilizes the fact that Wikip edia has category
pages. Category pages are lists of articles that belong to a
specific category (e.g., Zidane is in the category of French
football players
4
). These lists give us candidates for enti-
ties (e.g. Zidane), candidates for concepts (e.g. IsA(Zidane,
FootballPlayer)) [15] and candidates for relations (e.g. isC-
itizenOf(Zidane, France)). In an ontology, concepts have to
be arranged in a taxonomy to be of use. The Wikipedia
categories are indeed arranged in a hierarchy, but this hier-
archy is barely useful for ontological purposes. For example,
Zidane is in the super-category named ”Football in France”,
but Zidane is a football player and not a football. WordNet,
in contrast, provides a clean and carefully assembled hierar-
chy of thousands of concepts. But the Wikipedia concepts
have no obvious counterparts in WordNet.
In this paper we present new techniques that link the
two sources with near-perfect accuracy. To the best of
our knowledge, our method is the first approach that ac-
complishes this unification between WordNet and facts de-
rived from Wikipedia with an accuracy of 97%. This al-
lows the YAGO ontology to profit, on one hand, from the
vast amount of individuals known to Wikipedia, while ex-
ploiting, on the other hand, the clean taxonomy of concepts
from WordNet. Currently, YAGO contains roughly 1 million
entities and 5 million facts about them.
YAGO is based on a data model of entities and binary re-
lations. But by means of reification (i.e., introducing iden-
tifiers for relation instances) we can also express relations
between relation instances (e.g., popularity rankings of pairs
of soccer players and their teams) and general properties of
relations (e.g., transitivity or acyclicity). We show that, de-
spite its expressiveness, the YAGO data model is decidable.
YAGO is designed to be extendable by other sources be
it by other high quality sources (such as gazetteers of geo-
graphic places and their relations), by domain-specific ex-
tensions, or by data gathered through information extrac-
tion from Web pages. We conduct an enrichment experi-
ment with the state-of-the-art information extraction system
Leila[25]. We observe that the more facts YAGO contains,
the better it can be extended. We hypothesize that this pos-
itive feedback loop could even accelerate future extensions.
The rest of this paper is organized as follows. In Section
2 we introduce YAGO’s data model. Section 3 describes the
1
http://www.snomed.org
2
http://www.geneontology.org/
3
Yet Another Great Ontology
4
So ccer is called football in some countries
sources from which the current YAGO is assembled, namely,
Wikip edia and WordNet. In Section 4 we give an overview of
the system behind YAGO. We explain our extraction tech-
niques and we show how YAGO can be extended by new
data. Section 5 presents an evaluation, a comparison to
other ontologies, an enrichment experiment and sample facts
from YAGO. We conclude with a summary in Section 6.
2. THE YAGO MODEL
2.1 Structure
To accommodate the ontological data we already ex-
tracted and to be prepared for future extensions, YAGO
must be based on a thorough and expressive data model.
The model must be able to express entities, facts, relations
between facts and properties of relations. The state-of-the-
art formalism in knowledge representation is currently the
Web Ontology Language OWL [24]. Its most expressive vari-
ant, OWL-full, can express properties of relations, but is
undecidable. The weaker variants of OWL, OWL-lite and
OWL-DL, cannot express relations between facts. RDFS,
the basis of OWL, can express relations between facts, but
provides only very primitive semantics (e.g. it does not know
transitivity). This is why we introduce a slight extension of
RDFS, the YAGO model. The YAGO model can express
relations between facts and relations, while it is at the same
time simple and decidable.
As in OWL and RDFS, all objects (e.g. cities, people,
even URLs) are represented as entities in the YAGO model.
Two entities can stand in a relation. For example, to state
that Albert Einstein won the Nobel Prize, we say that the
entity Albert Einstein stands in the ha sWonPrize rela-
tion with the entity Nobel Prize. We write
AlbertEinstein hasWonPrize NobelPrize
Numbers, dates, strings and other literals are represented as
entities as well. This means that they can stand in relations
to other entities. For example, to state that Albert Einstein
was born in 1879, we write:
AlbertEinstein bornInYear 1879
Entities are abstract ontological objects, which are
language-independe nt in the ideal case. Language uses
words to refer to these entities. In the YAGO model, words
are entities as well. This makes it possible to express that a
certain word refers to a certain entity, like in the following
example:
Einstein means AlbertEinstein
This allows us to deal with synonymy and ambiguity. The
following line says that ”Einstein” may also refer to the mu-
sicologist Alfred Einstein:
Einstein means AlfredEinstein
We use quotes to distinguish words from other entities. Sim-
ilar entities are grouped into classes. For example, the class
physicist comprises all physicists and the class word com-
prises all words. Each entity is an instance of at least one
class. We express this by the type relation:
AlbertEinstein type physicist
Classes are also entities. Thus, each class is itself an instance
of a class, namely of the class class. Classes are arranged

in a taxonomic hierarchy, expressed by the subClassOf re-
lation:
physicist subClassOf scientist
In the YAGO model, relations are entities as well. This
makes it possible to represent properties of relations (like
transitivity or subsumption) within the model. The follow-
ing line, e.g., states that the subClassOf relation is tran-
sitive by making it an instance of the class transitive-
Relation:
subclassOf type transitiveRelation
The triple of an entity, a relation and an entity is called
a fact. The two entities are called the arguments of the
fact. Each fact is given a fact identifier. As RDFS, the
YAGO model considers fact identifiers to be entities as well.
This allows us to represent for example that a certain fact
was found at a certain URL. For example, suppose that the
ab ove fact (Albert Einstein, bornInYear, 1879) had the
fact identifier #1, then the following line would say that this
fact was found in Wikipedia:
#1 foundIn http : //www.wikipedia.org/Einstein
We will refer to entities that are neither facts nor relations
as common entities. Common entities that are not classes
will be called individuals. Then, a YAGO ontology over a
finite set of common entities C, a finite set of relation names
R and a finite set of fact identifiers I is a function
y : I (I C R) × R × (I C R)
A YAGO ontology y has to be injective and total to ensure
that every fact identifier of I is mapped to exactly one fact.
Some facts require more than two arguments (for example
the fact that Einstein won the Nobel Prize in 1921). One
common way to deal with this problem is to use n-ary re-
lations (as for example in won-prize-in-year(Einstein,
Nobel-Prize, 1921)). In a relational database setting,
where relations correspond to tables, this has the disadvan-
tage that much space will be wasted if not all arguments of
the n-ary facts are known. Worse, if an argument (like e.g.
the place of an event) has not been foreseen in the design
phase of the database, the argument cannot be represented.
Another way of dealing with an n-ary relation is to intro-
duce a binary relation for each argument (e.g. winner,prize,
time). Then, an n-ary fact can be represented by a new
entity that is linked by these binary relations to all of its
arguments (as is proposed for OWL):
AlbertEinstein winner EinsteinWonNP1921
NobelPrize prize EinsteinWonNP1921
1921 time EinsteinWonNP1921
However, this method cannot deal with additional argu-
ments to relations that were designed to be binary. The
YAGO model offers a simple solution to this problem: It is
based on the assumption that for each n-ary relation, a pri-
mary pair of its arguments can be identified. For example,
for the above won-prize-in-year-relation, the pair of the
person and the prize could be considered a primary pair.
The primary pair can be represented as a binary fact with
a fact identifier:
#1 : AlbertEinstein hasWonPrize NobelPrize
All other arguments can be represented as relations that
hold between the primary pair and the other argument:
#2 : #1 time 1921
2.2 Semantics
This section will give a model-theoretic semantics to
YAGO. We first prescribe that the set of relation names R
for any YAGO ontology must contain at least the relation
names type, subClassOf, domain, range and subRelation-
Of. The set of common entities C must contain at least
the classes entity, class, relation, acyclicTransitive-
Relation and classes for all literals (as evident from the
following list). For the rest of the paper, we assume a given
set of common entities C and a given set of relations R.
The set of fact identifiers used by a YAGO ontology y is
implicitly given by I = domain(y). To define the semantics
of a YAGO ontology, we consider only the set of possible
facts F = (I C R) × R × (I C R). We define a
rewrite system P(F) × P(F), i.e. reduces one
set of facts to another set of facts. We use the shorthand
notation {f
1
, ..., f
n
} f to say that
F {f
1
, ..., f
n
} F {f
1
, ..., f
n
} {f}
for all F F, i.e. if a set of facts contains the facts f
1
, ..., f
n
,
then the rewrite rule adds f to this set. Our rewrite system
contains the following (axiomatic) rules:
5
(domain, domain, relation)
(domain, range, class)
(range, domain, relation)
(range, range, class)
(subClassOf, type, acyclicTransitiveRelation)
(subClassOf, domain, class)
(subClassOf, range, class)
(type, range, class)
(subRelationOf, type, acyclicTransitiveRelation)
(subRelationOf, domain, relation)
(subRelationOf, range, relation)
(boolean, subClassOf, literal)
(number, subClassOf, literal)
(rationalNumber, subClassOf, number)
(integer, subClassOf, rationalNumber)
(timeInterval, subClassOf, literal)
(dateTime, subClassOf, timeInterval)
(date, subClassOf, timeInterval)
(string, subClassOf, literal)
(character, subClassOf, string)
(word, subClassOf, string)
(URL, subClassOf, string)
Furthermore, it contains the following rules for all
r, r
1
, r
2
R, x, y, c, c
1
, c
2
I C R, r
1
6=
type, r
2
6= subRelationOf, r 6= subRelationOf,
r 6= type, c 6= acyclicTransitiveRelation, c
2
6=
acyclicTransitiveRelation:
(1) {(r
1
, subRelationOf, r
2
), (x, r
1
, y)} (x, r
2
, y)
(2) {(r, type, acyclicTransitiveRelation), (x, r, y), (y, r, z)}
(x, r, z)
(3) {(r, domain, c), (x, r, c)} (x, type, c)
(4) {(r, range, c), (x, r, y)} (y, type, c)
(5) {(x, type, c
1
), (c
1
, subClassOf, c
2
)} (x, type, c
2
)
Theorem 1: [Convergence of ]
Given a set of facts F F, the largest set S with F
S
is unique.
(The theorems are proven in the appendix.) Given a YAGO
ontology y, the rules of can be applied to its set of facts,
5
The class hierarchy of literals is inspired by SUMO[18]

r ange(y). We call the largest set that can be produced by
applying the rules of the set of derivable facts of y, D(y).
Two YAGO ontologies y
1
, y
2
are equivalent if the fact iden-
tifiers in y
2
can be renamed so that
(y
1
y
2
y
2
y
1
) D(y
1
) = D(y
2
)
The deductive closure of a YAGO ontology y is computed by
adding the derivable facts to y. Each derivable fact (x, r, y)
needs a new fact identifier, which is just f
x,r,y
. Using a
relational notation for the function y, we can write this as
y
:= y { (f
r,x,y
, (r, x, y)) |
(x, r, y) D(y) , (r, x, y) 6∈ range(y) }
A structure for a YAGO ontology y is a triple of
a set U (the universe)
a function D : I C R U (the denotation)
a function E : D(R) U ×U (the extension function)
Like in RDFS, a YAGO structure needs to define the exten-
sions of the relations by the extension function E. E maps
the denotation of a relation symbol to a relation on universe
elements. We define the interpretation Ψ with respect to a
structure < U, D, E > as the following relation:
Ψ := {(e
1
, r, e
2
) | (D(e
1
), D (e
2
)) E(D(r))}
We say that a fact (e
1
, r, e
2
) is true in a structure, if it
is contained in the interpretation. A model of a YAGO
ontology y is a structure such that
1. all facts of y
are true
2. if Ψ(x, type, literal) for some x, then D(x) = x
3. if Ψ(r, type, acyclicTransitiveRelation) for some r,
then there exists no x such that Ψ(x, r, x)
A YAGO ontology y is called consistent iff there exists a
mo del for it. Obviously, a YAGO ontology is consistent iff
6 x, r : (r, type, acyclicTransitiveRelation) D(y)
(x, r, x) D(y)
Since D(y) is finite, the consistency of a YAGO ontology is
decidable. A base of a YAGO ontology y is any equivalent
YAGO ontology b with b y. A canonical base of y is a
base so that there exists no other base with less elements.
Theorem 2: [Uniqueness of the Canonical Base]
The canonical base of a consistent YAGO ontology is unique.
In fact, the canonical base of a YAGO ontology can be com-
puted by greedily removing derivable facts from the ontol-
ogy. This makes the canonical base a natural choice to effi-
ciently store a YAGO ontology.
2.3 Relation to Other Formalisms
The YAGO model is very similar to RDFS. In RDFS, rela-
tions are called properties. Just as YAGO, RDFS knows the
properties domain, range, subClassOf and subPropertyOf
(i.e. subRelationOf). These properties have a semantics
that is equivalent to that of the corresponding YAGO re-
lations. RDFS also knows fact identifiers, which can occur
as arguments of other facts. The following excerpt shows
how some sample facts of Section 2.1 can be represented in
RDFS. Each fact of YAGO becomes a triple in RDFS.
<rdf:Description
rdf:about="http://mpii.mpg.de/yago#Albert_Einstein">
<yago:bornInYear rdf:ID="f1">1879</yago:bornInYear>
</rdf:Description>
<rdf:Description
rdf:about="http://mpii.mpg.de/yago#f1">
<yago:foundIn rdf:ID="f2" rdf:resource="http:..."/>
</rdf:Description>
However, RDFS does not have a built-in transitive relation
or an acyclic transitive relation, as YAGO does. This en-
tails that the property acyclicTransitiveRelation can be
defined and used, but that RDFS would not know its se-
mantics.
YAGO uses fact identifiers, but it does not have built-
in relations to make logical assertions about facts (e.g. it
do e s not allow to say that a fact is false). If one relies on
the denotation to map a fact identifier to the corresponding
fact element in the universe, one can consider fact identi-
fiers as simple individuals. This abandons the syntactic link
between a fact identifier and the fact. In return, it opens
up the possibility of mapping a YAGO ontology to an OWL
ontology under certain conditions. OWL has built-in coun-
terparts for almost all built-in data types, classes, and rela-
tions of YAGO. The only concept that does not have an ex-
act built-in counterpart is the acyclicTransitiveRelation.
However, this is about to change. OWL is currently being
refined to its successor, OWL 1.1. The extended description
logic SROIQ [12], which has been adopted as the logical
basis of OWL 1.1, allows to express irreflexivity and transi-
tivity. This allows to define acyclic transitivity. We plan to
investigate the relation of YAGO and OWL once OWL 1.1
has been fully established.
3. SOURCES FOR YAGO
3.1 WordNet
WordNet is a semantic lexicon for the English language
developed at the Cognitive Science Laboratory of Prince-
ton University. WordNet distinguishes between words as
literally appearing in texts and the actual senses of the
words. A set of words that share one sense is called a
synset. Thus, each synset identifies one sense (i.e., se-
mantic concept). Words with multiple meanings (ambigu-
ous words) belong to multiple synsets. As of the current
version 2.1, WordNet contains 81,426 synsets for 117,097
unique nouns. (Wordnet also includes other types of words
like verbs and adjectives, but we consider only nouns in
this paper.) WordNet provides relations between synsets
such as hypernymy/hyponymy (i.e., the relation between a
sub-concept and a super-concept) and holonymy/meronymy
(i.e., the relation b etween a part and the whole); for this
pap e r, we focus on hypernyms/hyponyms. Conceptually,
the hyp ernymy relation in WordNet spans a directed acyclic
graph (DAG) with a single source node called Entity.
3.2 Wikipedia
Wikip edia is a multilingual, Web-based encyclopedia. It is
written collaboratively by volunteers and is available for free.
We downloaded the English version of Wikipedia in January
2007, which comprised 1,600,000 articles at that time. Each
Wikip edia article is a single Web page and usually describes
a single topic.
The majority of Wikipedia pages have been manually as-
signed to one or multiple categories. The page about Albert

Citations
More filters
Book ChapterDOI
11 Nov 2007
TL;DR: The extraction of the DBpedia datasets is described, and how the resulting information is published on the Web for human-andmachine-consumption and how DBpedia could serve as a nucleus for an emerging Web of open data.
Abstract: DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows you to ask sophisticated queries against datasets derived from Wikipedia and to link other datasets on the Web to Wikipedia data. We describe the extraction of the DBpedia datasets, and how the resulting information is published on the Web for human-andmachine-consumption. We describe some emerging applications from the DBpedia community and show how website authors can facilitate DBpedia content within their sites. Finally, we present the current status of interlinking DBpedia with other open datasets on the Web and outline how DBpedia could serve as a nucleus for an emerging Web of open data.

4,828 citations

Journal ArticleDOI
TL;DR: An overview of the DBpedia community project is given, including its architecture, technical implementation, maintenance, internationalisation, usage statistics and applications, including DBpedia one of the central interlinking hubs in the Linked Open Data (LOD) cloud.
Abstract: The DBpedia community project extracts structured, multilingual knowledge from Wikipedia and makes it freely available on the Web using Semantic Web and Linked Data technologies. The project extracts knowledge from 111 different language editions of Wikipedia. The largest DBpedia knowledge base which is extracted from the English edition of Wikipedia consists of over 400 million facts that describe 3.7 million things. The DBpedia knowledge bases that are extracted from the other 110 Wikipedia editions together consist of 1.46 billion facts and describe 10 million additional things. The DBpedia project maps Wikipedia infoboxes from 27 different language editions to a single shared ontology consisting of 320 classes and 1,650 properties. The mappings are created via a world-wide crowd-sourcing effort and enable knowledge from the different Wikipedia editions to be combined. The project publishes releases of all DBpedia knowledge bases for download and provides SPARQL query access to 14 out of the 111 language editions via a global network of local DBpedia chapters. In addition to the regular releases, the project maintains a live knowledge base which is updated whenever a page in Wikipedia changes. DBpedia sets 27 million RDF links pointing into over 30 external data sources and thus enables data from these sources to be used together with DBpedia data. Several hundred data sets on the Web publish RDF links pointing to DBpedia themselves and make DBpedia one of the central interlinking hubs in the Linked Open Data (LOD) cloud. In this system report, we give an overview of the DBpedia community project, including its architecture, technical implementation, maintenance, internationalisation, usage statistics and applications.

2,856 citations

Book
02 Feb 2011
TL;DR: This Synthesis lecture provides readers with a detailed technical introduction to Linked Data, including coverage of relevant aspects of Web architecture, as the basis for application development, research or further study.
Abstract: The World Wide Web has enabled the creation of a global information space comprising linked documents. As the Web becomes ever more enmeshed with our daily lives, there is a growing desire for direct access to raw data not currently available on the Web or bound up in hypertext documents. Linked Data provides a publishing paradigm in which not only documents, but also data, can be a first class citizen of the Web, thereby enabling the extension of the Web with a global data space based on open standards - the Web of Data. In this Synthesis lecture we provide readers with a detailed technical introduction to Linked Data. We begin by outlining the basic principles of Linked Data, including coverage of relevant aspects of Web architecture. The remainder of the text is based around two main themes - the publication and consumption of Linked Data. Drawing on a practical Linked Data scenario, we provide guidance and best practices on: architectural approaches to publishing Linked Data; choosing URIs and vocabularies to identify and describe resources; deciding what data to return in a description of a resource on the Web; methods and frameworks for automated linking of data sets; and testing and debugging approaches for Linked Data deployments. We give an overview of existing Linked Data applications and then examine the architectures that are used to consume Linked Data from the Web, alongside existing tools and frameworks that enable these. Readers can expect to gain a rich technical understanding of Linked Data fundamentals, as the basis for application development, research or further study.

2,174 citations

Proceedings Article
Bishan Yang1, Wen-tau Yih2, Xiaodong He2, Jianfeng Gao2, Li Deng2 
01 May 2015
TL;DR: It is found that embeddings learned from the bilinear objective are particularly good at capturing relational semantics and that the composition of relations is characterized by matrix multiplication.
Abstract: We consider learning representations of entities and relations in KBs using the neural-embedding approach. We show that most existing models, including NTN (Socher et al., 2013) and TransE (Bordes et al., 2013b), can be generalized under a unified learning framework, where entities are low-dimensional vectors learned from a neural network and relations are bilinear and/or linear mapping functions. Under this framework, we compare a variety of embedding models on the link prediction task. We show that a simple bilinear formulation achieves new state-of-the-art results for the task (achieving a top-10 accuracy of 73.2% vs. 54.7% by TransE on Freebase). Furthermore, we introduce a novel approach that utilizes the learned relation embeddings to mine logical rules such as "BornInCity(a,b) and CityInCountry(b,c) => Nationality(a,c)". We find that embeddings learned from the bilinear objective are particularly good at capturing relational semantics and that the composition of relations is characterized by matrix multiplication. More interestingly, we demonstrate that our embedding-based rule extraction approach successfully outperforms a state-of-the-art confidence-based rule mining approach in mining Horn rules that involve compositional reasoning.

2,132 citations

Journal ArticleDOI
TL;DR: This article provides a systematic review of existing techniques of Knowledge graph embedding, including not only the state-of-the-arts but also those with latest trends, based on the type of information used in the embedding task.
Abstract: Knowledge graph (KG) embedding is to embed components of a KG including entities and relations into continuous vector spaces, so as to simplify the manipulation while preserving the inherent structure of the KG. It can benefit a variety of downstream tasks such as KG completion and relation extraction, and hence has quickly gained massive attention. In this article, we provide a systematic review of existing techniques, including not only the state-of-the-arts but also those with latest trends. Particularly, we make the review based on the type of information used in the embedding task. Techniques that conduct embedding using only facts observed in the KG are first introduced. We describe the overall framework, specific model design, typical training procedures, as well as pros and cons of such techniques. After that, we discuss techniques that further incorporate additional information besides facts. We focus specifically on the use of entity types, relation paths, textual descriptions, and logical rules. Finally, we briefly introduce how KG embedding can be applied to and benefit a wide variety of downstream tasks such as KG completion, relation extraction, question answering, and so forth.

1,905 citations

References
More filters
Book
01 Jan 2020
TL;DR: In this article, the authors present a comprehensive introduction to the theory and practice of artificial intelligence for modern applications, including game playing, planning and acting, and reinforcement learning with neural networks.
Abstract: The long-anticipated revision of this #1 selling book offers the most comprehensive, state of the art introduction to the theory and practice of artificial intelligence for modern applications. Intelligent Agents. Solving Problems by Searching. Informed Search Methods. Game Playing. Agents that Reason Logically. First-order Logic. Building a Knowledge Base. Inference in First-Order Logic. Logical Reasoning Systems. Practical Planning. Planning and Acting. Uncertainty. Probabilistic Reasoning Systems. Making Simple Decisions. Making Complex Decisions. Learning from Observations. Learning with Neural Networks. Reinforcement Learning. Knowledge in Learning. Agents that Communicate. Practical Communication in English. Perception. Robotics. For computer professionals, linguists, and cognitive scientists interested in artificial intelligence.

16,983 citations

Journal ArticleDOI
01 Sep 2000-Language
TL;DR: The lexical database: nouns in WordNet, Katherine J. Miller a semantic network of English verbs, and applications of WordNet: building semantic concordances are presented.
Abstract: Part 1 The lexical database: nouns in WordNet, George A. Miller modifiers in WordNet, Katherine J. Miller a semantic network of English verbs, Christiane Fellbaum design and implementation of the WordNet lexical database and searching software, Randee I. Tengi. Part 2: automated discovery of WordNet relations, Marti A. Hearst representing verb alterations in WordNet, Karen T. Kohl et al the formalization of WordNet by methods of relational concept analysis, Uta E. Priss. Part 3 Applications of WordNet: building semantic concordances, Shari Landes et al performance and confidence in a semantic annotation task, Christiane Fellbaum et al WordNet and class-based probabilities, Philip Resnik combining local context and WordNet similarity for word sense identification, Claudia Leacock and Martin Chodorow using WordNet for text retrieval, Ellen M. Voorhees lexical chains as representations of context for the detection and correction of malapropisms, Graeme Hirst and David St-Onge temporal indexing through lexical chaining, Reem Al-Halimi and Rick Kazman COLOR-X - using knowledge from WordNet for conceptual modelling, J.F.M. Burg and R.P. van de Riet knowledge processing on an extended WordNet, Sanda M. Harabagiu and Dan I Moldovan appendix - obtaining and using WordNet.

13,049 citations

Book
01 Jan 1998
TL;DR: This chapter discusses abstract reduction systems, universal algebra, and Grobner bases and Buchberger's algorithm, and a bluffer's guide to ML Bibliography Index.
Abstract: Preface 1. Motivating examples 2. Abstract reduction systems 3. Universal algebra 4. Equational problems 5. Termination 6. Confluence 7. Completion 8. Grobner bases and Buchberger's algorithm 9. Combination problems 10. Equational unification 11. Extensions Appendix 1. Ordered sets Appendix 2. A bluffer's guide to ML Bibliography Index.

2,515 citations

Proceedings ArticleDOI
17 Oct 2001
TL;DR: The strategy used to create the current version of the SUMO is outlined, some of the challenges that were faced in constructing the ontology are discussed, and its most general concepts and the relations between them are described.
Abstract: The Suggested Upper Merged Ontology (SUMO) is an upper level ontology that has been proposed as a starter document for The Standard Upper Ontology Working Group, an IEEE-sanctioned working group of collaborators from the fields of engineering, philosophy, and information science. The SUMO provides definitions for general-purpose terms and acts as a foundation for more specific domain ontologies. In this paper we outline the strategy used to create the current version of the SUMO, discuss some of the challenges that we faced in constructing the ontology, and describe in detail its most general concepts and the relations between them.

1,761 citations

Frequently Asked Questions (13)
Q1. What are the contributions in "Yago: a core of semantic knowledge unifying wordnet and wikipedia" ?

The authors present YAGO, a light-weight and extensible ontology with high coverage and quality. The facts have been automatically extracted from Wikipedia and unified with WordNet, using a carefully designed combination of rule-based and heuristic methods described in this paper. Finally, the authors show how YAGO can be further extended by state-of-the-art information extraction techniques. 

The authors observed that the more facts YAGO contains, the easier it is to extend it by further facts. The authors hypothesize that this positive feedback loop could facilitate the growth of the knowledge base in the future. On the theoretical side, the authors plan to investigate the relationship between OWL 1. 1 and the YAGO model, once OWL 1. 1 has been fully developed. On the practical side, the authors plan to enrich YAGO by further facts that go beyond the current somewhat arbitrary relations – including high confidence facts from gazetteers, but also extracted information from Web pages. 

ontological knowledge structures play an important role in data cleaning (e.g., for a data warehouse) [6], record linkage (aka. entity resolution) [7], and information integration in general [19]. 

This is why a cleaning step is necessary, in which the system filters out all facts with arguments that are not in the domain of the previously established type relation. 

The authors use the shorthand notation {f1, ..., fn} ↪→ f to say thatF ∪ {f1, ..., fn} → F ∪ {f1, ..., fn} ∪ {f}for all F ⊆ F , i.e. if a set of facts contains the facts f1, ..., fn, then the rewrite rule adds f to this set. 

a YAGO ontology is consistent iff6 ∃x, r : (r,type, acyclicTransitiveRelation) ∈ D(y) ∧ (x, r, x) ∈ D(y)Since D(y) is finite, the consistency of a YAGO ontology is decidable. 

the categorization of Wikipedia pages and their link structure are available as SQL tables, so that they can be exploited without parsing the actual Wikipedia articles. 

The information about witnesses will enable applications to use, e.g., only facts extracted by a certain technique, facts extracted from a certain source or facts of a certain date. 

There are roughly 15,000 cases, in which an entity is contributed by both WordNet and Wikipedia (i.e. a WordNet synset contains a common noun that is the name of a Wikipedia page). 

Such an ontology would have to be of high quality, with accuracy close to 100 percent, i.e. comparable in quality to an encyclopedia. 

WordNet provides relations between synsets such as hypernymy/hyponymy (i.e., the relation between a sub-concept and a super-concept) and holonymy/meronymy (i.e., the relation between a part and the whole); for this paper, the authors focus on hypernyms/hyponyms. 

One common way to deal with this problem is to use n-ary relations (as for example in won-prize-in-year(Einstein, Nobel-Prize, 1921)). 

all facts are tagged with their empirical confidence estimation (see Section 5.1.1), which lies between 0.90 and 0.98.