Proceedings Article•DOI•

Yago: a core of semantic knowledge

Fabian M. Suchanek¹, Gjergji Kasneci¹, Gerhard Weikum¹•Institutions (1)

08 May 2007-pp 697-706

TL;DR: YAGO as discussed by the authors is a light-weight and extensible ontology with high coverage and quality, which includes the Is-A hierarchy as well as non-taxonomic relations between entities (such as HASONEPRIZE).

read less

Abstract: We present YAGO, a light-weight and extensible ontology with high coverage and quality. YAGO builds on entities and relations and currently contains more than 1 million entities and 5 million facts. This includes the Is-A hierarchy as well as non-taxonomic relations between entities (such as HASONEPRIZE). The facts have been automatically extracted from Wikipedia and unified with WordNet, using a carefully designed combination of rule-based and heuristic methods described in this paper. The resulting knowledge base is a major step beyond WordNet: in quality by adding knowledge about individuals like persons, organizations, products, etc. with their semantic relationships - and in quantity by increasing the number of facts by more than an order of magnitude. Our empirical evaluation of fact correctness shows an accuracy of about 95%. YAGO is based on a logically clean model, which is decidable, extensible, and compatible with RDFS. Finally, we show how YAGO can be further extended by state-of-the-art information extraction techniques.

...read moreread less

Summary (3 min read)

Jump to: [1.1 Motivation] – [1.2 Related Work] – [1.3 Contributions and Outline] – [2.1 Structure] – [2.2 Semantics] – [2.3 Relation to Other Formalisms] – [3.1 WordNet] – [3.2 Wikipedia] – [4. THE YAGO SYSTEM] – [4.2 YAGO Storage] – [4.3 Enriching YAGO] – [5.1 Manual evaluation] – [5.2 Sample facts] – [5.3 Enrichment experiment] and [6. CONCLUSION]

1.1 Motivation

Many applications in modern information technology utilize ontological background knowledge.
It would have to comprise not only concepts in the style of WordNet, but also named entities like people, organizations, geographic locations, books, songs, products, etc., and also relations among these such as whatis-located-where, who-was-born-when, who-has-won-whichprize, etc.
If such an ontology were available, it could boost the performance of existing applications and also open up the path towards new applications in the Semantic Web era.

1.3 Contributions and Outline

This paper presents YAGO3, a new ontology that combines high coverage with high quality.
Category pages are lists of articles that belong to a specific category (e.g., Zidane is in the category of French football players4).
To the best of their knowledge, their method is the first approach that accomplishes this unification between WordNet and facts derived from Wikipedia with an accuracy of 97%.
The authors observe that the more facts YAGO contains, the better it can be extended.
3Yet Another Great Ontology 4Soccer is called football in some countries sources from which the current YAGO is assembled, namely, Wikipedia and WordNet.

2.1 Structure

This makes it possible to express that a certain word refers to a certain entity, like in the following example: ”Einstein” means AlbertEinstein.
In the YAGO model, relations are entities as well.
Common entities that are not classes will be called individuals.
Then, an n-ary fact can be represented by a new entity that is linked by these binary relations to all of its arguments (as is proposed for OWL): AlbertEinstein winner EinsteinWonNP1921 NobelPrize prize EinsteinWonNP1921 1921 time EinsteinWonNP1921.

2.2 Semantics

This section will give a model-theoretic semantics to YAGO.
The set of common entities C must contain at least the classes entity, class, relation, acyclicTransitiveRelation and classes for all literals (as evident from the following list).
Each derivable fact (x, r, y) needs a new fact identifier, which is just fx,r,y.
This makes the canonical base a natural choice to efficiently store a YAGO ontology.

2.3 Relation to Other Formalisms

Just as YAGO, RDFS knows the properties domain, range, subClassOf and subPropertyOf (i.e. subRelationOf).
These properties have a semantics that is equivalent to that of the corresponding YAGO relations.
The authors plan to investigate the relation of YAGO and OWL once OWL 1.1 has been fully established.

3.1 WordNet

WordNet is a semantic lexicon for the English language developed at the Cognitive Science Laboratory of Princeton University.
WordNet distinguishes between words as literally appearing in texts and the actual senses of the words.
Thus, each synset identifies one sense (i.e., semantic concept).
WordNet provides relations between synsets such as hypernymy/hyponymy (i.e., the relation between a sub-concept and a super-concept) and holonymy/meronymy (i.e., the relation between a part and the whole); for this paper, the authors focus on hypernyms/hyponyms.

3.2 Wikipedia

The authors downloaded the English version of Wikipedia in January 2007, which comprised 1,600,000 articles at that time.
Each Wikipedia article is a single Web page and usually describes a single topic.
The majority of Wikipedia pages have been manually assigned to one or multiple categories.
The page about Albert Einstein, for example, is in the categories German language philosophers, Swiss physicists, and 34 more.
The categorization of Wikipedia pages and their link structure are available as SQL tables, so that they can be exploited without parsing the actual Wikipedia articles.

4. THE YAGO SYSTEM

The authors system is designed to extract a YAGO ontology from WordNet and Wikipedia.
Facts extracted by other techniques (e.g. based on statistical learning) can have smaller confidence values.
This gives us a (possibly empty) set of conceptual categories for each Wikipedia page.
First, the authors introduce a class for each synset known to WordNet (i.e. city).
If the words used to refer to these individuals match the common pattern of a given name and a family name, the authors extract the name components and establish the relations givenNameOf and familyNameOf.

4.2 YAGO Storage

The YAGO model itself is independent of a particular data storage format.
The authors maintain a folder for each relation and each folder contains files that list the entity pairs.
The authors store only facts that cannot be derived by the rewrite rules of YAGO (see 2.2), so that they store in fact the unique canonical base of the ontology.
The table has the simple schema FACTS(factId, arg1, relation, arg2, confidence).
For their experiments, the authors used the Oracle version of YAGO.

4.3 Enriching YAGO

An application that adds new facts to the YAGO ontology is required to obey the following protocol.
For the disambiguation, the application can make use of the extensive information that YAGO provides for the existing entities: the relations to other entities, the words used to refer to the entities, and the context of the entities, as provided by the context relation.
The authors propose to take the maximum, but other options can be considered.
If (x, r, y) does not yet exist in the ontology, the application has to add the fact together with a new fact identifier.

5.1 Manual evaluation

The authors presented randomly selected facts of the ontology to human judges and asked them to assess whether the facts were correct.
Since common sense often does not suffice to judge the correctness of YAGO facts, the authors also presented them a snippet of the corresponding Wikipedia page.
Furthermore, accuracy can usually be varied at the cost of recall.
State-ofthe-art taxonomy induction as described in [23] achieves an accuracy of 84%. KnowItAll [9] and KnowItNow [4] are reported to have accuracy rates of 85% and 80%, respectively.
With the exception of Cyc (which is not publicly available), the facts of these ontologies are in the hundreds of thousands, whereas the facts of YAGO are in the millions.

5.2 Sample facts

In YAGO, the word ”Paris”, can refer to 71 distinct entities.
Preprocessing ensures that words in the query are considered in all their possible meanings.
The query algorithms are not in the scope of this paper.
Here, the authors only show some sample queries to illustrate the applicability of YAGO (Table 6).

5.3 Enrichment experiment

To demonstrate how an application can add new facts to the YAGO ontology, the authors conducted an experiment with the knowledge extraction system Leila [25].
Leila is a state-ofthe-art system that uses pattern matching on natural language text.
This relation holds between a company and the city of its headquarters.
For each candidate fact, the company and the city have to be mapped to the respective individuals in YAGO.
Hence the authors assume that the more facts and entities YAGO contains, the better it can be extended by new facts.

6. CONCLUSION

The authors presented YAGO, a large and extendable ontology of high quality.
YAGO contains 1 million entities and 5 million facts – more than any other publicly available formal ontology.
YAGO is available in different export formats, including plain text, XML, RDFS and SQL database formats at http://www.mpii.mpg.de/~suchanek/yago.
YAGO opens the door to numerous new challenges.

Did you find this useful? Give us your feedback

Figures (7)

Content maybe subject to copyright Report

HAL Id: hal-01472497

https://hal.archives-ouvertes.fr/hal-01472497

Submitted on 20 Feb 2017

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-

entic research documents, whether they are pub-

lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diusion de documents

scientiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Yago: A Core of Semantic Knowledge Unifying

WordNet and Wikipedia

Fabian Suchanek, Gjergji M Kasneci, Gerhard M Weikum

To cite this version:

Fabian Suchanek, Gjergji M Kasneci, Gerhard M Weikum. Yago: A Core of Semantic Knowledge

Unifying WordNet and Wikipedia. 16th international conference on World Wide Web, May 2007,

Ban, Canada. pp.697 - 697, �10.1145/1242572.1242667�. �hal-01472497�

YAGO: A Core of Semantic Knowledge

Unifying WordNet and Wikipedia

Fabian M. Suchanek

Max-Planck-Institut

Saarbr

ucken / Germany

suchanekaOmpii.mpg.de

Gjergji Kasneci

Max-Planck-Institut

Saarbr

ucken / Germany

kasneciaOmpii.mpg.de

Gerhard Weikum

Max-Planck-Institut

Saarbr

ucken / Germany

weikumaOmpii.mpg.de

ABSTRACT

We present YAGO, a light-weight and extensible ontology

with high coverage and quality. YAGO builds on entities

and relations and currently contains more than 1 million

entities and 5 million facts. This includes the Is-A hierarchy

as well as non-taxonomic relations between entities (such

as hasWonPrize). The facts have been automatically ex-

tracted from Wikipedia and uniﬁed with WordNet, using

a carefully designed combination of rule-based and heuris-

tic methods described in this paper. The resulting knowl-

edge base is a major step beyond WordNet: in quality by

adding knowledge ab out individuals like persons, organiza-

tions, products, etc. with their semantic relationships – and

in quantity by increasing the number of facts by more than

an order of magnitude. Our empirical evaluation of fact cor-

rectness shows an accuracy of about 95%. YAGO is based on

a logically clean model, which is decidable, extensible, and

compatible with RDFS. Finally, we show how YAGO can be

further extended by state-of-the-art information extraction

techniques.

Categories and Subject Descriptors

H.0 [Information Systems]: General

General Terms

Knowledge Extraction, Ontologies

Keywords

Wikip edia, WordNet

1. INTRODUCTION

1.1 Motivation

Many applications in modern information technology uti-

lize ontological background knowledge. This applies above

all to applications in the vision of the Semantic Web, but

there are many other application ﬁelds. Machine translation

(e.g. [5]) and word sense disambiguation (e.g. [3]) exploit

lexical knowledge, query expansion uses taxonomies (e.g.

[16, 11, 27]), document classiﬁcation based on supervised or

semi-sup erv ised learning can be combined with ontologies

(e.g. [14]), and [13] demonstrates the utility of background

knowledge for question answering and information retrieval.

mittee (IW3C2). Distribution of these papers is limited to classroom use,

and personal use by others.

WWW 2007, May 8–12, 2007, Banff, Alberta, Canada.

ACM 978-1-59593-654-7/07/0005.

Furthermore, ontological knowledge structures play an im-

portant role in data cleaning (e.g., for a data warehouse) [6],

record linkage (aka. entity resolution) [7], and information

integration in general [19].

But the existing applications typically use only a single

source of background knowledge (mostly WordNet [10] or

Wikip edia). They could boost their performance, if a huge

ontology with knowledge from several sources was available.

Such an ontology would have to be of high quality, with ac-

curacy close to 100 percent, i.e. comparable in quality to

an encyclopedia. It would have to comprise not only con-

cepts in the style of WordNet, but also named entities like

people, organizations, geographic locations, books, songs,

pro ducts, etc., and also relations among these such as what-

is-located-where, who-was-born-when, who-has-won-which-

prize, etc. It would have to be extensible, easily re-usable,

and application-independent. If such an ontology were avail-

able, it could boost the performance of existing applications

and also open up the path towards new applications in the

Semantic Web era.

1.2 Related Work

Knowledge representation is an old ﬁeld in AI and has

provided numerous models from frames and KL-ONE to

recent variants of description logics and RDFS and OWL

(see [22] and [24]). Numerous approaches have been pro-

posed to create general-purpose ontologies on top of these

representations. One class of approaches focuses on extract-

ing knowledge structures automatically from text corpora.

These approaches use information extraction technologies

that include pattern matching, natural-language parsing,

and statistical learning [25, 9, 4, 1, 23, 20, 8]. These tech-

niques have also been used to extend WordNet by Wikip edia

individuals [21]. Another project along these lines is Know-

ItAll [9], which aims at extracting and compiling instances

of unary and binary predicate instances on a very large scale

– e.g., as many soccer players as possible or almost all com-

pany/CEO pairs from the business world. Although these

approaches have recently improved the quality of their re-

sults considerably, the quality is still signiﬁcantly below that

of a man-made knowledge base. Typical results contain

many false positives (e.g., IsA(Aachen Cathedral, City), to

give one example from KnowItAll). Furthermore, obtaining

a recall above 90 percent for a closed domain typically en-

tails a drastic loss of precision in return. Thus, information-

extraction approaches are only of little use for applications

that need near-perfect ontologies (e.g. for automated rea-

soning). Furthermore, they typically do not have an explicit

(logic-based) knowledge representation model.

Due to the quality bottleneck, the most successful and

widely employed ontologies are still man-made. These in-

clude WordNet [10], Cyc or OpenCyc [17], SUMO [18], and

especially domain-speciﬁc ontologies and taxonomies such as

SNOMED

or the GeneOntology

. These knowledge sources

have the advantage of satisfying the highest quality expecta-

tions, because they are manually assembled. However, they

suﬀer from low coverage, high cost for assembly and quality

assurance, and fast aging. No human-made ontology knows

the most recent Windows version or the latest soccer stars.

1.3 Contributions and Outline

This paper presents YAGO

, a new ontology that com-

bines high coverage with high quality. Its core is assem-

bled from one of the most comprehensive lexicons available

to day, Wikipedia. But rather than using information ex-

traction methods to leverage the knowledge of Wikipedia,

our approach utilizes the fact that Wikip edia has category

pages. Category pages are lists of articles that belong to a

speciﬁc category (e.g., Zidane is in the category of French

football players

). These lists give us candidates for enti-

ties (e.g. Zidane), candidates for concepts (e.g. IsA(Zidane,

FootballPlayer)) [15] and candidates for relations (e.g. isC-

itizenOf(Zidane, France)). In an ontology, concepts have to

be arranged in a taxonomy to be of use. The Wikipedia

categories are indeed arranged in a hierarchy, but this hier-

archy is barely useful for ontological purposes. For example,

Zidane is in the super-category named ”Football in France”,

but Zidane is a football player and not a football. WordNet,

in contrast, provides a clean and carefully assembled hierar-

chy of thousands of concepts. But the Wikipedia concepts

have no obvious counterparts in WordNet.

In this paper we present new techniques that link the

two sources with near-perfect accuracy. To the best of

our knowledge, our method is the ﬁrst approach that ac-

complishes this uniﬁcation between WordNet and facts de-

rived from Wikipedia with an accuracy of 97%. This al-

lows the YAGO ontology to proﬁt, on one hand, from the

vast amount of individuals known to Wikipedia, while ex-

ploiting, on the other hand, the clean taxonomy of concepts

from WordNet. Currently, YAGO contains roughly 1 million

entities and 5 million facts about them.

YAGO is based on a data model of entities and binary re-

lations. But by means of reiﬁcation (i.e., introducing iden-

tiﬁers for relation instances) we can also express relations

between relation instances (e.g., popularity rankings of pairs

of soccer players and their teams) and general properties of

relations (e.g., transitivity or acyclicity). We show that, de-

spite its expressiveness, the YAGO data model is decidable.

YAGO is designed to be extendable by other sources – be

it by other high quality sources (such as gazetteers of geo-

graphic places and their relations), by domain-speciﬁc ex-

tensions, or by data gathered through information extrac-

tion from Web pages. We conduct an enrichment experi-

ment with the state-of-the-art information extraction system

Leila[25]. We observe that the more facts YAGO contains,

the better it can be extended. We hypothesize that this pos-

itive feedback loop could even accelerate future extensions.

The rest of this paper is organized as follows. In Section

2 we introduce YAGO’s data model. Section 3 describes the

http://www.snomed.org

http://www.geneontology.org/

Yet Another Great Ontology

So ccer is called football in some countries

sources from which the current YAGO is assembled, namely,

Wikip edia and WordNet. In Section 4 we give an overview of

the system behind YAGO. We explain our extraction tech-

niques and we show how YAGO can be extended by new

data. Section 5 presents an evaluation, a comparison to

other ontologies, an enrichment experiment and sample facts

from YAGO. We conclude with a summary in Section 6.

2. THE YAGO MODEL

2.1 Structure

To accommodate the ontological data we already ex-

tracted and to be prepared for future extensions, YAGO

must be based on a thorough and expressive data model.

The model must be able to express entities, facts, relations

between facts and properties of relations. The state-of-the-

art formalism in knowledge representation is currently the

Web Ontology Language OWL [24]. Its most expressive vari-

ant, OWL-full, can express properties of relations, but is

undecidable. The weaker variants of OWL, OWL-lite and

OWL-DL, cannot express relations between facts. RDFS,

the basis of OWL, can express relations between facts, but

provides only very primitive semantics (e.g. it does not know

transitivity). This is why we introduce a slight extension of

RDFS, the YAGO model. The YAGO model can express

relations between facts and relations, while it is at the same

time simple and decidable.

As in OWL and RDFS, all objects (e.g. cities, people,

even URLs) are represented as entities in the YAGO model.

Two entities can stand in a relation. For example, to state

that Albert Einstein won the Nobel Prize, we say that the

entity Albert Einstein stands in the ha sWonPrize rela-

tion with the entity Nobel Prize. We write

AlbertEinstein hasWonPrize NobelPrize

Numbers, dates, strings and other literals are represented as

entities as well. This means that they can stand in relations

to other entities. For example, to state that Albert Einstein

was born in 1879, we write:

AlbertEinstein bornInYear 1879

Entities are abstract ontological objects, which are

language-independe nt in the ideal case. Language uses

words to refer to these entities. In the YAGO model, words

are entities as well. This makes it possible to express that a

certain word refers to a certain entity, like in the following

example:

”Einstein” means AlbertEinstein

This allows us to deal with synonymy and ambiguity. The

following line says that ”Einstein” may also refer to the mu-

sicologist Alfred Einstein:

”Einstein” means AlfredEinstein

We use quotes to distinguish words from other entities. Sim-

ilar entities are grouped into classes. For example, the class

physicist comprises all physicists and the class word com-

prises all words. Each entity is an instance of at least one

class. We express this by the type relation:

AlbertEinstein type physicist

Classes are also entities. Thus, each class is itself an instance

of a class, namely of the class class. Classes are arranged

in a taxonomic hierarchy, expressed by the subClassOf re-

lation:

physicist subClassOf scientist

In the YAGO model, relations are entities as well. This

makes it possible to represent properties of relations (like

transitivity or subsumption) within the model. The follow-

ing line, e.g., states that the subClassOf relation is tran-

sitive by making it an instance of the class transitive-

Relation:

subclassOf type transitiveRelation

The triple of an entity, a relation and an entity is called

a fact. The two entities are called the arguments of the

fact. Each fact is given a fact identiﬁer. As RDFS, the

YAGO model considers fact identiﬁers to be entities as well.

This allows us to represent for example that a certain fact

was found at a certain URL. For example, suppose that the

ab ove fact (Albert Einstein, bornInYear, 1879) had the

fact identiﬁer #1, then the following line would say that this

fact was found in Wikipedia:

#1 foundIn http : //www.wikipedia.org/Einstein

We will refer to entities that are neither facts nor relations

as common entities. Common entities that are not classes

will be called individuals. Then, a YAGO ontology over a

ﬁnite set of common entities C, a ﬁnite set of relation names

R and a ﬁnite set of fact identiﬁers I is a function

y : I → (I ∪ C ∪ R) × R × (I ∪ C ∪ R)

A YAGO ontology y has to be injective and total to ensure

that every fact identiﬁer of I is mapped to exactly one fact.

Some facts require more than two arguments (for example

the fact that Einstein won the Nobel Prize in 1921). One

common way to deal with this problem is to use n-ary re-

lations (as for example in won-prize-in-year(Einstein,

Nobel-Prize, 1921)). In a relational database setting,

where relations correspond to tables, this has the disadvan-

tage that much space will be wasted if not all arguments of

the n-ary facts are known. Worse, if an argument (like e.g.

the place of an event) has not been foreseen in the design

phase of the database, the argument cannot be represented.

Another way of dealing with an n-ary relation is to intro-

duce a binary relation for each argument (e.g. winner,prize,

time). Then, an n-ary fact can be represented by a new

entity that is linked by these binary relations to all of its

arguments (as is proposed for OWL):

AlbertEinstein winner EinsteinWonNP1921

NobelPrize prize EinsteinWonNP1921

1921 time EinsteinWonNP1921

However, this method cannot deal with additional argu-

ments to relations that were designed to be binary. The

YAGO model oﬀers a simple solution to this problem: It is

based on the assumption that for each n-ary relation, a pri-

mary pair of its arguments can be identiﬁed. For example,

for the above won-prize-in-year-relation, the pair of the

person and the prize could be considered a primary pair.

The primary pair can be represented as a binary fact with

a fact identiﬁer:

#1 : AlbertEinstein hasWonPrize NobelPrize

All other arguments can be represented as relations that

hold between the primary pair and the other argument:

#2 : #1 time 1921

2.2 Semantics

This section will give a model-theoretic semantics to

YAGO. We ﬁrst prescribe that the set of relation names R

for any YAGO ontology must contain at least the relation

names type, subClassOf, domain, range and subRelation-

Of. The set of common entities C must contain at least

the classes entity, class, relation, acyclicTransitive-

Relation and classes for all literals (as evident from the

following list). For the rest of the paper, we assume a given

set of common entities C and a given set of relations R.

The set of fact identiﬁers used by a YAGO ontology y is

implicitly given by I = domain(y). To deﬁne the semantics

of a YAGO ontology, we consider only the set of possible

facts F = (I ∪ C ∪ R) × R × (I ∪ C ∪ R). We deﬁne a

rewrite system → ⊆ P(F) × P(F), i.e. → reduces one

set of facts to another set of facts. We use the shorthand

notation {f

, ..., f

} → f to say that

F ∪ {f

, ..., f

} → F ∪ {f

, ..., f

} ∪ {f}

for all F ⊆ F, i.e. if a set of facts contains the facts f

, ..., f

then the rewrite rule adds f to this set. Our rewrite system

contains the following (axiomatic) rules:

∅ → (domain, domain, relation)

∅ → (domain, range, class)

∅ → (range, domain, relation)

∅ → (range, range, class)

∅ → (subClassOf, type, acyclicTransitiveRelation)

∅ → (subClassOf, domain, class)

∅ → (subClassOf, range, class)

∅ → (type, range, class)

∅ → (subRelationOf, type, acyclicTransitiveRelation)

∅ → (subRelationOf, domain, relation)

∅ → (subRelationOf, range, relation)

∅ → (boolean, subClassOf, literal)

∅ → (number, subClassOf, literal)

∅ → (rationalNumber, subClassOf, number)

∅ → (integer, subClassOf, rationalNumber)

∅ → (timeInterval, subClassOf, literal)

∅ → (dateTime, subClassOf, timeInterval)

∅ → (date, subClassOf, timeInterval)

∅ → (string, subClassOf, literal)

∅ → (character, subClassOf, string)

∅ → (word, subClassOf, string)

∅ → (URL, subClassOf, string)

Furthermore, it contains the following rules for all

r, r

, r

∈ R, x, y, c, c

, c

∈ I ∪ C ∪ R, r

type, r

6= subRelationOf, r 6= subRelationOf,

r 6= type, c 6= acyclicTransitiveRelation, c

acyclicTransitiveRelation:

(1) {(r

, subRelationOf, r

), (x, r

, y)} → (x, r

, y)

(2) {(r, type, acyclicTransitiveRelation), (x, r, y), (y, r, z)}

→ (x, r, z)

(3) {(r, domain, c), (x, r, c)} → (x, type, c)

(4) {(r, range, c), (x, r, y)} → (y, type, c)

(5) {(x, type, c

), (c

, subClassOf, c

)} → (x, type, c

)

Theorem 1: [Convergence of →]

Given a set of facts F ⊂ F, the largest set S with F →

∗

is unique.

(The theorems are proven in the appendix.) Given a YAGO

ontology y, the rules of → can be applied to its set of facts,

The class hierarchy of literals is inspired by SUMO[18]

r ange(y). We call the largest set that can be produced by

applying the rules of → the set of derivable facts of y, D(y).

Two YAGO ontologies y

, y

are equivalent if the fact iden-

tiﬁers in y

can be renamed so that

⊆ y

∨ y

⊆ y

) ∧ D(y

) = D(y

)

The deductive closure of a YAGO ontology y is computed by

adding the derivable facts to y. Each derivable fact (x, r, y)

needs a new fact identiﬁer, which is just f

x,r,y

. Using a

relational notation for the function y, we can write this as

∗

:= y ∪ { (f

r,x,y

, (r, x, y)) |

(x, r, y) ∈ D(y) , (r, x, y) 6∈ range(y) }

A structure for a YAGO ontology y is a triple of

• a set U (the universe)

• a function D : I ∪ C ∪ R → U (the denotation)

• a function E : D(R) → U ×U (the extension function)

Like in RDFS, a YAGO structure needs to deﬁne the exten-

sions of the relations by the extension function E. E maps

the denotation of a relation symbol to a relation on universe

elements. We deﬁne the interpretation Ψ with respect to a

structure < U, D, E > as the following relation:

Ψ := {(e

, r, e

) | (D(e

), D (e

)) ∈ E(D(r))}

We say that a fact (e

, r, e

) is true in a structure, if it

is contained in the interpretation. A model of a YAGO

ontology y is a structure such that

1. all facts of y

∗

are true

2. if Ψ(x, type, literal) for some x, then D(x) = x

3. if Ψ(r, type, acyclicTransitiveRelation) for some r,

then there exists no x such that Ψ(x, r, x)

A YAGO ontology y is called consistent iﬀ there exists a

mo del for it. Obviously, a YAGO ontology is consistent iﬀ

6 ∃x, r : (r, type, acyclicTransitiveRelation) ∈ D(y)

∧ (x, r, x) ∈ D(y)

Since D(y) is ﬁnite, the consistency of a YAGO ontology is

decidable. A base of a YAGO ontology y is any equivalent

YAGO ontology b with b ⊆ y. A canonical base of y is a

base so that there exists no other base with less elements.

Theorem 2: [Uniqueness of the Canonical Base]

The canonical base of a consistent YAGO ontology is unique.

In fact, the canonical base of a YAGO ontology can be com-

puted by greedily removing derivable facts from the ontol-

ogy. This makes the canonical base a natural choice to eﬃ-

ciently store a YAGO ontology.

2.3 Relation to Other Formalisms

The YAGO model is very similar to RDFS. In RDFS, rela-

tions are called properties. Just as YAGO, RDFS knows the

properties domain, range, subClassOf and subPropertyOf

(i.e. subRelationOf). These properties have a semantics

that is equivalent to that of the corresponding YAGO re-

lations. RDFS also knows fact identiﬁers, which can occur

as arguments of other facts. The following excerpt shows

how some sample facts of Section 2.1 can be represented in

RDFS. Each fact of YAGO becomes a triple in RDFS.

<rdf:Description

rdf:about="http://mpii.mpg.de/yago#Albert_Einstein">

<yago:bornInYear rdf:ID="f1">1879</yago:bornInYear>

</rdf:Description>

<rdf:Description

rdf:about="http://mpii.mpg.de/yago#f1">

<yago:foundIn rdf:ID="f2" rdf:resource="http:..."/>

</rdf:Description>

However, RDFS does not have a built-in transitive relation

or an acyclic transitive relation, as YAGO does. This en-

tails that the property acyclicTransitiveRelation can be

deﬁned and used, but that RDFS would not know its se-

mantics.

YAGO uses fact identiﬁers, but it does not have built-

in relations to make logical assertions about facts (e.g. it

do e s not allow to say that a fact is false). If one relies on

the denotation to map a fact identiﬁer to the corresponding

fact element in the universe, one can consider fact identi-

ﬁers as simple individuals. This abandons the syntactic link

between a fact identiﬁer and the fact. In return, it opens

up the possibility of mapping a YAGO ontology to an OWL

ontology under certain conditions. OWL has built-in coun-

terparts for almost all built-in data types, classes, and rela-

tions of YAGO. The only concept that does not have an ex-

act built-in counterpart is the acyclicTransitiveRelation.

However, this is about to change. OWL is currently being

reﬁned to its successor, OWL 1.1. The extended description

logic SROIQ [12], which has been adopted as the logical

basis of OWL 1.1, allows to express irreﬂexivity and transi-

tivity. This allows to deﬁne acyclic transitivity. We plan to

investigate the relation of YAGO and OWL once OWL 1.1

has been fully established.

3. SOURCES FOR YAGO

3.1 WordNet

WordNet is a semantic lexicon for the English language

developed at the Cognitive Science Laboratory of Prince-

ton University. WordNet distinguishes between words as

literally appearing in texts and the actual senses of the

words. A set of words that share one sense is called a

synset. Thus, each synset identiﬁes one sense (i.e., se-

mantic concept). Words with multiple meanings (ambigu-

ous words) belong to multiple synsets. As of the current

version 2.1, WordNet contains 81,426 synsets for 117,097

unique nouns. (Wordnet also includes other types of words

like verbs and adjectives, but we consider only nouns in

this paper.) WordNet provides relations between synsets

such as hypernymy/hyponymy (i.e., the relation between a

sub-concept and a super-concept) and holonymy/meronymy

(i.e., the relation b etween a part and the whole); for this

pap e r, we focus on hypernyms/hyponyms. Conceptually,

the hyp ernymy relation in WordNet spans a directed acyclic

graph (DAG) with a single source node called Entity.

3.2 Wikipedia

Wikip edia is a multilingual, Web-based encyclopedia. It is

written collaboratively by volunteers and is available for free.

We downloaded the English version of Wikipedia in January

2007, which comprised 1,600,000 articles at that time. Each

Wikip edia article is a single Web page and usually describes

a single topic.

The majority of Wikipedia pages have been manually as-

signed to one or multiple categories. The page about Albert

HTML Viewer

Frequently Asked Questions (13)

Q1. What are the contributions in "Yago: a core of semantic knowledge unifying wordnet and wikipedia" ?

The authors present YAGO, a light-weight and extensible ontology with high coverage and quality. The facts have been automatically extracted from Wikipedia and unified with WordNet, using a carefully designed combination of rule-based and heuristic methods described in this paper. Finally, the authors show how YAGO can be further extended by state-of-the-art information extraction techniques.

Q2. What are the future works in "Yago: a core of semantic knowledge unifying wordnet and wikipedia" ?

The authors observed that the more facts YAGO contains, the easier it is to extend it by further facts. The authors hypothesize that this positive feedback loop could facilitate the growth of the knowledge base in the future. On the theoretical side, the authors plan to investigate the relationship between OWL 1. 1 and the YAGO model, once OWL 1. 1 has been fully developed. On the practical side, the authors plan to enrich YAGO by further facts that go beyond the current somewhat arbitrary relations – including high confidence facts from gazetteers, but also extracted information from Web pages.

Q3. What is the role of ontology in data cleaning?

ontological knowledge structures play an important role in data cleaning (e.g., for a data warehouse) [6], record linkage (aka. entity resolution) [7], and information integration in general [19].

Q4. What is the purpose of the cleaning step?

This is why a cleaning step is necessary, in which the system filters out all facts with arguments that are not in the domain of the previously established type relation.

Q5. What is the rewrite rule for a YAGO ontology?

The authors use the shorthand notation {f1, ..., fn} ↪→ f to say thatF ∪ {f1, ..., fn} → F ∪ {f1, ..., fn} ∪ {f}for all F ⊆ F , i.e. if a set of facts contains the facts f1, ..., fn, then the rewrite rule adds f to this set.

Q6. what is the definition of a YAGO ontology?

a YAGO ontology is consistent iff6 ∃x, r : (r,type, acyclicTransitiveRelation) ∈ D(y) ∧ (x, r, x) ∈ D(y)Since D(y) is finite, the consistency of a YAGO ontology is decidable.

Q7. How can The authorextract Wikipedia from WordNet?

the categorization of Wikipedia pages and their link structure are available as SQL tables, so that they can be exploited without parsing the actual Wikipedia articles.

Q8. What information about witnesses will enable applications to use?

The information about witnesses will enable applications to use, e.g., only facts extracted by a certain technique, facts extracted from a certain source or facts of a certain date.

Q9. How many cases of Wikipedia synsets are there?

There are roughly 15,000 cases, in which an entity is contributed by both WordNet and Wikipedia (i.e. a WordNet synset contains a common noun that is the name of a Wikipedia page).

Q10. What is the definition of an ontology?

Such an ontology would have to be of high quality, with accuracy close to 100 percent, i.e. comparable in quality to an encyclopedia.

Q11. What is the relation between a part and the whole?

WordNet provides relations between synsets such as hypernymy/hyponymy (i.e., the relation between a sub-concept and a super-concept) and holonymy/meronymy (i.e., the relation between a part and the whole); for this paper, the authors focus on hypernyms/hyponyms.

Q12. What is the common way to deal with this problem?

One common way to deal with this problem is to use n-ary relations (as for example in won-prize-in-year(Einstein, Nobel-Prize, 1921)).

Q13. How many facts are tagged with their confidence?

all facts are tagged with their empirical confidence estimation (see Section 5.1.1), which lies between 0.90 and 0.98.

Yago: a core of semantic knowledge

Summary (3 min read)

1.1 Motivation

1.3 Contributions and Outline

2.1 Structure

2.2 Semantics

2.3 Relation to Other Formalisms

3.1 WordNet

3.2 Wikipedia

4. THE YAGO SYSTEM

4.2 YAGO Storage

4.3 Enriching YAGO

5.1 Manual evaluation

5.2 Sample facts

5.3 Enrichment experiment

6. CONCLUSION

Figures (7)

Citations

References

Related Papers (5)

Frequently Asked Questions (13)

Q1. What are the contributions in "Yago: a core of semantic knowledge unifying wordnet and wikipedia" ?

Q2. What are the future works in "Yago: a core of semantic knowledge unifying wordnet and wikipedia" ?

Q3. What is the role of ontology in data cleaning?

Q4. What is the purpose of the cleaning step?

Q5. What is the rewrite rule for a YAGO ontology?

Q6. what is the definition of a YAGO ontology?

Q7. How can The authorextract Wikipedia from WordNet?

Q8. What information about witnesses will enable applications to use?

Q9. How many cases of Wikipedia synsets are there?

Q10. What is the definition of an ontology?

Q11. What is the relation between a part and the whole?

Q12. What is the common way to deal with this problem?

Q13. How many facts are tagged with their confidence?