Yago: a core of semantic knowledge
Summary (3 min read)
1.1 Motivation
- Many applications in modern information technology utilize ontological background knowledge.
- It would have to comprise not only concepts in the style of WordNet, but also named entities like people, organizations, geographic locations, books, songs, products, etc., and also relations among these such as whatis-located-where, who-was-born-when, who-has-won-whichprize, etc.
- If such an ontology were available, it could boost the performance of existing applications and also open up the path towards new applications in the Semantic Web era.
1.3 Contributions and Outline
- This paper presents YAGO3, a new ontology that combines high coverage with high quality.
- Category pages are lists of articles that belong to a specific category (e.g., Zidane is in the category of French football players4).
- To the best of their knowledge, their method is the first approach that accomplishes this unification between WordNet and facts derived from Wikipedia with an accuracy of 97%.
- The authors observe that the more facts YAGO contains, the better it can be extended.
- 3Yet Another Great Ontology 4Soccer is called football in some countries sources from which the current YAGO is assembled, namely, Wikipedia and WordNet.
2.1 Structure
- This makes it possible to express that a certain word refers to a certain entity, like in the following example: ”Einstein” means AlbertEinstein.
- In the YAGO model, relations are entities as well.
- Common entities that are not classes will be called individuals.
- Then, an n-ary fact can be represented by a new entity that is linked by these binary relations to all of its arguments (as is proposed for OWL): AlbertEinstein winner EinsteinWonNP1921 NobelPrize prize EinsteinWonNP1921 1921 time EinsteinWonNP1921.
2.2 Semantics
- This section will give a model-theoretic semantics to YAGO.
- The set of common entities C must contain at least the classes entity, class, relation, acyclicTransitiveRelation and classes for all literals (as evident from the following list).
- Each derivable fact (x, r, y) needs a new fact identifier, which is just fx,r,y.
- This makes the canonical base a natural choice to efficiently store a YAGO ontology.
2.3 Relation to Other Formalisms
- Just as YAGO, RDFS knows the properties domain, range, subClassOf and subPropertyOf (i.e. subRelationOf).
- These properties have a semantics that is equivalent to that of the corresponding YAGO relations.
- The authors plan to investigate the relation of YAGO and OWL once OWL 1.1 has been fully established.
3.1 WordNet
- WordNet is a semantic lexicon for the English language developed at the Cognitive Science Laboratory of Princeton University.
- WordNet distinguishes between words as literally appearing in texts and the actual senses of the words.
- Thus, each synset identifies one sense (i.e., semantic concept).
- WordNet provides relations between synsets such as hypernymy/hyponymy (i.e., the relation between a sub-concept and a super-concept) and holonymy/meronymy (i.e., the relation between a part and the whole); for this paper, the authors focus on hypernyms/hyponyms.
3.2 Wikipedia
- The authors downloaded the English version of Wikipedia in January 2007, which comprised 1,600,000 articles at that time.
- Each Wikipedia article is a single Web page and usually describes a single topic.
- The majority of Wikipedia pages have been manually assigned to one or multiple categories.
- The page about Albert Einstein, for example, is in the categories German language philosophers, Swiss physicists, and 34 more.
- The categorization of Wikipedia pages and their link structure are available as SQL tables, so that they can be exploited without parsing the actual Wikipedia articles.
4. THE YAGO SYSTEM
- The authors system is designed to extract a YAGO ontology from WordNet and Wikipedia.
- Facts extracted by other techniques (e.g. based on statistical learning) can have smaller confidence values.
- This gives us a (possibly empty) set of conceptual categories for each Wikipedia page.
- First, the authors introduce a class for each synset known to WordNet (i.e. city).
- If the words used to refer to these individuals match the common pattern of a given name and a family name, the authors extract the name components and establish the relations givenNameOf and familyNameOf.
4.2 YAGO Storage
- The YAGO model itself is independent of a particular data storage format.
- The authors maintain a folder for each relation and each folder contains files that list the entity pairs.
- The authors store only facts that cannot be derived by the rewrite rules of YAGO (see 2.2), so that they store in fact the unique canonical base of the ontology.
- The table has the simple schema FACTS(factId, arg1, relation, arg2, confidence).
- For their experiments, the authors used the Oracle version of YAGO.
4.3 Enriching YAGO
- An application that adds new facts to the YAGO ontology is required to obey the following protocol.
- For the disambiguation, the application can make use of the extensive information that YAGO provides for the existing entities: the relations to other entities, the words used to refer to the entities, and the context of the entities, as provided by the context relation.
- The authors propose to take the maximum, but other options can be considered.
- If (x, r, y) does not yet exist in the ontology, the application has to add the fact together with a new fact identifier.
5.1 Manual evaluation
- The authors presented randomly selected facts of the ontology to human judges and asked them to assess whether the facts were correct.
- Since common sense often does not suffice to judge the correctness of YAGO facts, the authors also presented them a snippet of the corresponding Wikipedia page.
- Furthermore, accuracy can usually be varied at the cost of recall.
- State-ofthe-art taxonomy induction as described in [23] achieves an accuracy of 84%. KnowItAll [9] and KnowItNow [4] are reported to have accuracy rates of 85% and 80%, respectively.
- With the exception of Cyc (which is not publicly available), the facts of these ontologies are in the hundreds of thousands, whereas the facts of YAGO are in the millions.
5.2 Sample facts
- In YAGO, the word ”Paris”, can refer to 71 distinct entities.
- Preprocessing ensures that words in the query are considered in all their possible meanings.
- The query algorithms are not in the scope of this paper.
- Here, the authors only show some sample queries to illustrate the applicability of YAGO (Table 6).
5.3 Enrichment experiment
- To demonstrate how an application can add new facts to the YAGO ontology, the authors conducted an experiment with the knowledge extraction system Leila [25].
- Leila is a state-ofthe-art system that uses pattern matching on natural language text.
- This relation holds between a company and the city of its headquarters.
- For each candidate fact, the company and the city have to be mapped to the respective individuals in YAGO.
- Hence the authors assume that the more facts and entities YAGO contains, the better it can be extended by new facts.
6. CONCLUSION
- The authors presented YAGO, a large and extendable ontology of high quality.
- YAGO contains 1 million entities and 5 million facts – more than any other publicly available formal ontology.
- YAGO is available in different export formats, including plain text, XML, RDFS and SQL database formats at http://www.mpii.mpg.de/~suchanek/yago.
- YAGO opens the door to numerous new challenges.
Did you find this useful? Give us your feedback
Citations
4,828 citations
2,856 citations
2,174 citations
2,132 citations
1,905 citations
References
16,983 citations
13,049 citations
2,515 citations
1,761 citations
1,546 citations
Related Papers (5)
Frequently Asked Questions (13)
Q2. What are the future works in "Yago: a core of semantic knowledge unifying wordnet and wikipedia" ?
The authors observed that the more facts YAGO contains, the easier it is to extend it by further facts. The authors hypothesize that this positive feedback loop could facilitate the growth of the knowledge base in the future. On the theoretical side, the authors plan to investigate the relationship between OWL 1. 1 and the YAGO model, once OWL 1. 1 has been fully developed. On the practical side, the authors plan to enrich YAGO by further facts that go beyond the current somewhat arbitrary relations – including high confidence facts from gazetteers, but also extracted information from Web pages.
Q3. What is the role of ontology in data cleaning?
ontological knowledge structures play an important role in data cleaning (e.g., for a data warehouse) [6], record linkage (aka. entity resolution) [7], and information integration in general [19].
Q4. What is the purpose of the cleaning step?
This is why a cleaning step is necessary, in which the system filters out all facts with arguments that are not in the domain of the previously established type relation.
Q5. What is the rewrite rule for a YAGO ontology?
The authors use the shorthand notation {f1, ..., fn} ↪→ f to say thatF ∪ {f1, ..., fn} → F ∪ {f1, ..., fn} ∪ {f}for all F ⊆ F , i.e. if a set of facts contains the facts f1, ..., fn, then the rewrite rule adds f to this set.
Q6. what is the definition of a YAGO ontology?
a YAGO ontology is consistent iff6 ∃x, r : (r,type, acyclicTransitiveRelation) ∈ D(y) ∧ (x, r, x) ∈ D(y)Since D(y) is finite, the consistency of a YAGO ontology is decidable.
Q7. How can The authorextract Wikipedia from WordNet?
the categorization of Wikipedia pages and their link structure are available as SQL tables, so that they can be exploited without parsing the actual Wikipedia articles.
Q8. What information about witnesses will enable applications to use?
The information about witnesses will enable applications to use, e.g., only facts extracted by a certain technique, facts extracted from a certain source or facts of a certain date.
Q9. How many cases of Wikipedia synsets are there?
There are roughly 15,000 cases, in which an entity is contributed by both WordNet and Wikipedia (i.e. a WordNet synset contains a common noun that is the name of a Wikipedia page).
Q10. What is the definition of an ontology?
Such an ontology would have to be of high quality, with accuracy close to 100 percent, i.e. comparable in quality to an encyclopedia.
Q11. What is the relation between a part and the whole?
WordNet provides relations between synsets such as hypernymy/hyponymy (i.e., the relation between a sub-concept and a super-concept) and holonymy/meronymy (i.e., the relation between a part and the whole); for this paper, the authors focus on hypernyms/hyponyms.
Q12. What is the common way to deal with this problem?
One common way to deal with this problem is to use n-ary relations (as for example in won-prize-in-year(Einstein, Nobel-Prize, 1921)).
Q13. How many facts are tagged with their confidence?
all facts are tagged with their empirical confidence estimation (see Section 5.1.1), which lies between 0.90 and 0.98.