scispace - formally typeset
Search or ask a question

Showing papers by "Christian Bizer published in 2012"


Proceedings ArticleDOI
30 Mar 2012
TL;DR: Sieve, a framework for flexibly expressing quality assessment methods as well as fusion methods for quality assessment and fusion, is presented, which is integrated into the Linked Data Integration Framework (LDIF), which handles Data Access, Schema Mapping and Identity Resolution.
Abstract: The Web of Linked Data grows rapidly and already contains data originating from hundreds of data sources. The quality of data from those sources is very diverse, as values may be out of date, incomplete or incorrect. Moreover, data sources may provide conflicting values for a single real-world object.In order for Linked Data applications to consume data from this global data space in an integrated fashion, a number of challenges have to be overcome. One of these challenges is to rate and to integrate data based on their quality. However, quality is a very subjective matter, and finding a canonic judgement that is suitable for each and every task is not feasible.To simplify the task of consuming high-quality data, we present Sieve, a framework for flexibly expressing quality assessment methods as well as fusion methods. Sieve is integrated into the Linked Data Integration Framework (LDIF), which handles Data Access, Schema Mapping and Identity Resolution, all crucial preliminaries for quality assessment and fusion.We demonstrate Sieve in a data integration scenario importing data from the English and Portuguese versions of DBpedia, and discuss how we increase completeness, conciseness and consistency through the use of our framework.

263 citations


Journal ArticleDOI
11 Jan 2012
TL;DR: Twenty-five Semantic Web and Database researchers met at the 2011 STI Semantic Summit in Riga, Latvia July 6-8, 2011 to discuss the opportunities and challenges posed by Big Data.
Abstract: Twenty-five Semantic Web and Database researchers met at the 2011 STI Semantic Summit in Riga, Latvia July 6-8, 2011[1] to discuss the opportunities and challenges posed by Big Data for the Semantic Web, Semantic Technologies, and Database communities. The unanimous conclusion was that the greatest shared challenge was not only engineering Big Data, but also doing so meaningfully. The following are four expressions of that challenge from different perspectives.

228 citations


Proceedings Article
01 May 2012
TL;DR: This paper describes the general DBpedia knowledge base and as well as the DBpedia data sets that specifically aim at supporting computational linguistics tasks that include Entity Linking, Word Sense Disambiguation, Question Answering, Slot Filling and Relationship Extraction.
Abstract: The DBpedia project extracts structured information from Wikipedia editions in 97 different languages and combines this information into a large multi-lingual knowledge base covering many specific domains and general world knowledge. The knowledge base contains textual descriptions (titles and abstracts) of concepts in up to 97 languages. It also contains structured knowledge that has been extracted from the infobox systems of Wikipedias in 15 different languages and is mapped onto a single consistent ontology by a community effort. The knowledge base can be queried using the SPARQL query language and all its data sets are freely available for download. In this paper, we describe the general DBpedia knowledge base and as well as the DBpedia data sets that specifically aim at supporting computational linguistics tasks. These task include Entity Linking, Word Sense Disambiguation, Question Answering, Slot Filling and Relationship Extraction. These use cases are outlined, pointing at added value that the structured data of DBpedia provides.

167 citations


Journal ArticleDOI
01 Jul 2012
TL;DR: GenLink as discussed by the authors learns linkage rules from a set of existing reference links using genetic programming, which is capable of generating linkage rules which select discriminative properties for comparison, apply chains of data transformations to normalize property values, choose appropriate distance measures and thresholds and combine the results of multiple comparisons using non-linear aggregation functions.
Abstract: A central problem in data integration and data cleansing is to find entities in different data sources that describe the same real-world object. Many existing methods for identifying such entities rely on explicit linkage rules which specify the conditions that entities must fulfill in order to be considered to describe the same real-world object. In this paper, we present the GenLink algorithm for learning expressive linkage rules from a set of existing reference links using genetic programming. The algorithm is capable of generating linkage rules which select discriminative properties for comparison, apply chains of data transformations to normalize property values, choose appropriate distance measures and thresholds and combine the results of multiple comparisons using non-linear aggregation functions. Our experiments show that the GenLink algorithm outperforms the state-of-the-art genetic programming approach to learning linkage rules recently presented by Carvalho et. al. and is capable of learning linkage rules which achieve a similar accuracy as human written rules for the same problem.

107 citations


Posted Content
TL;DR: The GenLink algorithm for learning expressive linkage rules from a set of existing reference links using genetic programming is presented, capable of generating linkage rules which select discriminative properties for comparison, apply chains of data transformations to normalize property values, and combine the results of multiple comparisons using non-linear aggregation functions.
Abstract: A central problem in data integration and data cleansing is to find entities in different data sources that describe the same real-world object. Many existing methods for identifying such entities rely on explicit linkage rules which specify the conditions that entities must fulfill in order to be considered to describe the same real-world object. In this paper, we present the GenLink algorithm for learning expressive linkage rules from a set of existing reference links using genetic programming. The algorithm is capable of generating linkage rules which select discriminative properties for comparison, apply chains of data transformations to normalize property values, choose appropriate distance measures and thresholds and combine the results of multiple comparisons using non-linear aggregation functions. Our experiments show that the GenLink algorithm outperforms the state-of-the-art genetic programming approach to learning linkage rules recently presented by Carvalho et. al. and is capable of learning linkage rules which achieve a similar accuracy as human written rules for the same problem.

89 citations


01 Jan 2012
TL;DR: The LDIF - Linked Data Integration Framework is provided, which provides an expressive mapping language for translating data from the various vocabularies that are used on the Web to a consistent, local target vocabulary and contains a data quality assessment and a data fusion module which allow Web data to be ltered according to dierent data qualityessment policies.
Abstract: While the Web of Linked Data grows rapidly, the development of Linked Data applications is still cumbersome and hampered due to the lack of software libraries for accessing, integrating and cleansing Linked Data from the Web. In order to make it easier to develop Linked Data applications, we provide the LDIF - Linked Data Integration Framework. LDIF can be used as a component within Linked Data applications to gather Linked Data from the Web and to translate the gathered data into a clean local target representation while keeping track of data provenance. LDIF provides a Linked Data crawler as well as components for accessing SPARQL endpoints and remote RDF dumps. It provides an expressive mapping language for translating data from the various vocabularies that are used on the Web to a consistent, local target vocabulary. LDIF includes an identity resolution component which discovers URI aliases in the input data and replaces them with a single target URI based on exible, user-provided matching heuristics. For provenance tracking, the LDIF framework employs the Named Graphs data model. LDIF contains a data quality assessment and a data fusion module which allow Web data to be ltered according to dierent data quality assessment policies and provide for fusing Web data using dierent conict resolution methods. In order to deal with use cases of dierent sizes, we provide an in-memory implementation of the LDIF framework as well as an RDF-store-backed implementation and a Hadoop implementation that can be deployed on Amazon EC2.

85 citations


01 Jan 2012
TL;DR: The Web Data Commons project extracts all Microformat, Microdata and RDFa data from the Common Crawl web corpus, the largest and most up-todata web corpus that is currently available to the public, and provides the extracted data for download in the form of RDF-quads.
Abstract: More and more websites embed structured data describing for instance products, people, organizations, places, events, resumes, and cooking recipes into their HTML pages using encoding standards such as Microformats, Microdatas and RDFa. The Web Data Commons project extracts all Microformat, Microdata and RDFa data from the Common Crawl web corpus, the largest and most up-todata web corpus that is currently available to the public, and provides the extracted data for download in the form of RDF-quads. In this paper, we give an overview of the project and present statistics about the popularity of the different encoding standards as well as the kinds of data that are published using each format.

71 citations


01 Jan 2012
TL;DR: A benchmark for comparing the expressivity as well as the runtime performance of data translation systems, based on a set of examples from the LOD Cloud, and a catalog of fifteen data translation patterns that aims to reflect the real-world heterogeneities that exist on the Web of Data.
Abstract: Linked Data sources on the Web use a wide range of different vocabularies to represent data describing the same type of entity. For some types of entities, like people or bibliographic record, common vocabularies have emerged that are used by multiple data sources. But even for representing data of these common types, different user communities use different competing common vocabularies. Linked Data applications that want to understand as much data from the Web as possible, thus need to overcome vocabulary heterogeneity and translate the original data into a single target vocabulary. To support application developers with this integration task, several Linked Data translation systems have been developed. These systems provide languages to express declarative mappings that are used to translate heterogeneous Web data into a single target vocabulary. In this paper, we present a benchmark for comparing the expressivity as well as the runtime performance of data translation systems. Based on a set of examples from the LOD Cloud, we developed a catalog of fifteen data translation patterns and survey how often these patterns occur in the example set. Based on these statistics, we designed the LODIB (Linked Open Data Integration Benchmark) that aims to reflect the real-world heterogeneities that exist on the Web of Data. We apply the benchmark to test the performance of two data translation systems, Mosto and LDIF, and compare the performance of the systems with the SPARQL 1.1 CONSTRUCT query performance of the Jena TDB RDF store.

25 citations


Book ChapterDOI
23 Jul 2012
TL;DR: This work presents an approach which combines genetic programming and active learning for the interactive generation of expressive linkage rules and automates the generation of a linkage rule and only requires the user to confirm or decline a number of example links.
Abstract: The amount of data that is available as Linked Data on the Web has grown rapidly over the last years. However, the linkage between data sources remains sparse as setting RDF links means effort for the data publishers. Many existing methods for generating these links rely on explicit linkage rules which specify the conditions which must hold true for two entities in order to be interlinked. As writing good linkage rules by hand is a non-trivial problem, the burden to generate links between data sources is still high. In order to reduce the effort and required expertise to write linkage rules, we present an approach which combines genetic programming and active learning for the interactive generation of expressive linkage rules. Our approach automates the generation of a linkage rule and only requires the user to confirm or decline a number of example links. The algorithm minimizes user involvement by selecting example links which yield a high information gain. The proposed approach has been implemented in the Silk Link Discovery Framework. Within our experiments, the algorithm was capable of finding linkage rules with a full F1-measure by asking the user to confirm or decline a maximum amount of 20 links.

16 citations


Proceedings Article
01 May 2012
TL;DR: The impact of the phrase recognition step on the ability of the DBpedia Spotlight system to correctly reproduce the annotations of a gold standard in an unsupervised setting is evaluated.
Abstract: We have developed DBpedia Spotlight, a flexible concept tagging system that is able to annotate entities, topics and other terms in natural language text. The system starts by recognizing phrases to annotate in the input text, and subsequently disambiguates them to a reference knowledge base extracted from Wikipedia. In this paper we evaluate the impact of the phrase recognition step on the ability of the system to correctly reproduce the annotations of a gold standard in an unsupervised setting. We argue that a combination of techniques is needed, and we evaluate a number of alternatives according to an existing evaluation set.

9 citations


Book ChapterDOI
01 Jan 2012
TL;DR: The different techniques that are used to publish structured data on the Web are discussed and statistics about the amount and topics of the data currently published using each technique are provided.
Abstract: Over the last years, an increasing number of web sites have started to embed structured data into HTML documents as well as to publish structured data in addition to HTML documents directly on the Web. This trend has led to the extension of the Web with a global data space—the Web of Data. As the classic document Web, the Web of Data covers a wide variety of topics ranging from data describing people, organizations, and events over products and reviews to statistical data provided by governments as well as research data from various scientific disciplines. This chapter gives an overview of the topology of the Web of Data. We discuss the different techniques that are used to publish structured data on the Web and provide statistics about the amount and topics of the data currently published using each technique.

Book
06 Aug 2012
TL;DR: The book gives an overview about information quality assessment in context of web-based systems and develops a quality-driven information filtering framework that allows infor­mation consumers to apply a wide range of different filtering policies.
Abstract: Revision with unchanged content. Web-based information systems, such as search engines, news portals, elec­tronic markets and community sites, provide access to information origi­nating from numerous information providers. The quality of provided infor­mation varies as information providers have different levels of knowledge and different intentions. Users of web-based systems are therefore confron­ted with the increasingly difficult task to select high quality information from the vast amount of Web-accessible information. How can information sys­tems support users to distinguish high quality from low quality information? Which filtering mechanisms can be applied? How can filtering decisions be explained to the user? The book gives an overview about information quality assessment in context of web-based systems. Afterwards, a quality-driven information filtering framework is developed. The framework allows infor­mation consumers to apply a wide range of different filtering policies. In order to facilitate the information consumers' understanding of filtering decisions, the framework generates explanations why information satisfies a specific policy. The book targets Web developers who need to handle infor­mation quality problems within their applications as well as researchers wor­king on the topic.

Dataset
01 Aug 2012
TL;DR: The subset consists of 147 million relational tables that describe a set of entities described with one or more attributes in relational tables.
Abstract: The subset consists of 147 million relational tables In relational tables, a set of entities is described with one or more attributes

Journal ArticleDOI
TL;DR: The Semantic Web Challenge 2011 took place at the 10th InternationalSemantic Web Conference held in Bonn, Germany, from 23-27 October 2011 and required that applications are designed to operate in an open Web environment and that they utilize the semantics of the data which they process.