Showing papers by "Christian Bizer published in 2012"

PDF

Open Access

Proceedings Article•DOI•

Sieve: linked data quality assessment and fusion

[...]

Pablo N. Mendes¹, Hannes Mühleisen¹, Christian Bizer¹•Institutions (1)

30 Mar 2012

TL;DR: Sieve, a framework for flexibly expressing quality assessment methods as well as fusion methods for quality assessment and fusion, is presented, which is integrated into the Linked Data Integration Framework (LDIF), which handles Data Access, Schema Mapping and Identity Resolution.

...read moreread less

Abstract: The Web of Linked Data grows rapidly and already contains data originating from hundreds of data sources. The quality of data from those sources is very diverse, as values may be out of date, incomplete or incorrect. Moreover, data sources may provide conflicting values for a single real-world object.In order for Linked Data applications to consume data from this global data space in an integrated fashion, a number of challenges have to be overcome. One of these challenges is to rate and to integrate data based on their quality. However, quality is a very subjective matter, and finding a canonic judgement that is suitable for each and every task is not feasible.To simplify the task of consuming high-quality data, we present Sieve, a framework for flexibly expressing quality assessment methods as well as fusion methods. Sieve is integrated into the Linked Data Integration Framework (LDIF), which handles Data Access, Schema Mapping and Identity Resolution, all crucial preliminaries for quality assessment and fusion.We demonstrate Sieve in a data integration scenario importing data from the English and Portuguese versions of DBpedia, and discuss how we increase completeness, conciseness and consistency through the use of our framework.

...read moreread less

263 citations

Journal Article•DOI•

The meaningful use of big data: four perspectives -- four challenges

[...]

Christian Bizer¹, Peter Boncz², Michael L. Brodie³, Orri Erling⁴•Institutions (4)

Free University of Berlin¹, Centrum Wiskunde & Informatica², Verizon Communications³, OpenLink Software⁴

11 Jan 2012

TL;DR: Twenty-five Semantic Web and Database researchers met at the 2011 STI Semantic Summit in Riga, Latvia July 6-8, 2011 to discuss the opportunities and challenges posed by Big Data.

...read moreread less

Abstract: Twenty-five Semantic Web and Database researchers met at the 2011 STI Semantic Summit in Riga, Latvia July 6-8, 2011[1] to discuss the opportunities and challenges posed by Big Data for the Semantic Web, Semantic Technologies, and Database communities. The unanimous conclusion was that the greatest shared challenge was not only engineering Big Data, but also doing so meaningfully. The following are four expressions of that challenge from different perspectives.

...read moreread less

228 citations

Proceedings Article•

DBpedia: A Multilingual Cross-domain Knowledge Base

[...]

Pablo N. Mendes¹, Max Jakob¹, Christian Bizer¹•Institutions (1)

Free University of Berlin¹

01 May 2012

TL;DR: This paper describes the general DBpedia knowledge base and as well as the DBpedia data sets that specifically aim at supporting computational linguistics tasks that include Entity Linking, Word Sense Disambiguation, Question Answering, Slot Filling and Relationship Extraction.

...read moreread less

Abstract: The DBpedia project extracts structured information from Wikipedia editions in 97 different languages and combines this information into a large multi-lingual knowledge base covering many specific domains and general world knowledge. The knowledge base contains textual descriptions (titles and abstracts) of concepts in up to 97 languages. It also contains structured knowledge that has been extracted from the infobox systems of Wikipedias in 15 different languages and is mapped onto a single consistent ontology by a community effort. The knowledge base can be queried using the SPARQL query language and all its data sets are freely available for download. In this paper, we describe the general DBpedia knowledge base and as well as the DBpedia data sets that specifically aim at supporting computational linguistics tasks. These task include Entity Linking, Word Sense Disambiguation, Question Answering, Slot Filling and Relationship Extraction. These use cases are outlined, pointing at added value that the structured data of DBpedia provides.

...read moreread less

167 citations

Journal Article•DOI•

Learning expressive linkage rules using genetic programming

[...]

Robert Isele¹, Christian Bizer¹•Institutions (1)

Free University of Berlin¹

01 Jul 2012

TL;DR: GenLink as discussed by the authors learns linkage rules from a set of existing reference links using genetic programming, which is capable of generating linkage rules which select discriminative properties for comparison, apply chains of data transformations to normalize property values, choose appropriate distance measures and thresholds and combine the results of multiple comparisons using non-linear aggregation functions.

...read moreread less

Abstract: A central problem in data integration and data cleansing is to find entities in different data sources that describe the same real-world object. Many existing methods for identifying such entities rely on explicit linkage rules which specify the conditions that entities must fulfill in order to be considered to describe the same real-world object. In this paper, we present the GenLink algorithm for learning expressive linkage rules from a set of existing reference links using genetic programming. The algorithm is capable of generating linkage rules which select discriminative properties for comparison, apply chains of data transformations to normalize property values, choose appropriate distance measures and thresholds and combine the results of multiple comparisons using non-linear aggregation functions. Our experiments show that the GenLink algorithm outperforms the state-of-the-art genetic programming approach to learning linkage rules recently presented by Carvalho et. al. and is capable of learning linkage rules which achieve a similar accuracy as human written rules for the same problem.

...read moreread less

107 citations

Posted Content•

Learning Expressive Linkage Rules using Genetic Programming

[...]

Robert Isele¹, Christian Bizer¹•Institutions (1)

Free University of Berlin¹

01 Aug 2012-arXiv: Databases

TL;DR: The GenLink algorithm for learning expressive linkage rules from a set of existing reference links using genetic programming is presented, capable of generating linkage rules which select discriminative properties for comparison, apply chains of data transformations to normalize property values, and combine the results of multiple comparisons using non-linear aggregation functions.

...read moreread less

89 citations

LDIF - A Framework for Large-Scale Linked Data Integration

[...]

Christian Bizer¹, Christian Becker, Pablo N. Mendes, Robert Isele, Andrea Matteini, Andreas Schultz¹ - Show less +2 more•Institutions (1)

Free University of Berlin¹

01 Jan 2012

TL;DR: The LDIF - Linked Data Integration Framework is provided, which provides an expressive mapping language for translating data from the various vocabularies that are used on the Web to a consistent, local target vocabulary and contains a data quality assessment and a data fusion module which allow Web data to be ltered according to dierent data qualityessment policies.

...read moreread less

Abstract: While the Web of Linked Data grows rapidly, the development of Linked Data applications is still cumbersome and hampered due to the lack of software libraries for accessing, integrating and cleansing Linked Data from the Web. In order to make it easier to develop Linked Data applications, we provide the LDIF - Linked Data Integration Framework. LDIF can be used as a component within Linked Data applications to gather Linked Data from the Web and to translate the gathered data into a clean local target representation while keeping track of data provenance. LDIF provides a Linked Data crawler as well as components for accessing SPARQL endpoints and remote RDF dumps. It provides an expressive mapping language for translating data from the various vocabularies that are used on the Web to a consistent, local target vocabulary. LDIF includes an identity resolution component which discovers URI aliases in the input data and replaces them with a single target URI based on exible, user-provided matching heuristics. For provenance tracking, the LDIF framework employs the Named Graphs data model. LDIF contains a data quality assessment and a data fusion module which allow Web data to be ltered according to dierent data quality assessment policies and provide for fusing Web data using dierent conict resolution methods. In order to deal with use cases of dierent sizes, we provide an in-memory implementation of the LDIF framework as well as an RDF-store-backed implementation and a Hadoop implementation that can be deployed on Amazon EC2.

...read moreread less

85 citations

Web Data Commons – Extracting Structured Data from Two Large Web Corpora

[...]

Hannes Mühleisen, Christian Bizer

01 Jan 2012

TL;DR: The Web Data Commons project extracts all Microformat, Microdata and RDFa data from the Common Crawl web corpus, the largest and most up-todata web corpus that is currently available to the public, and provides the extracted data for download in the form of RDF-quads.

...read moreread less

Abstract: More and more websites embed structured data describing for instance products, people, organizations, places, events, resumes, and cooking recipes into their HTML pages using encoding standards such as Microformats, Microdatas and RDFa. The Web Data Commons project extracts all Microformat, Microdata and RDFa data from the Common Crawl web corpus, the largest and most up-todata web corpus that is currently available to the public, and provides the extracted data for download in the form of RDF-quads. In this paper, we give an overview of the project and present statistics about the popularity of the different encoding standards as well as the kinds of data that are published using each format.

...read moreread less

71 citations

Benchmarking the Performance of Linked Data Translation Systems

[...]

Carlos R. Rivero¹, Andreas Schultz², Christian Bizer, David Ruiz¹•Institutions (2)

University of Seville¹, Free University of Berlin²

01 Jan 2012

TL;DR: A benchmark for comparing the expressivity as well as the runtime performance of data translation systems, based on a set of examples from the LOD Cloud, and a catalog of fifteen data translation patterns that aims to reflect the real-world heterogeneities that exist on the Web of Data.

...read moreread less

Abstract: Linked Data sources on the Web use a wide range of different vocabularies to represent data describing the same type of entity. For some types of entities, like people or bibliographic record, common vocabularies have emerged that are used by multiple data sources. But even for representing data of these common types, different user communities use different competing common vocabularies. Linked Data applications that want to understand as much data from the Web as possible, thus need to overcome vocabulary heterogeneity and translate the original data into a single target vocabulary. To support application developers with this integration task, several Linked Data translation systems have been developed. These systems provide languages to express declarative mappings that are used to translate heterogeneous Web data into a single target vocabulary. In this paper, we present a benchmark for comparing the expressivity as well as the runtime performance of data translation systems. Based on a set of examples from the LOD Cloud, we developed a catalog of fifteen data translation patterns and survey how often these patterns occur in the example set. Based on these statistics, we designed the LODIB (Linked Open Data Integration Benchmark) that aims to reflect the real-world heterogeneities that exist on the Web of Data. We apply the benchmark to test the performance of two data translation systems, Mosto and LDIF, and compare the performance of the systems with the SPARQL 1.1 CONSTRUCT query performance of the Jena TDB RDF store.

...read moreread less

25 citations

Book Chapter•DOI•

Active learning of expressive linkage rules for the web of data

[...]

Robert Isele¹, Anja Jentzsch¹, Christian Bizer¹•Institutions (1)

Free University of Berlin¹

23 Jul 2012

TL;DR: This work presents an approach which combines genetic programming and active learning for the interactive generation of expressive linkage rules and automates the generation of a linkage rule and only requires the user to confirm or decline a number of example links.

...read moreread less

Abstract: The amount of data that is available as Linked Data on the Web has grown rapidly over the last years. However, the linkage between data sources remains sparse as setting RDF links means effort for the data publishers. Many existing methods for generating these links rely on explicit linkage rules which specify the conditions which must hold true for two entities in order to be interlinked. As writing good linkage rules by hand is a non-trivial problem, the burden to generate links between data sources is still high. In order to reduce the effort and required expertise to write linkage rules, we present an approach which combines genetic programming and active learning for the interactive generation of expressive linkage rules. Our approach automates the generation of a linkage rule and only requires the user to confirm or decline a number of example links. The algorithm minimizes user involvement by selecting example links which yield a high information gain. The proposed approach has been implemented in the Silk Link Discovery Framework. Within our experiments, the algorithm was capable of finding linkage rules with a full F1-measure by asking the user to confirm or decline a maximum amount of 20 links.

...read moreread less

16 citations

Proceedings Article•

Evaluating the Impact of Phrase Recognition on Concept Tagging

[...]

Pablo N. Mendes¹, Joachim Daiber², Rohana Rajapakse, Felix Sasaki³, Christian Bizer¹ - Show less +1 more•Institutions (3)

Free University of Berlin¹, University of Groningen², German Research Centre for Artificial Intelligence³

01 May 2012

TL;DR: The impact of the phrase recognition step on the ability of the DBpedia Spotlight system to correctly reproduce the annotations of a gold standard in an unsupervised setting is evaluated.

...read moreread less

Abstract: We have developed DBpedia Spotlight, a flexible concept tagging system that is able to annotate entities, topics and other terms in natural language text. The system starts by recognizing phrases to annotate in the input text, and subsequently disambiguates them to a reference knowledge base extracted from Wikipedia. In this paper we evaluate the impact of the phrase recognition step on the ability of the system to correctly reproduce the annotations of a gold standard in an unsupervised setting. We argue that a combination of techniques is needed, and we evaluate a number of alternatives according to an existing evaluation set.

...read moreread less

9 citations

Book Chapter•DOI•

Topology of the Web of Data

[...]

Christian Bizer¹, Pablo N. Mendes¹, Anja Jentzsch¹•Institutions (1)

Free University of Berlin¹

01 Jan 2012

TL;DR: The different techniques that are used to publish structured data on the Web are discussed and statistics about the amount and topics of the data currently published using each technique are provided.

...read moreread less

Abstract: Over the last years, an increasing number of web sites have started to embed structured data into HTML documents as well as to publish structured data in addition to HTML documents directly on the Web. This trend has led to the extension of the Web with a global data space—the Web of Data. As the classic document Web, the Web of Data covers a wide variety of topics ranging from data describing people, organizations, and events over products and reviews to statistical data provided by governments as well as research data from various scientific disciplines. This chapter gives an overview of the topology of the Web of Data. We discuss the different techniques that are used to publish structured data on the Web and provide statistics about the amount and topics of the data currently published using each technique.

...read moreread less

Book•

Quality-Driven Information Filtering

[...]

Christian Bizer

06 Aug 2012

TL;DR: The book gives an overview about information quality assessment in context of web-based systems and develops a quality-driven information filtering framework that allows information consumers to apply a wide range of different filtering policies.

...read moreread less

Abstract: Revision with unchanged content. Web-based information systems, such as search engines, news portals, electronic markets and community sites, provide access to information originating from numerous information providers. The quality of provided information varies as information providers have different levels of knowledge and different intentions. Users of web-based systems are therefore confronted with the increasingly difficult task to select high quality information from the vast amount of Web-accessible information. How can information systems support users to distinguish high quality from low quality information? Which filtering mechanisms can be applied? How can filtering decisions be explained to the user? The book gives an overview about information quality assessment in context of web-based systems. Afterwards, a quality-driven information filtering framework is developed. The framework allows information consumers to apply a wide range of different filtering policies. In order to facilitate the information consumers' understanding of filtering decisions, the framework generates explanations why information satisfies a specific policy. The book targets Web developers who need to handle information quality problems within their applications as well as researchers working on the topic.

...read moreread less

Dataset•

Web Data Commons - Web Table Corpus 2012 / Relational Data

[...]

Christian Bizer, Robert Meusel, Petar Ristoski, Heiko Paulheim, Oliver Lehmberg, Alexander Diete, Nicolas Heist, Sascha Krstanovic, Thorsten Andre Knöller - Show less +5 more

01 Aug 2012

TL;DR: The subset consists of 147 million relational tables that describe a set of entities described with one or more attributes in relational tables.

...read moreread less

Abstract: The subset consists of 147 million relational tables In relational tables, a set of entities is described with one or more attributes

...read moreread less

Journal Article•DOI•

Editorial: The Semantic Web Challenge, 2011

[...]

Christian Bizer, Diana Maynard

01 Nov 2012-Journal of Web Semantics

TL;DR: The Semantic Web Challenge 2011 took place at the 10th InternationalSemantic Web Conference held in Bonn, Germany, from 23-27 October 2011 and required that applications are designed to operate in an open Web environment and that they utilize the semantics of the data which they process.

...read moreread less