Sieve: linked data quality assessment and fusion

doi:10.1145/2320765.2320803

Home
/
Papers
/
Sieve: linked data quality assessment and fusion

Proceedings Article•DOI•

Sieve: linked data quality assessment and fusion

Pablo N. Mendes¹, Hannes Mühleisen¹, Christian Bizer¹•Institutions (1)

Free University of Berlin¹

30 Mar 2012-pp 116-123

TL;DR: Sieve, a framework for flexibly expressing quality assessment methods as well as fusion methods for quality assessment and fusion, is presented, which is integrated into the Linked Data Integration Framework (LDIF), which handles Data Access, Schema Mapping and Identity Resolution.

read less

Abstract: The Web of Linked Data grows rapidly and already contains data originating from hundreds of data sources. The quality of data from those sources is very diverse, as values may be out of date, incomplete or incorrect. Moreover, data sources may provide conflicting values for a single real-world object.In order for Linked Data applications to consume data from this global data space in an integrated fashion, a number of challenges have to be overcome. One of these challenges is to rate and to integrate data based on their quality. However, quality is a very subjective matter, and finding a canonic judgement that is suitable for each and every task is not feasible.To simplify the task of consuming high-quality data, we present Sieve, a framework for flexibly expressing quality assessment methods as well as fusion methods. Sieve is integrated into the Linked Data Integration Framework (LDIF), which handles Data Access, Schema Mapping and Identity Resolution, all crucial preliminaries for quality assessment and fusion.We demonstrate Sieve in a data integration scenario importing data from the English and Portuguese versions of DBpedia, and discuss how we increase completeness, conciseness and consistency through the use of our framework.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Knowledge Graphs

[...]

Aidan Hogan, Eva Blomqvist, Michael Cochez, Claudia d'Amato, Gerard de Melo, Claudio Gutierrez, José Emilio Labra Gayo, Sabrina Kirrane, Sebastian Neumaier, Axel Polleres, Roberto Navigli, Axel-Cyrille Ngonga Ngomo, Sabbir M. Rashid, Anisa Rula, Lukas Schmelzeisen, Juan F. Sequeda, Steffen Staab, Antoine Zimmermann - Show less +14 more

04 Mar 2020-arXiv: Artificial Intelligence

TL;DR: The historical events that lead to the interweaving of data and knowledge are tracked to help improve knowledge and understanding of the world around us.

...read moreread less

Abstract: In this paper we provide a comprehensive introduction to knowledge graphs, which have recently garnered significant attention from both industry and academia in scenarios that require exploiting diverse, dynamic, large-scale collections of data. After some opening remarks, we motivate and contrast various graph-based data models and query languages that are used for knowledge graphs. We discuss the roles of schema, identity, and context in knowledge graphs. We explain how knowledge can be represented and extracted using a combination of deductive and inductive techniques. We summarise methods for the creation, enrichment, quality assessment, refinement, and publication of knowledge graphs. We provide an overview of prominent open knowledge graphs and enterprise knowledge graphs, their applications, and how they use the aforementioned techniques. We conclude with high-level future research directions for knowledge graphs.

...read moreread less

560 citations

Cites background from "Sieve: linked data quality assessme..."

...[348] distinguish intensional conciseness (schema level), which refers to the case when the data does not contain redundant schema elements (properties, classes, shapes, etc....
[...]

Journal Article•DOI•

Quality assessment for Linked Data: A Survey

[...]

Amrapali Zaveri¹, Anisa Rula², Andrea Maurino², Ricardo Pietrobon³, Jens Lehmann¹, Soeren Auer⁴ - Show less +2 more•Institutions (4)

Leipzig University¹, University of Milano-Bicocca², Duke University³, University of Bonn⁴

17 Mar 2015-Social Work

TL;DR: A systematic review of approaches for assessing the quality of Linked Data, which unify and formalize commonly used terminologies across papers related to data quality and provides a comprehensive list of 18 quality dimensions and 69 metrics.

...read moreread less

Abstract: The development and standardization of semantic web technologies has resulted in an unprecedented volume of data being published on the Web as Linked Data (LD). However, we observe widely varying data quality ranging from extensively curated datasets to crowdsourced and extracted data of relatively low quality. In this article, we present the results of a systematic review of approaches for assessing the quality of LD. We gather existing approaches and analyze them qualitatively. In particular, we unify and formalize commonly used terminologies across papers related to data quality and provide a comprehensive list of 18 quality dimensions and 69 metrics. Additionally, we qualitatively analyze the 30 core approaches and 12 tools using a set of attributes. The aim of this article is to provide researchers and data curators a comprehensive understanding of existing work, thereby encouraging further experimentation and development of new approaches focused towards data quality, specifically for LD.

...read moreread less

531 citations

Cites background from "Sieve: linked data quality assessme..."

...of unique instances of a dataset total number of instances representations in the dataset [45], ∗ 1 − total no....
[...]
..., 2012 [45] Sieve: Linked Data Quality Assessment and Fusion...
[...]
...[45] distinguished completeness on the schema and the data level....
[...]
...[45] that “a dataset is consistent if it is free of conflicting information”....
[...]
...of classes and properties [18,45] – CM2: property completeness (i) no....
[...]

Proceedings Article•DOI•

Test-driven evaluation of linked data quality

[...]

Dimitris Kontokostas¹, Patrick Westphal¹, Sören Auer², Sebastian Hellmann¹, Jens Lehmann¹, Roland Cornelissen, Amrapali Zaveri¹ - Show less +3 more•Institutions (2)

Leipzig University¹, University of Bonn²

07 Apr 2014

TL;DR: This work presents a methodology for test-driven quality assessment of Linked Data, which is inspired by test- driven software development, and argues that vocabularies, ontologies and knowledge bases should be accompanied by a number of test cases, which help to ensure a basic level of quality.

...read moreread less

Abstract: Linked Open Data (LOD) comprises an unprecedented volume of structured data on the Web. However, these datasets are of varying quality ranging from extensively curated datasets to crowdsourced or extracted data of often relatively low quality. We present a methodology for test-driven quality assessment of Linked Data, which is inspired by test-driven software development. We argue that vocabularies, ontologies and knowledge bases should be accompanied by a number of test cases, which help to ensure a basic level of quality. We present a methodology for assessing the quality of linked data resources, based on a formalization of bad smells and data quality problems. Our formalization employs SPARQL query templates, which are instantiated into concrete quality test case queries. Based on an extensive survey, we compile a comprehensive library of data quality test case patterns. We perform automatic test case instantiation based on schema constraints or semi-automatically enriched schemata and allow the user to generate specific test case instantiations that are applicable to a schema or dataset. We provide an extensive evaluation of five LOD datasets, manual test case instantiation for five schemas and automatic test case instantiations for all available schemata registered with Linked Open Vocabularies (LOV). One of the main advantages of our approach is that domain specific semantics can be encoded in the data quality test cases, thus being able to discover data quality problems beyond conventional quality heuristics.

...read moreread less

271 citations

Additional excerpts

...ACM 978-1-4503-2744-2/14/04. http://dx.doi.org/10.1145/2566486.2568002 ....
[...]

Book•DOI•

Data and information quality

[...]

Carlo Batini, Monica Scannapieco

01 Jan 2016

TL;DR: The first € price and the £ and $ price are net prices, subject to local VAT, and the €(D) includes 7% for Germany, the€(A) includes 10% for Austria.

...read moreread less

Abstract: The first € price and the £ and $ price are net prices, subject to local VAT. Prices indicated with * include VAT for books; the €(D) includes 7% for Germany, the €(A) includes 10% for Austria. Prices indicated with ** include VAT for electronic products; 19% for Germany, 20% for Austria. All prices exclusive of carriage charges. Prices and other details are subject to change without notice. All errors and omissions excepted. C. Batini, M. Scannapieco Data and Information Quality

...read moreread less

186 citations

Cites background from "Sieve: linked data quality assessme..."

...Reputation of the dataset Analyzing references or page rank or by assigning a reputation score to the dataset [436]...
[...]
..., properties and classes) of a dataset in relation to the overall number of schema elements in a schema [436]....
[...]
...Extensional conciseness measures the number of unique entities in relation to the overall number of entities in the dataset [436]....
[...]

Book•

Publishing and Using Cultural Heritage Linked Data on the Semantic Web

[...]

Eero Hyvönen¹•Institutions (1)

Aalto University¹

19 Oct 2012

TL;DR: This book gives an overview on why, when, and how Linked (Open) Data and Semantic Web technologies can be employed in practice in publishing CH collections and other content on the Web, and motivates and presents a general semantic portal model and publishing framework as a solution approach to distributed semantic content creation, based on an ontology infrastructure.

...read moreread less

Abstract: Cultural Heritage (CH) data is syntactically and semantically heterogeneous, multilingual, semantically rich, and highly interlinked. It is produced in a distributed, open fashion by museums, libraries, archives, and media organizations, as well as individual persons. Managing publication of such richness and variety of content on the Web, and at the same time supporting distributed, interoperable content creation processes, poses challenges where traditional publication approaches need to be re-thought. Application of the principles and technologies of Linked Data and the Semantic Web is a new, promising approach to address these problems. This development is leading to the creation of large national and international CH portals, such as Europeana, to large open data repositories, such as the Linked Open Data Cloud, and massive publications of linked library data in the U.S., Europe, and Asia. Cultural Heritage has become one of the most successful application domains of Linked Data nd Semantic Web technologies. This book gives an overview on why, when, and how Linked (Open) Data and Semantic Web technologies can be employed in practice in publishing CH collections and other content on the Web. The text first motivates and presents a general semantic portal model and publishing framework as a solution approach to distributed semantic content creation, based on an ontology infrastructure. On the Semantic Web, such an infrastructure includes shared metadata models, ontologies, and logical reasoning, and is supported by shared ontology and other Web services alleviating the use of the new technology and linked data in legacy cataloging systems. The goal of all this is to provide layman users and researchers with new, more intelligent and usable Web applications that can be utilized by other Web applications, too, via well-defined Application Programming Interfaces (API). At the same time, it is possible to provide publishing organizations with more cost-efficient so utions for content creation and publication. This book is targeted to computer scientists, museum curators, librarians, archivists, and other CH professionals interested in Linked Data and CH applications on the Semantic Web. The text is focused on practice and applications, making it suitable to students, researchers, and practitioners developing Web services and applications of CH, as well as to CH managers willing to understand the technical issues and challenges involved in linked data publication. Table of Contents: Cultural Heritage on the Semantic Web / Portal Model for Collaborative CH Publishing / Requirements for Publishing Linked Data / Metadata Schemas / Domain Vocabularies and Ontologies / Logic Rules for Cultural Heritage / Cultural Content Creation / Semantic Services for Human and Machine Users / Conclusions

...read moreread less

155 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53

Collapse

References

PDF

Open Access

More filters

Book•

Quality-control handbook

[...]

Joseph Moses Juran

01 Jan 1951

2,217 citations

"Sieve: linked data quality assessme..." refers background in this paper

...A popular definition for quality is “fitness for use” [11]....
[...]

Book•

Linked Data: Evolving the Web into a Global Data Space

[...]

Tom Heath, Christian Bizer, Bebo White¹•Institutions (1)

SLAC National Accelerator Laboratory¹

02 Feb 2011

TL;DR: This Synthesis lecture provides readers with a detailed technical introduction to Linked Data, including coverage of relevant aspects of Web architecture, as the basis for application development, research or further study.

...read moreread less

Abstract: The World Wide Web has enabled the creation of a global information space comprising linked documents. As the Web becomes ever more enmeshed with our daily lives, there is a growing desire for direct access to raw data not currently available on the Web or bound up in hypertext documents. Linked Data provides a publishing paradigm in which not only documents, but also data, can be a first class citizen of the Web, thereby enabling the extension of the Web with a global data space based on open standards - the Web of Data. In this Synthesis lecture we provide readers with a detailed technical introduction to Linked Data. We begin by outlining the basic principles of Linked Data, including coverage of relevant aspects of Web architecture. The remainder of the text is based around two main themes - the publication and consumption of Linked Data. Drawing on a practical Linked Data scenario, we provide guidance and best practices on: architectural approaches to publishing Linked Data; choosing URIs and vocabularies to identify and describe resources; deciding what data to return in a description of a resource on the Web; methods and frameworks for automated linking of data sets; and testing and debugging approaches for Linked Data deployments. We give an overview of existing Linked Data applications and then examine the architectures that are used to consume Linked Data from the Web, alongside existing tools and frameworks that enable these. Readers can expect to gain a rich technical understanding of Linked Data fundamentals, as the basis for application development, research or further study.

...read moreread less

2,174 citations

"Sieve: linked data quality assessme..." refers background in this paper

...) Applications that consume data from the Linked Data cloud are confronted with the challenge of obtaining a homogenized view of this global data space [8]....
[...]

Journal Article•DOI•

Data fusion

[...]

Jens Bleiholder¹, Felix Naumann¹•Institutions (1)

Hasso Plattner Institute¹

15 Jan 2009-ACM Computing Surveys

TL;DR: This article places data fusion into the greater context of data integration, precisely defines the goals of data fusion, namely, complete, concise, and consistent data, and highlights the challenges of data Fusion.

...read moreread less

Abstract: The development of the Internet in recent years has made it possible and useful to access many different information systems anywhere in the world to obtain information. While there is much research on the integration of heterogeneous information systems, most commercial systems stop short of the actual integration of available data. Data fusion is the process of fusing multiple records representing the same real-world object into a single, consistent, and clean representation.This article places data fusion into the greater context of data integration, precisely defines the goals of data fusion, namely, complete, concise, and consistent data, and highlights the challenges of data fusion, namely, uncertain and conflicting data values. We give an overview and classification of different ways of fusing data and present several techniques based on standard and advanced operators of the relational algebra and SQL. Finally, the article features a comprehensive survey of data integration systems from academia and industry, showing if and how data fusion is performed in each.

...read moreread less

1,797 citations

"Sieve: linked data quality assessme..." refers background or methods in this paper

...Data Integration is commonly applied in order to increase data quality along at least three dimensions: completeness, conciseness and consistency [5]....
[...]
...[4] J. Bleiholder and F. Naumann....
[...]
...In the context of data integration, Data Fusion is defined as the “process of fusing multiple records representing the same real-world object into a single, consistent, and clean representation” [5]....
[...]
...Similarly, intensional conciseness measures the number of unique attributes of a dataset in relation to the overall number of attributes in a target schema [5]....
[...]
...Our data fusion framework is inspired by the work of Bleiholder and Naumann [3]....
[...]

Book•

Linked Data

[...]

Tom Heath, Christian Bizer, James Hendler

20 Feb 2011

TL;DR: This Synthesis lecture provides readers with a detailed technical introduction to Linked Data, including coverage of relevant aspects of Web architecture, and provides guidance and best practices on architectural approaches to publishing Linked data.

...read moreread less

1,149 citations

Book Chapter•DOI•

Discovering and Maintaining Links on the Web of Data

[...]

Julius Volz¹, Christian Bizer², Martin Gaedke¹, Georgi Kobilarov²•Institutions (2)

Chemnitz University of Technology¹, Free University of Berlin²

06 Nov 2009

TL;DR: The Silk --- Linking Framework is presented, a toolkit for discovering and maintaining data links between Web data sources and allows data sources to exchange both linksets as well as detailed change information and enables continuous link recomputation.

...read moreread less

Abstract: The Web of Data is built upon two simple ideas: Employ the RDF data model to publish structured data on the Web and to create explicit data links between entities within different data sources. This paper presents the Silk --- Linking Framework, a toolkit for discovering and maintaining data links between Web data sources. Silk consists of three components: 1. A link discovery engine, which computes links between data sources based on a declarative specification of the conditions that entities must fulfill in order to be interlinked; 2. A tool for evaluating the generated data links in order to fine-tune the linking specification; 3. A protocol for maintaining data links between continuously changing data sources. The protocol allows data sources to exchange both linksets as well as detailed change information and enables continuous link recomputation. The interplay of all the components is demonstrated within a life science use case.

...read moreread less

353 citations

"Sieve: linked data quality assessme..." refers methods in this paper

...Through the identity resolution module (Silk) and the URI translation module, these identity links will be used to merge object descriptions into one URI per object....
[...]
...Silk Server -Adding missing Links while consuming Linked Data....
[...]
...Furthermore, since data may also include multiple identi- Figure 1: LDIF Architecture .ers for the same real-world entity, LDIF also supports an Identity Resolution step through Silk [14][9]....
[...]
...fiers for the same real-world entity, LDIF also supports an Identity Resolution step through Silk [14][9]....
[...]
...As an additional normalization step, LDIF s URITranslator module can homogenize all URI aliases identi.ed as duplicates by Silk, grouping all properties of those URIs into one canonical target URI....
[...]