scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Sieve: linked data quality assessment and fusion

30 Mar 2012-pp 116-123
TL;DR: Sieve, a framework for flexibly expressing quality assessment methods as well as fusion methods for quality assessment and fusion, is presented, which is integrated into the Linked Data Integration Framework (LDIF), which handles Data Access, Schema Mapping and Identity Resolution.
Abstract: The Web of Linked Data grows rapidly and already contains data originating from hundreds of data sources. The quality of data from those sources is very diverse, as values may be out of date, incomplete or incorrect. Moreover, data sources may provide conflicting values for a single real-world object.In order for Linked Data applications to consume data from this global data space in an integrated fashion, a number of challenges have to be overcome. One of these challenges is to rate and to integrate data based on their quality. However, quality is a very subjective matter, and finding a canonic judgement that is suitable for each and every task is not feasible.To simplify the task of consuming high-quality data, we present Sieve, a framework for flexibly expressing quality assessment methods as well as fusion methods. Sieve is integrated into the Linked Data Integration Framework (LDIF), which handles Data Access, Schema Mapping and Identity Resolution, all crucial preliminaries for quality assessment and fusion.We demonstrate Sieve in a data integration scenario importing data from the English and Portuguese versions of DBpedia, and discuss how we increase completeness, conciseness and consistency through the use of our framework.
Citations
More filters
Journal ArticleDOI
TL;DR: The historical events that lead to the interweaving of data and knowledge are tracked to help improve knowledge and understanding of the world around us.
Abstract: In this paper we provide a comprehensive introduction to knowledge graphs, which have recently garnered significant attention from both industry and academia in scenarios that require exploiting diverse, dynamic, large-scale collections of data. After some opening remarks, we motivate and contrast various graph-based data models and query languages that are used for knowledge graphs. We discuss the roles of schema, identity, and context in knowledge graphs. We explain how knowledge can be represented and extracted using a combination of deductive and inductive techniques. We summarise methods for the creation, enrichment, quality assessment, refinement, and publication of knowledge graphs. We provide an overview of prominent open knowledge graphs and enterprise knowledge graphs, their applications, and how they use the aforementioned techniques. We conclude with high-level future research directions for knowledge graphs.

560 citations


Cites background from "Sieve: linked data quality assessme..."

  • ...[348] distinguish intensional conciseness (schema level), which refers to the case when the data does not contain redundant schema elements (properties, classes, shapes, etc....

    [...]

Journal ArticleDOI
TL;DR: A systematic review of approaches for assessing the quality of Linked Data, which unify and formalize commonly used terminologies across papers related to data quality and provides a comprehensive list of 18 quality dimensions and 69 metrics.
Abstract: The development and standardization of semantic web technologies has resulted in an unprecedented volume of data being published on the Web as Linked Data (LD). However, we observe widely varying data quality ranging from extensively curated datasets to crowdsourced and extracted data of relatively low quality. In this article, we present the results of a systematic review of approaches for assessing the quality of LD. We gather existing approaches and analyze them qualitatively. In particular, we unify and formalize commonly used terminologies across papers related to data quality and provide a comprehensive list of 18 quality dimensions and 69 metrics. Additionally, we qualitatively analyze the 30 core approaches and 12 tools using a set of attributes. The aim of this article is to provide researchers and data curators a comprehensive understanding of existing work, thereby encouraging further experimentation and development of new approaches focused towards data quality, specifically for LD.

531 citations


Cites background from "Sieve: linked data quality assessme..."

  • ...of unique instances of a dataset total number of instances representations in the dataset [45], ∗ 1 − total no....

    [...]

  • ..., 2012 [45] Sieve: Linked Data Quality Assessment and Fusion...

    [...]

  • ...[45] distinguished completeness on the schema and the data level....

    [...]

  • ...[45] that “a dataset is consistent if it is free of conflicting information”....

    [...]

  • ...of classes and properties [18,45] – CM2: property completeness (i) no....

    [...]

Proceedings ArticleDOI
07 Apr 2014
TL;DR: This work presents a methodology for test-driven quality assessment of Linked Data, which is inspired by test- driven software development, and argues that vocabularies, ontologies and knowledge bases should be accompanied by a number of test cases, which help to ensure a basic level of quality.
Abstract: Linked Open Data (LOD) comprises an unprecedented volume of structured data on the Web. However, these datasets are of varying quality ranging from extensively curated datasets to crowdsourced or extracted data of often relatively low quality. We present a methodology for test-driven quality assessment of Linked Data, which is inspired by test-driven software development. We argue that vocabularies, ontologies and knowledge bases should be accompanied by a number of test cases, which help to ensure a basic level of quality. We present a methodology for assessing the quality of linked data resources, based on a formalization of bad smells and data quality problems. Our formalization employs SPARQL query templates, which are instantiated into concrete quality test case queries. Based on an extensive survey, we compile a comprehensive library of data quality test case patterns. We perform automatic test case instantiation based on schema constraints or semi-automatically enriched schemata and allow the user to generate specific test case instantiations that are applicable to a schema or dataset. We provide an extensive evaluation of five LOD datasets, manual test case instantiation for five schemas and automatic test case instantiations for all available schemata registered with Linked Open Vocabularies (LOV). One of the main advantages of our approach is that domain specific semantics can be encoded in the data quality test cases, thus being able to discover data quality problems beyond conventional quality heuristics.

271 citations


Additional excerpts

  • ...ACM 978-1-4503-2744-2/14/04. http://dx.doi.org/10.1145/2566486.2568002 ....

    [...]

BookDOI
01 Jan 2016
TL;DR: The first € price and the £ and $ price are net prices, subject to local VAT, and the €(D) includes 7% for Germany, the€(A) includes 10% for Austria.
Abstract: The first € price and the £ and $ price are net prices, subject to local VAT. Prices indicated with * include VAT for books; the €(D) includes 7% for Germany, the €(A) includes 10% for Austria. Prices indicated with ** include VAT for electronic products; 19% for Germany, 20% for Austria. All prices exclusive of carriage charges. Prices and other details are subject to change without notice. All errors and omissions excepted. C. Batini, M. Scannapieco Data and Information Quality

186 citations


Cites background from "Sieve: linked data quality assessme..."

  • ...Reputation of the dataset Analyzing references or page rank or by assigning a reputation score to the dataset [436]...

    [...]

  • ..., properties and classes) of a dataset in relation to the overall number of schema elements in a schema [436]....

    [...]

  • ...Extensional conciseness measures the number of unique entities in relation to the overall number of entities in the dataset [436]....

    [...]

Book
Eero Hyvönen1
19 Oct 2012
TL;DR: This book gives an overview on why, when, and how Linked (Open) Data and Semantic Web technologies can be employed in practice in publishing CH collections and other content on the Web, and motivates and presents a general semantic portal model and publishing framework as a solution approach to distributed semantic content creation, based on an ontology infrastructure.
Abstract: Cultural Heritage (CH) data is syntactically and semantically heterogeneous, multilingual, semantically rich, and highly interlinked. It is produced in a distributed, open fashion by museums, libraries, archives, and media organizations, as well as individual persons. Managing publication of such richness and variety of content on the Web, and at the same time supporting distributed, interoperable content creation processes, poses challenges where traditional publication approaches need to be re-thought. Application of the principles and technologies of Linked Data and the Semantic Web is a new, promising approach to address these problems. This development is leading to the creation of large national and international CH portals, such as Europeana, to large open data repositories, such as the Linked Open Data Cloud, and massive publications of linked library data in the U.S., Europe, and Asia. Cultural Heritage has become one of the most successful application domains of Linked Data nd Semantic Web technologies. This book gives an overview on why, when, and how Linked (Open) Data and Semantic Web technologies can be employed in practice in publishing CH collections and other content on the Web. The text first motivates and presents a general semantic portal model and publishing framework as a solution approach to distributed semantic content creation, based on an ontology infrastructure. On the Semantic Web, such an infrastructure includes shared metadata models, ontologies, and logical reasoning, and is supported by shared ontology and other Web services alleviating the use of the new technology and linked data in legacy cataloging systems. The goal of all this is to provide layman users and researchers with new, more intelligent and usable Web applications that can be utilized by other Web applications, too, via well-defined Application Programming Interfaces (API). At the same time, it is possible to provide publishing organizations with more cost-efficient so utions for content creation and publication. This book is targeted to computer scientists, museum curators, librarians, archivists, and other CH professionals interested in Linked Data and CH applications on the Semantic Web. The text is focused on practice and applications, making it suitable to students, researchers, and practitioners developing Web services and applications of CH, as well as to CH managers willing to understand the technical issues and challenges involved in linked data publication. Table of Contents: Cultural Heritage on the Semantic Web / Portal Model for Collaborative CH Publishing / Requirements for Publishing Linked Data / Metadata Schemas / Domain Vocabularies and Ontologies / Logic Rules for Cultural Heritage / Cultural Content Creation / Semantic Services for Human and Machine Users / Conclusions

155 citations

References
More filters
Book
01 Jan 1951

2,217 citations


"Sieve: linked data quality assessme..." refers background in this paper

  • ...A popular definition for quality is “fitness for use” [11]....

    [...]

Book
02 Feb 2011
TL;DR: This Synthesis lecture provides readers with a detailed technical introduction to Linked Data, including coverage of relevant aspects of Web architecture, as the basis for application development, research or further study.
Abstract: The World Wide Web has enabled the creation of a global information space comprising linked documents. As the Web becomes ever more enmeshed with our daily lives, there is a growing desire for direct access to raw data not currently available on the Web or bound up in hypertext documents. Linked Data provides a publishing paradigm in which not only documents, but also data, can be a first class citizen of the Web, thereby enabling the extension of the Web with a global data space based on open standards - the Web of Data. In this Synthesis lecture we provide readers with a detailed technical introduction to Linked Data. We begin by outlining the basic principles of Linked Data, including coverage of relevant aspects of Web architecture. The remainder of the text is based around two main themes - the publication and consumption of Linked Data. Drawing on a practical Linked Data scenario, we provide guidance and best practices on: architectural approaches to publishing Linked Data; choosing URIs and vocabularies to identify and describe resources; deciding what data to return in a description of a resource on the Web; methods and frameworks for automated linking of data sets; and testing and debugging approaches for Linked Data deployments. We give an overview of existing Linked Data applications and then examine the architectures that are used to consume Linked Data from the Web, alongside existing tools and frameworks that enable these. Readers can expect to gain a rich technical understanding of Linked Data fundamentals, as the basis for application development, research or further study.

2,174 citations


"Sieve: linked data quality assessme..." refers background in this paper

  • ...) Applications that consume data from the Linked Data cloud are confronted with the challenge of obtaining a homogenized view of this global data space [8]....

    [...]

Journal ArticleDOI
TL;DR: This article places data fusion into the greater context of data integration, precisely defines the goals of data fusion, namely, complete, concise, and consistent data, and highlights the challenges of data Fusion.
Abstract: The development of the Internet in recent years has made it possible and useful to access many different information systems anywhere in the world to obtain information. While there is much research on the integration of heterogeneous information systems, most commercial systems stop short of the actual integration of available data. Data fusion is the process of fusing multiple records representing the same real-world object into a single, consistent, and clean representation.This article places data fusion into the greater context of data integration, precisely defines the goals of data fusion, namely, complete, concise, and consistent data, and highlights the challenges of data fusion, namely, uncertain and conflicting data values. We give an overview and classification of different ways of fusing data and present several techniques based on standard and advanced operators of the relational algebra and SQL. Finally, the article features a comprehensive survey of data integration systems from academia and industry, showing if and how data fusion is performed in each.

1,797 citations


"Sieve: linked data quality assessme..." refers background or methods in this paper

  • ...Data Integration is commonly applied in order to increase data quality along at least three dimensions: completeness, conciseness and consistency [5]....

    [...]

  • ...[4] J. Bleiholder and F. Naumann....

    [...]

  • ...In the context of data integration, Data Fusion is defined as the “process of fusing multiple records representing the same real-world object into a single, consistent, and clean representation” [5]....

    [...]

  • ...Similarly, intensional conciseness measures the number of unique attributes of a dataset in relation to the overall number of attributes in a target schema [5]....

    [...]

  • ...Our data fusion framework is inspired by the work of Blei­holder and Naumann [3]....

    [...]

Book
20 Feb 2011
TL;DR: This Synthesis lecture provides readers with a detailed technical introduction to Linked Data, including coverage of relevant aspects of Web architecture, and provides guidance and best practices on architectural approaches to publishing Linked data.
Abstract: The World Wide Web has enabled the creation of a global information space comprising linked documents. As the Web becomes ever more enmeshed with our daily lives, there is a growing desire for direct access to raw data not currently available on the Web or bound up in hypertext documents. Linked Data provides a publishing paradigm in which not only documents, but also data, can be a first class citizen of the Web, thereby enabling the extension of the Web with a global data space based on open standards - the Web of Data. In this Synthesis lecture we provide readers with a detailed technical introduction to Linked Data. We begin by outlining the basic principles of Linked Data, including coverage of relevant aspects of Web architecture. The remainder of the text is based around two main themes - the publication and consumption of Linked Data. Drawing on a practical Linked Data scenario, we provide guidance and best practices on: architectural approaches to publishing Linked Data; choosing URIs and vocabularies to identify and describe resources; deciding what data to return in a description of a resource on the Web; methods and frameworks for automated linking of data sets; and testing and debugging approaches for Linked Data deployments. We give an overview of existing Linked Data applications and then examine the architectures that are used to consume Linked Data from the Web, alongside existing tools and frameworks that enable these. Readers can expect to gain a rich technical understanding of Linked Data fundamentals, as the basis for application development, research or further study. Table of Contents: List of Figures / Introduction / Principles of Linked Data / The Web of Data / Linked Data Design Considerations / Recipes for Publishing Linked Data / Consuming Linked Data / Summary and Outlook

1,149 citations

Book ChapterDOI
06 Nov 2009
TL;DR: The Silk --- Linking Framework is presented, a toolkit for discovering and maintaining data links between Web data sources and allows data sources to exchange both linksets as well as detailed change information and enables continuous link recomputation.
Abstract: The Web of Data is built upon two simple ideas: Employ the RDF data model to publish structured data on the Web and to create explicit data links between entities within different data sources. This paper presents the Silk --- Linking Framework, a toolkit for discovering and maintaining data links between Web data sources. Silk consists of three components: 1. A link discovery engine, which computes links between data sources based on a declarative specification of the conditions that entities must fulfill in order to be interlinked; 2. A tool for evaluating the generated data links in order to fine-tune the linking specification; 3. A protocol for maintaining data links between continuously changing data sources. The protocol allows data sources to exchange both linksets as well as detailed change information and enables continuous link recomputation. The interplay of all the components is demonstrated within a life science use case.

353 citations


"Sieve: linked data quality assessme..." refers methods in this paper

  • ...Through the identity resolution module (Silk) and the URI translation module, these identity links will be used to merge object descriptions into one URI per object....

    [...]

  • ...Silk Server -Adding missing Links while consuming Linked Data....

    [...]

  • ...Furthermore, since data may also include multiple identi- Figure 1: LDIF Architecture .ers for the same real-world entity, LDIF also supports an Identity Resolution step through Silk [14][9]....

    [...]

  • ...fiers for the same real-world entity, LDIF also supports an Identity Resolution step through Silk [14][9]....

    [...]

  • ...As an additional normalization step, LDIF s URITrans­lator module can homogenize all URI aliases identi.ed as duplicates by Silk, grouping all properties of those URIs into one canonical target URI....

    [...]