scispace - formally typeset
Journal ArticleDOI

Detecting Linked Data quality issues via crowdsourcing: A DBpedia study

TLDR
The results show that a combination of the two styles of crowdsourcing is likely to achieve more efficient results than each of them used in isolation, and that human computation is a promising and affordable way to enhance the quality of Linked Data.
Abstract
In this paper we examine the use of crowdsourcing as a means to master Linked Data quality problems that are difficult to solve automatically. We base our approach on the analysis of the most common errors encountered in Linked Data sources, and a classification of these errors according to the extent to which they are likely to be amenable to crowdsourcing. We then propose and compare different crowdsourcing approaches to identify these Linked Data quality issues, employing the DBpedia dataset as our use case: (i) a contest targeting the Linked Data expert community, and (ii) paid microtasks published on Amazon Mechanical Turk. We secondly focus on adapting the Find-Fix-Verify crowdsourcing pattern to exploit the strengths of experts and lay workers. By testing two distinct Find-Verify workflows (lay users only and experts verified by lay users) we reveal how to best combine different crowds’ complementary aptitudes in quality issue detection. The results show that a combination of the two styles of crowdsourcing is likely to achieve more efficient results than each of them used in isolation, and that human computation is a promising and affordable way to enhance the quality of Linked Data.

read more

Content maybe subject to copyright    Report

Undefined 0 (2015) 1–0 1
IOS Press
Detecting Linked Data Quality Issues via
Crowdsourcing: A DBpedia Study
Maribel Acosta
a
, Amrapali Zaveri
b
Elena Simperl
c
, Dimitris Kontokostas
b
, Fabian Fl
¨
ock
d
,
Jens Lehmann
b
a
Institute AIFB, Karlsruhe Institute of Technology, Germany
E-mail: maribel.acosta@kit.edu
b
Institut f
¨
ur Informatik, AKSW, Universit
¨
at Leipzig, Germany
E-mail: {zaveri,kontokostas,lehmann}@informatik.uni-leipzig.de
c
Web Science and Internet Research Group, University of Southampton, United Kingdom
E-mail: e.simperl@soton.ac.uk
d
Computational Social Science Group, GESIS - Leibniz Institute for the Social Sciences, Germany
E-mail: fabian.floeck@gesis.org
Abstract. In this paper we examine the use of crowdsourcing as a means to master Linked Data quality problems that are difficult
to solve automatically. We base our approach on the analysis of the most common errors encountered in Linked Data sources,
and a classification of these errors according to the extent to which they are likely to be amenable to crowdsourcing. We then
propose and compare different crowdsourcing approaches to identify these Linked Data quality issues, employing the DBpedia
dataset as our use case: (i) a contest targeting the Linked Data expert community, and (ii) paid microtasks published on Amazon
Mechanical Turk. We secondly focus on adapting the Find-Fix-Verify crowdsourcing pattern to exploit the strengths of experts
and lay workers. By testing two distinct Find-Verify workflows (lay users only and experts verified by lay users) we reveal how
to best combine different crowds’ complementary aptitudes in quality issue detection. The results show that a combination of
the two styles of crowdsourcing is likely to achieve more efficient results than each of them used in isolation, and that human
computation is a promising and affordable way to enhance the quality of Linked Data.
1. Introduction
Many would consider Linked Data (LD) to be one of
the most important technological trends in data man-
agement of the last decade [16]. However, seamless
consumption of LD in applications is still very lim-
ited given the varying quality of the data published
in the Linked Open Data (LOD) Cloud [18,44]. This
is the result of a combination of data- and process-
related factors. The data sets being released into the
LOD Cloud are apart from any factual flaws they
may contain very diverse in terms of formats, struc-
ture, and vocabulary. This heterogeneity and the fact
that some kinds of data tend to be more challenging
to lift to RDF than others make it hard to avoid errors,
especially when the translation happens automatically.
Simple issues like syntax errors or duplicates can be
easily identified and repaired in a fully automatic fash-
ion. However, data quality issues in LD are more chal-
lenging to detect. Current approaches to tackle these
problems still require expert human intervention, e.g.,
for specifying rules [14] or test cases [21], or fail due
to the context-specific nature of quality assessment,
which does not lend itself well to general workflows
and rules that could be executed by a computer pro-
gram. In this paper, we explore an alternative data cu-
ration strategy, which is based on crowdsourcing.
Crowdsourcing [19] refers to the process of solving
a problem formulated as a task by reaching out to a
large network of (often previously unknown) people.
One of the most popular forms of crowdsourcing are
0000-0000/15/$00.00 © 2015 IOS Press and the authors. All rights reserved

2 Detecting Linked Data Quality Issues via Crowdsourcing: A DBpedia Study
‘microtasks’ (or ‘microwork’), which consists on di-
viding a task into several smaller subtasks that can be
independently solved. Conditional on the tackled prob-
lem, the level of task granularity can vary (microtasks
whose results need to be aggregated vs. macrotasks,
which require filtering to identify the most valuable
contributions); as can the incentive structure (e.g., pay-
ments per unit of useful work vs. prizes for top par-
ticipants in a contest). Another major design decision
in the crowdsourcing workflow is the selection of the
crowd. While many (micro)tasks can be performed by
untrained workers, others might require more skilled
human participants, especially in specialized fields of
expertise, such as LD. Of course, expert intervention
usually comes at a higher price; either in monetary re-
wards or in the form of effort to recruit participants
in another setting, such as volunteer work. Microtask
crowdsourcing platforms such as Amazon Mechanical
Turk (MTurk)
1
on the other hand offer a formidable
and readily-available workforce at relatively low fees.
In this work, we crowdsource three specific LD
quality issues. We did so building on previous work
of ours [43] which analyzed common quality prob-
lems encountered in Linked Data sources and classi-
fied them according to the extent to which they could
be amenable to crowdsourcing. The first research ques-
tion explored is hence: RQ1: Is it feasible to detect
quality issues in LD sets via crowdsourcing mecha-
nisms? This question aims at establishing a general un-
derstanding if crowdsourcing approaches can be used
to find issues in LD sets and if so, to what degree they
are an efficient and effective solution. Secondly, given
the option of different crowds, we formulate RQ2: In a
crowdsourcing approach, can we employ unskilled lay
users to identify quality issues in RDF triple data or to
what extent is expert validation needed and desirable?
As a subquestion to RQ2, we also examined which
type of crowd is most suitable to detect which type of
quality issue (and, conversely, which errors they are
prone to make). With these questions, we are interested
(i) in learning to what extent we can exploit the cost-
efficiency of lay users, or if the quality of error detec-
tion is prohibitively low. We (ii) investigate how well
experts generally perform in a crowdsourcing setting
and if and how they outperform lay users. And lastly,
(iii) it is of interest if one of the two distinct approaches
performs well in areas that might not be a strength of
the other method and crowd.
1
https://www.mturk.com/
To answer these questions, we (i) first launched
a contest that acquired 58 experts knowledgeable in
Linked Data to find and classify erroneous RDF triples
from DBpedia (Section 4.1). They inspected 68, 976
triples in total. These triples were then (ii) submitted as
paid microtasks on MTurk to be examined by workers
on the MTurk platform in a similar way (Section 4.2).
Each approach (contest and paid microtasks) makes
several assumptions about the audiences they address
(the ‘crowd’) and their skills. This is reflected in the
design of the crowdsourcing tasks and the related in-
centive mechanisms. The results of both crowds were
then compared to a manually created gold standard.
The results of the comparison of experts and turkers,
as discussed in Section 5, indicate that (i) untrained
crowdworkers are in fact able to spot certain quality
issues with satisfactory precision; that (ii) experts per-
form well locating two but not the third type of qual-
ity issues given, and that lastly (iii) the two approaches
reveal complementary strengths.
Given these insights, RQ3 was formulated: How can
we design better crowdsourcing workflows using lay
users or experts for curating LD sets, beyond one-step
solutions for pointing out quality flaws? To do so, we
adapted the crowdsourcing pattern known as Find-Fix-
Verify, which has been originally proposed by Bern-
stein et al. in [3]. Specifically, we wanted to know: can
(i) we enhance the results of the LD quality issue de-
tection through lay users by adding a subsequent step
of cross-checking (Verify) to the initial Find stage? Or
is it (ii) even more promising to combine experts and
lay workers by letting the latter Verify the results of
the experts’ Find step, hence drawing on the crowds’
complementary skills for deficiency identification we
recognized before?
Accordingly, the results of both Find stages (ex-
pert and workers) in the form of sets of triples iden-
tified as incorrect, marked with the respective errors
were fed into a subsequent Verify step, carried out
by MTurk workers (Section 4.3). The task consisted
solely of the rating of a formerly indicated quality is-
sue for a triple as correctly or wrongly assigned. This
Verify step was, in fact, able to improve the preci-
sion of both Find stages substantially. In particular,
the experts’ Find stage results could be improved to
precision levels of around 0.9 in the Verify stage for
two error types which showed to score much lower
for an expert-only Find approach. The worker-worker
Find-Verify strategy yielded also better results than the
Find-only worker approach, and for one error type
even reached slightly better precision than the expert-

Detecting Linked Data Quality Issues via Crowdsourcing: A DBpedia Study 3
worker model. All in all, we show that (i) a Find-Verify
combination of experts and lay users is likely to pro-
duce the best results, but that (ii) they are not superior
to expert-only evaluation in all cases. We demonstrate
also that (iii) lay users-only Find-Verify approaches
can be a viable alternative for detection of LD qual-
ity issues if experts are not available and that they cer-
tainly outperform Find-only lay user workflows.
Note that we did not implement a Fix step in this
work, as correcting the greatest part of the found er-
rors via crowdsourcing is not the most cost-efficient
method of addressing these issues. Thus, we argue
in Section 4, a majority of errors can and should be
addressed already at the level of individual wrappers
leveraging datasets to LD.
To understand the strengths and limitations of
crowdsourcing in this scenario, we further executed
automated baseline approaches to compare them to the
results of our crowdsourcing experiments. We show
that while they may be amenable to pre-filtering RDF
triple data for ontological inconsistencies (thus poten-
tially decreasing the amount of cases necessary to be
browsed in the Find stage), a substantial part of quality
issues can only be addressed via human intervention.
Contributions
This paper is an extension to previous work of
ours [1], in which we presented the results of combin-
ing LD experts and lay users from MTurk when detect-
ing quality issues in DBpedia. The novel contributions
of our current work can be summarized as follows:
Definition of the problem of classifying RDF
triples into quality issues.
Formalization of the proposed approach: The
adaptation of the Find-Fix-Verify pattern is for-
malized for the problem of detecting quality is-
sues in RDF triples.
Introduction of a new crowdsourcing workflow
that solely relies on microtask crowdsourcing to
detect LD quality issues.
Analysis of the properties of our approaches to
generate microtasks for triple-based quality as-
sessment.
Empirical evaluation of the proposed workflow.
Inclusion of a new baseline study by executing
the state-of-the-art solution RDFUnit [21], a test-
based approach to detect LD quality issues either
manually or (semi-)automatically.
Structure of the paper
In Section 2, we discuss the type of LD quality is-
sues that are studied in this work. Section 3 briefly in-
troduces the crowdsourcing methods and related con-
cepts that are used throughout the paper. Our approach
is presented in Section 4, and is empirically evaluated
in Section 5. In Section 6 we summarize the findings of
our experimental study and provide answers to the for-
mulated research questions. Related work is discussed
in Section 7. Conclusions and future work are pre-
sented in Section 8.
2. Linked Data Quality Issues
The Web of Data spans a network of data sources
of varying quality. There are a large number of high-
quality data sets, for instance, in the life-science do-
main, which are the result of decades of thorough
curation and have been recently made available as
Linked Open Data
2
. Other data sets, however, have
been (semi-)automatically translated into RDF from
their primary sources, or via crowdsourcing in a decen-
tralized process involving a large number of contrib-
utors, for example DBpedia [23]. While the combina-
tion of machine-driven extraction and crowdsourcing
was a reasonable approach to produce a baseline ver-
sion of a greatly useful resource, it was also the cause
of a wide range of quality problems, in particular in
the mappings between Wikipedia attributes and their
corresponding DBpedia properties.
Our analysis of Linked Data quality issues focuses
on DBpedia as a representative data set for the broader
Web of Data due to the diversity of the types of er-
rors exhibited and the vast domain and scope of the
data set. In our previous work [44], we compiled
a list of data quality dimensions (criteria) applica-
ble to Linked Data quality assessment. Afterwards,
we mapped these dimensions to DBpedia [43]. A
sub-set of four dimensions of the original framework
were found particularly relevant in this setting: Ac-
curacy, Relevancy, Representational-Consistency and
Interlinking. To provide a comprehensive analysis of
DBpedia quality, we further divided these four cate-
gories of problems into sub-categories. For the purpose
of this paper, from these categories we chose the fol-
lowing three triple-level quality issues.
2
http://beta.bio2rdf.org/

4 Detecting Linked Data Quality Issues via Crowdsourcing: A DBpedia Study
Object incorrectly/incompletely extracted. Consider
the triple: (dbpedia:Rodrigo
Salinas, dbpedia-owl:birthPlace, dbpe-
dia:Puebla F.C.). The DBpedia resource is about the per-
son ‘Rodrigo Salinas’, with the incorrect value of the
birth place. Instead of extracting the name of the city or
country from Wikipedia, the stadium name Puebla F.C, is
extracted.
Datatype or language tag incorrectly extracted. This
category refers to triples with an incorrect datatype for
a typed literal. For example, consider the triple: (dbpe-
dia:Oreye, dbpedia-owl:postalCode, “4360”@en). The datatype of
the literal “4360” is incorrectly identified as english
instead of integer.
Incorrect link. This category refers to RDF triples
whose association between the subject and the object
is incorrect. Erroneous interlinks can associate values
within a dataset or between several data sources. This
category of quality issues also includes links to exter-
nal Web sites or other external data sources such as
Wikimedia, Freebase, GeoSpecies or links generated
via the Flickr wrapper are incorrect; that is, they do not
show any related content pertaining to the resource.
These categories of quality problems occur perva-
sively in DBpedia. These problems might be present
in other data sets which are extracted in a similar
fashion as DBpedia. Given the diversity of the situa-
tions in which they can be instantiated (broad range
of datatypes and object values) and their sometimes
deeply contextual character (interlinking), assessing
them automatically is challenging. In the following we
explain how crowdsourcing could support quality as-
sessment processes.
3. Crowdsourcing Preliminaries
3.1. Types of Crowdsourcing
The term crowdsourcing was first proposed by
Howe [19] that consists on a problem-solving mech-
anism in which a task is performed by an “an unde-
fined (and generally large) network of people in the
form of an open call. Nowadays, many different forms
of crowdsourcing have emerged, e.g., microtask, con-
tests, macrotask, crowdfunding, among others; each
form of crowdsourcing is designed to target partic-
ular types of problems and reaching out to different
crowds. In the following we briefly describe contest-
based and microtask crowdsourcing, the two crowd-
sourcing methods studied in this work.
3.1.1. Contest-based Crowdsourcing
A contest reaches out to a crowd to solve a given
problem and rewards the best ideas. It exploits com-
petition and intellectual challenge as main drivers for
participation. The idea, originating from open innova-
tion, has been employed in many domains, from cre-
ative industries to sciences, for tasks of varying com-
plexity (from designing logos to building sophisticated
algorithms). In particular, contests as means to suc-
cessfully involve experts in advancing science have
a long-standing tradition in research, e.g., the Darpa
challenges
3
and NetFlix.
4
Usually, contests as crowd-
sourcing mechanisms are open for a medium to long
period of time in order to attract high quality contribu-
tions. Contests may apply different reward models, but
a common modality is to define one main prize for the
contest winner.
We applied this contest-based model to mobilize
an expert crowd consisting of researchers and Linked
Data enthusiasts to discover and classify quality issues
in DBpedia. The reward mechanism applied in this
contest was “one-participant gets it all”. The winner
was the participant who covered the highest number of
DBpedia resources.
3.1.2. Microtask Crowdsourcing
This form of crowdsourcing is applied to problems
which can be broken down into smaller units of work
(called ‘microtasks’). Microtask crowdsourcing works
best for tasks that rely primarily on basic human abili-
ties, such as visual and audio cognition or natural lan-
guage understanding, and less on acquired skills (such
as subject-matter knowledge).
To be more efficient than traditional outsourcing (or
even in-house resources), microtasks need to be highly
parallelized. This means that the actual work is exe-
cuted by a high number of contributors in a decen-
tralized fashion;
5
this not only leads to significant im-
provements in terms of time of delivery, but also offers
a means to cross-check the accuracy of the answers (as
each task is typically assigned to more than one per-
son). Collecting answers from different workers allow
for techniques such as majority voting (or other aggre-
gation methods) to automatically identify accurate re-
sponses. The most common reward model in micro-
3
http://www.darpa.mil/About/History/
Archives.aspx
4
http://www.netflixprize.com/
5
More complex workflows, though theoretically feasible, require
additional functionality to handle task dependencies.

Detecting Linked Data Quality Issues via Crowdsourcing: A DBpedia Study 5
task crowdsourcing implies small monetary payments
for each worker who has successfully solved a task.
In our work, we used microtask crowdsourcing as a
fast and cost-efficient way to examine the three types
of DBPedia errors described in Section 2. We pro-
vided specific instructions to workers about how to as-
sess RDF triples according to the three previous qual-
ity issues. We reached out to the crowd of the micro-
task marketplace Amazon Mechanical Turk (MTurk).
In the following we present a summary of the relevant
MTurk terminology:
Requester: Submits tasks to the platform (MTurk).
Human Intelligence Task (HIT): Work unit in
MTurk and refer to a single microtask. A HIT is
a self-contained task submitted by a requester.
Worker: Human provider who solves HITs.
Assignments: Number of different workers to be
assigned to solve each HIT. This allows to collect
multiple answers for each question. A worker can
solve a HIT only once.
Question: A HIT can be composed of several
questions. In the remainder of this paper, we re-
fer to task granularity as the number of questions
contained within a HIT.
Payment: Monetary reward granted to a worker
for successfully completing a HIT. Payments are
defined by the requester, taking into consideration
the complexity of the HIT, mainly defined as the
time that workers have to spend to solve the task.
Qualification type or worker qualification: Re-
questers may specify parameters to prohibit cer-
tain workers to solve tasks. MTutk provide a fixed
set of qualification types, including Approval
Rate” defined as the percentage of tasks success-
fully solved by a worker. In addition, requesters
can create customized qualification types.
3.2. Crowdsourcing Pattern Find-Fix-Verify
The Find-Fix-Verify pattern [3] consists on dividing
a complex human task into a series of simpler tasks
that are carried out in a three-stage process. Each stage
in the Find-Fix-Verify pattern corresponds to a verifi-
cation step over the outcome produced in the immedi-
ate previous stage. The first stage of this crowdsourc-
ing pattern, Find, asks the crowd to identify portions
of data that require attention depending on the task to
be solved. In the second stage, Fix, the crowd corrects
the elements belonging to the outcome of the previous
stage. The Verify stage corresponds to a final quality
control iteration.
Originally, this crowdsourcing pattern was intro-
duced in Soylent [3], a human-enabled word process-
ing interface that contacts microtask workers to edit
and improve parts of a document. The tasks studied in
Soylent include: text shortening, grammar check, and
unifying citation formatting. For example, in the Soy-
lent text shortening task, microtasks workers in Find
stage are asked to to identify portions of text that can
potentially be reduced in each paragraph. Candidate
portions that meet certain consensus degree among
workers move on to the next step. In the Fix stage,
workers must shorten the previously identified por-
tions of paragraphs. All the rewrites generated are as-
sessed by workers to select the most appropriate one
without changing the meaning of the original text.
The Find-Fix-Verify pattern has proven to produce
reliable results since each stage exploits independent
agreement to filter out potential low-quality answers
from the crowd. In addition, this approach is efficient
in terms of the number of questions asked to the paid
microtask crowd, therefore the costs remain competi-
tive with other crowdsourcing alternatives.
In scenarios in which crowdsourcing is applied to
validate the results of machine computation tasks,
question filtering relies on specific thresholds or his-
torical information about the likelihood that human in-
put will significantly improve the results generated al-
gorithmically. Find-Fix-Verify addresses tasks that ini-
tially can be very complex (or very large), like in our
case the discovery and classification of various types
of errors in DBpedia.
The Find-Fix-Very pattern is highly flexible, since
each stage can employ different types of crowds, as
they require different skills and expertise [3].
4. Our Approach: Crowdsourcing Linked Data
Quality Assessment
Our work on human-driven Linked Data quality
assessment focuses on applying crowdsourcing tech-
niques to annotate RDF triples with their correspond-
ing quality issue. Given a set of quality issues Q and
a set T of RDF triples to be assessed, we formally de-
fine the annotation of triples with their corresponding
quality issues as follows.
Definition 1. (Problem Definition: Mapping RDF
Triples to Quality Issues). Given T a set of RDF
triples and Q a set of quality issues, a mapping of
triples to quality issues is defined as a partial function

Citations
More filters
Journal ArticleDOI

Linked data quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO

TL;DR: Data quality criteria according to which KGs can be analyzed and analyze and compare the above mentioned KGs are provided and a framework for finding the most suitable KG for a given setting is proposed.
Journal ArticleDOI

Large-scale Semantic Integration of Linked Data: A Survey

TL;DR: The work that has been done in the area of Linked Data integration is surveyed, the main actors and use cases are identified, it analyzes and factorizes the integration process according to various dimensions, and it discusses the methods that are used in each step.
Journal ArticleDOI

Efficient knowledge graph accuracy evaluation

TL;DR: In this paper, the authors proposed an efficient sampling and evaluation framework, which aims to provide quality accuracy evaluation with strong statistical guarantee while minimizing human efforts, motivated by the properties of the annotation cost function observed in practice.
Proceedings ArticleDOI

Learning How to Correct a Knowledge Base from the Edit History

TL;DR: This work proposes to take advantage of the edit history of the knowledge base in order to learn how to correct constraint violations, and uses the edits that solved some violations in the past to infer how to solve similar violations inThe present.
Book ChapterDOI

Measuring Accuracy of Triples in Knowledge Graphs

TL;DR: This paper introduces an automatic approach, Triples Accuracy Assessment (TAA), for validating RDF triples (source triples) in a knowledge graph by finding consensus of matched triples from other knowledge graphs by applying different matching methods between the predicates of source triples and target triples.
References
More filters
Journal ArticleDOI

A Coefficient of agreement for nominal Scales

TL;DR: In this article, the authors present a procedure for having two or more judges independently categorize a sample of units and determine the degree, significance, and significance of the units. But they do not discuss the extent to which these judgments are reproducible, i.e., reliable.
Journal ArticleDOI

DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia

TL;DR: An overview of the DBpedia community project is given, including its architecture, technical implementation, maintenance, internationalisation, usage statistics and applications, including DBpedia one of the central interlinking hubs in the Linked Open Data (LOD) cloud.
Journal ArticleDOI

DBpedia - A crystallization point for the Web of Data

TL;DR: The extraction of the DBpedia knowledge base is described, the current status of interlinking DBpedia with other data sources on the Web is discussed, and an overview of applications that facilitate the Web of Data around DBpedia is given.
Book

Quality-control handbook

Related Papers (5)