Detecting Linked Data quality issues via crowdsourcing: A DBpedia study

doi:10.3233/SW-160239

Undeﬁned 0 (2015) 1–0 1

IOS Press

Detecting Linked Data Quality Issues via

Crowdsourcing: A DBpedia Study

Maribel Acosta

a

, Amrapali Zaveri

b

Elena Simperl

c

, Dimitris Kontokostas

b

, Fabian Fl

¨

ock

d

,

Jens Lehmann

b

a

Institute AIFB, Karlsruhe Institute of Technology, Germany

E-mail: maribel.acosta@kit.edu

b

Institut f

¨

ur Informatik, AKSW, Universit

¨

at Leipzig, Germany

E-mail: {zaveri,kontokostas,lehmann}@informatik.uni-leipzig.de

c

Web Science and Internet Research Group, University of Southampton, United Kingdom

E-mail: e.simperl@soton.ac.uk

d

Computational Social Science Group, GESIS - Leibniz Institute for the Social Sciences, Germany

E-mail: fabian.ﬂoeck@gesis.org

Abstract. In this paper we examine the use of crowdsourcing as a means to master Linked Data quality problems that are difﬁcult

to solve automatically. We base our approach on the analysis of the most common errors encountered in Linked Data sources,

and a classiﬁcation of these errors according to the extent to which they are likely to be amenable to crowdsourcing. We then

propose and compare different crowdsourcing approaches to identify these Linked Data quality issues, employing the DBpedia

dataset as our use case: (i) a contest targeting the Linked Data expert community, and (ii) paid microtasks published on Amazon

Mechanical Turk. We secondly focus on adapting the Find-Fix-Verify crowdsourcing pattern to exploit the strengths of experts

and lay workers. By testing two distinct Find-Verify workﬂows (lay users only and experts veriﬁed by lay users) we reveal how

to best combine different crowds’ complementary aptitudes in quality issue detection. The results show that a combination of

the two styles of crowdsourcing is likely to achieve more efﬁcient results than each of them used in isolation, and that human

computation is a promising and affordable way to enhance the quality of Linked Data.

1. Introduction

Many would consider Linked Data (LD) to be one of

the most important technological trends in data man-

agement of the last decade [16]. However, seamless

consumption of LD in applications is still very lim-

ited given the varying quality of the data published

in the Linked Open Data (LOD) Cloud [18,44]. This

is the result of a combination of data- and process-

related factors. The data sets being released into the

LOD Cloud are – apart from any factual ﬂaws they

may contain – very diverse in terms of formats, struc-

ture, and vocabulary. This heterogeneity and the fact

that some kinds of data tend to be more challenging

to lift to RDF than others make it hard to avoid errors,

especially when the translation happens automatically.

Simple issues like syntax errors or duplicates can be

easily identiﬁed and repaired in a fully automatic fash-

ion. However, data quality issues in LD are more chal-

lenging to detect. Current approaches to tackle these

problems still require expert human intervention, e.g.,

for specifying rules [14] or test cases [21], or fail due

to the context-speciﬁc nature of quality assessment,

which does not lend itself well to general workﬂows

and rules that could be executed by a computer pro-

gram. In this paper, we explore an alternative data cu-

ration strategy, which is based on crowdsourcing.

Crowdsourcing [19] refers to the process of solving

a problem formulated as a task by reaching out to a

large network of (often previously unknown) people.

One of the most popular forms of crowdsourcing are

2 Detecting Linked Data Quality Issues via Crowdsourcing: A DBpedia Study

‘microtasks’ (or ‘microwork’), which consists on di-

viding a task into several smaller subtasks that can be

independently solved. Conditional on the tackled prob-

lem, the level of task granularity can vary (microtasks

whose results need to be aggregated vs. macrotasks,

which require ﬁltering to identify the most valuable

contributions); as can the incentive structure (e.g., pay-

ments per unit of useful work vs. prizes for top par-

ticipants in a contest). Another major design decision

in the crowdsourcing workﬂow is the selection of the

crowd. While many (micro)tasks can be performed by

untrained workers, others might require more skilled

human participants, especially in specialized ﬁelds of

expertise, such as LD. Of course, expert intervention

usually comes at a higher price; either in monetary re-

wards or in the form of effort to recruit participants

in another setting, such as volunteer work. Microtask

crowdsourcing platforms such as Amazon Mechanical

Turk (MTurk)

1

on the other hand offer a formidable

and readily-available workforce at relatively low fees.

In this work, we crowdsource three speciﬁc LD

quality issues. We did so building on previous work

of ours [43] which analyzed common quality prob-

lems encountered in Linked Data sources and classi-

ﬁed them according to the extent to which they could

be amenable to crowdsourcing. The ﬁrst research ques-

tion explored is hence: RQ1: Is it feasible to detect

quality issues in LD sets via crowdsourcing mecha-

nisms? This question aims at establishing a general un-

derstanding if crowdsourcing approaches can be used

to ﬁnd issues in LD sets and if so, to what degree they

are an efﬁcient and effective solution. Secondly, given

the option of different crowds, we formulate RQ2: In a

crowdsourcing approach, can we employ unskilled lay

users to identify quality issues in RDF triple data or to

what extent is expert validation needed and desirable?

As a subquestion to RQ2, we also examined which

type of crowd is most suitable to detect which type of

quality issue (and, conversely, which errors they are

prone to make). With these questions, we are interested

(i) in learning to what extent we can exploit the cost-

efﬁciency of lay users, or if the quality of error detec-

tion is prohibitively low. We (ii) investigate how well

experts generally perform in a crowdsourcing setting

and if and how they outperform lay users. And lastly,

(iii) it is of interest if one of the two distinct approaches

performs well in areas that might not be a strength of

the other method and crowd.

1

https://www.mturk.com/

To answer these questions, we (i) ﬁrst launched

a contest that acquired 58 experts knowledgeable in

Linked Data to ﬁnd and classify erroneous RDF triples

from DBpedia (Section 4.1). They inspected 68, 976

triples in total. These triples were then (ii) submitted as

paid microtasks on MTurk to be examined by workers

on the MTurk platform in a similar way (Section 4.2).

Each approach (contest and paid microtasks) makes

several assumptions about the audiences they address

(the ‘crowd’) and their skills. This is reﬂected in the

design of the crowdsourcing tasks and the related in-

centive mechanisms. The results of both crowds were

then compared to a manually created gold standard.

The results of the comparison of experts and turkers,

as discussed in Section 5, indicate that (i) untrained

crowdworkers are in fact able to spot certain quality

issues with satisfactory precision; that (ii) experts per-

form well locating two but not the third type of qual-

ity issues given, and that lastly (iii) the two approaches

reveal complementary strengths.

Given these insights, RQ3 was formulated: How can

we design better crowdsourcing workﬂows using lay

users or experts for curating LD sets, beyond one-step

solutions for pointing out quality ﬂaws? To do so, we

adapted the crowdsourcing pattern known as Find-Fix-

Verify, which has been originally proposed by Bern-

stein et al. in [3]. Speciﬁcally, we wanted to know: can

(i) we enhance the results of the LD quality issue de-

tection through lay users by adding a subsequent step

of cross-checking (Verify) to the initial Find stage? Or

is it (ii) even more promising to combine experts and

lay workers by letting the latter Verify the results of

the experts’ Find step, hence drawing on the crowds’

complementary skills for deﬁciency identiﬁcation we

recognized before?

Accordingly, the results of both Find stages (ex-

pert and workers) – in the form of sets of triples iden-

tiﬁed as incorrect, marked with the respective errors

– were fed into a subsequent Verify step, carried out

by MTurk workers (Section 4.3). The task consisted

solely of the rating of a formerly indicated quality is-

sue for a triple as correctly or wrongly assigned. This

Verify step was, in fact, able to improve the preci-

sion of both Find stages substantially. In particular,

the experts’ Find stage results could be improved to

precision levels of around 0.9 in the Verify stage for

two error types which showed to score much lower

for an expert-only Find approach. The worker-worker

Find-Verify strategy yielded also better results than the

Find-only worker approach, and for one error type

even reached slightly better precision than the expert-

Detecting Linked Data Quality Issues via Crowdsourcing: A DBpedia Study 3

worker model. All in all, we show that (i) a Find-Verify

combination of experts and lay users is likely to pro-

duce the best results, but that (ii) they are not superior

to expert-only evaluation in all cases. We demonstrate

also that (iii) lay users-only Find-Verify approaches

can be a viable alternative for detection of LD qual-

ity issues if experts are not available and that they cer-

tainly outperform Find-only lay user workﬂows.

Note that we did not implement a Fix step in this

work, as correcting the greatest part of the found er-

rors via crowdsourcing is not the most cost-efﬁcient

method of addressing these issues. Thus, we argue

in Section 4, a majority of errors can and should be

addressed already at the level of individual wrappers

leveraging datasets to LD.

To understand the strengths and limitations of

crowdsourcing in this scenario, we further executed

automated baseline approaches to compare them to the

results of our crowdsourcing experiments. We show

that while they may be amenable to pre-ﬁltering RDF

triple data for ontological inconsistencies (thus poten-

tially decreasing the amount of cases necessary to be

browsed in the Find stage), a substantial part of quality

issues can only be addressed via human intervention.

Contributions

This paper is an extension to previous work of

ours [1], in which we presented the results of combin-

ing LD experts and lay users from MTurk when detect-

ing quality issues in DBpedia. The novel contributions

of our current work can be summarized as follows:

– Deﬁnition of the problem of classifying RDF

triples into quality issues.

– Formalization of the proposed approach: The

adaptation of the Find-Fix-Verify pattern is for-

malized for the problem of detecting quality is-

sues in RDF triples.

– Introduction of a new crowdsourcing workﬂow

that solely relies on microtask crowdsourcing to

detect LD quality issues.

– Analysis of the properties of our approaches to

generate microtasks for triple-based quality as-

sessment.

– Empirical evaluation of the proposed workﬂow.

– Inclusion of a new baseline study by executing

the state-of-the-art solution RDFUnit [21], a test-

based approach to detect LD quality issues either

manually or (semi-)automatically.

Structure of the paper

In Section 2, we discuss the type of LD quality is-

sues that are studied in this work. Section 3 brieﬂy in-

troduces the crowdsourcing methods and related con-

cepts that are used throughout the paper. Our approach

is presented in Section 4, and is empirically evaluated

in Section 5. In Section 6 we summarize the ﬁndings of

our experimental study and provide answers to the for-

mulated research questions. Related work is discussed

in Section 7. Conclusions and future work are pre-

sented in Section 8.

2. Linked Data Quality Issues

The Web of Data spans a network of data sources

of varying quality. There are a large number of high-

quality data sets, for instance, in the life-science do-

main, which are the result of decades of thorough

curation and have been recently made available as

Linked Open Data

2

. Other data sets, however, have

been (semi-)automatically translated into RDF from

their primary sources, or via crowdsourcing in a decen-

tralized process involving a large number of contrib-

utors, for example DBpedia [23]. While the combina-

tion of machine-driven extraction and crowdsourcing

was a reasonable approach to produce a baseline ver-

sion of a greatly useful resource, it was also the cause

of a wide range of quality problems, in particular in

the mappings between Wikipedia attributes and their

corresponding DBpedia properties.

Our analysis of Linked Data quality issues focuses

on DBpedia as a representative data set for the broader

Web of Data due to the diversity of the types of er-

rors exhibited and the vast domain and scope of the

data set. In our previous work [44], we compiled

a list of data quality dimensions (criteria) applica-

ble to Linked Data quality assessment. Afterwards,

we mapped these dimensions to DBpedia [43]. A

sub-set of four dimensions of the original framework

were found particularly relevant in this setting: Ac-

curacy, Relevancy, Representational-Consistency and

Interlinking. To provide a comprehensive analysis of

DBpedia quality, we further divided these four cate-

gories of problems into sub-categories. For the purpose

of this paper, from these categories we chose the fol-

lowing three triple-level quality issues.

2

http://beta.bio2rdf.org/

4 Detecting Linked Data Quality Issues via Crowdsourcing: A DBpedia Study

Object incorrectly/incompletely extracted. Consider

the triple: (dbpedia:Rodrigo

Salinas, dbpedia-owl:birthPlace, dbpe-

dia:Puebla F.C.). The DBpedia resource is about the per-

son ‘Rodrigo Salinas’, with the incorrect value of the

birth place. Instead of extracting the name of the city or

country from Wikipedia, the stadium name Puebla F.C, is

extracted.

Datatype or language tag incorrectly extracted. This

category refers to triples with an incorrect datatype for

a typed literal. For example, consider the triple: (dbpe-

dia:Oreye, dbpedia-owl:postalCode, “4360”@en). The datatype of

the literal “4360” is incorrectly identiﬁed as english

instead of integer.

Incorrect link. This category refers to RDF triples

whose association between the subject and the object

is incorrect. Erroneous interlinks can associate values

within a dataset or between several data sources. This

category of quality issues also includes links to exter-

nal Web sites or other external data sources such as

Wikimedia, Freebase, GeoSpecies or links generated

via the Flickr wrapper are incorrect; that is, they do not

show any related content pertaining to the resource.

These categories of quality problems occur perva-

sively in DBpedia. These problems might be present

in other data sets which are extracted in a similar

fashion as DBpedia. Given the diversity of the situa-

tions in which they can be instantiated (broad range

of datatypes and object values) and their sometimes

deeply contextual character (interlinking), assessing

them automatically is challenging. In the following we

explain how crowdsourcing could support quality as-

sessment processes.

3. Crowdsourcing Preliminaries

3.1. Types of Crowdsourcing

The term crowdsourcing was ﬁrst proposed by

Howe [19] that consists on a problem-solving mech-

anism in which a task is performed by an “an unde-

ﬁned (and generally large) network of people in the

form of an open call.” Nowadays, many different forms

of crowdsourcing have emerged, e.g., microtask, con-

tests, macrotask, crowdfunding, among others; each

form of crowdsourcing is designed to target partic-

ular types of problems and reaching out to different

crowds. In the following we brieﬂy describe contest-

based and microtask crowdsourcing, the two crowd-

sourcing methods studied in this work.

3.1.1. Contest-based Crowdsourcing

A contest reaches out to a crowd to solve a given

problem and rewards the best ideas. It exploits com-

petition and intellectual challenge as main drivers for

participation. The idea, originating from open innova-

tion, has been employed in many domains, from cre-

ative industries to sciences, for tasks of varying com-

plexity (from designing logos to building sophisticated

algorithms). In particular, contests as means to suc-

cessfully involve experts in advancing science have

a long-standing tradition in research, e.g., the Darpa

challenges

3

and NetFlix.

4

Usually, contests as crowd-

sourcing mechanisms are open for a medium to long

period of time in order to attract high quality contribu-

tions. Contests may apply different reward models, but

a common modality is to deﬁne one main prize for the

contest winner.

We applied this contest-based model to mobilize

an expert crowd consisting of researchers and Linked

Data enthusiasts to discover and classify quality issues

in DBpedia. The reward mechanism applied in this

contest was “one-participant gets it all”. The winner

was the participant who covered the highest number of

DBpedia resources.

3.1.2. Microtask Crowdsourcing

This form of crowdsourcing is applied to problems

which can be broken down into smaller units of work

(called ‘microtasks’). Microtask crowdsourcing works

best for tasks that rely primarily on basic human abili-

ties, such as visual and audio cognition or natural lan-

guage understanding, and less on acquired skills (such

as subject-matter knowledge).

To be more efﬁcient than traditional outsourcing (or

even in-house resources), microtasks need to be highly

parallelized. This means that the actual work is exe-

cuted by a high number of contributors in a decen-

tralized fashion;

5

this not only leads to signiﬁcant im-

provements in terms of time of delivery, but also offers

a means to cross-check the accuracy of the answers (as

each task is typically assigned to more than one per-

son). Collecting answers from different workers allow

for techniques such as majority voting (or other aggre-

gation methods) to automatically identify accurate re-

sponses. The most common reward model in micro-

3

http://www.darpa.mil/About/History/

Archives.aspx

4

http://www.netflixprize.com/

5

More complex workﬂows, though theoretically feasible, require

additional functionality to handle task dependencies.

Detecting Linked Data Quality Issues via Crowdsourcing: A DBpedia Study 5

task crowdsourcing implies small monetary payments

for each worker who has successfully solved a task.

In our work, we used microtask crowdsourcing as a

fast and cost-efﬁcient way to examine the three types

of DBPedia errors described in Section 2. We pro-

vided speciﬁc instructions to workers about how to as-

sess RDF triples according to the three previous qual-

ity issues. We reached out to the crowd of the micro-

task marketplace Amazon Mechanical Turk (MTurk).

In the following we present a summary of the relevant

MTurk terminology:

– Requester: Submits tasks to the platform (MTurk).

– Human Intelligence Task (HIT): Work unit in

MTurk and refer to a single microtask. A HIT is

a self-contained task submitted by a requester.

– Worker: Human provider who solves HITs.

– Assignments: Number of different workers to be

assigned to solve each HIT. This allows to collect

multiple answers for each question. A worker can

solve a HIT only once.

– Question: A HIT can be composed of several

questions. In the remainder of this paper, we re-

fer to task granularity as the number of questions

contained within a HIT.

– Payment: Monetary reward granted to a worker

for successfully completing a HIT. Payments are

deﬁned by the requester, taking into consideration

the complexity of the HIT, mainly deﬁned as the

time that workers have to spend to solve the task.

– Qualiﬁcation type or worker qualiﬁcation: Re-

questers may specify parameters to prohibit cer-

tain workers to solve tasks. MTutk provide a ﬁxed

set of qualiﬁcation types, including “Approval

Rate” deﬁned as the percentage of tasks success-

fully solved by a worker. In addition, requesters

can create customized qualiﬁcation types.

3.2. Crowdsourcing Pattern Find-Fix-Verify

The Find-Fix-Verify pattern [3] consists on dividing

a complex human task into a series of simpler tasks

that are carried out in a three-stage process. Each stage

in the Find-Fix-Verify pattern corresponds to a veriﬁ-

cation step over the outcome produced in the immedi-

ate previous stage. The ﬁrst stage of this crowdsourc-

ing pattern, Find, asks the crowd to identify portions

of data that require attention depending on the task to

be solved. In the second stage, Fix, the crowd corrects

the elements belonging to the outcome of the previous

stage. The Verify stage corresponds to a ﬁnal quality

control iteration.

Originally, this crowdsourcing pattern was intro-

duced in Soylent [3], a human-enabled word process-

ing interface that contacts microtask workers to edit

and improve parts of a document. The tasks studied in

Soylent include: text shortening, grammar check, and

unifying citation formatting. For example, in the Soy-

lent text shortening task, microtasks workers in Find

stage are asked to to identify portions of text that can

potentially be reduced in each paragraph. Candidate

portions that meet certain consensus degree among

workers move on to the next step. In the Fix stage,

workers must shorten the previously identiﬁed por-

tions of paragraphs. All the rewrites generated are as-

sessed by workers to select the most appropriate one

without changing the meaning of the original text.

The Find-Fix-Verify pattern has proven to produce

reliable results since each stage exploits independent

agreement to ﬁlter out potential low-quality answers

from the crowd. In addition, this approach is efﬁcient

in terms of the number of questions asked to the paid

microtask crowd, therefore the costs remain competi-

tive with other crowdsourcing alternatives.

In scenarios in which crowdsourcing is applied to

validate the results of machine computation tasks,

question ﬁltering relies on speciﬁc thresholds or his-

torical information about the likelihood that human in-

put will signiﬁcantly improve the results generated al-

gorithmically. Find-Fix-Verify addresses tasks that ini-

tially can be very complex (or very large), like in our

case the discovery and classiﬁcation of various types

of errors in DBpedia.

The Find-Fix-Very pattern is highly ﬂexible, since

each stage can employ different types of crowds, as

they require different skills and expertise [3].

4. Our Approach: Crowdsourcing Linked Data

Quality Assessment

Our work on human-driven Linked Data quality

assessment focuses on applying crowdsourcing tech-

niques to annotate RDF triples with their correspond-

ing quality issue. Given a set of quality issues Q and

a set T of RDF triples to be assessed, we formally de-

ﬁne the annotation of triples with their corresponding

quality issues as follows.

Deﬁnition 1. (Problem Deﬁnition: Mapping RDF

Triples to Quality Issues). Given T a set of RDF

triples and Q a set of quality issues, a mapping of

triples to quality issues is deﬁned as a partial function

Detecting Linked Data quality issues via crowdsourcing: A DBpedia study

Citations

Linked data quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO

Large-scale Semantic Integration of Linked Data: A Survey

Efficient knowledge graph accuracy evaluation

Learning How to Correct a Knowledge Base from the Edit History

Measuring Accuracy of Triples in Knowledge Graphs

References

A Coefficient of agreement for nominal Scales

Measuring nominal scale agreement among many raters.

DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia

DBpedia - A crystallization point for the Web of Data

Quality-control handbook

Related Papers (5)

Test-driven evaluation of linked data quality

DBpedia: a nucleus for a web of open data

Sieve: linked data quality assessment and fusion

On the Configuration of Crowdsourcing Projects

What? How? Where? A Survey of Crowdsourcing