Journal Article•DOI•

Large-scale linked data integration using probabilistic reasoning and crowdsourcing

Gianluca Demartini¹, Djellel Eddine Difallah¹, Philippe Cudré-Mauroux¹•Institutions (1)

01 Oct 2013-Vol. 22, Iss: 5, pp 665-687

TL;DR: The ZenCrowd system uses a three-stage blocking technique in order to obtain the best possible instance matches while minimizing both computational complexity and latency, and identifies entities from natural language text using state-of-the-art techniques and automatically connects them to the linked open data cloud.

read less

Abstract: We tackle the problems of semiautomatically matching linked data sets and of linking large collections of Web pages to linked data. Our system, ZenCrowd, (1) uses a three-stage blocking technique in order to obtain the best possible instance matches while minimizing both computational complexity and latency, and (2) identifies entities from natural language text using state-of-the-art techniques and automatically connects them to the linked open data cloud. First, we use structured inverted indices to quickly find potential candidate results from entities that have been indexed in our system. Our system then analyzes the candidate matches and refines them whenever deemed necessary using computationally more expensive queries on a graph database. Finally, we resort to human computation by dynamically generating crowdsourcing tasks in case the algorithmic components fail to come up with convincing results. We integrate all results from the inverted indices, from the graph database and from the crowd using a probabilistic framework in order to make sensible decisions about candidate matches and to identify unreliable human workers. In the following, we give an overview of the architecture of our system and describe in detail our novel three-stage blocking technique and our probabilistic decision framework. We also report on a series of experimental results on a standard data set, showing that our system can achieve a 95 % average accuracy on instance matching (as compared to the initial 88 % average accuracy of the purely automatic baseline) while drastically limiting the amount of work performed by the crowd. The experimental evaluation of our system on the entity linking task shows an average relative improvement of 14 % over our best automatic approach.

...read moreread less

Summary (3 min read)

Jump to: [1 Introduction] – [3 Preliminaries] – [4 Architecture] – [4.1.1 Instance matching pipeline] – [4.1.2 Entity linking pipeline] – [5 Effective instance matching based on confidence estimation and crowdsourcing] – [6 Probabilistic models] – [6.2.2 Unicity constraints for entity linking] – [6.2.3 SameAs constraints for entity linking] – [7 Experiments on instance matching] and [9 Conclusions]

1 Introduction

Semistructured data are becoming more prominent on the Web as more and more data are either interweaved or serialized in HTML pages.
This paper describes ZenCrowd, a system the authors have developed in order to create links across large data sets containing similar instances and to semiautomatically identify LOD entities from textual content.
In the present work, the authors extend ZenCrowd to handle both instance matching and entity linking.
The first task addressed by this paper is that of matching instances of multiple types among two data sets.
It is possible to categorize different crowdsourcing strategies based on the different types of incentives used to motivate the crowd to perform such tasks.

3 Preliminaries

As already mentioned, ZenCrowd addresses two distinct data integration tasks related to the general problem of entity resolution [24].
Given a document d and a LOD data setU1 ={u11, .., u1n}, the authors define entity linking as the task of identifying all entities in U1 from d and of associating the corresponding identifier u1i to each entity.
These two tasks are highly related: Instance matching aims at creating connections between different LOD data sets that describe the same real-world entity using different vocabularies.
To improve the result for both tasks, the authors selectively use paid micro-task crowdsourcing.
The disadvantages, however, potentially include the following: high financial cost, low availability of workers, and poor workers’ skills or honesty.

4 Architecture

ZenCrowd is a hybrid platform that takes advantage of both algorithmic and manual data integration techniques simultaneously.
The authors start by giving an overview of their system below in Sect. 4.1 and then describe in more detail some of its components in Sects. 4.2–4.4. 4.1 System overview.
In the following, the authors describe the different components of the ZenCrowd system focusing first on the instance matching and then on the entity linking pipeline.

4.1.1 Instance matching pipeline

In order to create new links, ZenCrowd takes as input a pair of data sets from the LOD cloud.
Then, for each instance of the source data set, their system tries to come up with candidate matches from the target data set.
The architecture of ZenCrowd: For the instance matching task (green pipeline), the system takes as input a pair of data sets to be interlinked and creates new links between the data sets using < owl : sameAs > RDF triples.
At this point, the candidate matches that have a low confidence score are sent to the crowd for further analysis.
The Decision Engine collects confidence scores from the previous steps in oder to decide what to crowdsource, together with data from the graph database to construct the HITs.

4.1.2 Entity linking pipeline

The other task that ZenCrowd performs is entity linking, that is, identifying occurrences of LOD entities in textual content and creating links from the text to corresponding instances stored in a database.
The authors LOD index engine receives as input a list of SPARQL endpoints or LOD dumps as well as a list of triple patterns, and iteratively retrieves all corresponding triples from the LOD data sets.
Once extracted, the textual entities are inspected by algorithmic linkers, whose role is to find semantically related entities from the LOD cloud.
Given a source instance from a data set, ZenCrowd considers all instances of the target data set as possible matches.
Then, given the available resources, top pairs are crowdsourced by batch to improve the accuracy of the matching process.

5 Effective instance matching based on confidence estimation and crowdsourcing

The authors describe the final steps of the blocking process that assure high-quality instance matching results.
The authors first define their schema-based matching confidence measure, which is then used to decide which candidate matchings to crowdsource.
This second interface defines a simpler task for the worker by presenting directly on the HIT page relevant information about the target entity as well as about the candidate matches.

6 Probabilistic models

ZenCrowd exploits probabilistic models to make sensible decisions about candidate results.
The authors use factor graphs to graphically represent probabilistic variables and distributions in the following.
The authors give below a brief introduction to factor graphs and message-passing techniques.
Each candidate can also be examined by human workers wi performing micro-matching tasks and performing clicks ci j to express the fact that a given candidate matching corresponds (or not) to the source instance from his/her perspective.
Clicks, workers, and matchings are further connected through two factors described below.

6.2.2 Unicity constraints for entity linking

Given that the instance matching task definition assumes that only one instance from the target data set can be a correct match for the source instance.
The authors can thus rule out all configurations where more than one candidate from the same LOD data set are considered as Correct .

6.2.3 SameAs constraints for entity linking

SameAs constraints are exclusively used in entity linking graphs.
This constraint considerably helps the decision process when strong evidences (good priors, reliable clicks) are available for any of the URIs connected to a SameAs link.
As time passes, decisions are reached on the correctness of the various matches, and the probabilistic network iteratively accumulates posterior probabilities on the reliability of the workers.
This corresponds to a learning parameters phase in a probabilistic graphical model when some of the observations are missing.

7 Experiments on instance matching

The authors experimentally evaluate the effective of ZenCrowd for the instance matching (IM) task.
ZenCrowd takes advantage of a probabilistic framework for making decisions and performs even better, leading to a relative performance improvement up to 14 % over their best automatic matching approach (going from 0.78 to 0.89).
As workers cannot be selected dynamically for a given task on the current crowdsourcing platforms (all the authors can do is prevent some workers from receiving any further task through blacklisting or decide not to reward workers who consistently perform bad), obtaining perfect matching results is thus in general unrealistic for non-controlled settings.
According to their ground truth, 383 out of the 488 automatically extracted entities can be correctly linked to URIs in their experiments, while the remaining ones are either wrongly extracted or not available in the LOD cloud the authors consider.
In Fig. 12, the authors report on the average recall of the top-5 candidates when classifying results based on the maximum confidence score obtained (top-1 score).

9 Conclusions

As the LOD movement gains momentum, matching instances across data sets and linking traditional Web content to the LOD cloud is getting increasingly important in order to foster automated information processing capabilities.
As their approach incorporates a human intelligence component, it typically cannot perform instance matching and entity linking tasks in real time.
For the entity linking task, ZenCrowd improves the precision of the results by 4–35 % over a state of the art and manually optimized crowdsourcing approach, and on average by 14 % over their best automatic approach.
Moreover, considering documents written in languages other than English could be addressed by exploiting the multilingual property of many LOD data sets.

Did you find this useful? Give us your feedback

Figures (20)

Fig. 10 ZenCrowd money saving by considering results from top-K workers only

Table 3 Matching precision for purely automatic and hybrid human/machine approaches

Fig. 3 Molecule instance matching HIT interface, where the labels of the entities as well as related property-value pairs are displayed

Fig. 6 Entity factor graph connecting twoworkers (wi ), six clicks (ci j ), and three candidate matchings (m j )

Table 2 Crowdmatching precision over two different HIT design interfaces (label-only and molecule) and two different aggregation methods (majority voting and ZenCrowd)

Fig. 9 Number of tasks generated for a given confidence value

Fig. 17 Number of HITs completed by each worker for both IM and EL ordered by most productive workers first

Fig. 18 Distribution of the workers’ precision for the entity linking task as compared to the number of tasks performed by the worker (top) and task precision with top k workers (bottom)

Fig. 5 A simple factor graph of four variables and two factors

Fig. 8 Precision and recall as compared tomatching confidence values

Fig. 7 Maximum achievable precision by considering top-K results from the inverted index

Fig. 13 Performance results (precision, recall) for the automatic approach

Table 5 Performance results for the candidate selection approach

Fig. 12 Average recall of candidate selection when discriminating on maximum relevance probability in the candidate URI set

Fig. 15 Crowdsourcing results with two different textual contexts

Table 6 Performance results for crowdsourcing with agreement vote over linkable entities

Table 4 Correct and incorrect matchings as by crowd majority voting using two different HIT designs

Fig. 11 Distribution of the workers’ precision using the molecule design as compared to the number of tasks performed by the workers

Content maybe subject to copyright Report

The VLDB Journal (2013) 22:665–687

DOI 10.1007/s00778-013-0324-z

SPECIAL ISSUE PAPER

Large-scale linked data integration using probabilistic reasoning

and crowdsourcing

Gianluca Demartini · Djellel Eddine Difallah ·

Philippe Cudré-Mauroux

Received: 26 September 2012 / Revised: 15 June 2013 / Accepted: 20 June 2013 / Published online: 18 July 2013

Abstract We tackle the problems of semiautomatically

matching linked data sets and of linking large collections of

Web pages to linked data. Our system, ZenCrowd, (1) uses

a three-stage blocking technique in order to obtain the best

possible instance matches while minimizing both computa-

tional complexity and latency, and (2) i dentiﬁes entities from

natural language text using state-of-the-art techniques and

automatically connects them to the linked open data cloud.

First, we use structured inverted indices to quickly ﬁnd poten-

tial candidate results from entities that have been indexed in

our system. Our system then analyzes the candidate matches

and reﬁnes them whenever deemed necessary using com-

putationally more expensive queries on a graph database.

Finally, we resort to human computation by dynamically gen-

erating crowdsourcing tasks in case the algorithmic compo-

nents fail to come up with convincing results. We integrate

all results from the inverted indices, from the graph data-

base and from the crowd using a probabilistic framework in

order to make sensible decisions about candidate matches

and to identify unreliable human workers. In the following,

we give an overview of the architecture of our system and

describe in detail our novel three-stage blocking technique

and our probabilistic decision framework. We also report on

a series of experimental results on a standard data set, show-

ing that our system can achieve a 95 % average accuracy on

This work was supported by the Swiss National Science Foundation

under grant number PP00P2_128459.

G. Demartini (

) · D. E. Difallah · P. Cudré-Mauroux

eXascale Infolab, University of Fribourg, Fribourg, Switzerland

e-mail: gianluca.demartini@unifr.ch

D. E. Difallah

e-mail: djelleleddine.difallah@unifr.ch

P. Cudré-Mauroux

e-mail: philippe.cudre-mauroux@unifr.ch

instance matching (as compared to the initial 88 % average

accuracy of the purely automatic baseline) while drastically

limiting the amount of work performed by the crowd. The

experimental evaluation of our system on the entity linking

task shows an average relative improvement of 14 % over our

best automatic approach.

Keywords Instance matching · Entity linking ·

Data integration · Crowdsourcing · Probabilistic reasoning

1 Introduction

Semistructured data are becoming more prominent on the

Web as more and more data are either interweaved or serial-

ized in HTML pages. The linked open data (LOD) commu-

nity,

for instance, is bringing structured data to the Web by

publishing data sets using the RDF formalism and by inter-

linking pieces of data coming from heterogeneous sources.

As the LOD movement gains momentum, linking traditional

Web content to the LOD cloud is giving rise to new possibil-

ities for online information processing. For instance, iden-

tifying unique real-world objects, persons, or concepts, in

textual content and linking them to their LOD counterparts

(also referred to as Entities), opens the door to automated text

enrichment (e.g., by providing additional information com-

ing from the LOD cloud on entities appearing in the HTML

text), as well as streamlined information retrieval and inte-

gration (e.g., by using links to retrieve all text articles related

to a given concept from the LOD cloud).

As more LOD data sets are being published on the Web,

unique entities are getting described multiple times by differ-

ent sources. It is therefore critical that such openly available

http://linkeddata.org/.

123

666 G. Demartini et al.

data sets are interlinked to each other in order promote global

data interoperability. The interlinking of data sets describing

similar entities enables Web developers to cope with the rapid

growth of LOD data, by focusing on a small set of well-

known data sets (such as DBPedia

or Freebase

) and by

automatically following links from those data sets to retrieve

additional information whenever necessary.

Automatizing the process of matching instances from het-

erogeneous LOD data sets and the process of linking entities

appearing in HTML pages to their correct LOD counterpart

is currently drawing a lot of attention (see the Sect. 2 below).

These processes represent however a highly challenging task,

as instance matching is known to be extremely difﬁcult even

in relatively simple contexts. Some of the challenges that

arise in this context are (1) to identify entities appearing in

natural text, (2) to cope with the large-scale and distributed

nature of LOD, (3) to disambiguate candidate concepts, and

ﬁnally (4) to match instances across data sets.

This paper describes ZenCrowd, a system we have devel-

oped in order to create links across large data sets con-

taining similar instances and to semiautomatically identify

LOD entities from textual content. In a recent work [17],

we focused on the entity linking task, that is, on extracting

and identifying occurrences of LOD instances from textual

content (e.g., news articles in HTML format). In the present

work, we extend ZenCrowd to handle both instance match-

ing and entity linking. Our system gracefully combines algo-

rithmic and manual integration, by ﬁrst taking advantage of

automated data integration techniques and then by improving

the automatic results by involving human workers.

The ZenCrowd approach addresses the scalability issues

of data integration by proposing a novel three-stage blocking

technique that incrementally combines three very different

approaches together. In a ﬁrst step, we use an inverted index

built over the entire data set to efﬁciently determine poten-

tial candidates and to obtain an initial ranked list of poten-

tial results. Top potential candidates are then analyzed fur-

ther by taking advantage of a more accurate (but also more

costly) graph-based instance matching techniques (a simi-

lar structured/unstructured hybrid approach has been taken

in [45]). Finally, results yielding low conﬁdence values (as

determined by probabilistic inference) are used to dynami-

cally create micro-tasks published on a crowdsourcing plat-

form, the assumption being that tasks in question do not need

special expertise to be performed.

ZenCrowd does not focus on the algorithmic problems

of instance matching and entity linking per se. However, we

make a number of key contributions at the interface of algo-

rithmic and manual data integration and discuss in detail how

to most effectively and efﬁciently combine scalable inverted

http://www.dpbedia.org.

http://freebase.org.

indices, structured graph queries and human computation in

order to match large LOD data sets. The contributions of this

paper include the following:

– a new system architecture supporting algorithmic and

manual instance matching as well as entity linking in

concert.

– a new three-stage blocking approach that combines

highly scalable automatic ﬁltering of semistructured data

together with more complex graph-based matching and

high-quality manual matching performed by the crowd.

– a new probabilistic inference framework to dynamically

assess the results of arbitrary human workers operating

on a crowdsourcing platform and to effectively combine

their (conﬂicting) output taking into account the results

of the automatic stage output.

– an empirical evaluation of our system in a real deploy-

ment over different Human Intelligence Task interfaces

showing that ZenCrowd combines the best of both

worlds, in the sense that our combined approach turns

out to be more effective than both (a) pure algorithmic,

by improving the accuracy and (b) full manual match-

ing, by being cost-effective while mitigating the workers’

uncertainty.

The rest of this paper is structured as follows: We review

the state of the art in instance matching, entity linking, and

crowdsourcing systems in Sect. 2. Section 3 introduces the

terminology used throughout the paper. Section 4 gives an

overview of the architecture of our system, including its algo-

rithmic matching interface, its probabilistic inference engine,

and its templating and crowdsourcing components. Section

5 presents our graph-based matching conﬁdence measure as

well as different methods to crowdsource instance matching

and entity linking tasks. We describe our formal model to

combine both algorithmic and crowdsourcing results using

probabilistic networks in Sect. 6. We introduce our evalua-

tion methodology and discuss results from a real deployment

of our system for the instance matching task in Sect. 7 and for

the entity linking task in Sect. 8, before concluding in Sect. 9.

2 Related work

2.1 Instance matching

The ﬁrst task addressed by this paper is that of matching

instances of multiple types among two data sets. Thanks to

the LOD movement, many data sets describing instances have

been created and published on the Web.

A lot of attention has been put on the task of automatic

instance matching, which is deﬁned as the identiﬁcation of

the same real-world object described in two different data

123

Large-scale linked data integration 667

sets. Classical matching approaches are based on string sim-

ilarities (“Barack Obama” vs. “B. Obama”) such as the

edit distance [33], the Jaro similarity [27], or the Jaro-

Winkler similarity [50]. More advanced techniques, such

as instance group linkage [40], compare groups of records

to ﬁnd matches. A third class of approaches uses seman-

tic information. Reference reconciliation [21], for example,

builds a dependency graph and exploits relations to propagate

information among entities. Recently, approaches exploiting

Wikipedia as background corpus have been proposed as well

[9,13]. In [26], the authors propose entity disambiguation

techniques using relations between entities in Wikipedia and

concepts. The technique uses, for example, the link between

“Micheal Jordan” and the “University of California, Berke-

ley” or to “basketball” on Wikipedia.

The number of candidate matching pairs between two data

sets grows rapidly (i.e., quadratically) with the size of the

data, making the matching task rapidly intractable in prac-

tice. Methods based on blocking [41,49] have been proposed

to tackle scalability issues. The idea is to adopt a computa-

tionally inexpensive method to ﬁrst group together candidate

matching pairs and, as a second step, to adopt a more accurate

and expensive measure to compare all possible pairs within

the candidate set.

Crowdsourcing techniques have already been leveraged

for instance matching. In [48], the authors propose a hybrid

human–machine approach that exploits both the scalability

of automatic methods and the accuracy of manual matching.

The focus of their work i s on how to best present the match-

ing task to the crowd. Instead, our work focuses on how

to combine automated and manual matching by means of a

three-stage blocking technique and a probabilistic network

able to identify and weight-out low-quality answers.

In idMesh [15], we built disambiguation graphs based on

the transitive closures of equivalence links for networks con-

taining uncertain information. Our present work focuses on

hybrid matching techniques for LOD data sets, combining

both automated processes and human computation in order

to obtain a system that is both scalable and highly accurate.

2.2 Entity linking

The other task performed by ZenCrowd is entity linking,

that is, identifying instances from textual content and linking

them to their description in a database. Entities, that is, real-

world objects described following a given schema/ontology,

have recently become ﬁrst-class citizens on the Web. A large

amount of online search queries are about entities [42], and

search engines exploit entities and structured data to build

their result pages [25]. In the ﬁeld of information retrieval

(IR), a lot of attention has been given to entities: At TREC,

http://trec.nist.gov.

the main IR evaluation initiative, the task of Expert Finding,

Related Entity Finding, and Entity List Completion have been

studied [2,3]. Along similar lines, we have evaluated entity

ranking in Wikipedia at INEX

recently [18].

The problem of assigning identiﬁers to instances men-

tioned in textual content (i.e., entity linking) has been widely

studied by the database and the semantic Web research com-

munities. A related effort has, for example, been carried out

in the context of the OKKAM project,

which suggested the

idea of an entity name system (ENS) to assign identiﬁers to

entities on the Web [8]. The ENS could integrate techniques

from our paper to improve matching effectiveness.

The ﬁrst step in entity linking consists in extracting entities

from textual content. Several approaches developed within

the NLP ﬁeld provide high-quality entity extraction for per-

sons, locations, and organizations [4,12]. State-of-the-art

techniques are implemented in tools like Gate [16], the Stan-

ford Parser [30] (which we use in our experiments), and

Extractiv.

Once entities are extracted, they still need to be disam-

biguated and matched to semantically similar but syntac-

tically different occurrences of the same real-world object

(e.g., “Mr. Obama” and “President of the USA”).

The ﬁnal step in entity linking is that of deciding which

links to retain in order to enrich the entity. Systems per-

forming such a task are available as well (e.g., Open

Calais,

DBPedia Spotlight [37]). Relevant approaches aim

for instance at enriching documents by automatically cre-

ating links to Wikipedia pages [38,44], which can be seen

as entity identiﬁers. While previous work selects uniform

resource identiﬁers (URIs) from a speciﬁc corpus (e.g.,

DBPedia, Wikipedia), our goal in ZenCrowd is to assign

entity identiﬁers from the larger LOD cloud

instead.

The present work aims at correctly linking isolated enti-

ties to external entities using an effective combination of

algorithmic and manual matching techniques. To the best of

our knowledge, this paper is the ﬁrst to propose a principled

approach based on crowdsourcing techniques to improve the

quality of automated entity linking algorithms.

2.3 Ad hoc object retrieval

Another task related to entity linking is ad hoc object retrieval

(AOR) [42], where systems need to retrieve the correct URIs

given a keyword query representing an entity. Such a task has

been evaluated in the context of the Semantic Search work-

https://inex.mmci.uni-saarland.de/.

http://www.okkam.org.

http://extractiv.com/.

http://www.opencalais.com/.

http://linkeddata.org/.

123

668 G. Demartini et al.

shop in 2010

and 2011

using a set of queries extracted

from a commercial search engine query log and crowdsourc-

ing techniques to create the gold standard. Most of the pro-

posed systems for this task (see, for example, Blanco et al.

[7]) exploit IR indexing and ranking techniques over the RDF

data set used at the Billion Triple Challenge

2009. Simi-

larly to such tasks, our data set is composed of a large set of

triples coming from LOD data sets, while our queries consist

of instance labels from the testset where the gold standard is

manually created by experts. In addition to those efforts, we

selectively exploit the crowd to improve the accuracy of the

task.

ZenCrowd adopts a hybrid architecture that combines

unstructured inverted indices together with a structured graph

database to optimize the task of instance matching. A similar

approach has been taken in our previous work [45] where

we combined structured and unstructured representations of

graph data to effectively address the task of ad hoc object

retrieval.

2.4 Crowdsourcing

ZenCrowd selectively adopts crowdsourcing to improve the

quality in data integration tasks. Crowdsourcing is a term

used to deﬁne those methods to generate or process data ask-

ing to a large group of people to complete small tasks. It

is possible to categorize different crowdsourcing strategies

based on the different types of incentives used to motivate

the crowd to perform such tasks. One of the most successful

example of crowdsourcing is the creation of Wikipedia, an

online encyclopedia collaboratively written by a large num-

ber of web users. The incentive to create articles in Wikipedia

is to help the community and to share knowledge with others.

An incentive that is often leveraged to get input from the

crowd is fun. Games with a purpose have s tudied how to

design entertaining applications that can generate useful data

to be processed by further algorithms. An example of a suc-

cessful game that at the same time generates meaningful data

is the ESP game [46] where two human players have to agree

on the words used to tag a picture. An extension of this game

is Peekaboom: a game that asks the player to detect and anno-

tate speciﬁc objects within an image [47].

A different type of crowdsourcing uses a monetary incen-

tive to motivate the crowd to perform some tasks. The most

popular paid crowdsourcing platform currently available is

Amazon MTurk

where micro-tasks (called Human Intel-

ligence Tasks or HITs) are published by requesters and

selected by workers who perform them in exchange of a

http://km.aifb.kit.edu/ws/semsearch10/.

https://km.aifb.kit.edu/ws/semsearch11/.

http://challenge.semanticweb.org/.

http://www.mturk.com.

small monetary reward. We use the MTurk platform as a basis

for the ZenCrowd system. Other paid crowdsourcing plat-

forms use the approach of modeling worker skills to select

the right worker for a speciﬁc HIT [20]. This is beneﬁcial

when the tasks are domain-speciﬁc and require workers hav-

ing some domain knowledge. In this paper, we use MTurk

as a crowdsourcing platform as we deal with well-known

general-domain entities. Alternative platforms could be used

for domain-speciﬁc data integration tasks like, for example,

linking entities described in scientiﬁc articles. ZenCrowd

uses paid crowdsourcing to enable fast scalability to large

amounts of data. This is possible thanks to the continuous

availability of human workers on crowdsourcing platforms

such as Amazon MTurk.

Paid crowdsourcing is a relatively recent technique that is

currently being investigated in a number of contexts. In the

IR community, crowdsourcing techniques have been mainly

used to create test collections for repeatable relevance assess-

ment [1,28,29]. The task of the workers is to judge the rele-

vance of a document for a given query. Studies have shown

that this is a practically relevant approach, which produces

reliable evaluation collections [6]. The database community

is currently evaluating how crowdsourcing methods can be

used to build RDMS systems able to answer complex queries

where s ubjective comparison is needed (e.g., “10 papers with

the most novel ideas”) [22,43]. Crowdsourcing can also be

used for basic computational operations such as sort and join

[36] as well as for sentiment analysis and image tagging [35].

In the context of entity identiﬁcation, crowdsourcing has

been used by Finn et al. [23] to annotate entities in Twitter.

Their goal is simpler than ours as they ask human workers

to identify entities in text and assign a type (i.e., person,

location, or organization) to the identiﬁed entities. Our goal

is, instead, to assign entity identiﬁers to large numbers of

entities on the Web. The two approaches might be combined

to obtain high-quality results for both extraction and linking.

3 Preliminaries

As already mentioned, ZenCrowd addresses two distinct data

integration tasks related to the general problem of entity res-

olution [24].

We deﬁne Instance Matching as the task of identifying

two instances following different schemas (or ontologies) but

referring to the same real-world object. Within the database

literature, this task is related to record linkage [11], duplicate

detection [5], or entity identiﬁcation [34] when performed

over two relational databases. However, in our setting, the

main goal is to create new cross-data set < owl : sameAs >

RDF statements. As commonly assumed for record linkage,

we also assume that there are no duplicate entities within the

same source and leverage this assumption when computing

123

Large-scale linked data integration 669

the ﬁnal probability of a match in our probabilistic reasoning

step.

We deﬁne Entity Linking as the task of assigning a URI

selected from a background knowledge base for an entity

mentioned in a textual document. This task is also known as

entity resolution [24] or disambiguation [10] in the literature.

In addition to the classic entity resolution task, the objective

of our task is not only to understand which possible interpre-

tation of the entity is correct (Michael Jordan the basketball

player as compared to the UC Berkeley professor), but also

to assign a URI to the entity, which can be used to retrieve

additional factual information about it.

Given two LOD data set U

={u

, .., u

} and U

, .., u

} containing structured entity descriptions u

where i identiﬁes the data set and j the entity URI, we

deﬁne instance matching as the identiﬁcation of each pair

, u

2 j

) of entity URIs from U

and U

referring to the

same real-world entity and call such a pair a match. An exam-

ple of match is given by the pair u

= <http://dbpedia.org/

resource/Tom_Cruise> and u

= <http://www.freebase.

com/m/07r1h> where U

is the DBPedia LOD data set and

is the Freebase LOD data set.

Given a document d anda LOD data set U

={u

, .., u

we deﬁne entity linking as the task of identifying all entities

in U

from d and of associating the corresponding identiﬁer

to each entity.

These two tasks are highly related: Instance matching

aims at creating connections between different LOD data

sets that describe the same real-world entity using different

vocabularies. Such connections can then be used to run link-

ing on textual documents. Indeed, ZenCrowd uses existing

< owl : sameAs > statements as probabilistic priors to

take a ﬁnal decision about which links to select for an entity

appearing in a textual document.

Hence, we use in the following the term entity to refer to

a real-world object mentioned in a textual document (e.g.,

a news article), while we use the term instance to refer to

its structured description (e.g., a set of RDF triples), which

follows the well-deﬁned schema of a LOD data set.

Our system relies on LOD data sets for both tasks. Such

linked data sets describe interconnected entities that are com-

monly mentioned in Web content. As compared to traditional

data integration tasks, the use of LOD data may support inte-

gration algorithms by means of its structured entity descrip-

tions and entity interlinking within and across data sets

(Fig. 1).

In our work, we make use of Human Intelligence at scale

to, ﬁrst, improve the quality of such links across data sets

and, second, to connect unstructured documents to the struc-

tured representation of the entities they mention. To improve

the result for both tasks, we selectively use paid micro-task

crowdsourcing. To do this, we create HITs on a crowdsourc-

ing platform. For the entity linking task, a HIT consists of

asking which of the candidate links is correct for an entity

extracted from a document. For the instance matching task,

a HIT consists in ﬁnding which instance from a target data

set corresponds to a given instance from a source data set.

See Figs. 2, 3, and 4, which give examples of such tasks.

Paid crowdsourcing presents enormous advantages for

high-quality data processing. The disadvantages, however,

potentially include the following: high ﬁnancial cost, low

availability of workers, and poor workers’ skills or honesty.

To overcome those shortcomings, we alleviate the ﬁnancial

cost using an efﬁcient decision engine that selectively picks

tasks that have a high improvement potential. Our present

assumption is that entities extracted from HTML news arti-

cles could be recognized by the large public, especially

when provided with sufﬁcient contextual information. Fur-

thermore, each task is shown to multiple workers to balance

out low-quality answers.

4 Architecture

ZenCrowd is a hybrid platform that takes advantage of both

algorithmic and manual data integration techniques simul-

taneously. Figure 1 presents a simpliﬁed architecture of our

system. We start by giving an overview of our system below

in Sect. 4.1 and then describe in more detail some of its com-

ponents in Sects. 4.2–4.4.

4.1 System overview

In the following, we describe the different components of the

ZenCrowd system focusing ﬁrst on the instance matching and

then on the entity linking pipeline.

4.1.1 Instance matching pipeline

In order to create new links, ZenCrowd takes as input a pair

of data sets from the LOD cloud. Among the two data sets,

one is selected as the source data set and one as the target

data set. Then, for each instance of the source data set, our

system tries to come up with candidate matches from the

target data set.

First, the label used to name the source instance is used

to query the LOD Index (see Sect. 4.2) in order to obtain a

ranked list of candidate matches from the target data set. This

can efﬁciently, and cheaply, ﬁlter out numerous clear non-

matches out of potentially numerous (in the order of hundreds

of millions for some LOD data sets) instances available. Next,

top-ranked candidate instances are further examined in the

graph database. This step is taken to obtain more complete

information about the target instances, both to compute a

more accurate matching score and to provide information to

the Micro-Task Manager (see Fig. 1), which has to ﬁll the

123

HTML Viewer

Frequently Asked Questions (11)

Q1. What is the other task that ZenCrowd performs?

The other task that ZenCrowd performs is entity linking, that is, identifying occurrences of LOD entities in textual content and creating links from the text to corresponding instances stored in a database.

Q2. What is the challenging type of instance to match in their experiment?

the authors observe that the most challenging type of instances to match in their experiment is organizations, while people can be matched with high precision using automatic methods only.

Q3. What is the successful example of crowdsourcing?

One of the most successful example of crowdsourcing is the creation of Wikipedia, an online encyclopedia collaboratively written by a large number of web users.

Q4. How does ZenCrowd improve the accuracy of the entity matching task?

For the entity linking task, ZenCrowd improves the precision of the results by 4–35 % over a state of the art and manually optimized crowdsourcing approach, and on average by 14 % over their best automatic approach.

Q5. How does ZenCrowd improve the accuracy of the instance matching task?

In conclusion, ZenCrowd provides a reliable approach to entity linking and instance matching, which exploits the trade-off between large-scale automatic instance matching and high-quality human annotation, and which according to their results improves the precision of the instance matching results up to 14 % over their best automatic matching approach for the instance matching task.

Q6. What kind of data integration tasks could be used on alternative platforms?

Alternative platforms could be used for domain-specific data integration tasks like, for example, linking entities described in scientific articles.

Q7. What is the effect of the probabilistic network on the reliability of the workers?

As time passes, decisions are reached on the correctness of the various matches, and the probabilistic network iteratively accumulates posterior probabilities on the reliability of the workers.

Q8. What is the popular example of a successful game that generates meaningful data?

An example of a successful game that at the same time generates meaningful data is the ESP game [46] where two human players have to agree on the words used to tag a picture.

Q9. What is the main goal of the task of identifying entities in a database?

Within the database literature, this task is related to record linkage [11], duplicate detection [5], or entity identification [34] when performed over two relational databases.

Q10. What is the definition of entity linking?

Given a document d and a LOD data setU1 ={u11, .., u1n}, the authors define entity linking as the task of identifying all entities in U1 from d and of associating the corresponding identifier u1i to each entity.

Q11. What is the test collection available for download?

The test collection the authors created is available for download at: http:// exascale.info/zencrowd/.extracted from it using the Stanford Parser [30] as entity extractor.

Large-scale linked data integration using probabilistic reasoning and crowdsourcing

Summary (3 min read)

1 Introduction

3 Preliminaries

4 Architecture

4.1.1 Instance matching pipeline

4.1.2 Entity linking pipeline

5 Effective instance matching based on confidence estimation and crowdsourcing

6 Probabilistic models

6.2.2 Unicity constraints for entity linking

6.2.3 SameAs constraints for entity linking

7 Experiments on instance matching

9 Conclusions

Figures (20)

Citations

Cites background from "Large-scale linked data integration..."

References

"Large-scale linked data integration..." refers methods in this paper

"Large-scale linked data integration..." refers methods in this paper

Related Papers (5)

Frequently Asked Questions (11)

Q1. What is the other task that ZenCrowd performs?

Q2. What is the challenging type of instance to match in their experiment?

Q3. What is the successful example of crowdsourcing?

Q4. How does ZenCrowd improve the accuracy of the entity matching task?

Q5. How does ZenCrowd improve the accuracy of the instance matching task?

Q6. What kind of data integration tasks could be used on alternative platforms?

Q7. What is the effect of the probabilistic network on the reliability of the workers?

Q8. What is the popular example of a successful game that generates meaningful data?

Q9. What is the main goal of the task of identifying entities in a database?

Q10. What is the definition of entity linking?

Q11. What is the test collection available for download?