scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Large-scale linked data integration using probabilistic reasoning and crowdsourcing

01 Oct 2013-Vol. 22, Iss: 5, pp 665-687
TL;DR: The ZenCrowd system uses a three-stage blocking technique in order to obtain the best possible instance matches while minimizing both computational complexity and latency, and identifies entities from natural language text using state-of-the-art techniques and automatically connects them to the linked open data cloud.
Abstract: We tackle the problems of semiautomatically matching linked data sets and of linking large collections of Web pages to linked data. Our system, ZenCrowd, (1) uses a three-stage blocking technique in order to obtain the best possible instance matches while minimizing both computational complexity and latency, and (2) identifies entities from natural language text using state-of-the-art techniques and automatically connects them to the linked open data cloud. First, we use structured inverted indices to quickly find potential candidate results from entities that have been indexed in our system. Our system then analyzes the candidate matches and refines them whenever deemed necessary using computationally more expensive queries on a graph database. Finally, we resort to human computation by dynamically generating crowdsourcing tasks in case the algorithmic components fail to come up with convincing results. We integrate all results from the inverted indices, from the graph database and from the crowd using a probabilistic framework in order to make sensible decisions about candidate matches and to identify unreliable human workers. In the following, we give an overview of the architecture of our system and describe in detail our novel three-stage blocking technique and our probabilistic decision framework. We also report on a series of experimental results on a standard data set, showing that our system can achieve a 95 % average accuracy on instance matching (as compared to the initial 88 % average accuracy of the purely automatic baseline) while drastically limiting the amount of work performed by the crowd. The experimental evaluation of our system on the entity linking task shows an average relative improvement of 14 % over our best automatic approach.

Summary (3 min read)

1 Introduction

  • Semistructured data are becoming more prominent on the Web as more and more data are either interweaved or serialized in HTML pages.
  • This paper describes ZenCrowd, a system the authors have developed in order to create links across large data sets containing similar instances and to semiautomatically identify LOD entities from textual content.
  • In the present work, the authors extend ZenCrowd to handle both instance matching and entity linking.
  • The first task addressed by this paper is that of matching instances of multiple types among two data sets.
  • It is possible to categorize different crowdsourcing strategies based on the different types of incentives used to motivate the crowd to perform such tasks.

3 Preliminaries

  • As already mentioned, ZenCrowd addresses two distinct data integration tasks related to the general problem of entity resolution [24].
  • Given a document d and a LOD data setU1 ={u11, .., u1n}, the authors define entity linking as the task of identifying all entities in U1 from d and of associating the corresponding identifier u1i to each entity.
  • These two tasks are highly related: Instance matching aims at creating connections between different LOD data sets that describe the same real-world entity using different vocabularies.
  • To improve the result for both tasks, the authors selectively use paid micro-task crowdsourcing.
  • The disadvantages, however, potentially include the following: high financial cost, low availability of workers, and poor workers’ skills or honesty.

4 Architecture

  • ZenCrowd is a hybrid platform that takes advantage of both algorithmic and manual data integration techniques simultaneously.
  • The authors start by giving an overview of their system below in Sect. 4.1 and then describe in more detail some of its components in Sects. 4.2–4.4. 4.1 System overview.
  • In the following, the authors describe the different components of the ZenCrowd system focusing first on the instance matching and then on the entity linking pipeline.

4.1.1 Instance matching pipeline

  • In order to create new links, ZenCrowd takes as input a pair of data sets from the LOD cloud.
  • Then, for each instance of the source data set, their system tries to come up with candidate matches from the target data set.
  • The architecture of ZenCrowd: For the instance matching task (green pipeline), the system takes as input a pair of data sets to be interlinked and creates new links between the data sets using < owl : sameAs > RDF triples.
  • At this point, the candidate matches that have a low confidence score are sent to the crowd for further analysis.
  • The Decision Engine collects confidence scores from the previous steps in oder to decide what to crowdsource, together with data from the graph database to construct the HITs.

4.1.2 Entity linking pipeline

  • The other task that ZenCrowd performs is entity linking, that is, identifying occurrences of LOD entities in textual content and creating links from the text to corresponding instances stored in a database.
  • The authors LOD index engine receives as input a list of SPARQL endpoints or LOD dumps as well as a list of triple patterns, and iteratively retrieves all corresponding triples from the LOD data sets.
  • Once extracted, the textual entities are inspected by algorithmic linkers, whose role is to find semantically related entities from the LOD cloud.
  • Given a source instance from a data set, ZenCrowd considers all instances of the target data set as possible matches.
  • Then, given the available resources, top pairs are crowdsourced by batch to improve the accuracy of the matching process.

5 Effective instance matching based on confidence estimation and crowdsourcing

  • The authors describe the final steps of the blocking process that assure high-quality instance matching results.
  • The authors first define their schema-based matching confidence measure, which is then used to decide which candidate matchings to crowdsource.
  • This second interface defines a simpler task for the worker by presenting directly on the HIT page relevant information about the target entity as well as about the candidate matches.

6 Probabilistic models

  • ZenCrowd exploits probabilistic models to make sensible decisions about candidate results.
  • The authors use factor graphs to graphically represent probabilistic variables and distributions in the following.
  • The authors give below a brief introduction to factor graphs and message-passing techniques.
  • Each candidate can also be examined by human workers wi performing micro-matching tasks and performing clicks ci j to express the fact that a given candidate matching corresponds (or not) to the source instance from his/her perspective.
  • Clicks, workers, and matchings are further connected through two factors described below.

6.2.2 Unicity constraints for entity linking

  • Given that the instance matching task definition assumes that only one instance from the target data set can be a correct match for the source instance.
  • The authors can thus rule out all configurations where more than one candidate from the same LOD data set are considered as Correct .

6.2.3 SameAs constraints for entity linking

  • SameAs constraints are exclusively used in entity linking graphs.
  • This constraint considerably helps the decision process when strong evidences (good priors, reliable clicks) are available for any of the URIs connected to a SameAs link.
  • As time passes, decisions are reached on the correctness of the various matches, and the probabilistic network iteratively accumulates posterior probabilities on the reliability of the workers.
  • This corresponds to a learning parameters phase in a probabilistic graphical model when some of the observations are missing.

7 Experiments on instance matching

  • The authors experimentally evaluate the effective of ZenCrowd for the instance matching (IM) task.
  • ZenCrowd takes advantage of a probabilistic framework for making decisions and performs even better, leading to a relative performance improvement up to 14 % over their best automatic matching approach (going from 0.78 to 0.89).
  • As workers cannot be selected dynamically for a given task on the current crowdsourcing platforms (all the authors can do is prevent some workers from receiving any further task through blacklisting or decide not to reward workers who consistently perform bad), obtaining perfect matching results is thus in general unrealistic for non-controlled settings.
  • According to their ground truth, 383 out of the 488 automatically extracted entities can be correctly linked to URIs in their experiments, while the remaining ones are either wrongly extracted or not available in the LOD cloud the authors consider.
  • In Fig. 12, the authors report on the average recall of the top-5 candidates when classifying results based on the maximum confidence score obtained (top-1 score).

9 Conclusions

  • As the LOD movement gains momentum, matching instances across data sets and linking traditional Web content to the LOD cloud is getting increasingly important in order to foster automated information processing capabilities.
  • As their approach incorporates a human intelligence component, it typically cannot perform instance matching and entity linking tasks in real time.
  • For the entity linking task, ZenCrowd improves the precision of the results by 4–35 % over a state of the art and manually optimized crowdsourcing approach, and on average by 14 % over their best automatic approach.
  • Moreover, considering documents written in languages other than English could be addressed by exploiting the multilingual property of many LOD data sets.

Did you find this useful? Give us your feedback

Figures (20)

Content maybe subject to copyright    Report

The VLDB Journal (2013) 22:665–687
DOI 10.1007/s00778-013-0324-z
SPECIAL ISSUE PAPER
Large-scale linked data integration using probabilistic reasoning
and crowdsourcing
Gianluca Demartini · Djellel Eddine Difallah ·
Philippe Cudré-Mauroux
Received: 26 September 2012 / Revised: 15 June 2013 / Accepted: 20 June 2013 / Published online: 18 July 2013
© Springer-Verlag Berlin Heidelberg 2013
Abstract We tackle the problems of semiautomatically
matching linked data sets and of linking large collections of
Web pages to linked data. Our system, ZenCrowd, (1) uses
a three-stage blocking technique in order to obtain the best
possible instance matches while minimizing both computa-
tional complexity and latency, and (2) i dentifies entities from
natural language text using state-of-the-art techniques and
automatically connects them to the linked open data cloud.
First, we use structured inverted indices to quickly find poten-
tial candidate results from entities that have been indexed in
our system. Our system then analyzes the candidate matches
and refines them whenever deemed necessary using com-
putationally more expensive queries on a graph database.
Finally, we resort to human computation by dynamically gen-
erating crowdsourcing tasks in case the algorithmic compo-
nents fail to come up with convincing results. We integrate
all results from the inverted indices, from the graph data-
base and from the crowd using a probabilistic framework in
order to make sensible decisions about candidate matches
and to identify unreliable human workers. In the following,
we give an overview of the architecture of our system and
describe in detail our novel three-stage blocking technique
and our probabilistic decision framework. We also report on
a series of experimental results on a standard data set, show-
ing that our system can achieve a 95 % average accuracy on
This work was supported by the Swiss National Science Foundation
under grant number PP00P2_128459.
G. Demartini (
B
) · D. E. Difallah · P. Cudré-Mauroux
eXascale Infolab, University of Fribourg, Fribourg, Switzerland
e-mail: gianluca.demartini@unifr.ch
D. E. Difallah
e-mail: djelleleddine.difallah@unifr.ch
P. Cudré-Mauroux
e-mail: philippe.cudre-mauroux@unifr.ch
instance matching (as compared to the initial 88 % average
accuracy of the purely automatic baseline) while drastically
limiting the amount of work performed by the crowd. The
experimental evaluation of our system on the entity linking
task shows an average relative improvement of 14 % over our
best automatic approach.
Keywords Instance matching · Entity linking ·
Data integration · Crowdsourcing · Probabilistic reasoning
1 Introduction
Semistructured data are becoming more prominent on the
Web as more and more data are either interweaved or serial-
ized in HTML pages. The linked open data (LOD) commu-
nity,
1
for instance, is bringing structured data to the Web by
publishing data sets using the RDF formalism and by inter-
linking pieces of data coming from heterogeneous sources.
As the LOD movement gains momentum, linking traditional
Web content to the LOD cloud is giving rise to new possibil-
ities for online information processing. For instance, iden-
tifying unique real-world objects, persons, or concepts, in
textual content and linking them to their LOD counterparts
(also referred to as Entities), opens the door to automated text
enrichment (e.g., by providing additional information com-
ing from the LOD cloud on entities appearing in the HTML
text), as well as streamlined information retrieval and inte-
gration (e.g., by using links to retrieve all text articles related
to a given concept from the LOD cloud).
As more LOD data sets are being published on the Web,
unique entities are getting described multiple times by differ-
ent sources. It is therefore critical that such openly available
1
http://linkeddata.org/.
123

666 G. Demartini et al.
data sets are interlinked to each other in order promote global
data interoperability. The interlinking of data sets describing
similar entities enables Web developers to cope with the rapid
growth of LOD data, by focusing on a small set of well-
known data sets (such as DBPedia
2
or Freebase
3
) and by
automatically following links from those data sets to retrieve
additional information whenever necessary.
Automatizing the process of matching instances from het-
erogeneous LOD data sets and the process of linking entities
appearing in HTML pages to their correct LOD counterpart
is currently drawing a lot of attention (see the Sect. 2 below).
These processes represent however a highly challenging task,
as instance matching is known to be extremely difficult even
in relatively simple contexts. Some of the challenges that
arise in this context are (1) to identify entities appearing in
natural text, (2) to cope with the large-scale and distributed
nature of LOD, (3) to disambiguate candidate concepts, and
finally (4) to match instances across data sets.
This paper describes ZenCrowd, a system we have devel-
oped in order to create links across large data sets con-
taining similar instances and to semiautomatically identify
LOD entities from textual content. In a recent work [17],
we focused on the entity linking task, that is, on extracting
and identifying occurrences of LOD instances from textual
content (e.g., news articles in HTML format). In the present
work, we extend ZenCrowd to handle both instance match-
ing and entity linking. Our system gracefully combines algo-
rithmic and manual integration, by first taking advantage of
automated data integration techniques and then by improving
the automatic results by involving human workers.
The ZenCrowd approach addresses the scalability issues
of data integration by proposing a novel three-stage blocking
technique that incrementally combines three very different
approaches together. In a first step, we use an inverted index
built over the entire data set to efficiently determine poten-
tial candidates and to obtain an initial ranked list of poten-
tial results. Top potential candidates are then analyzed fur-
ther by taking advantage of a more accurate (but also more
costly) graph-based instance matching techniques (a simi-
lar structured/unstructured hybrid approach has been taken
in [45]). Finally, results yielding low confidence values (as
determined by probabilistic inference) are used to dynami-
cally create micro-tasks published on a crowdsourcing plat-
form, the assumption being that tasks in question do not need
special expertise to be performed.
ZenCrowd does not focus on the algorithmic problems
of instance matching and entity linking per se. However, we
make a number of key contributions at the interface of algo-
rithmic and manual data integration and discuss in detail how
to most effectively and efficiently combine scalable inverted
2
http://www.dpbedia.org.
3
http://freebase.org.
indices, structured graph queries and human computation in
order to match large LOD data sets. The contributions of this
paper include the following:
a new system architecture supporting algorithmic and
manual instance matching as well as entity linking in
concert.
a new three-stage blocking approach that combines
highly scalable automatic filtering of semistructured data
together with more complex graph-based matching and
high-quality manual matching performed by the crowd.
a new probabilistic inference framework to dynamically
assess the results of arbitrary human workers operating
on a crowdsourcing platform and to effectively combine
their (conflicting) output taking into account the results
of the automatic stage output.
an empirical evaluation of our system in a real deploy-
ment over different Human Intelligence Task interfaces
showing that ZenCrowd combines the best of both
worlds, in the sense that our combined approach turns
out to be more effective than both (a) pure algorithmic,
by improving the accuracy and (b) full manual match-
ing, by being cost-effective while mitigating the workers’
uncertainty.
The rest of this paper is structured as follows: We review
the state of the art in instance matching, entity linking, and
crowdsourcing systems in Sect. 2. Section 3 introduces the
terminology used throughout the paper. Section 4 gives an
overview of the architecture of our system, including its algo-
rithmic matching interface, its probabilistic inference engine,
and its templating and crowdsourcing components. Section
5 presents our graph-based matching confidence measure as
well as different methods to crowdsource instance matching
and entity linking tasks. We describe our formal model to
combine both algorithmic and crowdsourcing results using
probabilistic networks in Sect. 6. We introduce our evalua-
tion methodology and discuss results from a real deployment
of our system for the instance matching task in Sect. 7 and for
the entity linking task in Sect. 8, before concluding in Sect. 9.
2 Related work
2.1 Instance matching
The first task addressed by this paper is that of matching
instances of multiple types among two data sets. Thanks to
the LOD movement, many data sets describing instances have
been created and published on the Web.
A lot of attention has been put on the task of automatic
instance matching, which is defined as the identification of
the same real-world object described in two different data
123

Large-scale linked data integration 667
sets. Classical matching approaches are based on string sim-
ilarities (“Barack Obama” vs. “B. Obama”) such as the
edit distance [33], the Jaro similarity [27], or the Jaro-
Winkler similarity [50]. More advanced techniques, such
as instance group linkage [40], compare groups of records
to find matches. A third class of approaches uses seman-
tic information. Reference reconciliation [21], for example,
builds a dependency graph and exploits relations to propagate
information among entities. Recently, approaches exploiting
Wikipedia as background corpus have been proposed as well
[9,13]. In [26], the authors propose entity disambiguation
techniques using relations between entities in Wikipedia and
concepts. The technique uses, for example, the link between
“Micheal Jordan” and the “University of California, Berke-
ley” or to “basketball” on Wikipedia.
The number of candidate matching pairs between two data
sets grows rapidly (i.e., quadratically) with the size of the
data, making the matching task rapidly intractable in prac-
tice. Methods based on blocking [41,49] have been proposed
to tackle scalability issues. The idea is to adopt a computa-
tionally inexpensive method to first group together candidate
matching pairs and, as a second step, to adopt a more accurate
and expensive measure to compare all possible pairs within
the candidate set.
Crowdsourcing techniques have already been leveraged
for instance matching. In [48], the authors propose a hybrid
human–machine approach that exploits both the scalability
of automatic methods and the accuracy of manual matching.
The focus of their work i s on how to best present the match-
ing task to the crowd. Instead, our work focuses on how
to combine automated and manual matching by means of a
three-stage blocking technique and a probabilistic network
able to identify and weight-out low-quality answers.
In idMesh [15], we built disambiguation graphs based on
the transitive closures of equivalence links for networks con-
taining uncertain information. Our present work focuses on
hybrid matching techniques for LOD data sets, combining
both automated processes and human computation in order
to obtain a system that is both scalable and highly accurate.
2.2 Entity linking
The other task performed by ZenCrowd is entity linking,
that is, identifying instances from textual content and linking
them to their description in a database. Entities, that is, real-
world objects described following a given schema/ontology,
have recently become first-class citizens on the Web. A large
amount of online search queries are about entities [42], and
search engines exploit entities and structured data to build
their result pages [25]. In the field of information retrieval
(IR), a lot of attention has been given to entities: At TREC,
4
4
http://trec.nist.gov.
the main IR evaluation initiative, the task of Expert Finding,
Related Entity Finding, and Entity List Completion have been
studied [2,3]. Along similar lines, we have evaluated entity
ranking in Wikipedia at INEX
5
recently [18].
The problem of assigning identifiers to instances men-
tioned in textual content (i.e., entity linking) has been widely
studied by the database and the semantic Web research com-
munities. A related effort has, for example, been carried out
in the context of the OKKAM project,
6
which suggested the
idea of an entity name system (ENS) to assign identifiers to
entities on the Web [8]. The ENS could integrate techniques
from our paper to improve matching effectiveness.
The first step in entity linking consists in extracting entities
from textual content. Several approaches developed within
the NLP field provide high-quality entity extraction for per-
sons, locations, and organizations [4,12]. State-of-the-art
techniques are implemented in tools like Gate [16], the Stan-
ford Parser [30] (which we use in our experiments), and
Extractiv.
7
Once entities are extracted, they still need to be disam-
biguated and matched to semantically similar but syntac-
tically different occurrences of the same real-world object
(e.g., “Mr. Obama” and “President of the USA”).
The final step in entity linking is that of deciding which
links to retain in order to enrich the entity. Systems per-
forming such a task are available as well (e.g., Open
Calais,
8
DBPedia Spotlight [37]). Relevant approaches aim
for instance at enriching documents by automatically cre-
ating links to Wikipedia pages [38,44], which can be seen
as entity identifiers. While previous work selects uniform
resource identifiers (URIs) from a specific corpus (e.g.,
DBPedia, Wikipedia), our goal in ZenCrowd is to assign
entity identifiers from the larger LOD cloud
9
instead.
The present work aims at correctly linking isolated enti-
ties to external entities using an effective combination of
algorithmic and manual matching techniques. To the best of
our knowledge, this paper is the first to propose a principled
approach based on crowdsourcing techniques to improve the
quality of automated entity linking algorithms.
2.3 Ad hoc object retrieval
Another task related to entity linking is ad hoc object retrieval
(AOR) [42], where systems need to retrieve the correct URIs
given a keyword query representing an entity. Such a task has
been evaluated in the context of the Semantic Search work-
5
https://inex.mmci.uni-saarland.de/.
6
http://www.okkam.org.
7
http://extractiv.com/.
8
http://www.opencalais.com/.
9
http://linkeddata.org/.
123

668 G. Demartini et al.
shop in 2010
10
and 2011
11
using a set of queries extracted
from a commercial search engine query log and crowdsourc-
ing techniques to create the gold standard. Most of the pro-
posed systems for this task (see, for example, Blanco et al.
[7]) exploit IR indexing and ranking techniques over the RDF
data set used at the Billion Triple Challenge
12
2009. Simi-
larly to such tasks, our data set is composed of a large set of
triples coming from LOD data sets, while our queries consist
of instance labels from the testset where the gold standard is
manually created by experts. In addition to those efforts, we
selectively exploit the crowd to improve the accuracy of the
task.
ZenCrowd adopts a hybrid architecture that combines
unstructured inverted indices together with a structured graph
database to optimize the task of instance matching. A similar
approach has been taken in our previous work [45] where
we combined structured and unstructured representations of
graph data to effectively address the task of ad hoc object
retrieval.
2.4 Crowdsourcing
ZenCrowd selectively adopts crowdsourcing to improve the
quality in data integration tasks. Crowdsourcing is a term
used to define those methods to generate or process data ask-
ing to a large group of people to complete small tasks. It
is possible to categorize different crowdsourcing strategies
based on the different types of incentives used to motivate
the crowd to perform such tasks. One of the most successful
example of crowdsourcing is the creation of Wikipedia, an
online encyclopedia collaboratively written by a large num-
ber of web users. The incentive to create articles in Wikipedia
is to help the community and to share knowledge with others.
An incentive that is often leveraged to get input from the
crowd is fun. Games with a purpose have s tudied how to
design entertaining applications that can generate useful data
to be processed by further algorithms. An example of a suc-
cessful game that at the same time generates meaningful data
is the ESP game [46] where two human players have to agree
on the words used to tag a picture. An extension of this game
is Peekaboom: a game that asks the player to detect and anno-
tate specific objects within an image [47].
A different type of crowdsourcing uses a monetary incen-
tive to motivate the crowd to perform some tasks. The most
popular paid crowdsourcing platform currently available is
Amazon MTurk
13
where micro-tasks (called Human Intel-
ligence Tasks or HITs) are published by requesters and
selected by workers who perform them in exchange of a
10
http://km.aifb.kit.edu/ws/semsearch10/.
11
https://km.aifb.kit.edu/ws/semsearch11/.
12
http://challenge.semanticweb.org/.
13
http://www.mturk.com.
small monetary reward. We use the MTurk platform as a basis
for the ZenCrowd system. Other paid crowdsourcing plat-
forms use the approach of modeling worker skills to select
the right worker for a specific HIT [20]. This is beneficial
when the tasks are domain-specific and require workers hav-
ing some domain knowledge. In this paper, we use MTurk
as a crowdsourcing platform as we deal with well-known
general-domain entities. Alternative platforms could be used
for domain-specific data integration tasks like, for example,
linking entities described in scientific articles. ZenCrowd
uses paid crowdsourcing to enable fast scalability to large
amounts of data. This is possible thanks to the continuous
availability of human workers on crowdsourcing platforms
such as Amazon MTurk.
Paid crowdsourcing is a relatively recent technique that is
currently being investigated in a number of contexts. In the
IR community, crowdsourcing techniques have been mainly
used to create test collections for repeatable relevance assess-
ment [1,28,29]. The task of the workers is to judge the rele-
vance of a document for a given query. Studies have shown
that this is a practically relevant approach, which produces
reliable evaluation collections [6]. The database community
is currently evaluating how crowdsourcing methods can be
used to build RDMS systems able to answer complex queries
where s ubjective comparison is needed (e.g., “10 papers with
the most novel ideas”) [22,43]. Crowdsourcing can also be
used for basic computational operations such as sort and join
[36] as well as for sentiment analysis and image tagging [35].
In the context of entity identification, crowdsourcing has
been used by Finn et al. [23] to annotate entities in Twitter.
Their goal is simpler than ours as they ask human workers
to identify entities in text and assign a type (i.e., person,
location, or organization) to the identified entities. Our goal
is, instead, to assign entity identifiers to large numbers of
entities on the Web. The two approaches might be combined
to obtain high-quality results for both extraction and linking.
3 Preliminaries
As already mentioned, ZenCrowd addresses two distinct data
integration tasks related to the general problem of entity res-
olution [24].
We define Instance Matching as the task of identifying
two instances following different schemas (or ontologies) but
referring to the same real-world object. Within the database
literature, this task is related to record linkage [11], duplicate
detection [5], or entity identification [34] when performed
over two relational databases. However, in our setting, the
main goal is to create new cross-data set < owl : sameAs >
RDF statements. As commonly assumed for record linkage,
we also assume that there are no duplicate entities within the
same source and leverage this assumption when computing
123

Large-scale linked data integration 669
the final probability of a match in our probabilistic reasoning
step.
We define Entity Linking as the task of assigning a URI
selected from a background knowledge base for an entity
mentioned in a textual document. This task is also known as
entity resolution [24] or disambiguation [10] in the literature.
In addition to the classic entity resolution task, the objective
of our task is not only to understand which possible interpre-
tation of the entity is correct (Michael Jordan the basketball
player as compared to the UC Berkeley professor), but also
to assign a URI to the entity, which can be used to retrieve
additional factual information about it.
Given two LOD data set U
1
={u
11
, .., u
1n
} and U
2
=
{u
21
, .., u
2m
} containing structured entity descriptions u
ij
,
where i identifies the data set and j the entity URI, we
define instance matching as the identification of each pair
(u
1i
, u
2 j
) of entity URIs from U
1
and U
2
referring to the
same real-world entity and call such a pair a match. An exam-
ple of match is given by the pair u
11
= <http://dbpedia.org/
resource/Tom_Cruise> and u
21
= <http://www.freebase.
com/m/07r1h> where U
1
is the DBPedia LOD data set and
U
2
is the Freebase LOD data set.
Given a document d anda LOD data set U
1
={u
11
, .., u
1n
},
we define entity linking as the task of identifying all entities
in U
1
from d and of associating the corresponding identifier
u
1i
to each entity.
These two tasks are highly related: Instance matching
aims at creating connections between different LOD data
sets that describe the same real-world entity using different
vocabularies. Such connections can then be used to run link-
ing on textual documents. Indeed, ZenCrowd uses existing
< owl : sameAs > statements as probabilistic priors to
take a final decision about which links to select for an entity
appearing in a textual document.
Hence, we use in the following the term entity to refer to
a real-world object mentioned in a textual document (e.g.,
a news article), while we use the term instance to refer to
its structured description (e.g., a set of RDF triples), which
follows the well-defined schema of a LOD data set.
Our system relies on LOD data sets for both tasks. Such
linked data sets describe interconnected entities that are com-
monly mentioned in Web content. As compared to traditional
data integration tasks, the use of LOD data may support inte-
gration algorithms by means of its structured entity descrip-
tions and entity interlinking within and across data sets
(Fig. 1).
In our work, we make use of Human Intelligence at scale
to, first, improve the quality of such links across data sets
and, second, to connect unstructured documents to the struc-
tured representation of the entities they mention. To improve
the result for both tasks, we selectively use paid micro-task
crowdsourcing. To do this, we create HITs on a crowdsourc-
ing platform. For the entity linking task, a HIT consists of
asking which of the candidate links is correct for an entity
extracted from a document. For the instance matching task,
a HIT consists in finding which instance from a target data
set corresponds to a given instance from a source data set.
See Figs. 2, 3, and 4, which give examples of such tasks.
Paid crowdsourcing presents enormous advantages for
high-quality data processing. The disadvantages, however,
potentially include the following: high financial cost, low
availability of workers, and poor workers’ skills or honesty.
To overcome those shortcomings, we alleviate the financial
cost using an efficient decision engine that selectively picks
tasks that have a high improvement potential. Our present
assumption is that entities extracted from HTML news arti-
cles could be recognized by the large public, especially
when provided with sufficient contextual information. Fur-
thermore, each task is shown to multiple workers to balance
out low-quality answers.
4 Architecture
ZenCrowd is a hybrid platform that takes advantage of both
algorithmic and manual data integration techniques simul-
taneously. Figure 1 presents a simplified architecture of our
system. We start by giving an overview of our system below
in Sect. 4.1 and then describe in more detail some of its com-
ponents in Sects. 4.24.4.
4.1 System overview
In the following, we describe the different components of the
ZenCrowd system focusing first on the instance matching and
then on the entity linking pipeline.
4.1.1 Instance matching pipeline
In order to create new links, ZenCrowd takes as input a pair
of data sets from the LOD cloud. Among the two data sets,
one is selected as the source data set and one as the target
data set. Then, for each instance of the source data set, our
system tries to come up with candidate matches from the
target data set.
First, the label used to name the source instance is used
to query the LOD Index (see Sect. 4.2) in order to obtain a
ranked list of candidate matches from the target data set. This
can efficiently, and cheaply, filter out numerous clear non-
matches out of potentially numerous (in the order of hundreds
of millions for some LOD data sets) instances available. Next,
top-ranked candidate instances are further examined in the
graph database. This step is taken to obtain more complete
information about the target instances, both to compute a
more accurate matching score and to provide information to
the Micro-Task Manager (see Fig. 1), which has to fill the
123

Citations
More filters
Proceedings ArticleDOI
Divesh Srivastava1
19 Dec 2013
TL;DR: This seminar explores the progress that has been made by the data integration community on the topics of schema mapping, record linkage and data fusion in addressing these novel challenges faced by big data integration, and identifies a range of open problems for the community.
Abstract: The Big Data era is upon us: data is being generated, collected and analyzed at an unprecedented scale, and data-driven decision making is sweeping through all aspects of society. Since the value of data explodes when it can be linked and fused with other data, addressing the big data integration (BDI) challenge is critical to realizing the promise of Big Data. BDI differs from traditional data integration in many dimensions: (i) the number of data sources, even for a single domain, has grown to be in the tens of thousands, (ii) many of the data sources are very dynamic, as a huge amount of newly collected data are continuously made available, (iii) the data sources are extremely heterogeneous in their structure, with considerable variety even for substantially similar entities, and (iv) the data sources are of widely differing qualities, with significant differences in the coverage, accuracy and timeliness of data provided. This talk explores the progress that has been made by the data integration community in addressing these novel challenges faced by big data integration, and identifies a range of open problems for the community.

213 citations

Journal ArticleDOI
TL;DR: In this paper, a survey of quality in the context of crowdsourcing along several dimensions is presented to define and characterize it and to understand the current state-of-the-art.
Abstract: Crowdsourcing enables one to leverage on the intelligence and wisdom of potentially large groups of individuals toward solving problems. Common problems approached with crowdsourcing are labeling images, translating or transcribing text, providing opinions or ideas, and similar—all tasks that computers are not good at or where they may even fail altogether. The introduction of humans into computations and/or everyday work, however, also poses critical, novel challenges in terms of quality control, as the crowd is typically composed of people with unknown and very diverse abilities, skills, interests, personal objectives, and technological resources. This survey studies quality in the context of crowdsourcing along several dimensions, so as to define and characterize it and to understand the current state of the art. Specifically, this survey derives a quality model for crowdsourcing tasks, identifies the methods and techniques that can be used to assess the attributes of the model, and the actions and strategies that help prevent and mitigate quality problems. An analysis of how these features are supported by the state of the art further identifies open issues and informs an outlook on hot future research directions.

204 citations

Book
01 Aug 2015
TL;DR: This tutorial provides an overview of the key research results that are relevant to addressing the new challenges in entity resolution posed by the Web of data, in which real world entities are described by interlinked data rather than documents.
Abstract: This tutorial provides an overview of the key research results in the area of entity resolution that are relevant to addressing the new challenges in entity resolution posed by the Web of data, in which real world entities are described by interlinked data rather than documents. Since such descriptions are usually partial, overlapping and sometimes evolving, entity resolution emerges as a central problem both to increase dataset linking but also to search the Web of data for entities and their relations.

160 citations


Cites background from "Large-scale linked data integration..."

  • ...blocking) – When algorithms fail to reach a match decision, ask humans [Demartini et al. 2013] – Humans are used to verify only the most likely matches [Wang et al....

    [...]

  • ...…history of descriptions (see [Dong & Tan 2015]) Uncertain ER: – consider confidence scores when resolving certain & uncertain entity descriptions (see [Gal 2014] [Demartini et al. 2013]) Privacy-aware ER: – Trade-off between entity obfuscation techniques and ER results quality (see [Whang &…...

    [...]

Journal ArticleDOI
TL;DR: In this article, a tutorial explores the progress that has been made by the data integration community on the topics of schema mapping, record linkage and data fusion in addressing these novel challenges faced by big data integration, and identifies a range of open problems for the community.
Abstract: The Big Data era is upon us: data is being generated, collected and analyzed at an unprecedented scale, and data-driven decision making is sweeping through society. Since the value of data explodes when it can be linked and fused with other data, addressing the big data integration (BDI) challenge is critical to realizing the promise of Big Data.BDI differs from traditional data integration in many dimensions: (i) the number of data sources, even for a single domain, has grown to be in the tens of thousands, (ii) many of the data sources are very dynamic, as a huge amount of newly collected data are continuously made available, (iii) the data sources are extremely heterogeneous in their structure, with considerable variety even for substantially similar entities, and (iv) the data sources are of widely differing qualities, with significant differences in the coverage, accuracy and timeliness of data provided. This tutorial explores the progress that has been made by the data integration community on the topics of schema mapping, record linkage and data fusion in addressing these novel challenges faced by big data integration, and identifies a range of open problems for the community.

153 citations

Book
01 Feb 2015
TL;DR: In this paper, a tutorial explores the progress that has been made by the data integration community on the topics of schema mapping, record linkage and data fusion in addressing these novel challenges faced by big data integration, and identifies a range of open problems for the community.
Abstract: The Big Data era is upon us: data is being generated, collected and analyzed at an unprecedented scale, and data-driven decision making is sweeping through society. Since the value of data explodes when it can be linked and fused with other data, addressing the big data integration (BDI) challenge is critical to realizing the promise of Big Data.BDI differs from traditional data integration in many dimensions: (i) the number of data sources, even for a single domain, has grown to be in the tens of thousands, (ii) many of the data sources are very dynamic, as a huge amount of newly collected data are continuously made available, (iii) the data sources are extremely heterogeneous in their structure, with considerable variety even for substantially similar entities, and (iv) the data sources are of widely differing qualities, with significant differences in the coverage, accuracy and timeliness of data provided. This tutorial explores the progress that has been made by the data integration community on the topics of schema mapping, record linkage and data fusion in addressing these novel challenges faced by big data integration, and identifies a range of open problems for the community.

133 citations

References
More filters
Journal ArticleDOI
TL;DR: A generic message-passing algorithm, the sum-product algorithm, that operates in a factor graph, that computes-either exactly or approximately-various marginal functions derived from the global function.
Abstract: Algorithms that must deal with complicated global functions of many variables often exploit the manner in which the given functions factor as a product of "local" functions, each of which depends on a subset of the variables. Such a factorization can be visualized with a bipartite graph that we call a factor graph, In this tutorial paper, we present a generic message-passing algorithm, the sum-product algorithm, that operates in a factor graph. Following a single, simple computational rule, the sum-product algorithm computes-either exactly or approximately-various marginal functions derived from the global function. A wide variety of algorithms developed in artificial intelligence, signal processing, and digital communications can be derived as specific instances of the sum-product algorithm, including the forward/backward algorithm, the Viterbi algorithm, the iterative "turbo" decoding algorithm, Pearl's (1988) belief propagation algorithm for Bayesian networks, the Kalman filter, and certain fast Fourier transform (FFT) algorithms.

6,637 citations


"Large-scale linked data integration..." refers methods in this paper

  • ...The sum-product algorithm [32] exploits this observation to compute all marginal functions of a factor graph in a concurrent and efficient manner....

    [...]

Proceedings ArticleDOI
07 Jul 2003
TL;DR: It is demonstrated that an unlexicalized PCFG can parse much more accurately than previously shown, by making use of simple, linguistically motivated state splits, which break down false independence assumptions latent in a vanilla treebank grammar.
Abstract: We demonstrate that an unlexicalized PCFG can parse much more accurately than previously shown, by making use of simple, linguistically motivated state splits, which break down false independence assumptions latent in a vanilla treebank grammar. Indeed, its performance of 86.36% (LP/LR F1) is better than that of early lexicalized PCFG models, and surprisingly close to the current state-of-the-art. This result has potential uses beyond establishing a strong lower bound on the maximum possible accuracy of unlexicalized models: an unlexicalized PCFG is much more compact, easier to replicate, and easier to interpret than more complex lexical models, and the parsing algorithms are simpler, more widely understood, of lower asymptotic complexity, and easier to optimize.

3,291 citations


"Large-scale linked data integration..." refers methods in this paper

  • ...extracted from it using the Stanford Parser [30] as entity extractor....

    [...]

  • ...As described above, we use a state-of-the-art extractor (the Stanford Parser) for this task....

    [...]

  • ...State-of-the-art techniques are implemented in tools like Gate [16], the Stanford Parser [30] (which we use in our experiments), and Extractiv....

    [...]

  • ...State-of-the-art techniques are implemented in tools like Gate [16], the Stanford Parser [30] (which we use in our experiments), and Extractiv.7 Once entities are extracted, they still need to be disambiguated and matched to semantically similar but syntactically different occurrences of the same real-world object (e.g., “Mr. Obama” and “President of the USA”)....

    [...]

  • ...23 The test collection we created is available for download at: http:// exascale.info/zencrowd/. extracted from it using the Stanford Parser [30] as entity extractor....

    [...]

Frequently Asked Questions (11)
Q1. What is the other task that ZenCrowd performs?

The other task that ZenCrowd performs is entity linking, that is, identifying occurrences of LOD entities in textual content and creating links from the text to corresponding instances stored in a database. 

the authors observe that the most challenging type of instances to match in their experiment is organizations, while people can be matched with high precision using automatic methods only. 

One of the most successful example of crowdsourcing is the creation of Wikipedia, an online encyclopedia collaboratively written by a large number of web users. 

For the entity linking task, ZenCrowd improves the precision of the results by 4–35 % over a state of the art and manually optimized crowdsourcing approach, and on average by 14 % over their best automatic approach. 

In conclusion, ZenCrowd provides a reliable approach to entity linking and instance matching, which exploits the trade-off between large-scale automatic instance matching and high-quality human annotation, and which according to their results improves the precision of the instance matching results up to 14 % over their best automatic matching approach for the instance matching task. 

Alternative platforms could be used for domain-specific data integration tasks like, for example, linking entities described in scientific articles. 

As time passes, decisions are reached on the correctness of the various matches, and the probabilistic network iteratively accumulates posterior probabilities on the reliability of the workers. 

An example of a successful game that at the same time generates meaningful data is the ESP game [46] where two human players have to agree on the words used to tag a picture. 

Within the database literature, this task is related to record linkage [11], duplicate detection [5], or entity identification [34] when performed over two relational databases. 

Given a document d and a LOD data setU1 ={u11, .., u1n}, the authors define entity linking as the task of identifying all entities in U1 from d and of associating the corresponding identifier u1i to each entity. 

The test collection the authors created is available for download at: http:// exascale.info/zencrowd/.extracted from it using the Stanford Parser [30] as entity extractor.