Large-scale linked data integration using probabilistic reasoning and crowdsourcing
Summary (3 min read)
1 Introduction
- Semistructured data are becoming more prominent on the Web as more and more data are either interweaved or serialized in HTML pages.
- This paper describes ZenCrowd, a system the authors have developed in order to create links across large data sets containing similar instances and to semiautomatically identify LOD entities from textual content.
- In the present work, the authors extend ZenCrowd to handle both instance matching and entity linking.
- The first task addressed by this paper is that of matching instances of multiple types among two data sets.
- It is possible to categorize different crowdsourcing strategies based on the different types of incentives used to motivate the crowd to perform such tasks.
3 Preliminaries
- As already mentioned, ZenCrowd addresses two distinct data integration tasks related to the general problem of entity resolution [24].
- Given a document d and a LOD data setU1 ={u11, .., u1n}, the authors define entity linking as the task of identifying all entities in U1 from d and of associating the corresponding identifier u1i to each entity.
- These two tasks are highly related: Instance matching aims at creating connections between different LOD data sets that describe the same real-world entity using different vocabularies.
- To improve the result for both tasks, the authors selectively use paid micro-task crowdsourcing.
- The disadvantages, however, potentially include the following: high financial cost, low availability of workers, and poor workers’ skills or honesty.
4 Architecture
- ZenCrowd is a hybrid platform that takes advantage of both algorithmic and manual data integration techniques simultaneously.
- The authors start by giving an overview of their system below in Sect. 4.1 and then describe in more detail some of its components in Sects. 4.2–4.4. 4.1 System overview.
- In the following, the authors describe the different components of the ZenCrowd system focusing first on the instance matching and then on the entity linking pipeline.
4.1.1 Instance matching pipeline
- In order to create new links, ZenCrowd takes as input a pair of data sets from the LOD cloud.
- Then, for each instance of the source data set, their system tries to come up with candidate matches from the target data set.
- The architecture of ZenCrowd: For the instance matching task (green pipeline), the system takes as input a pair of data sets to be interlinked and creates new links between the data sets using < owl : sameAs > RDF triples.
- At this point, the candidate matches that have a low confidence score are sent to the crowd for further analysis.
- The Decision Engine collects confidence scores from the previous steps in oder to decide what to crowdsource, together with data from the graph database to construct the HITs.
4.1.2 Entity linking pipeline
- The other task that ZenCrowd performs is entity linking, that is, identifying occurrences of LOD entities in textual content and creating links from the text to corresponding instances stored in a database.
- The authors LOD index engine receives as input a list of SPARQL endpoints or LOD dumps as well as a list of triple patterns, and iteratively retrieves all corresponding triples from the LOD data sets.
- Once extracted, the textual entities are inspected by algorithmic linkers, whose role is to find semantically related entities from the LOD cloud.
- Given a source instance from a data set, ZenCrowd considers all instances of the target data set as possible matches.
- Then, given the available resources, top pairs are crowdsourced by batch to improve the accuracy of the matching process.
5 Effective instance matching based on confidence estimation and crowdsourcing
- The authors describe the final steps of the blocking process that assure high-quality instance matching results.
- The authors first define their schema-based matching confidence measure, which is then used to decide which candidate matchings to crowdsource.
- This second interface defines a simpler task for the worker by presenting directly on the HIT page relevant information about the target entity as well as about the candidate matches.
6 Probabilistic models
- ZenCrowd exploits probabilistic models to make sensible decisions about candidate results.
- The authors use factor graphs to graphically represent probabilistic variables and distributions in the following.
- The authors give below a brief introduction to factor graphs and message-passing techniques.
- Each candidate can also be examined by human workers wi performing micro-matching tasks and performing clicks ci j to express the fact that a given candidate matching corresponds (or not) to the source instance from his/her perspective.
- Clicks, workers, and matchings are further connected through two factors described below.
6.2.2 Unicity constraints for entity linking
- Given that the instance matching task definition assumes that only one instance from the target data set can be a correct match for the source instance.
- The authors can thus rule out all configurations where more than one candidate from the same LOD data set are considered as Correct .
6.2.3 SameAs constraints for entity linking
- SameAs constraints are exclusively used in entity linking graphs.
- This constraint considerably helps the decision process when strong evidences (good priors, reliable clicks) are available for any of the URIs connected to a SameAs link.
- As time passes, decisions are reached on the correctness of the various matches, and the probabilistic network iteratively accumulates posterior probabilities on the reliability of the workers.
- This corresponds to a learning parameters phase in a probabilistic graphical model when some of the observations are missing.
7 Experiments on instance matching
- The authors experimentally evaluate the effective of ZenCrowd for the instance matching (IM) task.
- ZenCrowd takes advantage of a probabilistic framework for making decisions and performs even better, leading to a relative performance improvement up to 14 % over their best automatic matching approach (going from 0.78 to 0.89).
- As workers cannot be selected dynamically for a given task on the current crowdsourcing platforms (all the authors can do is prevent some workers from receiving any further task through blacklisting or decide not to reward workers who consistently perform bad), obtaining perfect matching results is thus in general unrealistic for non-controlled settings.
- According to their ground truth, 383 out of the 488 automatically extracted entities can be correctly linked to URIs in their experiments, while the remaining ones are either wrongly extracted or not available in the LOD cloud the authors consider.
- In Fig. 12, the authors report on the average recall of the top-5 candidates when classifying results based on the maximum confidence score obtained (top-1 score).
9 Conclusions
- As the LOD movement gains momentum, matching instances across data sets and linking traditional Web content to the LOD cloud is getting increasingly important in order to foster automated information processing capabilities.
- As their approach incorporates a human intelligence component, it typically cannot perform instance matching and entity linking tasks in real time.
- For the entity linking task, ZenCrowd improves the precision of the results by 4–35 % over a state of the art and manually optimized crowdsourcing approach, and on average by 14 % over their best automatic approach.
- Moreover, considering documents written in languages other than English could be addressed by exploiting the multilingual property of many LOD data sets.
Did you find this useful? Give us your feedback
Citations
[...]
213 citations
204 citations
160 citations
Cites background from "Large-scale linked data integration..."
...blocking) – When algorithms fail to reach a match decision, ask humans [Demartini et al. 2013] – Humans are used to verify only the most likely matches [Wang et al....
[...]
...…history of descriptions (see [Dong & Tan 2015]) Uncertain ER: – consider confidence scores when resolving certain & uncertain entity descriptions (see [Gal 2014] [Demartini et al. 2013]) Privacy-aware ER: – Trade-off between entity obfuscation techniques and ER results quality (see [Whang &…...
[...]
[...]
153 citations
[...]
133 citations
References
8,811 citations
6,637 citations
"Large-scale linked data integration..." refers methods in this paper
...The sum-product algorithm [32] exploits this observation to compute all marginal functions of a factor graph in a concurrent and efficient manner....
[...]
3,291 citations
"Large-scale linked data integration..." refers methods in this paper
...extracted from it using the Stanford Parser [30] as entity extractor....
[...]
...As described above, we use a state-of-the-art extractor (the Stanford Parser) for this task....
[...]
...State-of-the-art techniques are implemented in tools like Gate [16], the Stanford Parser [30] (which we use in our experiments), and Extractiv....
[...]
...State-of-the-art techniques are implemented in tools like Gate [16], the Stanford Parser [30] (which we use in our experiments), and Extractiv.7 Once entities are extracted, they still need to be disambiguated and matched to semantically similar but syntactically different occurrences of the same real-world object (e.g., “Mr. Obama” and “President of the USA”)....
[...]
...23 The test collection we created is available for download at: http:// exascale.info/zencrowd/. extracted from it using the Stanford Parser [30] as entity extractor....
[...]
Related Papers (5)
Frequently Asked Questions (11)
Q2. What is the challenging type of instance to match in their experiment?
the authors observe that the most challenging type of instances to match in their experiment is organizations, while people can be matched with high precision using automatic methods only.
Q3. What is the successful example of crowdsourcing?
One of the most successful example of crowdsourcing is the creation of Wikipedia, an online encyclopedia collaboratively written by a large number of web users.
Q4. How does ZenCrowd improve the accuracy of the entity matching task?
For the entity linking task, ZenCrowd improves the precision of the results by 4–35 % over a state of the art and manually optimized crowdsourcing approach, and on average by 14 % over their best automatic approach.
Q5. How does ZenCrowd improve the accuracy of the instance matching task?
In conclusion, ZenCrowd provides a reliable approach to entity linking and instance matching, which exploits the trade-off between large-scale automatic instance matching and high-quality human annotation, and which according to their results improves the precision of the instance matching results up to 14 % over their best automatic matching approach for the instance matching task.
Q6. What kind of data integration tasks could be used on alternative platforms?
Alternative platforms could be used for domain-specific data integration tasks like, for example, linking entities described in scientific articles.
Q7. What is the effect of the probabilistic network on the reliability of the workers?
As time passes, decisions are reached on the correctness of the various matches, and the probabilistic network iteratively accumulates posterior probabilities on the reliability of the workers.
Q8. What is the popular example of a successful game that generates meaningful data?
An example of a successful game that at the same time generates meaningful data is the ESP game [46] where two human players have to agree on the words used to tag a picture.
Q9. What is the main goal of the task of identifying entities in a database?
Within the database literature, this task is related to record linkage [11], duplicate detection [5], or entity identification [34] when performed over two relational databases.
Q10. What is the definition of entity linking?
Given a document d and a LOD data setU1 ={u11, .., u1n}, the authors define entity linking as the task of identifying all entities in U1 from d and of associating the corresponding identifier u1i to each entity.
Q11. What is the test collection available for download?
The test collection the authors created is available for download at: http:// exascale.info/zencrowd/.extracted from it using the Stanford Parser [30] as entity extractor.