Collective entity resolution in relational data
read more
Citations
Power-Law Distributions in Empirical Data
A Review of Relational Machine Learning for Knowledge Graphs
A Survey of Statistical Network Models
A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication
A Survey of Heterogeneous Information Network Analysis
References
A guided tour to approximate string matching
Friends and neighbors on the Web
A Theory for Record Linkage
The link prediction problem for social networks
A comparison of string distance metrics for name-matching tasks
Related Papers (5)
Frequently Asked Questions (12)
Q2. What are the future works in this paper?
Interesting directions of future research include exploring stronger coupling between the extraction and resolution phases of query processing and investigating localized resolution for offline data cleaning as well. In this chapter, I look at a potential application of entity resolution in the do- main of natural language processing and consider the related problem of word sense disambiguation.
Q3. What is the third dataset used in the IBM KDD Challenge?
The third dataset, describing biology publications, is the Elsevier BioBase dataset2 which was used in a recent IBM KDD-Challenge competition.
Q4. What is the secondary similarity measure for Soft TF-IDF?
Jaro-Winkler is reported to be the best secondary similarity measure for Soft TF-IDF, but for completeness, The authoralso experiment with the Jaro and the Scaled Levenstein measures.
Q5. How many pairs are rejected by an O(n2) approach?
Apart from the scaling issue, most pairs checked by an O(n2) approach will be rejected since usually only about 1% of all pairs are true matches.
Q6. What is the effect of naive relational approaches?
The naive relational approaches (NR and NR*) degrade in performance with higher neighborhood sizes, again highlighting the importance of resolving related references.
Q7. How many merge operations are required to exhaust a queue that has q entries?
If the merge tree is perfectly balanced, then the size of each cluster is doubled by each merge operation and as few as O(log q) merges are required.
Q8. What is the similarity measure for cluster pairs?
The similarity measure for cluster pairs accounts for relationships between different6references, and as a result of this, each merge operation affects similarities for related cluster pairs.
Q9. What is the common way to resolve a sense?
As for entity resolution, The authorexplore the problem of collective sense disambiguation, where senses are resolved for multiple languages simultaneously.
Q10. What is the effect of adding more relationships between entities?
As more relationships get added between entities, relationship patterns between entities are less informative, and may actually hurt performance.
Q11. What is the relational40 component of the similarity between clusters?
As a result, the relational40component of the similarity between clusters would be zero and merges would occur based on attribute similarity alone.
Q12. How do The authorcalculate the second order neighborhood Nbr2(c) for a cluster?
The authorcalculate the second order neighborhood Nbr2(c) for a cluster c by recursively taking the set union (alternatively, multi-set union) of the neighborhoods of all neighboring clusters: Nbr2(c) = ⋃c′∈Nbr(c) Nbr(c ′).