What are the future works in this paper?

Interesting directions of future research include exploring stronger coupling between the extraction and resolution phases of query processing and investigating localized resolution for offline data cleaning as well. In this chapter, I look at a potential application of entity resolution in the do- main of natural language processing and consider the related problem of word sense disambiguation.

What is the third dataset used in the IBM KDD Challenge?

The third dataset, describing biology publications, is the Elsevier BioBase dataset2 which was used in a recent IBM KDD-Challenge competition.

What is the secondary similarity measure for Soft TF-IDF?

Jaro-Winkler is reported to be the best secondary similarity measure for Soft TF-IDF, but for completeness, The authoralso experiment with the Jaro and the Scaled Levenstein measures.

How many pairs are rejected by an O(n2) approach?

Apart from the scaling issue, most pairs checked by an O(n2) approach will be rejected since usually only about 1% of all pairs are true matches.

What is the effect of naive relational approaches?

The naive relational approaches (NR and NR*) degrade in performance with higher neighborhood sizes, again highlighting the importance of resolving related references.

How many merge operations are required to exhaust a queue that has q entries?

If the merge tree is perfectly balanced, then the size of each cluster is doubled by each merge operation and as few as O(log q) merges are required.

What is the similarity measure for cluster pairs?

The similarity measure for cluster pairs accounts for relationships between different6references, and as a result of this, each merge operation affects similarities for related cluster pairs.

What is the common way to resolve a sense?

As for entity resolution, The authorexplore the problem of collective sense disambiguation, where senses are resolved for multiple languages simultaneously.

What is the effect of adding more relationships between entities?

As more relationships get added between entities, relationship patterns between entities are less informative, and may actually hurt performance.

What is the relational40 component of the similarity between clusters?

As a result, the relational40component of the similarity between clusters would be zero and merges would occur based on attribute similarity alone.

How do The authorcalculate the second order neighborhood Nbr2(c) for a cluster?

The authorcalculate the second order neighborhood Nbr2(c) for a cluster c by recursively taking the set union (alternatively, multi-set union) of the neighborhoods of all neighboring clusters: Nbr2(c) = ⋃c′∈Nbr(c) Nbr(c ′).

(Open Access) Collective entity resolution in relational data (2007) | Indrajit Bhattacharya

ABSTRACT

Title of Dissertation: COLLECTIVE ENTITY RESOLUTION

IN RELATIONAL DATA

Indrajit Bhattacharya, Doctor of Philosophy, 2006

Dissertation directed by: Dr. Lise Getoor

Department of Computer Science

Many databases contain imprecise references to real-world entities. For exam-

ple, a social-network database records names of people. But diﬀerent people can go

by the same name and there may be diﬀerent observed names referring to the same

person. The goal of entity resolution is to determine the mapping from database

references to discovered real-world entities.

Traditional entity resolution approaches consider approximate matches be-

tween attributes of individual references, but this does not always work well. In

many domains, such as social networks and academic circles, the underlying entities

exhibit strong ties to each other, and as a result, their references often co-occur in

the data. In this dissertation, I focus on the use of such co-occurrence relationships

for jointly resolving entities. I refer to this problem as ‘collective entity resolution’.

First, I propose a relational clustering algorithm for iteratively discovering entities

by clustering references taking into account the clusters of co-occurring references.

Next, I propose a probabilistic generative model for collective resolution that ﬁnds

hidden group structures among the entities and uses the latent groups as evidence

for entity resolution. One of my contributions is an eﬃcient unsupervised infer-

ence algorithm for this model using Gibbs Sampling techniques that discovers the

most likely number of entities. Both of these approaches improve performance over

attribute-only baselines in multiple real world and synthetic datasets. I also perform

a theoretical analysis of how the structural properties of the data aﬀect collective

entity resolution and verify the predicted trends experimentally. In addition, I mo-

tivate the problem of query-time entity resolution. I propose an adaptive algorithm

that uses collective resolution for answering queries by recursively exploring and

resolving related references. This enables resolution at query-time, while preserv-

ing the performance beneﬁts of collective resolution. Finally, as an application of

entity resolution in the domain of natural language processing, I study the sense dis-

ambiguation problem and propose models for collective sense disambiguation using

multiple languages that outperform other unsupervised approaches.

COLLECTIVE ENTITY RESOLUTION IN RELATIONAL DATA

Indrajit Bhattacharya

Dissertation submitted to the Faculty of the Graduate School of the

University of Maryland, College Park in partial fulﬁllment

of the requirements for the degree of

Doctor of Philosophy

2006

Advisory Committee:

Dr. Lise Getoor, Chair/Advisor

Dr. Carol Espy-Wilson, Dean’s Representative

Dr. Amol Deshpande

Dr. Philip Resnik

Dr. Marie desJardins

 Copyright by

Indrajit Bhattacharya

2006

Dedication

To my parents.

Collective entity resolution in relational data

Figures

Citations

Power-Law Distributions in Empirical Data

A Review of Relational Machine Learning for Knowledge Graphs

A Survey of Statistical Network Models

A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication

A Survey of Heterogeneous Information Network Analysis

References

A guided tour to approximate string matching

Friends and neighbors on the Web

A Theory for Record Linkage

The link prediction problem for social networks

A comparison of string distance metrics for name-matching tasks

Related Papers (5)

A Theory for Record Linkage

Duplicate Record Detection: A Survey

Reference reconciliation in complex information spaces

Adaptive duplicate detection using learnable string similarity measures

The merge/purge problem for large databases

Frequently Asked Questions (12)

Q1. What are the contributions in this paper?

Q2. What are the future works in this paper?

Q3. What is the third dataset used in the IBM KDD Challenge?

Q4. What is the secondary similarity measure for Soft TF-IDF?

Q5. How many pairs are rejected by an O(n2) approach?

Q6. What is the effect of naive relational approaches?

Q7. How many merge operations are required to exhaust a queue that has q entries?

Q8. What is the similarity measure for cluster pairs?

Q9. What is the common way to resolve a sense?

Q10. What is the effect of adding more relationships between entities?

Q11. What is the relational40 component of the similarity between clusters?

Q12. How do The authorcalculate the second order neighborhood Nbr2(c) for a cluster?