scispace - formally typeset
Open AccessJournal ArticleDOI

Collective entity resolution in relational data

Reads0
Chats0
TLDR
In this article, a relational clustering algorithm that uses both attribute and relational information for determining the underlying domain entities is proposed, which improves entity resolution performance over both attribute-based baselines and over algorithms that consider relational information but do not resolve entities collectively.
Abstract
Many databases contain uncertain and imprecise references to real-world entities. The absence of identifiers for the underlying entities often results in a database which contains multiple references to the same entity. This can lead not only to data redundancy, but also inaccuracies in query processing and knowledge extraction. These problems can be alleviated through the use of entity resolution. Entity resolution involves discovering the underlying entities and mapping each database reference to these entities. Traditionally, entities are resolved using pairwise similarity over the attributes of references. However, there is often additional relational information in the data. Specifically, references to different entities may cooccur. In these cases, collective entity resolution, in which entities for cooccurring references are determined jointly rather than independently, can improve entity resolution accuracy. We propose a novel relational clustering algorithm that uses both attribute and relational information for determining the underlying domain entities, and we give an efficient implementation. We investigate the impact that different relational similarity measures have on entity resolution quality. We evaluate our collective entity resolution algorithm on multiple real-world databases. We show that it improves entity resolution performance over both attribute-based baselines and over algorithms that consider relational information but do not resolve entities collectively. In addition, we perform detailed experiments on synthetically generated data to identify data characteristics that favor collective relational resolution over purely attribute-based algorithms.

read more

Content maybe subject to copyright    Report

ABSTRACT
Title of Dissertation: COLLECTIVE ENTITY RESOLUTION
IN RELATIONAL DATA
Indrajit Bhattacharya, Doctor of Philosophy, 2006
Dissertation directed by: Dr. Lise Getoor
Department of Computer Science
Many databases contain imprecise references to real-world entities. For exam-
ple, a social-network database records names of people. But different people can go
by the same name and there may be different observed names referring to the same
person. The goal of entity resolution is to determine the mapping from database
references to discovered real-world entities.
Traditional entity resolution approaches consider approximate matches be-
tween attributes of individual references, but this does not always work well. In
many domains, such as social networks and academic circles, the underlying entities
exhibit strong ties to each other, and as a result, their references often co-occur in
the data. In this dissertation, I focus on the use of such co-occurrence relationships
for jointly resolving entities. I refer to this problem as ‘collective entity resolution’.
First, I propose a relational clustering algorithm for iteratively discovering entities
by clustering references taking into account the clusters of co-occurring references.
Next, I propose a probabilistic generative model for collective resolution that finds
hidden group structures among the entities and uses the latent groups as evidence

for entity resolution. One of my contributions is an efficient unsupervised infer-
ence algorithm for this model using Gibbs Sampling techniques that discovers the
most likely number of entities. Both of these approaches improve performance over
attribute-only baselines in multiple real world and synthetic datasets. I also perform
a theoretical analysis of how the structural properties of the data affect collective
entity resolution and verify the predicted trends experimentally. In addition, I mo-
tivate the problem of query-time entity resolution. I propose an adaptive algorithm
that uses collective resolution for answering queries by recursively exploring and
resolving related references. This enables resolution at query-time, while preserv-
ing the performance benefits of collective resolution. Finally, as an application of
entity resolution in the domain of natural language processing, I study the sense dis-
ambiguation problem and propose models for collective sense disambiguation using
multiple languages that outperform other unsupervised approaches.

COLLECTIVE ENTITY RESOLUTION IN RELATIONAL DATA
by
Indrajit Bhattacharya
Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
2006
Advisory Committee:
Dr. Lise Getoor, Chair/Advisor
Dr. Carol Espy-Wilson, Dean’s Representative
Dr. Amol Deshpande
Dr. Philip Resnik
Dr. Marie desJardins

c
Copyright by
Indrajit Bhattacharya
2006

Dedication
To my parents.
ii

Citations
More filters
Journal ArticleDOI

Power-Law Distributions in Empirical Data

TL;DR: This work proposes a principled statistical framework for discerning and quantifying power-law behavior in empirical data by combining maximum-likelihood fitting methods with goodness-of-fit tests based on the Kolmogorov-Smirnov (KS) statistic and likelihood ratios.
Journal ArticleDOI

A Review of Relational Machine Learning for Knowledge Graphs

TL;DR: This paper provides a review of how statistical models can be “trained” on large knowledge graphs, and then used to predict new facts about the world (which is equivalent to predicting new edges in the graph) and how such statistical models of graphs can be combined with text-based information extraction methods for automatically constructing knowledge graphs from the Web.
Journal ArticleDOI

A Survey of Statistical Network Models

TL;DR: In this paper, the authors provide an overview of the historical development of statistical network modeling and then introduce a number of examples that have been studied in the network literature and their subsequent discussion focuses on some prominent static and dynamic network models and their interconnections.
Journal ArticleDOI

A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication

TL;DR: A survey of 12 variations of 6 indexing techniques for record linkage and deduplication aimed at reducing the number of record pairs to be compared in the matching process by removing obvious nonmatching pairs, while at the same time maintaining high matching quality is presented.
Journal ArticleDOI

A Survey of Heterogeneous Information Network Analysis

TL;DR: A survey of heterogeneous information network analysis can be found in this article, where the authors introduce basic concepts of HIN analysis, examine its developments on different data mining tasks, discuss some advanced topics, and point out some future research directions.
References
More filters
Journal ArticleDOI

A guided tour to approximate string matching

TL;DR: This work surveys the current techniques to cope with the problem of string matching that allows errors, and focuses on online searching and mostly on edit distance, explaining the problem and its relevance, its statistical behavior, its history and current developments, and the central ideas of the algorithms.
Journal ArticleDOI

Friends and neighbors on the Web

TL;DR: In this paper, the authors show that some factors are better indicators of social connections than others, and that these indicators vary between user populations, and provide potential applications in automatically inferring real world connections and discovering, labeling, and characterizing communities.
Journal ArticleDOI

A Theory for Record Linkage

TL;DR: A mathematical model is developed to provide a theoretical framework for a computer-oriented solution to the problem of recognizing those records in two files which represent identical persons, objects or events.
Proceedings ArticleDOI

The link prediction problem for social networks

TL;DR: Experiments on large co-authorship networks suggest that information about future interactions can be extracted from network topology alone, and that fairly subtle measures for detecting node proximity can outperform more direct measures.
Proceedings Article

A comparison of string distance metrics for name-matching tasks

TL;DR: Using an open-source, Java toolkit of name-matching methods, the authors experimentally compare string distance metrics on the task of matching entity names and find that the best performing method is a hybrid scheme combining a TFIDF weighting scheme, which is widely used in information retrieval, with the Jaro-Winkler string-distance scheme.
Related Papers (5)
Frequently Asked Questions (12)
Q1. What are the contributions in this paper?

Traditional entity resolution approaches consider approximate matches between attributes of individual references, but this does not always work well. 

Interesting directions of future research include exploring stronger coupling between the extraction and resolution phases of query processing and investigating localized resolution for offline data cleaning as well. In this chapter, I look at a potential application of entity resolution in the do- main of natural language processing and consider the related problem of word sense disambiguation. 

The third dataset, describing biology publications, is the Elsevier BioBase dataset2 which was used in a recent IBM KDD-Challenge competition. 

Jaro-Winkler is reported to be the best secondary similarity measure for Soft TF-IDF, but for completeness, The authoralso experiment with the Jaro and the Scaled Levenstein measures. 

Apart from the scaling issue, most pairs checked by an O(n2) approach will be rejected since usually only about 1% of all pairs are true matches. 

The naive relational approaches (NR and NR*) degrade in performance with higher neighborhood sizes, again highlighting the importance of resolving related references. 

If the merge tree is perfectly balanced, then the size of each cluster is doubled by each merge operation and as few as O(log q) merges are required. 

The similarity measure for cluster pairs accounts for relationships between different6references, and as a result of this, each merge operation affects similarities for related cluster pairs. 

As for entity resolution, The authorexplore the problem of collective sense disambiguation, where senses are resolved for multiple languages simultaneously. 

As more relationships get added between entities, relationship patterns between entities are less informative, and may actually hurt performance. 

As a result, the relational40component of the similarity between clusters would be zero and merges would occur based on attribute similarity alone. 

The authorcalculate the second order neighborhood Nbr2(c) for a cluster c by recursively taking the set union (alternatively, multi-set union) of the neighborhoods of all neighboring clusters: Nbr2(c) = ⋃c′∈Nbr(c) Nbr(c ′).