scispace - formally typeset
Open Access

A probabilistic model for entity disambiguation using relationships

TLDR
In this paper it is argued a better solution exist which analyzes not only features but also relationships, and standard feature-based data cleaning approaches can be employed.
Abstract
Graphs representing relationships among sets of entities are of increasing focus of interest in the context of data analysis applications. These graphs are typically constructed from existing datasets from which entities and relationships are extracted. For some of the entities, values in certain attributes would refer to other entities – such references determine relationships. Often, for certain datasets such references are given in the form of (string) descriptions. Each such description alone may not uniquely identify one entity as it is supposed to, but rather can match descriptions of multiple entities. Such cases are especially common if the datasets are collected not from one but multiple heterogeneous sources. Thus the correct linking of entities via relationships can be a nontrivial challenge which, if done incorrectly, can in turn impede further graph-based analyses. To overcome this problem, standard feature-based data cleaning approaches can be employed. In this paper we argue a better solution exist which analyzes not only features but also relationships.

read more

Content maybe subject to copyright    Report

Citations
More filters
Patent

Method and apparatus for automatic entity disambiguation

TL;DR: The authors used multiple search keys to efficiently find pairs of mentions that correspond to the same entity by performing within-document entity disambiguation (100) and cross-document (110) while skipping billions of unnecessary comparisons, yielding a system with very high throughput that can be applied to truly massive data.

Unsupervised Name Disambiguation via Social Network Similarity

TL;DR: Unsupervised methods which simultaneously learn the number of entities represented by a particular name and which observations correspond to the same entity are investigated, suggesting methods which measure similarity based on community, rather than exact, similarity provide more robust disambiguation capability.
Patent

System and method for creating and maintaining a database of disambiguated entity mentions and relations from a corpus of electronic documents

TL;DR: In this paper, the authors present a method for creating an electronic database of disambiguated entity mentions and relations from a corpus of electronic documents, which automatically extracts from the corpus mentions about entities (e.g., references to people, organizations or places) and parses the entity mentions into "mention objects," and executes a series of grouping, comparison and hierarchical fuzzy object clustering algorithms to cluster together all of mention objects referring to the same entity and all of the mention objects associated with each other by a relationship.
Patent

Fast accurate fuzzy matching

TL;DR: A computer-implemented technique for fuzzy matching is described in this article, which works quickly yet accurately to determine if a given computer-readable record is represented, by exact match or pretty close match, in a large collection of computerreadable records.
Book ChapterDOI

Semantic Relatedness Approach for Named Entity Disambiguation

TL;DR: This work addresses the problem of giving a sense to proper names in a text, that is, automatically associating words representing Named Entities with their referents, based on Semantic Relatedness Scores obtained with a graph based model over Wikipedia.
References
More filters
Journal ArticleDOI

Data mining and knowledge discovery: making sense out of data

TL;DR: Without a concerted effort to develop knowledge discovery techniques, organizations stand to forfeit much of the value from the data they currently collect and store.
Journal ArticleDOI

A Theory for Record Linkage

TL;DR: A mathematical model is developed to provide a theoretical framework for a computer-oriented solution to the problem of recognizing those records in two files which represent identical persons, objects or events.
Proceedings Article

A comparison of string distance metrics for name-matching tasks

TL;DR: Using an open-source, Java toolkit of name-matching methods, the authors experimentally compare string distance metrics on the task of matching entity names and find that the best performing method is a hybrid scheme combining a TFIDF weighting scheme, which is widely used in information retrieval, with the Jaro-Winkler string-distance scheme.
Journal ArticleDOI

Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida

TL;DR: The theoretical and practical issues encountered in conducting the matching operation and the results of that operation are discussed.
Proceedings ArticleDOI

Efficient clustering of high-dimensional data sets with application to reference matching

TL;DR: This work presents a new technique for clustering large datasets, using a cheap, approximate distance measure to eciently divide the data into overlapping subsets the authors call canopies, and presents ex- perimental results on grouping bibliographic citations from the reference sections of research papers.
Related Papers (5)