Open Access
A probabilistic model for entity disambiguation using relationships
TLDR
In this paper it is argued a better solution exist which analyzes not only features but also relationships, and standard feature-based data cleaning approaches can be employed.Abstract:
Graphs representing relationships among sets of entities are of increasing focus of interest in the context of data analysis applications. These graphs are typically constructed from existing datasets from which entities and relationships are extracted. For some of the entities, values in certain attributes would refer to other entities – such references determine relationships. Often, for certain datasets such references are given in the form of (string) descriptions. Each such description alone may not uniquely identify one entity as it is supposed to, but rather can match descriptions of multiple entities. Such cases are especially common if the datasets are collected not from one but multiple heterogeneous sources. Thus the correct linking of entities via relationships can be a nontrivial challenge which, if done incorrectly, can in turn impede further graph-based analyses. To overcome this problem, standard feature-based data cleaning approaches can be employed. In this paper we argue a better solution exist which analyzes not only features but also relationships.read more
Citations
More filters
Patent
Method and apparatus for automatic entity disambiguation
TL;DR: The authors used multiple search keys to efficiently find pairs of mentions that correspond to the same entity by performing within-document entity disambiguation (100) and cross-document (110) while skipping billions of unnecessary comparisons, yielding a system with very high throughput that can be applied to truly massive data.
Unsupervised Name Disambiguation via Social Network Similarity
TL;DR: Unsupervised methods which simultaneously learn the number of entities represented by a particular name and which observations correspond to the same entity are investigated, suggesting methods which measure similarity based on community, rather than exact, similarity provide more robust disambiguation capability.
Patent
System and method for creating and maintaining a database of disambiguated entity mentions and relations from a corpus of electronic documents
TL;DR: In this paper, the authors present a method for creating an electronic database of disambiguated entity mentions and relations from a corpus of electronic documents, which automatically extracts from the corpus mentions about entities (e.g., references to people, organizations or places) and parses the entity mentions into "mention objects," and executes a series of grouping, comparison and hierarchical fuzzy object clustering algorithms to cluster together all of mention objects referring to the same entity and all of the mention objects associated with each other by a relationship.
Patent
Fast accurate fuzzy matching
TL;DR: A computer-implemented technique for fuzzy matching is described in this article, which works quickly yet accurately to determine if a given computer-readable record is represented, by exact match or pretty close match, in a large collection of computerreadable records.
Book ChapterDOI
Semantic Relatedness Approach for Named Entity Disambiguation
TL;DR: This work addresses the problem of giving a sense to proper names in a text, that is, automatically associating words representing Named Entities with their referents, based on Semantic Relatedness Scores obtained with a graph based model over Wikipedia.
References
More filters
Journal ArticleDOI
Data mining and knowledge discovery: making sense out of data
TL;DR: Without a concerted effort to develop knowledge discovery techniques, organizations stand to forfeit much of the value from the data they currently collect and store.
Journal ArticleDOI
A Theory for Record Linkage
Ivan P. Fellegi,Alan B. Sunter +1 more
TL;DR: A mathematical model is developed to provide a theoretical framework for a computer-oriented solution to the problem of recognizing those records in two files which represent identical persons, objects or events.
Proceedings Article
A comparison of string distance metrics for name-matching tasks
TL;DR: Using an open-source, Java toolkit of name-matching methods, the authors experimentally compare string distance metrics on the task of matching entity names and find that the best performing method is a hybrid scheme combining a TFIDF weighting scheme, which is widely used in information retrieval, with the Jaro-Winkler string-distance scheme.
Journal ArticleDOI
Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida
TL;DR: The theoretical and practical issues encountered in conducting the matching operation and the results of that operation are discussed.
Proceedings ArticleDOI
Efficient clustering of high-dimensional data sets with application to reference matching
TL;DR: This work presents a new technique for clustering large datasets, using a cheap, approximate distance measure to eciently divide the data into overlapping subsets the authors call canopies, and presents ex- perimental results on grouping bibliographic citations from the reference sections of research papers.
Related Papers (5)
Entity-Based Cross-Document Core f erencing Using the Vector Space Model
Amit Bagga,Breck Baldwin +1 more