scispace - formally typeset
Journal ArticleDOI

A Theory for Record Linkage

Reads0
Chats0
TLDR
A mathematical model is developed to provide a theoretical framework for a computer-oriented solution to the problem of recognizing those records in two files which represent identical persons, objects or events.
Abstract
A mathematical model is developed to provide a theoretical framework for a computer-oriented solution to the problem of recognizing those records in two files which represent identical persons, objects or events (said to be matched). A comparison is to be made between the recorded characteristics and values in two records (one from each file) and a decision made as to whether or not the members of the comparison-pair represent the same person or event, or whether there is insufficient evidence to justify either of these decisions at stipulated levels of error. These three decisions are referred to as link (A 1), a non-link (A 3), and a possible link (A 2). The first two decisions are called positive dispositions. The two types of error are defined as the error of the decision A 1 when the members of the comparison pair are in fact unmatched, and the error of the decision A 3 when the members of the comparison pair are, in fact matched. The probabilities of these errors are defined as and respecti...

read more

Citations
More filters
Journal ArticleDOI

Duplicate Record Detection: A Survey

TL;DR: This paper presents an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database and covers similarity metrics that are commonly used to detect similar field entries.
Proceedings Article

A comparison of string distance metrics for name-matching tasks

TL;DR: Using an open-source, Java toolkit of name-matching methods, the authors experimentally compare string distance metrics on the task of matching entity names and find that the best performing method is a hybrid scheme combining a TFIDF weighting scheme, which is widely used in information retrieval, with the Jaro-Winkler string-distance scheme.
Journal ArticleDOI

Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida

TL;DR: The theoretical and practical issues encountered in conducting the matching operation and the results of that operation are discussed.
Proceedings ArticleDOI

Efficient clustering of high-dimensional data sets with application to reference matching

TL;DR: This work presents a new technique for clustering large datasets, using a cheap, approximate distance measure to eciently divide the data into overlapping subsets the authors call canopies, and presents ex- perimental results on grouping bibliographic citations from the reference sections of research papers.
Proceedings ArticleDOI

Adaptive duplicate detection using learnable string similarity measures

TL;DR: This paper proposes to employ learnable text distance functions for each database field, and shows that such measures are capable of adapting to the specific notion of similarity that is appropriate for the field's domain.
References
More filters
Journal ArticleDOI

Automatic linkage of vital records.

TL;DR: The authors' special interest in the techniques of record linkage relates to their possible use for keeping track of large groups of individuals who have been exposed to low levels of radiation, in order to determine the causes of their eventual deaths.
Journal ArticleDOI

Record linkage: making maximum use of the discriminating power of identifying information

TL;DR: Rules that can be applied generally to name retrieval systems have been developed in a methodological study of the linkage of vital and health records into family groupings for demographic research purposes.
Journal ArticleDOI

A Model for Optimum Linkage of Records

TL;DR: In this article, a model for the frequently recurring problem of linking records from two lists is presented, and the criterion for an optimum decision rule is taken to be the minimization of the expected total costs associated with the various actions that may be taken for each pair of records.
Journal ArticleDOI

A Solution to the Problem of Linking Multivariate Documents

TL;DR: Some aspects of classifying pairs of documents into one of two populations when their items are identifying information, where each item of information can take on three distinct values correct, incorrect or missing, are considered.
Journal ArticleDOI

Outcome Probabilities for a Record Matching Process with Complete Invariant Information

TL;DR: It is shown that this can be done for a simple model which assumes that the information used for matching is complete and invariant but, possibly, insufficient to distinguish between all population items, by considering only the class-size probability distributions.
Related Papers (5)