Journal Article•DOI•

Duplicate Record Detection: A Survey

01 Jan 2007-IEEE Transactions on Knowledge and Data Engineering-Vol. 19, Iss: 1, pp 1-16

TL;DR: This paper presents an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database and covers similarity metrics that are commonly used to detect similar field entries.

read less

Abstract: Often, in the real world, entities have two or more representations in databases. Duplicate records do not share a common key and/or they contain errors that make duplicate matching a difficult task. Errors are introduced as the result of transcription errors, incomplete information, lack of standard formats, or any combination of these factors. In this paper, we present a thorough analysis of the literature on duplicate record detection. We cover similarity metrics that are commonly used to detect similar field entries, and we present an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database. We also cover multiple techniques for improving the efficiency and scalability of approximate duplicate detection algorithms. We conclude with coverage of existing tools and with a brief discussion of the big open problems in the area

...read moreread less

Summary (3 min read)

Jump to: [Introduction] – [A. Character-based similarity metrics] – [B. Token-based similarity metrics] – [C. Phonetic similarity metrics] – [D. Numeric Similarity Metrics] – [E. Concluding Remarks] – [B. Probabilistic Matching Models] – [C. Supervised and Semi-Supervised Learning] – [D. Active-Learning-Based Techniques] – [E. Distance-Based Techniques] – [F. Rule-based Approaches] – [H. Concluding Remarks] and [A. Reducing the Number of Record Comparisons]

Introduction

Often, in the real world, entities have two or more representations in databases.
The authors should note that the algorithms developed for mirror detection or for anaphora resolution are often applicable for the task of duplicate detection.
The authors will use the term duplicate record detection in this paper.
Date and time formatting and name and title formatting pose other standardization dif culties in a database.
In the next section, the authors describe techniques for measuring the similarity of individual elds, and later, in Section IV they describe techniques for measuring the similarity of entire records.

A. Character-based similarity metrics

The character-based similarity metrics are designed to handle well typographical errors.
Pinheiro and Sun [70] proposed a similar similarity measure, which tries to nd the best character alignment for the two compared strings σ1 and σ2, so that the number of character mismatches is minimized.
The q-grams are short character substrings1 of length q of the database strings [89], [90].

B. Token-based similarity metrics

Character-based similarity metrics work well for typographical errors.
It is often the case that typographical conventions lead to rearrangement of words (e.g., John Smith vs. Smith, John ).
Based on this algorithm, the similarity of two elds is the number of their matching atomic strings divided by their average number of atomic strings.
Also, introduction of frequent words affects only minimally the similarity of the two strings due to the low idf weight of the frequent words.
This metric handles the insertion and deletion of words nicely.

C. Phonetic similarity metrics

Character-level and token-based similarity metrics focus on the string-based representation of the database records.
Strings may be phonetically similar even if they are not similar in a character or token level.
When the names are of predominantly East Asian origin, this code is less satisfactory, because much of the discriminating power of these names resides in the vowel sounds, which the code ignores.
The introduction of multiple phonetic encodings greatly enhances the matching performance, with rather small overhead.

D. Numeric Similarity Metrics

While multiple methods exist for detecting similarities of string-based data, the methods for capturing similarities in numeric data are rather primitive.
Typically, the numbers are treated as strings (and compared using the metrics described above) or simple range queries, which locate numbers with similar values.

E. Concluding Remarks

The large number of eld comparison metrics re ects the large number of errors or transformations that may occur in real-life data.
They show that the Monge-Elkan metric has the highest average performance across data sets and across character-based distance metrics.
The authors review methods that are used for matching records with multiple elds.
The rest of this section is organized as follows: initially, in Section IV-A the authors describe the notation.
Finally, Section IV-G covers unsupervised machine learning techniques, and Section IV-H provides some concluding remarks.

B. Probabilistic Matching Models

Newcombe et al. [64] were the rst to recognize duplicate detection as a Bayesian inference problem.
The main assumption is that x is a random vector whose density function is different for each of the two classes.
The values of p(xi|M) and p(xi|U) can be computed using a training set of pre-labeled record pairs.
2) The Bayes Decision Rule for Minimum Cost: Often, in practice, the minimization of the probability of error is not the best criterion for creating decision rules, as the misclassi cations of M and U samples may have different consequences.

C. Supervised and Semi-Supervised Learning

The probabilistic model uses a Bayesian approach to classify record pairs into two classes, M and U .
While the Fellegi-Sunter approach dominated the eld for more than two decades, the development of new classi cation techniques in the machine learning and statistics communities prompted the development of new deduplication techniques.
A typical post-processing step for these techniques (including the probabilistic techniques of Section IV-B) is to construct a graph for all the records in the database, linking together the matching records.
The underlying assumption is that the only differences are due to different representations of the same entity (e.g., Google and Google Inc. ) and that there is no erroneous information in the attribute values (e.g., by mistake someone entering Bismarck ,ND as the location of Google headquarters).

D. Active-Learning-Based Techniques

One of the problems with the supervised learning techniques is the requirement for a large number of training examples.
The main idea behind ALIAS is that most duplicate and non-duplicate pairs are clearly distinct.
In the sequel, the initial classi er is used for predicting the status of unlabeled pairs of records.
The goal is to seek out from the unlabeled data pool those instances which, when labeled, will improve the accuracy of the classi er at the fastest possible rate.
Using this technique, ALIAS can quickly learn the peculiarities of a data set and rapidly detect duplicates using only a small number of training data.

E. Distance-Based Techniques

Even active learning techniques require some training data or some human effort to create the matching models.
Guha et al. map the problem into the minimum cost perfect matching problem, and develop then ef cient solutions for identifying the top-k matching records.
This approach is conceptually similar to the work of Perkowitz et al. [67] and of Dasu et al. [25], which examine the contents of elds to locate the matching elds across two tables (see Section II).
This would nullify the major advantage of distance- August 13, 2006 DRAFT based techniques, which is the ability to operate without training data.
Recently, Chaudhuri et al. [16] proposed a new framework for distance-based duplicate detection, observing that the distance thresholds for detecting real duplicate entries is different from each database tuple.

F. Rule-based Approaches

Wang and Madnick [94] proposed a rulebased approach for the duplicate detection problem.
By using such rules, Wang and Madnick hoped to generate unique keys that can cluster multiple records that represent the same real-world entity.
Specifying such an inference in the equational theory requires declarative rule language.
AJAX provides a framework wherein the logic of a data cleaning program is modeled as a directed graph of data transformations starting from some input source data.
It is noteworthy that such rule-based approaches, which require a human expert to devise meticulously crafted matching rules, typically result in systems with high accuracy.

H. Concluding Remarks

There are multiple techniques for duplicate record detection.
The authors can divide the techniques into two broad categories: ad-hoc techniques that work quickly on existing relational databases, and more principled techniques that are based on probabilistic inference models.
V. IMPROVING THE EFFICIENCY OF DUPLICATE DETECTION.
In Section V-A the authors describe techniques that substantially reduce the number of required comparisons.
Another factor that can lead to increased computation expense is the cost required for a single comparison.

A. Reducing the Number of Record Comparisons

One traditional method for identifying identical records in a database table is to scan the table and compute the value of a hash function for each record.
Verykios et al. [91] propose a set of techniques for reducing the complexity of record comparison.
Hanna Pasula, Bhaskara Marthi, Brian Milch, Stuart J. Russell, and Ilya Shpitser.

Did you find this useful? Give us your feedback

Figures (1)

Fig. 1. The Pinheiro and Sun similarity metric alignment for the strings σ1 = ABcDeFgH σ2 = AxByDzF .

Content maybe subject to copyright Report

AHMED K. ELMAGARMID, PANAGIOTIS G. IPEIROTIS, AND VASSILIOS S. VERYKIOS 1

Duplicate Record Detection: A Survey

Ahmed K. Elmagarmid

Purdue University

Panagiotis G. Ipeirotis

New York University

Vassilios S. Verykios

University of Thessaly

August 13, 2006 DRAFT

AHMED K. ELMAGARMID, PANAGIOTIS G. IPEIROTIS, AND VASSILIOS S. VERYKIOS 2

Abstract

Often, in the real world, entities have two or more representations in databases. Duplicate records do

not share a common key and/or they contain errors that make duplicate matching a difcult task. Errors

are introduced as the result of transcription errors, incomplete information, lack of standard formats

or any combination of these factors. In this article, we present a thorough analysis of the literature on

duplicate record detection. We cover similarity metrics that are commonly used to detect similar eld

entries, and we present an extensive set of duplicate detection algorithms that can detect approximately

duplicate records in a database. We also cover multiple techniques for improving the efciency and

scalability of approximate duplicate detection algorithms. We conclude with a coverage of existing

tools and with a brief discussion of the big open problems in the area.

Index Terms

duplicate detection, data cleaning, data integration, record linkage, data deduplication, instance

identication, database hardening, name matching, identity uncertainty, entity resolution, fuzzy duplicate

detection, entity matching

I. I

NTRODUCTION

Databases play an important role in today's IT based economy. Many industries and systems

depend on the accuracy of databases to carry out operations. Therefore, the quality of the

information (or the lack thereof) stored in the databases, can have signicant cost implications

to a system that relies on information to function and conduct business. In an error-free system

with perfectly clean data, the construction of a comprehensive view of the data consists of

linking in relational terms, joining two or more tables on their key elds. Unfortunately, data

often lack a unique, global identier that would permit such an operation. Furthermore, the data

are neither carefully controlled for quality nor dened in a consistent way across different data

sources. Thus, data quality is often compromised by many factors, including data entry errors

(e.g.,

Microsft

instead of

Microsoft

), missing integrity constraints (e.g., allowing entries such as

EmployeeAge=567

), and multiple conventions for recording information (e.g.,

44 W. 4th St.

vs.

44 West Fourth Street

). To make things worse, in independently managed databases not only

the values, but the structure, semantics and underlying assumptions about the data may differ as

well.

August 13, 2006 DRAFT

AHMED K. ELMAGARMID, PANAGIOTIS G. IPEIROTIS, AND VASSILIOS S. VERYKIOS 3

Often, while integrating data from different sources to implement a data warehouse, organiza-

tions become aware of potential systematic differences or conicts. Such problems fall under the

umbrella-term

data heterogeneity

[14].

Data cleaning

[77], or

data scrubbing

[96], refer to the

process of resolving such identication problems in the data. We distinguish between two types

of data heterogeneity:

structural

and

lexical

Structural heterogeneity

occurs when the elds of

the tuples in the database are structured differently in different databases. For example, in one

database, the customer address might be recorded in one eld named, say,

addr

, while in another

database the same information might be stored in multiple elds such as

street

city

state

, and

zipcode

Lexical heterogeneity

occurs when the tuples have identically structured elds across

databases, but the data use different representations to refer to the same real-world object (e.g.,

StreetAddress=44 W. 4th St.

vs.

StreetAddress=44 West Fourth Street

In this paper, we focus on the problem of lexical heterogeneity and survey various techniques

which have been developed for addressing this problem. We focus on the case where the input

is a set of

structured

and

properly segmented

records, i.e., we focus mainly on cases of database

records. Hence, we do not cover solutions for the various other problems, such that of

mirror

detection

, in which the goal is to detect similar or identical web pages (e.g., see [13], [18]). Also,

we do not cover solutions for problems such as

anaphora resolution

[56], in which the problem

is to locate different mentions of the same entity in

free text

(e.g., that the phrase President of

the U.S. refers to the same entity as George W. Bush). We should note that the algorithms

developed for mirror detection or for anaphora resolution are often applicable for the task of

duplicate detection. Techniques for mirror detection have been used for detection of duplicate

database records (see, for example, Section V-A.4) and techniques for anaphora resolution are

commonly used as an integral part of deduplication in relations that are extracted from free text

using information extraction systems [52].

The problem that we study has been known for more than ve decades as the

record linkage

or the

record matching

problem [31], [61][64], [88] in the statistics community. The goal of

record matching is to identify records in the same or different databases that refer to the same

real-world entity, even if the records are not identical. In slightly ironic fashion, the same problem

has multiple names across research communities. In the database community, the problem

is described as

merge-purge

[39],

data deduplication

[78], and

instance identication

[94];

in the AI community, the same problem is described as

database hardening

[21] and

name

August 13, 2006 DRAFT

AHMED K. ELMAGARMID, PANAGIOTIS G. IPEIROTIS, AND VASSILIOS S. VERYKIOS 4

matching

[9]. The names

coreference resolution

identity uncertainty

, and

duplicate detection

are

also commonly used to refer to the same task. We will use the term

duplicate record detection

in this paper.

The remaining part of this paper is organized as follows: In Section II, we briey discuss

the necessary steps in the data cleaning process,

before

the

duplicate record detection

phase.

Then, Section III describes techniques used to match individual elds, and Section IV presents

techniques for matching records that contain multiple elds. Section V describes methods for

improving the efciency of the duplicate record detection process and Section VI presents a few

commercial, off-the-shelf tools used in industry for duplicate record detection and for evaluating

the initial quality of the data and of the matched records. Finally, Section VII concludes the

paper and discusses interesting directions for future research.

II. D

ATA

REPARATION

Duplicate record detection is the process of identifying different or multiple records that refer

to one unique real-world entity or object. Typically, the process of duplicate detection is preceded

by a

data preparation

stage, during which data entries are stored in a uniform manner in the

database, resolving (at least partially) the structural heterogeneity problem. The data preparation

stage includes a

parsing

, a

data transformation

, and a

standardization

step. The approaches

that deal with data preparation are also described under the using the term

ETL

(Extraction,

Transformation, Loading) [43]. These steps improve the quality of the in-ow data and make

the data comparable and more usable. While data preparation is not the focus of this survey, for

completeness we describe briey the tasks performed in that stage. A comprehensive collection

of papers related to various data transformation approaches can be found in [74].

Parsing is the rst critical component in the data preparation stage. Parsing locates, identies

and isolates individual data elements in the source les. Parsing makes it easier to correct,

standardize, and match data because it allows the comparison of individual components, rather

than of long complex strings of data. For example, the appropriate parsing of name and address

components into consistent packets of information is a crucial part in the data cleaning process.

Multiple parsing methods have been proposed recently in the literature (e.g., [1], [11], [53], [71],

[84]) and the area continues to be an active eld of research.

Data transformation refers to simple conversions that can be applied to the data in order for

August 13, 2006 DRAFT

AHMED K. ELMAGARMID, PANAGIOTIS G. IPEIROTIS, AND VASSILIOS S. VERYKIOS 5

them to conform to the data types of their corresponding domains. In other words, this type of

conversion focuses on manipulating one eld at a time, without taking into account the values

in related elds. The most common form of a simple transformation is the conversion of a data

element from one data type to another. Such a data type conversion is usually required when

a legacy or parent application stored data in a data type that makes sense within the context

of the original application, but not in a newly developed or subsequent system. Renaming of

a eld from one name to another is considered data transformation as well. Encoded values in

operational systems and in external data is another problem that is addressed at this stage. These

values should be converted to their decoded equivalents, so records from different sources can

be compared in a uniform manner. Range checking is yet another kind of data transformation

which involves examining data in a eld to ensure that it falls within the expected range, usually

a numeric or date range. Lastly, dependency checking is slightly more involved since it requires

comparing the value in a particular eld to the values in another eld, to ensure a minimal level

of consistency in the data.

Data standardization refers to the process of standardizing the information represented in

certain elds to a specic content format. This is used for information that can be stored in

many different ways in various data sources and must be converted to a uniform representation

before the duplicate detection process starts. Without standardization, many duplicate entries

could erroneously be designated as non-duplicates, based on the fact that common identifying

information cannot be compared. One of the most common standardization applications involves

address information. There is no one standardized way to capture addresses so the same address

can be represented in many different ways. Address standardization locates (using various parsing

techniques) components such as house numbers, street names, post ofce boxes, apartment

numbers and rural routes, which are then recorded in the database using a standardized format

(e.g.,

44 West Fourth Street

is stored as

44 W4th St.

). Date and time formatting and name and

title formatting pose other standardization difculties in a database. Typically, when operational

applications are designed and constructed, there is very little uniform handling of date and time

formats across applications. Because most operational environments have many different formats

for representing dates and times, there is a need to transform dates and times into a standardized

format. Name standardization identies components such as rst names, last names, title and

middle initials and records everything using some standardized convention. Data standardization

August 13, 2006 DRAFT

HTML Viewer

Frequently Asked Questions (14)

Q1. What is the way to avoid manual labeling of the comparison vectors?

One way to avoid manual labeling of the comparison vectors is to use clustering algorithms, and group together similar comparison vectors.

Q2. What is the way to avoid the need for training data?

One way of avoiding the need for training data is to de ne a distance metric for records, which does not need tuning through training data.

Q3. What is the way to reduce the complexity of the record comparison process?

By using a feature selection algorithm (e.g., [44]) as a preprocessing step the record comparison process uses only a small subset of the record elds, which speeds up the comparison process.

Q4. How can the authors improve the quality of duplicate detection in databases?

Ananthakrishna et al. show that by using foreign key co-occurrence information, they can substantially improve the quality of duplicate detection in databases that use multiple tables to store the entries of a record.

Q5. What is the common metric used to measure token similarity?

2) Assign the following codes to the remaining letters:• B,F, P, V → 1 • C,G, J,K, Q, S, X, Z → 2 • D, T → 32The token similarity is measured using a metric that works well for short strings, such as edit distance and Jaro.

Q6. How do they learn to label the data?

The basic idea, also known as co-training [10], is to use very few labeled data, and then use unsupervised learning techniques to label appropriately the data with unknown labels.

Q7. What should be made available to developers?

A repository of benchmark data sources with known and diverse characteristics should be made available to developers so they may evaluate their methods during the development process.

Q8. How long does it take to compute the q-gram overlap between two strings?

With the appropriate use of hash-based indexes, the average time required for computing the q-gram overlap between two strings σ1 and σ2 is O(max{|σ1|, |σ2|}).

Q9. How can the authors compute the distance between two strings using a dynamic programming technique?

The distance between two strings can be computed using a dynamic programming technique, based on the Needleman and Wunsch algorithm [60].

Q10. What is the probability of directing a record pair to an expert?

By setting thresholds for the conditional error on M and U , the authors can de ne the reject region and the reject probability, which measure the probability of directing a record pair to an expert for review.

Q11. How many pre-labeled record pairs are required to learn matching models?

Verykios et al. show that the classi ers generated using the new, larger training set have high accuracy, and require only a minimal number of pre-labeled record pairs.

Q12. What is the way to estimate p(x|M)?

When the conditional independence is not a reasonable assumption, then Winkler [97] suggested using the general expectation maximization algorithm to estimate p(x|M), p(x|U).

Q13. What are the effective edit distance metrics?

The edit distance metrics work well for catching typographical errors, but they are typically ineffective for other types of mismatches.

Q14. What was the main reason for the development of new deduplication techniques?

While the Fellegi-Sunter approach dominated the eld for more than two decades, the development of new classi cation techniques in the machine learning and statistics communities prompted the development of new deduplication techniques.

Duplicate Record Detection: A Survey

Summary (3 min read)

Introduction

A. Character-based similarity metrics

B. Token-based similarity metrics

C. Phonetic similarity metrics

D. Numeric Similarity Metrics

E. Concluding Remarks

B. Probabilistic Matching Models

C. Supervised and Semi-Supervised Learning

D. Active-Learning-Based Techniques

E. Distance-Based Techniques

F. Rule-based Approaches

H. Concluding Remarks

A. Reducing the Number of Record Comparisons

Figures (1)

Citations

References

"Duplicate Record Detection: A Surve..." refers background in this paper

Related Papers (5)

Frequently Asked Questions (14)

Q1. What is the way to avoid manual labeling of the comparison vectors?

Q2. What is the way to avoid the need for training data?

Q3. What is the way to reduce the complexity of the record comparison process?

Q4. How can the authors improve the quality of duplicate detection in databases?

Q5. What is the common metric used to measure token similarity?

Q6. How do they learn to label the data?

Q7. What should be made available to developers?

Q8. How long does it take to compute the q-gram overlap between two strings?

Q9. How can the authors compute the distance between two strings using a dynamic programming technique?

Q10. What is the probability of directing a record pair to an expert?

Q11. How many pre-labeled record pairs are required to learn matching models?

Q12. What is the way to estimate p(x|M)?

Q13. What are the effective edit distance metrics?

Q14. What was the main reason for the development of new deduplication techniques?