What is the way to avoid manual labeling of the comparison vectors?

One way to avoid manual labeling of the comparison vectors is to use clustering algorithms, and group together similar comparison vectors.

What is the way to avoid the need for training data?

One way of avoiding the need for training data is to de ne a distance metric for records, which does not need tuning through training data.

What is the way to reduce the complexity of the record comparison process?

By using a feature selection algorithm (e.g., [44]) as a preprocessing step the record comparison process uses only a small subset of the record elds, which speeds up the comparison process.

How can the authors improve the quality of duplicate detection in databases?

Ananthakrishna et al. show that by using foreign key co-occurrence information, they can substantially improve the quality of duplicate detection in databases that use multiple tables to store the entries of a record.

What is the common metric used to measure token similarity?

2) Assign the following codes to the remaining letters:• B,F, P, V → 1 • C,G, J,K, Q, S, X, Z → 2 • D, T → 32The token similarity is measured using a metric that works well for short strings, such as edit distance and Jaro.

How do they learn to label the data?

The basic idea, also known as co-training [10], is to use very few labeled data, and then use unsupervised learning techniques to label appropriately the data with unknown labels.

What should be made available to developers?

A repository of benchmark data sources with known and diverse characteristics should be made available to developers so they may evaluate their methods during the development process.

How long does it take to compute the q-gram overlap between two strings?

With the appropriate use of hash-based indexes, the average time required for computing the q-gram overlap between two strings σ1 and σ2 is O(max{|σ1|, |σ2|}).

How can the authors compute the distance between two strings using a dynamic programming technique?

The distance between two strings can be computed using a dynamic programming technique, based on the Needleman and Wunsch algorithm [60].

What is the probability of directing a record pair to an expert?

By setting thresholds for the conditional error on M and U , the authors can de ne the reject region and the reject probability, which measure the probability of directing a record pair to an expert for review.

How many pre-labeled record pairs are required to learn matching models?

Verykios et al. show that the classi ers generated using the new, larger training set have high accuracy, and require only a minimal number of pre-labeled record pairs.

What is the way to estimate p(x|M)?

When the conditional independence is not a reasonable assumption, then Winkler [97] suggested using the general expectation maximization algorithm to estimate p(x|M), p(x|U).

What are the effective edit distance metrics?

The edit distance metrics work well for catching typographical errors, but they are typically ineffective for other types of mismatches.

What was the main reason for the development of new deduplication techniques?

While the Fellegi-Sunter approach dominated the eld for more than two decades, the development of new classi cation techniques in the machine learning and statistics communities prompted the development of new deduplication techniques.

(Open Access) Duplicate Record Detection: A Survey (2007) | Elmagarmid

AHMED K. ELMAGARMID, PANAGIOTIS G. IPEIROTIS, AND VASSILIOS S. VERYKIOS 1

Duplicate Record Detection: A Survey

Ahmed K. Elmagarmid

Purdue University

Panagiotis G. Ipeirotis

New York University

Vassilios S. Verykios

University of Thessaly

August 13, 2006 DRAFT

AHMED K. ELMAGARMID, PANAGIOTIS G. IPEIROTIS, AND VASSILIOS S. VERYKIOS 2

Abstract

Often, in the real world, entities have two or more representations in databases. Duplicate records do

not share a common key and/or they contain errors that make duplicate matching a difcult task. Errors

are introduced as the result of transcription errors, incomplete information, lack of standard formats

or any combination of these factors. In this article, we present a thorough analysis of the literature on

duplicate record detection. We cover similarity metrics that are commonly used to detect similar eld

entries, and we present an extensive set of duplicate detection algorithms that can detect approximately

duplicate records in a database. We also cover multiple techniques for improving the efciency and

scalability of approximate duplicate detection algorithms. We conclude with a coverage of existing

tools and with a brief discussion of the big open problems in the area.

Index Terms

duplicate detection, data cleaning, data integration, record linkage, data deduplication, instance

identication, database hardening, name matching, identity uncertainty, entity resolution, fuzzy duplicate

detection, entity matching

I. I

NTRODUCTION

Databases play an important role in today's IT based economy. Many industries and systems

depend on the accuracy of databases to carry out operations. Therefore, the quality of the

information (or the lack thereof) stored in the databases, can have signicant cost implications

to a system that relies on information to function and conduct business. In an error-free system

with perfectly clean data, the construction of a comprehensive view of the data consists of

linking in relational terms, joining two or more tables on their key elds. Unfortunately, data

often lack a unique, global identier that would permit such an operation. Furthermore, the data

are neither carefully controlled for quality nor dened in a consistent way across different data

sources. Thus, data quality is often compromised by many factors, including data entry errors

(e.g.,

Microsft

instead of

Microsoft

), missing integrity constraints (e.g., allowing entries such as

EmployeeAge=567

), and multiple conventions for recording information (e.g.,

44 W. 4th St.

vs.

44 West Fourth Street

). To make things worse, in independently managed databases not only

the values, but the structure, semantics and underlying assumptions about the data may differ as

well.

August 13, 2006 DRAFT

AHMED K. ELMAGARMID, PANAGIOTIS G. IPEIROTIS, AND VASSILIOS S. VERYKIOS 3

Often, while integrating data from different sources to implement a data warehouse, organiza-

tions become aware of potential systematic differences or conicts. Such problems fall under the

umbrella-term

data heterogeneity

[14].

Data cleaning

[77], or

data scrubbing

[96], refer to the

process of resolving such identication problems in the data. We distinguish between two types

of data heterogeneity:

structural

and

lexical

Structural heterogeneity

occurs when the elds of

the tuples in the database are structured differently in different databases. For example, in one

database, the customer address might be recorded in one eld named, say,

addr

, while in another

database the same information might be stored in multiple elds such as

street

city

state

, and

zipcode

Lexical heterogeneity

occurs when the tuples have identically structured elds across

databases, but the data use different representations to refer to the same real-world object (e.g.,

StreetAddress=44 W. 4th St.

vs.

StreetAddress=44 West Fourth Street

In this paper, we focus on the problem of lexical heterogeneity and survey various techniques

which have been developed for addressing this problem. We focus on the case where the input

is a set of

structured

and

properly segmented

records, i.e., we focus mainly on cases of database

records. Hence, we do not cover solutions for the various other problems, such that of

mirror

detection

, in which the goal is to detect similar or identical web pages (e.g., see [13], [18]). Also,

we do not cover solutions for problems such as

anaphora resolution

[56], in which the problem

is to locate different mentions of the same entity in

free text

(e.g., that the phrase President of

the U.S. refers to the same entity as George W. Bush). We should note that the algorithms

developed for mirror detection or for anaphora resolution are often applicable for the task of

duplicate detection. Techniques for mirror detection have been used for detection of duplicate

database records (see, for example, Section V-A.4) and techniques for anaphora resolution are

commonly used as an integral part of deduplication in relations that are extracted from free text

using information extraction systems [52].

The problem that we study has been known for more than ve decades as the

record linkage

or the

record matching

problem [31], [61][64], [88] in the statistics community. The goal of

record matching is to identify records in the same or different databases that refer to the same

real-world entity, even if the records are not identical. In slightly ironic fashion, the same problem

has multiple names across research communities. In the database community, the problem

is described as

merge-purge

[39],

data deduplication

[78], and

instance identication

[94];

in the AI community, the same problem is described as

database hardening

[21] and

name

August 13, 2006 DRAFT

AHMED K. ELMAGARMID, PANAGIOTIS G. IPEIROTIS, AND VASSILIOS S. VERYKIOS 4

matching

[9]. The names

coreference resolution

identity uncertainty

, and

duplicate detection

are

also commonly used to refer to the same task. We will use the term

duplicate record detection

in this paper.

The remaining part of this paper is organized as follows: In Section II, we briey discuss

the necessary steps in the data cleaning process,

before

the

duplicate record detection

phase.

Then, Section III describes techniques used to match individual elds, and Section IV presents

techniques for matching records that contain multiple elds. Section V describes methods for

improving the efciency of the duplicate record detection process and Section VI presents a few

commercial, off-the-shelf tools used in industry for duplicate record detection and for evaluating

the initial quality of the data and of the matched records. Finally, Section VII concludes the

paper and discusses interesting directions for future research.

II. D

ATA

REPARATION

Duplicate record detection is the process of identifying different or multiple records that refer

to one unique real-world entity or object. Typically, the process of duplicate detection is preceded

by a

data preparation

stage, during which data entries are stored in a uniform manner in the

database, resolving (at least partially) the structural heterogeneity problem. The data preparation

stage includes a

parsing

, a

data transformation

, and a

standardization

step. The approaches

that deal with data preparation are also described under the using the term

ETL

(Extraction,

Transformation, Loading) [43]. These steps improve the quality of the in-ow data and make

the data comparable and more usable. While data preparation is not the focus of this survey, for

completeness we describe briey the tasks performed in that stage. A comprehensive collection

of papers related to various data transformation approaches can be found in [74].

Parsing is the rst critical component in the data preparation stage. Parsing locates, identies

and isolates individual data elements in the source les. Parsing makes it easier to correct,

standardize, and match data because it allows the comparison of individual components, rather

than of long complex strings of data. For example, the appropriate parsing of name and address

components into consistent packets of information is a crucial part in the data cleaning process.

Multiple parsing methods have been proposed recently in the literature (e.g., [1], [11], [53], [71],

[84]) and the area continues to be an active eld of research.

Data transformation refers to simple conversions that can be applied to the data in order for

August 13, 2006 DRAFT

AHMED K. ELMAGARMID, PANAGIOTIS G. IPEIROTIS, AND VASSILIOS S. VERYKIOS 5

them to conform to the data types of their corresponding domains. In other words, this type of

conversion focuses on manipulating one eld at a time, without taking into account the values

in related elds. The most common form of a simple transformation is the conversion of a data

element from one data type to another. Such a data type conversion is usually required when

a legacy or parent application stored data in a data type that makes sense within the context

of the original application, but not in a newly developed or subsequent system. Renaming of

a eld from one name to another is considered data transformation as well. Encoded values in

operational systems and in external data is another problem that is addressed at this stage. These

values should be converted to their decoded equivalents, so records from different sources can

be compared in a uniform manner. Range checking is yet another kind of data transformation

which involves examining data in a eld to ensure that it falls within the expected range, usually

a numeric or date range. Lastly, dependency checking is slightly more involved since it requires

comparing the value in a particular eld to the values in another eld, to ensure a minimal level

of consistency in the data.

Data standardization refers to the process of standardizing the information represented in

certain elds to a specic content format. This is used for information that can be stored in

many different ways in various data sources and must be converted to a uniform representation

before the duplicate detection process starts. Without standardization, many duplicate entries

could erroneously be designated as non-duplicates, based on the fact that common identifying

information cannot be compared. One of the most common standardization applications involves

address information. There is no one standardized way to capture addresses so the same address

can be represented in many different ways. Address standardization locates (using various parsing

techniques) components such as house numbers, street names, post ofce boxes, apartment

numbers and rural routes, which are then recorded in the database using a standardized format

(e.g.,

44 West Fourth Street

is stored as

44 W4th St.

). Date and time formatting and name and

title formatting pose other standardization difculties in a database. Typically, when operational

applications are designed and constructed, there is very little uniform handling of date and time

formats across applications. Because most operational environments have many different formats

for representing dates and times, there is a need to transform dates and times into a standardized

format. Name standardization identies components such as rst names, last names, title and

middle initials and records everything using some standardized convention. Data standardization

August 13, 2006 DRAFT

Duplicate Record Detection: A Survey

Figures

Citations

DSNotify: handling broken links in the web of data

End-to-End Multi-Perspective Matching for Entity Resolution

Correlation Clustering in Data Streams

On indexing error-tolerant set containment

Copy-Move Forgery Detection in Digital Image

References

Basic Local Alignment Search Tool

Maximum likelihood from incomplete data via the EM algorithm

Pattern Classification and Scene Analysis.

A general method applicable to the search for similarities in the amino acid sequence of two proteins

The Elements of Statistical Learning

Related Papers (5)

A Theory for Record Linkage

The merge/purge problem for large databases

Adaptive duplicate detection using learnable string similarity measures

Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem

Binary codes capable of correcting deletions, insertions, and reversals

Frequently Asked Questions (14)

Q1. What is the way to avoid manual labeling of the comparison vectors?

Q2. What is the way to avoid the need for training data?

Q3. What is the way to reduce the complexity of the record comparison process?

Q4. How can the authors improve the quality of duplicate detection in databases?

Q5. What is the common metric used to measure token similarity?

Q6. How do they learn to label the data?

Q7. What should be made available to developers?

Q8. How long does it take to compute the q-gram overlap between two strings?

Q9. How can the authors compute the distance between two strings using a dynamic programming technique?

Q10. What is the probability of directing a record pair to an expert?

Q11. How many pre-labeled record pairs are required to learn matching models?

Q12. What is the way to estimate p(x|M)?

Q13. What are the effective edit distance metrics?

Q14. What was the main reason for the development of new deduplication techniques?