scispace - formally typeset
Open AccessJournal ArticleDOI

Duplicate Record Detection: A Survey

Elmagarmid, +2 more
- 01 Jan 2007 - 
- Vol. 19, Iss: 1, pp 1-16
Reads0
Chats0
TLDR
This paper presents an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database and covers similarity metrics that are commonly used to detect similar field entries.
Abstract
Often, in the real world, entities have two or more representations in databases. Duplicate records do not share a common key and/or they contain errors that make duplicate matching a difficult task. Errors are introduced as the result of transcription errors, incomplete information, lack of standard formats, or any combination of these factors. In this paper, we present a thorough analysis of the literature on duplicate record detection. We cover similarity metrics that are commonly used to detect similar field entries, and we present an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database. We also cover multiple techniques for improving the efficiency and scalability of approximate duplicate detection algorithms. We conclude with coverage of existing tools and with a brief discussion of the big open problems in the area

read more

Content maybe subject to copyright    Report

AHMED K. ELMAGARMID, PANAGIOTIS G. IPEIROTIS, AND VASSILIOS S. VERYKIOS 1
Duplicate Record Detection: A Survey
Ahmed K. Elmagarmid
Purdue University
Panagiotis G. Ipeirotis
New York University
Vassilios S. Verykios
University of Thessaly
August 13, 2006 DRAFT

AHMED K. ELMAGARMID, PANAGIOTIS G. IPEIROTIS, AND VASSILIOS S. VERYKIOS 2
Abstract
Often, in the real world, entities have two or more representations in databases. Duplicate records do
not share a common key and/or they contain errors that make duplicate matching a difcult task. Errors
are introduced as the result of transcription errors, incomplete information, lack of standard formats
or any combination of these factors. In this article, we present a thorough analysis of the literature on
duplicate record detection. We cover similarity metrics that are commonly used to detect similar eld
entries, and we present an extensive set of duplicate detection algorithms that can detect approximately
duplicate records in a database. We also cover multiple techniques for improving the efciency and
scalability of approximate duplicate detection algorithms. We conclude with a coverage of existing
tools and with a brief discussion of the big open problems in the area.
Index Terms
duplicate detection, data cleaning, data integration, record linkage, data deduplication, instance
identication, database hardening, name matching, identity uncertainty, entity resolution, fuzzy duplicate
detection, entity matching
I. I
NTRODUCTION
Databases play an important role in today's IT based economy. Many industries and systems
depend on the accuracy of databases to carry out operations. Therefore, the quality of the
information (or the lack thereof) stored in the databases, can have signicant cost implications
to a system that relies on information to function and conduct business. In an error-free system
with perfectly clean data, the construction of a comprehensive view of the data consists of
linking in relational terms, joining two or more tables on their key elds. Unfortunately, data
often lack a unique, global identier that would permit such an operation. Furthermore, the data
are neither carefully controlled for quality nor dened in a consistent way across different data
sources. Thus, data quality is often compromised by many factors, including data entry errors
(e.g.,
Microsft
instead of
Microsoft
), missing integrity constraints (e.g., allowing entries such as
EmployeeAge=567
), and multiple conventions for recording information (e.g.,
44 W. 4th St.
vs.
44 West Fourth Street
). To make things worse, in independently managed databases not only
the values, but the structure, semantics and underlying assumptions about the data may differ as
well.
August 13, 2006 DRAFT

AHMED K. ELMAGARMID, PANAGIOTIS G. IPEIROTIS, AND VASSILIOS S. VERYKIOS 3
Often, while integrating data from different sources to implement a data warehouse, organiza-
tions become aware of potential systematic differences or conicts. Such problems fall under the
umbrella-term
data heterogeneity
[14].
Data cleaning
[77], or
data scrubbing
[96], refer to the
process of resolving such identication problems in the data. We distinguish between two types
of data heterogeneity:
structural
and
lexical
.
Structural heterogeneity
occurs when the elds of
the tuples in the database are structured differently in different databases. For example, in one
database, the customer address might be recorded in one eld named, say,
addr
, while in another
database the same information might be stored in multiple elds such as
street
,
city
,
state
, and
zipcode
.
Lexical heterogeneity
occurs when the tuples have identically structured elds across
databases, but the data use different representations to refer to the same real-world object (e.g.,
StreetAddress=44 W. 4th St.
vs.
StreetAddress=44 West Fourth Street
).
In this paper, we focus on the problem of lexical heterogeneity and survey various techniques
which have been developed for addressing this problem. We focus on the case where the input
is a set of
structured
and
properly segmented
records, i.e., we focus mainly on cases of database
records. Hence, we do not cover solutions for the various other problems, such that of
mirror
detection
, in which the goal is to detect similar or identical web pages (e.g., see [13], [18]). Also,
we do not cover solutions for problems such as
anaphora resolution
[56], in which the problem
is to locate different mentions of the same entity in
free text
(e.g., that the phrase President of
the U.S. refers to the same entity as George W. Bush). We should note that the algorithms
developed for mirror detection or for anaphora resolution are often applicable for the task of
duplicate detection. Techniques for mirror detection have been used for detection of duplicate
database records (see, for example, Section V-A.4) and techniques for anaphora resolution are
commonly used as an integral part of deduplication in relations that are extracted from free text
using information extraction systems [52].
The problem that we study has been known for more than ve decades as the
record linkage
or the
record matching
problem [31], [61][64], [88] in the statistics community. The goal of
record matching is to identify records in the same or different databases that refer to the same
real-world entity, even if the records are not identical. In slightly ironic fashion, the same problem
has multiple names across research communities. In the database community, the problem
is described as
merge-purge
[39],
data deduplication
[78], and
instance identication
[94];
in the AI community, the same problem is described as
database hardening
[21] and
name
August 13, 2006 DRAFT

AHMED K. ELMAGARMID, PANAGIOTIS G. IPEIROTIS, AND VASSILIOS S. VERYKIOS 4
matching
[9]. The names
coreference resolution
,
identity uncertainty
, and
duplicate detection
are
also commonly used to refer to the same task. We will use the term
duplicate record detection
in this paper.
The remaining part of this paper is organized as follows: In Section II, we briey discuss
the necessary steps in the data cleaning process,
before
the
duplicate record detection
phase.
Then, Section III describes techniques used to match individual elds, and Section IV presents
techniques for matching records that contain multiple elds. Section V describes methods for
improving the efciency of the duplicate record detection process and Section VI presents a few
commercial, off-the-shelf tools used in industry for duplicate record detection and for evaluating
the initial quality of the data and of the matched records. Finally, Section VII concludes the
paper and discusses interesting directions for future research.
II. D
ATA
P
REPARATION
Duplicate record detection is the process of identifying different or multiple records that refer
to one unique real-world entity or object. Typically, the process of duplicate detection is preceded
by a
data preparation
stage, during which data entries are stored in a uniform manner in the
database, resolving (at least partially) the structural heterogeneity problem. The data preparation
stage includes a
parsing
, a
data transformation
, and a
standardization
step. The approaches
that deal with data preparation are also described under the using the term
ETL
(Extraction,
Transformation, Loading) [43]. These steps improve the quality of the in-ow data and make
the data comparable and more usable. While data preparation is not the focus of this survey, for
completeness we describe briey the tasks performed in that stage. A comprehensive collection
of papers related to various data transformation approaches can be found in [74].
Parsing is the rst critical component in the data preparation stage. Parsing locates, identies
and isolates individual data elements in the source les. Parsing makes it easier to correct,
standardize, and match data because it allows the comparison of individual components, rather
than of long complex strings of data. For example, the appropriate parsing of name and address
components into consistent packets of information is a crucial part in the data cleaning process.
Multiple parsing methods have been proposed recently in the literature (e.g., [1], [11], [53], [71],
[84]) and the area continues to be an active eld of research.
Data transformation refers to simple conversions that can be applied to the data in order for
August 13, 2006 DRAFT

AHMED K. ELMAGARMID, PANAGIOTIS G. IPEIROTIS, AND VASSILIOS S. VERYKIOS 5
them to conform to the data types of their corresponding domains. In other words, this type of
conversion focuses on manipulating one eld at a time, without taking into account the values
in related elds. The most common form of a simple transformation is the conversion of a data
element from one data type to another. Such a data type conversion is usually required when
a legacy or parent application stored data in a data type that makes sense within the context
of the original application, but not in a newly developed or subsequent system. Renaming of
a eld from one name to another is considered data transformation as well. Encoded values in
operational systems and in external data is another problem that is addressed at this stage. These
values should be converted to their decoded equivalents, so records from different sources can
be compared in a uniform manner. Range checking is yet another kind of data transformation
which involves examining data in a eld to ensure that it falls within the expected range, usually
a numeric or date range. Lastly, dependency checking is slightly more involved since it requires
comparing the value in a particular eld to the values in another eld, to ensure a minimal level
of consistency in the data.
Data standardization refers to the process of standardizing the information represented in
certain elds to a specic content format. This is used for information that can be stored in
many different ways in various data sources and must be converted to a uniform representation
before the duplicate detection process starts. Without standardization, many duplicate entries
could erroneously be designated as non-duplicates, based on the fact that common identifying
information cannot be compared. One of the most common standardization applications involves
address information. There is no one standardized way to capture addresses so the same address
can be represented in many different ways. Address standardization locates (using various parsing
techniques) components such as house numbers, street names, post ofce boxes, apartment
numbers and rural routes, which are then recorded in the database using a standardized format
(e.g.,
44 West Fourth Street
is stored as
44 W4th St.
). Date and time formatting and name and
title formatting pose other standardization difculties in a database. Typically, when operational
applications are designed and constructed, there is very little uniform handling of date and time
formats across applications. Because most operational environments have many different formats
for representing dates and times, there is a need to transform dates and times into a standardized
format. Name standardization identies components such as rst names, last names, title and
middle initials and records everything using some standardized convention. Data standardization
August 13, 2006 DRAFT

Citations
More filters
Journal ArticleDOI

Effective record linkage for mining campaign contribution data

TL;DR: This paper describes a record linkage technique that is applicable to various sources and across large geographical areas and shows how it may be effectively applied in the context of nationwide donation data and reports on new, previously unattainable results about campaign contributors in the 2007–2008 US election cycle.
Journal ArticleDOI

ScaDS Research on Scalable Privacy-preserving Record Linkage

TL;DR: This work presents the use of pivot-based filtering techniques and LSH (locality-sensitive hashing)-based blocking to reduce the number of comparisons in PPRL, and reports on parallel linkage implementations based on Apache Flink supporting scalability to millions of records.
Proceedings ArticleDOI

A High-level User-oriented Framework for Database Evolution

TL;DR: A high-level, user-oriented, schema evolution framework with an algebra of specialized schema modification operators that allows introduction of novel operators as motivated by new requirements and is amenable to well established optimization techniques for efficient planning and execution.
Dissertation

A new approach for interlinking and integrating semi-structured and linked data

TL;DR: The new approach for integrating semi-structured and Linked Data is a mediator-based architecture that enables the integration, on-the-fly, of semi- Structured heterogeneous data sources with large-scale Linked data sources.
Proceedings ArticleDOI

GRAM: global research activity map

TL;DR: E evaluation of the GRAM system, with the help of university research management stakeholders, reveals interesting patterns in research investment and output for universities across the world (USA, Europe, Asia) and for different types of universities.
References
More filters
Journal ArticleDOI

Basic Local Alignment Search Tool

TL;DR: A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score.
Journal ArticleDOI

A general method applicable to the search for similarities in the amino acid sequence of two proteins

TL;DR: A computer adaptable method for finding similarities in the amino acid sequences of two proteins has been developed and it is possible to determine whether significant homology exists between the proteins to trace their possible evolutionary development.
Journal ArticleDOI

The Elements of Statistical Learning

Eric R. Ziegel
- 01 Aug 2003 - 
TL;DR: Chapter 11 includes more case studies in other areas, ranging from manufacturing to marketing research, and a detailed comparison with other diagnostic tools, such as logistic regression and tree-based methods.
Frequently Asked Questions (14)
Q1. What is the way to avoid manual labeling of the comparison vectors?

One way to avoid manual labeling of the comparison vectors is to use clustering algorithms, and group together similar comparison vectors. 

One way of avoiding the need for training data is to de ne a distance metric for records, which does not need tuning through training data. 

By using a feature selection algorithm (e.g., [44]) as a preprocessing step the record comparison process uses only a small subset of the record elds, which speeds up the comparison process. 

Ananthakrishna et al. show that by using foreign key co-occurrence information, they can substantially improve the quality of duplicate detection in databases that use multiple tables to store the entries of a record. 

2) Assign the following codes to the remaining letters:• B,F, P, V → 1 • C,G, J,K, Q, S, X, Z → 2 • D, T → 32The token similarity is measured using a metric that works well for short strings, such as edit distance and Jaro. 

The basic idea, also known as co-training [10], is to use very few labeled data, and then use unsupervised learning techniques to label appropriately the data with unknown labels. 

A repository of benchmark data sources with known and diverse characteristics should be made available to developers so they may evaluate their methods during the development process. 

With the appropriate use of hash-based indexes, the average time required for computing the q-gram overlap between two strings σ1 and σ2 is O(max{|σ1|, |σ2|}). 

The distance between two strings can be computed using a dynamic programming technique, based on the Needleman and Wunsch algorithm [60]. 

By setting thresholds for the conditional error on M and U , the authors can de ne the reject region and the reject probability, which measure the probability of directing a record pair to an expert for review. 

Verykios et al. show that the classi ers generated using the new, larger training set have high accuracy, and require only a minimal number of pre-labeled record pairs. 

When the conditional independence is not a reasonable assumption, then Winkler [97] suggested using the general expectation maximization algorithm to estimate p(x|M), p(x|U). 

The edit distance metrics work well for catching typographical errors, but they are typically ineffective for other types of mismatches. 

While the Fellegi-Sunter approach dominated the eld for more than two decades, the development of new classi cation techniques in the machine learning and statistics communities prompted the development of new deduplication techniques.