scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Duplicate Record Detection: A Survey

TL;DR: This paper presents an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database and covers similarity metrics that are commonly used to detect similar field entries.
Abstract: Often, in the real world, entities have two or more representations in databases. Duplicate records do not share a common key and/or they contain errors that make duplicate matching a difficult task. Errors are introduced as the result of transcription errors, incomplete information, lack of standard formats, or any combination of these factors. In this paper, we present a thorough analysis of the literature on duplicate record detection. We cover similarity metrics that are commonly used to detect similar field entries, and we present an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database. We also cover multiple techniques for improving the efficiency and scalability of approximate duplicate detection algorithms. We conclude with coverage of existing tools and with a brief discussion of the big open problems in the area

Summary (3 min read)

Introduction

  • Often, in the real world, entities have two or more representations in databases.
  • The authors should note that the algorithms developed for mirror detection or for anaphora resolution are often applicable for the task of duplicate detection.
  • The authors will use the term duplicate record detection in this paper.
  • Date and time formatting and name and title formatting pose other standardization dif culties in a database.
  • In the next section, the authors describe techniques for measuring the similarity of individual elds, and later, in Section IV they describe techniques for measuring the similarity of entire records.

A. Character-based similarity metrics

  • The character-based similarity metrics are designed to handle well typographical errors.
  • Pinheiro and Sun [70] proposed a similar similarity measure, which tries to nd the best character alignment for the two compared strings σ1 and σ2, so that the number of character mismatches is minimized.
  • The q-grams are short character substrings1 of length q of the database strings [89], [90].

B. Token-based similarity metrics

  • Character-based similarity metrics work well for typographical errors.
  • It is often the case that typographical conventions lead to rearrangement of words (e.g., John Smith vs. Smith, John ).
  • Based on this algorithm, the similarity of two elds is the number of their matching atomic strings divided by their average number of atomic strings.
  • Also, introduction of frequent words affects only minimally the similarity of the two strings due to the low idf weight of the frequent words.
  • This metric handles the insertion and deletion of words nicely.

C. Phonetic similarity metrics

  • Character-level and token-based similarity metrics focus on the string-based representation of the database records.
  • Strings may be phonetically similar even if they are not similar in a character or token level.
  • When the names are of predominantly East Asian origin, this code is less satisfactory, because much of the discriminating power of these names resides in the vowel sounds, which the code ignores.
  • The introduction of multiple phonetic encodings greatly enhances the matching performance, with rather small overhead.

D. Numeric Similarity Metrics

  • While multiple methods exist for detecting similarities of string-based data, the methods for capturing similarities in numeric data are rather primitive.
  • Typically, the numbers are treated as strings (and compared using the metrics described above) or simple range queries, which locate numbers with similar values.

E. Concluding Remarks

  • The large number of eld comparison metrics re ects the large number of errors or transformations that may occur in real-life data.
  • They show that the Monge-Elkan metric has the highest average performance across data sets and across character-based distance metrics.
  • The authors review methods that are used for matching records with multiple elds.
  • The rest of this section is organized as follows: initially, in Section IV-A the authors describe the notation.
  • Finally, Section IV-G covers unsupervised machine learning techniques, and Section IV-H provides some concluding remarks.

B. Probabilistic Matching Models

  • Newcombe et al. [64] were the rst to recognize duplicate detection as a Bayesian inference problem.
  • The main assumption is that x is a random vector whose density function is different for each of the two classes.
  • The values of p(xi|M) and p(xi|U) can be computed using a training set of pre-labeled record pairs.
  • 2) The Bayes Decision Rule for Minimum Cost: Often, in practice, the minimization of the probability of error is not the best criterion for creating decision rules, as the misclassi cations of M and U samples may have different consequences.

C. Supervised and Semi-Supervised Learning

  • The probabilistic model uses a Bayesian approach to classify record pairs into two classes, M and U .
  • While the Fellegi-Sunter approach dominated the eld for more than two decades, the development of new classi cation techniques in the machine learning and statistics communities prompted the development of new deduplication techniques.
  • A typical post-processing step for these techniques (including the probabilistic techniques of Section IV-B) is to construct a graph for all the records in the database, linking together the matching records.
  • The underlying assumption is that the only differences are due to different representations of the same entity (e.g., Google and Google Inc. ) and that there is no erroneous information in the attribute values (e.g., by mistake someone entering Bismarck ,ND as the location of Google headquarters).

D. Active-Learning-Based Techniques

  • One of the problems with the supervised learning techniques is the requirement for a large number of training examples.
  • The main idea behind ALIAS is that most duplicate and non-duplicate pairs are clearly distinct.
  • In the sequel, the initial classi er is used for predicting the status of unlabeled pairs of records.
  • The goal is to seek out from the unlabeled data pool those instances which, when labeled, will improve the accuracy of the classi er at the fastest possible rate.
  • Using this technique, ALIAS can quickly learn the peculiarities of a data set and rapidly detect duplicates using only a small number of training data.

E. Distance-Based Techniques

  • Even active learning techniques require some training data or some human effort to create the matching models.
  • Guha et al. map the problem into the minimum cost perfect matching problem, and develop then ef cient solutions for identifying the top-k matching records.
  • This approach is conceptually similar to the work of Perkowitz et al. [67] and of Dasu et al. [25], which examine the contents of elds to locate the matching elds across two tables (see Section II).
  • This would nullify the major advantage of distance- August 13, 2006 DRAFT based techniques, which is the ability to operate without training data.
  • Recently, Chaudhuri et al. [16] proposed a new framework for distance-based duplicate detection, observing that the distance thresholds for detecting real duplicate entries is different from each database tuple.

F. Rule-based Approaches

  • Wang and Madnick [94] proposed a rulebased approach for the duplicate detection problem.
  • By using such rules, Wang and Madnick hoped to generate unique keys that can cluster multiple records that represent the same real-world entity.
  • Specifying such an inference in the equational theory requires declarative rule language.
  • AJAX provides a framework wherein the logic of a data cleaning program is modeled as a directed graph of data transformations starting from some input source data.
  • It is noteworthy that such rule-based approaches, which require a human expert to devise meticulously crafted matching rules, typically result in systems with high accuracy.

H. Concluding Remarks

  • There are multiple techniques for duplicate record detection.
  • The authors can divide the techniques into two broad categories: ad-hoc techniques that work quickly on existing relational databases, and more principled techniques that are based on probabilistic inference models.
  • V. IMPROVING THE EFFICIENCY OF DUPLICATE DETECTION.
  • In Section V-A the authors describe techniques that substantially reduce the number of required comparisons.
  • Another factor that can lead to increased computation expense is the cost required for a single comparison.

A. Reducing the Number of Record Comparisons

  • One traditional method for identifying identical records in a database table is to scan the table and compute the value of a hash function for each record.
  • Verykios et al. [91] propose a set of techniques for reducing the complexity of record comparison.
  • Hanna Pasula, Bhaskara Marthi, Brian Milch, Stuart J. Russell, and Ilya Shpitser.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

AHMED K. ELMAGARMID, PANAGIOTIS G. IPEIROTIS, AND VASSILIOS S. VERYKIOS 1
Duplicate Record Detection: A Survey
Ahmed K. Elmagarmid
Purdue University
Panagiotis G. Ipeirotis
New York University
Vassilios S. Verykios
University of Thessaly
August 13, 2006 DRAFT

AHMED K. ELMAGARMID, PANAGIOTIS G. IPEIROTIS, AND VASSILIOS S. VERYKIOS 2
Abstract
Often, in the real world, entities have two or more representations in databases. Duplicate records do
not share a common key and/or they contain errors that make duplicate matching a difcult task. Errors
are introduced as the result of transcription errors, incomplete information, lack of standard formats
or any combination of these factors. In this article, we present a thorough analysis of the literature on
duplicate record detection. We cover similarity metrics that are commonly used to detect similar eld
entries, and we present an extensive set of duplicate detection algorithms that can detect approximately
duplicate records in a database. We also cover multiple techniques for improving the efciency and
scalability of approximate duplicate detection algorithms. We conclude with a coverage of existing
tools and with a brief discussion of the big open problems in the area.
Index Terms
duplicate detection, data cleaning, data integration, record linkage, data deduplication, instance
identication, database hardening, name matching, identity uncertainty, entity resolution, fuzzy duplicate
detection, entity matching
I. I
NTRODUCTION
Databases play an important role in today's IT based economy. Many industries and systems
depend on the accuracy of databases to carry out operations. Therefore, the quality of the
information (or the lack thereof) stored in the databases, can have signicant cost implications
to a system that relies on information to function and conduct business. In an error-free system
with perfectly clean data, the construction of a comprehensive view of the data consists of
linking in relational terms, joining two or more tables on their key elds. Unfortunately, data
often lack a unique, global identier that would permit such an operation. Furthermore, the data
are neither carefully controlled for quality nor dened in a consistent way across different data
sources. Thus, data quality is often compromised by many factors, including data entry errors
(e.g.,
Microsft
instead of
Microsoft
), missing integrity constraints (e.g., allowing entries such as
EmployeeAge=567
), and multiple conventions for recording information (e.g.,
44 W. 4th St.
vs.
44 West Fourth Street
). To make things worse, in independently managed databases not only
the values, but the structure, semantics and underlying assumptions about the data may differ as
well.
August 13, 2006 DRAFT

AHMED K. ELMAGARMID, PANAGIOTIS G. IPEIROTIS, AND VASSILIOS S. VERYKIOS 3
Often, while integrating data from different sources to implement a data warehouse, organiza-
tions become aware of potential systematic differences or conicts. Such problems fall under the
umbrella-term
data heterogeneity
[14].
Data cleaning
[77], or
data scrubbing
[96], refer to the
process of resolving such identication problems in the data. We distinguish between two types
of data heterogeneity:
structural
and
lexical
.
Structural heterogeneity
occurs when the elds of
the tuples in the database are structured differently in different databases. For example, in one
database, the customer address might be recorded in one eld named, say,
addr
, while in another
database the same information might be stored in multiple elds such as
street
,
city
,
state
, and
zipcode
.
Lexical heterogeneity
occurs when the tuples have identically structured elds across
databases, but the data use different representations to refer to the same real-world object (e.g.,
StreetAddress=44 W. 4th St.
vs.
StreetAddress=44 West Fourth Street
).
In this paper, we focus on the problem of lexical heterogeneity and survey various techniques
which have been developed for addressing this problem. We focus on the case where the input
is a set of
structured
and
properly segmented
records, i.e., we focus mainly on cases of database
records. Hence, we do not cover solutions for the various other problems, such that of
mirror
detection
, in which the goal is to detect similar or identical web pages (e.g., see [13], [18]). Also,
we do not cover solutions for problems such as
anaphora resolution
[56], in which the problem
is to locate different mentions of the same entity in
free text
(e.g., that the phrase President of
the U.S. refers to the same entity as George W. Bush). We should note that the algorithms
developed for mirror detection or for anaphora resolution are often applicable for the task of
duplicate detection. Techniques for mirror detection have been used for detection of duplicate
database records (see, for example, Section V-A.4) and techniques for anaphora resolution are
commonly used as an integral part of deduplication in relations that are extracted from free text
using information extraction systems [52].
The problem that we study has been known for more than ve decades as the
record linkage
or the
record matching
problem [31], [61][64], [88] in the statistics community. The goal of
record matching is to identify records in the same or different databases that refer to the same
real-world entity, even if the records are not identical. In slightly ironic fashion, the same problem
has multiple names across research communities. In the database community, the problem
is described as
merge-purge
[39],
data deduplication
[78], and
instance identication
[94];
in the AI community, the same problem is described as
database hardening
[21] and
name
August 13, 2006 DRAFT

AHMED K. ELMAGARMID, PANAGIOTIS G. IPEIROTIS, AND VASSILIOS S. VERYKIOS 4
matching
[9]. The names
coreference resolution
,
identity uncertainty
, and
duplicate detection
are
also commonly used to refer to the same task. We will use the term
duplicate record detection
in this paper.
The remaining part of this paper is organized as follows: In Section II, we briey discuss
the necessary steps in the data cleaning process,
before
the
duplicate record detection
phase.
Then, Section III describes techniques used to match individual elds, and Section IV presents
techniques for matching records that contain multiple elds. Section V describes methods for
improving the efciency of the duplicate record detection process and Section VI presents a few
commercial, off-the-shelf tools used in industry for duplicate record detection and for evaluating
the initial quality of the data and of the matched records. Finally, Section VII concludes the
paper and discusses interesting directions for future research.
II. D
ATA
P
REPARATION
Duplicate record detection is the process of identifying different or multiple records that refer
to one unique real-world entity or object. Typically, the process of duplicate detection is preceded
by a
data preparation
stage, during which data entries are stored in a uniform manner in the
database, resolving (at least partially) the structural heterogeneity problem. The data preparation
stage includes a
parsing
, a
data transformation
, and a
standardization
step. The approaches
that deal with data preparation are also described under the using the term
ETL
(Extraction,
Transformation, Loading) [43]. These steps improve the quality of the in-ow data and make
the data comparable and more usable. While data preparation is not the focus of this survey, for
completeness we describe briey the tasks performed in that stage. A comprehensive collection
of papers related to various data transformation approaches can be found in [74].
Parsing is the rst critical component in the data preparation stage. Parsing locates, identies
and isolates individual data elements in the source les. Parsing makes it easier to correct,
standardize, and match data because it allows the comparison of individual components, rather
than of long complex strings of data. For example, the appropriate parsing of name and address
components into consistent packets of information is a crucial part in the data cleaning process.
Multiple parsing methods have been proposed recently in the literature (e.g., [1], [11], [53], [71],
[84]) and the area continues to be an active eld of research.
Data transformation refers to simple conversions that can be applied to the data in order for
August 13, 2006 DRAFT

AHMED K. ELMAGARMID, PANAGIOTIS G. IPEIROTIS, AND VASSILIOS S. VERYKIOS 5
them to conform to the data types of their corresponding domains. In other words, this type of
conversion focuses on manipulating one eld at a time, without taking into account the values
in related elds. The most common form of a simple transformation is the conversion of a data
element from one data type to another. Such a data type conversion is usually required when
a legacy or parent application stored data in a data type that makes sense within the context
of the original application, but not in a newly developed or subsequent system. Renaming of
a eld from one name to another is considered data transformation as well. Encoded values in
operational systems and in external data is another problem that is addressed at this stage. These
values should be converted to their decoded equivalents, so records from different sources can
be compared in a uniform manner. Range checking is yet another kind of data transformation
which involves examining data in a eld to ensure that it falls within the expected range, usually
a numeric or date range. Lastly, dependency checking is slightly more involved since it requires
comparing the value in a particular eld to the values in another eld, to ensure a minimal level
of consistency in the data.
Data standardization refers to the process of standardizing the information represented in
certain elds to a specic content format. This is used for information that can be stored in
many different ways in various data sources and must be converted to a uniform representation
before the duplicate detection process starts. Without standardization, many duplicate entries
could erroneously be designated as non-duplicates, based on the fact that common identifying
information cannot be compared. One of the most common standardization applications involves
address information. There is no one standardized way to capture addresses so the same address
can be represented in many different ways. Address standardization locates (using various parsing
techniques) components such as house numbers, street names, post ofce boxes, apartment
numbers and rural routes, which are then recorded in the database using a standardized format
(e.g.,
44 West Fourth Street
is stored as
44 W4th St.
). Date and time formatting and name and
title formatting pose other standardization difculties in a database. Typically, when operational
applications are designed and constructed, there is very little uniform handling of date and time
formats across applications. Because most operational environments have many different formats
for representing dates and times, there is a need to transform dates and times into a standardized
format. Name standardization identies components such as rst names, last names, title and
middle initials and records everything using some standardized convention. Data standardization
August 13, 2006 DRAFT

Citations
More filters
Journal ArticleDOI
TL;DR: The authors describe progress to date in publishing Linked Data on the Web, review applications that have been developed to exploit the Web of Data, and map out a research agenda for the Linked data community as it moves forward.
Abstract: The term “Linked Data” refers to a set of best practices for publishing and connecting structured data on the Web. These best practices have been adopted by an increasing number of data providers over the last three years, leading to the creation of a global data space containing billions of assertions— the Web of Data. In this article, the authors present the concept and technical principles of Linked Data, and situate these within the broader context of related technological developments. They describe progress to date in publishing Linked Data on the Web, review applications that have been developed to exploit the Web of Data, and map out a research agenda for the Linked Data community as it moves forward.

5,113 citations

Book
05 Jun 2007
TL;DR: The second edition of Ontology Matching has been thoroughly revised and updated to reflect the most recent advances in this quickly developing area, which resulted in more than 150 pages of new content.
Abstract: Ontologies tend to be found everywhere. They are viewed as the silver bullet for many applications, such as database integration, peer-to-peer systems, e-commerce, semantic web services, or social networks. However, in open or evolving systems, such as the semantic web, different parties would, in general, adopt different ontologies. Thus, merely using ontologies, like using XML, does not reduce heterogeneity: it just raises heterogeneity problems to a higher level. Euzenat and Shvaikos book is devoted to ontology matching as a solution to the semantic heterogeneity problem faced by computer systems. Ontology matching aims at finding correspondences between semantically related entities of different ontologies. These correspondences may stand for equivalence as well as other relations, such as consequence, subsumption, or disjointness, between ontology entities. Many different matching solutions have been proposed so far from various viewpoints, e.g., databases, information systems, and artificial intelligence. The second edition of Ontology Matching has been thoroughly revised and updated to reflect the most recent advances in this quickly developing area, which resulted in more than 150 pages of new content. In particular, the book includes a new chapter dedicated to the methodology for performing ontology matching. It also covers emerging topics, such as data interlinking, ontology partitioning and pruning, context-based matching, matcher tuning, alignment debugging, and user involvement in matching, to mention a few. More than 100 state-of-the-art matching systems and frameworks were reviewed. With Ontology Matching, researchers and practitioners will find a reference book that presents currently available work in a uniform framework. In particular, the work and the techniques presented in this book can be equally applied to database schema matching, catalog integration, XML schema matching and other related problems. The objectives of the book include presenting (i) the state of the art and (ii) the latest research results in ontology matching by providing a systematic and detailed account of matching techniques and matching systems from theoretical, practical and application perspectives.

2,579 citations

Book
02 Feb 2011
TL;DR: This Synthesis lecture provides readers with a detailed technical introduction to Linked Data, including coverage of relevant aspects of Web architecture, as the basis for application development, research or further study.
Abstract: The World Wide Web has enabled the creation of a global information space comprising linked documents. As the Web becomes ever more enmeshed with our daily lives, there is a growing desire for direct access to raw data not currently available on the Web or bound up in hypertext documents. Linked Data provides a publishing paradigm in which not only documents, but also data, can be a first class citizen of the Web, thereby enabling the extension of the Web with a global data space based on open standards - the Web of Data. In this Synthesis lecture we provide readers with a detailed technical introduction to Linked Data. We begin by outlining the basic principles of Linked Data, including coverage of relevant aspects of Web architecture. The remainder of the text is based around two main themes - the publication and consumption of Linked Data. Drawing on a practical Linked Data scenario, we provide guidance and best practices on: architectural approaches to publishing Linked Data; choosing URIs and vocabularies to identify and describe resources; deciding what data to return in a description of a resource on the Web; methods and frameworks for automated linking of data sets; and testing and debugging approaches for Linked Data deployments. We give an overview of existing Linked Data applications and then examine the architectures that are used to consume Linked Data from the Web, alongside existing tools and frameworks that enable these. Readers can expect to gain a rich technical understanding of Linked Data fundamentals, as the basis for application development, research or further study.

2,174 citations

BookDOI
01 Jan 2010
TL;DR: Information visualisation and geographic visualisation need information visualisation because they manage multi-valued data with complex topologies that can be visualised using their canonical geometry and 3D systems use specific types of interfaces that are very different to traditional desktop interfaces.
Abstract: Data Nodes, Edges Display Interactive Display Visual Analogues VisualItems in ItemRegistry User Figure 6.3: The Information Visualisation Reference Model, adapted from Heer et al.[57] 6.2 State of the Art 93 a visual analytics issue that should be better tackled by all the visualisation communities. Blending different kinds of visualisations in the same application is becoming Blending different kinds of visualisations is currently difficult more frequent. Scientific visualisation and geographic visualisation need information visualisation because they manage multi-valued data with complex topologies that can be visualised using their canonical geometry. In addition, they can also be explored with more abstract visual representations to avoid geometric artefacts. For example, census data can be visualised as a coloured map but also as a multi-dimensional dataset where the longitude and latitude are two attributes among others. Clustering this data by some similarity measure will then reveal places that can be far away in space but behave similarly in term of other attributes (e.g., level of education, level of income, size of houses etc.), similarity that would not be visible on a map. On top of these visualisation systems, a user interface allows control of the overall application. User interfaces are well understood but they can be very different in styles. 3D systems use specific types of interfaces that are very different to traditional desktop interfaces. Moreover, information visualisation systems tend to deeply embed the interaction with the visualisation, offering special kinds of controls either directly inside the visualisations (e.g., range sliders on the axes of parallel coordinates) or around it but with special kinds of widgets (e.g., range sliders for performing range-queries). Interoperability can thus be described at several levels. At the data management level, at the architecture model level and at the interface level. 6.2.2 Data Management All visual analytics applications start with data that can be either statically collected or dynamically produced. Depending on the nature of the data, visual analytics applications have used various ways of managing their storage. In order of sophistication, they are: Flat files using ad-hoc formats, Structured file formats such as XML, Specialised NoSQL systems, including Cloud Storage, Standard or extended transactional databases (SQL), Workflow or dataflow systems integrating storage, distribution and data processing. We will now consider these data storage methods, paying particular attention to Data Management for visual analytics can rely on different levels of sophistication the levels of service required by visual analytics, such as: Persistence (they all provide it by definition), Typing, Distribution, Atomic transactions, Notification, Interactive performance, Computation.

775 citations

Book
05 Jul 2012
TL;DR: Data matching (also known as record or data linkage, entity resolution, object identification, or field matching) is the task of identifying, matching and merging records that correspond to the same entities from several databases or even within one database as mentioned in this paper.
Abstract: Data matching (also known as record or data linkage, entity resolution, object identification, or field matching) is the task of identifying, matching and merging records that correspond to the same entities from several databases or even within one database. Based on research in various domains including applied statistics, health informatics, data mining, machine learning, artificial intelligence, database management, and digital libraries, significant advances have been achieved over the last decade in all aspects of the data matching process, especially on how to improve the accuracy of data matching, and its scalability to large databases. Peter Christens book is divided into three parts: Part I, Overview, introduces the subject by presenting several sample applications and their special challenges, as well as a general overview of a generic data matching process. Part II, Steps of the Data Matching Process, then details its main steps like pre-processing, indexing, field and record comparison, classification, and quality evaluation. Lastly, part III, Further Topics, deals with specific aspects like privacy, real-time matching, or matching unstructured data. Finally, it briefly describes the main features of many research and open source systems available today. By providing the reader with a broad range of data matching concepts and techniques and touching on all aspects of the data matching process, this book helps researchers as well as students specializing in data quality or data matching aspects to familiarize themselves with recent research advances and to identify open research challenges in the area of data matching. To this end, each chapter of the book includes a final section that provides pointers to further background and research material. Practitioners will better understand the current state of the art in data matching as well as the internal workings and limitations of current systems. Especially, they will learn that it is often not feasible to simply implement an existing off-the-shelf data matching system without substantial adaption and customization. Such practical considerations are discussed for each of the major steps in the data matching process.

713 citations

References
More filters
Journal ArticleDOI
TL;DR: A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score.

88,255 citations

Journal ArticleDOI
TL;DR: A computer adaptable method for finding similarities in the amino acid sequences of two proteins has been developed and it is possible to determine whether significant homology exists between the proteins to trace their possible evolutionary development.

11,844 citations

Journal ArticleDOI
TL;DR: Chapter 11 includes more case studies in other areas, ranging from manufacturing to marketing research, and a detailed comparison with other diagnostic tools, such as logistic regression and tree-based methods.
Abstract: Chapter 11 includes more case studies in other areas, ranging from manufacturing to marketing research. Chapter 12 concludes the book with some commentary about the scientiŽ c contributions of MTS. The Taguchi method for design of experiment has generated considerable controversy in the statistical community over the past few decades. The MTS/MTGS method seems to lead another source of discussions on the methodology it advocates (Montgomery 2003). As pointed out by Woodall et al. (2003), the MTS/MTGS methods are considered ad hoc in the sense that they have not been developed using any underlying statistical theory. Because the “normal” and “abnormal” groups form the basis of the theory, some sampling restrictions are fundamental to the applications. First, it is essential that the “normal” sample be uniform, unbiased, and/or complete so that a reliable measurement scale is obtained. Second, the selection of “abnormal” samples is crucial to the success of dimensionality reduction when OAs are used. For example, if each abnormal item is really unique in the medical example, then it is unclear how the statistical distance MD can be guaranteed to give a consistent diagnosis measure of severity on a continuous scale when the larger-the-better type S/N ratio is used. Multivariate diagnosis is not new to Technometrics readers and is now becoming increasingly more popular in statistical analysis and data mining for knowledge discovery. As a promising alternative that assumes no underlying data model, The Mahalanobis–Taguchi Strategy does not provide sufŽ cient evidence of gains achieved by using the proposed method over existing tools. Readers may be very interested in a detailed comparison with other diagnostic tools, such as logistic regression and tree-based methods. Overall, although the idea of MTS/MTGS is intriguing, this book would be more valuable had it been written in a rigorous fashion as a technical reference. There is some lack of precision even in several mathematical notations. Perhaps a follow-up with additional theoretical justiŽ cation and careful case studies would answer some of the lingering questions.

11,507 citations


"Duplicate Record Detection: A Surve..." refers background in this paper

  • ...It can be easily shown [60] that the Bayes test results in the smallest probability of error and it is, in that respect, an optimal classifier....

    [...]

  • ...cation and regression trees, a linear discriminant algorithm [60], which generates a linear combination of the parameters...

    [...]

Frequently Asked Questions (14)
Q1. What is the way to avoid manual labeling of the comparison vectors?

One way to avoid manual labeling of the comparison vectors is to use clustering algorithms, and group together similar comparison vectors. 

One way of avoiding the need for training data is to de ne a distance metric for records, which does not need tuning through training data. 

By using a feature selection algorithm (e.g., [44]) as a preprocessing step the record comparison process uses only a small subset of the record elds, which speeds up the comparison process. 

Ananthakrishna et al. show that by using foreign key co-occurrence information, they can substantially improve the quality of duplicate detection in databases that use multiple tables to store the entries of a record. 

2) Assign the following codes to the remaining letters:• B,F, P, V → 1 • C,G, J,K, Q, S, X, Z → 2 • D, T → 32The token similarity is measured using a metric that works well for short strings, such as edit distance and Jaro. 

The basic idea, also known as co-training [10], is to use very few labeled data, and then use unsupervised learning techniques to label appropriately the data with unknown labels. 

A repository of benchmark data sources with known and diverse characteristics should be made available to developers so they may evaluate their methods during the development process. 

With the appropriate use of hash-based indexes, the average time required for computing the q-gram overlap between two strings σ1 and σ2 is O(max{|σ1|, |σ2|}). 

The distance between two strings can be computed using a dynamic programming technique, based on the Needleman and Wunsch algorithm [60]. 

By setting thresholds for the conditional error on M and U , the authors can de ne the reject region and the reject probability, which measure the probability of directing a record pair to an expert for review. 

Verykios et al. show that the classi ers generated using the new, larger training set have high accuracy, and require only a minimal number of pre-labeled record pairs. 

When the conditional independence is not a reasonable assumption, then Winkler [97] suggested using the general expectation maximization algorithm to estimate p(x|M), p(x|U). 

The edit distance metrics work well for catching typographical errors, but they are typically ineffective for other types of mismatches. 

While the Fellegi-Sunter approach dominated the eld for more than two decades, the development of new classi cation techniques in the machine learning and statistics communities prompted the development of new deduplication techniques.