Duplicate Record Detection: A Survey
Summary (3 min read)
Introduction
- Often, in the real world, entities have two or more representations in databases.
- The authors should note that the algorithms developed for mirror detection or for anaphora resolution are often applicable for the task of duplicate detection.
- The authors will use the term duplicate record detection in this paper.
- Date and time formatting and name and title formatting pose other standardization dif culties in a database.
- In the next section, the authors describe techniques for measuring the similarity of individual elds, and later, in Section IV they describe techniques for measuring the similarity of entire records.
A. Character-based similarity metrics
- The character-based similarity metrics are designed to handle well typographical errors.
- Pinheiro and Sun [70] proposed a similar similarity measure, which tries to nd the best character alignment for the two compared strings σ1 and σ2, so that the number of character mismatches is minimized.
- The q-grams are short character substrings1 of length q of the database strings [89], [90].
B. Token-based similarity metrics
- Character-based similarity metrics work well for typographical errors.
- It is often the case that typographical conventions lead to rearrangement of words (e.g., John Smith vs. Smith, John ).
- Based on this algorithm, the similarity of two elds is the number of their matching atomic strings divided by their average number of atomic strings.
- Also, introduction of frequent words affects only minimally the similarity of the two strings due to the low idf weight of the frequent words.
- This metric handles the insertion and deletion of words nicely.
C. Phonetic similarity metrics
- Character-level and token-based similarity metrics focus on the string-based representation of the database records.
- Strings may be phonetically similar even if they are not similar in a character or token level.
- When the names are of predominantly East Asian origin, this code is less satisfactory, because much of the discriminating power of these names resides in the vowel sounds, which the code ignores.
- The introduction of multiple phonetic encodings greatly enhances the matching performance, with rather small overhead.
D. Numeric Similarity Metrics
- While multiple methods exist for detecting similarities of string-based data, the methods for capturing similarities in numeric data are rather primitive.
- Typically, the numbers are treated as strings (and compared using the metrics described above) or simple range queries, which locate numbers with similar values.
E. Concluding Remarks
- The large number of eld comparison metrics re ects the large number of errors or transformations that may occur in real-life data.
- They show that the Monge-Elkan metric has the highest average performance across data sets and across character-based distance metrics.
- The authors review methods that are used for matching records with multiple elds.
- The rest of this section is organized as follows: initially, in Section IV-A the authors describe the notation.
- Finally, Section IV-G covers unsupervised machine learning techniques, and Section IV-H provides some concluding remarks.
B. Probabilistic Matching Models
- Newcombe et al. [64] were the rst to recognize duplicate detection as a Bayesian inference problem.
- The main assumption is that x is a random vector whose density function is different for each of the two classes.
- The values of p(xi|M) and p(xi|U) can be computed using a training set of pre-labeled record pairs.
- 2) The Bayes Decision Rule for Minimum Cost: Often, in practice, the minimization of the probability of error is not the best criterion for creating decision rules, as the misclassi cations of M and U samples may have different consequences.
C. Supervised and Semi-Supervised Learning
- The probabilistic model uses a Bayesian approach to classify record pairs into two classes, M and U .
- While the Fellegi-Sunter approach dominated the eld for more than two decades, the development of new classi cation techniques in the machine learning and statistics communities prompted the development of new deduplication techniques.
- A typical post-processing step for these techniques (including the probabilistic techniques of Section IV-B) is to construct a graph for all the records in the database, linking together the matching records.
- The underlying assumption is that the only differences are due to different representations of the same entity (e.g., Google and Google Inc. ) and that there is no erroneous information in the attribute values (e.g., by mistake someone entering Bismarck ,ND as the location of Google headquarters).
D. Active-Learning-Based Techniques
- One of the problems with the supervised learning techniques is the requirement for a large number of training examples.
- The main idea behind ALIAS is that most duplicate and non-duplicate pairs are clearly distinct.
- In the sequel, the initial classi er is used for predicting the status of unlabeled pairs of records.
- The goal is to seek out from the unlabeled data pool those instances which, when labeled, will improve the accuracy of the classi er at the fastest possible rate.
- Using this technique, ALIAS can quickly learn the peculiarities of a data set and rapidly detect duplicates using only a small number of training data.
E. Distance-Based Techniques
- Even active learning techniques require some training data or some human effort to create the matching models.
- Guha et al. map the problem into the minimum cost perfect matching problem, and develop then ef cient solutions for identifying the top-k matching records.
- This approach is conceptually similar to the work of Perkowitz et al. [67] and of Dasu et al. [25], which examine the contents of elds to locate the matching elds across two tables (see Section II).
- This would nullify the major advantage of distance- August 13, 2006 DRAFT based techniques, which is the ability to operate without training data.
- Recently, Chaudhuri et al. [16] proposed a new framework for distance-based duplicate detection, observing that the distance thresholds for detecting real duplicate entries is different from each database tuple.
F. Rule-based Approaches
- Wang and Madnick [94] proposed a rulebased approach for the duplicate detection problem.
- By using such rules, Wang and Madnick hoped to generate unique keys that can cluster multiple records that represent the same real-world entity.
- Specifying such an inference in the equational theory requires declarative rule language.
- AJAX provides a framework wherein the logic of a data cleaning program is modeled as a directed graph of data transformations starting from some input source data.
- It is noteworthy that such rule-based approaches, which require a human expert to devise meticulously crafted matching rules, typically result in systems with high accuracy.
H. Concluding Remarks
- There are multiple techniques for duplicate record detection.
- The authors can divide the techniques into two broad categories: ad-hoc techniques that work quickly on existing relational databases, and more principled techniques that are based on probabilistic inference models.
- V. IMPROVING THE EFFICIENCY OF DUPLICATE DETECTION.
- In Section V-A the authors describe techniques that substantially reduce the number of required comparisons.
- Another factor that can lead to increased computation expense is the cost required for a single comparison.
A. Reducing the Number of Record Comparisons
- One traditional method for identifying identical records in a database table is to scan the table and compute the value of a hash function for each record.
- Verykios et al. [91] propose a set of techniques for reducing the complexity of record comparison.
- Hanna Pasula, Bhaskara Marthi, Brian Milch, Stuart J. Russell, and Ilya Shpitser.
Did you find this useful? Give us your feedback
Citations
5,113 citations
[...]
2,579 citations
2,174 citations
775 citations
713 citations
References
88,255 citations
14,948 citations
11,844 citations
11,507 citations
"Duplicate Record Detection: A Surve..." refers background in this paper
...It can be easily shown [60] that the Bayes test results in the smallest probability of error and it is, in that respect, an optimal classifier....
[...]
...cation and regression trees, a linear discriminant algorithm [60], which generates a linear combination of the parameters...
[...]
Related Papers (5)
Frequently Asked Questions (14)
Q2. What is the way to avoid the need for training data?
One way of avoiding the need for training data is to de ne a distance metric for records, which does not need tuning through training data.
Q3. What is the way to reduce the complexity of the record comparison process?
By using a feature selection algorithm (e.g., [44]) as a preprocessing step the record comparison process uses only a small subset of the record elds, which speeds up the comparison process.
Q4. How can the authors improve the quality of duplicate detection in databases?
Ananthakrishna et al. show that by using foreign key co-occurrence information, they can substantially improve the quality of duplicate detection in databases that use multiple tables to store the entries of a record.
Q5. What is the common metric used to measure token similarity?
2) Assign the following codes to the remaining letters:• B,F, P, V → 1 • C,G, J,K, Q, S, X, Z → 2 • D, T → 32The token similarity is measured using a metric that works well for short strings, such as edit distance and Jaro.
Q6. How do they learn to label the data?
The basic idea, also known as co-training [10], is to use very few labeled data, and then use unsupervised learning techniques to label appropriately the data with unknown labels.
Q7. What should be made available to developers?
A repository of benchmark data sources with known and diverse characteristics should be made available to developers so they may evaluate their methods during the development process.
Q8. How long does it take to compute the q-gram overlap between two strings?
With the appropriate use of hash-based indexes, the average time required for computing the q-gram overlap between two strings σ1 and σ2 is O(max{|σ1|, |σ2|}).
Q9. How can the authors compute the distance between two strings using a dynamic programming technique?
The distance between two strings can be computed using a dynamic programming technique, based on the Needleman and Wunsch algorithm [60].
Q10. What is the probability of directing a record pair to an expert?
By setting thresholds for the conditional error on M and U , the authors can de ne the reject region and the reject probability, which measure the probability of directing a record pair to an expert for review.
Q11. How many pre-labeled record pairs are required to learn matching models?
Verykios et al. show that the classi ers generated using the new, larger training set have high accuracy, and require only a minimal number of pre-labeled record pairs.
Q12. What is the way to estimate p(x|M)?
When the conditional independence is not a reasonable assumption, then Winkler [97] suggested using the general expectation maximization algorithm to estimate p(x|M), p(x|U).
Q13. What are the effective edit distance metrics?
The edit distance metrics work well for catching typographical errors, but they are typically ineffective for other types of mismatches.
Q14. What was the main reason for the development of new deduplication techniques?
While the Fellegi-Sunter approach dominated the eld for more than two decades, the development of new classi cation techniques in the machine learning and statistics communities prompted the development of new deduplication techniques.