Duplicate Record Detection: A Survey
read more
Citations
DSNotify: handling broken links in the web of data
End-to-End Multi-Perspective Matching for Entity Resolution
Correlation Clustering in Data Streams
On indexing error-tolerant set containment
Copy-Move Forgery Detection in Digital Image
References
Basic Local Alignment Search Tool
Maximum likelihood from incomplete data via the EM algorithm
Pattern Classification and Scene Analysis.
A general method applicable to the search for similarities in the amino acid sequence of two proteins
The Elements of Statistical Learning
Related Papers (5)
Frequently Asked Questions (14)
Q2. What is the way to avoid the need for training data?
One way of avoiding the need for training data is to de ne a distance metric for records, which does not need tuning through training data.
Q3. What is the way to reduce the complexity of the record comparison process?
By using a feature selection algorithm (e.g., [44]) as a preprocessing step the record comparison process uses only a small subset of the record elds, which speeds up the comparison process.
Q4. How can the authors improve the quality of duplicate detection in databases?
Ananthakrishna et al. show that by using foreign key co-occurrence information, they can substantially improve the quality of duplicate detection in databases that use multiple tables to store the entries of a record.
Q5. What is the common metric used to measure token similarity?
2) Assign the following codes to the remaining letters:• B,F, P, V → 1 • C,G, J,K, Q, S, X, Z → 2 • D, T → 32The token similarity is measured using a metric that works well for short strings, such as edit distance and Jaro.
Q6. How do they learn to label the data?
The basic idea, also known as co-training [10], is to use very few labeled data, and then use unsupervised learning techniques to label appropriately the data with unknown labels.
Q7. What should be made available to developers?
A repository of benchmark data sources with known and diverse characteristics should be made available to developers so they may evaluate their methods during the development process.
Q8. How long does it take to compute the q-gram overlap between two strings?
With the appropriate use of hash-based indexes, the average time required for computing the q-gram overlap between two strings σ1 and σ2 is O(max{|σ1|, |σ2|}).
Q9. How can the authors compute the distance between two strings using a dynamic programming technique?
The distance between two strings can be computed using a dynamic programming technique, based on the Needleman and Wunsch algorithm [60].
Q10. What is the probability of directing a record pair to an expert?
By setting thresholds for the conditional error on M and U , the authors can de ne the reject region and the reject probability, which measure the probability of directing a record pair to an expert for review.
Q11. How many pre-labeled record pairs are required to learn matching models?
Verykios et al. show that the classi ers generated using the new, larger training set have high accuracy, and require only a minimal number of pre-labeled record pairs.
Q12. What is the way to estimate p(x|M)?
When the conditional independence is not a reasonable assumption, then Winkler [97] suggested using the general expectation maximization algorithm to estimate p(x|M), p(x|U).
Q13. What are the effective edit distance metrics?
The edit distance metrics work well for catching typographical errors, but they are typically ineffective for other types of mismatches.
Q14. What was the main reason for the development of new deduplication techniques?
While the Fellegi-Sunter approach dominated the eld for more than two decades, the development of new classi cation techniques in the machine learning and statistics communities prompted the development of new deduplication techniques.