On Evaluation and Training-Set Construction for Duplicate Detection

Open AccessProceedings Article

On Evaluation and Training-Set Construction for Duplicate Detection

Mikhail Bilenko and Raymond J. Mooney

- pp 7-12

Chats0

TLDR

A proposed two new approaches to collecting training data called static-active learning and weaklylabeled non-duplicates are proposed, and experimental results on their effectiveness are presented.

Abstract:

A variety of experimental methodologies have been used to evaluate the accuracy of duplicate-detection systems. We advocate presenting precision-recall curves as the most informative evaluation methodology. We also discuss a number of issues that arise when evaluating and assembling training data for adaptive systems that use machine learning to tune themselves to specific applications. We consider several different application scenarios and experimentally examine the effectiveness of alternative methods of collecting training data under each scenario. We propose two new approaches to collecting training data called static-active learning and weaklylabeled non-duplicates, and present experimental results on their effectiveness.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication

Peter Christen

- 01 Sep 2012 -

IEEE Transactions on Knowledge and Data ...

TL;DR: A survey of 12 variations of 6 indexing techniques for record linkage and deduplication aimed at reducing the number of record pairs to be compared in the matching process by removing obvious nonmatching pairs, while at the same time maintaining high matching quality is presented.

...read moreread less

Journal ArticleDOI

Evaluation of entity resolution approaches on real-world match problems

Hanna Köpcke, +2 more

TL;DR: It is found that some challenging resolution tasks such as matching product entities from online shops are not sufficiently solved with conventional approaches based on the similarity of attribute values.

...read moreread less

Proceedings ArticleDOI

Entity Resolution with Markov Logic

Parag Singla, +1 more

TL;DR: A well-founded, integrated solution to the entity resolution problem based on Markov logic, which combines first-order logic and probabilistic graphical models by attaching weights to first- order formulas, and viewing them as templates for features of Markov networks.

...read moreread less

Journal ArticleDOI

Frameworks for entity matching: A comparison

Hanna Köpcke, +1 more

TL;DR: This paper comparatively analyze 11 proposed frameworks for entity matching and considers both frameworks which do or do not utilize training data to semi-automatically find an entity matching strategy to solve a given match task.

...read moreread less

Journal ArticleDOI

Markov Chain Monte Carlo Data Association for Multi-Target Tracking

Songhwai Oh, +2 more

- 04 Mar 2009 -

IEEE Transactions on Automatic Control

TL;DR: Simulation results show that MCMCDA outperforms multiple hypothesis tracking (MHT) by a significant margin in terms of accuracy and efficiency under extreme conditions, such as a large number of targets in a dense environment, low detection probabilities, and high false alarm rates.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Book

Modern Information Retrieval

Ricardo Baeza-Yates, +1 more

TL;DR: In this article, the authors present a rigorous and complete textbook for a first course on information retrieval from the computer science (as opposed to a user-centred) perspective, which provides an up-to-date student oriented treatment of the subject.

...read moreread less

Posted ContentDOI

Making large scale SVM learning practical

Thorsten Joachims

- 29 Oct 1999 -

Technical reports

TL;DR: SVM light as discussed by the authors is an implementation of an SVM learner which addresses the problem of large-scale SVM training with many training examples on the shelf, which makes large scale SVM learning more practical.

...read moreread less

Journal ArticleDOI

A Theory for Record Linkage

Ivan P. Fellegi, +1 more

- 01 Dec 1969 -

Journal of the American Statistical Asso...

TL;DR: A mathematical model is developed to provide a theoretical framework for a computer-oriented solution to the problem of recognizing those records in two files which represent identical persons, objects or events.

...read moreread less

Algorithms on strings, trees, and sequences

Dan Gusfield

TL;DR: Ukkonen’s method is the method of choice for most problems requiring the construction of a suffix tree, and it will be presented first because it is easier to understand.

...read moreread less

Proceedings ArticleDOI

Efficient clustering of high-dimensional data sets with application to reference matching

Andrew McCallum, +2 more

TL;DR: This work presents a new technique for clustering large datasets, using a cheap, approximate distance measure to eciently divide the data into overlapping subsets the authors call canopies, and presents ex- perimental results on grouping bibliographic citations from the reference sections of research papers.

...read moreread less

On Evaluation and Training-Set Construction for Duplicate Detection

Citations

A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication

Evaluation of entity resolution approaches on real-world match problems

Entity Resolution with Markov Logic

Frameworks for entity matching: A comparison

Markov Chain Monte Carlo Data Association for Multi-Target Tracking

References

Modern Information Retrieval

Making large scale SVM learning practical

A Theory for Record Linkage

Algorithms on strings, trees, and sequences

Efficient clustering of high-dimensional data sets with application to reference matching

Related Papers (5)

Interactive deduplication using active learning

Adaptive duplicate detection using learnable string similarity measures

Duplicate Record Detection: A Survey

A Theory for Record Linkage

The merge/purge problem for large databases