scispace - formally typeset
Open AccessProceedings Article

On Evaluation and Training-Set Construction for Duplicate Detection

Reads0
Chats0
TLDR
A proposed two new approaches to collecting training data called static-active learning and weaklylabeled non-duplicates are proposed, and experimental results on their effectiveness are presented.
Abstract
A variety of experimental methodologies have been used to evaluate the accuracy of duplicate-detection systems. We advocate presenting precision-recall curves as the most informative evaluation methodology. We also discuss a number of issues that arise when evaluating and assembling training data for adaptive systems that use machine learning to tune themselves to specific applications. We consider several different application scenarios and experimentally examine the effectiveness of alternative methods of collecting training data under each scenario. We propose two new approaches to collecting training data called static-active learning and weaklylabeled non-duplicates, and present experimental results on their effectiveness.

read more

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication

TL;DR: A survey of 12 variations of 6 indexing techniques for record linkage and deduplication aimed at reducing the number of record pairs to be compared in the matching process by removing obvious nonmatching pairs, while at the same time maintaining high matching quality is presented.
Journal ArticleDOI

Evaluation of entity resolution approaches on real-world match problems

TL;DR: It is found that some challenging resolution tasks such as matching product entities from online shops are not sufficiently solved with conventional approaches based on the similarity of attribute values.
Proceedings ArticleDOI

Entity Resolution with Markov Logic

TL;DR: A well-founded, integrated solution to the entity resolution problem based on Markov logic, which combines first-order logic and probabilistic graphical models by attaching weights to first- order formulas, and viewing them as templates for features of Markov networks.
Journal ArticleDOI

Frameworks for entity matching: A comparison

TL;DR: This paper comparatively analyze 11 proposed frameworks for entity matching and considers both frameworks which do or do not utilize training data to semi-automatically find an entity matching strategy to solve a given match task.
Journal ArticleDOI

Markov Chain Monte Carlo Data Association for Multi-Target Tracking

TL;DR: Simulation results show that MCMCDA outperforms multiple hypothesis tracking (MHT) by a significant margin in terms of accuracy and efficiency under extreme conditions, such as a large number of targets in a dense environment, low detection probabilities, and high false alarm rates.
References
More filters
Book

Modern Information Retrieval

TL;DR: In this article, the authors present a rigorous and complete textbook for a first course on information retrieval from the computer science (as opposed to a user-centred) perspective, which provides an up-to-date student oriented treatment of the subject.
Posted ContentDOI

Making large scale SVM learning practical

TL;DR: SVM light as discussed by the authors is an implementation of an SVM learner which addresses the problem of large-scale SVM training with many training examples on the shelf, which makes large scale SVM learning more practical.
Journal ArticleDOI

A Theory for Record Linkage

TL;DR: A mathematical model is developed to provide a theoretical framework for a computer-oriented solution to the problem of recognizing those records in two files which represent identical persons, objects or events.

Algorithms on strings, trees, and sequences

Dan Gusfield
TL;DR: Ukkonen’s method is the method of choice for most problems requiring the construction of a suffix tree, and it will be presented first because it is easier to understand.
Proceedings ArticleDOI

Efficient clustering of high-dimensional data sets with application to reference matching

TL;DR: This work presents a new technique for clustering large datasets, using a cheap, approximate distance measure to eciently divide the data into overlapping subsets the authors call canopies, and presents ex- perimental results on grouping bibliographic citations from the reference sections of research papers.
Related Papers (5)