scispace - formally typeset
Search or ask a question

A Comparison of String Metrics for Matching Names and Records

TL;DR: An open-source Java toolkit of methods for matching names and records is described and results obtained from using various string distance metrics on the task of matching entity names are summarized.
Abstract: We describe an open-source Java toolkit of methods for matching names and records. We summarize results obtained from using various string distance metrics on the task of matching entity names. These metrics include distance functions proposed by several different communities, such as edit-distance metrics, fast heuristic string comparators, token-based distance metrics, and hybrid methods. We then describe an extension to the toolkit which allows records to be compared. We discuss some issues involved in performing a similar comparision for record-matching techniques, and finally present results for some baseline record-matching algorithms that aggregate string comparisons between fields.

Content maybe subject to copyright    Report

Citations
More filters
Book
05 Jun 2007
TL;DR: The second edition of Ontology Matching has been thoroughly revised and updated to reflect the most recent advances in this quickly developing area, which resulted in more than 150 pages of new content.
Abstract: Ontologies tend to be found everywhere. They are viewed as the silver bullet for many applications, such as database integration, peer-to-peer systems, e-commerce, semantic web services, or social networks. However, in open or evolving systems, such as the semantic web, different parties would, in general, adopt different ontologies. Thus, merely using ontologies, like using XML, does not reduce heterogeneity: it just raises heterogeneity problems to a higher level. Euzenat and Shvaikos book is devoted to ontology matching as a solution to the semantic heterogeneity problem faced by computer systems. Ontology matching aims at finding correspondences between semantically related entities of different ontologies. These correspondences may stand for equivalence as well as other relations, such as consequence, subsumption, or disjointness, between ontology entities. Many different matching solutions have been proposed so far from various viewpoints, e.g., databases, information systems, and artificial intelligence. The second edition of Ontology Matching has been thoroughly revised and updated to reflect the most recent advances in this quickly developing area, which resulted in more than 150 pages of new content. In particular, the book includes a new chapter dedicated to the methodology for performing ontology matching. It also covers emerging topics, such as data interlinking, ontology partitioning and pruning, context-based matching, matcher tuning, alignment debugging, and user involvement in matching, to mention a few. More than 100 state-of-the-art matching systems and frameworks were reviewed. With Ontology Matching, researchers and practitioners will find a reference book that presents currently available work in a uniform framework. In particular, the work and the techniques presented in this book can be equally applied to database schema matching, catalog integration, XML schema matching and other related problems. The objectives of the book include presenting (i) the state of the art and (ii) the latest research results in ontology matching by providing a systematic and detailed account of matching techniques and matching systems from theoretical, practical and application perspectives.

2,579 citations

Book ChapterDOI
TL;DR: This paper presents a new classification of schema-based matching techniques that builds on the top of state of the art in both schema and ontology matching and distinguishes between approximate and exact techniques at schema-level; and syntactic, semantic, and external techniques at element- and structure-level.
Abstract: Schema and ontology matching is a critical problem in many application domains, such as semantic web, schema/ontology integration, data warehouses, e-commerce, etc. Many different matching solutions have been proposed so far. In this paper we present a new classification of schema-based matching techniques that builds on the top of state of the art in both schema and ontology matching. Some innovations are in introducing new criteria which are based on (i) general properties of matching techniques, (ii) interpretation of input information, and (iii) the kind of input information. In particular, we distinguish between approximate and exact techniques at schema-level; and syntactic, semantic, and external techniques at element- and structure-level. Based on the classification proposed we overview some of the recent schema/ontology matching systems pointing which part of the solution space they cover. The proposed classification provides a common conceptual basis, and, hence, can be used for comparing different existing schema/ontology matching techniques and systems as well as for designing new ones, taking advantages of state of the art solutions.

1,285 citations


Cites background from "A Comparison of String Metrics for ..."

  • ...A comparison of different string matching techniques, from distance like functions to token-based distance functions can be found in [9]....

    [...]

Book ChapterDOI
06 Nov 2005
TL;DR: A new string metric for the comparison of names which performs better on the process of ontology alignment as well as to many other field matching problems is presented.
Abstract: Ontologies are today a key part of every knowledge based system. They provide a source of shared and precisely defined terms, resulting in system interoperability by knowledge sharing and reuse. Unfortunately, the variety of ways that a domain can be conceptualized results in the creation of different ontologies with contradicting or overlapping parts. For this reason ontologies need to be brought into mutual agreement (aligned). One important method for ontology alignment is the comparison of class and property names of ontologies using string-distance metrics. Today quite a lot of such metrics exist in literature. But all of them have been initially developed for different applications and fields, resulting in poor performance when applied in this new domain. In the current paper we present a new string metric for the comparison of names which performs better on the process of ontology alignment as well as to many other field matching problems.

465 citations


Cites background or methods from "A Comparison of String Metrics for ..."

  • ...The second one is performed with classical benchmarks found in literature for data integration and retrieval [20]....

    [...]

  • ...Thus we could not resist but to evaluate it with classical benchmarks found in literature like the ones in [7,8,24,20]....

    [...]

  • ...In order to evaluate our metric against these datasets we used the SecondString open-source library [20]....

    [...]

Proceedings ArticleDOI
18 Dec 2006
TL;DR: A well-founded, integrated solution to the entity resolution problem based on Markov logic, which combines first-order logic and probabilistic graphical models by attaching weights to first- order formulas, and viewing them as templates for features of Markov networks.
Abstract: Entity resolution is the problem of determining which records in a database refer to the same entities, and is a crucial and expensive step in the data mining process. Interest in it has grown rapidly in recent years, and many approaches have been proposed. However, they tend to address only isolated aspects of the problem, and are often ad hoc. This paper proposes a well-founded, integrated solution to the entity resolution problem based on Markov logic. Markov logic combines first-order logic and probabilistic graphical models by attaching weights to first-order formulas, and viewing them as templates for features of Markov networks. We show how a number of previous approaches can be formulated and seamlessly combined in Markov logic, and how the resulting learning and inference problems can be solved efficiently. Experiments on two citation databases show the utility of this approach, and evaluate the contribution of the different components.

422 citations


Cites background or methods from "A Comparison of String Metrics for ..."

  • ...Cohen et al. [ 6 ] found such hybrid measures to outperform pure word-based and pure string-based ones for entity resolution....

    [...]

  • ...The second step involves learning an MLN(B+C+T) model on the words inferred by the first stage.3 This model implements a hybrid similarity measure as proposed by Cohen et al. [ 6 ]....

    [...]

  • ...Several authors have devised, compared and learned similarity measures for use in entity resolution (e.g., [ 6 , 45, 3])....

    [...]

Journal ArticleDOI
TL;DR: YAKE!, a light-weight unsupervised automatic keyword extraction method which rests on statistical text features extracted from single documents to select the most relevant keywords of a text, is described.

357 citations

References
More filters
Book
01 Feb 2005
TL;DR: This book gives a unified, up-to-date and self-contained account, with a Bayesian slant, of such methods, and more generally to probabilistic methods of sequence analysis.
Abstract: Probablistic models are becoming increasingly important in analyzing the huge amount of data being produced by large-scale DNA-sequencing efforts such as the Human Genome Project. For example, hidden Markov models are used for analyzing biological sequences, linguistic-grammar-based probabilistic models for identifying RNA secondary structure, and probabilistic evolutionary models for inferring phylogenies of sequences from different organisms. This book gives a unified, up-to-date and self-contained account, with a Bayesian slant, of such methods, and more generally to probabilistic methods of sequence analysis. Written by an interdisciplinary team of authors, it is accessible to molecular biologists, computer scientists, and mathematicians with no formal knowledge of the other fields, and at the same time presents the state of the art in this new and important field.

3,985 citations


"A Comparison of String Metrics for ..." refers methods in this paper

  • ...SecondString supports a range of metrics based on edit distance, including Levenstein distance, which assigns a unit cost to all edit operations); and the Monge-Elkan distance function (Monge & Elkan 1996), a well-tuned affine variant of the Smith-Waterman distance function (Durban et al. 1998)....

    [...]

Journal ArticleDOI
TL;DR: A mathematical model is developed to provide a theoretical framework for a computer-oriented solution to the problem of recognizing those records in two files which represent identical persons, objects or events.
Abstract: A mathematical model is developed to provide a theoretical framework for a computer-oriented solution to the problem of recognizing those records in two files which represent identical persons, objects or events (said to be matched). A comparison is to be made between the recorded characteristics and values in two records (one from each file) and a decision made as to whether or not the members of the comparison-pair represent the same person or event, or whether there is insufficient evidence to justify either of these decisions at stipulated levels of error. These three decisions are referred to as link (A 1), a non-link (A 3), and a possible link (A 2). The first two decisions are called positive dispositions. The two types of error are defined as the error of the decision A 1 when the members of the comparison pair are in fact unmatched, and the error of the decision A 3 when the members of the comparison pair are, in fact matched. The probabilities of these errors are defined as and respecti...

2,306 citations


"A Comparison of String Metrics for ..." refers methods in this paper

  • ...We have also implemented token-based distance metrics based on Jensen-Shannon distance (Dagan, Lee, & Pereira 1999) with various smoothing methods, and a simplified form of Fellegi and Sunter’s method (Fellegi & Sunter 1969), called SFS below....

    [...]

  • ...In statistics, a long line of research has been conducted in probabilistic record linkage, largely based on the seminal paper by Fellegi and Sunter (1969)....

    [...]

Proceedings Article
09 Aug 2003
TL;DR: Using an open-source, Java toolkit of name-matching methods, the authors experimentally compare string distance metrics on the task of matching entity names and find that the best performing method is a hybrid scheme combining a TFIDF weighting scheme, which is widely used in information retrieval, with the Jaro-Winkler string-distance scheme.
Abstract: Using an open-source, Java toolkit of name-matching methods, we experimentally compare string distance metrics on the task of matching entity names We investigate a number of different metrics proposed by different communities, including edit-distance metrics, fast heuristic string comparators, token-based distance metrics, and hybrid methods Overall, the best-performing method is a hybrid scheme combining a TFIDF weighting scheme, which is widely used in information retrieval, with the Jaro-Winkler string-distance scheme, which was developed in the probabilistic record linkage community

1,355 citations

Journal ArticleDOI
TL;DR: The theoretical and practical issues encountered in conducting the matching operation and the results of that operation are discussed.
Abstract: A test census of Tampa, Florida and an independent postenumeration survey (PES) were conducted by the U.S. Census Bureau in 1985. The PES was a stratified block sample with heavy emphasis placed on hard-to-count population groups. Matching the individuals in the census to the individuals in the PES is an important aspect of census coverage evaluation and consequently a very important process for any census adjustment operations that might be planned. For such an adjustment to be feasible, record-linkage software had to be developed that could perform matches with a high degree of accuracy and that was based on an underlying mathematical theory. A principal purpose of the PES was to provide an opportunity to evaluate the newly implemented record-linkage system and associated methodology. This article discusses the theoretical and practical issues encountered in conducting the matching operation and presents the results of that operation. A review of the theoretical background of the record-linkage...

1,347 citations


"A Comparison of String Metrics for ..." refers background or methods in this paper

  • ...These proposals have been, by and large, adopted by subsequent researchers, often with elaborations of the underlying statistical model (Jaro 1989; 1995; Winkler 1999; Larsen 1999; Belin & Rubin 1997)....

    [...]

  • ...It also supports the Jaro metric (Jaro 1995; 1989), a metric widely used in the record-linkage community, with and without a variation due to Winkler (1999)....

    [...]

Proceedings ArticleDOI
01 Aug 2000
TL;DR: This work presents a new technique for clustering large datasets, using a cheap, approximate distance measure to eciently divide the data into overlapping subsets the authors call canopies, and presents ex- perimental results on grouping bibliographic citations from the reference sections of research papers.
Abstract: important problems involve clustering large datasets. Although naive implementations of clustering are computa- tionally expensive, there are established ecient techniques for clustering when the dataset has either (1) a limited num- ber of clusters, (2) a low feature dimensionality, or (3) a small number of data points. However, there has been much less work on methods of eciently clustering datasets that are large in all three ways at once|for example, having millions of data points that exist in many thousands of di- mensions representing many thousands of clusters. We present a new technique for clustering these large, high- dimensional datasets. The key idea involves using a cheap, approximate distance measure to eciently divide the data into overlapping subsets we call canopies .T hen cluster- ing is performed by measuring exact distances only between points that occur in a common canopy. Using canopies, large clustering problems that were formerly impossible become practical. Under reasonable assumptions about the cheap distance metric, this reduction in computational cost comes without any loss in clustering accuracy. Canopies can be applied to many domains and used with a variety of cluster- ing approaches, including Greedy Agglomerative Clustering, K-means and Expectation-Maximization. We present ex- perimental results on grouping bibliographic citations from the reference sections of research papers. Here the canopy approach reduces computation time over a traditional clus- tering approach by more than an order of magnitude and decreases error in comparison to a previously used algorithm by 25%.

1,197 citations