scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

A Comparison of Personal Name Matching: Techniques and Practical Issues

18 Dec 2006-pp 290-294
TL;DR: The characteristics of personal names are discussed and potential sources of variations and errors are presented and a comprehensive number of commonly used, as well as some recently developed name matching techniques are overview.
Abstract: Finding and matching personal names is at the core of an increasing number of applications: from text and Web mining, search engines, to information extraction, dedupli- cation and data linkage systems Variations and errors in names make exact string matching problematic, and ap- proximate matching techniques have to be applied When compared to general text, however, personal names have different characteristics that need to be considered In this paper we discuss the characteristics of personal names and present potential sources of variations and errors We then overview a comprehensive number of commonly used, as well as some recently developed name matching techniques Experimental comparisons using four large name data sets indicate that there is no clear best matching technique

Summary (3 min read)

1. Introduction

  • Increasingly large amounts of data are being created, communicated and stored by many individuals, organisations and businesses on a daily basis.
  • At lot of this data contains some information about people, for example e-mails, customer and patient records, news articles, business and political memorandums.
  • As reported in [28], the use of approximate comparison methods does improve the matching quality in these applications.
  • Personal names have characteristics that makes them different to general text.
  • While similar comparison studies on matching techniques have been done in the past [9, 17, 20, 25, 32, 34], none has analysed and compared such a comprehensive number of techniques specifically with application to personal names.

2. Personal name characteristics

  • Even when only considering the English-speaking world, a name can have several different spelling forms for a variety of reasons.
  • In the Anglo-Saxon region and most other Western countries, a personal name is usually made of a given name, an optional middle name, and a surname or family name [24].
  • Other specific errors were differences in punctuation marks and whitespaces (for example ‘O’Connor’, ‘OConnor’ and ‘O Connor’) in 12% of errors, and different last names for female patients (8% of errors).
  • Thus, there seem to be significant differences between general text and personal names, which have to be considered when name matching algorithm are being developed and used.

2.1 Sources of name variations

  • Besides the variations in personal names discussed above, the nature of data entry [19] will determine the most likely types of errors and their distribution.
  • Manual keyboard based data entry can result in wrongly typed neighbouring keys (for example ‘n’ and ‘m’, or ‘e’ and ‘r’).
  • Finally, people themselves sometimes report their names differently depending upon the organisation they are in contact with, or deliberately provide wrong or modified names.
  • When matching names, one has to deal with legitimate name variations (that should be preserved and matched), and errors introduced during data entry and recording (that should be corrected) [3].
  • The challenge lies in distinguishing between these two sources of variations.

3. Matching techniques

  • Name matching can be defined asthe process of determining whether two name strings are instances of the same name[24].
  • In the following three subsections the authors present the most commonly used as well as several recently proposed new techniques.
  • Without proper parsing and segmentation a name (even if stored in two fields as given- and surname) can contain several words separated by a hyphen, apostrophe, whitespace or other character.
  • Frequency distributions of name values can also be used to improve the quality of name matching.
  • I.e. only the basic techniques used to compare two names without taking any context information into account.the authors.

3.1 Phonetic encoding

  • Most techniques – including all presented here – have been developed mainly with English in mind.
  • The transformed name string is then encoded into a one-letter three-digits code (again removing zeros and duplicate numbers) using the following encoding table.
  • It contains many rules that take the position within a name, as well as previous and following letters into account (similar to Phonix).
  • When matching names, phonetic encoding can be used as a filtering step (calledblocking in data linkage [6, 30]), i.e. only names having the same phonetic code will be compared using a computationally more expensive pattern matching algorithm.

3.2 Pattern matching

  • A similarity measure can be calculated by dividing the total length of the common sub-strings by the minimum, maximum or average lengths of the two original strings (similar to Smith-Waterman above).
  • Positionalq-grams can be padded with start and end characters similar to non-positionalq-grams, and similarity measures can be calculated in the same three ways as with non-positionalq-grams.
  • The Winkler algorithm therefore increases the Jaro similarity measure for agreeing initial characters (up to four).

3.3 Combined techniques

  • Two techniques combine phonetic encoding and pattern matching with the aim to improve the matching quality.
  • The edit costs in Editex are 0 if two letters are the same, 1 if they are in the same letter group, and 2 otherwise.
  • Similar to basic edit distance, the time and space complexities of matching two stringss1 ands2 with Editex areO(|s1|×|s2|) andO(min(|s1|, |s2|)), respectively.
  • This recently developed technique, calledSyllable Alignment Pattern Searching (SAPS)[13] is based on the idea of matching two names syllable by syllable, rather than character by character.
  • The experimental results presented in [13] indicate that SAPS performs better than Editex, edit distance and Soundex on the same large name data set used in [25] (the COMPLETE data set the authors are using in their experiments as well).

4 Experiments and discussion

  • In this section the authors discuss the results of a series of comparison experiments using four large name data sets.
  • The aim of these experiments was to see which matching techniques achieve the best matching quality for different personal name types, and to compare their computational performance.
  • All name matching techniques were implemented in Python as part of theFebrl (Freely Extensible Biomedical Record Linkage)3 data linkage system [5].

4.1 Name data sets

  • Three of the test data sets were based on given- and surnames extracted from a health data set containing midwives’ records (women who gave birth) from the Australian state of New South Wales [4].
  • A deduplication status in this data (indicating which records correspond to the same women) allowed us to extract true name pairs (known matches).
  • The authors then created afull namedata set by concatenating given- with surnames (separated by a whitespace).
  • The fourth data set was created in a similar way using the COMPLETE name database [13, 25] by forming surname pairs from 90 randomly chosen and manually matched queries.
  • Table 2 shows the size of their four test data sets.

4.2 Distribution of edit distances

  • In order to better understand their test data, the authors calculated the edit distances for all the known name pairs.
  • This indicates the challenge 3http://datamining.anu.edu.au/linkage.html of name matching: how to correctly classify two names that are very different.

4.3 Matching results

  • The two techniques that combine phonetic encoding with pattern matching (Editex and syllable alignment distance) do not perform as well as one might have expected, and neither do skip-grams.
  • Details for the best performing pattern matching techniques on the four data sets can be seen in Figure 1.
  • An optimal value for one data set and technique will very likely result in sub-optimal quality for another data set or technique.

4.4 Timing results

  • As shown in Table 6, the phonetic encoding techniques (times shown include encoding of two names) are generally much faster than pattern matching, due to their complexity beingO(|s|) for a given strings.
  • Phonix with its many rules is the slowest phonetic techniques (almost ten times as slow as others), while Smith-Waterman is the slowest pattern matching techniques.

5 Recommendations

  • The mixed results presented in the previous section indicate that there is no single best name matching technique, and that the type of personal name data to be matched has to be considered when selecting a matching technique.
  • The following recommendations will help with this.
  • It is important to know the type of names to be matched, and if these names have been properly parsed and standardised [7], or if the name data potentially contains several words with various separators.
  • Phonetic encoding followed by exact comparison of the phonetic codes should not be used.
  • Even small changes of the threshold can result in dramatic drops in matching quality.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

TR-CS-06-02
A Comparison of Personal Name
Matching: Techniques and Practical
Issues
Peter Christen
September 2006
Joint Computer Science Technical Report Series
Department of Computer Science
Faculty of Engineering and Information Technology
Computer Sciences Laboratory
Research School of Information Sciences and Engineering

This technical report series is published jointly by the Department of
Computer Science, Faculty of Engineering and Information Technology,
and the Computer Sciences Laboratory, Research School of Information
Sciences and Engineering, The Australian National University.
Please direct correspondence regarding this series to:
Technical Reports
Department of Computer Science
Faculty of Engineering and Information Technology
The Australian National University
Canberra ACT 0200
Australia
or send email to:
Technical-DOT-Reports-AT-cs-DOT-anu.edu.au
A list of technical reports, including some abstracts and copies of some full
reports may be found at:
http://cs.anu.edu.au/techreports/
Recent reports in this series:
TR-CS-06-01 Stephen M Blackburn, Robin Garner, Chris Hoffmann,
Asjad M Khan, Kathryn S McKinley, Rotem Bentzur, Amer
Diwan, Daniel Feinberg, Daniel Frampton, Samuel Z Guyer,
Martin Hirzel, Antony Hosking, Maria Jump, Han Lee,
J Eliot B Moss, Aashish Phansalkar, Darko Stefanovi
´
c,
Thomas VanDrunen, Daniel von Dincklage, and Ben
Wiedermann. The DaCapo benchmarks: Java benchmarking
development and analysis (Extended Version). September
2006.
TR-CS-05-01 Peter Strazdins. CycleCounter: an efficient and accurate
UltraSPARC III CPU simulation module. May 2005.
TR-CS-04-04 C. W. Johnson and Ian Barnes. Redesigning the intermediate
course in software design. November 2004.
TR-CS-04-03 Alonso Marquez. Efficient implementation of design patterns
in Java programs. February 2004.
TR-CS-04-02 Bill Clarke. Solemn: Solaris emulation mode for Sparc Sulima.
February 2004.
TR-CS-04-01 Peter Strazdins and John Uhlmann. Local scheduling
out-performs gang scheduling on a Beowulf cluster. January
2004.

A Comparison of Personal Name Matching: Techniques and Practical Issues
Peter Christen
Department of Computer Science, The Australian National University
Canberra ACT 0200, Australia
Peter.Christen@anu.edu.au
Abstract
Finding and matching personal names is at the core of an
increasing number of applications: from text and Web min-
ing, information retrieval and extraction, search engines,
to deduplication and data linkage systems. Variations and
errors in names make exact string matching problematic,
and approximate matching techniques based on phonetic
encoding or pattern matching have to be applied. When
compared to general text, however, personal names have
different characteristics that need to be considered.
In this paper we discuss the characteristics of personal
names and present potential sources of variations and er-
rors. We overview a comprehensive number of commonly
used, as well as some recently developed name matching
techniques. Experimental comparisons on four large name
data sets indicate that there is no clear best technique.
We provide a series of recommendations that will help re-
searchers and practitioners to select a name matching tech-
nique suitable for a given data set.
1. Introduction
Increasingly large amounts of data are being created,
communicated and stored by many individuals, organisa-
tions and businesses on a daily basis. At lot of this data con-
tains some information about people, for example e-mails,
customer and patient records, news articles, business and
political memorandums. Even most scientific and techni-
cal documents contain details about their authors. Personal
names are often used to search for documents in large col-
lections. Examples include Web searches (the most popular
query in the last few years on Google has always been a
celebrity name, with another four or five names ranked in
the top ten queries
1
), retrieval of medical patient records,
or bibliographic searches (using author names). Names are
also important pieces of information when databases are
1
http://www.google.com/press/zeitgeist.html
deduplicated (e.g. to find and remove duplicate customer
records), and when two data sets are linked or integrated
and no unique entity identifiers are available [5, 6, 30]. As
reported in [28], the use of approximate comparison meth-
ods does improvethe matching quality in these applications.
Personal names have characteristics that makes them
different to general text. While there is only one cor-
rect spelling for many words, there are often several valid
spelling variations for personal names, for example ‘Gail’,
‘Gale’ and ‘Gayle’. People also frequently use (or are
given) nicknames in daily life, for example ‘Bill’ rather
than the more formal ‘William’. Personal names sometimes
change over time, for example when somebody gets mar-
ried. Names are also heavily influenced by people’s cul-
tural backgrounds. These issues make matching of personal
names more challenging compared to matching of general
text [3, 24].
As names are often recorded with different spellings, ap-
plying exact matching leads to poor results. In [11], for
example, the percentage of name mismatches in three large
hospital databases ranged between 23% and 36%. To im-
prove matching accuracy, many different techniques for ap-
proximate name matching have been developed in the last
four decades [15, 20, 25, 34], and new techniques are still
being invented [13, 18]. Most techniques are based on a
pattern matching, phonetic encoding, or a combination of
these two approaches.
Computational complexity has to be considered when
name matching is done on very large data sets. The time
needed to determine if two names match is crucial for the
overall performance of an application (besides data struc-
tures that allow to efficiently extract candidate name pairs
while filtering out likely non-matches [23]). Matching
speed is vital when quick response times are needed, for
example in search engines, or crime and biomedical emer-
gency response systems, where an answer should be avail-
able within a couple of seconds.
While similar comparison studies on matching tech-
niques have been done in the past [9, 17, 20, 25, 32, 34],
none has analysed and compared such a comprehensive

number of techniques specifically with application to per-
sonal names. The contributions of this paper are a detailed
discussion of the characteristics of personal names and pos-
sible sources of variations and errors in them, an overview
of a range of name matching techniques, and a comparison
of their performance using several large real world data sets
containing personal names.
We start in Section 2 with a discussion of personal name
characteristics and sources of variations. In Section 3 we
first look at different situations and contexts of name match-
ing, and then present a comprehensive number of name
matching techniques. The results of experimental compar-
isons are discussed in Section 4, and a series of recommen-
dations is given in Section 5 that will help researchers and
practitioners who are faced with the problem of selecting a
name matching technique. Finally, conclusions and an out-
look to future work is discussed in Section 6.
2. Personal name characteristics
Even when only considering the English-speaking
world, a name can have several different spelling forms for
a variety of reasons. In the Anglo-Saxon region and most
other Western countries, a personal name is usually made
of a given name, an optional middle name, and a surname
or family name [24]. Both ‘Gail Vest’ and ‘Gayle West’
might refer to the same person, while Tina Smith’ might be
recorded in the same database as ‘Christine J. Smith’ and as
‘C.J. Smith-Miller’. People change their name over time,
most commonly when somebody gets married (in which
case there are different cultural conventions and laws of
how a person’s name is changed). Compound names are
often used by married women, while in certain countries
husbands can take on the surname of their wives.
In daily life, people often use (or are given) nicknames.
These can be short forms of their given names (like ‘Bob’
for ‘Robert’, or ‘Liz’ for ‘Elizabeth’), they can be varia-
tions of their surname (like ‘Vesty’ for ‘Vest’) or they might
relate to some life event, character sketch or physical char-
acteristics of a person [3]. While having one given and one
middle name is common for Anglo-Saxon names, several
European countries favour compound given names instead,
for example ‘Hans-Peter’ or ‘Jean-Pierre’. In general, there
are no legal regulations of what constitutes a name [3].
In today’s multi-cultural societies and worldwide data
collections (e.g. global online businesses or international
crime and terrorism databases), the challenge is to be able to
match names coming from different cultural backgrounds.
For Asian names, for example, there exist several translit-
eration systems into the Roman alphabet [24], the surname
traditionally appears before the given name, and frequently
a Western given name is added. Hispanic names can con-
tain two surnames, while Arabic names are often made of
several components and contain various affixes that can be
separated by hyphens or whitespaces.
An early study [10] on spelling errors in general words
found that over 80% of errors were single errors either a
letter was deleted, an extra letter was inserted, a letter was
substituted for another letter, or two adjacent letters were
transposed. Substitutions were the most common errors,
followed by deletions, then insertions and finally transpo-
sitions, followed by multiple errors in one word. Other
studies [15, 19, 27] reported similar results. However, in
a study [11] that looked at patient names within hospital
databases, different types and distributions of errors were
found. With 36%, insertion of an additional name word,
initial or title were the most common errors. This was fol-
lowed in 14% of errors by several different letters in a name
due to nicknames or spelling variations. Other specific er-
rors were differences in punctuation marks and whitespaces
(for example ‘O’Connor’, ‘OConnor’ and ‘O Connor’) in
12% of errors, and different last names for female patients
(8% of errors). Single errors in this study accounted for
39% of all errors, only around half compared to the 80%
reported in [10]. Thus, there seem to be significant differ-
ences between general text and personal names, which have
to be considered when name matching algorithm are being
developed and used. According to [20] the most common
name variations can be categorised as
spelling variations (like ‘Meier’ and ‘Meyer’) due to
typographical errors that do not affect the phonetical
structure of a name but still post a problem for match-
ing;
phonetic variations (like Sinclair’ and ’St. Clair’)
where the phonemes are modified and the structure of
a name is changed substantially;
compound names (like ’Hans-Peter’ or ‘Smith Miller’)
that might be given in full (potentially with differ-
ent separators), one component only, or components
swapped;
alternative names (like nicknames, married names or
other deliberate name changes); and
initials only (mainly for given and middle names).
In [19] character level (or non-word) misspellings are
classified into (1) typographical errors, where it is assumed
that the person doing the data entry does know the correct
spelling of a word but makes a typing error (e.g. Sydeny’
instead of ‘Sydney’); (2) cognitive errors, assumed to come
from a lack of knowledge or misconceptions; and (3) pho-
netic errors, coming from substituting a correct spelling
with a similar sounding one. The combination of phonet-
ical and spelling variations, as well as potentially totally
changed name words, make name matching challenging.

2.1 Sources of name variations
Besides the variations in personal names discussed
above, the nature of data entry [19] will determine the most
likely types of errors and their distribution.
When handwritten forms are scanned and optical char-
acter recognition (OCR) is applied [15, 27], the most
likely types of errors will be substitutions between
similar looking characters (like q’ and ‘g’), or sub-
stitutions of one character with a similar looking char-
acter sequence (like ‘m’ and ‘r n’, or ‘b’ and ‘l i’).
Manual keyboard based data entry can result in
wrongly typed neighbouring keys (for example ‘n’ and
‘m’, or ‘e’ and ‘r’). While in some cases this is quickly
corrected by the person doing the data entry, such er-
rors are often not recognised, possibly due to limited
time or by distractions to the person doing the data
entry (imagine a busy receptionist in a hospital emer-
gency department). The likelihood of letter substitu-
tions obviously depends upon the keyboard layout.
Data entry over the telephone (for example as part of
a survey study) is a confounding factor to manual key-
board entry. The person doing the data entry might not
request the correct spelling, but rather assume a default
spelling which is based on the person’s knowledge and
cultural background. Generally, errors are more likely
for names that come from a culture that is different to
the one of the person doing the data entry, or if names
are long or complicated (like ‘Kyzwieslowski’) [11].
Limitations in the maximum length of input fields can
force people to use abbreviations, initials only, or even
disregard some parts of a name.
Finally, people themselves sometimes report their
names differently depending upon the organisation
they are in contact with, or deliberately provide wrong
or modified names. Or, while somebody might report
her or his name consistently in good faith, others report
it inconsistently or wrongly for various reasons.
If data from various sources is used, for example in a text
mining, information retrieval or data linkage system, then
the variability and error distribution will likely be larger
than if the names to be matched come from one source only.
This will also limit the use of trained name matching algo-
rithms [2, 9, 31] that are adapted to deal with certain types
of variations and errors. Having meta-data that describes
the data entry process for all data to be used can be valuable
when assessing data quality.
As discussed previously, while there is only one cor-
rect spelling for most general words, there are often no
wrong name spellings, just several valid name variations.
For this reason, in many cases it is not possible to disre-
gard a name as wrong if it is not found in a dictionary
of known names. When matching names, one has to deal
with legitimate name variations (that should be preserved
and matched), and errors introduced during data entry and
recording (that should be corrected) [3]. The challenge lies
in distinguishing between these two sources of variations.
3. Matching techniques
Name matching can be defined as the process of deter-
mining whether two name strings are instances of the same
name [24]. As name variations and errors are quite com-
mon [11], exact name comparison will not result in good
matching quality. Rather, an approximate measure of how
similar to names are is desired. Generally, a normalised
similarity measure between 1.0 (two names are identical)
and 0.0 (two names are totally different) is used.
The two main approaches for matching names are pho-
netic encoding and pattern matching. Different techniques
have been developed for both approaches, and several tech-
niques combine the two with the aim to improve the match-
ing quality. In the following three subsections we present
the most commonly used as well as several recently pro-
posed new techniques.
Matching two names can be viewed as an isolated prob-
lem or within a wider database or application context. Four
different situations can be considered.
1. The matching of two names that consist of a single
word each, not containing whitespaces or other sep-
arators like hyphens or commas. This is normally the
situation when names have been parsed and segmented
into components (individual words) [7], and all sepa-
rators have been removed. Full names are split into
their components and stored into fields like title, given
name, middle name, surname and alternative surname.
Parsing errors, however, can result in a name word
being put into the wrong field, thereby increasing the
likelihood of wrong matching.
2. Without proper parsing and segmentation a name (even
if stored in two fields as given- and surname) can con-
tain several words separated by a hyphen, apostrophe,
whitespace or other character. Examples include com-
pound given names, born surname and married name,
name pre- and suffixes, and title words (like ‘Ms’, ‘Mr’
or ‘Dr’). In this situation, besides variations in a single
word, parts of a name might be in a different order or
missing, and there might be different separators. All
this will complicate the name matching task.
3. In the first two situations names were matched indi-
vidually without taking any context information into

Citations
More filters
Book
05 Jul 2012
TL;DR: Data matching (also known as record or data linkage, entity resolution, object identification, or field matching) is the task of identifying, matching and merging records that correspond to the same entities from several databases or even within one database as mentioned in this paper.
Abstract: Data matching (also known as record or data linkage, entity resolution, object identification, or field matching) is the task of identifying, matching and merging records that correspond to the same entities from several databases or even within one database. Based on research in various domains including applied statistics, health informatics, data mining, machine learning, artificial intelligence, database management, and digital libraries, significant advances have been achieved over the last decade in all aspects of the data matching process, especially on how to improve the accuracy of data matching, and its scalability to large databases. Peter Christens book is divided into three parts: Part I, Overview, introduces the subject by presenting several sample applications and their special challenges, as well as a general overview of a generic data matching process. Part II, Steps of the Data Matching Process, then details its main steps like pre-processing, indexing, field and record comparison, classification, and quality evaluation. Lastly, part III, Further Topics, deals with specific aspects like privacy, real-time matching, or matching unstructured data. Finally, it briefly describes the main features of many research and open source systems available today. By providing the reader with a broad range of data matching concepts and techniques and touching on all aspects of the data matching process, this book helps researchers as well as students specializing in data quality or data matching aspects to familiarize themselves with recent research advances and to identify open research challenges in the area of data matching. To this end, each chapter of the book includes a final section that provides pointers to further background and research material. Practitioners will better understand the current state of the art in data matching as well as the internal workings and limitations of current systems. Especially, they will learn that it is often not feasible to simply implement an existing off-the-shelf data matching system without substantial adaption and customization. Such practical considerations are discussed for each of the major steps in the data matching process.

713 citations

Journal ArticleDOI
TL;DR: A thorough overview and analysis of the main approaches to entity linking is presented, and various applications, the evaluation of entity linking systems, and future directions are discussed.
Abstract: The large number of potential applications from bridging web data with knowledge bases have led to an increase in the entity linking research. Entity linking is the task to link entity mentions in text with their corresponding entities in a knowledge base. Potential applications include information extraction, information retrieval, and knowledge base population. However, this task is challenging due to name variations and entity ambiguity. In this survey, we present a thorough overview and analysis of the main approaches to entity linking, and discuss various applications, the evaluation of entity linking systems, and future directions.

702 citations


Cites background from "A Comparison of Personal Name Match..."

  • ...The ratio of the recursively longest common subsequence [98] to the shorter among the entity mention and the candidate entity name....

    [...]

Journal ArticleDOI
TL;DR: A survey of 12 variations of 6 indexing techniques for record linkage and deduplication aimed at reducing the number of record pairs to be compared in the matching process by removing obvious nonmatching pairs, while at the same time maintaining high matching quality is presented.
Abstract: Record linkage is the process of matching records from several databases that refer to the same entities. When applied on a single database, this process is known as deduplication. Increasingly, matched data are becoming important in many application areas, because they can contain information that is not available otherwise, or that is too costly to acquire. Removing duplicate records in a single database is a crucial step in the data cleaning process, because duplicates can severely influence the outcomes of any subsequent data processing or data mining. With the increasing size of today's databases, the complexity of the matching process becomes one of the major challenges for record linkage and deduplication. In recent years, various indexing techniques have been developed for record linkage and deduplication. They are aimed at reducing the number of record pairs to be compared in the matching process by removing obvious nonmatching pairs, while at the same time maintaining high matching quality. This paper presents a survey of 12 variations of 6 indexing techniques. Their complexity is analyzed, and their performance and scalability is evaluated within an experimental framework using both synthetic and real data sets. No such detailed survey has so far been published.

663 citations


Cites background or methods from "A Comparison of Personal Name Match..."

  • ...7, the normalized edit-distance string measure [35], and a minimum similarity of t 1⁄4 0:85, then the following suffix string pairs and their corresponding record identifier lists will be merged into one block each: “atherina” and “atherine” (with similarity 0....

    [...]

  • ...The first one is made of Soundex (Sndx) encoded givenname (GiN) values concatenated with full postcode (PC) values, the second consists of the first two digits (Fi2D) of postcode values concatenated with Double-Metaphone (DMe) encoded surname (SurN) values, and the third is made of Soundex encoded suburb name (SubN) values concatenated with the last two digits (La2D) of postcode values....

    [...]

  • ...String fields such as names and addresses were phonetically encoded using the Double-Metaphone [35] algorithm....

    [...]

  • ...Four string similarity functions (Jaro-Winkler, bigram, edit-distance, and longest common substring) [35] were employed for the adaptive sorted neighborhood, the robust suffix array, and the string-map-based indexing techniques....

    [...]

  • ...For strings that contain personal names, for example, phonetic similarity can be obtained by using phonetic encoding functions such as Soundex, NYSIIS, or Double-Metaphone [35]....

    [...]

Journal ArticleDOI
01 Sep 2010
TL;DR: It is found that some challenging resolution tasks such as matching product entities from online shops are not sufficiently solved with conventional approaches based on the similarity of attribute values.
Abstract: Despite the huge amount of recent research efforts on entity resolution (matching) there has not yet been a comparative evaluation on the relative effectiveness and efficiency of alternate approaches. We therefore present such an evaluation of existing implementations on challenging real-world match tasks. We consider approaches both with and without using machine learning to find suitable parameterization and combination of similarity functions. In addition to approaches from the research community we also consider a state-of-the-art commercial entity resolution implementation. Our results indicate significant quality and efficiency differences between different approaches. We also find that some challenging resolution tasks such as matching product entities from online shops are not sufficiently solved with conventional approaches based on the similarity of attribute values.

436 citations


Cites background from "A Comparison of Personal Name Match..."

  • ...The authors of [8] present a comparison of FEBRL's string similarity functions for personal name data....

    [...]

Proceedings Article
23 Aug 2010
TL;DR: This work presents a state of the art system for entity disambiguation that not only addresses challenges but also scales to knowledge bases with several million entries using very little resources.
Abstract: The integration of facts derived from information extraction systems into existing knowledge bases requires a system to disambiguate entity mentions in the text. This is challenging due to issues such as non-uniform variations in entity names, mention ambiguity, and entities absent from a knowledge base. We present a state of the art system for entity disambiguation that not only addresses these challenges but also scales to knowledge bases with several million entries using very little resources. Further, our approach achieves performance of up to 95% on entities mentioned from newswire and 80% on a public test set that was designed to include challenging queries.

356 citations


Cites background from "A Comparison of Personal Name Match..."

  • ...included the ratio of the recursive longest common subsequence (Christen, 2006) to the shorter of the mention or entry name, which is effective at handling some deletions or word reorderings (e....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: This work surveys the current techniques to cope with the problem of string matching that allows errors, and focuses on online searching and mostly on edit distance, explaining the problem and its relevance, its statistical behavior, its history and current developments, and the central ideas of the algorithms.
Abstract: We survey the current techniques to cope with the problem of string matching that allows errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining the problem and its relevance, its statistical behavior, its history and current developments, and the central ideas of the algorithms and their complexities. We present a number of experiments to compare the performance of the different algorithms and show which are the best choices. We conclude with some directions for future work and open problems.

2,723 citations


"A Comparison of Personal Name Match..." refers background or methods in this paper

  • ...Levenshtein or Edit distance [19] is de ned as the smallest number of edit operations (inserts, deletes and substitutions) required to change one string into another....

    [...]

  • ...Damerau-Levenshtein distance is a variation of edit distance where a transposition of two characters is also considered to be an elementary edit operation [8, 19]....

    [...]

  • ...Pattern matching techniques are commonly used in approximate string matching [12, 14, 19], which has widespread applications, from data linkage and duplicate detection [2, 5, 7, 24], information retrieval [11, 15, 25], correction of spelling errors [8, 16, 23], to bio- and health informatics [9]....

    [...]

Journal ArticleDOI
Fred J. Damerau1
TL;DR: The method described assumes that a word which cannot be found in a dictionary has at most one error, which might be a wrong, missing or extra letter or a single transposition.
Abstract: The method described assumes that a word which cannot be found in a dictionary has at most one error, which might be a wrong, missing or extra letter or a single transposition. The unidentified input word is compared to the dictionary again, testing each time to see if the words match—assuming one of these errors occurred. During a test run on garbled text, correct identifications were made for over 95 percent of these error types.

1,591 citations


"A Comparison of Personal Name Match..." refers background or methods in this paper

  • ...Damerau-Levenshtein distance is a variation of edit distance where a transposition of two characters is also considered to be an elementary edit operation [8, 19]....

    [...]

  • ...An early study [8] on spelling errors in general words found that over 80% of errors were single character errors (inserts, deletes, or substitutions)....

    [...]

  • ...Pattern matching techniques are commonly used in approximate string matching [12, 14, 19], which has widespread applications, from data linkage and duplicate detection [2, 5, 7, 24], information retrieval [11, 15, 25], correction of spelling errors [8, 16, 23], to bio- and health informatics [9]....

    [...]

Journal ArticleDOI
TL;DR: Research aimed at correcting words in text has focused on three progressively more difficult problems: nonword error detection; (2) isolated-word error correction; and (3) context-dependent work correction, which surveys documented findings on spelling error patterns.
Abstract: Research aimed at correcting words in text has focused on three progressively more difficult problems:(1) nonword error detection; (2) isolated-word error correction; and (3) context-dependent work correction. In response to the first problem, efficient pattern-matching and n-gram analysis techniques have been developed for detecting strings that do not appear in a given word list. In response to the second problem, a variety of general and application-specific spelling correction techniques have been developed. Some of them were based on detailed studies of spelling error patterns. In response to the third problem, a few experiments using natural-language-processing tools or statistical-language models have been carried out. This article surveys documented findings on spelling error patterns, provides descriptions of various nonword detection and isolated-word error correction techniques, reviews the state of the art of context-dependent word correction techniques, and discusses research issues related to all three areas of automatic error correction in text.

1,417 citations


"A Comparison of Personal Name Match..." refers background or methods in this paper

  • ...Pattern matching techniques are commonly used in approximate string matching [12, 14, 19], which has widespread applications, from data linkage and duplicate detection [2, 5, 7, 24], information retrieval [11, 15, 25], correction of spelling errors [8, 16, 23], to bio- and health informatics [9]....

    [...]

  • ...Besides the variations in personal names discussed above, the nature of data entry [16] will determine the most likely types of errors and their distributions....

    [...]

  • ...Other studies [12, 16, 23] reported similar results....

    [...]

  • ...In [16] character level (or non-word) misspellings are classi ed into (1) typographical errors, where it is assumed that the person doing the data entry does know the correct spelling of a word but makes a typing error (e....

    [...]

  • ...Q-grams [16] are sub-strings of length q....

    [...]

Proceedings Article
09 Aug 2003
TL;DR: Using an open-source, Java toolkit of name-matching methods, the authors experimentally compare string distance metrics on the task of matching entity names and find that the best performing method is a hybrid scheme combining a TFIDF weighting scheme, which is widely used in information retrieval, with the Jaro-Winkler string-distance scheme.
Abstract: Using an open-source, Java toolkit of name-matching methods, we experimentally compare string distance metrics on the task of matching entity names We investigate a number of different metrics proposed by different communities, including edit-distance metrics, fast heuristic string comparators, token-based distance metrics, and hybrid methods Overall, the best-performing method is a hybrid scheme combining a TFIDF weighting scheme, which is widely used in information retrieval, with the Jaro-Winkler string-distance scheme, which was developed in the probabilistic record linkage community

1,355 citations

Journal ArticleDOI
Ming Li, Xin Chen1, Xin Li, Bin Ma, Paul M. B. Vitányi1 
15 Sep 2003
TL;DR: Evidence of successful application in areas as diverse as genomics, virology, languages, literature, music, handwritten digits, astronomy, and combinations of objects from completely different domains, using statistical, dictionary, and block sorting compressors is reported.
Abstract: We present a new method for clustering based on compression. The method does not use subject-specific features or background knowledge, and works as follows: First, we determine a parameter-free, universal, similarity distance, the normalized compression distance or NCD, computed from the lengths of compressed data files (singly and in pairwise concatenation). Second, we apply a hierarchical clustering method. The NCD is not restricted to a specific application area, and works across application area boundaries. A theoretical precursor, the normalized information distance, co-developed by one of the authors, is provably optimal. However, the optimality comes at the price of using the noncomputable notion of Kolmogorov complexity. We propose axioms to capture the real-world setting, and show that the NCD approximates optimality. To extract a hierarchy of clusters from the distance matrix, we determine a dendrogram (ternary tree) by a new quartet method and a fast heuristic to implement it. The method is implemented and available as public software, and is robust under choice of different compressors. To substantiate our claims of universality and robustness, we report evidence of successful application in areas as diverse as genomics, virology, languages, literature, music, handwritten digits, astronomy, and combinations of objects from completely different domains, using statistical, dictionary, and block sorting compressors. In genomics, we presented new evidence for major questions in Mammalian evolution, based on whole-mitochondrial genomic analysis: the Eutherian orders and the Marsupionta hypothesis against the Theria hypothesis.

1,087 citations

Frequently Asked Questions (10)
Q1. What are the contributions mentioned in the paper "A comparison of personal name matching: techniques and practical issues" ?

In this paper the authors discuss the characteristics of personal names and present potential sources of variations and errors. The authors provide a series of recommendations that will help researchers and practitioners to select a name matching technique suitable for a given data set. 

If a string contains more than one word (i.e. it contains at least one whitespace or other separator), then the words are first sorted alphabetically before the Winkler technique is applied (to the full strings). 

When handwritten forms are scanned and optical character recognition (OCR) is applied [15, 27], the most likely types of errors will be substitutions between similar looking characters (like ‘q’ and ‘g’), or substitutions of one character with a similar looking character sequence (like ‘m’ and ‘r n’, or ‘b’ and ‘l i’).• 

The authors also extracted single names from records that did not have duplicates, and randomly created name pairs (the same number as known matched pairs in order to get balanced test data sets). 

The time needed to determine if two names match is crucial for the overall performance of an application (besides data structures that allow to efficiently extract candidate name pairs while filtering out likely non-matches [23]). 

As it allows for gaps, the Smith-Waterman algorithm should be especially suited for compound names that contain initials only or abbreviated names 

A q-gram similarity measure between two strings is calculated by counting the number of q-grams in common (i.e. q-grams contained in both strings) and divide by either the number of q-grams in the shorter string (called Overlap coefficient2), the number in the longer string (called Jaccard similarity) or the average number of q-grams in both strings (called the Dice coefficient). 

Increasingly large amounts of data are being created, communicated and stored by many individuals, organisations and businesses on a daily basis. 

As expected, the Bag distance is very fast (followed by simple q-grams), making it suitable as a filtering technique to remove obvious non-matches. 

Their experiments with skip-grams using multi-lingual texts from different European languages show improved results compared to bigrams, trigrams, edit distance and the longest common sub-string technique.•