A Comparison of Personal Name Matching: Techniques and Practical Issues
Summary (3 min read)
1. Introduction
- Increasingly large amounts of data are being created, communicated and stored by many individuals, organisations and businesses on a daily basis.
- At lot of this data contains some information about people, for example e-mails, customer and patient records, news articles, business and political memorandums.
- As reported in [28], the use of approximate comparison methods does improve the matching quality in these applications.
- Personal names have characteristics that makes them different to general text.
- While similar comparison studies on matching techniques have been done in the past [9, 17, 20, 25, 32, 34], none has analysed and compared such a comprehensive number of techniques specifically with application to personal names.
2. Personal name characteristics
- Even when only considering the English-speaking world, a name can have several different spelling forms for a variety of reasons.
- In the Anglo-Saxon region and most other Western countries, a personal name is usually made of a given name, an optional middle name, and a surname or family name [24].
- Other specific errors were differences in punctuation marks and whitespaces (for example ‘O’Connor’, ‘OConnor’ and ‘O Connor’) in 12% of errors, and different last names for female patients (8% of errors).
- Thus, there seem to be significant differences between general text and personal names, which have to be considered when name matching algorithm are being developed and used.
2.1 Sources of name variations
- Besides the variations in personal names discussed above, the nature of data entry [19] will determine the most likely types of errors and their distribution.
- Manual keyboard based data entry can result in wrongly typed neighbouring keys (for example ‘n’ and ‘m’, or ‘e’ and ‘r’).
- Finally, people themselves sometimes report their names differently depending upon the organisation they are in contact with, or deliberately provide wrong or modified names.
- When matching names, one has to deal with legitimate name variations (that should be preserved and matched), and errors introduced during data entry and recording (that should be corrected) [3].
- The challenge lies in distinguishing between these two sources of variations.
3. Matching techniques
- Name matching can be defined asthe process of determining whether two name strings are instances of the same name[24].
- In the following three subsections the authors present the most commonly used as well as several recently proposed new techniques.
- Without proper parsing and segmentation a name (even if stored in two fields as given- and surname) can contain several words separated by a hyphen, apostrophe, whitespace or other character.
- Frequency distributions of name values can also be used to improve the quality of name matching.
- I.e. only the basic techniques used to compare two names without taking any context information into account.the authors.
3.1 Phonetic encoding
- Most techniques – including all presented here – have been developed mainly with English in mind.
- The transformed name string is then encoded into a one-letter three-digits code (again removing zeros and duplicate numbers) using the following encoding table.
- It contains many rules that take the position within a name, as well as previous and following letters into account (similar to Phonix).
- When matching names, phonetic encoding can be used as a filtering step (calledblocking in data linkage [6, 30]), i.e. only names having the same phonetic code will be compared using a computationally more expensive pattern matching algorithm.
3.2 Pattern matching
- A similarity measure can be calculated by dividing the total length of the common sub-strings by the minimum, maximum or average lengths of the two original strings (similar to Smith-Waterman above).
- Positionalq-grams can be padded with start and end characters similar to non-positionalq-grams, and similarity measures can be calculated in the same three ways as with non-positionalq-grams.
- The Winkler algorithm therefore increases the Jaro similarity measure for agreeing initial characters (up to four).
3.3 Combined techniques
- Two techniques combine phonetic encoding and pattern matching with the aim to improve the matching quality.
- The edit costs in Editex are 0 if two letters are the same, 1 if they are in the same letter group, and 2 otherwise.
- Similar to basic edit distance, the time and space complexities of matching two stringss1 ands2 with Editex areO(|s1|×|s2|) andO(min(|s1|, |s2|)), respectively.
- This recently developed technique, calledSyllable Alignment Pattern Searching (SAPS)[13] is based on the idea of matching two names syllable by syllable, rather than character by character.
- The experimental results presented in [13] indicate that SAPS performs better than Editex, edit distance and Soundex on the same large name data set used in [25] (the COMPLETE data set the authors are using in their experiments as well).
4 Experiments and discussion
- In this section the authors discuss the results of a series of comparison experiments using four large name data sets.
- The aim of these experiments was to see which matching techniques achieve the best matching quality for different personal name types, and to compare their computational performance.
- All name matching techniques were implemented in Python as part of theFebrl (Freely Extensible Biomedical Record Linkage)3 data linkage system [5].
4.1 Name data sets
- Three of the test data sets were based on given- and surnames extracted from a health data set containing midwives’ records (women who gave birth) from the Australian state of New South Wales [4].
- A deduplication status in this data (indicating which records correspond to the same women) allowed us to extract true name pairs (known matches).
- The authors then created afull namedata set by concatenating given- with surnames (separated by a whitespace).
- The fourth data set was created in a similar way using the COMPLETE name database [13, 25] by forming surname pairs from 90 randomly chosen and manually matched queries.
- Table 2 shows the size of their four test data sets.
4.2 Distribution of edit distances
- In order to better understand their test data, the authors calculated the edit distances for all the known name pairs.
- This indicates the challenge 3http://datamining.anu.edu.au/linkage.html of name matching: how to correctly classify two names that are very different.
4.3 Matching results
- The two techniques that combine phonetic encoding with pattern matching (Editex and syllable alignment distance) do not perform as well as one might have expected, and neither do skip-grams.
- Details for the best performing pattern matching techniques on the four data sets can be seen in Figure 1.
- An optimal value for one data set and technique will very likely result in sub-optimal quality for another data set or technique.
4.4 Timing results
- As shown in Table 6, the phonetic encoding techniques (times shown include encoding of two names) are generally much faster than pattern matching, due to their complexity beingO(|s|) for a given strings.
- Phonix with its many rules is the slowest phonetic techniques (almost ten times as slow as others), while Smith-Waterman is the slowest pattern matching techniques.
5 Recommendations
- The mixed results presented in the previous section indicate that there is no single best name matching technique, and that the type of personal name data to be matched has to be considered when selecting a matching technique.
- The following recommendations will help with this.
- It is important to know the type of names to be matched, and if these names have been properly parsed and standardised [7], or if the name data potentially contains several words with various separators.
- Phonetic encoding followed by exact comparison of the phonetic codes should not be used.
- Even small changes of the threshold can result in dramatic drops in matching quality.
Did you find this useful? Give us your feedback
Citations
713 citations
702 citations
Cites background from "A Comparison of Personal Name Match..."
...The ratio of the recursively longest common subsequence [98] to the shorter among the entity mention and the candidate entity name....
[...]
663 citations
Cites background or methods from "A Comparison of Personal Name Match..."
...7, the normalized edit-distance string measure [35], and a minimum similarity of t 1⁄4 0:85, then the following suffix string pairs and their corresponding record identifier lists will be merged into one block each: “atherina” and “atherine” (with similarity 0....
[...]
...The first one is made of Soundex (Sndx) encoded givenname (GiN) values concatenated with full postcode (PC) values, the second consists of the first two digits (Fi2D) of postcode values concatenated with Double-Metaphone (DMe) encoded surname (SurN) values, and the third is made of Soundex encoded suburb name (SubN) values concatenated with the last two digits (La2D) of postcode values....
[...]
...String fields such as names and addresses were phonetically encoded using the Double-Metaphone [35] algorithm....
[...]
...Four string similarity functions (Jaro-Winkler, bigram, edit-distance, and longest common substring) [35] were employed for the adaptive sorted neighborhood, the robust suffix array, and the string-map-based indexing techniques....
[...]
...For strings that contain personal names, for example, phonetic similarity can be obtained by using phonetic encoding functions such as Soundex, NYSIIS, or Double-Metaphone [35]....
[...]
436 citations
Cites background from "A Comparison of Personal Name Match..."
...The authors of [8] present a comparison of FEBRL's string similarity functions for personal name data....
[...]
356 citations
Cites background from "A Comparison of Personal Name Match..."
...included the ratio of the recursive longest common subsequence (Christen, 2006) to the shorter of the mention or entry name, which is effective at handling some deletions or word reorderings (e....
[...]
References
2,723 citations
"A Comparison of Personal Name Match..." refers background or methods in this paper
...Levenshtein or Edit distance [19] is de ned as the smallest number of edit operations (inserts, deletes and substitutions) required to change one string into another....
[...]
...Damerau-Levenshtein distance is a variation of edit distance where a transposition of two characters is also considered to be an elementary edit operation [8, 19]....
[...]
...Pattern matching techniques are commonly used in approximate string matching [12, 14, 19], which has widespread applications, from data linkage and duplicate detection [2, 5, 7, 24], information retrieval [11, 15, 25], correction of spelling errors [8, 16, 23], to bio- and health informatics [9]....
[...]
1,591 citations
"A Comparison of Personal Name Match..." refers background or methods in this paper
...Damerau-Levenshtein distance is a variation of edit distance where a transposition of two characters is also considered to be an elementary edit operation [8, 19]....
[...]
...An early study [8] on spelling errors in general words found that over 80% of errors were single character errors (inserts, deletes, or substitutions)....
[...]
...Pattern matching techniques are commonly used in approximate string matching [12, 14, 19], which has widespread applications, from data linkage and duplicate detection [2, 5, 7, 24], information retrieval [11, 15, 25], correction of spelling errors [8, 16, 23], to bio- and health informatics [9]....
[...]
1,417 citations
"A Comparison of Personal Name Match..." refers background or methods in this paper
...Pattern matching techniques are commonly used in approximate string matching [12, 14, 19], which has widespread applications, from data linkage and duplicate detection [2, 5, 7, 24], information retrieval [11, 15, 25], correction of spelling errors [8, 16, 23], to bio- and health informatics [9]....
[...]
...Besides the variations in personal names discussed above, the nature of data entry [16] will determine the most likely types of errors and their distributions....
[...]
...Other studies [12, 16, 23] reported similar results....
[...]
...In [16] character level (or non-word) misspellings are classi ed into (1) typographical errors, where it is assumed that the person doing the data entry does know the correct spelling of a word but makes a typing error (e....
[...]
...Q-grams [16] are sub-strings of length q....
[...]
1,355 citations
1,087 citations
Related Papers (5)
Frequently Asked Questions (10)
Q2. What is the method for a string to be sorted?
If a string contains more than one word (i.e. it contains at least one whitespace or other separator), then the words are first sorted alphabetically before the Winkler technique is applied (to the full strings).
Q3. What are the likely types of errors in handwritten forms?
When handwritten forms are scanned and optical character recognition (OCR) is applied [15, 27], the most likely types of errors will be substitutions between similar looking characters (like ‘q’ and ‘g’), or substitutions of one character with a similar looking character sequence (like ‘m’ and ‘r n’, or ‘b’ and ‘l i’).•
Q4. How did the authors extract the names from the records that did not have duplicates?
The authors also extracted single names from records that did not have duplicates, and randomly created name pairs (the same number as known matched pairs in order to get balanced test data sets).
Q5. What is the importance of determining if two names match?
The time needed to determine if two names match is crucial for the overall performance of an application (besides data structures that allow to efficiently extract candidate name pairs while filtering out likely non-matches [23]).
Q6. What is the algorithm for a name that contains initials only?
As it allows for gaps, the Smith-Waterman algorithm should be especially suited for compound names that contain initials only or abbreviated names
Q7. What is the q-gram similarity measure between two strings?
A q-gram similarity measure between two strings is calculated by counting the number of q-grams in common (i.e. q-grams contained in both strings) and divide by either the number of q-grams in the shorter string (called Overlap coefficient2), the number in the longer string (called Jaccard similarity) or the average number of q-grams in both strings (called the Dice coefficient).
Q8. How many people are creating, communicating and storing data?
Increasingly large amounts of data are being created, communicated and stored by many individuals, organisations and businesses on a daily basis.
Q9. What is the technique for removing obvious non-matches?
As expected, the Bag distance is very fast (followed by simple q-grams), making it suitable as a filtering technique to remove obvious non-matches.
Q10. What are the results of their experiments with skip-grams?
Their experiments with skip-grams using multi-lingual texts from different European languages show improved results compared to bigrams, trigrams, edit distance and the longest common sub-string technique.•