Proceedings Article•DOI•

A Comparison of Personal Name Matching: Techniques and Practical Issues

Peter Christen¹•Institutions (1)

18 Dec 2006-pp 290-294

TL;DR: The characteristics of personal names are discussed and potential sources of variations and errors are presented and a comprehensive number of commonly used, as well as some recently developed name matching techniques are overview.

read less

Abstract: Finding and matching personal names is at the core of an increasing number of applications: from text and Web mining, search engines, to information extraction, dedupli- cation and data linkage systems Variations and errors in names make exact string matching problematic, and ap- proximate matching techniques have to be applied When compared to general text, however, personal names have different characteristics that need to be considered In this paper we discuss the characteristics of personal names and present potential sources of variations and errors We then overview a comprehensive number of commonly used, as well as some recently developed name matching techniques Experimental comparisons using four large name data sets indicate that there is no clear best matching technique

...read moreread less

Summary (3 min read)

Jump to: [1. Introduction] – [2. Personal name characteristics] – [2.1 Sources of name variations] – [3. Matching techniques] – [3.1 Phonetic encoding] – [3.2 Pattern matching] – [3.3 Combined techniques] – [4 Experiments and discussion] – [4.1 Name data sets] – [4.2 Distribution of edit distances] – [4.3 Matching results] – [4.4 Timing results] and [5 Recommendations]

1. Introduction

Increasingly large amounts of data are being created, communicated and stored by many individuals, organisations and businesses on a daily basis.
At lot of this data contains some information about people, for example e-mails, customer and patient records, news articles, business and political memorandums.
As reported in [28], the use of approximate comparison methods does improve the matching quality in these applications.
Personal names have characteristics that makes them different to general text.
While similar comparison studies on matching techniques have been done in the past [9, 17, 20, 25, 32, 34], none has analysed and compared such a comprehensive number of techniques specifically with application to personal names.

2. Personal name characteristics

Even when only considering the English-speaking world, a name can have several different spelling forms for a variety of reasons.
In the Anglo-Saxon region and most other Western countries, a personal name is usually made of a given name, an optional middle name, and a surname or family name [24].
Other specific errors were differences in punctuation marks and whitespaces (for example ‘O’Connor’, ‘OConnor’ and ‘O Connor’) in 12% of errors, and different last names for female patients (8% of errors).
Thus, there seem to be significant differences between general text and personal names, which have to be considered when name matching algorithm are being developed and used.

2.1 Sources of name variations

Besides the variations in personal names discussed above, the nature of data entry [19] will determine the most likely types of errors and their distribution.
Manual keyboard based data entry can result in wrongly typed neighbouring keys (for example ‘n’ and ‘m’, or ‘e’ and ‘r’).
Finally, people themselves sometimes report their names differently depending upon the organisation they are in contact with, or deliberately provide wrong or modified names.
When matching names, one has to deal with legitimate name variations (that should be preserved and matched), and errors introduced during data entry and recording (that should be corrected) [3].
The challenge lies in distinguishing between these two sources of variations.

3. Matching techniques

Name matching can be defined asthe process of determining whether two name strings are instances of the same name[24].
In the following three subsections the authors present the most commonly used as well as several recently proposed new techniques.
Without proper parsing and segmentation a name (even if stored in two fields as given- and surname) can contain several words separated by a hyphen, apostrophe, whitespace or other character.
Frequency distributions of name values can also be used to improve the quality of name matching.
I.e. only the basic techniques used to compare two names without taking any context information into account.the authors.

3.1 Phonetic encoding

Most techniques – including all presented here – have been developed mainly with English in mind.
The transformed name string is then encoded into a one-letter three-digits code (again removing zeros and duplicate numbers) using the following encoding table.
It contains many rules that take the position within a name, as well as previous and following letters into account (similar to Phonix).
When matching names, phonetic encoding can be used as a filtering step (calledblocking in data linkage [6, 30]), i.e. only names having the same phonetic code will be compared using a computationally more expensive pattern matching algorithm.

3.2 Pattern matching

A similarity measure can be calculated by dividing the total length of the common sub-strings by the minimum, maximum or average lengths of the two original strings (similar to Smith-Waterman above).
Positionalq-grams can be padded with start and end characters similar to non-positionalq-grams, and similarity measures can be calculated in the same three ways as with non-positionalq-grams.
The Winkler algorithm therefore increases the Jaro similarity measure for agreeing initial characters (up to four).

3.3 Combined techniques

Two techniques combine phonetic encoding and pattern matching with the aim to improve the matching quality.
The edit costs in Editex are 0 if two letters are the same, 1 if they are in the same letter group, and 2 otherwise.
Similar to basic edit distance, the time and space complexities of matching two stringss1 ands2 with Editex areO(|s1|×|s2|) andO(min(|s1|, |s2|)), respectively.
This recently developed technique, calledSyllable Alignment Pattern Searching (SAPS)[13] is based on the idea of matching two names syllable by syllable, rather than character by character.
The experimental results presented in [13] indicate that SAPS performs better than Editex, edit distance and Soundex on the same large name data set used in [25] (the COMPLETE data set the authors are using in their experiments as well).

4 Experiments and discussion

In this section the authors discuss the results of a series of comparison experiments using four large name data sets.
The aim of these experiments was to see which matching techniques achieve the best matching quality for different personal name types, and to compare their computational performance.
All name matching techniques were implemented in Python as part of theFebrl (Freely Extensible Biomedical Record Linkage)3 data linkage system [5].

4.1 Name data sets

Three of the test data sets were based on given- and surnames extracted from a health data set containing midwives’ records (women who gave birth) from the Australian state of New South Wales [4].
A deduplication status in this data (indicating which records correspond to the same women) allowed us to extract true name pairs (known matches).
The authors then created afull namedata set by concatenating given- with surnames (separated by a whitespace).
The fourth data set was created in a similar way using the COMPLETE name database [13, 25] by forming surname pairs from 90 randomly chosen and manually matched queries.
Table 2 shows the size of their four test data sets.

4.2 Distribution of edit distances

In order to better understand their test data, the authors calculated the edit distances for all the known name pairs.
This indicates the challenge 3http://datamining.anu.edu.au/linkage.html of name matching: how to correctly classify two names that are very different.

4.3 Matching results

The two techniques that combine phonetic encoding with pattern matching (Editex and syllable alignment distance) do not perform as well as one might have expected, and neither do skip-grams.
Details for the best performing pattern matching techniques on the four data sets can be seen in Figure 1.
An optimal value for one data set and technique will very likely result in sub-optimal quality for another data set or technique.

4.4 Timing results

As shown in Table 6, the phonetic encoding techniques (times shown include encoding of two names) are generally much faster than pattern matching, due to their complexity beingO(|s|) for a given strings.
Phonix with its many rules is the slowest phonetic techniques (almost ten times as slow as others), while Smith-Waterman is the slowest pattern matching techniques.

5 Recommendations

The mixed results presented in the previous section indicate that there is no single best name matching technique, and that the type of personal name data to be matched has to be considered when selecting a matching technique.
The following recommendations will help with this.
It is important to know the type of names to be matched, and if these names have been properly parsed and standardised [7], or if the name data potentially contains several words with various separators.
Phonetic encoding followed by exact comparison of the phonetic codes should not be used.
Even small changes of the threshold can result in dramatic drops in matching quality.

Did you find this useful? Give us your feedback

Figures (6)

Figure 1. Best f-measure results for the four data sets (similarity measures on the horizontal and f-measures on the vertical axis).

Table 3. Distribution of edit distances for matched name pairs.

Table 1. Phonetic name encoding examples.

Table 2. Number of name pairs and single names in test data sets used for experiments.

Table 6. Timings results in milli-seconds (shortest times shown boldface and longest times underlined).

Table 4. Average f-measure values (best results shown boldface and worst results underlined).

Content maybe subject to copyright Report

TR-CS-06-02

A Comparison of Personal Name

Matching: Techniques and Practical

Issues

Peter Christen

September 2006

Joint Computer Science Technical Report Series

Department of Computer Science

Faculty of Engineering and Information Technology

Computer Sciences Laboratory

Research School of Information Sciences and Engineering

This technical report series is published jointly by the Department of

Computer Science, Faculty of Engineering and Information Technology,

and the Computer Sciences Laboratory, Research School of Information

Sciences and Engineering, The Australian National University.

Please direct correspondence regarding this series to:

Technical Reports

Department of Computer Science

Faculty of Engineering and Information Technology

The Australian National University

Canberra ACT 0200

Australia

or send email to:

Technical-DOT-Reports-AT-cs-DOT-anu.edu.au

A list of technical reports, including some abstracts and copies of some full

reports may be found at:

http://cs.anu.edu.au/techreports/

Recent reports in this series:

TR-CS-06-01 Stephen M Blackburn, Robin Garner, Chris Hoffmann,

Asjad M Khan, Kathryn S McKinley, Rotem Bentzur, Amer

Diwan, Daniel Feinberg, Daniel Frampton, Samuel Z Guyer,

Martin Hirzel, Antony Hosking, Maria Jump, Han Lee,

J Eliot B Moss, Aashish Phansalkar, Darko Stefanovi

Thomas VanDrunen, Daniel von Dincklage, and Ben

Wiedermann. The DaCapo benchmarks: Java benchmarking

development and analysis (Extended Version). September

2006.

TR-CS-05-01 Peter Strazdins. CycleCounter: an efﬁcient and accurate

UltraSPARC III CPU simulation module. May 2005.

TR-CS-04-04 C. W. Johnson and Ian Barnes. Redesigning the intermediate

course in software design. November 2004.

TR-CS-04-03 Alonso Marquez. Efﬁcient implementation of design patterns

in Java programs. February 2004.

TR-CS-04-02 Bill Clarke. Solemn: Solaris emulation mode for Sparc Sulima.

February 2004.

TR-CS-04-01 Peter Strazdins and John Uhlmann. Local scheduling

out-performs gang scheduling on a Beowulf cluster. January

2004.

A Comparison of Personal Name Matching: Techniques and Practical Issues

Peter Christen

Department of Computer Science, The Australian National University

Canberra ACT 0200, Australia

Peter.Christen@anu.edu.au

Abstract

Finding and matching personal names is at the core of an

increasing number of applications: from text and Web min-

ing, information retrieval and extraction, search engines,

to deduplication and data linkage systems. Variations and

errors in names make exact string matching problematic,

and approximate matching techniques based on phonetic

encoding or pattern matching have to be applied. When

compared to general text, however, personal names have

different characteristics that need to be considered.

In this paper we discuss the characteristics of personal

names and present potential sources of variations and er-

rors. We overview a comprehensive number of commonly

used, as well as some recently developed name matching

techniques. Experimental comparisons on four large name

data sets indicate that there is no clear best technique.

We provide a series of recommendations that will help re-

searchers and practitioners to select a name matching tech-

nique suitable for a given data set.

1. Introduction

Increasingly large amounts of data are being created,

communicated and stored by many individuals, organisa-

tions and businesses on a daily basis. At lot of this data con-

tains some information about people, for example e-mails,

customer and patient records, news articles, business and

political memorandums. Even most scientiﬁc and techni-

cal documents contain details about their authors. Personal

names are often used to search for documents in large col-

lections. Examples include Web searches (the most popular

query in the last few years on Google has always been a

celebrity name, with another four or ﬁve names ranked in

the top ten queries

), retrieval of medical patient records,

or bibliographic searches (using author names). Names are

also important pieces of information when databases are

http://www.google.com/press/zeitgeist.html

deduplicated (e.g. to ﬁnd and remove duplicate customer

records), and when two data sets are linked or integrated

and no unique entity identiﬁers are available [5, 6, 30]. As

reported in [28], the use of approximate comparison meth-

ods does improvethe matching quality in these applications.

Personal names have characteristics that makes them

different to general text. While there is only one cor-

rect spelling for many words, there are often several valid

spelling variations for personal names, for example ‘Gail’,

‘Gale’ and ‘Gayle’. People also frequently use (or are

given) nicknames in daily life, for example ‘Bill’ rather

than the more formal ‘William’. Personal names sometimes

change over time, for example when somebody gets mar-

ried. Names are also heavily inﬂuenced by people’s cul-

tural backgrounds. These issues make matching of personal

names more challenging compared to matching of general

text [3, 24].

As names are often recorded with different spellings, ap-

plying exact matching leads to poor results. In [11], for

example, the percentage of name mismatches in three large

hospital databases ranged between 23% and 36%. To im-

prove matching accuracy, many different techniques for ap-

proximate name matching have been developed in the last

four decades [15, 20, 25, 34], and new techniques are still

being invented [13, 18]. Most techniques are based on a

pattern matching, phonetic encoding, or a combination of

these two approaches.

Computational complexity has to be considered when

name matching is done on very large data sets. The time

needed to determine if two names match is crucial for the

overall performance of an application (besides data struc-

tures that allow to efﬁciently extract candidate name pairs

while ﬁltering out likely non-matches [23]). Matching

speed is vital when quick response times are needed, for

example in search engines, or crime and biomedical emer-

gency response systems, where an answer should be avail-

able within a couple of seconds.

While similar comparison studies on matching tech-

niques have been done in the past [9, 17, 20, 25, 32, 34],

none has analysed and compared such a comprehensive

number of techniques speciﬁcally with application to per-

sonal names. The contributions of this paper are a detailed

discussion of the characteristics of personal names and pos-

sible sources of variations and errors in them, an overview

of a range of name matching techniques, and a comparison

of their performance using several large real world data sets

containing personal names.

We start in Section 2 with a discussion of personal name

characteristics and sources of variations. In Section 3 we

ﬁrst look at different situations and contexts of name match-

ing, and then present a comprehensive number of name

matching techniques. The results of experimental compar-

isons are discussed in Section 4, and a series of recommen-

dations is given in Section 5 that will help researchers and

practitioners who are faced with the problem of selecting a

name matching technique. Finally, conclusions and an out-

look to future work is discussed in Section 6.

2. Personal name characteristics

Even when only considering the English-speaking

world, a name can have several different spelling forms for

a variety of reasons. In the Anglo-Saxon region and most

other Western countries, a personal name is usually made

of a given name, an optional middle name, and a surname

or family name [24]. Both ‘Gail Vest’ and ‘Gayle West’

might refer to the same person, while ‘Tina Smith’ might be

recorded in the same database as ‘Christine J. Smith’ and as

‘C.J. Smith-Miller’. People change their name over time,

most commonly when somebody gets married (in which

case there are different cultural conventions and laws of

how a person’s name is changed). Compound names are

often used by married women, while in certain countries

husbands can take on the surname of their wives.

In daily life, people often use (or are given) nicknames.

These can be short forms of their given names (like ‘Bob’

for ‘Robert’, or ‘Liz’ for ‘Elizabeth’), they can be varia-

tions of their surname (like ‘Vesty’ for ‘Vest’) or they might

relate to some life event, character sketch or physical char-

acteristics of a person [3]. While having one given and one

middle name is common for Anglo-Saxon names, several

European countries favour compound given names instead,

for example ‘Hans-Peter’ or ‘Jean-Pierre’. In general, there

are no legal regulations of what constitutes a name [3].

In today’s multi-cultural societies and worldwide data

collections (e.g. global online businesses or international

crime and terrorism databases), the challenge is to be able to

match names coming from different cultural backgrounds.

For Asian names, for example, there exist several translit-

eration systems into the Roman alphabet [24], the surname

traditionally appears before the given name, and frequently

a Western given name is added. Hispanic names can con-

tain two surnames, while Arabic names are often made of

several components and contain various afﬁxes that can be

separated by hyphens or whitespaces.

An early study [10] on spelling errors in general words

found that over 80% of errors were single errors – either a

letter was deleted, an extra letter was inserted, a letter was

substituted for another letter, or two adjacent letters were

transposed. Substitutions were the most common errors,

followed by deletions, then insertions and ﬁnally transpo-

sitions, followed by multiple errors in one word. Other

studies [15, 19, 27] reported similar results. However, in

a study [11] that looked at patient names within hospital

databases, different types and distributions of errors were

found. With 36%, insertion of an additional name word,

initial or title were the most common errors. This was fol-

lowed in 14% of errors by several different letters in a name

due to nicknames or spelling variations. Other speciﬁc er-

rors were differences in punctuation marks and whitespaces

(for example ‘O’Connor’, ‘OConnor’ and ‘O Connor’) in

12% of errors, and different last names for female patients

(8% of errors). Single errors in this study accounted for

39% of all errors, only around half compared to the 80%

reported in [10]. Thus, there seem to be signiﬁcant differ-

ences between general text and personal names, which have

to be considered when name matching algorithm are being

developed and used. According to [20] the most common

name variations can be categorised as

• spelling variations (like ‘Meier’ and ‘Meyer’) due to

typographical errors that do not affect the phonetical

structure of a name but still post a problem for match-

ing;

• phonetic variations (like ‘Sinclair’ and ’St. Clair’)

where the phonemes are modiﬁed and the structure of

a name is changed substantially;

• compound names (like ’Hans-Peter’ or ‘Smith Miller’)

that might be given in full (potentially with differ-

ent separators), one component only, or components

swapped;

• alternative names (like nicknames, married names or

other deliberate name changes); and

• initials only (mainly for given and middle names).

In [19] character level (or non-word) misspellings are

classiﬁed into (1) typographical errors, where it is assumed

that the person doing the data entry does know the correct

spelling of a word but makes a typing error (e.g. ‘Sydeny’

instead of ‘Sydney’); (2) cognitive errors, assumed to come

from a lack of knowledge or misconceptions; and (3) pho-

netic errors, coming from substituting a correct spelling

with a similar sounding one. The combination of phonet-

ical and spelling variations, as well as potentially totally

changed name words, make name matching challenging.

2.1 Sources of name variations

Besides the variations in personal names discussed

above, the nature of data entry [19] will determine the most

likely types of errors and their distribution.

• When handwritten forms are scanned and optical char-

acter recognition (OCR) is applied [15, 27], the most

likely types of errors will be substitutions between

similar looking characters (like ‘q’ and ‘g’), or sub-

stitutions of one character with a similar looking char-

acter sequence (like ‘m’ and ‘r n’, or ‘b’ and ‘l i’).

• Manual keyboard based data entry can result in

wrongly typed neighbouring keys (for example ‘n’ and

‘m’, or ‘e’ and ‘r’). While in some cases this is quickly

corrected by the person doing the data entry, such er-

rors are often not recognised, possibly due to limited

time or by distractions to the person doing the data

entry (imagine a busy receptionist in a hospital emer-

gency department). The likelihood of letter substitu-

tions obviously depends upon the keyboard layout.

• Data entry over the telephone (for example as part of

a survey study) is a confounding factor to manual key-

board entry. The person doing the data entry might not

request the correct spelling, but rather assume a default

spelling which is based on the person’s knowledge and

cultural background. Generally, errors are more likely

for names that come from a culture that is different to

the one of the person doing the data entry, or if names

are long or complicated (like ‘Kyzwieslowski’) [11].

• Limitations in the maximum length of input ﬁelds can

force people to use abbreviations, initials only, or even

disregard some parts of a name.

• Finally, people themselves sometimes report their

names differently depending upon the organisation

they are in contact with, or deliberately provide wrong

or modiﬁed names. Or, while somebody might report

her or his name consistently in good faith, others report

it inconsistently or wrongly for various reasons.

If data from various sources is used, for example in a text

mining, information retrieval or data linkage system, then

the variability and error distribution will likely be larger

than if the names to be matched come from one source only.

This will also limit the use of trained name matching algo-

rithms [2, 9, 31] that are adapted to deal with certain types

of variations and errors. Having meta-data that describes

the data entry process for all data to be used can be valuable

when assessing data quality.

As discussed previously, while there is only one cor-

rect spelling for most general words, there are often no

wrong name spellings, just several valid name variations.

For this reason, in many cases it is not possible to disre-

gard a name as wrong if it is not found in a dictionary

of known names. When matching names, one has to deal

with legitimate name variations (that should be preserved

and matched), and errors introduced during data entry and

recording (that should be corrected) [3]. The challenge lies

in distinguishing between these two sources of variations.

3. Matching techniques

Name matching can be deﬁned as the process of deter-

mining whether two name strings are instances of the same

name [24]. As name variations and errors are quite com-

mon [11], exact name comparison will not result in good

matching quality. Rather, an approximate measure of how

similar to names are is desired. Generally, a normalised

similarity measure between 1.0 (two names are identical)

and 0.0 (two names are totally different) is used.

The two main approaches for matching names are pho-

netic encoding and pattern matching. Different techniques

have been developed for both approaches, and several tech-

niques combine the two with the aim to improve the match-

ing quality. In the following three subsections we present

the most commonly used as well as several recently pro-

posed new techniques.

Matching two names can be viewed as an isolated prob-

lem or within a wider database or application context. Four

different situations can be considered.

1. The matching of two names that consist of a single

word each, not containing whitespaces or other sep-

arators like hyphens or commas. This is normally the

situation when names have been parsed and segmented

into components (individual words) [7], and all sepa-

rators have been removed. Full names are split into

their components and stored into ﬁelds like title, given

name, middle name, surname and alternative surname.

Parsing errors, however, can result in a name word

being put into the wrong ﬁeld, thereby increasing the

likelihood of wrong matching.

2. Without proper parsing and segmentation a name (even

if stored in two ﬁelds as given- and surname) can con-

tain several words separated by a hyphen, apostrophe,

whitespace or other character. Examples include com-

pound given names, born surname and married name,

name pre- and sufﬁxes, and title words (like ‘Ms’, ‘Mr’

or ‘Dr’). In this situation, besides variations in a single

word, parts of a name might be in a different order or

missing, and there might be different separators. All

this will complicate the name matching task.

3. In the ﬁrst two situations names were matched indi-

vidually without taking any context information into

HTML Viewer

Frequently Asked Questions (10)

Q1. What are the contributions mentioned in the paper "A comparison of personal name matching: techniques and practical issues" ?

In this paper the authors discuss the characteristics of personal names and present potential sources of variations and errors. The authors provide a series of recommendations that will help researchers and practitioners to select a name matching technique suitable for a given data set.

Q2. What is the method for a string to be sorted?

If a string contains more than one word (i.e. it contains at least one whitespace or other separator), then the words are first sorted alphabetically before the Winkler technique is applied (to the full strings).

Q3. What are the likely types of errors in handwritten forms?

When handwritten forms are scanned and optical character recognition (OCR) is applied [15, 27], the most likely types of errors will be substitutions between similar looking characters (like ‘q’ and ‘g’), or substitutions of one character with a similar looking character sequence (like ‘m’ and ‘r n’, or ‘b’ and ‘l i’).•

Q4. How did the authors extract the names from the records that did not have duplicates?

The authors also extracted single names from records that did not have duplicates, and randomly created name pairs (the same number as known matched pairs in order to get balanced test data sets).

Q5. What is the importance of determining if two names match?

The time needed to determine if two names match is crucial for the overall performance of an application (besides data structures that allow to efficiently extract candidate name pairs while filtering out likely non-matches [23]).

Q6. What is the algorithm for a name that contains initials only?

As it allows for gaps, the Smith-Waterman algorithm should be especially suited for compound names that contain initials only or abbreviated names

Q7. What is the q-gram similarity measure between two strings?

A q-gram similarity measure between two strings is calculated by counting the number of q-grams in common (i.e. q-grams contained in both strings) and divide by either the number of q-grams in the shorter string (called Overlap coefficient2), the number in the longer string (called Jaccard similarity) or the average number of q-grams in both strings (called the Dice coefficient).

Q8. How many people are creating, communicating and storing data?

Increasingly large amounts of data are being created, communicated and stored by many individuals, organisations and businesses on a daily basis.

Q9. What is the technique for removing obvious non-matches?

As expected, the Bag distance is very fast (followed by simple q-grams), making it suitable as a filtering technique to remove obvious non-matches.

Q10. What are the results of their experiments with skip-grams?

Their experiments with skip-grams using multi-lingual texts from different European languages show improved results compared to bigrams, trigrams, edit distance and the longest common sub-string technique.•

A Comparison of Personal Name Matching: Techniques and Practical Issues

Summary (3 min read)

1. Introduction

2. Personal name characteristics

2.1 Sources of name variations

3. Matching techniques

3.1 Phonetic encoding

3.2 Pattern matching

3.3 Combined techniques

4 Experiments and discussion

4.1 Name data sets

4.2 Distribution of edit distances

4.3 Matching results

4.4 Timing results

5 Recommendations

Figures (6)

Citations

Cites background from "A Comparison of Personal Name Match..."

Cites background or methods from "A Comparison of Personal Name Match..."

Cites background from "A Comparison of Personal Name Match..."

Cites background from "A Comparison of Personal Name Match..."

References

"A Comparison of Personal Name Match..." refers background or methods in this paper

"A Comparison of Personal Name Match..." refers background or methods in this paper

"A Comparison of Personal Name Match..." refers background or methods in this paper

Related Papers (5)

Frequently Asked Questions (10)

Q1. What are the contributions mentioned in the paper "A comparison of personal name matching: techniques and practical issues" ?

Q2. What is the method for a string to be sorted?

Q3. What are the likely types of errors in handwritten forms?

Q4. How did the authors extract the names from the records that did not have duplicates?

Q5. What is the importance of determining if two names match?

Q6. What is the algorithm for a name that contains initials only?

Q7. What is the q-gram similarity measure between two strings?

Q8. How many people are creating, communicating and storing data?

Q9. What is the technique for removing obvious non-matches?

Q10. What are the results of their experiments with skip-grams?