A taxonomy of privacy-preserving record linkage techniques

doi:10.1016/J.IS.2012.11.005

Journal Article•DOI•

A taxonomy of privacy-preserving record linkage techniques

Dinusha Vatsalan¹, Peter Christen¹, Vassilios S. Verykios²•Institutions (2)

Australian National University¹, Hellenic Open University²

01 Sep 2013-Information Systems (Pergamon-Elsevier Ltd)-Vol. 38, Iss: 6, pp 946-969

TL;DR: This paper presents an overview of techniques that allow the linking of databases between organizations while at the same time preserving the privacy of these data, and presents a taxonomy of PPRL techniques to characterize these techniques along 15 dimensions.

read less

About: This article is published in Information Systems.The article was published on 2013-09-01 and is currently open access. It has received 241 citations till now. The article focuses on the topics: Record linkage & Data warehouse.

...read moreread less

Summary (10 min read)

1. Introduction

Much of these data are about people, or they are generated by people.
One way to improve data quality and allow more sophisticated data analysis and mining is to integrate data from different sources.
Record linkage can also be applied on a single database to detect duplicate records [8, 14] .
When personal information about people is used in the linking of databases across organizations, then the privacy of this information needs to be carefully protected.
After discussing these 15 dimensions, the authors provide a detailed review of existing PPRL techniques, and they show how they fit into their taxonomy.

2. Applications of record linkage

Linking records from different databases with the aim to improve data quality or enrich data for further analysis and mining is occurring in an increasing number of application areas including healthcare, government services, crime and fraud detection, and business applications.
Record linkage techniques are being used by national security agencies and crime investigators to effectively identify individuals who have committed fraud or crimes [24] [25] [26] .
When data from several organizations are linked, then privacy and confidentiality need to be carefully considered, as the following scenarios illustrate.
This research requires data from hospitals, the police, as well as public and private health insurers.
Preventing infectious diseases early before they are spread widely around a country is important for a healthy nation, also known as Health surveillance.

3. The record linkage process

The first step of data pre-processing (data cleaning and standardization) is crucial for quality record linkage outcomes, because most real-world data contain noisy, incomplete and inconsistent data [2, 4] .
Section 3.2 describes several popular comparison techniques in more detail, including those that have been employed for PPRL.
This is usually a time-consuming and errorprone process which depends upon experience of the experts who conduct the review.
In the following the authors discuss the steps of the record linkage process in more detail, and present techniques that have been used in each of the steps.

3.1. Indexing

This becomes the major performance bottleneck in the record linkage process, since expensive detailed comparisons between records are required [17, 18] .
To reduce this large number of potential record pair comparisons, some kind of filtering of the unlikely matches can be performed.
A drawback of these phonetic encodings, however, is that they are language dependent.
In the traditional standard blocking approach used since the 1960s [10] , all records that have the same blocking key value will be inserted into the same block, and only the records within the same block will be compared with each other in detail in the comparison step.
Related to q-gram based and sorted neighborhood indexing is suffix array based indexing [42, 43] , where suffixes are generated from the blocking key values, and blocks are extracted from the sorted array of suffix strings.

3.2. Comparison

Comparisons between two records can be conducted either at the record level or at the attribute level.
With comparisons at the attribute level, comparisons are conducted between individual attribute values, with specialized comparison functions used depending upon the type of data in these attributes.
Approximate comparison functions, on the other hand, measure how similar the values in two attributes are with each other.
Two surveys of edit-distance based approximate string comparison functions can be found in [46, 47] .
Winkler later added several improvements to this basic comparison function [53, 54] , such as increased similarity if the beginning of two strings is the same, or weight adjustments based on the lengths of two strings and how many similar characters they contain.

3.3. Classification

These vectors are used to classify record pairs as matches, non-matches, and possible matches, depending upon the decision model used [33] .
Extensions to the basic Fellegi and Sunter approach include the use of the Expectation Maximization (EM) algorithm to estimate the conditional probabilities required by the method in an unsupervised fashion [54, [56] [57] [58] .
Generating rules is often a timeconsuming and complex process, since it requires manual efforts to build rule systems and also to maintain them.
These rules are then applied on the comparison vectors to classify candidate record pairs into matches, non-matches, or possible matches (if desired) [14, 15, 60] .
To accurately classify the compared candidate record pairs into matches and non-matches, many recently developed classification techniques for record linkage employ supervised machine learning approaches [8, 61, 62] that require training data with known class labels for matches and non-matches to train a decision model.

3.4. Evaluation

Evaluating the performance of record linkage algorithms in terms of how efficient and effective they are is the final step in the linkage process.
A higher reduction ratio value means an indexing technique is more efficient in reducing the number of candidate record pairs that are being generated.
The quality of a linkage can be measured by using the metrics commonly employed in both information retrieval, and in machine learning and data mining [66, 67] .
True positives (TP) are the true matching record pairs that are correctly classified as 'matches', while false positives (FP) are the true nonmatching record pairs that are classified as 'matches'.
Based on these four numbers, various measures can be defined.

4. An overview of PPRL

As the scenarios in Section 2 have shown, the exchange of private or confidential data between organizations is often not feasible due to privacy concerns, legal restrictions, or because of commercial interests.
The increasing need of being able to link large databases across organizations while, at the same time, preserving the privacy of the entities stored in these databases, has led to the development of a new research area called privacy-preserving record linkage (PPRL) [69] [70] [71] .
The information revealed can either be (1) the number of records that have been classified as matches, (2) the identifiers of these matched records, or (3) a selected set of attributes from these matched records.
Some exchange of information between the data sources about what data pre-processing approaches they use, as well as which attributes they have in common that are to be used for the linkage, is therefore required.
In a PPRL context, this classification needs to be conducted in such a way that no party learns anything about the records in the other parties' databases that do not match, such as similarity values for certain attributes of individual record pairs, which record pairs have low similarities, or even the distribution of similarity values across all compared record pairs.

4.1. Previous PPRL surveys

Trepetin [80] theoretically analyzed four different anonymized string matching techniques and concluded that many existing techniques fall short in providing a sound solution either because they are not scalable to large databases, or because they are unable to provide both linkage quality and privacy guarantees.
Similar conclusions were also drawn in [69, 79] , that survey several existing techniques for private matching ranging from classical record matching techniques enhanced by SMC techniques to provide privacy, to advanced solutions developed specific to solve the PPRL problem.
In Durham et al.'s [78] recent survey on privacypreserving string comparators, six existing comparators that can be used in PPRL for private comparison have been experimentally evaluated in terms of their complexity, correctness, and privacy.
While all these surveys analyze and compare several private comparison functions, their survey is the first to develop a taxonomy that characterizes all aspects of PPRL, and to provide a comprehensive analysis of current approaches to PPRL.

5.1.1. Number of parties

Solutions to PPRL can be classified into those that require a third party for performing the linkage and those that do not.
In three-party protocols, a third party (which the authors call the 'linkage unit') is involved in conducting the linkage, while in two-party protocols only the two database owners participate in the PPRL process.
The advantages of two-party over three-party protocols is that the former are more secure because there is no possibility of collusion between one of the database owners and the linkage unit, while two-party protocols often have lower communication costs.
Two-party protocols generally require more complex techniques to ensure that the two database owners cannot infer any sensitive information from each other during the linkage process.

5.1.2. Adversary model

PPRL techniques generally consider one of the two adversary models that are commonly used in the field of cryptography, and especially in the area of secure multiparty computation (SMC) [70, 82, 83] .
HBC parties are curious in that they try to find out as much as they can about the other party's inputs while following the protocol [70, 83], also known as (i) Honest-but-curious behavior (HBC).
The protocol is secure in the HBC perspective if and only if all parties involved have no new knowledge at the end of the protocol above what they would have learned from the output of the record pairs classified as matches.
Most of the PPRL solutions proposed in the literature assume the HBC adversary model.
In particular, malicious parties may refuse to participate in a protocol, not follow the protocol in the specified way, choose arbitrary values for their data inputs, or abort the protocol at any time [84] .

5.1.3. Privacy techniques

A variety of privacy techniques has been employed to facilitate PPRL.
In order to prevent dictionary attacks, where an adversary hash-encodes values from a large list of common words using existing hash encoding functions until a matching hash-code is found, a keyed hash encoding approach can be used which significantly improves the security of this privacy technique.
The basic idea of SMC is that a computation is secure if at the end of the computation no party knows anything except its own input and the final results of the computed function [82, 83, 93] .
Various SMC techniques have been used in PPRL for accurate computation while preserving privacy.
When adding extra records there is generally a trade-off between linkage quality, scalability and privacy [101] . (x) Differential privacy: Recently, differential privacy [119, 120] has emerged as an alternative to generalization techniques.

5.2. Linkage techniques

The techniques used in the different steps of the PPRL process, as illustrated in Fig. 2 , determine the computational requirements and the quality of the linkage results.
The dimensions under this topic cover each of the required steps.

5.2.1. Indexing

The techniques employed in the indexing step to facilitate record linkage solutions that scale to very large databases become more challenging if privacy concerns have to be considered.
In PPRL, there is a trade-off of the indexing step not only between accuracy and efficiency, but also privacy.
Several approaches have been proposed that address the scalability of PPRL solutions by adapting existing indexing techniques, such as standard blocking, mapping based blocking, clustering, sampling, and locality sensitive hash functions, into a privacy-preserving context, as discussed in Section 6.3.

5.2.2. Comparison

Linkage quality is heavily influenced by how the values in records or individual attributes are compared with each other [48] .
As discussed in Section 4, the naı ¨ve approach of exact matching of encrypted values does not provide a practical solution.
Several of the approximate comparison functions that were presented in Section 3.2 have been investigated from a privacy preservation perspective.
These techniques will be described in detail in Sections 6.2 and 6.3.
The main challenge with these techniques is how the similarity between pairs of string values held at different parties can be calculated such that neither party learns about the other party's string value.

5.2.3. Classification

The decision model used in PPRL to securely classify the compared record pairs needs to be effective in providing highly accurate results, such that the number of false negatives and false positives is minimized, while at the same time preserving the privacy of all records that are not part of matching pairs.
As discussed in Section 3.3, a variety of classification techniques has been developed for record linkage.
Details of which classification techniques have been used in PPRL will be described for individual approaches in Section 6.

5.3.1. Scalability

This includes the computation and communication complexities that measure the overall computational efforts and cost of communication required in the PPRL process.
Generally, the big O-notation is used to specify the computation complexity [121] .

5.3.2. Linkage quality

The quality of linkage is theoretically analyzed in terms of fault-tolerance of the matching technique to data errors and variations, whether the matching is based on individual fields or whole records, and the types of data the matching technique can be applied to.
Faulttolerance to data errors can be addressed by using approximate matching or pre-processing techniques such as spelling transformations.
Records can either be compared as a whole (record based) or by comparing the values of individual selected attributes (field based), as was discussed in Section 3.2.
Several approximate comparison functions have been adapted into a privacy-preserving context as presented in Sections 6.2 and 6.3.

5.3.3. Privacy vulnerabilities

The privacy vulnerabilities that a PPRL technique is susceptible to provide a theoretical estimate of the privacy guarantees of that technique.
The main privacy vulnerabilities include frequency attack and dictionary attack (as discussed in Section 5.1.3).
As Kuzu et al. [122] recently showed, depending upon the number of hash functions employed and the number of bits in a Bloom filter, using a constrained satisfaction solver allows the iterative mapping of individual encoded values back to their original values.
Another vulnerability associated with three-party and multi-party approaches is collusion between parties.
The vulnerabilities of individual PPRL techniques are discussed in Section 6.

5.4.3. Privacy evaluation

Various measures have been used to assess the privacy protection that PPRL techniques provide.
Here the authors present the most prominent measures used.
(i) Entropy, Information gain (IG) and Relative information gain (RIG): Entropy measures the amount of information contained in a message [101, 123] .
IG assesses the possibility of inferring the original message Y, given its enciphered version X [101, 123] .
A party's view in the execution of a PPRL technique requires to be simulated given only its input and output to evaluate the privacy.

5.5.1. Implementation

This dimension specifies the implementation techniques that have been used to prototype a PPRL technique in order to conduct its experimental evaluation.
Some solutions proposed in the literature provide only theoretical proofs but they have not been evaluated experimentally, or no details about their implementation have been published.

5.5.2. Datasets

Experimental evaluation on one or ideally several datasets is important for the critical evaluation of PPRL techniques.
Due to the difficulties of obtaining real-world data that contain personal information, synthetically generated databases are commonly used.
To evaluate the practical aspects of PPRL techniques with regard to their expected performance in real-world applications, evaluations should ideally be done on databases that exhibit real-world properties and error characteristics.

6. A survey of privacy-preserving record linkage techniques

Research directions for PPRL were provided in [7, 20] stating the needs, problems and current approaches in this area, while various techniques have been developed addressing this research problem [69, [78] [79] [80] .
The authors highlight terms that relate back to their taxonomy in italic font.
The authors categorize PPRL techniques into three generations according to the factors that have been considered.
These three generations are (1) techniques that consider exact matching of attribute values only; (2) techniques that can conduct approximate matching to improve the quality of linkage; and (3) techniques that also address scalability while conducting approximate matching.
Each technique is given an identifier composed of the first three letters of the first author and the last two digits of the year of publication, which is then used in Table 1 to identify individual publications.

6.1.1. Three-party techniques

This approach is cost effective, but it is inappropriate in real-world applications since it can only perform exact matching of attribute values.
Both database owners will merge the values of their linkage attributes into a single string (record based) which is then double-hashed using a secure hash function and a public key encryption algorithm in order to prevent dictionary attacks.
The hash strings are then used by a third party to classify the records using a deterministic classification technique.
Experiments conducted on health databases showed that the accuracy of the classification increases if the concatenated string includes the full date of birth value.
This approach is useful when health policies preclude the full exchange of identifiers that is commonly required by other more sophisticated algorithms.

6.1.2. Two-party techniques

Privacy-preserving Information Retrieval (PPIR) is a research area related to PPRL, whereby PPIR employs a single query record while PPRL employs all records as match queries, also known as Fre05.
Their approach uses SMC techniques (homomorphic encryption) and oblivious pseudo-random functions.
In case of multiple queries where the query contains multiple search keywords, then the process is repeated for each search keyword in the query (field based).
The server uses indexing based on blocking for an efficient search that defines L bins and maps the n keywords in the database into these bins using a hash function.
The approach has a communication complexity which is poly-logarithmic in the size of the database n, and it needs only one round of communication.

6.1.3. Multi-party techniques

One-way secure hash functions such as SHA are used with two pads added in order to avoid dictionary attacks.
In their approach, all the records are first converted into a Bloom filter bit array (record based), and each party partitions its Bloom filter into the number of parties involved in the linkage and sends a segment to the corresponding party.
A multi-party approach based on a generalization technique (k-anonymity) for person-specific biomedical data was introduced by Kantarcioglu et al. [108] in 2008, also known as Kan08.
This approach performs efficient secure joins of encrypted databases by a third party without decrypting or inferring the contents of the joined records.

6.2.1. Three-party techniques

Du et al. [127] in 2001 suggested a secure approach for private remote database access with an untrusted third party that is assumed to not collude with any of the two database owners, also known as Du01.
The four models are the Private Information Matching (PIM), the PIM from Public Database , the Secure Storage Outsourcing (SSO), and the Secure Storage and Computing Outsourcing (SSCO).
A threshold based classification is used for deciding which record pairs are matches.
An approach based on a combination of Bloom filters and q-grams (to facilitate approximate matching) was proposed by Schnell et al. [114] in 2009, also known as Sch09.
The attribute values of each record are concatenated into one string (record based comparison), and the q-grams of that string are mapped to one bit array (a Bloom filter) using multiple cryptographic hash functions.

6.2.2. Two-party techniques

A two-party protocol was proposed by Atallah et al. [129] in 2003 where the edit distance algorithm, as presented in Section 3.2, is modified for providing privacy to genome sequence approximate comparisons in the area of bioinformatics, also known as Ata03.
The smallest overall cost of transforming one sequence into another is calculated as the editdistance.
It is therefore unsuited for tasks with large databases.
In their work, they presented methods for approximate comparison of values using string distance metrics, specifically TF-IDF, SoftTF-IDF and the Euclidean distance.
This approach provides privacy under the malicious adversary model as well by adopting an encrypted similarity matrix to store the intermediate results.

6.3.1. Three-party techniques

Al-Lawati et al. [72] proposed a secure threeparty blocking protocol in 2005 that assumed a HBC adversary model for achieving high performance private record linkage by using secure hash encoding for computing the TF-IDF distance measure in a secure fashion as illustrated in Fig. 10, also known as All05.
The approach provides field based and approximate comparison of record pairs which are then classified using a threshold based classification model.
The third method, the frugal third party blocking, uses a secure set intersection (SSI) SMC protocol to reduce the cost of transferring the whole databases to the third party by first identifying the hash signatures that occur in both databases.
Increasing the size of the reference table improves the linkage quality to some extent, but this is impractical because it leads to longer run times.
And also used locality-sensitive hash (LSH) functions for private blocking to reduce the computational complexity.

6.3.2. Two-party techniques

The approach of Song et al. [99] in 2000 in a two-party context with the HBC model takes into consideration the problem of approximate matching by calculating enciphered permutations of values using pseudorandom functions for private approximate searching of documents by certain query values, also known as Son00.
Euclidean distance is used to measure the approximate similarity between records.
The approach combines differential privacy and cryptographic methods to solve the PPRL problem in a two-party protocol following the HBC adversary model.
They propose an iterative classification approach where the database owners iteratively reveal bits from their Bloom filters without compromising privacy and complexity.
The pairs that are classified as possible matches are taken to the next iteration where more bit positions are revealed to classify the pairs.

6.3.3. Multi-party techniques

Mohammed et al. [109] in 2011 proposed an approach for efficient PPRL using the k-anonymity based generalization privacy technique without the need of a trusted third party (two parties only), also known as Moh11.
Mohammed sources) based on the two different adversary models.
The computation and communication costs of this approach are Oðn log nÞ and O(n), respectively, where n is the number of records in the databases.
To prevent malicious parties from sending false scores, game-theoretic concepts are used.
Empirical studies conducted by the authors using the real-world Adult dataset demonstrated the scalability of the solutions.

7. Discussion and research directions

As their survey has shown, since the beginning of the development of techniques that aim to provide solutions for PPRL, there has been a large variety of techniques that have been investigated.
There is a clear path of progress, starting from early techniques that solve the problem of privacy-preserving exact matching, moving on to techniques that allow approximate matching while keeping the attribute values that are matched secure, and finally in the last few years focusing on techniques that address the issue of scalability of PPRL on large databases.

7.1. Privacy aspects

With regard to privacy, several topics require further attention in order to make PPRL more applicable for practical applications.
In a three-party scenario, extending PPRL protocols can be accomplished such that all database owners send their data to the linkage unit, which then conducts the linkage.
The approach by Mohammed et al. [109] uses game theory concepts to deal with malicious parties, and as such provides a novel approach to PPRL.
Some of the more commonly used techniques include Bloom filters and generalization techniques such as k-anonymity, however they both have their limitations.
More research is needed to investigate the use of differential privacy and other advanced scalable techniques that provide sufficient privacy protection to work in combination with or even replace expensive SMC based techniques.

7.2. Linkage techniques

Research in non-PPRL in recent years has developed various advanced techniques that provide improved scalability and linkage quality.
Most work in PPRL that has investigated scalability, through some form of indexing technique, has employed the basic standard blocking approach, also known as Indexing.
Other efficient techniques such as the sorted neighborhood approach, or suffix-array based indexing techniques, need to be explored in a privacy-preserving setting.
Most PPRL solutions in the second and third generations consider approximate comparison, also known as Comparison.
They are mostly applicable only to string data type.

7.3. Theoretical analysis

A standard set of privacy measures is required that allows the comparative theoretical analysis of privacy preservation that can be achieved by PPRL techniques.
As there are often different privacy requirements in different practical applications of record linkage, a measure such as the privacy spectrum proposed by Reiter and Rubin [142] might be suitable.

7.4. Evaluation

The evaluation of the implementation of PPRL techniques with regard to their scalability, linkage quality, and privacy preservation poses some unique challenges.
Without being able to assess linkage quality and completeness, PPRL will not be useful for real-world linkage applications, because not knowing how good the results of a linkage project are is not an option in practical applications, where linkage quality and completeness are two crucial factors for successful PPRL.
There is currently no framework available for PPRL that facilitates the comparative evaluation of different PPRL techniques with regard to privacy, scalability, and linkage quality.
A framework for PPRL will need to facilitate the detailed specifications of all building blocks of the PPRL process in the form of abstract representations, such as XML schemas.
This will make it possible for researchers to implement their novel algorithms and techniques, and integrate them so as to evaluate them comparatively.

7.5. Practical aspects

So far it seems that no single PPRL technique has outperformed all other techniques in the three aspects of linkage quality, privacy preservation, and scalability to large datasets.
The lack of comprehensive studies that compare many existing techniques within the same framework and on many different types of data, means that it is currently not possible to determine which technique(s) perform better than others on data with different characteristics and of different sizes.
Conducting such large experimental studies is one avenue of research that would be highly beneficial to better understand the characteristics of PPRL techniques.

8. Conclusion

In this paper the authors have presented a survey of historical and current state-of-the-art techniques for PPRL.
Crucially, there is currently no overarching framework available that allows different approaches to PPRL to be evaluated comparatively.
Solving these open research questions is a core requirement to make PPRL applicable for practical applications.

Did you find this useful? Give us your feedback