Privacy-Preserving Record Linkage for Big Data: Current Approaches and Research Challenges

doi:10.1007/978-3-319-49340-4_25

Privacy-Preserving Record Linkage for Big Data:

Current Approaches and Research Challenges

Dinusha Vatsalan

1

, Ziad Sehili

2

, Peter Christen

1

, and Erhard Rahm

2

1

Research School of Computer Science, The Australian National University,

Acton ACT 2601, Australia; {dinusha.vatsalan,peter.christen}@anu.edu.au

2

Database Group, University of Leipzig, 04109 Leipzig, Germany;

{rahm,sehili}@informatik.uni-leipzig.de

Abstract. The growth of Big Data, especially personal data dispersed

in multiple data sources, presents enormous opportunities and insights

for businesses to explore and leverage the value of linked and integrated

data. However, privacy concerns impede sharing or exchanging data for

linkage across diﬀerent organizations. Privacy-preserving record linkage

(PPRL) aims to address this problem by identifying and linking records

that correspond to the same real-world entity across several data sources

held by diﬀerent parties without revealing any sensitive information

about these entities. PPRL is increasingly being required in many real-

world application areas. Examples range from public health surveillance

to crime and fraud detection, and national security. PPRL for Big Data

poses several challenges, with the three major ones being (1) scalability

to multiple large databases, due to their massive volume and the ﬂow of

data within Big Data applications, (2) achieving high quality results of

the linkage in the presence of variety and veracity of Big Data, and (3)

preserving privacy and conﬁdentiality of the entities represented in Big

Data collections. In this chapter, we describe the challenges of PPRL

in the context of Big Data, survey existing techniques for PPRL, and

provide directions for future research.

Keywords: Record linkage, Privacy, Big Data, Scalability

1 Introduction

With the Big Data revolution, many organizations collect and process datasets

that contain many millions of records to analyze and mine interesting patterns

and knowledge in order to empower eﬃcient and quality decision making [28,

53]. Analyzing and mining such large datasets often require data from multi-

ple sources to be linked and aggregated. Linking records from diﬀerent data

sources with the aim to improve data quality or enrich data for further analysis

is occurring in an increasing number of application areas, such as in healthcare,

government services, crime and fraud detection, national security, and business

applications [28, 52]. Eﬀective ways of linking data from diﬀerent sources have

also played an increasingly important role in generating new insights for popu-

lation informatics in the health and social sciences [100].

2 Privacy-Preserving Record Linkage for Big Data

For example, linking health databases from diﬀerent organizations facilitates

quality health data mining and analytics in applications such as epidemiologi-

cal studies (outbreak detection of infectious diseases) or adverse drug reaction

studies [20, 117]. These applications require data from several organizations to

be linked, for example human health data, travel data, consumed drug data, and

even animal health data [38]. Linked health databases can also be used for the

development of health policies in a more eﬃcient and eﬀective way compared to

traditionally used time-consuming survey studies [37, 89].

Record linkage techniques are also being used by national security agen-

cies and crime investigators for eﬀective identiﬁcation of fraud, crime, or terror-

ism suspects [74, 125, 168]. Such applications require data from law enforcement

agencies, immigration departments, Internet service providers, businesses, as well

as ﬁnancial institutions [125].

In recent time, record linkage is increasingly being required by social scien-

tists in the ﬁeld of population informatics to study insights into our society from

‘social genome’ data, the digital traces that contain person-level data about so-

cial beings [100]. The ‘Beyond 2011’ program by the Oﬃce for National Statistics

in the UK, for example, has carried out research to study diﬀerent possible ap-

proaches to producing population and socio-demographics statistics for England

and Wales by linking data from several sources [56].

Record linkage within a single organization does not generally involve privacy

and conﬁdentiality concerns (assuming there are no internal threats within the

organization and the linked data are not being revealed outside the organiza-

tion). An example application is the deduplication of a customer database by a

business using record linkage techniques for conducting eﬀective marketing ac-

tivities. However, in many countries record linkage across several organizations,

as required in the above example applications, might not allow the exchange or

the sharing of database records between organizations due to laws or regulations.

Some example Acts that describe the legal restrictions of disclosing personal or

sensitive data are: (1) the Data-Matching Program Act in Australia

3

, (2) the

European Union (EU) Personal Data Protection Act in Europe

4

, and (3) the

Health Insurance Portability and Accountability Act (HIPAA) in the USA

5

.

The privacy requirements in the record linkage process have been addressed

by developing ‘privacy-preserving record linkage’ (PPRL) techniques, which aim

to identify matching records that refer to the same entities in diﬀerent databases

without compromising privacy and conﬁdentiality of these entities. In a PPRL

project, the database owners (or data custodians) agree to reveal only selected

information about records that have been classiﬁed as matches among each other,

or to an external party, such as a researcher [165]. However, record linkage re-

quires access to the actual values of certain attributes.

3

https://www.oaic.gov.au/privacy-law/other-legislation/government-data-matching

[Accessed: 15/06/2016]

4

http://ec.europa.eu/justice/data-protection/index en.htm [Accessed: 15/06/2016]

5

http://www.hhs.gov/ocr/privacy/ [Accessed: 15/06/2016]

Privacy-Preserving Record Linkage for Big Data 3

Known as quasi-identiﬁers (QIDs), these attributes need to be common in

all databases to be linked and represent identifying characteristics of entities to

allow matching of records. Examples of QIDs are ﬁrst and last names, addresses,

telephone numbers, or dates of birth. Such QIDs often contain private and conﬁ-

dential information of entities that cannot be revealed, and therefore the linkage

has to be conducted on masked (encoded) versions of the QIDs to preserve the

privacy of entities. Several masking techniques have been developed (as we will

describe in Sect. 3.4), using two diﬀerent types of general approaches: (1) secure

multi-party computation (SMC) [112] and (2) data perturbation [88].

Leveraging the tremendous opportunities that Big Data can provide for busi-

nesses comes with the challenges that PPRL poses, including scalability, quality,

and privacy. Big Data implies enormous data volume as well as massive ﬂows (ve-

locity) of data, leading to scalability challenges even with advanced computing

technology. The variety and veracity aspects of Big Data require biases, noise,

variations and abnormality in data to be considered, which makes the linkage

process more challenging. With Big Data containing massive amounts of per-

sonal data, linking and mining data may breach the privacy of those represented

by the data. A practical PPRL solution that can be used in real-world appli-

cations should therefore address these challenges of scalability, linkage quality,

and privacy. A variety of PPRL techniques has been developed over the past two

decades, as surveyed in [154, 165]. However, these existing approaches for PPRL

fall short in providing a sound solution in the Big Data era by not addressing

all of the Big Data challenges. Therefore, more research is required to leverage

the huge potential that linking databases in the era of Big Data can provide for

businesses, government agencies, and research organizations.

In this chapter, we review the existing challenges and techniques, and discuss

research directions of PPRL for Big Data. We provide the preliminaries in Sect. 2

and review existing privacy techniques for PPRL in Sect. 3. We then discuss the

scalability challenge and existing approaches that address scalability of PPRL in

Sect. 4. In Sect. 5, we describe the challenges and existing techniques of PPRL

on multiple databases, which is an emerging research avenue that is being in-

creasingly required in many Big Data applications. In Sect. 6 we discuss research

directions in PPRL for Big Data, and in Sect. 7 we conclude this chapter with

a brief summary of the topic covered.

2 Background

Building on the introduction to record linkage and privacy-preserving record

linkage (PPRL) in Sect. 1, we now present background material that contributes

to the understanding of the preliminaries. We describe the basic concepts and

challenges in Sect. 2.1, and then describe the process of PPRL in Sect. 2.2.

2.1 Overview and Challenges of PPRL

Record linkage is a widely used data pre-processing and data cleaning task where

the aim is to link and integrate records that refer to the same entity from two or

4 Privacy-Preserving Record Linkage for Big Data

ID Given nameSurname DOB Gender Address Loan typeBalance

6723 peter robert 20.06.72 M 16 Main Street 2617 Mortgage 230,000

8345 smith roberts 11.10.79 M 645 Reader Ave 2602 Personal 8,100

9241 amelia millar 06.01.74 F 49E Applecross Rd 2415 Mortgage 320,750

Table 1. Example bank database.

PID Last nameFirst nameAge Address SexPressure Stress Reason

P1209 roberts peter 41 16 Main St 2617 m 140/90 high chest pain

P4204 miller amelia 39 49 Aplecross Road 2415 f 120/80 high headache

P4894 sieman jeﬀ 30 123 Norcross Blvd 2602 m 110/80 normal checkup

Table 2. Example health database.

multiple disparate databases. The record pairs (when linking two databases) or

record sets (when linking more than two databases) are compared and classiﬁed

as ‘matches’ by a linkage model if they are assumed to refer to the same entity,

or as ‘non-matches’ if they are assumed to refer to diﬀerent entities [26, 54]. The

frequent absence of unique entity identiﬁers across the databases to be linked

makes it impossible to use a simple SQL-join [30], and therefore linkage requires

sophisticated comparisons between a set of QIDs (such as names and addresses)

that are commonly available in the records to be linked. However, these QIDs

often contain personal information and therefore revealing or exchanging them

for linkage is not possible due to privacy and conﬁdentiality concerns.

As an example scenario, assume a demographer who aims to investigate how

mortgage stress (having to pay large sums of money on a regular basis to pay oﬀ

a house) is aﬀecting people with regard to their mental and physical health. This

research will require data from ﬁnancial institutions as well as hospitals as shown

in Tables 1 and 2. Neither of these organizations is likely willing or allowed by law

to provide their databases to the researcher. The researcher only requires access

to some attributes of the records (such as loan type, balance amount, blood

pressure, and stress level) that are linked across these databases, but not the

actual identities of the individuals that were linked. However, personal details

(such as name, age or date of birth, gender, and address) are needed as QIDs to

conduct the linkage due to the absence of unique identiﬁers across the databases.

As illustrated in the above example application (shown in Tables 1 and 2),

linking records in a privacy-preserving context is important, as sharing or ex-

changing sensitive and conﬁdential personal data (contained in QIDs of records)

between organizations is often not feasible due to privacy concerns, legal restric-

tions, or commercial interests. Therefore, databases need to be linked in such

ways that no sensitive information is being revealed to any of the organizations

involved in a cross-organizational linkage project, and no adversary is able to

Privacy-Preserving Record Linkage for Big Data 5

learn anything about these sensitive data. This problem has been addressed by

the emerging research area of PPRL [165].

The basic ideas of PPRL techniques are to mask (encode) the databases at

their sources and to conduct the linkage using only these masked data. This

means no sensitive data are ever exchanged between the organizations involved

in a PPRL protocol, or revealed to any other party. At the end of such a PPRL

process, the database owners only learn which of their own records match with a

high similarity with records from the other database(s). The next steps would be

exchanging the values in certain attributes of the matched records (such as loan

type, balance amount, blood pressure, and stress level in the above example)

between the database owners, or sending selected attribute values to a third

party, such as a researcher who requires the linked data for their project [165].

Recent research outcomes and experiments conducted in real health data linkage

validate that PPRL can achieve linkage quality with only small loss compared

to traditional record linkage using unencoded QIDs [134, 136].

Using PPRL for Big Data involves many challenges, among them the follow-

ing three key challenges need to be addressed to make PPRL viable for Big Data

applications:

1. Scalability: The number of comparisons required for classifying record pairs

or sets equals to the product of the size of the databases that are linked. This

is a performance bottleneck in the record linkage process since it potentially

requires comparison of all record pairs/sets using expensive comparison func-

tions [9, 33]. Due to the increasing size of Big Data (volume), comparing all

records is not feasible in most real-world applications. Blocking and ﬁltering

techniques have been used to overcome this challenge by eliminating as many

comparisons between non-matching records as possible [9, 29, 149].

2. Linkage quality: The emergence of Big Data brings with it the challenge of

dealing with typographical errors and other variations in data (variety and

veracity) making the linkage more challenging. The exact matching of QID

values, which would classify pairs or sets of records as matches if their QIDs

are exactly the same and as non-matches otherwise, will likely lead to low

linkage accuracy in the presence of real-world data errors. In addition, the

classiﬁcation models used in record linkage should be eﬀective and accurate

in classifying matches and non-matches [33]. Therefore, for practical record

linkage applications, techniques are required that facilitate both approximate

matching of QID values for comparison, as well as eﬀective classiﬁcation of

record pairs/sets for high linkage accuracy.

3. Privacy: The privacy-preserving requirement in the record linkage process

adds a third challenge, privacy, to the two main challenges of scalability

and linkage quality [165]. Linking Big Data containing massive amounts of

personal data generally involves privacy and conﬁdentiality issues. Privacy

needs to be considered in all steps of the record linkage process as only

the masked (or encoded) records can be used, making the task of linking

databases across organizations more challenging. Several masking techniques

have been used for PPRL, as we will discuss in detail in Sect. 3.4.

Privacy-Preserving Record Linkage for Big Data: Current Approaches and Research Challenges

Citations

A Survey on Federated Learning Systems: Vision, Hype and Reality for Data Privacy and Protection

Book review: Applied cryptography: Protocols, algorithms, and source code in C

Local Differential Privacy for Deep Learning

Local Differential Privacy for Deep Learning

Smart Medical Information Technology for Healthcare (SMITH).

References

I and J

k -anonymity: a model for protecting privacy

Space/time trade-offs in hash coding with allowable errors

Approximate nearest neighbors: towards removing the curse of dimensionality

Differential privacy

Related Papers (5)

Privacy-preserving record linkage using Bloom filters

Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection

Space/time trade-offs in hash coding with allowable errors

RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response

The Algorithmic Foundations of Differential Privacy