scispace - formally typeset
Book ChapterDOI

Privacy-Preserving Record Linkage for Big Data: Current Approaches and Research Challenges

TLDR
The challenges of PPRL for Big data poses several challenges, with the three major ones being scalability to multiple large databases, due to their massive volume and the flow of data within Big Data applications, and achieving high quality results of the linkage in the presence of variety and veracity of Big Data.
Abstract
The growth of Big Data, especially personal data dispersed in multiple data sources, presents enormous opportunities and insights for businesses to explore and leverage the value of linked and integrated data. However, privacy concerns impede sharing or exchanging data for linkage across different organizations. Privacy-preserving record linkage (PPRL) aims to address this problem by identifying and linking records that correspond to the same real-world entity across several data sources held by different parties without revealing any sensitive information about these entities. PPRL is increasingly being required in many real-world application areas. Examples range from public health surveillance to crime and fraud detection, and national security. PPRL for Big Data poses several challenges, with the three major ones being (1) scalability to multiple large databases, due to their massive volume and the flow of data within Big Data applications, (2) achieving high quality results of the linkage in the presence of variety and veracity of Big Data, and (3) preserving privacy and confidentiality of the entities represented in Big Data collections. In this chapter, we describe the challenges of PPRL in the context of Big Data, survey existing techniques for PPRL, and provide directions for future research.

read more

Content maybe subject to copyright    Report

Privacy-Preserving Record Linkage for Big Data:
Current Approaches and Research Challenges
Dinusha Vatsalan
1
, Ziad Sehili
2
, Peter Christen
1
, and Erhard Rahm
2
1
Research School of Computer Science, The Australian National University,
Acton ACT 2601, Australia; {dinusha.vatsalan,peter.christen}@anu.edu.au
2
Database Group, University of Leipzig, 04109 Leipzig, Germany;
{rahm,sehili}@informatik.uni-leipzig.de
Abstract. The growth of Big Data, especially personal data dispersed
in multiple data sources, presents enormous opportunities and insights
for businesses to explore and leverage the value of linked and integrated
data. However, privacy concerns impede sharing or exchanging data for
linkage across different organizations. Privacy-preserving record linkage
(PPRL) aims to address this problem by identifying and linking records
that correspond to the same real-world entity across several data sources
held by different parties without revealing any sensitive information
about these entities. PPRL is increasingly being required in many real-
world application areas. Examples range from public health surveillance
to crime and fraud detection, and national security. PPRL for Big Data
poses several challenges, with the three major ones being (1) scalability
to multiple large databases, due to their massive volume and the flow of
data within Big Data applications, (2) achieving high quality results of
the linkage in the presence of variety and veracity of Big Data, and (3)
preserving privacy and confidentiality of the entities represented in Big
Data collections. In this chapter, we describe the challenges of PPRL
in the context of Big Data, survey existing techniques for PPRL, and
provide directions for future research.
Keywords: Record linkage, Privacy, Big Data, Scalability
1 Introduction
With the Big Data revolution, many organizations collect and process datasets
that contain many millions of records to analyze and mine interesting patterns
and knowledge in order to empower efficient and quality decision making [28,
53]. Analyzing and mining such large datasets often require data from multi-
ple sources to be linked and aggregated. Linking records from different data
sources with the aim to improve data quality or enrich data for further analysis
is occurring in an increasing number of application areas, such as in healthcare,
government services, crime and fraud detection, national security, and business
applications [28, 52]. Effective ways of linking data from different sources have
also played an increasingly important role in generating new insights for popu-
lation informatics in the health and social sciences [100].

2 Privacy-Preserving Record Linkage for Big Data
For example, linking health databases from different organizations facilitates
quality health data mining and analytics in applications such as epidemiologi-
cal studies (outbreak detection of infectious diseases) or adverse drug reaction
studies [20, 117]. These applications require data from several organizations to
be linked, for example human health data, travel data, consumed drug data, and
even animal health data [38]. Linked health databases can also be used for the
development of health policies in a more efficient and effective way compared to
traditionally used time-consuming survey studies [37, 89].
Record linkage techniques are also being used by national security agen-
cies and crime investigators for effective identification of fraud, crime, or terror-
ism suspects [74, 125, 168]. Such applications require data from law enforcement
agencies, immigration departments, Internet service providers, businesses, as well
as financial institutions [125].
In recent time, record linkage is increasingly being required by social scien-
tists in the field of population informatics to study insights into our society from
‘social genome’ data, the digital traces that contain person-level data about so-
cial beings [100]. The ‘Beyond 2011’ program by the Office for National Statistics
in the UK, for example, has carried out research to study different possible ap-
proaches to producing population and socio-demographics statistics for England
and Wales by linking data from several sources [56].
Record linkage within a single organization does not generally involve privacy
and confidentiality concerns (assuming there are no internal threats within the
organization and the linked data are not being revealed outside the organiza-
tion). An example application is the deduplication of a customer database by a
business using record linkage techniques for conducting effective marketing ac-
tivities. However, in many countries record linkage across several organizations,
as required in the above example applications, might not allow the exchange or
the sharing of database records between organizations due to laws or regulations.
Some example Acts that describe the legal restrictions of disclosing personal or
sensitive data are: (1) the Data-Matching Program Act in Australia
3
, (2) the
European Union (EU) Personal Data Protection Act in Europe
4
, and (3) the
Health Insurance Portability and Accountability Act (HIPAA) in the USA
5
.
The privacy requirements in the record linkage process have been addressed
by developing ‘privacy-preserving record linkage’ (PPRL) techniques, which aim
to identify matching records that refer to the same entities in different databases
without compromising privacy and confidentiality of these entities. In a PPRL
project, the database owners (or data custodians) agree to reveal only selected
information about records that have been classified as matches among each other,
or to an external party, such as a researcher [165]. However, record linkage re-
quires access to the actual values of certain attributes.
3
https://www.oaic.gov.au/privacy-law/other-legislation/government-data-matching
[Accessed: 15/06/2016]
4
http://ec.europa.eu/justice/data-protection/index en.htm [Accessed: 15/06/2016]
5
http://www.hhs.gov/ocr/privacy/ [Accessed: 15/06/2016]

Privacy-Preserving Record Linkage for Big Data 3
Known as quasi-identifiers (QIDs), these attributes need to be common in
all databases to be linked and represent identifying characteristics of entities to
allow matching of records. Examples of QIDs are first and last names, addresses,
telephone numbers, or dates of birth. Such QIDs often contain private and confi-
dential information of entities that cannot be revealed, and therefore the linkage
has to be conducted on masked (encoded) versions of the QIDs to preserve the
privacy of entities. Several masking techniques have been developed (as we will
describe in Sect. 3.4), using two different types of general approaches: (1) secure
multi-party computation (SMC) [112] and (2) data perturbation [88].
Leveraging the tremendous opportunities that Big Data can provide for busi-
nesses comes with the challenges that PPRL poses, including scalability, quality,
and privacy. Big Data implies enormous data volume as well as massive flows (ve-
locity) of data, leading to scalability challenges even with advanced computing
technology. The variety and veracity aspects of Big Data require biases, noise,
variations and abnormality in data to be considered, which makes the linkage
process more challenging. With Big Data containing massive amounts of per-
sonal data, linking and mining data may breach the privacy of those represented
by the data. A practical PPRL solution that can be used in real-world appli-
cations should therefore address these challenges of scalability, linkage quality,
and privacy. A variety of PPRL techniques has been developed over the past two
decades, as surveyed in [154, 165]. However, these existing approaches for PPRL
fall short in providing a sound solution in the Big Data era by not addressing
all of the Big Data challenges. Therefore, more research is required to leverage
the huge potential that linking databases in the era of Big Data can provide for
businesses, government agencies, and research organizations.
In this chapter, we review the existing challenges and techniques, and discuss
research directions of PPRL for Big Data. We provide the preliminaries in Sect. 2
and review existing privacy techniques for PPRL in Sect. 3. We then discuss the
scalability challenge and existing approaches that address scalability of PPRL in
Sect. 4. In Sect. 5, we describe the challenges and existing techniques of PPRL
on multiple databases, which is an emerging research avenue that is being in-
creasingly required in many Big Data applications. In Sect. 6 we discuss research
directions in PPRL for Big Data, and in Sect. 7 we conclude this chapter with
a brief summary of the topic covered.
2 Background
Building on the introduction to record linkage and privacy-preserving record
linkage (PPRL) in Sect. 1, we now present background material that contributes
to the understanding of the preliminaries. We describe the basic concepts and
challenges in Sect. 2.1, and then describe the process of PPRL in Sect. 2.2.
2.1 Overview and Challenges of PPRL
Record linkage is a widely used data pre-processing and data cleaning task where
the aim is to link and integrate records that refer to the same entity from two or

4 Privacy-Preserving Record Linkage for Big Data
ID Given nameSurname DOB Gender Address Loan typeBalance
6723 peter robert 20.06.72 M 16 Main Street 2617 Mortgage 230,000
8345 smith roberts 11.10.79 M 645 Reader Ave 2602 Personal 8,100
9241 amelia millar 06.01.74 F 49E Applecross Rd 2415 Mortgage 320,750
Table 1. Example bank database.
PID Last nameFirst nameAge Address SexPressure Stress Reason
P1209 roberts peter 41 16 Main St 2617 m 140/90 high chest pain
P4204 miller amelia 39 49 Aplecross Road 2415 f 120/80 high headache
P4894 sieman jeff 30 123 Norcross Blvd 2602 m 110/80 normal checkup
Table 2. Example health database.
multiple disparate databases. The record pairs (when linking two databases) or
record sets (when linking more than two databases) are compared and classified
as ‘matches’ by a linkage model if they are assumed to refer to the same entity,
or as ‘non-matches’ if they are assumed to refer to different entities [26, 54]. The
frequent absence of unique entity identifiers across the databases to be linked
makes it impossible to use a simple SQL-join [30], and therefore linkage requires
sophisticated comparisons between a set of QIDs (such as names and addresses)
that are commonly available in the records to be linked. However, these QIDs
often contain personal information and therefore revealing or exchanging them
for linkage is not possible due to privacy and confidentiality concerns.
As an example scenario, assume a demographer who aims to investigate how
mortgage stress (having to pay large sums of money on a regular basis to pay off
a house) is affecting people with regard to their mental and physical health. This
research will require data from financial institutions as well as hospitals as shown
in Tables 1 and 2. Neither of these organizations is likely willing or allowed by law
to provide their databases to the researcher. The researcher only requires access
to some attributes of the records (such as loan type, balance amount, blood
pressure, and stress level) that are linked across these databases, but not the
actual identities of the individuals that were linked. However, personal details
(such as name, age or date of birth, gender, and address) are needed as QIDs to
conduct the linkage due to the absence of unique identifiers across the databases.
As illustrated in the above example application (shown in Tables 1 and 2),
linking records in a privacy-preserving context is important, as sharing or ex-
changing sensitive and confidential personal data (contained in QIDs of records)
between organizations is often not feasible due to privacy concerns, legal restric-
tions, or commercial interests. Therefore, databases need to be linked in such
ways that no sensitive information is being revealed to any of the organizations
involved in a cross-organizational linkage project, and no adversary is able to

Privacy-Preserving Record Linkage for Big Data 5
learn anything about these sensitive data. This problem has been addressed by
the emerging research area of PPRL [165].
The basic ideas of PPRL techniques are to mask (encode) the databases at
their sources and to conduct the linkage using only these masked data. This
means no sensitive data are ever exchanged between the organizations involved
in a PPRL protocol, or revealed to any other party. At the end of such a PPRL
process, the database owners only learn which of their own records match with a
high similarity with records from the other database(s). The next steps would be
exchanging the values in certain attributes of the matched records (such as loan
type, balance amount, blood pressure, and stress level in the above example)
between the database owners, or sending selected attribute values to a third
party, such as a researcher who requires the linked data for their project [165].
Recent research outcomes and experiments conducted in real health data linkage
validate that PPRL can achieve linkage quality with only small loss compared
to traditional record linkage using unencoded QIDs [134, 136].
Using PPRL for Big Data involves many challenges, among them the follow-
ing three key challenges need to be addressed to make PPRL viable for Big Data
applications:
1. Scalability: The number of comparisons required for classifying record pairs
or sets equals to the product of the size of the databases that are linked. This
is a performance bottleneck in the record linkage process since it potentially
requires comparison of all record pairs/sets using expensive comparison func-
tions [9, 33]. Due to the increasing size of Big Data (volume), comparing all
records is not feasible in most real-world applications. Blocking and filtering
techniques have been used to overcome this challenge by eliminating as many
comparisons between non-matching records as possible [9, 29, 149].
2. Linkage quality: The emergence of Big Data brings with it the challenge of
dealing with typographical errors and other variations in data (variety and
veracity) making the linkage more challenging. The exact matching of QID
values, which would classify pairs or sets of records as matches if their QIDs
are exactly the same and as non-matches otherwise, will likely lead to low
linkage accuracy in the presence of real-world data errors. In addition, the
classification models used in record linkage should be effective and accurate
in classifying matches and non-matches [33]. Therefore, for practical record
linkage applications, techniques are required that facilitate both approximate
matching of QID values for comparison, as well as effective classification of
record pairs/sets for high linkage accuracy.
3. Privacy: The privacy-preserving requirement in the record linkage process
adds a third challenge, privacy, to the two main challenges of scalability
and linkage quality [165]. Linking Big Data containing massive amounts of
personal data generally involves privacy and confidentiality issues. Privacy
needs to be considered in all steps of the record linkage process as only
the masked (or encoded) records can be used, making the task of linking
databases across organizations more challenging. Several masking techniques
have been used for PPRL, as we will discuss in detail in Sect. 3.4.

Citations
More filters
Posted Content

A Survey on Federated Learning Systems: Vision, Hype and Reality for Data Privacy and Protection

TL;DR: A comprehensive review of federated learning systems can be found in this paper, where the authors provide a thorough categorization of the existing systems according to six different aspects, including data distribution, machine learning model, privacy mechanism, communication architecture, scale of federation and motivation of federation.
Journal ArticleDOI

Book review: Applied cryptography: Protocols, algorithms, and source code in C

TL;DR: This is Applied Cryptography Protocols Algorithms And Source Code In C Applied Cryptographic Protocols algorithms and Source Code in C By Schneier Bruce Author Nov 01 1995 the best ebook that you can get right now online.
Journal ArticleDOI

Local Differential Privacy for Deep Learning

Abstract: The Internet of Things (IoT) is transforming major industries, including but not limited to healthcare, agriculture, finance, energy, and transportation. IoT platforms are continually improving with innovations, such as the amalgamation of software-defined networks (SDNs) and network function virtualization (NFV) in the edge-cloud interplay. Deep learning (DL) is becoming popular due to its remarkable accuracy when trained with a massive amount of data such as generated by IoT. However, DL algorithms tend to leak privacy when trained on highly sensitive crowd-sourced data such as medical data. The existing privacy-preserving DL algorithms rely on the traditional server-centric approaches requiring high processing powers. We propose a new local differentially private (LDP) algorithm named LATENT that redesigns the training process. LATENT enables a data owner to add a randomization layer before data leave the data owners’ devices and reach a potentially untrusted machine learning service. This feature is achieved by splitting the architecture of a convolutional neural network (CNN) into three layers: 1) convolutional module (CNM); 2) randomization module; and 3) fully connected module. Hence, the randomization module can operate as an NFV privacy preservation service in an SDN-controlled NFV, making LATENT more practical for IoT-driven cloud-based environments compared to existing approaches. The randomization module employs a newly proposed LDP protocol named utility enhancing randomization, which allows LATENT to maintain high utility compared to existing LDP protocols. Our experimental evaluation of LATENT on convolutional deep neural networks demonstrates excellent accuracy (e.g., 91%–96%) with high model quality even under low privacy budgets (e.g., $\varepsilon =0.5$ ).
Journal ArticleDOI

Local Differential Privacy for Deep Learning

TL;DR: A new local differentially private (LDP) algorithm named LATENT is proposed that redesigns the training process and enables a data owner to add a randomization layer before data leave the data owners’ devices and reach a potentially untrusted machine learning service.
References
More filters
Book ChapterDOI

I and J

Journal ArticleDOI

k -anonymity: a model for protecting privacy

TL;DR: The solution provided in this paper includes a formal protection model named k-anonymity and a set of accompanying policies for deployment and examines re-identification attacks that can be realized on releases that adhere to k- anonymity unless accompanying policies are respected.
Journal ArticleDOI

Space/time trade-offs in hash coding with allowable errors

TL;DR: Analysis of the paradigm problem demonstrates that allowing a small number of test messages to be falsely identified as members of the given set will permit a much smaller hash area to be used without increasing reject time.
Proceedings ArticleDOI

Approximate nearest neighbors: towards removing the curse of dimensionality

TL;DR: In this paper, the authors present two algorithms for the approximate nearest neighbor problem in high-dimensional spaces, for data sets of size n living in R d, which require space that is only polynomial in n and d.
Book ChapterDOI

Differential privacy

TL;DR: In this article, the authors give a general impossibility result showing that a formalization of Dalenius' goal along the lines of semantic security cannot be achieved, and suggest a new measure, differential privacy, which, intuitively, captures the increased risk to one's privacy incurred by participating in a database.