scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A taxonomy of privacy-preserving record linkage techniques

01 Sep 2013-Information Systems (Pergamon-Elsevier Ltd)-Vol. 38, Iss: 6, pp 946-969
TL;DR: This paper presents an overview of techniques that allow the linking of databases between organizations while at the same time preserving the privacy of these data, and presents a taxonomy of PPRL techniques to characterize these techniques along 15 dimensions.
About: This article is published in Information Systems.The article was published on 2013-09-01 and is currently open access. It has received 241 citations till now. The article focuses on the topics: Record linkage & Data warehouse.

Summary (10 min read)

1. Introduction

  • Much of these data are about people, or they are generated by people.
  • One way to improve data quality and allow more sophisticated data analysis and mining is to integrate data from different sources.
  • Record linkage can also be applied on a single database to detect duplicate records [8, 14] .
  • When personal information about people is used in the linking of databases across organizations, then the privacy of this information needs to be carefully protected.
  • After discussing these 15 dimensions, the authors provide a detailed review of existing PPRL techniques, and they show how they fit into their taxonomy.

2. Applications of record linkage

  • Linking records from different databases with the aim to improve data quality or enrich data for further analysis and mining is occurring in an increasing number of application areas including healthcare, government services, crime and fraud detection, and business applications.
  • Record linkage techniques are being used by national security agencies and crime investigators to effectively identify individuals who have committed fraud or crimes [24] [25] [26] .
  • When data from several organizations are linked, then privacy and confidentiality need to be carefully considered, as the following scenarios illustrate.
  • This research requires data from hospitals, the police, as well as public and private health insurers.
  • Preventing infectious diseases early before they are spread widely around a country is important for a healthy nation, also known as Health surveillance.

3. The record linkage process

  • The first step of data pre-processing (data cleaning and standardization) is crucial for quality record linkage outcomes, because most real-world data contain noisy, incomplete and inconsistent data [2, 4] .
  • Section 3.2 describes several popular comparison techniques in more detail, including those that have been employed for PPRL.
  • This is usually a time-consuming and errorprone process which depends upon experience of the experts who conduct the review.
  • In the following the authors discuss the steps of the record linkage process in more detail, and present techniques that have been used in each of the steps.

3.1. Indexing

  • This becomes the major performance bottleneck in the record linkage process, since expensive detailed comparisons between records are required [17, 18] .
  • To reduce this large number of potential record pair comparisons, some kind of filtering of the unlikely matches can be performed.
  • A drawback of these phonetic encodings, however, is that they are language dependent.
  • In the traditional standard blocking approach used since the 1960s [10] , all records that have the same blocking key value will be inserted into the same block, and only the records within the same block will be compared with each other in detail in the comparison step.
  • Related to q-gram based and sorted neighborhood indexing is suffix array based indexing [42, 43] , where suffixes are generated from the blocking key values, and blocks are extracted from the sorted array of suffix strings.

3.2. Comparison

  • Comparisons between two records can be conducted either at the record level or at the attribute level.
  • With comparisons at the attribute level, comparisons are conducted between individual attribute values, with specialized comparison functions used depending upon the type of data in these attributes.
  • Approximate comparison functions, on the other hand, measure how similar the values in two attributes are with each other.
  • Two surveys of edit-distance based approximate string comparison functions can be found in [46, 47] .
  • Winkler later added several improvements to this basic comparison function [53, 54] , such as increased similarity if the beginning of two strings is the same, or weight adjustments based on the lengths of two strings and how many similar characters they contain.

3.3. Classification

  • These vectors are used to classify record pairs as matches, non-matches, and possible matches, depending upon the decision model used [33] .
  • Extensions to the basic Fellegi and Sunter approach include the use of the Expectation Maximization (EM) algorithm to estimate the conditional probabilities required by the method in an unsupervised fashion [54, [56] [57] [58] .
  • Generating rules is often a timeconsuming and complex process, since it requires manual efforts to build rule systems and also to maintain them.
  • These rules are then applied on the comparison vectors to classify candidate record pairs into matches, non-matches, or possible matches (if desired) [14, 15, 60] .
  • To accurately classify the compared candidate record pairs into matches and non-matches, many recently developed classification techniques for record linkage employ supervised machine learning approaches [8, 61, 62] that require training data with known class labels for matches and non-matches to train a decision model.

3.4. Evaluation

  • Evaluating the performance of record linkage algorithms in terms of how efficient and effective they are is the final step in the linkage process.
  • A higher reduction ratio value means an indexing technique is more efficient in reducing the number of candidate record pairs that are being generated.
  • The quality of a linkage can be measured by using the metrics commonly employed in both information retrieval, and in machine learning and data mining [66, 67] .
  • True positives (TP) are the true matching record pairs that are correctly classified as 'matches', while false positives (FP) are the true nonmatching record pairs that are classified as 'matches'.
  • Based on these four numbers, various measures can be defined.

4. An overview of PPRL

  • As the scenarios in Section 2 have shown, the exchange of private or confidential data between organizations is often not feasible due to privacy concerns, legal restrictions, or because of commercial interests.
  • The increasing need of being able to link large databases across organizations while, at the same time, preserving the privacy of the entities stored in these databases, has led to the development of a new research area called privacy-preserving record linkage (PPRL) [69] [70] [71] .
  • The information revealed can either be (1) the number of records that have been classified as matches, (2) the identifiers of these matched records, or (3) a selected set of attributes from these matched records.
  • Some exchange of information between the data sources about what data pre-processing approaches they use, as well as which attributes they have in common that are to be used for the linkage, is therefore required.
  • In a PPRL context, this classification needs to be conducted in such a way that no party learns anything about the records in the other parties' databases that do not match, such as similarity values for certain attributes of individual record pairs, which record pairs have low similarities, or even the distribution of similarity values across all compared record pairs.

4.1. Previous PPRL surveys

  • Trepetin [80] theoretically analyzed four different anonymized string matching techniques and concluded that many existing techniques fall short in providing a sound solution either because they are not scalable to large databases, or because they are unable to provide both linkage quality and privacy guarantees.
  • Similar conclusions were also drawn in [69, 79] , that survey several existing techniques for private matching ranging from classical record matching techniques enhanced by SMC techniques to provide privacy, to advanced solutions developed specific to solve the PPRL problem.
  • In Durham et al.'s [78] recent survey on privacypreserving string comparators, six existing comparators that can be used in PPRL for private comparison have been experimentally evaluated in terms of their complexity, correctness, and privacy.
  • While all these surveys analyze and compare several private comparison functions, their survey is the first to develop a taxonomy that characterizes all aspects of PPRL, and to provide a comprehensive analysis of current approaches to PPRL.

5.1.1. Number of parties

  • Solutions to PPRL can be classified into those that require a third party for performing the linkage and those that do not.
  • In three-party protocols, a third party (which the authors call the 'linkage unit') is involved in conducting the linkage, while in two-party protocols only the two database owners participate in the PPRL process.
  • The advantages of two-party over three-party protocols is that the former are more secure because there is no possibility of collusion between one of the database owners and the linkage unit, while two-party protocols often have lower communication costs.
  • Two-party protocols generally require more complex techniques to ensure that the two database owners cannot infer any sensitive information from each other during the linkage process.

5.1.2. Adversary model

  • PPRL techniques generally consider one of the two adversary models that are commonly used in the field of cryptography, and especially in the area of secure multiparty computation (SMC) [70, 82, 83] .
  • HBC parties are curious in that they try to find out as much as they can about the other party's inputs while following the protocol [70, 83], also known as (i) Honest-but-curious behavior (HBC).
  • The protocol is secure in the HBC perspective if and only if all parties involved have no new knowledge at the end of the protocol above what they would have learned from the output of the record pairs classified as matches.
  • Most of the PPRL solutions proposed in the literature assume the HBC adversary model.
  • In particular, malicious parties may refuse to participate in a protocol, not follow the protocol in the specified way, choose arbitrary values for their data inputs, or abort the protocol at any time [84] .

5.1.3. Privacy techniques

  • A variety of privacy techniques has been employed to facilitate PPRL.
  • In order to prevent dictionary attacks, where an adversary hash-encodes values from a large list of common words using existing hash encoding functions until a matching hash-code is found, a keyed hash encoding approach can be used which significantly improves the security of this privacy technique.
  • The basic idea of SMC is that a computation is secure if at the end of the computation no party knows anything except its own input and the final results of the computed function [82, 83, 93] .
  • Various SMC techniques have been used in PPRL for accurate computation while preserving privacy.
  • When adding extra records there is generally a trade-off between linkage quality, scalability and privacy [101] . (x) Differential privacy: Recently, differential privacy [119, 120] has emerged as an alternative to generalization techniques.

5.2. Linkage techniques

  • The techniques used in the different steps of the PPRL process, as illustrated in Fig. 2 , determine the computational requirements and the quality of the linkage results.
  • The dimensions under this topic cover each of the required steps.

5.2.1. Indexing

  • The techniques employed in the indexing step to facilitate record linkage solutions that scale to very large databases become more challenging if privacy concerns have to be considered.
  • In PPRL, there is a trade-off of the indexing step not only between accuracy and efficiency, but also privacy.
  • Several approaches have been proposed that address the scalability of PPRL solutions by adapting existing indexing techniques, such as standard blocking, mapping based blocking, clustering, sampling, and locality sensitive hash functions, into a privacy-preserving context, as discussed in Section 6.3.

5.2.2. Comparison

  • Linkage quality is heavily influenced by how the values in records or individual attributes are compared with each other [48] .
  • As discussed in Section 4, the naı ¨ve approach of exact matching of encrypted values does not provide a practical solution.
  • Several of the approximate comparison functions that were presented in Section 3.2 have been investigated from a privacy preservation perspective.
  • These techniques will be described in detail in Sections 6.2 and 6.3.
  • The main challenge with these techniques is how the similarity between pairs of string values held at different parties can be calculated such that neither party learns about the other party's string value.

5.2.3. Classification

  • The decision model used in PPRL to securely classify the compared record pairs needs to be effective in providing highly accurate results, such that the number of false negatives and false positives is minimized, while at the same time preserving the privacy of all records that are not part of matching pairs.
  • As discussed in Section 3.3, a variety of classification techniques has been developed for record linkage.
  • Details of which classification techniques have been used in PPRL will be described for individual approaches in Section 6.

5.3.1. Scalability

  • This includes the computation and communication complexities that measure the overall computational efforts and cost of communication required in the PPRL process.
  • Generally, the big O-notation is used to specify the computation complexity [121] .

5.3.2. Linkage quality

  • The quality of linkage is theoretically analyzed in terms of fault-tolerance of the matching technique to data errors and variations, whether the matching is based on individual fields or whole records, and the types of data the matching technique can be applied to.
  • Faulttolerance to data errors can be addressed by using approximate matching or pre-processing techniques such as spelling transformations.
  • Records can either be compared as a whole (record based) or by comparing the values of individual selected attributes (field based), as was discussed in Section 3.2.
  • Several approximate comparison functions have been adapted into a privacy-preserving context as presented in Sections 6.2 and 6.3.

5.3.3. Privacy vulnerabilities

  • The privacy vulnerabilities that a PPRL technique is susceptible to provide a theoretical estimate of the privacy guarantees of that technique.
  • The main privacy vulnerabilities include frequency attack and dictionary attack (as discussed in Section 5.1.3).
  • As Kuzu et al. [122] recently showed, depending upon the number of hash functions employed and the number of bits in a Bloom filter, using a constrained satisfaction solver allows the iterative mapping of individual encoded values back to their original values.
  • Another vulnerability associated with three-party and multi-party approaches is collusion between parties.
  • The vulnerabilities of individual PPRL techniques are discussed in Section 6.

5.4.3. Privacy evaluation

  • Various measures have been used to assess the privacy protection that PPRL techniques provide.
  • Here the authors present the most prominent measures used.
  • (i) Entropy, Information gain (IG) and Relative information gain (RIG): Entropy measures the amount of information contained in a message [101, 123] .
  • IG assesses the possibility of inferring the original message Y, given its enciphered version X [101, 123] .
  • A party's view in the execution of a PPRL technique requires to be simulated given only its input and output to evaluate the privacy.

5.5.1. Implementation

  • This dimension specifies the implementation techniques that have been used to prototype a PPRL technique in order to conduct its experimental evaluation.
  • Some solutions proposed in the literature provide only theoretical proofs but they have not been evaluated experimentally, or no details about their implementation have been published.

5.5.2. Datasets

  • Experimental evaluation on one or ideally several datasets is important for the critical evaluation of PPRL techniques.
  • Due to the difficulties of obtaining real-world data that contain personal information, synthetically generated databases are commonly used.
  • To evaluate the practical aspects of PPRL techniques with regard to their expected performance in real-world applications, evaluations should ideally be done on databases that exhibit real-world properties and error characteristics.

6. A survey of privacy-preserving record linkage techniques

  • Research directions for PPRL were provided in [7, 20] stating the needs, problems and current approaches in this area, while various techniques have been developed addressing this research problem [69, [78] [79] [80] .
  • The authors highlight terms that relate back to their taxonomy in italic font.
  • The authors categorize PPRL techniques into three generations according to the factors that have been considered.
  • These three generations are (1) techniques that consider exact matching of attribute values only; (2) techniques that can conduct approximate matching to improve the quality of linkage; and (3) techniques that also address scalability while conducting approximate matching.
  • Each technique is given an identifier composed of the first three letters of the first author and the last two digits of the year of publication, which is then used in Table 1 to identify individual publications.

6.1.1. Three-party techniques

  • This approach is cost effective, but it is inappropriate in real-world applications since it can only perform exact matching of attribute values.
  • Both database owners will merge the values of their linkage attributes into a single string (record based) which is then double-hashed using a secure hash function and a public key encryption algorithm in order to prevent dictionary attacks.
  • The hash strings are then used by a third party to classify the records using a deterministic classification technique.
  • Experiments conducted on health databases showed that the accuracy of the classification increases if the concatenated string includes the full date of birth value.
  • This approach is useful when health policies preclude the full exchange of identifiers that is commonly required by other more sophisticated algorithms.

6.1.2. Two-party techniques

  • Privacy-preserving Information Retrieval (PPIR) is a research area related to PPRL, whereby PPIR employs a single query record while PPRL employs all records as match queries, also known as Fre05.
  • Their approach uses SMC techniques (homomorphic encryption) and oblivious pseudo-random functions.
  • In case of multiple queries where the query contains multiple search keywords, then the process is repeated for each search keyword in the query (field based).
  • The server uses indexing based on blocking for an efficient search that defines L bins and maps the n keywords in the database into these bins using a hash function.
  • The approach has a communication complexity which is poly-logarithmic in the size of the database n, and it needs only one round of communication.

6.1.3. Multi-party techniques

  • One-way secure hash functions such as SHA are used with two pads added in order to avoid dictionary attacks.
  • In their approach, all the records are first converted into a Bloom filter bit array (record based), and each party partitions its Bloom filter into the number of parties involved in the linkage and sends a segment to the corresponding party.
  • A multi-party approach based on a generalization technique (k-anonymity) for person-specific biomedical data was introduced by Kantarcioglu et al. [108] in 2008, also known as Kan08.
  • This approach performs efficient secure joins of encrypted databases by a third party without decrypting or inferring the contents of the joined records.

6.2.1. Three-party techniques

  • Du et al. [127] in 2001 suggested a secure approach for private remote database access with an untrusted third party that is assumed to not collude with any of the two database owners, also known as Du01.
  • The four models are the Private Information Matching (PIM), the PIM from Public Database , the Secure Storage Outsourcing (SSO), and the Secure Storage and Computing Outsourcing (SSCO).
  • A threshold based classification is used for deciding which record pairs are matches.
  • An approach based on a combination of Bloom filters and q-grams (to facilitate approximate matching) was proposed by Schnell et al. [114] in 2009, also known as Sch09.
  • The attribute values of each record are concatenated into one string (record based comparison), and the q-grams of that string are mapped to one bit array (a Bloom filter) using multiple cryptographic hash functions.

6.2.2. Two-party techniques

  • A two-party protocol was proposed by Atallah et al. [129] in 2003 where the edit distance algorithm, as presented in Section 3.2, is modified for providing privacy to genome sequence approximate comparisons in the area of bioinformatics, also known as Ata03.
  • The smallest overall cost of transforming one sequence into another is calculated as the editdistance.
  • It is therefore unsuited for tasks with large databases.
  • In their work, they presented methods for approximate comparison of values using string distance metrics, specifically TF-IDF, SoftTF-IDF and the Euclidean distance.
  • This approach provides privacy under the malicious adversary model as well by adopting an encrypted similarity matrix to store the intermediate results.

6.3.1. Three-party techniques

  • Al-Lawati et al. [72] proposed a secure threeparty blocking protocol in 2005 that assumed a HBC adversary model for achieving high performance private record linkage by using secure hash encoding for computing the TF-IDF distance measure in a secure fashion as illustrated in Fig. 10, also known as All05.
  • The approach provides field based and approximate comparison of record pairs which are then classified using a threshold based classification model.
  • The third method, the frugal third party blocking, uses a secure set intersection (SSI) SMC protocol to reduce the cost of transferring the whole databases to the third party by first identifying the hash signatures that occur in both databases.
  • Increasing the size of the reference table improves the linkage quality to some extent, but this is impractical because it leads to longer run times.
  • And also used locality-sensitive hash (LSH) functions for private blocking to reduce the computational complexity.

6.3.2. Two-party techniques

  • The approach of Song et al. [99] in 2000 in a two-party context with the HBC model takes into consideration the problem of approximate matching by calculating enciphered permutations of values using pseudorandom functions for private approximate searching of documents by certain query values, also known as Son00.
  • Euclidean distance is used to measure the approximate similarity between records.
  • The approach combines differential privacy and cryptographic methods to solve the PPRL problem in a two-party protocol following the HBC adversary model.
  • They propose an iterative classification approach where the database owners iteratively reveal bits from their Bloom filters without compromising privacy and complexity.
  • The pairs that are classified as possible matches are taken to the next iteration where more bit positions are revealed to classify the pairs.

6.3.3. Multi-party techniques

  • Mohammed et al. [109] in 2011 proposed an approach for efficient PPRL using the k-anonymity based generalization privacy technique without the need of a trusted third party (two parties only), also known as Moh11.
  • Mohammed sources) based on the two different adversary models.
  • The computation and communication costs of this approach are Oðn log nÞ and O(n), respectively, where n is the number of records in the databases.
  • To prevent malicious parties from sending false scores, game-theoretic concepts are used.
  • Empirical studies conducted by the authors using the real-world Adult dataset demonstrated the scalability of the solutions.

7. Discussion and research directions

  • As their survey has shown, since the beginning of the development of techniques that aim to provide solutions for PPRL, there has been a large variety of techniques that have been investigated.
  • There is a clear path of progress, starting from early techniques that solve the problem of privacy-preserving exact matching, moving on to techniques that allow approximate matching while keeping the attribute values that are matched secure, and finally in the last few years focusing on techniques that address the issue of scalability of PPRL on large databases.

7.1. Privacy aspects

  • With regard to privacy, several topics require further attention in order to make PPRL more applicable for practical applications.
  • In a three-party scenario, extending PPRL protocols can be accomplished such that all database owners send their data to the linkage unit, which then conducts the linkage.
  • The approach by Mohammed et al. [109] uses game theory concepts to deal with malicious parties, and as such provides a novel approach to PPRL.
  • Some of the more commonly used techniques include Bloom filters and generalization techniques such as k-anonymity, however they both have their limitations.
  • More research is needed to investigate the use of differential privacy and other advanced scalable techniques that provide sufficient privacy protection to work in combination with or even replace expensive SMC based techniques.

7.2. Linkage techniques

  • Research in non-PPRL in recent years has developed various advanced techniques that provide improved scalability and linkage quality.
  • Most work in PPRL that has investigated scalability, through some form of indexing technique, has employed the basic standard blocking approach, also known as Indexing.
  • Other efficient techniques such as the sorted neighborhood approach, or suffix-array based indexing techniques, need to be explored in a privacy-preserving setting.
  • Most PPRL solutions in the second and third generations consider approximate comparison, also known as Comparison.
  • They are mostly applicable only to string data type.

7.3. Theoretical analysis

  • A standard set of privacy measures is required that allows the comparative theoretical analysis of privacy preservation that can be achieved by PPRL techniques.
  • As there are often different privacy requirements in different practical applications of record linkage, a measure such as the privacy spectrum proposed by Reiter and Rubin [142] might be suitable.

7.4. Evaluation

  • The evaluation of the implementation of PPRL techniques with regard to their scalability, linkage quality, and privacy preservation poses some unique challenges.
  • Without being able to assess linkage quality and completeness, PPRL will not be useful for real-world linkage applications, because not knowing how good the results of a linkage project are is not an option in practical applications, where linkage quality and completeness are two crucial factors for successful PPRL.
  • There is currently no framework available for PPRL that facilitates the comparative evaluation of different PPRL techniques with regard to privacy, scalability, and linkage quality.
  • A framework for PPRL will need to facilitate the detailed specifications of all building blocks of the PPRL process in the form of abstract representations, such as XML schemas.
  • This will make it possible for researchers to implement their novel algorithms and techniques, and integrate them so as to evaluate them comparatively.

7.5. Practical aspects

  • So far it seems that no single PPRL technique has outperformed all other techniques in the three aspects of linkage quality, privacy preservation, and scalability to large datasets.
  • The lack of comprehensive studies that compare many existing techniques within the same framework and on many different types of data, means that it is currently not possible to determine which technique(s) perform better than others on data with different characteristics and of different sizes.
  • Conducting such large experimental studies is one avenue of research that would be highly beneficial to better understand the characteristics of PPRL techniques.

8. Conclusion

  • In this paper the authors have presented a survey of historical and current state-of-the-art techniques for PPRL.
  • Crucially, there is currently no overarching framework available that allows different approaches to PPRL to be evaluated comparatively.
  • Solving these open research questions is a core requirement to make PPRL applicable for practical applications.

Did you find this useful? Give us your feedback

Citations
More filters
01 Jan 2002

9,314 citations

Journal ArticleDOI
01 Jan 2014
TL;DR: In this article, the authors explore the concept of sensing as a service and how it fits with the Internet of Things (IoT) and identify the major open challenges and issues.
Abstract: The world population is growing at a rapid pace. Towns and cities are accommodating half of the world's population thereby creating tremendous pressure on every aspect of urban living. Cities are known to have large concentration of resources and facilities. Such environments attract people from rural areas. However, unprecedented attraction has now become an overwhelming issue for city governance and politics. The enormous pressure towards efficient city management has triggered various Smart City initiatives by both government and private sector businesses to invest in information and communication technologies to find sustainable solutions to the growing issues. The Internet of Things IoT has also gained significant attention over the past decade. IoT envisions to connect billions of sensors to the Internet and expects to use them for efficient and effective resource management in Smart Cities. Today, infrastructure, platforms and software applications are offered as services using cloud technologies. In this paper, we explore the concept of sensing as a service and how it fits with the IoT. Our objective is to investigate the concept of sensing as a service model in technological, economical and social perspectives and identify the major open challenges and issues. Copyright © 2013 John Wiley & Sons, Ltd.

756 citations

Posted Content
TL;DR: The objective is to investigate the concept of sensing as a service model in technological, economical and social perspectives and identify the major open challenges and issues.
Abstract: The world population is growing at a rapid pace. Towns and cities are accommodating half of the world's population thereby creating tremendous pressure on every aspect of urban living. Cities are known to have large concentration of resources and facilities. Such environments attract people from rural areas. However, unprecedented attraction has now become an overwhelming issue for city governance and politics. The enormous pressure towards efficient city management has triggered various Smart City initiatives by both government and private sector businesses to invest in ICT to find sustainable solutions to the growing issues. The Internet of Things (IoT) has also gained significant attention over the past decade. IoT envisions to connect billions of sensors to the Internet and expects to use them for efficient and effective resource management in Smart Cities. Today infrastructure, platforms, and software applications are offered as services using cloud technologies. In this paper, we explore the concept of sensing as a service and how it fits with the Internet of Things. Our objective is to investigate the concept of sensing as a service model in technological, economical, and social perspectives and identify the major open challenges and issues.

719 citations


Cites background from "A taxonomy of privacy-preserving re..."

  • ...Techniques similar to privacy preserving data sharing ad matching [40] need to be developed in order to combine sensor data to anonymize entities / profiles (excluding sensitive data) later at the server level....

    [...]

Posted Content
TL;DR: This work describes a three-party end-to-end solution in two phases ---privacy-preserving entity resolution and federated logistic regression over messages encrypted with an additively homomorphic scheme---, secure against a honest-but-curious adversary.
Abstract: Consider two data providers, each maintaining private records of different feature sets about common entities. They aim to learn a linear model jointly in a federated setting, namely, data is local and a shared model is trained from locally computed updates. In contrast with most work on distributed learning, in this scenario (i) data is split vertically, i.e. by features, (ii) only one data provider knows the target variable and (iii) entities are not linked across the data providers. Hence, to the challenge of private learning, we add the potentially negative consequences of mistakes in entity resolution. Our contribution is twofold. First, we describe a three-party end-to-end solution in two phases ---privacy-preserving entity resolution and federated logistic regression over messages encrypted with an additively homomorphic scheme---, secure against a honest-but-curious adversary. The system allows learning without either exposing data in the clear or sharing which entities the data providers have in common. Our implementation is as accurate as a naive non-private solution that brings all data in one place, and scales to problems with millions of entities with hundreds of features. Second, we provide what is to our knowledge the first formal analysis of the impact of entity resolution's mistakes on learning, with results on how optimal classifiers, empirical losses, margins and generalisation abilities are affected. Our results bring a clear and strong support for federated learning: under reasonable assumptions on the number and magnitude of entity resolution's mistakes, it can be extremely beneficial to carry out federated learning in the setting where each peer's data provides a significant uplift to the other.

380 citations


Additional excerpts

  • ...Finding efficient and privacy-compliant algorithms is a field in itself, privacy-preserving entity resolution [Hall and Fienberg, 2010, Christen, 2012, Vatsalan et al., 2013a]....

    [...]

  • ...Faster computation is possible by blocking [Vatsalan et al., 2013b]....

    [...]

Journal ArticleDOI
TL;DR: This is Applied Cryptography Protocols Algorithms And Source Code In C Applied Cryptographic Protocols algorithms and Source Code in C By Schneier Bruce Author Nov 01 1995 the best ebook that you can get right now online.

207 citations

References
More filters
Book
08 Sep 2000
TL;DR: This book presents dozens of algorithms and implementation examples, all in pseudo-code and suitable for use in real-world, large-scale data mining projects, and provides a comprehensive, practical look at the concepts and techniques you need to get the most out of real business data.
Abstract: The increasing volume of data in modern business and science calls for more complex and sophisticated tools. Although advances in data mining technology have made extensive data collection much easier, it's still always evolving and there is a constant need for new techniques and tools that can help us transform this data into useful information and knowledge. Since the previous edition's publication, great advances have been made in the field of data mining. Not only does the third of edition of Data Mining: Concepts and Techniques continue the tradition of equipping you with an understanding and application of the theory and practice of discovering patterns hidden in large data sets, it also focuses on new, important topics in the field: data warehouses and data cube technology, mining stream, mining social networks, and mining spatial, multimedia and other complex data. Each chapter is a stand-alone guide to a critical topic, presenting proven algorithms and sound implementations ready to be used directly or with strategic modification against live data. This is the resource you need if you want to apply today's most powerful data mining techniques to meet real business challenges. * Presents dozens of algorithms and implementation examples, all in pseudo-code and suitable for use in real-world, large-scale data mining projects. * Addresses advanced topics such as mining object-relational databases, spatial databases, multimedia databases, time-series databases, text databases, the World Wide Web, and applications in several fields. *Provides a comprehensive, practical look at the concepts and techniques you need to get the most out of real business data

23,600 citations

Journal ArticleDOI
TL;DR: A new method for automatic indexing and retrieval to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries.
Abstract: A new method for automatic indexing and retrieval is described. The approach is to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries. The particular technique used is singular-value decomposition, in which a large term by document matrix is decomposed into a set of ca. 100 orthogonal factors from which the original matrix can be approximated by linear combination. Documents are represented by ca. 100 item vectors of factor weights. Queries are represented as pseudo-document vectors formed from weighted combinations of terms, and documents with supra-threshold cosine values are returned. initial tests find this completely automatic method for retrieval to be promising.

12,443 citations

01 Jan 2002

9,314 citations


Additional excerpts

  • ...TF IDF(b1,’b’) TF IDF(b2,’b’) 0 F[1] F[2] F[3]...

    [...]

Journal ArticleDOI
TL;DR: The solution provided in this paper includes a formal protection model named k-anonymity and a set of accompanying policies for deployment and examines re-identification attacks that can be realized on releases that adhere to k- anonymity unless accompanying policies are respected.
Abstract: Consider a data holder, such as a hospital or a bank, that has a privately held collection of person-specific, field structured data. Suppose the data holder wants to share a version of the data with researchers. How can a data holder release a version of its private data with scientific guarantees that the individuals who are the subjects of the data cannot be re-identified while the data remain practically useful? The solution provided in this paper includes a formal protection model named k-anonymity and a set of accompanying policies for deployment. A release provides k-anonymity protection if the information for each person contained in the release cannot be distinguished from at least k-1 individuals whose information also appears in the release. This paper also examines re-identification attacks that can be realized on releases that adhere to k- anonymity unless accompanying policies are respected. The k-anonymity protection model is important because it forms the basis on which the real-world systems known as Datafly, µ-Argus and k-Similar provide guarantees of privacy protection.

7,925 citations


"A taxonomy of privacy-preserving re..." refers background in this paper

  • ...A database satisfies the k-anonymity criteria if every combination of quasi-identifier attribute values is shared by at least k records in the database, where quasiidentifiers are attributes that can be used to identify individual entities [105]....

    [...]

Journal ArticleDOI
TL;DR: Analysis of the paradigm problem demonstrates that allowing a small number of test messages to be falsely identified as members of the given set will permit a much smaller hash area to be used without increasing reject time.
Abstract: In this paper trade-offs among certain computational factors in hash coding are analyzed. The paradigm problem considered is that of testing a series of messages one-by-one for membership in a given set of messages. Two new hash-coding methods are examined and compared with a particular conventional hash-coding method. The computational factors considered are the size of the hash area (space), the time required to identify a message as a nonmember of the given set (reject time), and an allowable error frequency.The new methods are intended to reduce the amount of space required to contain the hash-coded information from that associated with conventional methods. The reduction in space is accomplished by exploiting the possibility that a small fraction of errors of commission may be tolerable in some applications, in particular, applications in which a large amount of data is involved and a core resident hash area is consequently not feasible using conventional methods.In such applications, it is envisaged that overall performance could be improved by using a smaller core resident hash area in conjunction with the new methods and, when necessary, by using some secondary and perhaps time-consuming test to “catch” the small fraction of errors associated with the new methods. An example is discussed which illustrates possible areas of application for the new methods.Analysis of the paradigm problem demonstrates that allowing a small number of test messages to be falsely identified as members of the given set will permit a much smaller hash area to be used without increasing reject time.

7,390 citations


"A taxonomy of privacy-preserving re..." refers methods in this paper

  • ...As Kuzu et al. [122] recently showed, depending upon the number of hash functions employed and the number of bits in a Bloom filter, using a constrained satisfaction solver allows the iterative mapping of individual encoded values back to their original values....

    [...]

  • ...The Bloom filter was proposed by Bloom [111] for efficiently checking set membership [112]....

    [...]

  • ...Dur12: Recently, Durham [128] proposed a framework for PPRL using Bloom filters....

    [...]

  • ...Then the Bloom filters are compared bit-wise by a third party in a HBC model, and a logical conjunction (AND) is performed on these Bloom filters to calculate the similarity according to the Dicecoefficient, because this similarity function is insensitive to many matching zeros in long Bloom filters (Fig....

    [...]

  • ...In recent times, Bloom filters have been used in PPRL for private matching of records as they provide a means of privacy assurance [113–117]....

    [...]