scispace - formally typeset
Search or ask a question
Book ChapterDOI

Performance-oriented privacy-preserving data integration

TL;DR: The uncertainty introduced by collisions caused by hashing and the injection of noise can be leveraged to perform a privacy-preserving relational join operation between a massive public table and a relatively smaller private one.
Abstract: Current solutions to integrating private data with public data have provided useful privacy metrics, such as relative information gain, that can be used to evaluate alternative approaches. Unfortunately, they have not addressed critical performance issues, especially when the public database is very large. The use of hashes and noise yields better performance than existing techniques, while still making it difficult for unauthorized entities to distinguish which data items truly exist in the private database. As we show here, the uncertainty introduced by collisions caused by hashing and the injection of noise can be leveraged to perform a privacy-preserving relational join operation between a massive public table and a relatively smaller private one.

Summary (4 min read)

1. Introduction

  • Data is often generated or collected by various parties, and the need to integrate the resulting disparate data sources has been addressed by the research community [1]-[6].
  • When sharing scientific data, privacy quickly becomes an issue.
  • To address this problem, the authors augment the wellknown semi-join framework [11], “hiding” the actual values of the join column of table R by hashing them and including additional artificial values.

2.1. Defining Privacy

  • Privacy loss is likened to a communications channel, in which the difference between a posteriori (i.e., after data has been revealed) and a priori (i.e., before data has been revealed) distributions of data measures privacy loss.
  • In [15] and [16], a metric for measuring the inherent uncertainty of a random variable based on its differential entropy is used as a measure for privacy.
  • The common factor among all these proposed metrics is relative information gain, which has also been used in many privacy-preserving applications [17], making it a likely candidate for measuring privacy loss.

2.2. Correctness

  • The second challenge is producing exact and correct answers to queries posed by users.
  • Work in privacy-preserving data mining [18]-[21] have focused on changing the actual values of data items so that the values of data items are hidden but the distribution of the perturbed data is similar to that of the original data distribution.
  • The exact original data values can not be accurately recovered.
  • While this is acceptable in data mining applications, since data mining looks for trends and patterns, not exact values, for data integration, the exact answers are required.

2.3. Efficiency and Privacy

  • The third challenge is to perform the join operation efficiently without sacrificing much privacy.
  • It has been shown that to completely guarantee the privacy of the queries, the entire contents of dw should be downloaded [22].
  • In some cases this is not practical.
  • It requires the exchange of both parties’ encrypted data so that they can both mutually encrypt each others’ data.
  • The party providing the answer to the query does not learn the actual query.

3. Privacy Metric

  • For their work, the authors use relative information gain as a basis for a metric to measure privacy loss when data is exchanged.
  • The remainder of this section defines this metric and explains their motivation for selecting it.

4. Privacy-Preserving Distributed Join

  • The first step projects column B from table R and applies a hashing function h to each value in column B, yielding table h(R) with column h(B).
  • Step 2 will generate artificial hash values, yielding table n.

4.1. Privacy Constraint Satisfaction

  • Because different hash functions have various sizes, they yield different collision rates.
  • Large hash functions tend to yield low collision rates; whereas, small hash functions tend to yield high collision rates.
  • When the user wishes to perform a join on his private table R and the public table S, he requires that the privacy loss incurred with respect to the contents of table S to not exceed prel.
  • Applying equation 7 to each hash function, the minimum number of hash values |r1|,|r2|,…,|rm| for all m available hash functions on dw can be found.

4.2. Performance Estimation

  • To select the appropriate hash function for the data exchange, the transmission cost normalized with respect to the brute-force method (i.e., downloading table S from dw to db) costi can be estimated.
  • It is assumed that transmissions costs will dominate the execution costs of the overall join operation since the system will be operating over a limited communications link and search time is kept low with the use of indexes.
  • It is found that on average for a given hash value, the number of values in column B that will collide to the some hash value is || || iH S for a hash function hi.
  • The hash function hi (with an associated Ni found with equation 9) that yields the lowest normalized transmission cost according to equation 11 is selected as the hash function for the data exchange and is denoted by h.
  • The set h(R) is computed with hash function h.

5. Implementation and Results

  • A preliminary implementation was done in Java with MySQL [37] via MySQL’s JDBC connector [38].
  • The hash value sets were stored and indexed in w along with their respective S table.
  • Three sets of data were used for three instances of table S. The first two were each comprised of 2.5 million synthetically generated tuples.
  • The third set of data was the “alignment block in rat chain of chromosome 10” table, taken from the UCSC Genome Browser Project [40].
  • There were approximately 123,598 different values for the join column in the genome data set, so the size of domain U for join column values was approximated to be 217.

5.1. Execution Time Analysis

  • To begin the execution time analysis, the size of table R in relation to the size of the set of possible key values U (|R|/|U|) is varied and the required relative privacy loss is to not exceed 0.01.
  • As shown in a later graph in Figure 8, when |R /|U| transitions from 0.6 to 0.7, the system experiences the largest increase in hash size |H|, resulting in far fewer collisions; and, consequently many more hash values are sent to dw t meet the privacy constraint.
  • For a uniform distribution, the execution time is generally independent of |R|/|U|, except when there is a large transition in hash values used, because the transmission of noise and false-positives dominate the cost.
  • From this figure, it can also be seen that the execution times for join operations operating over the genome data distribution are lower than for the Gaussian distribution, which are usually lower than for the uniform distribution.
  • Figure 5 shows how execution times vary as the target prel changes.

5.2. Absolute Privacy Loss Analysis

  • Figure 7 shows how absolute privacy loss varies as |R| changes and the target prel is fixed at 0.01.
  • For the uniform distribution, the absolute privacy loss i kept very low and close to the target prel of 0.01 since satisfying the relative privacy loss constraint for a uniform distribution is almost identical to satisfying an absolute privacy constraint of the same magnitude.
  • For the Gaussian and genome data distributions, the absolute privacy loss differs greatly from the target relative prel, because far less effort is required to satisfy the relative privacy loss constrain than that required to satisfy an absolute privacy loss constraint of equal magnitude due to less uniformity in these distributions.
  • For non-uniform distributions, achieving low absolute privacy loss would be much more expensive than achieving low relative absolute privacy loss; whereas, the cost for achieving both f r a uniform distribution would be relatively the same.
  • Figure 7 also shows that as |R|/|U| increases, absolute privacy loss decreases.

5.3. Hash Selection Analysis

  • Figure 8 shows that the size of the selected hash function that yields the lowest transmission cost increases as |R /|U| increases, for all distributions.
  • For the uniform distribution, hash sizes ranging from 10-bits to 16-bits are required, depending on the size of |R|.
  • For the Gaussian distribution, hash sizes ranging from 12-bits to 16-bits are required.
  • Finally, for the genome data set, hash sizes ranging from 14- bits to 16-bits are needed.
  • This experiment shows the necessary hash sizes that need to be precomputed and stored in dw for the various S table distributions.

5.4. Transmission Cost Analysis

  • The transmission costs of the hash/noise method in relation to the brute-force are studied.
  • For the less uniform genome data, the transmission costs remain relatively constant with an average of 25% of that of the brute-force method, fr all target relative prel values and when |R|/|U| is 0.1.
  • Like for the other distributions, the general behavior of the observed transmission cost curve was predicted by the estimated transmission cost curves, but the actual transmission costs were poorly predicted.
  • Figure 10 compares the attained normalized transmission costs of the hash/noise method with the costs of simple semi-joins (i.e., no privacy constraints enforced).
  • The graph shows that |R|/|U| is directly proportional to what the cost of the semi-join would be.

5.5. Cost-Ratio Analysis

  • Finally, the effect of the cost-ratio, or the ratio between the transmission costs of sending a hash-value and the transmission costs of sending a tuple, is examined.
  • Figure 11 shows that the cost-ratio has very little effect on the overall performance of the system because the number of tuples in set F makes the cost of transmitting set F the dominating cost of the hash/noise method, regardless of the cost-ratio between sending hash values and tuples from set F.

7. Conclusion

  • Three challenges in solving the private data integration problem were presented: (1) privacy, (2) correctness, and (3) efficiency.
  • The use of relative information gain addresses the first challenge.
  • By making use of predefined hash functions and noise injection to satisfy any privacy constraints that a user may pose, traditional indexing mechanisms can be used, making the total cost of a distributed join dominated mostly by transmission costs rather than by search and computational costs.
  • The hash/noise technique works better for less uniform public data sets than for more uniform data sets stored at the public data warehouse.
  • Furthermor , uniform data distributions require a wider range of hash functions to be predefined than less uniform data distributions.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

UCRL-CONF-206647
Performance-Oriented
Privacy-Preserving Data
Integration
R. K. Pon, T. Critchlow
September 20, 2004
2005 SIAM International Conference on Data Mining
Newport Beach, CA, United States
April 21, 2005 through April 23, 2005

Disclaimer
This document was prepared as an account of work sponsored by an agency of the United States
Government. Neither the United States Government nor the University of California nor any of their
employees, makes any warranty, express or implied, or assumes any legal liability or responsibility for
the accuracy, completeness, or usefulness of any information, apparatus, product, or process
disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any
specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise,
does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United
States Government or the University of California. The views and opinions of authors expressed herein
do not necessarily state or reflect those of the United States Government or the University of California,
and shall not be used for advertising or product endorsement purposes.

Performance-Oriented Privacy-Preserving Data Integration
Raymond K. Pon Terence Critchlow
University of California, Los Angeles Lawrence Livermore National Laboratory
rpon@cs.ucla.edu critchlow@llnl.gov
Abstract
Current solutions to integrating private data with
public data have provided useful privacy metrics, such
as relative information gain, that can be used to
evaluate alternative approaches. Unfortunately, they
have not addressed critical performance issues,
especially when the public database is very large. The
use of hashes and noise yields better performance than
existing techniques while still making it difficult for
unauthorized entities to distinguish which data items
truly exist in the private database. As we show here,
leveraging the uncertainty introduced by collisions
caused by hashing and the injection of noise, we
present a technique for performing a relational join
operation between a massive public table and a
relatively smaller private one.
1. Introduction
Data is often generated or collected by various
parties, and the need to integrate the resulting disparate
data sources has been addressed by the research
community [1]-[6]. Although the heterogeneity of the
schemas has been addressed, most data integration
approaches have not yet efficiently addressed the
privacy requirements imposed by data sources.
Legal and social circumstances have made data
privacy a significant issue [7]-[8], resulting in the need
for Hippocratic databases (i.e., “database that include
privacy as a central concern”) [9], particularly in
sharing scientific or medical data. Without strong
privacy guarantees, often scientists refuse to share data
with other scientists for reasons, such as
subject/patient confidentiality, proprietary/sensitive
data restrictions, competition, and potential conflict
and disagreement [10].
When sharing scientific data, privacy quickly
becomes an issue. Suppose that a scientist wishes to
perform a query across a table in his private database
and a table in a public data warehouse in the most
efficient manner possible (shown in Figure 1).
Ignoring privacy restrictions, the problem is reduced to
a distributed database problem that can be solved by
shipping the scientist’s table to the warehouse and
performing the join at the warehouse. However, if the
scientist’s data set is proprietary, it cannot be sent
verbatim to the warehouse. The naive solution is for
the scientist to download the entire public table to his
local machine and perform the query there. But to do
so would be prohibitively expensive if the public table
is very large or the communications link is limited.
Assuming that schema reconciliation has already
been done, the problem can be formalized as the
following. Table
),( BAR
=
from a small private
database db is to be joined with table ),( CBS = from
a large data warehouse dw on column B, yielding the
desired table Goal = R
B
S. Table R is private and the
identity of the data items in R can not be known by any
other party other than the owner of db. Table S is
publicly available and accessible.
It is assumed that the system operates in a semi-
honest model, where both parties will behave
according to their prescribed role in any given
protocol. However, there are no restrictions on the use
of information that has been learned during the data
exchange after the protocol is completed. dw is treated
as the adversary. To describe the level of privacy
preserved, relative information gain is used.
To address this problem, we augment the well-
known semi-join framework [11], “hiding” the actual
values of the join column of table R by hashing them
and including additional artificial values. The resulting
collection is sent to the data warehouse to retrieve a
subset of table S that includes the data required to
answer the original query along with some false
positives. Although, this method will not provide for
absolute privacy (i.e., the adversary can infer nothing
about the contents of table R), the hash/noise method
can guarantee an upper bound on the amount of
privacy loss when data is exchanged. By sacrificing a
small fraction of privacy, this method incurs

significantly less transmission costs than downloading
the contents of dw to the private database. As one
might expect, this approach has roots in information
hiding [12].
Section 2 provides a short overview of challenges
related to privacy preservation and related works are
discussed. Section 3 describes the privacy metric.
Section 4 formally presents our hash/noise approach.
Section 5 outlines a proof of concept implementation
and initial experimental results are studied. Finally,
section 6 summarizes our work and explores future
roads of research. The appendix summarizes the
notation used throughout the paper.
2. Challenges and Other Related Works
There are several challenges in privacy-preserving
data integration, ranging from defining privacy,
correctness, to efficiency. This section provides a short
summary of the most relevant of these challenges.
2.1. Defining Privacy
First, a metric is needed to measure the amount of
privacy loss that is incurred when data is exposed. In
[13], variable privacy is proposed as a method in
which some information can be revealed for some
benefit. Privacy loss is likened to a communications
channel, in which the difference between a posteriori
(i.e., after data has been revealed) and a priori (i.e.,
before data has been revealed) distributions of data
measures privacy loss. In [14], the likelihood of what
can be inferred about a query posed by the user is used
as a measure of privacy loss. In [15] and [16], a metric
for measuring the inherent uncertainty of a random
variable based on its differential entropy is used as a
measure for privacy. The common factor among all
these proposed metrics is relative information gain,
which has also been used in many privacy-preserving
applications [17], making it a likely candidate for
measuring privacy loss.
2.2. Correctness
The second challenge is producing exact and
correct answers to queries posed by users. Work in
privacy-preserving data mining [18]-[21] have focused
on changing the actual values of data items so that the
values of data items are hidden but the distribution of
the perturbed data is similar to that of the original data
distribution. However, the exact original data values
can not be accurately recovered. While this is
acceptable in data mining applications, since data
mining looks for trends and patterns, not exact values,
for data integration, the exact answers are required.
2.3. Efficiency and Privacy
The third challenge is to perform the join
operation efficiently without sacrificing much privacy.
If the join operation is partitioned into multiple
selection queries (one query for each join column
value in table R), the problem is transformed into
hiding the identity of the queries from dw while still
being able to retrieve the result of such queries from
dw. It has been shown that to completely guarantee the
privacy of the queries, the entire contents of dw should
be downloaded [22]. However, in some cases this is
not practical. If the user is willing to sacrifice a small
portion of his data privacy, the join operation can be
done without retrieving all of table S.
Commutative encryption-based approaches have
also been proposed to solve the private data integration
problem as well [23]-[25]. These approaches take
advantage of a family of encryption functions in which
the order that data item are encrypted by two different
keys does not matter. Although such an approach hides
the contents of query results from one or both parties,
it requires the exchange of both parties’ encrypted data
so that they can both mutually encrypt each others’
data. This makes such an approach expensive.
Oblivious transfer [26]-[28] allows the user to
secretly pose a query and only receive the result of the
query and nothing else. The party providing the answer
to the query does not learn the actual query. However,
under an oblivious transfer protocol, encryption and
transmission of all data items held by dw to the user
are required.
There has also been work in private information
retrieval schemes [22][29], which allow a user to
retrieve information from a database while maintaining
the privacy of his query. In these schemes, table S
would be replicated at multiple sites. Given a query,
Figure 1. General problem.

multiple queries are generated and sent to each of site
such that no site can learn the actual original query by
acting alone. The value of a record in column B of
table R is not revealed to the data warehouse.
However, many users working with sensitive data
would be unwilling to trust such a system if there is no
way to enforce non-collusion among the sites in the
system, especially if the user simply sees the
aggregation of the various sites as a black box.
2.4. Other Related Works
The proposed hash/noise method takes an
approach similar that to the one discussed in [14],
which takes advantage of collisions caused by hashes
to introduce uncertainty in the true contents of a
private database’s table. A HMAC [30] hash value is
generated for each data item in both tables each time a
query is posed. The size of the hash is varied to control
the amount of privacy loss: when the hash size is
increased, there are fewer possible collisions among
join column values, and thus less uncertainty in the
identity of a join column. Specifically, db first hashes
the values of the column B from table R to truncated
HMAC values small enough to satisfy the privacy
constraint posed by the user. Then it transmits its
hashed values and hash size to dw, where the relevant
subset of table S is identified by performing a join on
R’s hashed values with S’s hashed values (generated
by the same HMAC hash key over column B of table
S). Because a new hash with a new size is generated
for each query to vary the level of privacy, traditional
indexing mechanisms can not be used to accelerate
querying time and extra computation time is required
to compute the hash values of all data items in both
tables. As a result, the join operation becomes a very
expensive operation.
In contrast, our hash/noise method approach uses
a set of fixed hashing and artificial hash values to
control the amount of uncertainty in the identity of the
join column values in table R, thereby controlling the
level of privacy loss incurred. dw would contain an
auxiliary table having a fixed set of columns. The hash
values of join column values of table S are computed
offline and are indexed. During query time, db will
select the hash function that will yield the best
performance. Artificial hash values will be injected
into the data set communicated to dw by db, if the
selected hash function does not sufficiently satisfy the
privacy constraint. Because the hashes are known in
advance, dw can store the resulting hash values
directly in the database and does not need to
recompute them for each query. A candidate set of
tuples that belong to the result is returned by the dw
when it receives the hashed values. The candidate set
is then filtered by db to retrieve the final result.
Furthermore, privacy control by hash truncation
alone as suggested by [14] is very coarse. For
example, suppose that a 16-bit hash does not satisfy
the privacy constraint given a table R, so a 15-bit hash
was selected instead. However, the 15-bit hash doubles
the collision rate of the 16-bit hash, doubling the size
of the candidate set for the join result. Whereas, the
same 16-bit hash with some additional artificial hash
values could have satisfied the same privacy constraint
and yield far fewer records in the candidate set.
There has also been work in using Bloom filters
to make joins in a distributed database system more
efficient and private [31]-[33]. Like Bloom filters, our
approach makes use of the uncertainty introduced by
the collisions induced by hashing. However, we
augment the simple hashing approach by introducing
artificial noise values to control the level of privacy
desired by the user in exchange for efficiency.
Furthermore, Bloom filters will not allow the use of
traditional indexing mechanism to speed up querying.
If a Bloom filter was used to summarize the join
column of table R and transmitted to dw, dw would
have to apply the Bloom filter to each join column
value in table S.
Work in querying remote encrypted data [34]-[35]
is also related to private data integration. However,
when querying remote encrypted data, it is assumed
that the encrypted data is owned by the user but exists
on a public server. In the problem we are addressing,
the data on the public server is generally publicly
available and is not owned by any one user.
3. Privacy Metric
For our work, we use relative information gain as
a basis for a metric to measure privacy loss when data
is exchanged. The remainder of this section defines
this metric and explains our motivation for selecting it.
3.1. Entropy and Relative Information Gain
Entropy and relative information gain were
initially proposed in [30]. Entropy is the amount of
uncertainty in a random variable X. If the random
variable X can take on a set of finite values x
1
,x
2
,…x
n
,
then its entropy is defined as:
=
===
n
i
ii
xXPxXPXH
1
2
)(log)()( (1)
The conditional entropy H(X|Y) is the amount of
uncertainty in X after Y has been observed. Relative
information gain, or the fraction of information
revealed by Y about X, is defined as:

Citations
More filters
Posted Content
TL;DR: This paper shows how to transform PIR schemes into SPIR schemes (with information-theoretic privacy), paying a constant factor in communication complexity, and introduces a new cryptographic primitive, called conditional disclosure of secrets, which it is believed may be a useful building block for the design of other cryptographic protocols.
Abstract: Private information retrieval (PIR) schemes allow a user to retrieve the ith bit of an n-bit data string x, replicated in k?2 databases (in the information-theoretic setting) or in k?1 databases (in the computational setting), while keeping the value of i private. The main cost measure for such a scheme is its communication complexity. In this paper we introduce a model of symmetrically-private information retrieval (SPIR), where the privacy of the data, as well as the privacy of the user, is guaranteed. That is, in every invocation of a SPIR protocol, the user learns only a single physical bit of x and no other information about the data. Previously known PIR schemes severely fail to meet this goal. We show how to transform PIR schemes into SPIR schemes (with information-theoretic privacy), paying a constant factor in communication complexity. To this end, we introduce and utilize a new cryptographic primitive, called conditional disclosure of secrets, which we believe may be a useful building block for the design of other cryptographic protocols. In particular, we get a k-database SPIR scheme of complexity O(n1/(2k?1)) for every constant k?2 and an O(logn)-database SPIR scheme of complexity O(log2n·loglogn). All our schemes require only a single round of interaction, and are resilient to any dishonest behavior of the user. These results also yield the first implementation of a distributed version of (n1)-OT (1-out-of-n oblivious transfer) with information-theoretic security and sublinear communication complexity.

418 citations

Journal Article
TL;DR: In this article, the authors proposed a data distortion scheme for association rule mining that simultaneously provides both privacy to the user and accuracy in the mining results, and demonstrated that by generalizing the distortion process to perform symbol-specific distortion, appropriately choosing the distortion parameters, and applying a variety of optimizations in the reconstruction process, runtime efficiencies that are well within an order of magnitude of undistorted mining can be achieved.
Abstract: Data mining services require accurate input data for their results to be meaningful, but privacy concerns may influence users to provide spurious information. To encourage users to provide correct inputs, we recently proposed a data distortion scheme for association rule mining that simultaneously provides both privacy to the user and accuracy in the mining results. However, mining the distorted database can be orders of magnitude more time-consuming as compared to mining the original database. In this paper, we address this issue and demonstrate that by (a) generalizing the distortion process to perform symbol-specific distortion, (b) appropriately chooosing the distortion parameters, and (c) applying a variety of optimizations in the reconstruction process, runtime efficiencies that are well within an order of magnitude of undistorted mining can be achieved.

61 citations

Book ChapterDOI
01 Oct 2012
TL;DR: This paper proposes a novel non-interactive differentially private mechanism called BLIP (for BLoom-and-flIP) for randomizing Bloom filters and provides an analysis of the protection offered by BLIP against this profile reconstruction attack by deriving an upper and lower bound for the required value of the differential privacy parameter.
Abstract: In this paper, we consider the scenario in which the profile of a user is represented in a compact way, as a Bloom filter, and the main objective is to privately compute in a distributed manner the similarity between users by relying only on the Bloom filter representation. In particular, we aim at providing a high level of privacy with respect to the profile even if a potentially unbounded number of similarity computations take place, thus calling for a non-interactive mechanism. To achieve this, we propose a novel non-interactive differentially private mechanism called BLIP (for BLoom-and-flIP) for randomizing Bloom filters. This approach relies on a bit flipping mechanism and offers high privacy guarantees while maintaining a small communication cost. Another advantage of this non-interactive mechanism is that similarity computation can take place even when the user is offline, which is impossible to achieve with interactive mechanisms. Another of our contributions is the definition of a probabilistic inference attack, called the "Profile Reconstruction attack", that can be used to reconstruct the profile of an individual from his Bloom filter representation. More specifically, we provide an analysis of the protection offered by BLIP against this profile reconstruction attack by deriving an upper and lower bound for the required value of the differential privacy parameter e.

55 citations


Cites methods from "Performance-oriented privacy-preser..."

  • ...The application domains of these techniques include searching document indexes [13,3,5], private information retrieval [21], private matching [22], private publication of search logs [14] and anti-counterfeiting in supply chains [16]....

    [...]

Book ChapterDOI
24 Sep 2008
TL;DR: This paper develops protocols that enable data holders to merge personal records, thus creating larger profiles and diminishing duplication, and presents an extension to the protocol that permits the revelation of k-anonymous demographics, such that the administrator can perform joins more efficiently.
Abstract: Many organizations capture personal information, but the quantity of records needed to detect statistically significant patterns is often beyond the grasp of a single data collector. In the biomedical realm, this problem has pressed regulatory agencies to require funded investigators to share research-derived data to public repositories. The challenge; however, is that shared records must not reveal the identity of the subjects. In this paper, we extend a secure framework in which data holders contribute and query encrypted person-specific data stored on a third party's server. Specifically, we develop protocols that enable data holders to merge personal records, thus creating larger profiles and diminishing duplication. The repository administrator can merge records via encrypted identifiers without decrypting or inferring the contents of the joined records. Our model is more practical than prior secure join methods because each data holder needs only a single interaction with the central repository. We further present an extension to the protocol that permits the revelation of k-anonymous demographics, such that the administrator can perform joins more efficiently with the guarantee that each record can be linked to no less than k individuals in the population. We prove the privacy preserving features of our protocols and experimentally evaluate their efficiency in a real world Census dataset.

33 citations


Cites methods from "Performance-oriented privacy-preser..."

  • ...To enable more efficient solutions, hash-based noise addition techniques [22] and anonymization based approaches have been [23] proposed....

    [...]

Journal ArticleDOI
01 Nov 2009
TL;DR: An extended version of the protocol is introduced in which data holders append k-anonymous features of their consumers to their encrypted submissions, which facilitate a more efficient join computation, while providing a formal guarantee that each record is linkable to no less than k individuals in the union of all organizations' consumers.
Abstract: Organizations, such as federally-funded medical research centers, must share de-identified data on their consumers to publicly accessible repositories to adhere to regulatory requirements. Many repositories are managed by third-parties and it is often unknown if records received from disparate organizations correspond to the same individual. Failure to resolve this issue can lead to biased (e.g., double counting of identical records) and underpowered (e.g., unlinked records of different data types) investigations. In this paper, we present a secure multiparty computation protocol that enables record joins via consumers' encrypted identifiers. Our solution is more practical than prior secure join models in that data holders need to interact with the third party one time per data submission. Though technically feasible, the speed of the basic protocol scales quadratically with the number of records. Thus, we introduce an extended version of our protocol in which data holders append k-anonymous features of their consumers to their encrypted submissions. These features facilitate a more efficient join computation, while providing a formal guarantee that each record is linkable to no less than k individuals in the union of all organizations' consumers. Beyond a theoretical treatment of the problem, we provide an extensive experimental investigation with data derived from the US Census to illustrate the significant gains in efficiency such an approach can achieve.

25 citations

References
More filters
Journal ArticleDOI
TL;DR: This final installment of the paper considers the case where the signals or the messages or both are continuously variable, in contrast with the discrete nature assumed until now.
Abstract: In this final installment of the paper we consider the case where the signals or the messages or both are continuously variable, in contrast with the discrete nature assumed until now. To a considerable extent the continuous case can be obtained through a limiting process from the discrete case by dividing the continuum of messages and signals into a large but finite number of small regions and calculating the various parameters involved on a discrete basis. As the size of the regions is decreased these parameters in general approach as limits the proper values for the continuous case. There are, however, a few new effects that appear and also a general change of emphasis in the direction of specialization of the general results to particular cases.

65,425 citations

Book
01 Jan 1996
TL;DR: A valuable reference for the novice as well as for the expert who needs a wider scope of coverage within the area of cryptography, this book provides easy and rapid access of information and includes more than 200 algorithms and protocols.
Abstract: From the Publisher: A valuable reference for the novice as well as for the expert who needs a wider scope of coverage within the area of cryptography, this book provides easy and rapid access of information and includes more than 200 algorithms and protocols; more than 200 tables and figures; more than 1,000 numbered definitions, facts, examples, notes, and remarks; and over 1,250 significant references, including brief comments on each paper.

13,597 citations


"Performance-oriented privacy-preser..." refers methods in this paper

  • ...Borrowing a technique from [15], eight hash functions were created by simply truncating the result of the MD5 hash [ 36 ]....

    [...]

Journal ArticleDOI
16 May 2000
TL;DR: This work considers the concrete case of building a decision-tree classifier from training data in which the values of individual records have been perturbed and proposes a novel reconstruction procedure to accurately estimate the distribution of original data values.
Abstract: A fruitful direction for future data mining research will be the development of techniques that incorporate privacy concerns. Specifically, we address the following question. Since the primary task in data mining is the development of models about aggregated data, can we develop accurate models without access to precise information in individual data records? We consider the concrete case of building a decision-tree classifier from training data in which the values of individual records have been perturbed. The resulting data records look very different from the original records and the distribution of data values is also very different from the original distribution. While it is not possible to accurately estimate original values in individual data records, we propose a novel reconstruction procedure to accurately estimate the distribution of original data values. By using these reconstructed distributions, we are able to build classifiers whose accuracy is comparable to the accuracy of classifiers built with the original data.

3,173 citations

Journal ArticleDOI
TL;DR: This work describes schemes that enable a user to access k replicated copies of a database and privately retrieve information stored in the database, so that each individual server gets no information on the identity of the item retrieved by the user.
Abstract: Publicly accessible databases are an indispensable resource for retrieving up-to-date information. But they also pose a significant risk to the privacy of the user, since a curious database operator can follow the user's queries and infer what the user is after. Indeed, in cases where the users' intentions are to be kept secret, users are often cautious about accessing the database. It can be shown that when accessing a single database, to completely guarantee the privacy of the user, the whole database should be down-loaded; namely n bits should be communicated (where n is the number of bits in the database).In this work, we investigate whether by replicating the database, more efficient solutions to the private retrieval problem can be obtained. We describe schemes that enable a user to access k replicated copies of a database (k≥2) and privately retrieve information stored in the database. This means that each individual server (holding a replicated copy of the database) gets no information on the identity of the item retrieved by the user. Our schemes use the replication to gain substantial saving. In particular, we present a two-server scheme with communication complexity O(n1/3).

1,918 citations

Journal ArticleDOI
24 Apr 2003-Nature
TL;DR: The Human Genome Project (HGP) as mentioned in this paper was the first attempt to obtain a high-quality, comprehensive sequence of the human genome, in this fiftieth anniversary year of the discovery of the double-helical structure of DNA.
Abstract: The completion of a high-quality, comprehensive sequence of the human genome, in this fiftieth anniversary year of the discovery of the double-helical structure of DNA, is a landmark event. The genomic era is now a reality. In contemplating a vision for the future of genomics research,it is appropriate to consider the remarkable path that has brought us here. The rollfold (Figure 1) shows a timeline of landmark accomplishments in genetics and genomics, beginning with Gregor Mendel’s discovery of the laws of heredity and their rediscovery in the early days of the twentieth century.Recognition of DNA as the hereditary material, determination of its structure, elucidation of the genetic code, development of recombinant DNA technologies, and establishment of increasingly automatable methods for DNA sequencing set the stage for the Human Genome Project (HGP) to begin in 1990 (see also www.nature.com/nature/DNA50). Thanks to the vision of the original planners, and the creativity and determination of a legion of talented scientists who decided to make this project their overarching focus, all of the initial objectives of the HGP have now been achieved at least two years ahead of expectation, and a revolution in biological research has begun. The project’s new research strategies and experimental technologies have generated a steady stream of ever-larger and more complex genomic data sets that have poured into public databases and have transformed the study of virtually all life processes. The genomic approach of technology development and large-scale generation of community resource data sets has introduced an important new dimension into biological and biomedical research. Interwoven advances in genetics, comparative genomics, highthroughput biochemistry and bioinformatics

1,704 citations

Frequently Asked Questions (2)
Q1. What have the authors stated for future works in "Performance-oriented privacy-preserving data integration" ?

From their initial results presented, several future research directions can be pursued. In future research, several additional features, such as the distribution of table S, should be incorporated into a new estimate. Another future research direction is the use of a Bloom filter to reduce the size of the set R used by the hash/noise method. If infinite domains are used, their method may be too conservative since there are an infinite number of actual values that may hash to a given hash value 

The use of hashes and noise yields better performance than existing techniques while still making it difficult for unauthorized entities to distinguish which data items truly exist in the private database. As the authors show here, leveraging the uncertainty introduced by collisions caused by hashing and the injection of noise, they present a technique for performing a relational join operation between a massive public table and a relatively smaller private one.