# Performance-oriented privacy-preserving data integration

## Summary (4 min read)

### 1. Introduction

- Data is often generated or collected by various parties, and the need to integrate the resulting disparate data sources has been addressed by the research community [1]-[6].
- When sharing scientific data, privacy quickly becomes an issue.
- To address this problem, the authors augment the wellknown semi-join framework [11], “hiding” the actual values of the join column of table R by hashing them and including additional artificial values.

### 2.1. Defining Privacy

- Privacy loss is likened to a communications channel, in which the difference between a posteriori (i.e., after data has been revealed) and a priori (i.e., before data has been revealed) distributions of data measures privacy loss.
- In [15] and [16], a metric for measuring the inherent uncertainty of a random variable based on its differential entropy is used as a measure for privacy.
- The common factor among all these proposed metrics is relative information gain, which has also been used in many privacy-preserving applications [17], making it a likely candidate for measuring privacy loss.

### 2.2. Correctness

- The second challenge is producing exact and correct answers to queries posed by users.
- Work in privacy-preserving data mining [18]-[21] have focused on changing the actual values of data items so that the values of data items are hidden but the distribution of the perturbed data is similar to that of the original data distribution.
- The exact original data values can not be accurately recovered.
- While this is acceptable in data mining applications, since data mining looks for trends and patterns, not exact values, for data integration, the exact answers are required.

### 2.3. Efficiency and Privacy

- The third challenge is to perform the join operation efficiently without sacrificing much privacy.
- It has been shown that to completely guarantee the privacy of the queries, the entire contents of dw should be downloaded [22].
- In some cases this is not practical.
- It requires the exchange of both parties’ encrypted data so that they can both mutually encrypt each others’ data.
- The party providing the answer to the query does not learn the actual query.

### 3. Privacy Metric

- For their work, the authors use relative information gain as a basis for a metric to measure privacy loss when data is exchanged.
- The remainder of this section defines this metric and explains their motivation for selecting it.

### 4. Privacy-Preserving Distributed Join

- The first step projects column B from table R and applies a hashing function h to each value in column B, yielding table h(R) with column h(B).
- Step 2 will generate artificial hash values, yielding table n.

### 4.1. Privacy Constraint Satisfaction

- Because different hash functions have various sizes, they yield different collision rates.
- Large hash functions tend to yield low collision rates; whereas, small hash functions tend to yield high collision rates.
- When the user wishes to perform a join on his private table R and the public table S, he requires that the privacy loss incurred with respect to the contents of table S to not exceed prel.
- Applying equation 7 to each hash function, the minimum number of hash values |r1|,|r2|,…,|rm| for all m available hash functions on dw can be found.

### 4.2. Performance Estimation

- To select the appropriate hash function for the data exchange, the transmission cost normalized with respect to the brute-force method (i.e., downloading table S from dw to db) costi can be estimated.
- It is assumed that transmissions costs will dominate the execution costs of the overall join operation since the system will be operating over a limited communications link and search time is kept low with the use of indexes.
- It is found that on average for a given hash value, the number of values in column B that will collide to the some hash value is || || iH S for a hash function hi.
- The hash function hi (with an associated Ni found with equation 9) that yields the lowest normalized transmission cost according to equation 11 is selected as the hash function for the data exchange and is denoted by h.
- The set h(R) is computed with hash function h.

### 5. Implementation and Results

- A preliminary implementation was done in Java with MySQL [37] via MySQL’s JDBC connector [38].
- The hash value sets were stored and indexed in w along with their respective S table.
- Three sets of data were used for three instances of table S. The first two were each comprised of 2.5 million synthetically generated tuples.
- The third set of data was the “alignment block in rat chain of chromosome 10” table, taken from the UCSC Genome Browser Project [40].
- There were approximately 123,598 different values for the join column in the genome data set, so the size of domain U for join column values was approximated to be 217.

### 5.1. Execution Time Analysis

- To begin the execution time analysis, the size of table R in relation to the size of the set of possible key values U (|R|/|U|) is varied and the required relative privacy loss is to not exceed 0.01.
- As shown in a later graph in Figure 8, when |R /|U| transitions from 0.6 to 0.7, the system experiences the largest increase in hash size |H|, resulting in far fewer collisions; and, consequently many more hash values are sent to dw t meet the privacy constraint.
- For a uniform distribution, the execution time is generally independent of |R|/|U|, except when there is a large transition in hash values used, because the transmission of noise and false-positives dominate the cost.
- From this figure, it can also be seen that the execution times for join operations operating over the genome data distribution are lower than for the Gaussian distribution, which are usually lower than for the uniform distribution.
- Figure 5 shows how execution times vary as the target prel changes.

### 5.2. Absolute Privacy Loss Analysis

- Figure 7 shows how absolute privacy loss varies as |R| changes and the target prel is fixed at 0.01.
- For the uniform distribution, the absolute privacy loss i kept very low and close to the target prel of 0.01 since satisfying the relative privacy loss constraint for a uniform distribution is almost identical to satisfying an absolute privacy constraint of the same magnitude.
- For the Gaussian and genome data distributions, the absolute privacy loss differs greatly from the target relative prel, because far less effort is required to satisfy the relative privacy loss constrain than that required to satisfy an absolute privacy loss constraint of equal magnitude due to less uniformity in these distributions.
- For non-uniform distributions, achieving low absolute privacy loss would be much more expensive than achieving low relative absolute privacy loss; whereas, the cost for achieving both f r a uniform distribution would be relatively the same.
- Figure 7 also shows that as |R|/|U| increases, absolute privacy loss decreases.

### 5.3. Hash Selection Analysis

- Figure 8 shows that the size of the selected hash function that yields the lowest transmission cost increases as |R /|U| increases, for all distributions.
- For the uniform distribution, hash sizes ranging from 10-bits to 16-bits are required, depending on the size of |R|.
- For the Gaussian distribution, hash sizes ranging from 12-bits to 16-bits are required.
- Finally, for the genome data set, hash sizes ranging from 14- bits to 16-bits are needed.
- This experiment shows the necessary hash sizes that need to be precomputed and stored in dw for the various S table distributions.

### 5.4. Transmission Cost Analysis

- The transmission costs of the hash/noise method in relation to the brute-force are studied.
- For the less uniform genome data, the transmission costs remain relatively constant with an average of 25% of that of the brute-force method, fr all target relative prel values and when |R|/|U| is 0.1.
- Like for the other distributions, the general behavior of the observed transmission cost curve was predicted by the estimated transmission cost curves, but the actual transmission costs were poorly predicted.
- Figure 10 compares the attained normalized transmission costs of the hash/noise method with the costs of simple semi-joins (i.e., no privacy constraints enforced).
- The graph shows that |R|/|U| is directly proportional to what the cost of the semi-join would be.

### 5.5. Cost-Ratio Analysis

- Finally, the effect of the cost-ratio, or the ratio between the transmission costs of sending a hash-value and the transmission costs of sending a tuple, is examined.
- Figure 11 shows that the cost-ratio has very little effect on the overall performance of the system because the number of tuples in set F makes the cost of transmitting set F the dominating cost of the hash/noise method, regardless of the cost-ratio between sending hash values and tuples from set F.

### 7. Conclusion

- Three challenges in solving the private data integration problem were presented: (1) privacy, (2) correctness, and (3) efficiency.
- The use of relative information gain addresses the first challenge.
- By making use of predefined hash functions and noise injection to satisfy any privacy constraints that a user may pose, traditional indexing mechanisms can be used, making the total cost of a distributed join dominated mostly by transmission costs rather than by search and computational costs.
- The hash/noise technique works better for less uniform public data sets than for more uniform data sets stored at the public data warehouse.
- Furthermor , uniform data distributions require a wider range of hash functions to be predefined than less uniform data distributions.

Did you find this useful? Give us your feedback

##### Citations

418 citations

61 citations

55 citations

### Cites methods from "Performance-oriented privacy-preser..."

...The application domains of these techniques include searching document indexes [13,3,5], private information retrieval [21], private matching [22], private publication of search logs [14] and anti-counterfeiting in supply chains [16]....

[...]

33 citations

### Cites methods from "Performance-oriented privacy-preser..."

...To enable more efficient solutions, hash-based noise addition techniques [22] and anonymization based approaches have been [23] proposed....

[...]

25 citations

##### References

65,425 citations

13,597 citations

### "Performance-oriented privacy-preser..." refers methods in this paper

...Borrowing a technique from [15], eight hash functions were created by simply truncating the result of the MD5 hash [ 36 ]....

[...]

^{1}

3,173 citations

1,918 citations

1,704 citations

##### Related Papers (5)

##### Frequently Asked Questions (2)

###### Q2. What contributions have the authors mentioned in the paper "Performance-oriented privacy-preserving data integration" ?

The use of hashes and noise yields better performance than existing techniques while still making it difficult for unauthorized entities to distinguish which data items truly exist in the private database. As the authors show here, leveraging the uncertainty introduced by collisions caused by hashing and the injection of noise, they present a technique for performing a relational join operation between a massive public table and a relatively smaller private one.