What have the authors stated for future works in "Publishing set-valued data via differential privacy" ?

The authors consider it in their future work.

What is the rationale of taking into consideration the height?

The rationale of taking into consideration the height is that more general partitions should have more records to be worth being partitioned.

What is the purpose of differential privacy?

several works [4, 10, 32, 33] have started to address the use of differential privacy in the non-interactive setting as a substitute for partition-based privacy models.

What is the applicability of their approach to other types of data?

The authors discuss the applicability of their approach to other types of data, e.g. relational data, in Appendix D.In the experiments, the authors examine the performance of their algorithm in terms of utility for different data mining tasks, namely counting queries and frequent itemset mining, and scalability of handling large set-valued datasets.

How does the paper contribute to the research of differential privacy?

The paper also contributes to the research of differential privacy by demonstrating that an efficient non-interactive solution could be achieved by carefully making use of the underlying dataset.

What is the maximum number of sub-partitions needed for a non-lea?

For a non-leaf partition, the authors generate a candidate set of taxonomy tree nodes from its hierarchy cut, containing allnon-leaf nodes that are of the largest height in H, and then randomly select a node u from the set to expand, generating a total of 2l sub-partitions, where l ≤ f is the number of u’s children in H.

What are the factors that dominate the complexity analysis?

According to the complexity analysis in Section 4.2, dataset size and universe size are the two factors that dominate the complexity.

(Open Access) Publishing set-valued data via differential privacy (2011) | Rui Chen

Q: What are the contributions in "Publishing set-valued data via differential privacy" ?

In this paper, the authors study the problem of publishing set-valued data for data mining tasks under the rigorous differential privacy model. The authors demonstrate that set-valued data could be efficiently released under differential privacy with guaranteed utility with the help of context-free taxonomy trees. The authors propose a probabilistic top-down partitioning algorithm to generate a differentially private release, which scales linearly with the input data size. The authors also discuss the applicability of their idea to the context of relational data. The authors prove that their result is ( ǫ, δ ) -useful for the class of counting queries, the foundation of many data mining tasks. The authors show that their approach maintains high utility for counting queries and frequent itemset mining and scales to large datasets through extensive experiments on real-life set-valued datasets.

Q: What is the method for generating sub-partitions?

For exponential mechanism, the authors can get the noisy number N of non-empty sub-partitions, and then use exponential mechanism to extract N sub-partitions by using the number of records in a sub-partition as the score function.

Q: What is the problem of privacy attacks on set-valued data?

due to both their vulnerability to adversaries’ background knowledge and their deterministic nature, many types of privacy attacks [20, 25, 31] have been identified on these approaches derived using these models, leading to privacy compromise.

Q: What is the utility of a sequence of differentially-private computations?

For a sequence of differentially-private computations, its privacy guarantee is provided by the composition properties of differential privacy, namely sequential composition and parallel composition, which are summarized in Appendix B.Due to the lower bound results [6, 8, 9], the authors can only guarantee the utility of restricted classes of queries [4] in the non-interactive setting.

Q: How does the algorithm create a generalized taxonomy tree?

The algorithm first constructs the context-free taxonomy tree H by iteratively grouping f nodes from one level to an upper level until a single root is created.

Q: What are the previous anonymization techniques for publishing setvalued data?

The previous anonymization techniques [5, 16, 19, 28, 29, 34, 35] developed for publishing setvalued data are dedicated to partition-based privacy models.

Publishing Set-Valued Data via Differential Privacy

Rui Chen

Concordia University

Montreal, Canada

che@encs.concordia.ca

Noman Mohammed

Concordia University

Montreal, Canada

no moham@encs.concordia.ca

Benjamin C. M. Fung

Concordia University

Montreal, Canada

fung@ciise.concordia.ca

Bipin C. Desai

Concordia University

Montreal, Canada

bcdesai@cs.concordia.ca

Li Xiong

Emory University

Atlanta, USA

lxiong@mathcs.emory.edu

ABSTRACT

Set-valued data provides enormous opportunities for various

data mining tasks. In this paper, we study the problem of

publishing set-valued data for data mining tasks under the

rigorous diﬀerential privacy model. All existing data pub-

lishing methods for set-valued da ta are based on partition-

based privacy models, for example k-anonymity, which are

vulnerable to privacy attacks based on background knowl-

edge. In contrast, diﬀerential privacy provides strong pri-

vacy guarantees independent of an adversary’s background

knowledge and computational power. Existing data pub-

lishing approaches for diﬀer ential privacy, however, are not

adequate in terms of both utility and scalability in the con-

text of set-valued data due to its high dimensionality.

We demonstrate that set-valued data could be eﬃciently

released under diﬀerential privacy with guaranteed utility

with the help of context-free taxono my trees. We propose a

probabilistic top-down partitioning algorithm to generate a

diﬀerentially private release, which scales linearly with the

input data size. We also discuss the applicability of our

idea to the context of relational data. We prove that our

result is (ǫ, δ)-useful for the class of counting queries, the

foundation of many data mining tasks. We show that our

approach maintains high utili ty for counting queries and fre-

quent itemset mining and scales to large datasets t h ro u g h

extensive experiments on real-life set-valued datasets.

1. INTRODUCTION

Set-valued data, such as transaction data, web search

queries, and click streams, refers to the data in which each

record owner is a s sociated with a set of items drawn from a

universe of items [19, 28, 29]. Sharing set-valued data p r o -

vides enormous o p portun iti es for various data mining tasks

in diﬀerent application domains such as marketing, adver-

tising, and infrastructure manag ement. However, such data

often contain s sensitive informatio n that could violate indi-

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for proﬁt or commercial advantage and that copies

bear this notice and the full citation on the ﬁrst page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior speciﬁc

permission and/or a fee. Articles from this volume were invited to present

their results at The 37th International Conference on Very Large Data Bases,

August 29th - September 3rd 2011, Seattle, Washington.

Proceedings of the VLDB Endowment, Vol. 4, No. 11

vidual privacy. Such privacy concerns are even exacerbated

in the emerging computing paradigms, for example cloud

computing. Therefore, set-valued data needs to be san itiz ed

before it can be released to the publ ic. In this p a per, we con-

sider the problem of pu b lishing set-valued data that simul-

taneously protects individu a l privacy under the framework

of diﬀerential privacy [8] and provides guara nteed utility to

data miners.

There has been some existing research [5, 16, 19, 28, 29,

34, 35 ] on publishing set-valued data based on par ti ti o n -

based privacy models [15], f o r example k-anonymity [27] (or

its relaxation, k

-anonymity [28, 29]) and/or conﬁdence

bounding [5, 30]. However, due to both their vulnerability

to adversaries’ b a ckground knowledge and their determinis -

tic nature, many types of privacy attacks [20, 25, 31] have

been identiﬁed on these approaches derived using these mo d -

els, leading to privacy compromise. In contrast, diﬀerential

privacy [8], a relatively new privac y model stemming from

the ﬁeld of statistical disclosure control, provides stron g pri-

vacy guarantees independent of an adversary’s background

knowledge, computational power or subsequent behavior.

Diﬀerential privacy, in general, requires that the out co me

of any analysis should not overly depend on a single data

record. It follows that even if a user had opted in the

database, there would n o t be a signiﬁcant change in any

computation based on the database. Therefore, this assures

every record owner that any p r ivacy breach will not be a

result of participating in a database.

There are two natural settings of data sanitization un-

der diﬀerential privacy: interactive a n d non-interactive. In

the interactive setting, a sanitization mechanism sits be-

tween the users and the database. Queries posed by the

users and/or their responses must be evaluated and may be

modiﬁed by the mechanism in order to protect p r ivacy; in

the non-interactive setting, a data publish er co mp u t es an d

releases a sanitized version of a data b a se, possi b ly a syn-

thetic database, to the public for future analysis. Th er e

have been some lower boun d results [6, 8, 9] of diﬀerential

privacy, indicating that o n ly a limited number of queries

could be answered; otherwise, an adversary would be ab le

to precisely reconstruct almost the entire original database,

resulting in a serious co mp ro m ise of privacy. Consequently,

most recent works have concentrated on designing various

interactive mechanisms that an swer only a sublinear num-

ber, in the size n of the underlyin g database, of queries in

total, regardless of t h e number of users. Once this limit is

1087

reached, either the database has to be shut down, or any

further query would be rejected. This limitation has greatly

hindered their applicability, especially in the scenario wh ere

a database is made available to many users who legitimately

need to pose a large number of queries. Naturally, one would

favor a non-interactive release that could be used to answer

an arbitrary large number of queries or for various data anal-

ysis tasks. Blum et al. [4] point out tha t the aforementioned

lower bounds could be circumvented in t h e non-interactive

setting at the cost of preserving usefulness for only restricted

classes of queries. However, they did not provide an eﬃcient

algorithm.

Dwork et al. [10] further propose a more eﬃcient non-

interactive s anitization mechanism with a synthetic output.

However, the progress is not suﬃcient to solve the problem

of publishing set-valued data for data mining tasks for two

reasons. First, the approach in [10] is of runtime complexity,

poly(|C|, |I|), where |C| is the size of a concept class and |I|

is the size of the item universe. A set-valued dataset could

be reconst ru c t ed by counting queries (see Section 3.3 for a

formal deﬁnition). This implies a complexity of poly(2

|I|

−

1, |I|), which is not desirable for real-life set-valued data,

where |I| is typically over a thousand. Second, for data min-

ing tasks the published dat a needs to be “semantically inter-

pretable”; therefore, synthetic data does not fully meet the

publisher’s goal [35]. Similarly, the approaches of two very

recent papers [32, 33], which are designed for publishing re-

lational data by ﬁrst enumerating all possib le combinations

of all diﬀerent values of diﬀerent attributes, also s u ﬀer from

the scalability problem in the context of set-valued data.

We argue that a more eﬃcient solution could be achieved

by taking into consideration the underlying d a ta s et. The

solution also has a positive impact on the resulting utility

as there is no need to add noise to every possible combina-

tion. The main technical challenge is how to make use of a

speciﬁc dataset while satisfying diﬀerential privacy.

In this paper, we demonstrate that in the presence of a

context-free taxonomy tree we can eﬃciently generate a san-

itized release of set- valued data in a diﬀerentially private

manner with guaranteed utility for counting queries and

many other data mining tasks. Unlike the use of taxono my

trees in the generalization mechanism for partition-ba s ed

privacy models, where the taxonomy trees are highly spe-

ciﬁc to a particular application, the taxonomy tree required

in our solution does not necessarily need to reﬂ ec t the under-

lying semantics and, therefore, is context-free. This feature

makes our approach ﬂexible for ap p lyin g to variou s kinds of

set-valued datasets.

Contribution. We summarize our contributions as follows.

First, this is the ﬁrst study of publishing set-valued data

via diﬀerential privacy. The previous anonymization tech-

niques [5, 16, 19, 28, 29, 34, 35] developed for publishing set-

valued data are dedicated to partition-based privacy mod-

els. Due to their deterministic nature, they ca n not be used

for achieving diﬀerential privacy. In this paper, we pro-

pose a probabili st i c top-down partitioning algorithm that

provides provable utility under diﬀerential privacy, one of

the strongest privacy models.

Second, this is the ﬁrst paper that proposes an eﬃcient

non-interactive approach scalable to high-dimensional set-

valued data with guaranteed utility un d er d i ﬀerential pri-

vacy. We stress that our goal is to publish the data, not d a t a

mining results. Publishing data provides much greater ﬂ exi-

bilities for data miners t h a n publishing data mining results.

We show that a more eﬃcient and eﬀecti ve solution could be

achieved by making u se of the underlying dataset, instead

of explicitly considering all possible outputs as used in the

existing works [4, 10, 32, 33]. For a set -valued dataset, it

could be done by a top-down partitioning process based on a

context-free taxonomy tree. The use of a context-free taxon-

omy tree makes our approach applicable to all kinds of s et-

valued datasets. We prove that the result of our approach is

(ǫ, δ)-usefu l for counting queries, which guarantees the use-

fulness fo r data mining tasks based on counts, e.g., mining

frequent patterns and association rules [17 ]. We argue that

the general idea has a wider application, for example, to

relational data in which each attribute is associated with

a taxonomy tree. This implies that some traditional data

publishing methods, such as TDS [14] and Mondrian [22],

could be adapted to satisfy diﬀerential privacy.

2. RELATED WORK

Set-Valued Data Publishing. Due to the nature of high

dimensionality in set-valued data, the extensive research on

privacy-preserving data publishing (PPDP) fo r relational

data does not ﬁt well with set-valu ed data [13]. Some recent

papers have started to address the problem of sanitizing set-

valued data for the purpose o f data mining [5, 11, 16, 19,

28, 29, 34, 35].

Ghinita et al. [16] and Xu et al. [34, 35] divide all items

into either sensitive or non-sensitive, and assume that an ad-

versary’s background knowledge is strictly conﬁn ed to non-

sensitive items. Ghinita et al. [16 ] p r o pose a bucketization-

based appro a ch that limits the probab ility of inferring a sen-

sitive item to a speciﬁed threshold, while preserving correla-

tions among items for frequent pattern mining. Xu et al. [35]

bound the background knowledge of an adversary to at most

p non-sensitive items, and employ global suppression to pre-

serve as many item instances as possible. Xu et al. [34]

improve the technique in [35] by preserving frequent item-

sets and presenting a border representation. Cao et a l. [5]

further assume that an ad versary may possess background

knowledge on sensitive items and p ro pose a privacy notion

ρ-uncertainty, which bounds the conﬁdence of inferr in g a

sensitive item from any itemset to ρ.

Terrovitis et al. [28, 29] and He and Naughton [19] elim-

inate the distin c tio n between sensitive and non-sensitive.

Similar to the idea of [34] and [35], Terrovitis et al. [28] pro -

pose to bound the background knowledge of an adversary

by the maximum number m of items a n d propose a new

privacy model k

-anonymity, a relaxation of k-anonymity.

They achieve k

-anonymity by a bottom-up global gener-

alization so lu t io n . To improve the utility, recently Terrovi-

tis et al. [29] provide a loca l recoding method for achiev-

ing k

-anonymity. He and Naughton [19] point out that

-anonymity provides a weaker privacy protection t h a n k-

anonymity and propose a top-down local generalization so-

lution under k-anonymity. We argue that even k-an o nymity

provides insuﬃcient privacy protection for set-valued data.

Evﬁmievski et al. [1 1] propose a series of randomization op-

erators to limit the conﬁdence of inferring a n item’s presence

in a d a ta s et with the goal of association rule mining.

Diﬀerential Privacy. In the last few years, diﬀerential

privacy has been gaining considerable attention in various

applications. Most of the r esear ch on d iﬀ erential privacy

concentrates on the interactive setting with the goal of ei-

1088

Table 1: A sample set-valued dataset.

TID Items

, I

}

, I

}

, I

}

, I

}

, I

}

ther reducing the magnitude of added noise [18, 26] or re-

leasing certain data mining results [2, 3, 12, 21]. Refer to

[7] for an overview of recent works on diﬀerential privacy.

Lately, several works [4, 10, 32, 33] have started to address

the use of diﬀerential privacy in the non-interactive setting

as a substitute for partition-based privacy models. Blum et

al. [4] demons tr a te that it is possible to circumvent the lower

bound results to release synthetic private databases that are

useful for all queries over a discretized domain from a con-

cept class with polynomial Vapnik-Chervon en kis dimension.

However, their mechanism is no t eﬃcient, taking runtime

complexity of sup e rpoly(|C|, |I|), where |C| is the size of a

concept class and |I| the size of the item universe. This fact

makes their mechanism impossible for p ra c t ica l applications.

To improve the eﬃc ien cy, Dwork et al. [10] propose a recur-

sive algorithm of generating a synthetic database with run-

time complexity of poly(|C|, |I|). As mentioned earlier, thi s

improvement, however, is still insuﬃcient to handle real-life

set-valued dat a sets . In this paper, we propose an algorithm

that is scalable to large real-life set-valued datasets.

Xiao et al. [33] propose a two-step algorithm for relational

data. It ﬁrst issues queries for every possible combination

of attribute values to the PINQ interface [23], an d then pro-

duces a generalized output using the perturbed dataset re-

turned by PINQ. Apparently, this approach is computation-

ally expensive in the context of set-valued data due to the

high dimensionality, which requires issuing a total of 2

|I|

−1

queries. All these works [4, 10, 33] are based on the query

model. In contrast, Xiao et al. [3 2 ] assume that their algo-

rithm has direc t and uncond itio n a l access to the underly-

ing relational data. They propose a wavelet-transformation

based approach that lowers the magnitude of noise than

adding independent Laplace noise. Similarly, the algorithm

needs to process all possible entries in the entire output do-

main, which causes a scalability problem for set-valued data.

3. PRELIMINARIES

Let I = {I

, I

, ..., I

|I|

} be the universe of items, where |I|

is the size of the universe. The multiset D = {t

, t

, ..., t

|D|

}

denotes a set-valued dataset, where each record t

∈ D is a

non-empty subset of I. Table 1 presents an example of set-

valued datasets with the item universe I = {I

, I

An overview of notational conventions is provided in Ap-

pendix A.

3.1 Context-Free Taxonomy Tree

A set-valued dataset could be associated with a single

taxonomy tree. In the classic generalization mech anism, the

taxonomy tree required is highly speciﬁc to a particular ap-

plication. This constraint has been considered a major lim-

itation of applying generalization [1]. The reason of requir-

ing an applicati on-speciﬁ c taxonomy tree is that the release

Figure 1: A context-free taxonomy tree of the sam-

ple data.

contains generalized items that need to be semantically con-

sistent with the original items. In our approach, we publish

only original items; therefor e, the taxonomy tree could be

context free.

Definition 3.1 (Context-Free Taxonomy Tree).

A context-free taxonomy tree is a taxonomy tree, whose in-

ternal nodes are a set of their leaves, not necessarily the

semantic generalization of the leaves.

For example, Figure 1 presents a context-free taxonomy

tree for Ta b le 1, and one of its internal nodes I

{1,2,3,4}

, I

}. We say that an item can be generali z ed to a

taxonomy tree node if it is in the node’s set. For example,

can be generalized to I

{1,2}

because I

∈ {I

, I

3.2 Differential Privacy

Diﬀerential privacy requires that the removal or addition

of a single database record does not signiﬁcantly aﬀect the

outcome of any analysis. It ensures a data record owner that

any privacy breach will not b e a result of participating in the

database since anything that is learnable from the da ta b a s e

with his record is also learnable from the one without his

record. Formally, diﬀerential privacy in the non-interactive

setting [4] is deﬁned as follow. Here the p ar a meter, α, spec-

iﬁes the degree of privacy oﬀered.

Definition 3.2 (α-differential privacy). A privacy

mechanism A gives α-diﬀerential privacy if for any dataset

and D

diﬀering on at most one record , and for any

possible sanitized dataset

D ∈ Range(A),

P r[A(D

) =

D] ≤ e

× P r[A(D

) =

D] (1)

where the probability is taken over the randomness of A.

Two princ ip a l techniques for achieving diﬀerential privacy

have a p peared in the literature, one for real-valued out-

puts [8] and the other f o r outputs of arbit ra r y types [24]. A

fundamental concept of both techniques is the global sensi-

tivity of a function [8] map p ing un d er lyin g datasets to (vec-

tors of) reals.

Definition 3.3 (Global Sensitivity). For any func-

tion f : D → R

, the sens itivi ty of f is

∆f = max

||f(D

) − f (D

)||

(2)

for all D

, D

diﬀering in at most one record.

Roughly speaking, functions with lower sensitivity are

more tolerant towards changes of a dataset and, th erefo r e,

allow more accurate diﬀerentially private mecha nisms.

Laplace Mechanism. Fo r t h e analysis whose outputs are

real, a standard mechanism to achieve diﬀerential privacy

is to add Laplace noise to the true output of a function.

Dwork et al. [8] propose the Laplace mechanism w h ich takes

1089

as inputs a dataset D, a function f , and the privacy pa-

rameter α. The magnitude of t h e noise added conforms

to a Lap la c e distribution with the probability density func -

tion p(x|λ) =

2λ

−|x|/λ

, where λ is determin ed by both the

global sensitivity of f and the desired privacy level α.

Theorem 3.1. [8] For any function f : D → R

over an

arbitrary domain D, th e mechanism A

A(D ) = f(D) + Laplace(∆f/α) (3)

gives α-diﬀerential privacy.

For example, for a single counting query Q over a d a ta set

D , returning Q(D) + Laplace(1/α) maintains α-diﬀerential

privacy bec a u se a counting query has a sensitivity 1.

Exponential Mechanism. For the analysis whose outputs

are not real or make no sense after adding noise, McSherry

and Talwar [24] propose th e exponential mechanism that

selects an output from the output domain, r ∈ R, by taking

into consideration its score of a given utility function q in a

diﬀerentially private ma n n er . The exponential mechanism

assigns exponentially greater probabilities of being selected

to outputs of higher scores so that the ﬁnal output would be

close to the optimum with respect to q. The chosen utility

function q should be in s en sit ive to changes in any particular

record, that is, has a low sensitivity. Let the sensitivity of q

be ∆q = max

∀r,D

|q(D

, r) − q(D

, r)|.

Theorem 3.2. [24] Given a utility function q : (D ×

R) → R for a dataset D, the mechanism A,

A(D , q) =



r e t urn r with prob ab ility ∝ exp(

αq(D, r)

2∆q

)



(4)

gives α-diﬀerential privacy.

For a sequence of diﬀerentially-private computations, its

privacy guarantee is provided by the composition properties

of diﬀerential privacy, namely sequential composition and

parallel composition, which are summarized in Appendix B.

3.3 Utility Metrics

Due to the lower bound results [6, 8, 9], we can only guar-

antee the utility of restricted classes of queries [4] in the

non-interactive setting. In this paper, we aim to develop

a solution fo r publishing set- valued data that is useful for

counting queries.

Definition 3.4 (Counting Query). For a given item-

set I

′

⊆ I, a counting query Q over a dataset D is deﬁned

to be Q(D) = |{t ∈ D : I

′

⊆ t}|.

We choose counting queries because th ey are crucial to

several key data mining tasks over set-valued data, for ex-

ample, mining frequent patterns and assoc ia tio n rules [17].

In this paper, we employ (ǫ, δ)-usefulness [4] to theoretically

measure the utility of sanitized data for counting queries.

Definition 3.5 ((ǫ, δ)-usefulness). A privacy mech-

anism A is (ǫ, δ)-useful for queries in class C if with prob-

ability 1 − δ, for every Q ∈ C and every dataset D, fo r

D = A(D), |Q(

D) − Q(D)| ≤ ǫ.

(ǫ, δ)-usefu ln ess is eﬀective to give an overall estimation

of utility, but fails to provide intu it ive experimental r esu lt s.

Therefore, in Section 5.1 , we experimentally measure the

utility of sanitized data for counting queries by relati ve error

(see Section 5.1 for more details.).

4. SANITIZATION ALGORITHM

We present a Diﬀ erentially-private sanitization algorithm

that rec u rs ively Partitions a given set-valued dataset based

on a context-free taxonomy tree (DiﬀPart).

4.1 Partitioning Algorithm

Intuitively, a diﬀerentially private release of a set-valued

dataset could be generated by addin g Laplace n o is e to a set

of counting queries. A simple yet in fea s ib le approach can be

achieved by employing Dwork et al.’s method [8]: ﬁrst gen-

erate all distinct itemsets from the item universe; then for

each itemset issue a counting query an d add Laplace noise to

the answer. Th is ap p ro a ch suﬀers from two main drawbacks

in the context of set-valued data. First, it requires a total of

|I|

k=1



|I|



= 2

|I|

−1 queries, where k is the number of items

in a query, giving rise to a scalability problem. Second, the

noise added to the itemsets that never appear in the origi-

nal dataset accu mulates exponentially, rendering the r eleas e

useless for data an a lys is tasks. In fact, these are als o the

main limitations of other non-interactive approaches [4, 10,

32, 33] when applied to set-valued data. We argue that an

eﬃcient solution could be achieved by taking into consider-

ation th e underlying dataset. However, attentions must be

paid because identifying the s et of counting queries based

on the input dataset may leak its sensitive information and,

therefore, violates diﬀerential privacy.

We ﬁrst provide an overview of DiﬀPart. It starts by cre-

ating t h e context-free taxonomy tree. It then generalizes

all records to a single partition with a common representa-

tion. We call the common representation the hierarchy cut,

consisting of a s et of taxonomy tree nodes. It recursively dis -

tributes the records into disjoint sub-partitions with more

speciﬁc rep resentations in a top-down manner based on the

taxonomy tree. For each sub-partition, we determine if it

is empty in a noisy way and further split the sub-partition s

considered “non-empty”. Our ap p r o a ch stops when n o fur-

ther partitioning is possible in any sub-partition. We call a

partition a leaf partition if every node in its hierarchy cut is

a leaf of the taxonomy tree. Finally, for each leaf partition,

the algo r ithm asks for its noisy size (the n o isy number of

records in the pa r tit io n ) to construct th e release. Our use of

a top-down p a r tit io n in g process is inspir ed by its u se in [19],

but with substantial diﬀerences. Their approach is used to

generate a generalized release satisfying k-anonymity while

ours is to identify the set of counting queries used to publish

diﬀerentially private data.

Algorithm 1 presents our approach in more deta il. It takes

as inputs the raw set-valued dataset D, the fan-out f used

to construct the taxonomy tree, and also the total privacy

budget B speciﬁed by the data publisher, and returns a

sanitized dataset

D satisfying B-diﬀerential privacy.

Top-Down Partitioning. The algorithm ﬁr st constructs

the context-free taxonomy tree H by iteratively grouping f

nodes from one level to an u p per level until a single root is

created. If the size of the item universe is not divided by f ,

smaller groups can be created.

The initia l partitio n p is created by generalizing all records

in D u n d er a hierarchy cut of a single taxonomy tree node,

namely the root of H. A record can be generalized to a hier-

archy cut if every item in the record can be generalized to a

node in the cut and every node in the cut generalizes some

items in the record. For example, the record {I

, I

} can be

generalized to the hierarchy cuts {I

{3,4}

} and {I

{1,2,3,4}

1090

Algorithm 1 DiﬀPart

Input: Raw set-valued dataset D; fan-out f;

privacy budget B

Output: Sanitized dataset

D ← ∅;

2: Construct a taxonomy tree H with fan-out f;

3: Partition p ← all records in D;

4: p.cut ← the root of H;

5: p.

B = B/2; p.α = p.

B/|InternalNodes(p.cut)|;

6: Add p to an initially empty queue Q;

7: while Q 6= ∅ do

8: Dequeue p

′

from Q;

9: Sub-partitions P ← SubP art

Gen(p

′

, H);

10: for each sub-partition p

∈ P do

11: if p

is a leaf partition then

12: N

= NoisyCount(|p

|, B/2 + p

B);

13: if N

≥

√

/(B/2 + p

B)then

14: Add N

copies of p

.cut to

15: else

16: Add p

to Q;

17: return

but not {I

{1,2}

, I

{3,4}

}. The initial partition p is added to

an empty queue Q.

For each partition in the queue, we n eed to generate its

sub-partitions and identify the non-empty ones for further

partitioning. Due to noise required by diﬀerential privacy, a

sub-partition ca n n o t be deterministically identiﬁed as non-

empty. Probabilistic operations are needed for this purpose.

For each operation, a certain portion of privacy budget is

required to obtain the noisy size of a sub-partition based on

which we decide whether it is “empty”. Algorit h m 1 keeps

partitioning “non-empty” sub-partitions until leaf partitions

are reached .

Example 4.1. Given the dataset in Table 1 and a fan-out

value 2, a possible taxonomy tree is presented in Figure 1,

and a possible partitioning process is illustrated in Figure 2.

Partitions {I

{3,4}

}, {I

{1,2}

, I

} and {I

{1,2}

, I

} are consid-

ered “empty” and, therefore, not further partitioned.

Privacy Budget Allocation. The use of the total pri-

vacy budget B needs to be carefully alloca ted to each prob-

abilistic operation to avoid u n expected termination of th e

algorithm. Since the operations ar e used to determine the

noisy sizes of the sub-partitions resulted from partition oper-

ations, a naive allocation scheme is to bound the maximum

number of partition operations needed in the entire algo-

rithm and assign an equal portion to each of them. This

approach, however, does not perform well. Instead, we pro-

pose a more sophisticat ed adaptive scheme. We reserve B/2

to obtain the noisy sizes of leaf partitions, which are used to

construct the release, and use the rest B/2 to guide the par-

titioning process. For each partition, we independently cal-

culate the maximum number of partition operations further

needed and assign privacy bu d g et to partition operations

based on the number.

The portion of privacy budget assigned to a partition op-

eration is further allocated to the resulting sub-partitions to

check their noisy sizes (to see if they are “empty”). Since

all sub-p a rt it io n s from the same partition operation con-

Procedure 1 SubPart Gen Procedure

Input: Partition p; taxonomy tree H

Output: Noisy non-empty sub-partitions V of p

1: Initialize a vector V ;

2: Select a nod e u from p.cut to partition;

3: Generate all non-empty sub-partitions S;

4: Allocate records in p to S;

5: for each sub-partition s

∈ S do

6: N

= Noisy Count(|s

|, p.α);

7: if N

≥

√

× height(p.cut)/p.α then

8: s

B = p.

B − p.α;

9: s

.α = s

B/|InternalNodes(s

.cut)|;

10: Add s

to V ;

11: j = 1; l = number of u’s ch ild r en;

12: while j ≤ 2

− |S| do

13: N

= NoisyCount(0, p.α);

14: if N

≥

√

× height(p.cut)/p.α then

15: Randomly generate an empty sub-partition s

′

;

16: s

′

B = p.

B − p.α;

17: s

′

.α = s

′

B/|InternalNodes(s

′

.cut)|;

18: Add s

′

to V ;

19: return V ;

tain disjoint records, due to the parallel compositio n prop-

erty [23 ], t h e portion of privacy budget could be used in full

on each sub-partition. This scheme guarantees that more

speciﬁc partitions always obtain more privacy budget (see

Appendix F.2 for a formal proof), complying w it h the ratio-

nale that more general partitions contain more records and,

therefore, are more resistant to a smaller privacy budget.

Theorem 4.1. Given a non-leaf partition p with a hier-

archy cut a nd an associated taxonomy t ree H, the maximum

number of partition operations needed to reach leaf partitions

is |InternalNodes(cut)| =

∈cut

|InternalN odes(u

, H)|,

where |InternalNodes(u

, H)| is the number o f internal node

of the subtree of H rooted at u

Proof. See Appendix F.1.

Each partition tracks its unused p rivacy budget

B and

calculates the portion of privacy budget α for the next par-

tition operation. Any priva c y budget left fr o m the parti-

tioning proces s is added to leaf partitions.

Example 4.2. For the partitioning process illust ra t ed in

Figure 2, partitions {I

, I

}, {I

{1,2}

, I

{3,4}

}, {I

{1,2}

, I

and {I

, I

} receive privacy budget 5B/6, B/6, B/6

and 2B/3 respectively.

Sub-Partition Generation. “Non- emp ty” sub-partitions

can be identiﬁed by either expo nent i a l mechanism or Laplace

mechanism. For exponential mechanism, we can get the

noisy number N of non-empty sub-partiti ons, and then use

exponential mechanism to extract N sub-partitions by us-

ing the number of records in a sub-partition as the score

function. This approach, however, does not take advantage

of the fact that all sub-partitions contain disjoint datasets,

resulting in a relatively small privacy budget for each oper-

ation and thus less accurate results. For this reason, we

employ Laplac e mechanism for generating sub-partitions,

whose details are presented in Procedure 1.

For a non-leaf partition, we generate a candidate set of

taxonomy tree nodes from its hierarchy cut, containing all

1091

Publishing set-valued data via differential privacy

Figures

Citations

Data Mining - Concepts and Techniques.

Differential Privacy Techniques for Cyber Physical Systems: A Survey

Heavy Hitter Estimation over Set-Valued Data with Local Differential Privacy

Differentially private sequential data publication via variable-length n-grams

Differentially private transit data publication: a case study on the montreal transportation system

References

Data Mining: Concepts and Techniques

Data Mining - Concepts and Techniques.

k -anonymity: a model for protecting privacy

Calibrating noise to sensitivity in private data analysis

Calibrating noise to sensitivity in private data analysis

Related Papers (5)

k -anonymity: a model for protecting privacy

Differential privacy

Calibrating noise to sensitivity in private data analysis

L-diversity: Privacy beyond k-anonymity

Mechanism Design via Differential Privacy

Frequently Asked Questions (16)

Q1. What are the contributions in "Publishing set-valued data via differential privacy" ?

Q2. What have the authors stated for future works in "Publishing set-valued data via differential privacy" ?

Q3. What is the rationale of taking into consideration the height?

Q4. What is the method for generating sub-partitions?

Q5. What is the problem of privacy attacks on set-valued data?

Q6. Why does Ghinita et al. have a problem with privacy-preserving?

Q7. What is the utility of a sequence of differentially-private computations?

Q8. How does the algorithm create a generalized taxonomy tree?

Q9. What are the previous anonymization techniques for publishing setvalued data?

Q10. What is the purpose of differential privacy?

Q11. What is the applicability of their approach to other types of data?

Q12. How does the paper contribute to the research of differential privacy?

Q13. What is the maximum number of sub-partitions needed for a non-lea?

Q14. What are the factors that dominate the complexity analysis?

Q15. What is the maximum number of partition operations needed to reach leaf partitions?

Q16. What are the two principal techniques for achieving differential privacy?