scispace - formally typeset
Open AccessJournal ArticleDOI

Publishing set-valued data via differential privacy

Reads0
Chats0
TLDR
It is demonstrated that set-valued data could be efficiently released under differential privacy with guaranteed utility with the help of context-free taxonomy trees, and a probabilistic top-down partitioning algorithm is proposed to generate a differentially private release, which scales linearly with the input data size.
Abstract
Set-valued data provides enormous opportunities for various data mining tasks. In this paper, we study the problem of publishing set-valued data for data mining tasks under the rigorous differential privacy model. All existing data publishing methods for set-valued data are based on partition-based privacy models, for example k-anonymity, which are vulnerable to privacy attacks based on background knowledge. In contrast, differential privacy provides strong privacy guarantees independent of an adversary's background knowledge and computational power. Existing data publishing approaches for differential privacy, however, are not adequate in terms of both utility and scalability in the context of set-valued data due to its high dimensionality.We demonstrate that set-valued data could be efficiently released under differential privacy with guaranteed utility with the help of context-free taxonomy trees. We propose a probabilistic top-down partitioning algorithm to generate a differentially private release, which scales linearly with the input data size. We also discuss the applicability of our idea to the context of relational data. We prove that our result is (∈, δ)-useful for the class of counting queries, the foundation of many data mining tasks. We show that our approach maintains high utility for counting queries and frequent itemset mining and scales to large datasets through extensive experiments on real-life set-valued datasets.

read more

Content maybe subject to copyright    Report

Publishing Set-Valued Data via Differential Privacy
Rui Chen
Concordia University
Montreal, Canada
ru
che@encs.concordia.ca
Noman Mohammed
Concordia University
Montreal, Canada
no moham@encs.concordia.ca
Benjamin C. M. Fung
Concordia University
Montreal, Canada
fung@ciise.concordia.ca
Bipin C. Desai
Concordia University
Montreal, Canada
bcdesai@cs.concordia.ca
Li Xiong
Emory University
Atlanta, USA
lxiong@mathcs.emory.edu
ABSTRACT
Set-valued data provides enormous opportunities for various
data mining tasks. In this paper, we study the problem of
publishing set-valued data for data mining tasks under the
rigorous differential privacy model. All existing data pub-
lishing methods for set-valued da ta are based on partition-
based privacy models, for example k-anonymity, which are
vulnerable to privacy attacks based on background knowl-
edge. In contrast, differential privacy provides strong pri-
vacy guarantees independent of an adversary’s background
knowledge and computational power. Existing data pub-
lishing approaches for differ ential privacy, however, are not
adequate in terms of both utility and scalability in the con-
text of set-valued data due to its high dimensionality.
We demonstrate that set-valued data could be efficiently
released under differential privacy with guaranteed utility
with the help of context-free taxono my trees. We propose a
probabilistic top-down partitioning algorithm to generate a
differentially private release, which scales linearly with the
input data size. We also discuss the applicability of our
idea to the context of relational data. We prove that our
result is (ǫ, δ)-useful for the class of counting queries, the
foundation of many data mining tasks. We show that our
approach maintains high utili ty for counting queries and fre-
quent itemset mining and scales to large datasets t h ro u g h
extensive experiments on real-life set-valued datasets.
1. INTRODUCTION
Set-valued data, such as transaction data, web search
queries, and click streams, refers to the data in which each
record owner is a s sociated with a set of items drawn from a
universe of items [19, 28, 29]. Sharing set-valued data p r o -
vides enormous o p portun iti es for various data mining tasks
in different application domains such as marketing, adver-
tising, and infrastructure manag ement. However, such data
often contain s sensitive informatio n that could violate indi-
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee. Articles from this volume were invited to present
their results at The 37th International Conference on Very Large Data Bases,
August 29th - September 3rd 2011, Seattle, Washington.
Proceedings of the VLDB Endowment, Vol. 4, No. 11
Copyright 2011 VLDB Endowment 2150-8097/11/08... $ 10.00.
vidual privacy. Such privacy concerns are even exacerbated
in the emerging computing paradigms, for example cloud
computing. Therefore, set-valued data needs to be san itiz ed
before it can be released to the publ ic. In this p a per, we con-
sider the problem of pu b lishing set-valued data that simul-
taneously protects individu a l privacy under the framework
of differential privacy [8] and provides guara nteed utility to
data miners.
There has been some existing research [5, 16, 19, 28, 29,
34, 35 ] on publishing set-valued data based on par ti ti o n -
based privacy models [15], f o r example k-anonymity [27] (or
its relaxation, k
m
-anonymity [28, 29]) and/or confidence
bounding [5, 30]. However, due to both their vulnerability
to adversaries’ b a ckground knowledge and their determinis -
tic nature, many types of privacy attacks [20, 25, 31] have
been identified on these approaches derived using these mo d -
els, leading to privacy compromise. In contrast, differential
privacy [8], a relatively new privac y model stemming from
the field of statistical disclosure control, provides stron g pri-
vacy guarantees independent of an adversary’s background
knowledge, computational power or subsequent behavior.
Differential privacy, in general, requires that the out co me
of any analysis should not overly depend on a single data
record. It follows that even if a user had opted in the
database, there would n o t be a significant change in any
computation based on the database. Therefore, this assures
every record owner that any p r ivacy breach will not be a
result of participating in a database.
There are two natural settings of data sanitization un-
der differential privacy: interactive a n d non-interactive. In
the interactive setting, a sanitization mechanism sits be-
tween the users and the database. Queries posed by the
users and/or their responses must be evaluated and may be
modified by the mechanism in order to protect p r ivacy; in
the non-interactive setting, a data publish er co mp u t es an d
releases a sanitized version of a data b a se, possi b ly a syn-
thetic database, to the public for future analysis. Th er e
have been some lower boun d results [6, 8, 9] of differential
privacy, indicating that o n ly a limited number of queries
could be answered; otherwise, an adversary would be ab le
to precisely reconstruct almost the entire original database,
resulting in a serious co mp ro m ise of privacy. Consequently,
most recent works have concentrated on designing various
interactive mechanisms that an swer only a sublinear num-
ber, in the size n of the underlyin g database, of queries in
total, regardless of t h e number of users. Once this limit is
1087

reached, either the database has to be shut down, or any
further query would be rejected. This limitation has greatly
hindered their applicability, especially in the scenario wh ere
a database is made available to many users who legitimately
need to pose a large number of queries. Naturally, one would
favor a non-interactive release that could be used to answer
an arbitrary large number of queries or for various data anal-
ysis tasks. Blum et al. [4] point out tha t the aforementioned
lower bounds could be circumvented in t h e non-interactive
setting at the cost of preserving usefulness for only restricted
classes of queries. However, they did not provide an efficient
algorithm.
Dwork et al. [10] further propose a more efficient non-
interactive s anitization mechanism with a synthetic output.
However, the progress is not sufficient to solve the problem
of publishing set-valued data for data mining tasks for two
reasons. First, the approach in [10] is of runtime complexity,
poly(|C|, |I|), where |C| is the size of a concept class and |I|
is the size of the item universe. A set-valued dataset could
be reconst ru c t ed by counting queries (see Section 3.3 for a
formal definition). This implies a complexity of poly(2
|I|
1, |I|), which is not desirable for real-life set-valued data,
where |I| is typically over a thousand. Second, for data min-
ing tasks the published dat a needs to be “semantically inter-
pretable”; therefore, synthetic data does not fully meet the
publisher’s goal [35]. Similarly, the approaches of two very
recent papers [32, 33], which are designed for publishing re-
lational data by first enumerating all possib le combinations
of all different values of different attributes, also s u ffer from
the scalability problem in the context of set-valued data.
We argue that a more efficient solution could be achieved
by taking into consideration the underlying d a ta s et. The
solution also has a positive impact on the resulting utility
as there is no need to add noise to every possible combina-
tion. The main technical challenge is how to make use of a
specific dataset while satisfying differential privacy.
In this paper, we demonstrate that in the presence of a
context-free taxonomy tree we can efficiently generate a san-
itized release of set- valued data in a differentially private
manner with guaranteed utility for counting queries and
many other data mining tasks. Unlike the use of taxono my
trees in the generalization mechanism for partition-ba s ed
privacy models, where the taxonomy trees are highly spe-
cific to a particular application, the taxonomy tree required
in our solution does not necessarily need to refl ec t the under-
lying semantics and, therefore, is context-free. This feature
makes our approach flexible for ap p lyin g to variou s kinds of
set-valued datasets.
Contribution. We summarize our contributions as follows.
First, this is the first study of publishing set-valued data
via differential privacy. The previous anonymization tech-
niques [5, 16, 19, 28, 29, 34, 35] developed for publishing set-
valued data are dedicated to partition-based privacy mod-
els. Due to their deterministic nature, they ca n not be used
for achieving differential privacy. In this paper, we pro-
pose a probabili st i c top-down partitioning algorithm that
provides provable utility under differential privacy, one of
the strongest privacy models.
Second, this is the first paper that proposes an efficient
non-interactive approach scalable to high-dimensional set-
valued data with guaranteed utility un d er d i fferential pri-
vacy. We stress that our goal is to publish the data, not d a t a
mining results. Publishing data provides much greater exi-
bilities for data miners t h a n publishing data mining results.
We show that a more efficient and effecti ve solution could be
achieved by making u se of the underlying dataset, instead
of explicitly considering all possible outputs as used in the
existing works [4, 10, 32, 33]. For a set -valued dataset, it
could be done by a top-down partitioning process based on a
context-free taxonomy tree. The use of a context-free taxon-
omy tree makes our approach applicable to all kinds of s et-
valued datasets. We prove that the result of our approach is
(ǫ, δ)-usefu l for counting queries, which guarantees the use-
fulness fo r data mining tasks based on counts, e.g., mining
frequent patterns and association rules [17 ]. We argue that
the general idea has a wider application, for example, to
relational data in which each attribute is associated with
a taxonomy tree. This implies that some traditional data
publishing methods, such as TDS [14] and Mondrian [22],
could be adapted to satisfy differential privacy.
2. RELATED WORK
Set-Valued Data Publishing. Due to the nature of high
dimensionality in set-valued data, the extensive research on
privacy-preserving data publishing (PPDP) fo r relational
data does not fit well with set-valu ed data [13]. Some recent
papers have started to address the problem of sanitizing set-
valued data for the purpose o f data mining [5, 11, 16, 19,
28, 29, 34, 35].
Ghinita et al. [16] and Xu et al. [34, 35] divide all items
into either sensitive or non-sensitive, and assume that an ad-
versary’s background knowledge is strictly confin ed to non-
sensitive items. Ghinita et al. [16 ] p r o pose a bucketization-
based appro a ch that limits the probab ility of inferring a sen-
sitive item to a specified threshold, while preserving correla-
tions among items for frequent pattern mining. Xu et al. [35]
bound the background knowledge of an adversary to at most
p non-sensitive items, and employ global suppression to pre-
serve as many item instances as possible. Xu et al. [34]
improve the technique in [35] by preserving frequent item-
sets and presenting a border representation. Cao et a l. [5]
further assume that an ad versary may possess background
knowledge on sensitive items and p ro pose a privacy notion
ρ-uncertainty, which bounds the confidence of inferr in g a
sensitive item from any itemset to ρ.
Terrovitis et al. [28, 29] and He and Naughton [19] elim-
inate the distin c tio n between sensitive and non-sensitive.
Similar to the idea of [34] and [35], Terrovitis et al. [28] pro -
pose to bound the background knowledge of an adversary
by the maximum number m of items a n d propose a new
privacy model k
m
-anonymity, a relaxation of k-anonymity.
They achieve k
m
-anonymity by a bottom-up global gener-
alization so lu t io n . To improve the utility, recently Terrovi-
tis et al. [29] provide a loca l recoding method for achiev-
ing k
m
-anonymity. He and Naughton [19] point out that
k
m
-anonymity provides a weaker privacy protection t h a n k-
anonymity and propose a top-down local generalization so-
lution under k-anonymity. We argue that even k-an o nymity
provides insufficient privacy protection for set-valued data.
Evfimievski et al. [1 1] propose a series of randomization op-
erators to limit the confidence of inferring a n item’s presence
in a d a ta s et with the goal of association rule mining.
Differential Privacy. In the last few years, differential
privacy has been gaining considerable attention in various
applications. Most of the r esear ch on d iff erential privacy
concentrates on the interactive setting with the goal of ei-
1088

Table 1: A sample set-valued dataset.
TID Items
t
1
{I
1
, I
2
, I
3
, I
4
}
t
2
{I
2
, I
4
}
t
3
{I
2
}
t
4
{I
1
, I
2
}
t
5
{I
2
}
t
6
{I
1
}
t
7
{I
1
, I
2
, I
3
, I
4
}
t
8
{I
2
, I
3
, I
4
}
ther reducing the magnitude of added noise [18, 26] or re-
leasing certain data mining results [2, 3, 12, 21]. Refer to
[7] for an overview of recent works on differential privacy.
Lately, several works [4, 10, 32, 33] have started to address
the use of differential privacy in the non-interactive setting
as a substitute for partition-based privacy models. Blum et
al. [4] demons tr a te that it is possible to circumvent the lower
bound results to release synthetic private databases that are
useful for all queries over a discretized domain from a con-
cept class with polynomial Vapnik-Chervon en kis dimension.
However, their mechanism is no t efficient, taking runtime
complexity of sup e rpoly(|C|, |I|), where |C| is the size of a
concept class and |I| the size of the item universe. This fact
makes their mechanism impossible for p ra c t ica l applications.
To improve the effic ien cy, Dwork et al. [10] propose a recur-
sive algorithm of generating a synthetic database with run-
time complexity of poly(|C|, |I|). As mentioned earlier, thi s
improvement, however, is still insufficient to handle real-life
set-valued dat a sets . In this paper, we propose an algorithm
that is scalable to large real-life set-valued datasets.
Xiao et al. [33] propose a two-step algorithm for relational
data. It first issues queries for every possible combination
of attribute values to the PINQ interface [23], an d then pro-
duces a generalized output using the perturbed dataset re-
turned by PINQ. Apparently, this approach is computation-
ally expensive in the context of set-valued data due to the
high dimensionality, which requires issuing a total of 2
|I|
1
queries. All these works [4, 10, 33] are based on the query
model. In contrast, Xiao et al. [3 2 ] assume that their algo-
rithm has direc t and uncond itio n a l access to the underly-
ing relational data. They propose a wavelet-transformation
based approach that lowers the magnitude of noise than
adding independent Laplace noise. Similarly, the algorithm
needs to process all possible entries in the entire output do-
main, which causes a scalability problem for set-valued data.
3. PRELIMINARIES
Let I = {I
1
, I
2
, ..., I
|I|
} be the universe of items, where |I|
is the size of the universe. The multiset D = {t
1
, t
2
, ..., t
|D|
}
denotes a set-valued dataset, where each record t
i
D is a
non-empty subset of I. Table 1 presents an example of set-
valued datasets with the item universe I = {I
1
, I
2
, I
3
, I
4
}.
An overview of notational conventions is provided in Ap-
pendix A.
3.1 Context-Free Taxonomy Tree
A set-valued dataset could be associated with a single
taxonomy tree. In the classic generalization mech anism, the
taxonomy tree required is highly specific to a particular ap-
plication. This constraint has been considered a major lim-
itation of applying generalization [1]. The reason of requir-
ing an applicati on-specifi c taxonomy tree is that the release
Figure 1: A context-free taxonomy tree of the sam-
ple data.
contains generalized items that need to be semantically con-
sistent with the original items. In our approach, we publish
only original items; therefor e, the taxonomy tree could be
context free.
Definition 3.1 (Context-Free Taxonomy Tree).
A context-free taxonomy tree is a taxonomy tree, whose in-
ternal nodes are a set of their leaves, not necessarily the
semantic generalization of the leaves.
For example, Figure 1 presents a context-free taxonomy
tree for Ta b le 1, and one of its internal nodes I
{1,2,3,4}
=
{I
1
, I
2
, I
3
, I
4
}. We say that an item can be generali z ed to a
taxonomy tree node if it is in the node’s set. For example,
I
1
can be generalized to I
{1,2}
because I
1
{I
1
, I
2
}.
3.2 Differential Privacy
Differential privacy requires that the removal or addition
of a single database record does not significantly affect the
outcome of any analysis. It ensures a data record owner that
any privacy breach will not b e a result of participating in the
database since anything that is learnable from the da ta b a s e
with his record is also learnable from the one without his
record. Formally, differential privacy in the non-interactive
setting [4] is defined as follow. Here the p ar a meter, α, spec-
ifies the degree of privacy offered.
Definition 3.2 (α-differential privacy). A privacy
mechanism A gives α-differential privacy if for any dataset
D
1
and D
2
differing on at most one record , and for any
possible sanitized dataset
e
D Range(A),
P r[A(D
1
) =
e
D] e
α
× P r[A(D
2
) =
e
D] (1)
where the probability is taken over the randomness of A.
Two princ ip a l techniques for achieving differential privacy
have a p peared in the literature, one for real-valued out-
puts [8] and the other f o r outputs of arbit ra r y types [24]. A
fundamental concept of both techniques is the global sensi-
tivity of a function [8] map p ing un d er lyin g datasets to (vec-
tors of) reals.
Definition 3.3 (Global Sensitivity). For any func-
tion f : D R
d
, the sens itivi ty of f is
f = max
D
1
,D
2
||f(D
1
) f (D
2
)||
1
(2)
for all D
1
, D
2
differing in at most one record.
Roughly speaking, functions with lower sensitivity are
more tolerant towards changes of a dataset and, th erefo r e,
allow more accurate differentially private mecha nisms.
Laplace Mechanism. Fo r t h e analysis whose outputs are
real, a standard mechanism to achieve differential privacy
is to add Laplace noise to the true output of a function.
Dwork et al. [8] propose the Laplace mechanism w h ich takes
1089

as inputs a dataset D, a function f , and the privacy pa-
rameter α. The magnitude of t h e noise added conforms
to a Lap la c e distribution with the probability density func -
tion p(x|λ) =
1
2λ
e
−|x|
, where λ is determin ed by both the
global sensitivity of f and the desired privacy level α.
Theorem 3.1. [8] For any function f : D R
d
over an
arbitrary domain D, th e mechanism A
A(D ) = f(D) + Laplace(∆f) (3)
gives α-differential privacy.
For example, for a single counting query Q over a d a ta set
D , returning Q(D) + Laplace(1) maintains α-differential
privacy bec a u se a counting query has a sensitivity 1.
Exponential Mechanism. For the analysis whose outputs
are not real or make no sense after adding noise, McSherry
and Talwar [24] propose th e exponential mechanism that
selects an output from the output domain, r R, by taking
into consideration its score of a given utility function q in a
differentially private ma n n er . The exponential mechanism
assigns exponentially greater probabilities of being selected
to outputs of higher scores so that the final output would be
close to the optimum with respect to q. The chosen utility
function q should be in s en sit ive to changes in any particular
record, that is, has a low sensitivity. Let the sensitivity of q
be q = max
r,D
1
,D
2
|q(D
1
, r) q(D
2
, r)|.
Theorem 3.2. [24] Given a utility function q : (D ×
R) R for a dataset D, the mechanism A,
A(D , q) =
r e t urn r with prob ab ility exp(
αq(D, r)
2∆q
)
(4)
gives α-differential privacy.
For a sequence of differentially-private computations, its
privacy guarantee is provided by the composition properties
of differential privacy, namely sequential composition and
parallel composition, which are summarized in Appendix B.
3.3 Utility Metrics
Due to the lower bound results [6, 8, 9], we can only guar-
antee the utility of restricted classes of queries [4] in the
non-interactive setting. In this paper, we aim to develop
a solution fo r publishing set- valued data that is useful for
counting queries.
Definition 3.4 (Counting Query). For a given item-
set I
I, a counting query Q over a dataset D is defined
to be Q(D) = |{t D : I
t}|.
We choose counting queries because th ey are crucial to
several key data mining tasks over set-valued data, for ex-
ample, mining frequent patterns and assoc ia tio n rules [17].
In this paper, we employ (ǫ, δ)-usefulness [4] to theoretically
measure the utility of sanitized data for counting queries.
Definition 3.5 ((ǫ, δ)-usefulness). A privacy mech-
anism A is (ǫ, δ)-useful for queries in class C if with prob-
ability 1 δ, for every Q C and every dataset D, fo r
e
D = A(D), |Q(
e
D) Q(D)| ǫ.
(ǫ, δ)-usefu ln ess is effective to give an overall estimation
of utility, but fails to provide intu it ive experimental r esu lt s.
Therefore, in Section 5.1 , we experimentally measure the
utility of sanitized data for counting queries by relati ve error
(see Section 5.1 for more details.).
4. SANITIZATION ALGORITHM
We present a Diff erentially-private sanitization algorithm
that rec u rs ively Partitions a given set-valued dataset based
on a context-free taxonomy tree (DiffPart).
4.1 Partitioning Algorithm
Intuitively, a differentially private release of a set-valued
dataset could be generated by addin g Laplace n o is e to a set
of counting queries. A simple yet in fea s ib le approach can be
achieved by employing Dwork et al.’s method [8]: first gen-
erate all distinct itemsets from the item universe; then for
each itemset issue a counting query an d add Laplace noise to
the answer. Th is ap p ro a ch suffers from two main drawbacks
in the context of set-valued data. First, it requires a total of
P
|I|
k=1
|I|
k
= 2
|I|
1 queries, where k is the number of items
in a query, giving rise to a scalability problem. Second, the
noise added to the itemsets that never appear in the origi-
nal dataset accu mulates exponentially, rendering the r eleas e
useless for data an a lys is tasks. In fact, these are als o the
main limitations of other non-interactive approaches [4, 10,
32, 33] when applied to set-valued data. We argue that an
efficient solution could be achieved by taking into consider-
ation th e underlying dataset. However, attentions must be
paid because identifying the s et of counting queries based
on the input dataset may leak its sensitive information and,
therefore, violates differential privacy.
We first provide an overview of DiffPart. It starts by cre-
ating t h e context-free taxonomy tree. It then generalizes
all records to a single partition with a common representa-
tion. We call the common representation the hierarchy cut,
consisting of a s et of taxonomy tree nodes. It recursively dis -
tributes the records into disjoint sub-partitions with more
specific rep resentations in a top-down manner based on the
taxonomy tree. For each sub-partition, we determine if it
is empty in a noisy way and further split the sub-partition s
considered “non-empty”. Our ap p r o a ch stops when n o fur-
ther partitioning is possible in any sub-partition. We call a
partition a leaf partition if every node in its hierarchy cut is
a leaf of the taxonomy tree. Finally, for each leaf partition,
the algo r ithm asks for its noisy size (the n o isy number of
records in the pa r tit io n ) to construct th e release. Our use of
a top-down p a r tit io n in g process is inspir ed by its u se in [19],
but with substantial differences. Their approach is used to
generate a generalized release satisfying k-anonymity while
ours is to identify the set of counting queries used to publish
differentially private data.
Algorithm 1 presents our approach in more deta il. It takes
as inputs the raw set-valued dataset D, the fan-out f used
to construct the taxonomy tree, and also the total privacy
budget B specified by the data publisher, and returns a
sanitized dataset
e
D satisfying B-differential privacy.
Top-Down Partitioning. The algorithm fir st constructs
the context-free taxonomy tree H by iteratively grouping f
nodes from one level to an u p per level until a single root is
created. If the size of the item universe is not divided by f ,
smaller groups can be created.
The initia l partitio n p is created by generalizing all records
in D u n d er a hierarchy cut of a single taxonomy tree node,
namely the root of H. A record can be generalized to a hier-
archy cut if every item in the record can be generalized to a
node in the cut and every node in the cut generalizes some
items in the record. For example, the record {I
3
, I
4
} can be
generalized to the hierarchy cuts {I
{3,4}
} and {I
{1,2,3,4}
},
1090

Algorithm 1 DiffPart
Input: Raw set-valued dataset D; fan-out f;
privacy budget B
Output: Sanitized dataset
e
D
1:
e
D ;
2: Construct a taxonomy tree H with fan-out f;
3: Partition p all records in D;
4: p.cut the root of H;
5: p.
e
B = B/2; p.α = p.
e
B/|InternalNodes(p.cut)|;
6: Add p to an initially empty queue Q;
7: while Q 6= do
8: Dequeue p
from Q;
9: Sub-partitions P SubP art
Gen(p
, H);
10: for each sub-partition p
i
P do
11: if p
i
is a leaf partition then
12: N
p
i
= NoisyCount(|p
i
|, B/2 + p
i
.
e
B);
13: if N
p
i
2C
1
/(B/2 + p
i
.
e
B)then
14: Add N
p
i
copies of p
i
.cut to
e
D;
15: else
16: Add p
i
to Q;
17: return
e
D;
but not {I
{1,2}
, I
{3,4}
}. The initial partition p is added to
an empty queue Q.
For each partition in the queue, we n eed to generate its
sub-partitions and identify the non-empty ones for further
partitioning. Due to noise required by differential privacy, a
sub-partition ca n n o t be deterministically identified as non-
empty. Probabilistic operations are needed for this purpose.
For each operation, a certain portion of privacy budget is
required to obtain the noisy size of a sub-partition based on
which we decide whether it is “empty”. Algorit h m 1 keeps
partitioning “non-empty” sub-partitions until leaf partitions
are reached .
Example 4.1. Given the dataset in Table 1 and a fan-out
value 2, a possible taxonomy tree is presented in Figure 1,
and a possible partitioning process is illustrated in Figure 2.
Partitions {I
{3,4}
}, {I
{1,2}
, I
3
} and {I
{1,2}
, I
4
} are consid-
ered “empty” and, therefore, not further partitioned.
Privacy Budget Allocation. The use of the total pri-
vacy budget B needs to be carefully alloca ted to each prob-
abilistic operation to avoid u n expected termination of th e
algorithm. Since the operations ar e used to determine the
noisy sizes of the sub-partitions resulted from partition oper-
ations, a naive allocation scheme is to bound the maximum
number of partition operations needed in the entire algo-
rithm and assign an equal portion to each of them. This
approach, however, does not perform well. Instead, we pro-
pose a more sophisticat ed adaptive scheme. We reserve B/2
to obtain the noisy sizes of leaf partitions, which are used to
construct the release, and use the rest B/2 to guide the par-
titioning process. For each partition, we independently cal-
culate the maximum number of partition operations further
needed and assign privacy bu d g et to partition operations
based on the number.
The portion of privacy budget assigned to a partition op-
eration is further allocated to the resulting sub-partitions to
check their noisy sizes (to see if they are “empty”). Since
all sub-p a rt it io n s from the same partition operation con-
Procedure 1 SubPart Gen Procedure
Input: Partition p; taxonomy tree H
Output: Noisy non-empty sub-partitions V of p
1: Initialize a vector V ;
2: Select a nod e u from p.cut to partition;
3: Generate all non-empty sub-partitions S;
4: Allocate records in p to S;
5: for each sub-partition s
i
S do
6: N
s
i
= Noisy Count(|s
i
|, p.α);
7: if N
s
i
2C
2
× height(p.cut)/p.α then
8: s
i
.
e
B = p.
e
B p.α;
9: s
i
= s
i
.
e
B/|InternalNodes(s
i
.cut)|;
10: Add s
i
to V ;
11: j = 1; l = number of u’s ch ild r en;
12: while j 2
l
|S| do
13: N
j
= NoisyCount(0, p.α);
14: if N
j
2C
2
× height(p.cut)/p.α then
15: Randomly generate an empty sub-partition s
j
;
16: s
j
.
e
B = p.
e
B p.α;
17: s
j
= s
j
.
e
B/|InternalNodes(s
j
.cut)|;
18: Add s
j
to V ;
19: return V ;
tain disjoint records, due to the parallel compositio n prop-
erty [23 ], t h e portion of privacy budget could be used in full
on each sub-partition. This scheme guarantees that more
specific partitions always obtain more privacy budget (see
Appendix F.2 for a formal proof), complying w it h the ratio-
nale that more general partitions contain more records and,
therefore, are more resistant to a smaller privacy budget.
Theorem 4.1. Given a non-leaf partition p with a hier-
archy cut a nd an associated taxonomy t ree H, the maximum
number of partition operations needed to reach leaf partitions
is |InternalNodes(cut)| =
P
u
i
cut
|InternalN odes(u
i
, H)|,
where |InternalNodes(u
i
, H)| is the number o f internal node
of the subtree of H rooted at u
i
.
Proof. See Appendix F.1.
Each partition tracks its unused p rivacy budget
e
B and
calculates the portion of privacy budget α for the next par-
tition operation. Any priva c y budget left fr o m the parti-
tioning proces s is added to leaf partitions.
Example 4.2. For the partitioning process illust ra t ed in
Figure 2, partitions {I
1
, I
2
}, {I
{1,2}
, I
{3,4}
}, {I
{1,2}
, I
3
, I
4
},
and {I
1
, I
2
, I
3
, I
4
} receive privacy budget 5B/6, B/6, B/6
and 2B/3 respectively.
Sub-Partition Generation. “Non- emp ty” sub-partitions
can be identified by either expo nent i a l mechanism or Laplace
mechanism. For exponential mechanism, we can get the
noisy number N of non-empty sub-partiti ons, and then use
exponential mechanism to extract N sub-partitions by us-
ing the number of records in a sub-partition as the score
function. This approach, however, does not take advantage
of the fact that all sub-partitions contain disjoint datasets,
resulting in a relatively small privacy budget for each oper-
ation and thus less accurate results. For this reason, we
employ Laplac e mechanism for generating sub-partitions,
whose details are presented in Procedure 1.
For a non-leaf partition, we generate a candidate set of
taxonomy tree nodes from its hierarchy cut, containing all
1091

Citations
More filters
Journal ArticleDOI

Differential Privacy Techniques for Cyber Physical Systems: A Survey

TL;DR: This paper surveys the application and implementation of differential privacy in four major applications of CPSs named as energy systems, transportation systems, healthcare and medical systems, and industrial Internet of things (IIoT).
Proceedings ArticleDOI

Heavy Hitter Estimation over Set-Valued Data with Local Differential Privacy

TL;DR: The main idea is to first gather a candidate set of heavy hitters using a portion of the privacy budget, and focus the remaining budget on refining the candidate set in a second phase, which is much more efficient budget-wise than obtaining the heavy hitters directly from the whole dataset.
Proceedings ArticleDOI

Differentially private sequential data publication via variable-length n-grams

TL;DR: A variable-length n-gram model is employed, which extracts the essential information of a sequential database in terms of a set of variable- length n- grams, and a solution for generating a synthetic database, which enables a wider spectrum of data analysis tasks.
Proceedings ArticleDOI

Differentially private transit data publication: a case study on the montreal transportation system

TL;DR: This paper presents an efficient data-dependent yet differentially private transit data sanitization approach based on a hybrid-granularity prefix tree structure, and is the first paper to introduce a practical solution for publishing large volume of sequential data under differential privacy.
References
More filters
Book

Data Mining: Concepts and Techniques

TL;DR: This book presents dozens of algorithms and implementation examples, all in pseudo-code and suitable for use in real-world, large-scale data mining projects, and provides a comprehensive, practical look at the concepts and techniques you need to get the most out of real business data.
Journal ArticleDOI

k -anonymity: a model for protecting privacy

TL;DR: The solution provided in this paper includes a formal protection model named k-anonymity and a set of accompanying policies for deployment and examines re-identification attacks that can be realized on releases that adhere to k- anonymity unless accompanying policies are respected.
Book ChapterDOI

Calibrating noise to sensitivity in private data analysis

TL;DR: In this article, the authors show that for several particular applications substantially less noise is needed than was previously understood to be the case, and also show the separation results showing the increased value of interactive sanitization mechanisms over non-interactive.
Journal Article

Calibrating noise to sensitivity in private data analysis

TL;DR: The study is extended to general functions f, proving that privacy can be preserved by calibrating the standard deviation of the noise according to the sensitivity of the function f, which is the amount that any single argument to f can change its output.
Frequently Asked Questions (16)
Q1. What are the contributions in "Publishing set-valued data via differential privacy" ?

In this paper, the authors study the problem of publishing set-valued data for data mining tasks under the rigorous differential privacy model. The authors demonstrate that set-valued data could be efficiently released under differential privacy with guaranteed utility with the help of context-free taxonomy trees. The authors propose a probabilistic top-down partitioning algorithm to generate a differentially private release, which scales linearly with the input data size. The authors also discuss the applicability of their idea to the context of relational data. The authors prove that their result is ( ǫ, δ ) -useful for the class of counting queries, the foundation of many data mining tasks. The authors show that their approach maintains high utility for counting queries and frequent itemset mining and scales to large datasets through extensive experiments on real-life set-valued datasets. 

The authors consider it in their future work. 

The rationale of taking into consideration the height is that more general partitions should have more records to be worth being partitioned. 

For exponential mechanism, the authors can get the noisy number N of non-empty sub-partitions, and then use exponential mechanism to extract N sub-partitions by using the number of records in a sub-partition as the score function. 

due to both their vulnerability to adversaries’ background knowledge and their deterministic nature, many types of privacy attacks [20, 25, 31] have been identified on these approaches derived using these models, leading to privacy compromise. 

Due to the nature of high dimensionality in set-valued data, the extensive research on privacy-preserving data publishing (PPDP) for relational data does not fit well with set-valued data [13]. 

For a sequence of differentially-private computations, its privacy guarantee is provided by the composition properties of differential privacy, namely sequential composition and parallel composition, which are summarized in Appendix B.Due to the lower bound results [6, 8, 9], the authors can only guarantee the utility of restricted classes of queries [4] in the non-interactive setting. 

The algorithm first constructs the context-free taxonomy tree H by iteratively grouping f nodes from one level to an upper level until a single root is created. 

The previous anonymization techniques [5, 16, 19, 28, 29, 34, 35] developed for publishing setvalued data are dedicated to partition-based privacy models. 

several works [4, 10, 32, 33] have started to address the use of differential privacy in the non-interactive setting as a substitute for partition-based privacy models. 

The authors discuss the applicability of their approach to other types of data, e.g. relational data, in Appendix D.In the experiments, the authors examine the performance of their algorithm in terms of utility for different data mining tasks, namely counting queries and frequent itemset mining, and scalability of handling large set-valued datasets. 

The paper also contributes to the research of differential privacy by demonstrating that an efficient non-interactive solution could be achieved by carefully making use of the underlying dataset. 

For a non-leaf partition, the authors generate a candidate set of taxonomy tree nodes from its hierarchy cut, containing allnon-leaf nodes that are of the largest height in H, and then randomly select a node u from the set to expand, generating a total of 2l sub-partitions, where l ≤ f is the number of u’s children in H. 

According to the complexity analysis in Section 4.2, dataset size and universe size are the two factors that dominate the complexity. 

Given a non-leaf partition p with a hierarchy cut and an associated taxonomy tree H, the maximum number of partition operations needed to reach leaf partitions is |InternalNodes(cut)| = ∑ui∈cut |InternalNodes(ui, H)|, where |InternalNodes(ui, H)| is the number of internal node of the subtree of H rooted at ui. 

Two principal techniques for achieving differential privacy have appeared in the literature, one for real-valued outputs [8] and the other for outputs of arbitrary types [24].