scispace - formally typeset
Open AccessJournal ArticleDOI

Plausible deniability for privacy-preserving data synthesis

TLDR
In this paper, the authors proposed the plausible deniability criterion for releasing sensitive data, where an output record can be released only if a certain amount of input records are indistinguishable, up to a privacy parameter.
Abstract
Releasing full data records is one of the most challenging problems in data privacy. On the one hand, many of the popular techniques such as data de-identification are problematic because of their dependence on the background knowledge of adversaries. On the other hand, rigorous methods such as the exponential mechanism for differential privacy are often computationally impractical to use for releasing high dimensional data or cannot preserve high utility of original data due to their extensive data perturbation.This paper presents a criterion called plausible deniability that provides a formal privacy guarantee, notably for releasing sensitive datasets: an output record can be released only if a certain amount of input records are indistinguishable, up to a privacy parameter. This notion does not depend on the background knowledge of an adversary. Also, it can efficiently be checked by privacy tests. We present mechanisms to generate synthetic datasets with similar statistical properties to the input data and the same format. We study this technique both theoretically and experimentally. A key theoretical result shows that, with proper randomization, the plausible deniability mechanism generates differentially private synthetic data. We demonstrate the efficiency of this generative technique on a large dataset; it is shown to preserve the utility of original data with respect to various statistical analysis and machine learning measures.

read more

Content maybe subject to copyright    Report

Plausible Deniability for Privacy-Preserving Data Synthesis
Vincent Bindschaedler
UIUC
bindsch2@illinois.edu
Reza Shokri
Cornell Tech
shokri@cornell.edu
Carl A. Gunter
UIUC
cgunter@illinois.edu
ABSTRACT
Releasing full data records is one of the most challenging
problems in data privacy. On the one hand, many of the
popular techniques such as data de-identification are proble-
matic because of their dependence on the background kno-
wledge of adversaries. On the other hand, rigorous methods
such as the exponential mechanism for differential privacy
are often computationally impractical to use for releasing
high dimensional data or cannot preserve high utility of ori-
ginal data due to their extensive data perturbation.
This paper presents a criterion called plausible deniability
that provides a formal privacy guarantee, notably for rele-
asing sensitive datasets: an output record can be released
only if a certain amount of input records are indistinguisha-
ble, up to a privacy parameter. This notion does not depend
on the background knowledge of an adversary. Also, it can
efficiently be checked by privacy tests. We present mecha-
nisms to generate synthetic datasets with similar statistical
properties to the input data and the same format. We study
this technique both theoretically and experimentally. A key
theoretical result shows that, with proper randomization,
the plausible deniability mechanism generates differentially
private synthetic data. We demonstrate the efficiency of
this generative technique on a large dataset; it is shown to
preserve the utility of original data with respect to various
statistical analysis and machine learning measures.
1. INTRODUCTION
There is tremendous interest in releasing datasets for rese-
arch and development. Privacy policies of data holders, ho-
wever, prevent them from sharing their sensitive datasets.
This is due, to a large extent, to multiple failed attempts
of releasing datasets using imperfect privacy-preserving me-
chanisms such as de-identification. A range of inference at-
tacks on, for example, AOL search log dataset [2], Netflix
movie rating dataset [39], Genomic data [48, 22], location
data [18, 46], and social networks data [40], shows that sim-
ple modification of sensitive data by removing identifiers or
This work is licensed under the Creative Commons Attribution-
NonCommercial-NoDerivatives 4.0 International License. To view a copy
of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For
any use beyond those covered by this license, obtain permission by emailing
info@vldb.org.
Proceedings of the VLDB Endowment, Vol. 10, No. 5
Copyright 2017 VLDB Endowment 2150-8097/17/01.
by generalizing/suppressing data features results in major
information leakage and cannot guarantee meaningful pri-
vacy for data owners. These simple de-identification soluti-
ons, however, preserve data utility as they impose minimal
perturbation to real data.
Rigorous privacy definitions, such as differential privacy
[15], can theoretically guarantee privacy and bound infor-
mation leakage about sensitive data. However, known me-
chanisms, such as the Laplacian mechanism [15] or the ex-
ponential mechanism [37], that achieve differential privacy
through randomization, have practical limitations. The ma-
jority of scenarios, where they have been applied, are limited
to interactive count queries on statistical databases [14]. In a
non-interactive setting for releasing generic datasets, these
mechanisms are either computationally infeasible on high-
dimensional data, or practically ineffective because of their
large utility costs [26]. At best, these methods are used to
release some privacy-preserving statistics (e.g., histograms
[6, 51]) about a dataset, but not full data records. It is not
obvious how to protect the privacy of full records as opposed
to that of aggregate statistics (by adding random noise).
Despite all these obstacles, releasing full data records is
firmly pursued by large-scale data holders such as the U.S.
Census Bureau [21, 28, 27]. The purpose of this endea-
vor is to allow researchers to develop analytic techniques
by processing full synthetic data records rather than a limi-
ted set of statistics. Synthetic data could also be used for
educational purpose, application development for data ana-
lysis, sharing sensitive data among different departments in
a company, developing and testing pattern recognition and
machine learning models, and algorithm design for sensitive
data. There exists some inference-based techniques to assess
the privacy risks of releasing synthetic data [42, 43]. Howe-
ver, the major open problem is how to generate synthetic
full data records with provable privacy, that experimentally
can achieve acceptable utility in various statistical analytics
and machine learning settings.
In this paper, we fill this major gap in data privacy by pro-
posing a generic theoretical framework for generating synt-
hetic data in a privacy-preserving manner. The fundamental
difference between our approach and that of existing mecha-
nisms for differential privacy (e.g., exponential mechanism)
is that we disentangle the data generative model from pri-
vacy definitions. Instead of forcing a generative model to be
privacy-preserving by design, which might significantly de-
grade its utility, we can use a utility-preserving generative
model and release only a subset of its output that satisfies
our privacy requirements. Thus, for designing a generative
481

model, we rely on the state-of-the-art techniques from data
science independently from the privacy requirements. This
enables us to generate high utility synthetic data.
We formalize the notion of plausible deniability for data
privacy [3], and generalize it to any type of data. Consider
a probabilistic generative model that transforms a real data
record, as its seed, into a synthetic data record. We can sam-
ple many synthetic data records from each seed using such
a generative model. According to our definition, a synthetic
record provides plausible deniability if there exists a set of
real data records that could have generated the same synthe-
tic data with (more or less) the same probability by which
it was generated from its own seed. We design a privacy
mechanism that provably guarantees plausible deniability.
This mechanism results in input indistinguishability: by ob-
serving the output set (i.e., synthetics), an adversary cannot
tell for sure whether a particular data record was in the input
set (i.e., real data). The degree of this indistinguishability
is a parameter in our mechanism.
Plausible deniability is a property of the overall process,
and similar to differential privacy, it is independent of any
adversary’s background knowledge. In fact, we prove that
our proposed plausible deniable data synthesis process can
also satisfy differential privacy, if we randomize the indis-
tinguishability parameter in the privacy mechanism. This
is a significant theoretical result towards achieving strong
privacy using privacy-agnostic utility-preserving generative
models. Thus, we achieve differential privacy without ar-
tificially downgrading the utility of the synthesized data
through output perturbation.
The process of generating a single synthetic data record
and testing its plausible deniability can be done indepen-
dently from that of other data records. Thus, millions of
data records can be generated and processed in parallel.
This makes our framework extremely efficient and allows
implementing it at a large scale. In this paper, we deve-
lop our theoretical framework as an open-source tool, and
run it on a large dataset: the American Community Sur-
vey [47] from the U.S. Census Bureau which contains over
3.1 million records. In fact, we can generate over one million
privacy-preserving synthetic records in less than one hour on
a multi-core machine running 12 processes in parallel.
We analyze the utility of synthetic data in two major sce-
narios: extracting statistics for data analysis, and perfor-
ming prediction using machine learning. We show that our
privacy test does not impose high utility cost. We also de-
monstrate that a significant fraction of candidate synthetic
records proposed by a generative model can pass the privacy
test even for strict privacy parameters.
We show that a strong adversary cannot distinguish a
synthetic record from a real one with better than 63.0%
accuracy (baseline: 79.8%). Furthermore, when it comes
to classification tasks, the accuracy of the model learned
on a synthetic dataset is only slightly lower than that of
model trained on real data. For example, for Random Fo-
rest the accuracy is 75.3% compared to 80.4% when trained
on real data (baseline: 63.8%); whereas for AdaBoostM1
the accuracy is 78.1% compared to 79.3% when trained on
real data (baseline: 69.2%). Similar results are obtained
when we compare logistic regression (LR) and support vec-
tor machine (SVM) classifiers trained on our synthetic da-
tasets with the same classifiers trained (on real data) in a
differential private way (using state-of-the-art techniques).
Concretely, the accuracy of classifiers trained on our synt-
hetic data is 77.5% (LR) and 77.1% (SVM); compared to
76.3% (LR) and 78.2% (SVM) for objective-perturbation ε-
DP classifiers.
Contributions. We introduce a formal framework for plau-
sible deniability as a privacy definition. We also design a
mechanism to achieve it for the case of generating synthetic
data. We prove that using a randomized test in our plausible
deniability mechanism achieves differential privacy (which is
a stronger guarantee). We also show how to construct ge-
nerative models with differential privacy guarantees. The
composition of our generative model and plausible deniabi-
lity mechanism also satisfies differential privacy. We show
the high accuracy of our model and utility of our generated
synthetic data. We develop a generic tool and show its high
efficiency for generating millions of full data records.
2. PLAUSIBLE DENIABILITY
In this section, we formalize plausible deniability as a
new privacy notion for releasing privacy-preserving synthe-
tic data. We also present a mechanism to achieve it. Finally,
we prove that our mechanism can also satisfy differential pri-
vacy (which is a stronger guarantee) by slightly randomizing
our plausible deniability mechanism.
Informally, plausible deniability states that an adversary
(with any background knowledge) cannot deduce that a par-
ticular record in the input (real) dataset was significantly
more responsible for an observed output (synthetic record)
than was a collection of other input records. A mecha-
nism ensures plausible deniability if, for a privacy parameter
k > 0, there are at least k input records that could have ge-
nerated the observed output with similar probability.
Unlike the majority of existing approaches (e.g., to achieve
differential privacy), designing a mechanism to satisfy plau-
sible deniability for generative models does not require ad-
ding artificial noise to the generated data. Instead, we se-
parate the process of releasing privacy-preserving data into
running two independent modules: (1) generative models,
and (2) privacy test. The first consists in constructing a
utility-preserving generative data model. This is ultimately
a data science task which requires insight into the type of
data for which one wants to generate synthetics. By con-
trast, the privacy test aims to safeguard the privacy of those
individuals whose data records are in the input dataset.
Every generated synthetic is subjected to this privacy test;
if it passes the test it can be safely released, otherwise it is
discarded. This is where the plausible deniability criterion
comes into the frame: the privacy test is designed to ensure
that any released output can be plausibly denied.
In this section, we assume a generic generative model that,
given a data record in the input dataset as seed, produces
a synthetic data record. In Section 3, we present a generic
generative model based on statistical models, and show how
it can be constructed in a differentially-private manner, so
that it does not significantly leak about its own training
data. Plausibly deniable mechanisms protect the privacy of
the seeds, and are not concerned about how the generative
models are constructed.
Let M be a probabilistic generative model that given any
data record d can generate synthetic records y with proba-
bility Pr{y = M(d)}. Let k 1 be an integer and γ 1 be
a real number. Both k and γ are privacy parameters.
482

Definition 1 (Plausible Deniability).
For any dataset D with |D| k, and any record y gene-
rated by a probabilistic generative model M such that y =
M(d
1
) for d
1
D, we state that y is releasable with (k, γ)-
plausible deniability, if there exist at least k 1 distinct
records d
2
, ..., d
k
D \ {d
1
} such that
γ
1
Pr{y = M(d
i
)}
Pr{y = M(d
j
)}
γ, (1)
for any i, j {1, 2, . . . , k}.
The larger privacy parameter k is, the larger the indistin-
guishability set for the input data record. Also, the closer
to 1 privacy parameter γ is, the stronger the indistinguisha-
bility of the input record among other plausible records.
Given a generative model M, and a dataset D, we need
a mechanism F to guarantee that the privacy criterion is
satisfied for any released data. Specifically F produces data
records by using M on dataset D. The following mechanism
enforces (k, γ)-plausible deniability by construction.
Mechanism 1 (F with Plausible Deniability).
Given a generative model M, dataset D, and parameters k,
γ, output a synthetic record y or nothing.
1. Randomly sample a seed record d D.
2. Generate a candidate synthetic record y = M(d).
3. Invoke the privacy test on (M, D, d, y, k, γ).
4. If the tuple passes the test, then release y.
Otherwise, there is no output.
The core of Mechanism 1 (F) is a privacy test that simply
rejects a candidate synthetic data record if it does not satisfy
a given privacy criterion.
We can think of Definition 1 as a privacy criterion that
can be efficiently checked and enforced. So, instead of trying
to measure how sensitive the model M is with respect to in-
put data records, we test if there are enough indistinguisha-
ble records in the input dataset that could have (plausibly)
generated a candidate synthetic data record.
Privacy Test 1 (Deterministic test T ).
Given a generative model M, dataset D, data records d and
y, and privacy parameters k and γ, output pass to allow
releasing y, otherwise output fail.
1. Let i 0 be the (only) integer that fits the inequalities
γ
i1
< Pr{y = M(d)} γ
i
.
2. Let k
0
be the number of records d
a
D such that
γ
i1
< Pr{y = M(d
a
)} γ
i
.
3. If k
0
k then return pass, otherwise return fail.
Step 2 counts the number of plausible seeds, i.e., records
in D which could have plausibly produced y. Note that for
a given y, there may exist some records d
a
D such that
Pr{y = M(d
a
)} = 0. Such records cannot be plausible seeds
of y since no integer i 0 fits the inequalities.
Remark that Privacy Test 1 (T ) enforces a stringent con-
dition that the probability of generating a candidate synt-
hetic y given the seed d and the probability of generating
the same record given another plausible seed d
a
both fall
into a geometric range [γ
i1
, γ
i
], for some integer i 0,
assuming γ > 1. Notice that, under this test, the set of
k 1 different d
a
s plus d satisfies the plausible deniability
condition (1).
Informally, the threshold k prevents releasing the implau-
sible synthetics records y. As k increases the number of
plausible records which could have produced y also increa-
ses. Thus, an adversary with only partial knowledge of the
input dataset cannot readily determine whether a particular
input record d was the seed of any released record y. This
is because there are at least k 1 other records d
i
6= d in
the input dataset which could plausibly have been the seed.
However, whether y passes the privacy test itself reveals so-
mething about the number of plausible seeds, which could
potentially reveal whether a particular d is included in the
input data. This can be prevented by using a privacy test
which randomizes the threshold k (as Section 2.1 shows) in
which case the mechanism achieves (ε, δ)-differential privacy.
2.1 Relationship with Differential Privacy
We show a connection between Plausible Deniability and
Differential Privacy, given the following definition.
Definition 2 (Differential Privacy [16]).
Mechanism F satisfies (ε, δ)-differential privacy if for any
neighboring datasets D, D
0
, and any output S Range(F ):
Pr{F (D
0
) S} e
ε
Pr{F (D) S} + δ .
Typically, one chooses δ smaller than an inverse polynomial
in the size of the dataset, e.g., δ |D|
c
, for some c > 1.
In this section, we prove that if the privacy test is rand-
omized in a certain way, then Mechanism 1 (F) is in fact
(ε, δ)-differentially private for some δ > 0 and ε > 0. Pri-
vacy Test 1 simply counts the number of plausible seeds for
an output and only releases a candidate synthetic if that
number is at least k. We design Privacy Test 2 which is
identical except that it randomizes the threshold k.
Privacy Test 2 (Randomized test T
0
).
Given a generative model M, dataset D, data records d and
y, privacy parameters k and γ, and randomness parameter
0
, output pass to allow releasing y, otherwise output fail.
1. Randomize k by adding fresh noise:
˜
k = k + Lap(
1
0
).
2. Let i 0 be the (only) integer that fits the inequalities
γ
i1
< Pr{y = M(d)} γ
i
.
3. Let k
0
be the number of records d
a
D such that
γ
i1
< Pr{y = M(d
a
)} γ
i
.
4. If k
0
˜
k then return pass, otherwise return fail.
Here z Lap(b) is a sample from the Laplace distribution
1
2b
exp (
−|z|
b
) with mean 0 and shape parameter b > 0.
Theorem 1 (Differential Privacy of F).
Let F denote Mechanism 1 with the (randomized) Privacy
Test 2 and parameters k 1, γ > 1, and ε
0
> 0. For any
neighboring datasets D and D
0
such that |D|, |D
0
| k, any
set of outcomes Y U, and any integer 1 t < k, we have:
Pr{F(D
0
) Y } e
ε
Pr{F(D) Y } + δ ,
for δ = e
ε
0
(kt)
and ε = ε
0
+ ln (1 +
γ
t
).
483

The privacy level offered by Theorem 1 is meaningful pro-
vided k is such that δ is sufficiently small. For example, if we
want δ
1
n
c
for some c > 1, then we can set k t +
c
ε
0
ln n.
Here t provides a trade-off between δ and ε.
The proof of Theorem 1 can be found in the extended
version of the paper ([4] Appendix C). Roughly speaking,
the theorem says that, except with some small probability δ,
adding a record to a dataset cannot change the probability
that any synthetic record y is produced by more than a
small multiplicative factor. The intuition behind this is the
following.
Fix an arbitrary synthetic record y. Observe that given y,
all the records in the dataset are partitioned into (disjoint)
sets according to their probabilities of generating y (with
respect to M). That is, the i
th
partition (or set) contains
those records d such that γ
(i+1)
< Pr{y = M(d)} γ
i
.
(Records d such that Pr{y = M(d)} = 0 can be ignored.)
Note that: for y to be released from partition i, the seed
must be in partition i, and it must pass the privacy test;
and the probability of passing the privacy test depends only
on the number of records in the partition of the seed.
Suppose we add d
0
to the dataset and let j be the partition
that d
0
falls into. The number of plausible seeds can increase
by at most one (this occurs when the seed is in partition j)
and so the probability of passing the privacy test changes by
a factor of at most e
ε
0
due to adding Laplacian noise to the
threshold k. Now, suppose the seed belongs to partition j.
One the one hand, if partition j contains only l records, such
that l k, then the change in probability (due to adding d
0
)
could be unbounded. (For example, it could be that d
0
is the
only record for which Pr{y = M(d
0
)} > 0.) However, in this
case, the probability of passing the privacy test is negligible
(at most δ). On the other hand, if l is large enough (say
l k) so that passing the privacy test is likely, then the
probability of generating y from partition j can only change
by a small multiplicative factor. Indeed, the probabilities of
generating y from d
0
or from any of the other l records in
partition j are γ-close.
3. GENERATIVE MODEL
In this section, we present our generative model, and the
process of using it to generate synthetic data. The core of
our synthesizer is a probabilistic model that captures the
joint distribution of attributes. We learn this model from
training data samples drawn from our real dataset D. Thus,
the model itself needs to be privacy-preserving with respect
to its training set. We show how to achieve this with diffe-
rential privacy guarantees.
Let D
S
, D
T
, and D
P
be three non-overlapping subsets of
dataset D. We use these datasets in the process of synthesis,
structure learning, and parameter learning, respectively.
3.1 Model
Let {x
1
, x
2
, ..., x
m
} be the set of random variables asso-
ciated with the attributes of the data records in D. Let G
be a directed acyclic graph (DAG), where the nodes are the
random variables, and the edges represent the probabilistic
dependency between them. A directed edge from x
j
to x
i
indicates the probabilistic dependence of attribute i to at-
tribute j. Let P
G
(i) be the set of parents of random variable
i according to the dependency graph G. The following mo-
del, which we use in Section 3.2 to generate synthetic data,
represents the joint probability of data attributes.
Pr{x
1
, ..., x
m
} =
m
Y
i=1
Pr{x
i
| {x
j
}
jP
G
(i)
} (2)
This model is based on a structure between random vari-
ables, captured by G, and a set of parameters that construct
the conditional probabilities. In Section 3.3 and Section 3.4,
we present our differentially-private algorithms to learn the
structure and parameters of the model from D, respectively.
3.2 Synthesis
Using a generative model, we probabilistically transform
a real data record (called the seed) into a synthetic data re-
cord, by updating its attributes. Let {x
1
, x
2
, ..., x
m
} be the
values for the set of data attributes for a randomly selected
record in the seed dataset D
S
. Let ω be the number of attri-
butes for which we generate new values. Thus, we keep (i.e.,
copy over) the values of m ω attributes from the seed to
the synthetic data. Let σ be a permutation over {1, 2, ..., m}
to determine the re-sampling order of attributes.
We set the re-sampling order σ to be the dependency or-
der between random variables. More precisely, j P
G
(i):
σ(j) < σ(i). We fix the values of the first m ω attribu-
tes according to σ (i.e., the synthetic record and the seed
overlap on their {σ(1), ..., σ(m ω)} attributes). We then
generate a new value for each of the remaining ω attribu-
tes, using the conditional probabilities (2). As we update
the record while we re-sample, each new value can depend
on attributes with updated values as well as the ones with
original (seed) values.
We re-sample attribute σ(i), for i > m ω, as
x
0
σ(i)
Pr{x
σ(i)
| {x
σ(j)
= x
σ(j)
}
jP
G
(i),jmω
,
{x
σ(j)
= x
0
σ(j)
}
jP
G
(i),j>mω
} (3)
In Section 2, we show how to protect the privacy of the
seed data record using our plausible deniability mechanisms.
Baseline: Marginal Synthesis. As a baseline generative
model, we consider a synthesizer that (independently from
any seed record) samples a value for an attribute from its
marginal distribution. Thus, for all attribute i, we generate
x
i
Pr{x
i
}. This is based on an assumption of indepen-
dence between attributes’ random variables, i.e., it assumes
Pr{x
1
, ..., x
m
} =
Q
m
i=1
Pr{x
i
}.
3.3 Privacy-Preserving Structure Learning
Our generative model depends on the dependency struc-
ture between random variables that represent data attribu-
tes. The dependency graph G embodies this structure. In
this section, we present an algorithm that learns G from real
data, in a privacy-preserving manner such that G does not
significantly depend on individual data records.
The algorithm is based on maximizing a scoring function
that reflects how correlated the attributes are according to
the data. There are multiple approaches to this problem
in the literature [35]. We use a method based on a well-
studied machine learning problem: feature selection. For
each attribute, the goal is to find the best set of features
(among all attributes) to predict it, and add them as the
attribute’s parents, under the condition that the dependency
graph remains acyclic.
484

The machine learning literature proposes several ways to
rank features in terms of how well they can predict a particu-
lar attribute. One possibility is to calculate the information
gain of each feature with the target attribute. The major
downside with this approach is that it ignores the redun-
dancy in information between the features. We propose to
use a different approach, namely Correlation-based Feature
Selection (CFS) [20] which consists in determining the best
subset of predictive features according to some correlation
measure. This is an optimization problem to select a sub-
set of features that have high correlation with the target
attribute and at the same time have low correlation among
themselves. The task is to find the best subset of features
which maximizes a merit score that captures our objective.
We follow [20] to compute the merit score for a parent set
P
G
(i) for attribute i as
score(P
G
(i)) =
P
jP
G
(i)
corr(x
i
, x
j
)
q
|P
G
(i)| +
P
j,kP
G
(i)
corr(x
j
, x
k
)
, (4)
where |P
G
(i)| is the size of the parent set, and corr() is the
correlation between two random variables associated with
two attributes. The numerator rewards correlation between
parent attributes and the target attribute, and the denomi-
nator penalizes the inner-correlation among parent attribu-
tes. The suggested correlation metric in [20], which we use,
is the symmetrical uncertainty coefficient:
corr(x
i
, x
j
) = 2 2
H(x
i
, x
j
)
H(x
i
) + H(x
j
)
, (5)
where H() is the entropy function.
The optimization objective in constructing G is to maxi-
mize the total score(P
G
(i)) for all attributes i. Unfortuna-
tely, the number of possible solutions to search for is expo-
nential in the number of attributes, making it impractical to
find the optimal solution. The greedy algorithm, suggested
in [20], is to start with an empty parent set for a target attri-
bute and always add the attribute (feature) that maximizes
the score.
There are two constraints in our optimization problem.
First, the resulting dependency graph obtained from the set
of best predictive features (i.e., parent attributes) for all
attributes should be acyclic. This would allow us to decom-
pose and compute the joint distribution over attributes as
represented in (2).
Second, we enforce a maximum allowable complexity cost
for the set of parents for each attribute. The cost is pro-
portional to the number of possible joint value assignments
(configurations) for the parent attributes. So, for each at-
tribute i, the complexity cost constraint is
cost(P
G
(i)) =
Y
jP
G
(i)
|x
j
| maxcost (6)
where |x
j
| is the total number of possible values that the at-
tribute j takes. This constraint prevents selecting too many
parent attribute combinations for predicting an attribute.
The larger the joint cardinality of attribute i’s parents is,
the fewer data points to estimate the conditional probability
Pr{x
i
| {x
j
}
jP
G
(i)
} can be found. This would cause over-
fitting the conditional probabilities on the data, that results
in low confidence parameter estimation in Section 3.4. The
constraint prevents this.
To compute the score and cost functions, we discretize the
parent attributes. Let bkt() be a discretizing function that
partitions an attribute’s values into buckets. If the attribute
is continuous, it becomes discrete, and if it is already dis-
crete, bkt() might reduce the number of its bins. Thus, we
update conditional probabilities as follows.
Pr{x
i
| {x
j
}
jP
G
(i)
} Pr{x
i
| {bkt(x
j
)}
jP
G
(i)
} (7)
where the discretization, of course, varies for each attribute.
We update (4) and (6) according to (7). This approxima-
tion itself decreases the cost complexity of a parent set, and
further prevents overfitting on the data.
3.3.1 Differential-Privacy Protection
In this section, we show how to safeguard the privacy of
individuals whose records are in D, and could influence the
model structure (which might leak about their data).
All the computations required for structured learning are
reduced to computing the correlation metric (5) from D.
Thus, we can achieve differential privacy [16] for the struc-
ture learning by simply adding appropriate noise to the me-
tric. As, the correlation metric is based on the entropy of a
single or a pair of random variables, we only need to compute
the entropy functions in a differentially-private way. We also
need to make sure that the correlation metric remains in the
[0, 1] range, after using noisy entropy values.
Let
˜
H(z) be the noisy version of the entropy of a random
variable z, where in our case, z could be a single or pair of
random variables associated with the attributes and their
discretized version (as presented in (7)). To be able to
compute differentially-private correlation metric in all ca-
ses, we need to compute noisy entropy
˜
H(x
i
),
˜
H(bkt(x
i
)),
˜
H(x
i
, x
j
), and
˜
H(x
i
, bkt(x
j
)), for all attributes i and j. For
each of these cases, we generate a fresh noise drawn from
the Laplacian distribution and compute the differentially-
private entropy as
˜
H(z) = H(z) + Lap(
H
ε
H
) (8)
where
H
is the sensitivity of the entropy function, and ε
H
is the differential privacy parameter.
It can be shown that if z is a random variable with a
probability distribution, estimated from n
T
= |D
T
| data re-
cords, then the upper bound for the entropy sensitivity is
H
1
n
T
[2 +
1
ln(2)
+ 2 log
2
n
T
] = O(
log
2
n
T
n
T
) (9)
The proof of (9) can be found in the extended version of
the paper ([4] Appendix B). Remark that
H
is a function
of n
T
(the number of records in D
T
) which per se needs to be
protected. As a defense, we compute
H
in a differentially-
private manner, by once randomizing the number of records
˜n
T
= n
T
+ Lap(
1
ε
n
T
) (10)
By using the randomized entropy values, according to (8),
the model structure, which will be denoted by
˜
G, is diffe-
rentially private. In Section 3.5, we use the composition
theorems to analyze the total privacy of our algorithm for
obtaining a differentially-private structure.
485

Citations
More filters
Proceedings ArticleDOI

Prochlo: Strong Privacy for Analytics in the Crowd

TL;DR: Encode, Shuffle, Analyze (ESA) as discussed by the authors is a principled system architecture for performing large-scale monitoring of computer users' software activities with high utility while also protecting user privacy.
Proceedings ArticleDOI

Prochlo: Strong Privacy for Analytics in the Crowd

TL;DR: The Encode, Shuffle, Analyze (ESA) architecture as discussed by the authors is a principled system architecture for large-scale monitoring of computer users' software activities, e.g., for application telemetry, error reporting, or demographic profiling.
Journal ArticleDOI

Local Differential Privacy-Based Federated Learning for Internet of Things

TL;DR: Wang et al. as mentioned in this paper proposed to integrate federated learning and local differential privacy (LDP) to facilitate the crowdsourcing applications to achieve the machine learning model, and they proposed four LDP mechanisms to perturb gradients generated by vehicles.
Posted Content

Local Differential Privacy based Federated Learning for Internet of Things

TL;DR: This article proposes to integrate federated learning and local differential privacy (LDP) to facilitate the crowdsourcing applications to achieve the machine learning model, and proposes four LDP mechanisms to perturb gradients generated by vehicles.
Book ChapterDOI

Privacy Preserving Synthetic Data Release Using Deep Learning

TL;DR: This paper presents a meta-analysis of data anonymization and synthetic data generation techniques that use rigorous definitions of differential privacy, and shows that these techniques have not been compared extensively using different utility metrics.
References
More filters
Book ChapterDOI

I and J

Book ChapterDOI

Calibrating noise to sensitivity in private data analysis

TL;DR: In this article, the authors show that for several particular applications substantially less noise is needed than was previously understood to be the case, and also show the separation results showing the increased value of interactive sanitization mechanisms over non-interactive.
Book

The Algorithmic Foundations of Differential Privacy

TL;DR: The preponderance of this monograph is devoted to fundamental techniques for achieving differential privacy, and application of these techniques in creative combinations, using the query-release problem as an ongoing example.
Journal ArticleDOI

L-diversity: Privacy beyond k-anonymity

TL;DR: This paper shows with two simple attacks that a \kappa-anonymized dataset has some subtle, but severe privacy problems, and proposes a novel and powerful privacy definition called \ell-diversity, which is practical and can be implemented efficiently.
Journal Article

Calibrating noise to sensitivity in private data analysis

TL;DR: The study is extended to general functions f, proving that privacy can be preserved by calibrating the standard deviation of the noise according to the sensitivity of the function f, which is the amount that any single argument to f can change its output.