Plausible deniability for privacy-preserving data synthesis

doi:10.14778/3055540.3055542

Plausible Deniability for Privacy-Preserving Data Synthesis

Vincent Bindschaedler

UIUC

bindsch2@illinois.edu

Reza Shokri

Cornell Tech

shokri@cornell.edu

Carl A. Gunter

UIUC

cgunter@illinois.edu

ABSTRACT

Releasing full data records is one of the most challenging

problems in data privacy. On the one hand, many of the

popular techniques such as data de-identiﬁcation are proble-

matic because of their dependence on the background kno-

wledge of adversaries. On the other hand, rigorous methods

such as the exponential mechanism for diﬀerential privacy

are often computationally impractical to use for releasing

high dimensional data or cannot preserve high utility of ori-

ginal data due to their extensive data perturbation.

This paper presents a criterion called plausible deniability

that provides a formal privacy guarantee, notably for rele-

asing sensitive datasets: an output record can be released

only if a certain amount of input records are indistinguisha-

ble, up to a privacy parameter. This notion does not depend

on the background knowledge of an adversary. Also, it can

eﬃciently be checked by privacy tests. We present mecha-

nisms to generate synthetic datasets with similar statistical

properties to the input data and the same format. We study

this technique both theoretically and experimentally. A key

theoretical result shows that, with proper randomization,

the plausible deniability mechanism generates diﬀerentially

private synthetic data. We demonstrate the eﬃciency of

this generative technique on a large dataset; it is shown to

preserve the utility of original data with respect to various

statistical analysis and machine learning measures.

1. INTRODUCTION

There is tremendous interest in releasing datasets for rese-

arch and development. Privacy policies of data holders, ho-

wever, prevent them from sharing their sensitive datasets.

This is due, to a large extent, to multiple failed attempts

of releasing datasets using imperfect privacy-preserving me-

chanisms such as de-identiﬁcation. A range of inference at-

tacks on, for example, AOL search log dataset [2], Netﬂix

movie rating dataset [39], Genomic data [48, 22], location

data [18, 46], and social networks data [40], shows that sim-

ple modiﬁcation of sensitive data by removing identiﬁers or

This work is licensed under the Creative Commons Attribution-

NonCommercial-NoDerivatives 4.0 International License. To view a copy

of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For

any use beyond those covered by this license, obtain permission by emailing

info@vldb.org.

Proceedings of the VLDB Endowment, Vol. 10, No. 5

by generalizing/suppressing data features results in major

information leakage and cannot guarantee meaningful pri-

vacy for data owners. These simple de-identiﬁcation soluti-

ons, however, preserve data utility as they impose minimal

perturbation to real data.

Rigorous privacy deﬁnitions, such as diﬀerential privacy

[15], can theoretically guarantee privacy and bound infor-

mation leakage about sensitive data. However, known me-

chanisms, such as the Laplacian mechanism [15] or the ex-

ponential mechanism [37], that achieve diﬀerential privacy

through randomization, have practical limitations. The ma-

jority of scenarios, where they have been applied, are limited

to interactive count queries on statistical databases [14]. In a

non-interactive setting for releasing generic datasets, these

mechanisms are either computationally infeasible on high-

dimensional data, or practically ineﬀective because of their

large utility costs [26]. At best, these methods are used to

release some privacy-preserving statistics (e.g., histograms

[6, 51]) about a dataset, but not full data records. It is not

obvious how to protect the privacy of full records as opposed

to that of aggregate statistics (by adding random noise).

Despite all these obstacles, releasing full data records is

ﬁrmly pursued by large-scale data holders such as the U.S.

Census Bureau [21, 28, 27]. The purpose of this endea-

vor is to allow researchers to develop analytic techniques

by processing full synthetic data records rather than a limi-

ted set of statistics. Synthetic data could also be used for

educational purpose, application development for data ana-

lysis, sharing sensitive data among diﬀerent departments in

a company, developing and testing pattern recognition and

machine learning models, and algorithm design for sensitive

data. There exists some inference-based techniques to assess

the privacy risks of releasing synthetic data [42, 43]. Howe-

ver, the major open problem is how to generate synthetic

full data records with provable privacy, that experimentally

can achieve acceptable utility in various statistical analytics

and machine learning settings.

In this paper, we ﬁll this major gap in data privacy by pro-

posing a generic theoretical framework for generating synt-

hetic data in a privacy-preserving manner. The fundamental

diﬀerence between our approach and that of existing mecha-

nisms for diﬀerential privacy (e.g., exponential mechanism)

is that we disentangle the data generative model from pri-

vacy deﬁnitions. Instead of forcing a generative model to be

privacy-preserving by design, which might signiﬁcantly de-

grade its utility, we can use a utility-preserving generative

model and release only a subset of its output that satisﬁes

our privacy requirements. Thus, for designing a generative

481

model, we rely on the state-of-the-art techniques from data

science independently from the privacy requirements. This

enables us to generate high utility synthetic data.

We formalize the notion of plausible deniability for data

privacy [3], and generalize it to any type of data. Consider

a probabilistic generative model that transforms a real data

record, as its seed, into a synthetic data record. We can sam-

ple many synthetic data records from each seed using such

a generative model. According to our deﬁnition, a synthetic

record provides plausible deniability if there exists a set of

real data records that could have generated the same synthe-

tic data with (more or less) the same probability by which

it was generated from its own seed. We design a privacy

mechanism that provably guarantees plausible deniability.

This mechanism results in input indistinguishability: by ob-

serving the output set (i.e., synthetics), an adversary cannot

tell for sure whether a particular data record was in the input

set (i.e., real data). The degree of this indistinguishability

is a parameter in our mechanism.

Plausible deniability is a property of the overall process,

and similar to diﬀerential privacy, it is independent of any

adversary’s background knowledge. In fact, we prove that

our proposed plausible deniable data synthesis process can

also satisfy diﬀerential privacy, if we randomize the indis-

tinguishability parameter in the privacy mechanism. This

is a signiﬁcant theoretical result towards achieving strong

privacy using privacy-agnostic utility-preserving generative

models. Thus, we achieve diﬀerential privacy without ar-

tiﬁcially downgrading the utility of the synthesized data

through output perturbation.

The process of generating a single synthetic data record

and testing its plausible deniability can be done indepen-

dently from that of other data records. Thus, millions of

data records can be generated and processed in parallel.

This makes our framework extremely eﬃcient and allows

implementing it at a large scale. In this paper, we deve-

lop our theoretical framework as an open-source tool, and

run it on a large dataset: the American Community Sur-

vey [47] from the U.S. Census Bureau which contains over

3.1 million records. In fact, we can generate over one million

privacy-preserving synthetic records in less than one hour on

a multi-core machine running 12 processes in parallel.

We analyze the utility of synthetic data in two major sce-

narios: extracting statistics for data analysis, and perfor-

ming prediction using machine learning. We show that our

privacy test does not impose high utility cost. We also de-

monstrate that a signiﬁcant fraction of candidate synthetic

records proposed by a generative model can pass the privacy

test even for strict privacy parameters.

We show that a strong adversary cannot distinguish a

synthetic record from a real one with better than 63.0%

accuracy (baseline: 79.8%). Furthermore, when it comes

to classiﬁcation tasks, the accuracy of the model learned

on a synthetic dataset is only slightly lower than that of

model trained on real data. For example, for Random Fo-

rest the accuracy is 75.3% compared to 80.4% when trained

on real data (baseline: 63.8%); whereas for AdaBoostM1

the accuracy is 78.1% compared to 79.3% when trained on

real data (baseline: 69.2%). Similar results are obtained

when we compare logistic regression (LR) and support vec-

tor machine (SVM) classiﬁers trained on our synthetic da-

tasets with the same classiﬁers trained (on real data) in a

diﬀerential private way (using state-of-the-art techniques).

Concretely, the accuracy of classiﬁers trained on our synt-

hetic data is 77.5% (LR) and 77.1% (SVM); compared to

76.3% (LR) and 78.2% (SVM) for objective-perturbation ε-

DP classiﬁers.

Contributions. We introduce a formal framework for plau-

sible deniability as a privacy deﬁnition. We also design a

mechanism to achieve it for the case of generating synthetic

data. We prove that using a randomized test in our plausible

deniability mechanism achieves diﬀerential privacy (which is

a stronger guarantee). We also show how to construct ge-

nerative models with diﬀerential privacy guarantees. The

composition of our generative model and plausible deniabi-

lity mechanism also satisﬁes diﬀerential privacy. We show

the high accuracy of our model and utility of our generated

synthetic data. We develop a generic tool and show its high

eﬃciency for generating millions of full data records.

2. PLAUSIBLE DENIABILITY

In this section, we formalize plausible deniability as a

new privacy notion for releasing privacy-preserving synthe-

tic data. We also present a mechanism to achieve it. Finally,

we prove that our mechanism can also satisfy diﬀerential pri-

vacy (which is a stronger guarantee) by slightly randomizing

our plausible deniability mechanism.

Informally, plausible deniability states that an adversary

(with any background knowledge) cannot deduce that a par-

ticular record in the input (real) dataset was signiﬁcantly

more responsible for an observed output (synthetic record)

than was a collection of other input records. A mecha-

nism ensures plausible deniability if, for a privacy parameter

k > 0, there are at least k input records that could have ge-

nerated the observed output with similar probability.

Unlike the majority of existing approaches (e.g., to achieve

diﬀerential privacy), designing a mechanism to satisfy plau-

sible deniability for generative models does not require ad-

ding artiﬁcial noise to the generated data. Instead, we se-

parate the process of releasing privacy-preserving data into

running two independent modules: (1) generative models,

and (2) privacy test. The ﬁrst consists in constructing a

utility-preserving generative data model. This is ultimately

a data science task which requires insight into the type of

data for which one wants to generate synthetics. By con-

trast, the privacy test aims to safeguard the privacy of those

individuals whose data records are in the input dataset.

Every generated synthetic is subjected to this privacy test;

if it passes the test it can be safely released, otherwise it is

discarded. This is where the plausible deniability criterion

comes into the frame: the privacy test is designed to ensure

that any released output can be plausibly denied.

In this section, we assume a generic generative model that,

given a data record in the input dataset as seed, produces

a synthetic data record. In Section 3, we present a generic

generative model based on statistical models, and show how

it can be constructed in a diﬀerentially-private manner, so

that it does not signiﬁcantly leak about its own training

data. Plausibly deniable mechanisms protect the privacy of

the seeds, and are not concerned about how the generative

models are constructed.

Let M be a probabilistic generative model that given any

data record d can generate synthetic records y with proba-

bility Pr{y = M(d)}. Let k ≥ 1 be an integer and γ ≥ 1 be

a real number. Both k and γ are privacy parameters.

482

Deﬁnition 1 (Plausible Deniability).

For any dataset D with |D| ≥ k, and any record y gene-

rated by a probabilistic generative model M such that y =

M(d

1

) for d

1

∈ D, we state that y is releasable with (k, γ)-

plausible deniability, if there exist at least k − 1 distinct

records d

2

, ..., d

k

∈ D \ {d

1

} such that

γ

−1

≤

Pr{y = M(d

i

)}

Pr{y = M(d

j

)}

≤ γ, (1)

for any i, j ∈ {1, 2, . . . , k}.

The larger privacy parameter k is, the larger the indistin-

guishability set for the input data record. Also, the closer

to 1 privacy parameter γ is, the stronger the indistinguisha-

bility of the input record among other plausible records.

Given a generative model M, and a dataset D, we need

a mechanism F to guarantee that the privacy criterion is

satisﬁed for any released data. Speciﬁcally F produces data

records by using M on dataset D. The following mechanism

enforces (k, γ)-plausible deniability by construction.

Mechanism 1 (F with Plausible Deniability).

Given a generative model M, dataset D, and parameters k,

γ, output a synthetic record y or nothing.

1. Randomly sample a seed record d ∈ D.

2. Generate a candidate synthetic record y = M(d).

3. Invoke the privacy test on (M, D, d, y, k, γ).

4. If the tuple passes the test, then release y.

Otherwise, there is no output.

The core of Mechanism 1 (F) is a privacy test that simply

rejects a candidate synthetic data record if it does not satisfy

a given privacy criterion.

We can think of Deﬁnition 1 as a privacy criterion that

can be eﬃciently checked and enforced. So, instead of trying

to measure how sensitive the model M is with respect to in-

put data records, we test if there are enough indistinguisha-

ble records in the input dataset that could have (plausibly)

generated a candidate synthetic data record.

Privacy Test 1 (Deterministic test T ).

Given a generative model M, dataset D, data records d and

y, and privacy parameters k and γ, output pass to allow

releasing y, otherwise output fail.

1. Let i ≥ 0 be the (only) integer that ﬁts the inequalities

γ

−i−1

< Pr{y = M(d)} ≤ γ

−i

.

2. Let k

0

be the number of records d

a

∈ D such that

γ

−i−1

< Pr{y = M(d

a

)} ≤ γ

−i

.

3. If k

0

≥ k then return pass, otherwise return fail.

Step 2 counts the number of plausible seeds, i.e., records

in D which could have plausibly produced y. Note that for

a given y, there may exist some records d

a

∈ D such that

Pr{y = M(d

a

)} = 0. Such records cannot be plausible seeds

of y since no integer i ≥ 0 ﬁts the inequalities.

Remark that Privacy Test 1 (T ) enforces a stringent con-

dition that the probability of generating a candidate synt-

hetic y given the seed d and the probability of generating

the same record given another plausible seed d

a

both fall

into a geometric range [γ

−i−1

, γ

−i

], for some integer i ≥ 0,

assuming γ > 1. Notice that, under this test, the set of

k − 1 diﬀerent d

a

s plus d satisﬁes the plausible deniability

condition (1).

Informally, the threshold k prevents releasing the implau-

sible synthetics records y. As k increases the number of

plausible records which could have produced y also increa-

ses. Thus, an adversary with only partial knowledge of the

input dataset cannot readily determine whether a particular

input record d was the seed of any released record y. This

is because there are at least k − 1 other records d

i

6= d in

the input dataset which could plausibly have been the seed.

However, whether y passes the privacy test itself reveals so-

mething about the number of plausible seeds, which could

potentially reveal whether a particular d is included in the

input data. This can be prevented by using a privacy test

which randomizes the threshold k (as Section 2.1 shows) in

which case the mechanism achieves (ε, δ)-diﬀerential privacy.

2.1 Relationship with Differential Privacy

We show a connection between Plausible Deniability and

Diﬀerential Privacy, given the following deﬁnition.

Deﬁnition 2 (Diﬀerential Privacy [16]).

Mechanism F satisﬁes (ε, δ)-diﬀerential privacy if for any

neighboring datasets D, D

0

, and any output S ⊆ Range(F ):

Pr{F (D

0

) ∈ S} ≤ e

ε

Pr{F (D) ∈ S} + δ .

Typically, one chooses δ smaller than an inverse polynomial

in the size of the dataset, e.g., δ ≤ |D|

−c

, for some c > 1.

In this section, we prove that if the privacy test is rand-

omized in a certain way, then Mechanism 1 (F) is in fact

(ε, δ)-diﬀerentially private for some δ > 0 and ε > 0. Pri-

vacy Test 1 simply counts the number of plausible seeds for

an output and only releases a candidate synthetic if that

number is at least k. We design Privacy Test 2 which is

identical except that it randomizes the threshold k.

Privacy Test 2 (Randomized test T



0

).

Given a generative model M, dataset D, data records d and

y, privacy parameters k and γ, and randomness parameter



0

, output pass to allow releasing y, otherwise output fail.

1. Randomize k by adding fresh noise:

˜

k = k + Lap(

1



0

).

2. Let i ≥ 0 be the (only) integer that ﬁts the inequalities

γ

−i−1

< Pr{y = M(d)} ≤ γ

−i

.

3. Let k

0

be the number of records d

a

∈ D such that

γ

−i−1

< Pr{y = M(d

a

)} ≤ γ

−i

.

4. If k

0

≥

˜

k then return pass, otherwise return fail.

Here z ∼ Lap(b) is a sample from the Laplace distribution

1

2b

exp (

−|z|

b

) with mean 0 and shape parameter b > 0.

Theorem 1 (Diﬀerential Privacy of F).

Let F denote Mechanism 1 with the (randomized) Privacy

Test 2 and parameters k ≥ 1, γ > 1, and ε

0

> 0. For any

neighboring datasets D and D

0

such that |D|, |D

0

| ≥ k, any

set of outcomes Y ⊆ U, and any integer 1 ≤ t < k, we have:

Pr{F(D

0

) ∈ Y } ≤ e

ε

Pr{F(D) ∈ Y } + δ ,

for δ = e

−ε

0

(k−t)

and ε = ε

0

+ ln (1 +

γ

t

).

483

The privacy level oﬀered by Theorem 1 is meaningful pro-

vided k is such that δ is suﬃciently small. For example, if we

want δ ≤

1

n

c

for some c > 1, then we can set k ≥ t +

c

ε

0

ln n.

Here t provides a trade-oﬀ between δ and ε.

The proof of Theorem 1 can be found in the extended

version of the paper ([4] Appendix C). Roughly speaking,

the theorem says that, except with some small probability δ,

adding a record to a dataset cannot change the probability

that any synthetic record y is produced by more than a

small multiplicative factor. The intuition behind this is the

following.

Fix an arbitrary synthetic record y. Observe that given y,

all the records in the dataset are partitioned into (disjoint)

sets according to their probabilities of generating y (with

respect to M). That is, the i

th

partition (or set) contains

those records d such that γ

−(i+1)

< Pr{y = M(d)} ≤ γ

−i

.

(Records d such that Pr{y = M(d)} = 0 can be ignored.)

Note that: for y to be released from partition i, the seed

must be in partition i, and it must pass the privacy test;

and the probability of passing the privacy test depends only

on the number of records in the partition of the seed.

Suppose we add d

0

to the dataset and let j be the partition

that d

0

falls into. The number of plausible seeds can increase

by at most one (this occurs when the seed is in partition j)

and so the probability of passing the privacy test changes by

a factor of at most e

ε

0

due to adding Laplacian noise to the

threshold k. Now, suppose the seed belongs to partition j.

One the one hand, if partition j contains only l records, such

that l  k, then the change in probability (due to adding d

0

)

could be unbounded. (For example, it could be that d

0

is the

only record for which Pr{y = M(d

0

)} > 0.) However, in this

case, the probability of passing the privacy test is negligible

(at most δ). On the other hand, if l is large enough (say

l ≈ k) so that passing the privacy test is likely, then the

probability of generating y from partition j can only change

by a small multiplicative factor. Indeed, the probabilities of

generating y from d

0

or from any of the other l records in

partition j are γ-close.

3. GENERATIVE MODEL

In this section, we present our generative model, and the

process of using it to generate synthetic data. The core of

our synthesizer is a probabilistic model that captures the

joint distribution of attributes. We learn this model from

training data samples drawn from our real dataset D. Thus,

the model itself needs to be privacy-preserving with respect

to its training set. We show how to achieve this with diﬀe-

rential privacy guarantees.

Let D

S

, D

T

, and D

P

be three non-overlapping subsets of

dataset D. We use these datasets in the process of synthesis,

structure learning, and parameter learning, respectively.

3.1 Model

Let {x

1

, x

2

, ..., x

m

} be the set of random variables asso-

ciated with the attributes of the data records in D. Let G

be a directed acyclic graph (DAG), where the nodes are the

random variables, and the edges represent the probabilistic

dependency between them. A directed edge from x

j

to x

i

indicates the probabilistic dependence of attribute i to at-

tribute j. Let P

G

(i) be the set of parents of random variable

i according to the dependency graph G. The following mo-

del, which we use in Section 3.2 to generate synthetic data,

represents the joint probability of data attributes.

Pr{x

1

, ..., x

m

} =

m

Y

i=1

Pr{x

i

| {x

j

}

∀j∈P

G

(i)

} (2)

This model is based on a structure between random vari-

ables, captured by G, and a set of parameters that construct

the conditional probabilities. In Section 3.3 and Section 3.4,

we present our diﬀerentially-private algorithms to learn the

structure and parameters of the model from D, respectively.

3.2 Synthesis

Using a generative model, we probabilistically transform

a real data record (called the seed) into a synthetic data re-

cord, by updating its attributes. Let {x

1

, x

2

, ..., x

m

} be the

values for the set of data attributes for a randomly selected

record in the seed dataset D

S

. Let ω be the number of attri-

butes for which we generate new values. Thus, we keep (i.e.,

copy over) the values of m − ω attributes from the seed to

the synthetic data. Let σ be a permutation over {1, 2, ..., m}

to determine the re-sampling order of attributes.

We set the re-sampling order σ to be the dependency or-

der between random variables. More precisely, ∀j ∈ P

G

(i):

σ(j) < σ(i). We ﬁx the values of the ﬁrst m − ω attribu-

tes according to σ (i.e., the synthetic record and the seed

overlap on their {σ(1), ..., σ(m − ω)} attributes). We then

generate a new value for each of the remaining ω attribu-

tes, using the conditional probabilities (2). As we update

the record while we re-sample, each new value can depend

on attributes with updated values as well as the ones with

original (seed) values.

We re-sample attribute σ(i), for i > m − ω, as

x

0

σ(i)

∼ Pr{x

σ(i)

| {x

σ(j)

= x

σ(j)

}

∀j∈P

G

(i),j≤m−ω

,

{x

σ(j)

= x

0

σ(j)

}

∀j∈P

G

(i),j>m−ω

} (3)

In Section 2, we show how to protect the privacy of the

seed data record using our plausible deniability mechanisms.

Baseline: Marginal Synthesis. As a baseline generative

model, we consider a synthesizer that (independently from

any seed record) samples a value for an attribute from its

marginal distribution. Thus, for all attribute i, we generate

x

i

∼ Pr{x

i

}. This is based on an assumption of indepen-

dence between attributes’ random variables, i.e., it assumes

Pr{x

1

, ..., x

m

} =

Q

m

i=1

Pr{x

i

}.

3.3 Privacy-Preserving Structure Learning

Our generative model depends on the dependency struc-

ture between random variables that represent data attribu-

tes. The dependency graph G embodies this structure. In

this section, we present an algorithm that learns G from real

data, in a privacy-preserving manner such that G does not

signiﬁcantly depend on individual data records.

The algorithm is based on maximizing a scoring function

that reﬂects how correlated the attributes are according to

the data. There are multiple approaches to this problem

in the literature [35]. We use a method based on a well-

studied machine learning problem: feature selection. For

each attribute, the goal is to ﬁnd the best set of features

(among all attributes) to predict it, and add them as the

attribute’s parents, under the condition that the dependency

graph remains acyclic.

484

The machine learning literature proposes several ways to

rank features in terms of how well they can predict a particu-

lar attribute. One possibility is to calculate the information

gain of each feature with the target attribute. The major

downside with this approach is that it ignores the redun-

dancy in information between the features. We propose to

use a diﬀerent approach, namely Correlation-based Feature

Selection (CFS) [20] which consists in determining the best

subset of predictive features according to some correlation

measure. This is an optimization problem to select a sub-

set of features that have high correlation with the target

attribute and at the same time have low correlation among

themselves. The task is to ﬁnd the best subset of features

which maximizes a merit score that captures our objective.

We follow [20] to compute the merit score for a parent set

P

G

(i) for attribute i as

score(P

G

(i)) =

P

j∈P

G

(i)

corr(x

i

, x

j

)

q

|P

G

(i)| +

P

j,k∈P

G

(i)

corr(x

j

, x

k

)

, (4)

where |P

G

(i)| is the size of the parent set, and corr() is the

correlation between two random variables associated with

two attributes. The numerator rewards correlation between

parent attributes and the target attribute, and the denomi-

nator penalizes the inner-correlation among parent attribu-

tes. The suggested correlation metric in [20], which we use,

is the symmetrical uncertainty coeﬃcient:

corr(x

i

, x

j

) = 2 − 2

H(x

i

, x

j

)

H(x

i

) + H(x

j

)

, (5)

where H() is the entropy function.

The optimization objective in constructing G is to maxi-

mize the total score(P

G

(i)) for all attributes i. Unfortuna-

tely, the number of possible solutions to search for is expo-

nential in the number of attributes, making it impractical to

ﬁnd the optimal solution. The greedy algorithm, suggested

in [20], is to start with an empty parent set for a target attri-

bute and always add the attribute (feature) that maximizes

the score.

There are two constraints in our optimization problem.

First, the resulting dependency graph obtained from the set

of best predictive features (i.e., parent attributes) for all

attributes should be acyclic. This would allow us to decom-

pose and compute the joint distribution over attributes as

represented in (2).

Second, we enforce a maximum allowable complexity cost

for the set of parents for each attribute. The cost is pro-

portional to the number of possible joint value assignments

(conﬁgurations) for the parent attributes. So, for each at-

tribute i, the complexity cost constraint is

cost(P

G

(i)) =

Y

j∈P

G

(i)

|x

j

| ≤ maxcost (6)

where |x

j

| is the total number of possible values that the at-

tribute j takes. This constraint prevents selecting too many

parent attribute combinations for predicting an attribute.

The larger the joint cardinality of attribute i’s parents is,

the fewer data points to estimate the conditional probability

Pr{x

i

| {x

j

}

∀j∈P

G

(i)

} can be found. This would cause over-

ﬁtting the conditional probabilities on the data, that results

in low conﬁdence parameter estimation in Section 3.4. The

constraint prevents this.

To compute the score and cost functions, we discretize the

parent attributes. Let bkt() be a discretizing function that

partitions an attribute’s values into buckets. If the attribute

is continuous, it becomes discrete, and if it is already dis-

crete, bkt() might reduce the number of its bins. Thus, we

update conditional probabilities as follows.

Pr{x

i

| {x

j

}

∀

j∈P

G

(i)

} ≈ Pr{x

i

| {bkt(x

j

)}

∀

j∈P

G

(i)

} (7)

where the discretization, of course, varies for each attribute.

We update (4) and (6) according to (7). This approxima-

tion itself decreases the cost complexity of a parent set, and

further prevents overﬁtting on the data.

3.3.1 Differential-Privacy Protection

In this section, we show how to safeguard the privacy of

individuals whose records are in D, and could inﬂuence the

model structure (which might leak about their data).

All the computations required for structured learning are

reduced to computing the correlation metric (5) from D.

Thus, we can achieve diﬀerential privacy [16] for the struc-

ture learning by simply adding appropriate noise to the me-

tric. As, the correlation metric is based on the entropy of a

single or a pair of random variables, we only need to compute

the entropy functions in a diﬀerentially-private way. We also

need to make sure that the correlation metric remains in the

[0, 1] range, after using noisy entropy values.

Let

˜

H(z) be the noisy version of the entropy of a random

variable z, where in our case, z could be a single or pair of

random variables associated with the attributes and their

discretized version (as presented in (7)). To be able to

compute diﬀerentially-private correlation metric in all ca-

ses, we need to compute noisy entropy

˜

H(x

i

),

˜

H(bkt(x

i

)),

˜

H(x

i

, x

j

), and

˜

H(x

i

, bkt(x

j

)), for all attributes i and j. For

each of these cases, we generate a fresh noise drawn from

the Laplacian distribution and compute the diﬀerentially-

private entropy as

˜

H(z) = H(z) + Lap(

∆

H

ε

H

) (8)

where ∆

H

is the sensitivity of the entropy function, and ε

H

is the diﬀerential privacy parameter.

It can be shown that if z is a random variable with a

probability distribution, estimated from n

T

= |D

T

| data re-

cords, then the upper bound for the entropy sensitivity is

∆

H

≤

1

n

T

[2 +

1

ln(2)

+ 2 log

2

n

T

] = O(

log

2

n

T

n

T

) (9)

The proof of (9) can be found in the extended version of

the paper ([4] Appendix B). Remark that ∆

H

is a function

of n

T

(the number of records in D

T

) which per se needs to be

protected. As a defense, we compute ∆

H

in a diﬀerentially-

private manner, by once randomizing the number of records

˜n

T

= n

T

+ Lap(

1

ε

n

T

) (10)

By using the randomized entropy values, according to (8),

the model structure, which will be denoted by

˜

G, is diﬀe-

rentially private. In Section 3.5, we use the composition

theorems to analyze the total privacy of our algorithm for

obtaining a diﬀerentially-private structure.

485

Plausible deniability for privacy-preserving data synthesis

Citations

Prochlo: Strong Privacy for Analytics in the Crowd

Prochlo: Strong Privacy for Analytics in the Crowd

Local Differential Privacy-Based Federated Learning for Internet of Things

Local Differential Privacy based Federated Learning for Internet of Things

Privacy Preserving Synthetic Data Release Using Deep Learning

References

I and J

Calibrating noise to sensitivity in private data analysis

The Algorithmic Foundations of Differential Privacy

L-diversity: Privacy beyond k-anonymity

Calibrating noise to sensitivity in private data analysis

Related Papers (5)

The Algorithmic Foundations of Differential Privacy

Calibrating noise to sensitivity in private data analysis

Differential privacy

k -anonymity: a model for protecting privacy

Deep Learning with Differential Privacy