Membership Inference Attacks Against Machine Learning Models

doi:10.1109/SP.2017.41

Membership Inference Attacks Against

Machine Learning Models

Reza Shokri

Cornell Tech

shokri@cornell.edu

Marco Stronati

∗

INRIA

marco@stronati.org

Congzheng Song

Cornell

cs2296@cornell.edu

Vitaly Shmatikov

Cornell Tech

shmat@cs.cornell.edu

Abstract—We quantitatively investigate how machine learning

models leak information about the individual data records on

which they were trained. We focus on the basic membership

inference attack: given a data record and black-box access to

a model, determine if the record was in the model’s training

dataset. To perform membership inference against a target model,

we make adversarial use of machine learning and train our own

inference model to recognize differences in the target model’s

predictions on the inputs that it trained on versus the inputs

that it did not train on.

We empirically evaluate our inference techniques on classi-

ﬁcation models trained by commercial “machine learning as a

service” providers such as Google and Amazon. Using realistic

datasets and classiﬁcation tasks, including a hospital discharge

dataset whose membership is sensitive from the privacy perspec-

tive, we show that these models can be vulnerable to membership

inference attacks. We then investigate the factors that inﬂuence

this leakage and evaluate mitigation strategies.

I. INTRODUCTION

Machine learning is the foundation of popular Internet

services such as image and speech recognition and natural lan-

guage translation. Many companies also use machine learning

internally, to improve marketing and advertising, recommend

products and services to users, or better understand the data

generated by their operations. In all of these scenarios, ac-

tivities of individual users—their purchases and preferences,

health data, online and ofﬂine transactions, photos they take,

commands they speak into their mobile phones, locations they

travel to—are used as the training data.

Internet giants such as Google and Amazon are already

offering “machine learning as a service.” Any customer in

possession of a dataset and a data classiﬁcation task can upload

this dataset to the service and pay it to construct a model.

The service then makes the model available to the customer,

typically as a black-box API. For example, a mobile-app maker

can use such a service to analyze users’ activities and query

the resulting model inside the app to promote in-app purchases

to users when they are most likely to respond. Some machine-

learning services also let data owners expose their models to

external users for querying or even sell them.

Our contributions. We focus on the fundamental question

known as membership inference: given a machine learning

model and a record, determine whether this record was used as

∗

This research was performed while the author was at Cornell Tech.

part of the model’s training dataset or not. We investigate this

question in the most difﬁcult setting, where the adversary’s

access to the model is limited to black-box queries that

return the model’s output on a given input. In summary,

we quantify membership information leakage through the

prediction outputs of machine learning models.

To answer the membership inference question, we turn

machine learning against itself and train an attack model

whose purpose is to distinguish the target model’s behavior

on the training inputs from its behavior on the inputs that it

did not encounter during training. In other words, we turn the

membership inference problem into a classiﬁcation problem.

Attacking black-box models such as those built by com-

mercial “machine learning as a service” providers requires

more sophistication than attacking white-box models whose

structure and parameters are known to the adversary. To

construct our attack models, we invented a shadow training

technique. First, we create multiple “shadow models” that

imitate the behavior of the target model, but for which we

know the training datasets and thus the ground truth about

membership in these datasets. We then train the attack model

on the labeled inputs and outputs of the shadow models.

We developed several effective methods to generate training

data for the shadow models. The ﬁrst method uses black-box

access to the target model to synthesize this data. The second

method uses statistics about the population from which the

target’s training dataset was drawn. The third method assumes

that the adversary has access to a potentially noisy version

of the target’s training dataset. The ﬁrst method does not

assume any prior knowledge about the distribution of the target

model’s training data, while the second and third methods

allow the attacker to query the target model only once before

inferring whether a given record was in its training dataset.

Our inference techniques are generic and not based on any

particular dataset or model type. We evaluate them against

neural networks, as well as black-box models trained using

Amazon ML and Google Prediction API. All of our experi-

ments on Amazon’s and Google’s platforms were done without

knowing the learning algorithms used by these services, nor

the architecture of the resulting models, since Amazon and

Google don’t reveal this information to the customers. For our

evaluation, we use realistic classiﬁcation tasks and standard

model-training procedures on concrete datasets of images,

retail purchases, location traces, and hospital inpatient stays. In

2017 IEEE Symposium on Security and Privacy

DOI 10.1109/SP.2017.41

3

addition to demonstrating that membership inference attacks

are successful, we quantify how their success relates to the

classiﬁcation tasks and the standard metrics of overﬁtting.

Inferring information about the model’s training dataset

should not be confused with techniques such as model in-

version that use a model’s output on a hidden input to infer

something about this input [17] or to extract features that

characterize one of the model’s classes [16]. As explained

in [27] and Section IX, model inversion does not produce an

actual member of the model’s training dataset, nor, given a

record, does it infer whether this record was in the training

dataset. By contrast, the membership inference problem we

study in this paper is essentially the same as the well-known

problem of identifying the presence of an individual’s data in a

mixed pool given some statistics about the pool [3], [15], [21],

[29]. In our case, however, the goal is to infer membership

given a black-box API to a model of unknown structure, as

opposed to explicit statistics.

Our experimental results show that models created using

machine-learning-as-a-service platforms can leak a lot of in-

formation about their training datasets. For multi-class clas-

siﬁcation models trained on 10,000-record retail transaction

datasets using Google’s and Amazon’s services in default

conﬁgurations, our membership inference achieves median

accuracy of 94% and 74%, respectively. Even if we make

no prior assumptions about the distribution of the target

model’s training data and use fully synthetic data for our

shadow models, the accuracy of membership inference against

Google-trained models is 90%. Our results for the Texas

hospital discharge dataset (over 70% accuracy) indicate that

membership inference can present a risk to health-care datasets

if these datasets are used to train machine learning models

and access to the resulting models is open to the public.

Membership in such datasets is highly sensitive.

We discuss the root causes that make these attacks possi-

ble and quantitatively compare mitigation strategies such as

limiting the model’s predictions to top k classes, decreasing

the precision of the prediction vector, increasing its entropy,

or using regularization while training the model.

In summary, this paper demonstrates and quantiﬁes the

problem of machine learning models leaking information

about their training datasets. To create our attack models, we

developed a new shadow learning technique that works with

minimal knowledge about the target model and its training

dataset. Finally, we quantify how the leakage of membership

information is related to model overﬁtting.

II. M

ACHINE LEARNING BACKGROUND

Machine learning algorithms help us better understand and

analyze complex data. When the model is created using

unsupervised training, the objective is to extract useful features

from the unlabeled data and build a model that explains its

hidden structure. When the model is created using supervised

training, which is the focus of this paper, the training records

(as inputs of the model) are assigned labels or scores (as

outputs of the model). The goal is to learn the relationship

between the data and the labels and construct a model that can

generalize to data records beyond the training set [19]. Model-

training algorithms aim to minimize the model’s prediction er-

ror on the training dataset and thus may overﬁt to this dataset,

producing models that perform better on the training inputs

than on the inputs drawn from the same population but not

used during the training. Many regularization techniques have

been proposed to prevent models from becoming overﬁtted

to their training datasets while minimizing their prediction

error [19].

Supervised training is often used for classiﬁcation and other

prediction tasks. For example, a retailer may train a model

that predicts a customer’s shopping style in order to offer her

suitable incentives, while a medical researcher may train a

model to predict which treatment is most likely to succeed

given a patient’s clinical symptoms or genetic makeup.

Machine learning as a service. Major Internet companies

now offer machine learning as a service on their cloud

platforms. Examples include Google Prediction API,

1

Amazon

Machine Learning (Amazon ML),

2

Microsoft Azure Machine

Learning (Azure ML),

3

and BigML.

4

These platforms provide simple APIs for uploading the data

and for training and querying models, thus making machine

learning technologies available to any customer. For example,

a developer may create an app that gathers data from users,

uploads it into the cloud platform to train a model (or update

an existing model with new data), and then uses the model’s

predictions inside the app to improve its features or better

interact with the users. Some platforms even envision data

holders training a model and then sharing it with others

through the platform’s API for proﬁt.

5

The details of the models and the training algorithms are

hidden from the data owners. The type of the model may be

chosen by the service adaptively, depending on the data and

perhaps accuracy on validation subsets. Service providers do

not warn customers about the consequences of overﬁtting and

provide little or no control over regularization. For example,

Google Prediction API hides all details, while Amazon ML

provides only a very limited set of pre-deﬁned options (L1- or

L2-norm regularization). The models cannot be downloaded

and are accessed only through the service’s API. Service

providers derive revenue mainly by charging customers for

queries through this API. Therefore, we treat “machine learn-

ing as a service” as a black box. All inference attacks we

demonstrate in this paper are performed entirely through the

services’ standard APIs.

III. P

RIVACY IN MACHINE LEARNING

Before dealing with inference attacks, we need to deﬁne

what privacy means in the context of machine learning or,

1

https://cloud.google.com/prediction

2

https://aws.amazon.com/machine-learning

3

https://studio.azureml.net

4

https://bigml.com

5

https://cloud.google.com/prediction/docs/gallery

4

alternatively, what it means for a machine learning model to

breach privacy.

A. Inference about members of the population

A plausible notion of privacy, known in statistical disclosure

control as the “Dalenius desideratum,” states that the model

should reveal no more about the input to which it is applied

than would have been known about this input without applying

the model. This cannot be achieved by any useful model [14].

A related notion of privacy appears in prior work on model

inversion [17]: a privacy breach occurs if an adversary can

use the model’s output to infer the values of unintended

(sensitive) attributes used as input to the model. As observed

in [27], it may not be possible to prevent this “breach” if

the model is based on statistical facts about the population.

For example, suppose that training the model has uncovered

a high correlation between a person’s externally observable

phenotype features and their genetic predisposition to a certain

disease. This correlation is now a publicly known scientiﬁc

fact that allows anyone to infer information about the person’s

genome after observing that person.

Critically, this correlation applies to all members of a given

population. Therefore, the model breaches “privacy” not just of

the people whose data was used to create the model, but also of

other people from the same population, even those whose data

was not used and whose identities may not even be known to

the model’s creator (i.e., this is “spooky action at a distance”).

Valid models generalize, i.e., they make accurate predictions

on inputs that were not part of their training datasets. This

means that the creator of a generalizable model cannot do

anything to protect “privacy” as deﬁned above because the

correlations on which the model is based—and the inferences

that these correlations enable—hold for the entire population,

regardless of how the training sample was chosen or how the

model was created from this sample.

B. Inference about members of the training dataset

To bypass the difﬁculties inherent in deﬁning and protecting

privacy of the entire population, we focus on protecting privacy

of the individuals whose data was used to train the model. This

motivation is closely related to the original goals of differential

privacy [13].

Of course, members of the training dataset are members

of the population, too. We investigate what the model reveals

about them beyond what it reveals about an arbitrary member

of the population. Our ultimate goal is to measure the mem-

bership risk that a person incurs if they allow their data to be

used to train a model.

The basic attack in this setting is membership inference,

i.e., determining whether a given data record was part of the

model’s training dataset or not. When a record is fully known

to the adversary, learning that it was used to train a particular

model is an indication of information leakage through the

model. In some cases, it can directly lead to a privacy breach.

For example, knowing that a certain patient’s clinical record

was used to train a model associated with a disease (e.g, to

determine the appropriate medicine dosage or to discover the

genetic basis of the disease) can reveal that the patient has this

disease.

We investigate the membership inference problem in the

black-box scenario where the adversary can only supply inputs

to the model and receive the model’s output(s). In some

situations, the model is available to the adversary indirectly.

For example, an app developer may use a machine-learning

service to construct a model from the data collected by the app

and have the app make API calls to the resulting model. In this

case, the adversary would supply inputs to the app (rather than

directly to the model) and receive the app’s outputs (which are

based on the model’s outputs). The details of internal model

usage vary signiﬁcantly from app to app. For simplicity and

generality, we will assume that the adversary directly supplies

inputs to and receives outputs from the black-box model.

IV. P

ROBLEM STATEMENT

Consider a set of labeled data records sampled from some

population and partitioned into classes. We assume that a

machine learning algorithm is used to train a classiﬁcation

model that captures the relationship between the content of

the data records and their labels.

For any input data record, the model outputs the prediction

vector of probabilities, one per class, that the record belongs

to a certain class. We will also refer to these probabilities

as conﬁdence values. The class with the highest conﬁdence

value is selected as the predicted label for the data record.

The accuracy of the model is evaluated by measuring how it

generalizes beyond its training set and predicts the labels of

other data records from the same population.

We assume that the attacker has query access to the model

and can obtain the model’s prediction vector on any data

record. The attacker knows the format of the inputs and

outputs of the model, including their number and the range of

values they can take. We also assume that the attacker either

(1) knows the type and architecture of the machine learning

model, as well as the training algorithm, or (2) has black-box

access to a machine learning oracle (e.g., a “machine learning

as a service” platform) that was used to train the model. In

the latter case, the attacker does not know a priori the model’s

structure or meta-parameters.

The attacker may have some background knowledge about

the population from which the target model’s training dataset

was drawn. For example, he may have independently drawn

samples from the population, disjoint from the target model’s

training dataset. Alternatively, the attacker may know some

general statistics about the population, for example, the

marginal distribution of feature values.

The setting for our inference attack is as follows. The

attacker is given a data record and black-box query access

to the target model. The attack succeeds if the attacker can

correctly determine whether this data record was part of the

model’s training dataset or not. The standard metrics for attack

accuracy are precision (what fraction of records inferred as

members are indeed members of the training dataset) and

5

(data record, class label) Target Model

Attack Model

data ∈ training set ?

predict(data)

label

prediction

Fig. 1: Membership inference attack in the black-box setting. The

attacker queries the target model with a data record and obtains

the model’s prediction on that record. The prediction is a vector of

probabilities, one per class, that the record belongs to a certain class.

This prediction vector, along with the label of the target record, is

passed to the attack model, which infers whether the record was in

or out of the target model’s training dataset.

ML API

Private Training Set

Target Model

Shadow Training Set 1

Shadow Model 1

Shadow Training Set 2

Shadow Model 2

...

Shadow Training Set k

Shadow Model k

train()

Fig. 2: Training shadow models using the same machine learning

platform as was used to train the target model. The training datasets

of the target and shadow models have the same format but are disjoint.

The training datasets of the shadow models may overlap. All models’

internal parameters are trained independently.

recall (what fraction of the training dataset’s members are

correctly inferred as members by the attacker).

V. M

EMBERSHIP INFERENCE

A. Overview of the attack

Our membership inference attack exploits the observation

that machine learning models often behave differently on the

data that they were trained on versus the data that they “see”

for the ﬁrst time. Overﬁtting is a common reason but not the

only one (see Section VII). The objective of the attacker is to

construct an attack model that can recognize such differences

in the target model’s behavior and use them to distinguish

members from non-members of the target model’s training

dataset based solely on the target model’s output.

Our attack model is a collection of models, one for each

output class of the target model. This increases accuracy of the

attack because the target model produces different distributions

over its output classes depending on the input’s true class.

To train our attack model, we build multiple “shadow”

models intended to behave similarly to the target model. In

contrast to the target model, we know the ground truth for each

shadow model, i.e., whether a given record was in its training

dataset or not. Therefore, we can use supervised training on

the inputs and the corresponding outputs (each labeled “in” or

“out”) of the shadow models to teach the attack model how to

distinguish the shadow models’ outputs on members of their

training datasets from their outputs on non-members.

Formally, let f

target

() be the target model, and let D

train

target

be its private training dataset which contains labeled data

records (x

{i}

,y

{i}

)

target

. A data record x

{i}

target

is the input to

the model, and y

{i}

target

is the true label that can take values

from a set of classes of size c

target

. The output of the target

model is a probability vector of size c

target

. The elements of

this vector are in [0, 1] and sum up to 1.

Let f

attack

() be the attack model. Its input x

attack

is com-

posed of a correctly labeled record and a prediction vector

of size c

target

. Since the goal of the attack is decisional

membership inference, the attack model is a binary classiﬁer

with two output classes, “in” and “out.”

Figure 1 illustrates our end-to-end attack process. For a

labeled record (x,y), we use the target model to compute

the prediction vector y = f

target

(x). The distribution of y

(classiﬁcation conﬁdence values) depends heavily on the true

class of x. This is why we pass the true label y of x in

addition to the model’s prediction vector y to the attack

model. Given how the probabilities in y are distributed around

y, the attack model computes the membership probability

Pr{(x,y) ∈ D

train

target

}, i.e., the probability that ((x,y), y)

belongs to the “in” class or, equivalently, that x is in the

training dataset of f

target

().

The main challenge is how to train the attack model to

distinguish members from non-members of the target model’s

training dataset when the attacker has no information about the

internal parameters of the target model and only limited query

access to it through the public API. To solve this conundrum,

we developed a shadow training technique that lets us train

the attack model on proxy targets for which we do know the

training dataset and can thus perform supervised training.

B. Shadow models

The attacker creates k shadow models f

i

shadow

(). Each

shadow model i is trained on a dataset D

train

shadow

i

of the same

format as and distributed similarly to the target model’s train-

ing dataset. These shadow training datasets can be generated

using one of methods described in Section V-C. We assume

that the datasets used for training the shadow models are

disjoint from the private dataset used to train the target model

(∀i, D

train

shadow

i

∩ D

train

target

= ∅). This is the worst case for the

attacker; the attack will perform even better if the training

datasets happen to overlap.

The shadow models must be trained in a similar way to

the target model. This is easy if the target’s training algorithm

6

Algorithm 1 Data synthesis using the target model

1: procedure SYNTHESIZE(class : c)

2: x ← RANDRECORD(.)  initialize a record randomly

3: y

∗

c

← 0

4: j ← 0

5: k ← k

max

6: for iteration =1···iter

max

do

7: y ← f

target

(x)  query the target model

8: if y

c

≥ y

∗

c

then  accept the record

9: if y

c

> conf

min

and c = arg max(y) then

10: if rand() <y

c

then  sample

11: return x  synthetic data

12: end if

13: end if

14: x

∗

← x

15: y

∗

c

← y

c

16: j ← 0

17: else

18: j ← j +1

19: if j>rej

max

then  many consecutive rejects

20: k ← max(k

min

, k/2)

21: j ← 0

22: end if

23: end if

24: x ← RANDRECORD(x

∗

, k)  randomize k features

25: end for

26: return ⊥  failed to synthesize

27: end procedure

(e.g., neural networks, SVM, logistic regression) and model

structure (e.g., the wiring of a neural network) are known.

Machine learning as a service is more challenging. Here the

type and structure of the target model are not known, but

the attacker can use exactly the same service (e.g., Google

Prediction API) to train the shadow model as was used to

train the target model—see Figure 2.

The more shadow models, the more accurate the attack

model will be. As described in Section V-D, the attack model

is trained to recognize differences in shadow models’ behavior

when these models operate on inputs from their own training

datasets versus inputs they did not encounter during training.

Therefore, more shadow models provide more training fodder

for the attack model.

C. Generating training data for shadow models

To train shadow models, the attacker needs training data

that is distributed similarly to the target model’s training data.

We developed several methods for generating such data.

Model-based synthesis. If the attacker does not have real

training data nor any statistics about its distribution, he can

generate synthetic training data for the shadow models using

the target model itself. The intuition is that records that are

classiﬁed by the target model with high conﬁdence should

be statistically similar to the target’s training dataset and thus

provide good fodder for shadow models.

The synthesis process runs in two phases: (1) search, using

a hill-climbing algorithm, the space of possible data records

to ﬁnd inputs that are classiﬁed by the target model with high

conﬁdence; (2) sample synthetic data from these records. After

this process synthesizes a record, the attacker can repeat it until

the training dataset for shadow models is full.

See Algorithm 1 for the pseudocode of our synthesis

procedure. First, ﬁx class c for which the attacker wants to

generate synthetic data. The ﬁrst phase is an iterative process.

Start by randomly initializing a data record x. Assuming that

the attacker knows only the syntactic format of data records,

sample the value for each feature uniformly at random from

among all possible values of that feature. In each iteration,

propose a new record. A proposed record is accepted only

if it increases the hill-climbing objective: the probability of

being classiﬁed by the target model as class c.

Each iteration involves proposing a new candidate record by

changing k randomly selected features of the latest accepted

record x

∗

. This is done by ﬂipping binary features or resam-

pling new values for features of other types. We initialize k to

k

max

and divide it by 2 when rej

max

subsequent proposals

are rejected. This controls the diameter of search around the

accepted record in order to propose a new record. We set the

minimum value of k to k

min

. This controls the speed of the

search for new records with a potentially higher classiﬁcation

probability y

c

.

The second, sampling phase starts when the target model’s

probability y

c

that the proposed data record is classiﬁed as

belonging to class c is larger than the probabilities for all

other classes and also larger than a threshold conf

min

. This

ensures that the predicted label for the record is c, and that the

target model is sufﬁciently conﬁdent in its label prediction. We

select such record for the synthetic dataset with probability y

∗

c

and, if selection fails, repeat until a record is selected.

This synthesis procedure works only if the adversary can

efﬁciently explore the space of possible inputs and discover

inputs that are classiﬁed by the target model with high conﬁ-

dence. For example, it may not work if the inputs are high-

resolution images and the target model performs a complex

image classiﬁcation task.

Statistics-based synthesis. The attacker may have some statis-

tical information about the population from which the target

model’s training data was drawn. For example, the attacker

may have prior knowledge of the marginal distributions of

different features. In our experiments, we generate synthetic

training records for the shadow models by independently

sampling the value of each feature from its own marginal

distribution. The resulting attack models are very effective.

Noisy real data. The attacker may have access to some data

that is similar to the target model’s training data and can be

considered as a “noisy” version thereof. In our experiments

with location datasets, we simulate this by ﬂipping the (bi-

nary) values of 10% or 20% randomly selected features, then

7

Membership Inference Attacks Against Machine Learning Models

Citations

Cites background from "Membership Inference Attacks Agains..."

References

"Membership Inference Attacks Agains..." refers background in this paper

"Membership Inference Attacks Agains..." refers methods in this paper

Related Papers (5)