scispace - formally typeset
Open AccessProceedings ArticleDOI

Towards Making Systems Forget with Machine Unlearning

Yinzhi Cao, +1 more
- pp 463-480
TLDR
This paper presents a general, efficient unlearning approach by transforming learning algorithms used by a system into a summation form, and applies to all stages of machine learning, including feature selection and modeling.
Abstract
Today's systems produce a rapidly exploding amount of data, and the data further derives more data, forming a complex data propagation network that we call the data's lineage. There are many reasons that users want systems to forget certain data including its lineage. From a privacy perspective, users who become concerned with new privacy risks of a system often want the system to forget their data and lineage. From a security perspective, if an attacker pollutes an anomaly detector by injecting manually crafted data into the training data set, the detector must forget the injected data to regain security. From a usability perspective, a user can remove noise and incorrect entries so that a recommendation engine gives useful recommendations. Therefore, we envision forgetting systems, capable of forgetting certain data and their lineages, completely and quickly. This paper focuses on making learning systems forget, the process of which we call machine unlearning, or simply unlearning. We present a general, efficient unlearning approach by transforming learning algorithms used by a system into a summation form. To forget a training data sample, our approach simply updates a small number of summations -- asymptotically faster than retraining from scratch. Our approach is general, because the summation form is from the statistical query learning in which many machine learning algorithms can be implemented. Our approach also applies to all stages of machine learning, including feature selection and modeling. Our evaluation, on four diverse learning systems and real-world workloads, shows that our approach is general, effective, fast, and easy to use.

read more

Content maybe subject to copyright    Report

Towards Making Systems Forget with Machine Unlearning
Yinzhi Cao and Junfeng Yang
Columbia University
{yzcao, junfeng}@cs.columbia.edu
Abstract—Today’s systems produce a rapidly exploding
amount of data, and the data further derives more data, forming
a complex data propagation network that we call the data’s
lineage. There are many reasons that users want systems to forget
certain data including its lineage. From a privacy perspective,
users who become concerned with new privacy risks of a system
often want the system to forget their data and lineage. From a
security perspective, if an attacker pollutes an anomaly detector
by injecting manually crafted data into the training data set,
the detector must forget the injected data to regain security.
From a usability perspective, a user can remove noise and
incorrect entries so that a recommendation engine gives useful
recommendations. Therefore, we envision forgetting systems,
capable of forgetting certain data and their lineages, completely
and quickly.
This paper focuses on making learning systems forget, the
process of which we call machine unlearning, or simply un-
learning. We present a general, efficient unlearning approach
by transforming learning algorithms used by a system into a
summation form. To forget a training data sample, our approach
simply updates a small number of summations asymptotically
faster than retraining from scratch. Our approach is general,
because the summation form is from the statistical query learning
in which many machine learning algorithms can be implemented.
Our approach also applies to all stages of machine learning,
including feature selection and modeling. Our evaluation, on four
diverse learning systems and real-world workloads, shows that
our approach is general, effective, fast, and easy to use.
I. INTRODUCTION
A. The Need for Systems to Forget
Today’s systems produce a rapidly exploding amount of
data, ranging from personal photos and office documents to
logs of user clicks on a website or mobile device [15]. From
this data, the systems perform a myriad of computations to
derive even more data. For instance, backup systems copy data
from one place (e.g., a mobile device) to another (e.g., the
cloud). Photo storage systems re-encode a photo into different
formats and sizes [23, 53]. Analytics systems aggregate raw
data such as click logs into insightful statistics. Machine learn-
ing systems extract models and properties (e.g., the similarities
of movies) from training data (e.g., historical movie ratings)
using advanced algorithms. This derived data can recursively
derive more data, such as a recommendation system predicting
a user’s rating of a movie based on movie similarities. In short,
a piece of raw data in today’s systems often goes through
a series of computations, “creeping” into many places and
appearing in many forms. The data, computations, and derived
data together form a complex data propagation network that
we call the data’s lineage.
For a variety of reasons, users want a system to forget
certain sensitive data and its complete lineage. Consider pri-
vacy first. After Facebook changed its privacy policy, many
users deleted their accounts and the associated data [69].
The iCloud photo hacking incident [8] led to online articles
teaching users how to completely delete iOS photos including
the backups [79]. New privacy research revealed that machine
learning models for personalized warfarin dosing leak patients’
genetic markers [43], and a small set of statistics on genet-
ics and diseases suffices to identify individuals [78]. Users
unhappy with these newfound risks naturally want their data
and its influence on the models and statistics to be completely
forgotten. System operators or service providers have strong
incentives to honor users’ requests to forget data, both to keep
users happy and to comply with the law [72]. For instance,
Google had removed 171,183 links [50] by October 2014
under the “right to be forgotten” ruling of the highest court in
the European Union.
Security is another reason that users want data to be
forgotten. Consider anomaly detection systems. The security
of these systems hinges on the model of normal behaviors ex-
tracted from the training data. By polluting
1
the training data,
attackers pollute the model, thus compromising security. For
instance, Perdisci et al. [56] show that PolyGraph [55], a worm
detection engine, fails to generate useful worm signatures if
the training data is injected with well-crafted fake network
flows. Once the polluted data is identified, the system must
completely forget the data and its lineage to regain security.
Usability is a third reason. Consider the recommendation
or prediction system Google Now [7]. It infers a user’s
preferences from her search history, browsing history, and
other analytics. It then pushes recommendations, such as news
about a show, to the user. Noise or incorrect entries in analytics
can seriously degrade the quality of the recommendation. One
of our lab members experienced this problem first-hand. He
loaned his laptop to a friend who searched for a TV show
(“Jeopardy!”) on Google [1]. He then kept getting news about
this show on his phone, even after he deleted the search record
from his search history.
We believe that systems must be designed under the core
principle of completely and quickly forgetting sensitive data
and its lineage for restoring privacy, security, and usability.
Such forgetting systems must carefully track data lineage
even across statistical processing or machine learning, and
make this lineage visible to users. They let users specify
1
In this paper, we use the term pollute [56] instead of poison [47, 77].
1

the data to forget with different levels of granularity. For
instance, a privacy-conscious user who accidentally searches
for a sensitive keyword without concealing her identity can
request that the search engine forget that particular search
record. These systems then remove the data and revert its
effects so that all future operations run as if the data had never
existed. They collaborate to forget data if the lineage spans
across system boundaries (e.g., in the context of web mashup
services). This collaborative forgetting potentially scales to
the entire Web. Users trust forgetting systems to comply
with requests to forget, because the aforementioned service
providers have strong incentives to comply, but other trust
models are also possible. The usefulness of forgetting systems
can be evaluated with two metrics: how completely they can
forget data (completeness) and how quickly they can do so
(timeliness). The higher these metrics, the better the systems
at restoring privacy, security, and usability.
We foresee easy adoption of forgetting systems because they
benefit both users and service providers. With the flexibility
to request that systems forget data, users have more control
over their data, so they are more willing to share data with the
systems. More data also benefit the service providers, because
they have more profit opportunities services and fewer legal
risks. In addition, we envision forgetting systems playing a
crucial role in emerging data markets [3, 40, 61] where users
trade data for money, services, or other data because the
mechanism of forgetting enables a user to cleanly cancel a
data transaction or rent out the use rights of her data without
giving up the ownership.
Forgetting systems are complementary to much existing
work [55, 75, 80]. Systems such as Google Search [6] can
forget a user’s raw data upon request, but they ignore the
lineage. Secure deletion [32, 60, 70] prevents deleted data from
being recovered from the storage media, but it largely ignores
the lineage, too. Information flow control [41, 67] can be
leveraged by forgetting systems to track data lineage. However,
it typically tracks only direct data duplication, not statistical
processing or machine learning, to avoid taint explosion.
Differential privacy [75, 80] preserves the privacy of each indi-
vidual item in a data set equally and invariably by restricting
accesses only to the whole data set’s statistics fuzzed with
noise. This restriction is at odds with today’s systems such
as Facebook and Google Search which, authorized by billions
of users, routinely access personal data for accurate results.
Unsurprisingly, it is impossible to strike a balance between
utility and privacy in state-of-the-art implementations [43]. In
contrast, forgetting systems aim to restore privacy on select
data. Although private data may still propagate, the lineage of
this data within the forgetting systems is carefully tracked and
removed completely and in a timely manner upon request. In
addition, this fine-grained data removal caters to an individual
user’s privacy consciousness and the data item’s sensitivity.
Forgetting systems conform to the trust and usage models of
today’s systems, representing a more practical privacy vs util-
ity tradeoff. Researchers also proposed mechanisms to make
systems more robust against training data pollution [27, 55].
...
Σ
...
a small
number of
summations
training data
samples
Machine Learning Model
Learn
Σ
Learn
Machine Learning Model
g1
...
g2
g1
g1
g2
g2
...
g1,g2:
transformations
Fig. 1: Unlearning idea. Instead of making a model directly depend
on each training data sample (left), we convert the learning algorithm
into a summation form (right). Specifically, each summation is the
sum of transformed data samples, where the transformation functions
g
i
are efficiently computable. There are only a small number of
summations, and the learning algorithm depends only on summations.
To forget a data sample, we simply update the summations and then
compute the updated model. This approach is asymptotically much
faster than retraining from scratch.
However, despite these mechanisms (and the others discussed
so far such as differential privacy), users may still request
systems to forget data due to, for example, policy changes and
new attacks against the mechanisms [43, 56]. These requests
can be served only by forgetting systems.
B. Machine Unlearning
While there are numerous challenges in making systems
forget, this paper focuses on one of the most difficult chal-
lenges: making machine learning systems forget. These sys-
tems extract features and models from training data to answer
questions about new data. They are widely used in many
areas of science [25, 35, 37, 46, 55, 63–65]. To forget a piece
of training data completely, these systems need to revert the
effects of the data on the extracted features and models. We
call this process machine unlearning, or unlearning for short.
A na
¨
ıve approach to unlearning is to retrain the features
and models from scratch after removing the data to forget.
However, when the set of training data is large, this approach
is quite slow, increasing the timing window during which the
system is vulnerable. For instance, with a real-world data set
from Huawei (see §VII), it takes Zozzle [35], a JavaScript
malware detector, over a day to retrain and forget a polluted
sample.
We present a general approach to efficient unlearning, with-
out retraining from scratch, for a variety of machine learning
algorithms widely used in real-world systems. To prepare for
unlearning, we transform learning algorithms in a system to
a form consisting of a small number of summations [33].
Each summation is the sum of some efficiently computable
transformation of the training data samples. The learning
algorithms depend only on the summations, not individual
data. These summations are saved together with the trained
model. (The rest of the system may still ask for individual data
and there is no injected noise as there is in differential privacy.)
Then, in the unlearning process, we subtract the data to forget
2

from each summation, and then update the model. As Figure 1
illustrates, forgetting a data item now requires recomputing
only a small number of terms, asymptotically faster than
retraining from scratch by a factor equal to the size of the
training data set. For the aforementioned Zozzle example, our
unlearning approach takes only less than a second compared to
a day for retraining. It is general because the summation form
is from statistical query (SQ) learning [48]. Many machine
learning algorithms, such as na
¨
ıve Bayes classifiers, support
vector machines, and k-means clustering, can be implemented
as SQ learning. Our approach also applies to all stages of
machine learning, including feature selection and modeling.
We evaluated our unlearning approach on four diverse
learning systems including (1) LensKit [39], an open-source
recommendation system used by several websites for confer-
ence [5], movie [14], and book [4] recommendations; (2) an
independent re-implementation of Zozzle, the aforementioned
closed-source JavaScript malware detector whose algorithm
was adopted by Microsoft Bing [42]; (3) an open-source online
social network (OSN) spam filter [46]; and (4) PJScan, an
open-source PDF malware detector [51]. We also used real-
world workloads such as more than 100K JavaScript malware
samples from Huawei. Our evaluation shows:
All four systems are prone to attacks targeting learn-
ing. For LensKit, we reproduced an existing privacy
attack [29]. For each of the other three systems, because
there is no known attack, we created a new, practical data
pollution attack to decrease the detection effectiveness.
One particular attack requires careful injection of mul-
tiple features in the training data set to mislead feature
selection and model training (see §VII).
Our unlearning approach applies to all learning algo-
rithms in LensKit, Zozzle, and PJScan. In particular,
enabled by our approach, we created the first effi-
cient unlearning algorithm for normalized cosine similar-
ity [37, 63] commonly used by recommendation systems
(e.g., LensKit) and for one-class support vector machine
(SVM) [71] commonly used by classification/anomaly
detection systems (e.g., PJScan uses it to learn a model of
malicious PDFs). We show analytically that, for all these
algorithms, our approach is both complete (completely
removing a data sample’s lineage) and timely (asymptot-
ically much faster than retraining). For the OSN spam
filter, we leveraged existing techniques for unlearning.
Using real-world data, we show empirically that unlearn-
ing prevents the attacks and the speedup over retraining
is often huge, matching our analytical results.
Our approach is easy to use. It is straightforward to
modify the systems to support unlearning. For each
system, we modified from 20 300 lines of code, less
than 1% of the system.
C. Contributions and Paper Organization
This paper makes four main contributions:
The concept of forgetting systems that restore privacy, se-
curity, and usability by forgetting data lineage completely
and quickly;
A general unlearning approach that converts learning al-
gorithms into a summation form for efficiently forgetting
data lineage;
An evaluation of our approach on real-world systems/al-
gorithms demonstrating that it is practical, complete, fast,
and easy to use; and
The practical data pollution attacks we created against
real-world systems/algorithms.
While prior work proposed incremental machine learning
for several specific learning algorithms [31, 62, 73], the key
difference in our work is that we propose a general efficient
unlearning approach applicable to any algorithm that can be
converted to the summation form, including some that cur-
rently have no incremental versions, such as normalized cosine
similarity and one-class SVM. In addition, our unlearning
approach handles all stages of learning, including feature
selection and modeling. We also demonstrated our approach
on real systems.
Our unlearning approach is inspired by prior work on speed-
ing up machine learning algorithms with MapReduce [33]. We
believe we are the first to establish the connection between
unlearning and the summation form. In addition, we are the
first to convert non-standard real-world learning algorithms
such as normalized cosine similarity to the summation form.
The conversion is complex and challenging (see §VI). In con-
trast, the prior work converts nine standard machine learning
algorithms using only simple transformations.
The rest of the paper is organized as follows. In §II, we
present some background on machine learning systems and
the extended motivation of unlearning. In §III, we present the
goals and work flow of unlearning. In §IV, we present the core
approach of unlearning, i.e., transforming a system into the
summation form, and its formal backbone. In §V, we overview
our evaluation methodology and summarize results. In §VI–
§IX, we report detailed case studies on four real-world learning
systems. In §X and §XI, we discuss some issues in unlearning
and related work, and in §XII, we conclude.
II. BACKGROUND AND ADVERSARIAL MODEL
This section presents some background on machine learning
(§II-A) and the extended motivation of unlearning (§II-B).
A. Machine Learning Background
Figure 2 shows that a general machine learning system with
three processing stages.
Feature selection. During this stage, the system selects,
from all features of the training data, a set of features
most crucial for classifying data. The selected feature
set is typically small to make later stages more accurate
and efficient. Feature selection can be (1) manual where
system builders carefully craft the feature set or (2) au-
tomatic where the system runs some learning algorithms
3

Feature
Selection
Model
Training
Prediction
Unknown sample
or
Feature
Set
Model
Training Data Set
Result
?
+
+
-
-
+
Fig. 2: A General Machine Learning System. Given a set of training
data including both malicious (+) and benign () samples, the system
first selects a set of features most crucial for classifying data. It then
uses the training data to construct a model. To process an unknown
sample, the system examines the features in the sample and uses
the model to predict the sample as malicious or benign. The lineage
of the training data thus flows to the feature set, the model, and the
prediction results. An attacker can feed different samples to the model
and observe the results to steal private information from every step
along the lineage, including the training data set (system inference
attack). She can pollute the training data and subsequently every step
along the lineage to alter prediction results (training data pollution
attack).
such as clustering and chi-squared test to compute how
crucial the features are and select the most crucial ones.
Model training. The system extracts the values of the
selected features from each training data sample into
a feature vector. It feeds the feature vectors and the
malicious or benign labels of all training data samples
into some machine learning algorithm to construct a
succinct model.
Prediction. When the system receives an unknown data
sample, it extracts the sample’s feature vector and uses
the model to predict whether the sample is malicious or
benign.
Note that a learning system may or may not contain all
three stages, work with labeled training data, or classify data
as malicious or benign. We present the system in Figure 2 be-
cause it matches many machine learning systems for security
purposes such as Zozzle. Without loss of generality, we refer
to this system as an example in the later sections of the paper.
B. Adversarial Model
To further motivate the need for unlearning, we describe
several practical attacks in the literature that target learning
systems. They either violate privacy by inferring private in-
formation in the trained models (§II-B1), or reduce security
by polluting the prediction (detection) results of anomaly
detection systems (§II-B2).
1) System Inference Attacks: The training data sets, such
as movie ratings, online purchase histories, and browsing
histories, often contain private data. As shown in Figure 2,
the private data lineage flows through the machine learning
algorithms into the feature set, the model, and the prediction
results. By exploiting this lineage, an attacker gains an oppor-
tunity to infer private data by feeding samples into the system
and observing the prediction results. Such an attack is called
a system inference attack [29].
2
Consider a recommendation system that uses item-item
collaborative filtering which learns item-item similarities from
users’ purchase histories and recommends to a user the items
most similar to the ones she previously purchased. Calandrino
et al. [29] show that once an attacker learns (1) the item-item
similarities, (2) the list of recommended items for a user before
she purchased an item, and (3) the list after, the attacker can
accurately infer what the user purchased by essentially invert-
ing the computation done by the recommendation algorithm.
For example, on LibraryThing [12], a book cataloging service
and recommendation engine, this attack successfully inferred
six book purchases per user with 90% accuracy for over one
million users!
Similarly, consider a personalized warfarin dosing system
that guides medical treatments based on a patient’s genotype
and background. Fredrikson et al. [43] show that with the
model and some demographic information about a patient,
an attacker can infer the genetic markers of the patient with
accuracy as high as 75%.
2) Training Data Pollution Attacks: Another way to exploit
the lineage in Figure 2 is using training data pollution attacks.
An attacker injects carefully polluted data samples into a
learning system, misleading the algorithms to compute an in-
correct feature set and model. Subsequently, when processing
unknown samples, the system may flag a big number of benign
samples as malicious and generate too many false positives,
or it may flag a big number of malicious as benign so the true
malicious samples evade detection.
Unlike system inference in which an attacker exploits an
easy-to-access public interface of a learning system, data
pollution requires an attacker to tackle two relatively difficult
issues. First, the attacker must trick the learning system into
including the polluted samples in the training data set. There
are a number of reported ways to do so [54, 56, 77]. For
instance, she may sign up as a crowdsourcing worker and
intentionally mislabel benign emails as spams [77]. She may
also attack the honeypots or other baiting traps intended for
collecting malicious samples, such as sending polluted emails
to a spamtrap [17], or compromising a machine in a honeynet
and sending packets with polluted protocol header fields [56].
Second, the attacker must carefully pollute enough data to
mislead the machine learning algorithms. In the crowdsourcing
case, she, the administrator of the crowdsourcing sites, directly
pollutes the labels of some training data [77]. 3% mislabeled
training data turned out to be enough to significantly decrease
detection efficacy. In the honeypot cases [17, 56], the attacker
cannot change the labels of the polluted data samples because
the honeypot automatically labels them as malicious. However,
2
In this paper, we use system inference instead of model inversion [43].
4

she controls what features appear in the samples, so she
can inject benign features into these samples, misleading the
system into relying on these features for detecting malicious
samples. For instance, Nelson et al. injected words that also
occur in benign emails into the emails sent to a spamtrap,
causing a spam detector to classify 60% of the benign emails
as spam. Perdisci et al. injected many packets with the same
randomly generated strings into a honeynet, so that true
malicious packets without these strings evade detection.
III. OVERVIEW
This section presents the goals (§III-A) and work flow
(§III-B) of machine learning.
A. Unlearning Goals
Recall that forgetting systems have two goals: (1) com-
pleteness, or how completely they can forget data; and (2)
timeliness, or how quickly they can forget. We discuss what
these goals mean in the context of unlearning.
1) Completeness: Intuitively, completeness requires that
once a data sample is removed, all its effects on the feature set
and the model are also cleanly reversed. It essentially captures
how consistent an unlearned system is with the system that
has been retrained from scratch. If, for every possible sample,
the unlearned system gives the same prediction result as the
retrained system, then an attacker, operator, or user has no
way of discovering that the unlearned data and its lineage
existed in the system by feeding input samples to the unlearned
system or even observing its features, model, and training
data. Such unlearning is complete. To empirically measure
completeness, we quantify the percentage of input samples that
receive the same prediction results from both the unlearned
and the retrained system using a representative test data set.
The higher the percentage, the more complete the unlearning.
Note that completeness does not depend on the correctness
of prediction results: an incorrect but consistent prediction by
both systems does not decrease completeness.
Our notion of completeness is subject to such factors as
how representative the test data set is and whether the learning
algorithm is randomized. In particular, given the same training
data set, the same randomized learning algorithm may compute
different models which subsequently predict differently. Thus,
we consider unlearning complete as long as the unlearned
system is consistent with one of the retrained systems.
2) Timeliness: Timeliness in unlearning captures how much
faster unlearning is than retraining at updating the features
and the model in the system. The more timely the unlearning,
the faster the system is at restoring privacy, security, and
usability. Analytically, unlearning updates only a small number
of summations and then runs a learning algorithm on these
summations, whereas retraining runs the learning algorithm
on the entire training data set, so unlearning is asymptotically
faster by a factor of the training data size. To empirically mea-
sure timeliness, we quantify the speedup of unlearning over
retraining. Unlearning does not replace retraining. Unlearning
works better when the data to forget is small compared to the
training set. This case is quite common. For instance, a single
user’s private data is typically small compared to the whole
training data of all users. Similarly, an attacker needs only a
small amount of data to pollute a learning system (e.g., 1.75%
in the OSN spam filter [46] as shown in §VIII). When the data
to forget becomes large, retraining may work better.
B. Unlearning Work Flow
Given a training data sample to forget, unlearning updates
the system in two steps, following the learning process shown
in Figure 2. First, it updates the set of selected features. The
inputs at this step are the sample to forget, the old feature
set, and the summations previously computed for deriving the
old feature set. The outputs are the updated feature set and
summations. For example, Zozzle selects features using the
chi-squared test, which scores a feature based on four counts
(the simplest form of summations): how many malicious or
benign samples contain or do not contain this feature. To
support unlearning, we augmented Zozzle to store the score
and these counts for each feature. To unlearn a sample,
we update these counts to exclude this sample, re-score the
features, and select the top scored features as the updated
feature set. This process does not depend on the training data
set, and is much faster than retraining which has to inspect
each sample for each feature. The updated feature set in our
experiments is very similar to the old one with a couple of
features removed and added.
Second, unlearning updates the model. The inputs at this
step are the sample to forget, the old feature set, the updated
feature set, the old model, and the summations previously
computed for deriving the old model. The outputs are the
updated model and summations. If a feature is removed from
the feature set, we simply splice out the feature’s data from
the model. If a feature is added, we compute its data in the
model. In addition, we update summations that depend on
the sample to forget, and update the model accordingly. For
Zozzle which classifies data as malicious or benign using na
¨
ıve
Bayes, the summations are probabilities (e.g., the probability
that a training data sample is malicious given that it contains
a certain feature) computed using the counts recorded in the
first step. Updating the probabilities and the model is thus
straightforward, and much faster than retraining.
IV. UNLEARNING APPROACH
As previously depicted in Figure 1, our unlearning approach
introduces a layer of a small number of summations between
the learning algorithm and the training data to break down
the dependencies. Now, the learning algorithm depends only
on the summations, each of which is the sum of some
efficiently computable transformations of the training data
samples. Chu et al. [33] show that many popular machine
learning algorithms, such as na
¨
ıve Bayes, can be represented
in this form. To remove a data sample, we simply remove
the transformations of this data sample from the summations
that depend on this sample, which has O(1) complexity, and
5

Figures
Citations
More filters
Proceedings ArticleDOI

Trojaning Attack on Neural Networks

TL;DR: A trojaning attack on neuron networks that can be successfully triggered without affecting its test accuracy for normal input data, and it only takes a small amount of time to attack a complex neuron network model.
Proceedings ArticleDOI

DeepXplore: Automated Whitebox Testing of Deep Learning Systems

TL;DR: DeepXplore efficiently finds thousands of incorrect corner case behaviors in state-of-the-art DL models with thousands of neurons trained on five popular datasets including ImageNet and Udacity self-driving challenge data.
Proceedings ArticleDOI

DeepXplore: Automated Whitebox Testing of Deep Learning Systems

TL;DR: DeepXplore as discussed by the authors is a white box framework for systematically testing real-world deep learning (DL) systems, which leverages multiple DL systems with similar functionality as cross-referencing oracles to avoid manual checking.
Proceedings ArticleDOI

LEMNA: Explaining Deep Learning based Security Applications

TL;DR: LEMNA is proposed, a high-fidelity explanation method dedicated for security applications that approximate a local area of the complex deep learning decision boundary using a simple interpretable model and has a much higher fidelity level compared to existing methods.
Proceedings ArticleDOI

Auror: defending against poisoning attacks in collaborative deep learning systems

TL;DR: This paper investigates the setting of indirect collaborative deep learning --- a form of practical deep learning wherein users submit masked features rather than direct data, and proposes Auror, a system that detects malicious users and generates an accurate model.
References
More filters
Journal ArticleDOI

Induction of Decision Trees

J. R. Quinlan
- 25 Mar 1986 - 
TL;DR: In this paper, an approach to synthesizing decision trees that has been used in a variety of systems, and it describes one such system, ID3, in detail, is described, and a reported shortcoming of the basic algorithm is discussed.
Proceedings ArticleDOI

Item-based collaborative filtering recommendation algorithms

TL;DR: This paper analyzes item-based collaborative ltering techniques and suggests that item- based algorithms provide dramatically better performance than user-based algorithms, while at the same time providing better quality than the best available userbased algorithms.
Journal ArticleDOI

TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones

TL;DR: TaintDroid as mentioned in this paper is an efficient, system-wide dynamic taint tracking and analysis system capable of simultaneously tracking multiple sources of sensitive data by leveraging Android's virtualized execution environment.
Journal ArticleDOI

Support Vector Data Description

TL;DR: The Support Vector Data Description (SVDD) is presented which obtains a spherically shaped boundary around a dataset and analogous to the Support Vector Classifier it can be made flexible by using other kernel functions.
Journal ArticleDOI

Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning

TL;DR: Genes implicated in DLBCL outcome included some that regulate responses to B-cell–receptor signaling, critical serine/threonine phosphorylation pathways and apoptosis, and identify rational targets for intervention.
Related Papers (5)
Frequently Asked Questions (13)
Q1. What are the contributions mentioned in the paper "Towards making systems forget with machine unlearning" ?

This paper focuses on making learning systems forget, the process of which the authors call machine unlearning, or simply unlearning. The authors present a general, efficient unlearning approach by transforming learning algorithms used by a system into a summation form. To forget a training data sample, their approach simply updates a small number of summations – asymptotically faster than retraining from scratch. Their approach is general, because the summation form is from the statistical query learning in which many machine learning algorithms can be implemented. Their approach also applies to all stages of machine learning, including feature selection and modeling. Their evaluation, on four diverse learning systems and real-world workloads, shows that their approach is general, effective, fast, and easy to use. 

The authors plan to build full-fledged forgetting systems that carefully track data lineage at many levels of granularity, across all operations, and at potentially the Web scale. 

Fredrikson et al. [43] show that with the model and some demographic information about a patient, an attacker can infer the genetic markers of the patient with accuracy as high as 75%.2) Training Data Pollution Attacks: Another way to exploit the lineage in Figure 2 is using training data pollution attacks. 

The number of iterations required for the algorithm to converge depends on the algorithm, the initial state selected, and the training data. 

To forget a piece of training data completely, these systems need to revert the effects of the data on the extracted features and models. 

Calandrino et al. [29] show that once an attacker learns (1) the item-item similarities, (2) the list of recommended items for a user before she purchased an item, and (3) the list after, the attacker can accurately infer what the user purchased by essentially inverting the computation done by the recommendation algorithm. 

For instance, with a real-world data set from Huawei (see §VII), it takes Zozzle [35], a JavaScript malware detector, over a day to retrain and forget a polluted sample. 

It obtained up to 104× speedup except for PJScan because its largest data set has only 65 PDFs, so the execution time was dominated by program start and shutdown not learning. 

unlearning can simply “resume” the iterative learning algorithm from this state on the updated training data set, and it should take much fewer iterations to converge than restarting from the original or a newly generated initial state. 

For each of the other three systems, because there is no known attack, the authors created a new, practical data pollution attack to decrease the detection effectiveness. 

The authors expect these scenarios to be rare because adaptive algorithms need to be robust anyway for convergence during normal operations. 

an attacker needs only a small amount of data to pollute a learning system (e.g., 1.75% in the OSN spam filter [46] as shown in §VIII). 

For instance, Google had removed 171,183 links [50] by October 2014 under the “right to be forgotten” ruling of the highest court in the European Union. 

Trending Questions (1)
How do we forget faster?

The paper proposes a general and efficient approach called machine unlearning, which updates a small number of summations to forget data faster.