scispace - formally typeset
Search or ask a question
Proceedings Article

Communication-Efficient Learning of Deep Networks from Decentralized Data

H. Brendan McMahan1, Eider Moore1, Daniel Ramage1, Seth Hampson, Blaise Aguera y Arcas1 
10 Apr 2017-pp 1273-1282
TL;DR: In this paper, the authors presented a decentralized approach for federated learning of deep networks based on iterative model averaging, and conduct an extensive empirical evaluation, considering five different model architectures and four datasets.
Abstract: Modern mobile devices have access to a wealth of data suitable for learning models, which in turn can greatly improve the user experience on the device For example, language models can improve speech recognition and text entry, and image models can automatically select good photos However, this rich data is often privacy sensitive, large in quantity, or both, which may preclude logging to the data center and training there using conventional approaches We advocate an alternative that leaves the training data distributed on the mobile devices, and learns a shared model by aggregating locally-computed updates We term this decentralized approach Federated Learning We present a practical method for the federated learning of deep networks based on iterative model averaging, and conduct an extensive empirical evaluation, considering five different model architectures and four datasets These experiments demonstrate the approach is robust to the unbalanced and non-IID data distributions that are a defining characteristic of this setting Communication costs are the principal constraint, and we show a reduction in required communication rounds by 10-100x as compared to synchronized stochastic gradient descent

Content maybe subject to copyright    Report

Communication-Efficient Learning of Deep Networks
from Decentralized Data
H. Brendan McMahan Eider Moore Daniel Ramage Seth Hampson Blaise Ag
¨
uera y Arcas
Google, Inc., 651 N 34th St., Seattle, WA 98103 USA
Abstract
Modern mobile devices have access to a wealth
of data suitable for learning models, which in turn
can greatly improve the user experience on the
device. For example, language models can im-
prove speech recognition and text entry, and im-
age models can automatically select good photos.
However, this rich data is often privacy sensitive,
large in quantity, or both, which may preclude
logging to the data center and training there using
conventional approaches. We advocate an alter-
native that leaves the training data distributed on
the mobile devices, and learns a shared model by
aggregating locally-computed updates. We term
this decentralized approach Federated Learning.
We present a practical method for the federated
learning of deep networks based on iterative
model averaging, and conduct an extensive empiri-
cal evaluation, considering five different model ar-
chitectures and four datasets. These experiments
demonstrate the approach is robust to the unbal-
anced and non-IID data distributions that are a
defining characteristic of this setting. Commu-
nication costs are the principal constraint, and
we show a reduction in required communication
rounds by 10–100
×
as compared to synchronized
stochastic gradient descent.
1 Introduction
Increasingly, phones and tablets are the primary computing
devices for many people [
30
,
2
]. The powerful sensors on
these devices (including cameras, microphones, and GPS),
combined with the fact they are frequently carried, means
they have access to an unprecedented amount of data, much
of it private in nature. Models learned on such data hold the
Proceedings of the
20
th
International Conference on Artificial In-
telligence and Statistics (AISTATS) 2017, Fort Lauderdale, Florida,
USA. JMLR: W&CP volume 54. Copyright 2017 by the author(s).
promise of greatly improving usability by powering more
intelligent applications, but the sensitive nature of the data
means there are risks and responsibilities to storing it in a
centralized location.
We investigate a learning technique that allows users to
collectively reap the benefits of shared models trained from
this rich data, without the need to centrally store it. We term
our approach Federated Learning, since the learning task is
solved by a loose federation of participating devices (which
we refer to as clients) which are coordinated by a central
server. Each client has a local training dataset which is
never uploaded to the server. Instead, each client computes
an update to the current global model maintained by the
server, and only this update is communicated. This is a
direct application of the principle of focused collection or
data minimization proposed by the 2012 White House report
on privacy of consumer data [
39
]. Since these updates are
specific to improving the current model, there is no reason
to store them once they have been applied.
A principal advantage of this approach is the decoupling of
model training from the need for direct access to the raw
training data. Clearly, some trust of the server coordinat-
ing the training is still required. However, for applications
where the training objective can be specified on the basis
of data available on each client, federated learning can sig-
nificantly reduce privacy and security risks by limiting the
attack surface to only the device, rather than the device and
the cloud.
Our primary contributions are 1) the identification of the
problem of training on decentralized data from mobile de-
vices as an important research direction; 2) the selection of
a straightforward and practical algorithm that can be applied
to this setting; and 3) an extensive empirical evaluation of
the proposed approach. More concretely, we introduce the
FederatedAveraging
algorithm, which combines lo-
cal stochastic gradient descent (SGD) on each client with
a server that performs model averaging. We perform ex-
tensive experiments on this algorithm, demonstrating it is
robust to unbalanced and non-IID data distributions, and
can reduce the rounds of communication needed to train a
deep network on decentralized data by orders of magnitude.

Communication-Efficient Learning of Deep Networks from Decentralized Data
Federated Learning
Ideal problems for federated learn-
ing have the following properties: 1) Training on real-world
data from mobile devices provides a distinct advantage over
training on proxy data that is generally available in the data
center. 2) This data is privacy sensitive or large in size (com-
pared to the size of the model), so it is preferable not to log
it to the data center purely for the purpose of model training
(in service of the focused collection principle). 3) For super-
vised tasks, labels on the data can be inferred naturally from
user interaction.
Many models that power intelligent behavior on mobile
devices fit the above criteria. As two examples, we con-
sider image classification, for example predicting which
photos are most likely to be viewed multiple times in the
future, or shared; and language models, which can be used
to improve voice recognition and text entry on touch-screen
keyboards by improving decoding, next-word-prediction,
and even predicting whole replies [10]. The potential train-
ing data for both these tasks (all the photos a user takes and
everything they type on their mobile keyboard, including
passwords, URLs, messages, etc.) can be privacy sensitive.
The distributions from which these examples are drawn are
also likely to differ substantially from easily available proxy
datasets: the use of language in chat and text messages is
generally much different than standard language corpora,
e.g., Wikipedia and other web documents; the photos people
take on their phone are likely quite different than typical
Flickr photos. And finally, the labels for these problems are
directly available: entered text is self-labeled for learning
a language model, and photo labels can be defined by natu-
ral user interaction with their photo app (which photos are
deleted, shared, or viewed).
Both of these tasks are well-suited to learning a neural net-
work. For image classification feed-forward deep networks,
and in particular convolutional networks, are well-known
to provide state-of-the-art results [
26
,
25
]. For language
modeling tasks recurrent neural networks, and in particular
LSTMs, have achieved state-of-the-art results [20, 5, 22].
Privacy
Federated learning has distinct privacy advan-
tages compared to data center training on persisted data.
Holding even an “anonymized” dataset can still put user
privacy at risk via joins with other data [
37
]. In contrast,
the information transmitted for federated learning is the
minimal update necessary to improve a particular model
(naturally, the strength of the privacy benefit depends on the
content of the updates).
1
The updates themselves can (and
should) be ephemeral. They will never contain more infor-
1
For example, if the update is the total gradient of the loss on
all of the local data, and the features are a sparse bag-of-words,
then the non-zero gradients reveal exactly which words the user
has entered on the device. In contrast, the sum of many gradients
for a dense model such as a CNN offers a harder target for attackers
seeking information about individual training instances (though
attacks are still possible).
mation than the raw training data (by the data processing
inequality), and will generally contain much less. Further,
the source of the updates is not needed by the aggregation
algorithm, so updates can be transmitted without identifying
meta-data over a mix network such as Tor [
7
] or via a trusted
third party. We briefly discuss the possibility of combining
federated learning with secure multiparty computation and
differential privacy at the end of the paper.
Federated Optimization
We refer to the optimization
problem implicit in federated learning as federated optimiza-
tion, drawing a connection (and contrast) to distributed opti-
mization. Federated optimization has several key properties
that differentiate it from a typical distributed optimization
problem:
Non-IID
The training data on a given client is typically
based on the usage of the mobile device by a particular
user, and hence any particular user’s local dataset will
not be representative of the population distribution.
Unbalanced
Similarly, some users will make much
heavier use of the service or app than others, leading
to varying amounts of local training data.
Massively distributed
We expect the number of
clients participating in an optimization to be much
larger than the average number of examples per client.
Limited communication
Mobile devices are fre-
quently offline or on slow or expensive connections.
In this work, our emphasis is on the non-IID and unbalanced
properties of the optimization, as well as the critical nature
of the communication constraints. A deployed federated
optimization system must also address a myriad of practical
issues: client datasets that change as data is added and
deleted; client availability that correlates with the local data
distribution in complex ways (e.g., phones from speakers
of American English will likely be plugged in at different
times than speakers of British English); and clients that
never respond or send corrupted updates.
These issues are beyond the scope of the current work;
instead, we use a controlled environment that is suitable
for experiments, but still addresses the key issues of client
availability and unbalanced and non-IID data. We assume
a synchronous update scheme that proceeds in rounds of
communication. There is a fixed set of
K
clients, each
with a fixed local dataset. At the beginning of each round,
a random fraction
C
of clients is selected, and the server
sends the current global algorithm state to each of these
clients (e.g., the current model parameters). We only select
a fraction of clients for efficiency, as our experiments show
diminishing returns for adding more clients beyond a certain
point. Each selected client then performs local computation
based on the global state and its local dataset, and sends an
update to the server. The server then applies these updates
to its global state, and the process repeats.

H. Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, Blaise Ag
¨
uera y Arcas
While we focus on non-convex neural network objectives,
the algorithm we consider is applicable to any finite-sum
objective of the form
min
wR
d
f(w) where f(w)
def
=
1
n
n
X
i=1
f
i
(w). (1)
For a machine learning problem, we typically take
f
i
(w) =
`(x
i
, y
i
; w)
, that is, the loss of the prediction on example
(x
i
, y
i
)
made with model parameters
w
. We assume there
are
K
clients over which the data is partitioned, with
P
k
the
set of indexes of data points on client
k
, with
n
k
= |P
k
|
.
Thus, we can re-write the objective (1) as
f(w) =
K
X
k=1
n
k
n
F
k
(w) where F
k
(w) =
1
n
k
X
i∈P
k
f
i
(w).
If the partition
P
k
was formed by distributing the training
examples over the clients uniformly at random, then we
would have
E
P
k
[F
k
(w)] = f(w)
, where the expectation is
over the set of examples assigned to a fixed client
k
. This is
the IID assumption typically made by distributed optimiza-
tion algorithms; we refer to the case where this does not
hold (that is,
F
k
could be an arbitrarily bad approximation
to f) as the non-IID setting.
In data center optimization, communication costs are rela-
tively small, and computational costs dominate, with much
of the recent emphasis being on using GPUs to lower these
costs. In contrast, in federated optimization communication
costs dominate we will typically be limited by an upload
bandwidth of 1 MB/s or less. Further, clients will typically
only volunteer to participate in the optimization when they
are charged, plugged-in, and on an unmetered wi-fi connec-
tion. Further, we expect each client will only participate in a
small number of update rounds per day. On the other hand,
since any single on-device dataset is small compared to the
total dataset size, and modern smartphones have relatively
fast processors (including GPUs), computation becomes
essentially free compared to communication costs for many
model types. Thus, our goal is to use additional computation
in order to decrease the number of rounds of communica-
tion needed to train a model. There are two primary ways
we can add computation: 1) increased parallelism, where
we use more clients working independently between each
communication round; and, 2) increased computation on
each client, where rather than performing a simple computa-
tion like a gradient calculation, each client performs a more
complex calculation between each communication round.
We investigate both of these approaches, but the speedups
we achieve are due primarily to adding more computation
on each client, once a minimum level of parallelism over
clients is used.
Related Work
Distributed training by iteratively averag-
ing locally trained models has been studied by McDon-
ald et al.
[28]
for the perceptron and Povey et al.
[31]
for
speech recognition DNNs. Zhang et al.
[42]
studies an asyn-
chronous approach with “soft” averaging. These works only
consider the cluster / data center setting (at most 16 workers,
wall-clock time based on fast networks), and do not consider
datasets that are unbalanced and non-IID, properties that
are essential to the federated learning setting. We adapt
this style of algorithm to the federated setting and perform
the appropriate empirical evaluation, which asks different
questions than those relevant in the data center setting, and
requires different methodology.
Using similar motivation to ours, Neverova et al.
[29]
also
discusses the advantages of keeping sensitive user data on
device. The work of Shokri and Shmatikov
[35]
is related in
several ways: they focus on training deep networks, empha-
size the importance of privacy, and address communication
costs by only sharing a subset of the parameters during each
round of communication; however, they also do not consider
unbalanced and non-IID data, and the empirical evaluation
is limited.
In the convex setting, the problem of distributed opti-
mization and estimation has received significant attention
[
4
,
15
,
33
], and some algorithms do focus specifically on
communication efficiency [
45
,
34
,
40
,
27
,
43
]. In addition
to assuming convexity, this existing work generally requires
that the number of clients is much smaller than the number
of examples per client, that the data is distributed across
the clients in IID fashion, and that each node has an iden-
tical number of data points all of these assumptions
are violated in the federated optimization setting. Asyn-
chronous distributed forms of SGD have also been applied
to training neural networks, e.g., Dean et al.
[12]
, but these
approaches require a prohibitive number of updates in the
federated setting. Distributed consensus algorithms (e.g.,
[
41
]) relax the IID assumption, but are still not a good fit for
communication-constrained optimization over very many
clients.
One endpoint of the (parameterized) algorithm family we
consider is simple one-shot averaging, where each client
solves for the model that minimizes (possibly regularized)
loss on their local data, and these models are averaged to
produce the final global model. This approach has been
studied extensively in the convex case with IID data, and it
is known that in the worst-case, the global model produced is
no better than training a model on a single client [
44
,
3
,
46
].
2 The FederatedAveraging Algorithm
The recent multitude of successful applications of deep
learning have almost exclusively relied on variants of
stochastic gradient descent (SGD) for optimization; in fact,
many advances can be understood as adapting the struc-
ture of the model (and hence the loss function) to be more
amenable to optimization by simple gradient-based meth-
ods [
16
]. Thus, it is natural that we build algorithms for

Communication-Efficient Learning of Deep Networks from Decentralized Data
federated optimization by starting from SGD.
SGD can be applied naively to the federated optimization
problem, where a single batch gradient calculation (say on
a randomly selected client) is done per round of commu-
nication. This approach is computationally efficient, but
requires very large numbers of rounds of training to produce
good models (e.g., even using an advanced approach like
batch normalization, Ioffe and Szegedy
[21]
trained MNIST
for 50000 steps on minibatches of size 60). We consider
this baseline in our CIFAR-10 experiments.
In the federated setting, there is little cost in wall-clock time
to involving more clients, and so for our baseline we use
large-batch synchronous SGD; experiments by Chen et al.
[8]
show this approach is state-of-the-art in the data center
setting, where it outperforms asynchronous approaches. To
apply this approach in the federated setting, we select a
C
-
fraction of clients on each round, and compute the gradient
of the loss over all the data held by these clients. Thus,
C
controls the global batch size, with
C = 1
corresponding
to full-batch (non-stochastic) gradient descent.
2
We refer to
this baseline algorithm as FederatedSGD (or FedSGD).
A typical implementation of
FedSGD
with
C = 1
and
a fixed learning rate
η
has each client
k
compute
g
k
=
OF
k
(w
t
)
, the average gradient on its local data at the current
model
w
t
, and the central server aggregates these gradients
and applies the update
w
t+1
w
t
η
P
K
k=1
n
k
n
g
k
,
since
P
K
k=1
n
k
n
g
k
= Of(w
t
)
. An equivalent update is given by
k, w
k
t+1
w
t
ηg
k
and then
w
t+1
P
K
k=1
n
k
n
w
k
t+1
.
That is, each client locally takes one step of gradient de-
scent on the current model using its local data, and the
server then takes a weighted average of the resulting models.
Once the algorithm is written this way, we can add more
computation to each client by iterating the local update
w
k
w
k
ηOF
k
(w
k
)
multiple times before the averag-
ing step. We term this approach
FederatedAveraging
(or
FedAvg
). The amount of computation is controlled
by three key parameters:
C
, the fraction of clients that
perform computation on each round;
E
, then number of
training passes each client makes over its local dataset on
each round; and
B
, the local minibatch size used for the
client updates. We write
B =
to indicate that the full
local dataset is treated as a single minibatch. Thus, at one
endpoint of this algorithm family, we can take
B =
and
E = 1
which corresponds exactly to
FedSGD
. For a client
with
n
k
local examples, the number of local updates per
round is given by
u
k
= E
n
k
B
; Complete pseudo-code is
given in Algorithm 1.
For general non-convex objectives, averaging models in
parameter space could produce an arbitrarily bad model.
2
While the batch selection mechanism is different than se-
lecting a batch by choosing individual examples uniformly at
random, the batch gradients
g
computed by
FedSGD
still satisfy
E[g] = Of(w).
0.20.0 0.2 0.4 0.6 0.8 1.0 1.2
mixing weight
θ
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
loss
Independent initialization
0.20.0 0.2 0.4 0.6 0.8 1.0 1.2
mixing weight
θ
0.40
0.42
0.44
0.46
0.48
0.50
0.52
0.54
loss
Common initialization
Figure 1: The loss on the full MNIST training set for models
generated by averaging the parameters of two models
w
and
w
0
using
θw + (1 θ)w
0
for 50 evenly spaced values
θ [0.2, 1.2]
.The models
w
and
w
0
were trained using
SGD on different small datasets. For the left plot,
w
and
w
0
were initialized using different random seeds; for the right
plot, a shared seed was used. Note the different
y
-axis scales.
The horizontal line gives the best loss achieved by
w
or
w
0
(which were quite close, corresponding to the vertical lines
at
θ = 0
and
θ = 1
). With shared initialization, averaging
the models produces a significant reduction in the loss on
the total training set (much better than the loss of either
parent model).
Following the approach of Goodfellow et al.
[17]
, we see
exactly this bad behavior when we average two MNIST
digit-recognition models
3
trained from different initial con-
ditions (Figure 1, left). For this figure, the parent models
w
and
w
0
were each trained on non-overlapping IID samples
of 600 examples from the MNIST training set. Training
was via SGD with a fixed learning rate of 0.1 for 240 up-
dates on minibatches of size 50 (or
E = 20
passes over
the mini-datasets of size 600). This is approximately the
amount of training where the models begin to overfit their
local datasets.
Recent work indicates that in practice, the loss surfaces of
sufficiently over-parameterized NNs are surprisingly well-
behaved and in particular less prone to bad local minima
than previously thought [
11
,
17
,
9
]. And indeed, when we
start two models from the same random initialization and
then again train each independently on a different subset of
the data (as described above), we find that naive parameter
averaging works surprisingly well (Figure 1, right): the av-
erage of these two models,
1
2
w +
1
2
w
0
, achieves significantly
lower loss on the full MNIST training set than the best
model achieved by training on either of the small datasets
independently. While Figure 1 starts from a random initial-
ization, note a shared starting model
w
t
is used for each
round of FedAvg, and so the same intuition applies.
3
We use the “2NN” multi-layer perceptron described in Sec-
tion 3.

H. Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, Blaise Ag
¨
uera y Arcas
Algorithm 1 FederatedAveraging
. The
K
clients are
indexed by
k
;
B
is the local minibatch size,
E
is the number
of local epochs, and η is the learning rate.
Server executes:
initialize w
0
for each round t = 1, 2, . . . do
m max(C · K, 1)
S
t
(random set of m clients)
for each client k S
t
in parallel do
w
k
t+1
ClientUpdate(k, w
t
)
w
t+1
P
K
k=1
n
k
n
w
k
t+1
ClientUpdate(k, w): // Run on client k
B (split P
k
into batches of size B)
for each local epoch i from 1 to E do
for batch b B do
w w ηO`(w; b)
return w to server
3 Experimental Results
We are motivated by both image classification and language
modeling tasks where good models can greatly enhance the
usability of mobile devices. For each of these tasks we first
picked a proxy dataset of modest enough size that we could
thoroughly investigate the hyperparameters of the
FedAvg
algorithm. While each individual training run is relatively
small, we trained over 2000 individual models for these
experiments. We then present results on the benchmark
CIFAR-10 image classification task. Finally, to demonstrate
the effectiveness of
FedAvg
on a real-world problem with
a natural partitioning of the data over clients, we evaluate
on a large language modeling task.
Our initial study includes three model families on two
datasets. The first two are for the MNIST digit recognition
task [
26
]: 1) A simple multilayer-perceptron with 2-hidden
layers with 200 units each using ReLu activations (199,210
total parameters), which we refer to as the MNIST 2NN.
2) A CNN with two 5x5 convolution layers (the first with
32 channels, the second with 64, each followed with 2x2
max pooling), a fully connected layer with 512 units and
ReLu activation, and a final softmax output layer (1,663,370
total parameters). To study federated optimization, we also
need to specify how the data is distributed over the clients.
We study two ways of partitioning the MNIST data over
clients:
IID
, where the data is shuffled, and then partitioned
into 100 clients each receiving 600 examples, and
Non-IID
,
where we first sort the data by digit label, divide it into 200
shards of size 300, and assign each of 100 clients 2 shards.
This is a pathological non-IID partition of the data, as most
clients will only have examples of two digits. Thus, this lets
us explore the degree to which our algorithms will break on
highly non-IID data. Both of these partitions are balanced,
Table 1: Effect of the client fraction
C
on the MNIST 2NN
with
E = 1
and CNN with
E = 5
. Note
C = 0.0
corre-
sponds to one client per round; since we use 100 clients for
the MNIST data, the rows correspond to 1, 10 20, 50, and
100 clients. Each table entry gives the number of rounds
of communication necessary to achieve a test-set accuracy
of 97% for the 2NN and 99% for the CNN, along with the
speedup relative to the
C = 0
baseline. Five runs with
the large batch size did not reach the target accuracy in the
allowed time.
2NN IID NON-IID
C B = B = 10 B = B = 10
0.0 1455 316 4278 3275
0.1 1474 (1.0×) 87 (3.6×) 1796 (2.4×) 664 (4.9×)
0.2 1658 (0.9×) 77 (4.1×) 1528 (2.8×) 619 (5.3×)
0.5 () 75 (4.2×) () 443 (7.4×)
1.0 () 70 (4.5×) () 380 (8.6×)
CNN, E = 5
0.0 387 50 1181 956
0.1 339 (1.1×) 18 (2.8×) 1100 (1.1×) 206 (4.6×)
0.2 337 (1.1×) 18 (2.8×) 978 (1.2×) 200 (4.8×)
0.5 164 (2.4×) 18 (2.8×) 1067 (1.1×) 261 (3.7×)
1.0 246 (1.6×) 16 (3.1×) () 97 (9.9×)
however.
4
For language modeling, we built a dataset from The Com-
plete Works of William Shakespeare [
32
]. We construct a
client dataset for each speaking role in each play with at
least two lines. This produced a dataset with 1146 clients.
For each client, we split the data into a set of training lines
(the first 80% of lines for the role), and test lines (the last
20%, rounded up to at least one line). The resulting dataset
has 3,564,579 characters in the training set, and 870,014
characters
5
in the test set. This data is substantially unbal-
anced, with many roles having only a few lines, and a few
with a large number of lines. Further, observe the test set is
not a random sample of lines, but is temporally separated
by the chronology of each play. Using an identical train/test
split, we also form a balanced and IID version of the dataset,
also with 1146 clients.
On this data we train a stacked character-level LSTM lan-
guage model, which after reading each character in a line,
predicts the next character [
22
]. The model takes a series of
characters as input and embeds each of these into a learned
8 dimensional space. The embedded characters are then
processed through 2 LSTM layers, each with 256 nodes.
Finally the output of the second LSTM layer is sent to a
softmax output layer with one node per character. The full
model has 866,578 parameters, and we trained using an
unroll length of 80 characters.
4
We performed additional experiments on unbalanced versions
of these datasets, and found them to in fact be slightly easier for
FedAvg.
5
We always use character to refer to a one byte string, and use
role to refer to a part in the play.

Citations
More filters
Journal ArticleDOI
TL;DR: In this paper, a taxonomy of recent contributions related to explainability of different machine learning models, including those aimed at explaining Deep Learning methods, is presented, and a second dedicated taxonomy is built and examined in detail.

2,827 citations

Journal ArticleDOI
TL;DR: In this paper, the authors discuss the unique characteristics and challenges of federated learning, provide a broad overview of current approaches, and outline several directions of future work that are relevant to a wide range of research communities.
Abstract: Federated learning involves training statistical models over remote devices or siloed data centers, such as mobile phones or hospitals, while keeping data localized. Training in heterogeneous and potentially massive networks introduces novel challenges that require a fundamental departure from standard approaches for large-scale machine learning, distributed optimization, and privacy-preserving data analysis. In this article, we discuss the unique characteristics and challenges of federated learning, provide a broad overview of current approaches, and outline several directions of future work that are relevant to a wide range of research communities.

2,163 citations

Proceedings ArticleDOI
30 Oct 2017
TL;DR: In this paper, the authors proposed a secure aggregation of high-dimensional data for federated deep neural networks, which allows a server to compute the sum of large, user-held data vectors from mobile devices in a secure manner without learning each user's individual contribution.
Abstract: We design a novel, communication-efficient, failure-robust protocol for secure aggregation of high-dimensional data. Our protocol allows a server to compute the sum of large, user-held data vectors from mobile devices in a secure manner (i.e. without learning each user's individual contribution), and can be used, for example, in a federated learning setting, to aggregate user-provided model updates for a deep neural network. We prove the security of our protocol in the honest-but-curious and active adversary settings, and show that security is maintained even if an arbitrarily chosen subset of users drop out at any time. We evaluate the efficiency of our protocol and show, by complexity analysis and a concrete implementation, that its runtime and communication overhead remain low even on large data sets and client pools. For 16-bit input values, our protocol offers $1.73 x communication expansion for 210 users and 220-dimensional vectors, and 1.98 x expansion for 214 users and 224-dimensional vectors over sending data in the clear.

1,890 citations

Posted Content
TL;DR: Previous efforts to define explainability in Machine Learning are summarized, establishing a novel definition that covers prior conceptual propositions with a major focus on the audience for which explainability is sought, and a taxonomy of recent contributions related to the explainability of different Machine Learning models are proposed.
Abstract: In the last years, Artificial Intelligence (AI) has achieved a notable momentum that may deliver the best of expectations over many application sectors across the field. For this to occur, the entire community stands in front of the barrier of explainability, an inherent problem of AI techniques brought by sub-symbolism (e.g. ensembles or Deep Neural Networks) that were not present in the last hype of AI. Paradigms underlying this problem fall within the so-called eXplainable AI (XAI) field, which is acknowledged as a crucial feature for the practical deployment of AI models. This overview examines the existing literature in the field of XAI, including a prospect toward what is yet to be reached. We summarize previous efforts to define explainability in Machine Learning, establishing a novel definition that covers prior conceptual propositions with a major focus on the audience for which explainability is sought. We then propose and discuss about a taxonomy of recent contributions related to the explainability of different Machine Learning models, including those aimed at Deep Learning methods for which a second taxonomy is built. This literature analysis serves as the background for a series of challenges faced by XAI, such as the crossroads between data fusion and explainability. Our prospects lead toward the concept of Responsible Artificial Intelligence, namely, a methodology for the large-scale implementation of AI methods in real organizations with fairness, model explainability and accountability at its core. Our ultimate goal is to provide newcomers to XAI with a reference material in order to stimulate future research advances, but also to encourage experts and professionals from other disciplines to embrace the benefits of AI in their activity sectors, without any prior bias for its lack of interpretability.

1,602 citations

Posted Content
TL;DR: This work presents a strategy to improve training on non-IID data by creating a small subset of data which is globally shared between all the edge devices, and shows that accuracy can be increased by 30% for the CIFAR-10 dataset with only 5% globally shared data.
Abstract: Federated learning enables resource-constrained edge compute devices, such as mobile phones and IoT devices, to learn a shared model for prediction, while keeping the training data local. This decentralized approach to train models provides privacy, security, regulatory and economic benefits. In this work, we focus on the statistical challenge of federated learning when local data is non-IID. We first show that the accuracy of federated learning reduces significantly, by up to 55% for neural networks trained for highly skewed non-IID data, where each client device trains only on a single class of data. We further show that this accuracy reduction can be explained by the weight divergence, which can be quantified by the earth mover's distance (EMD) between the distribution over classes on each device and the population distribution. As a solution, we propose a strategy to improve training on non-IID data by creating a small subset of data which is globally shared between all the edge devices. Experiments show that accuracy can be increased by 30% for the CIFAR-10 dataset with only 5% globally shared data.

1,343 citations