scispace - formally typeset
Search or ask a question
Posted Content

Federated Learning: Strategies for Improving Communication Efficiency

TL;DR: Two ways to reduce the uplink communication costs are proposed: structured updates, where the user directly learns an update from a restricted space parametrized using a smaller number of variables, e.g. either low-rank or a random mask; and sketched updates, which learn a full model update and then compress it using a combination of quantization, random rotations, and subsampling.
Abstract: Federated Learning is a machine learning setting where the goal is to train a high-quality centralized model while training data remains distributed over a large number of clients each with unreliable and relatively slow network connections. We consider learning algorithms for this setting where on each round, each client independently computes an update to the current model based on its local data, and communicates this update to a central server, where the client-side updates are aggregated to compute a new global model. The typical clients in this setting are mobile phones, and communication efficiency is of the utmost importance. In this paper, we propose two ways to reduce the uplink communication costs: structured updates, where we directly learn an update from a restricted space parametrized using a smaller number of variables, e.g. either low-rank or a random mask; and sketched updates, where we learn a full model update and then compress it using a combination of quantization, random rotations, and subsampling before sending it to the server. Experiments on both convolutional and recurrent networks show that the proposed methods can reduce the communication cost by two orders of magnitude.

Content maybe subject to copyright    Report

Under review as a conference paper at ICLR 2018
FEDERATED LEARNING: STRATEGIES FOR IMPROVING
COMMUNICATION EFFICIENCY
Anonymous authors
Paper under double-blind review
ABSTRACT
Federated Learning is a machine learning setting where the goal is to train a high-
quality centralized model while training data remains distributed over a large num-
ber of clients each with unreliable and relatively slow network connections. We
consider learning algorithms for this setting where on each round, each client in-
dependently computes an update to the current model based on its local data, and
communicates this update to a central server, where the client-side updates are
aggregated to compute a new global model. The typical clients in this setting are
mobile phones, and communication efficiency is of the utmost importance.
In this paper, we propose two ways to reduce the uplink communication costs:
structured updates, where we directly learn an update from a restricted space
parametrized using a smaller number of variables, e.g. either low-rank or a random
mask; and sketched updates, where we learn a full model update and then com-
press it using a combination of quantization, random rotations, and subsampling
before sending it to the server. Experiments on both convolutional and recurrent
networks show that the proposed methods can reduce the communication cost by
two orders of magnitude.
1 INTRODUCTION
As datasets grow larger and models more complex, training machine learning models increasingly
requires distributing the optimization of model parameters over multiple machines. Existing ma-
chine learning algorithms are designed for highly controlled environments (such as data centers)
where the data is distributed among machines in a balanced and i.i.d. fashion, and high-throughput
networks are available.
Recently, Federated Learning (and related decentralized approaches) (McMahan & Ramage, 2017;
Kone
ˇ
cn
´
y et al., 2016; McMahan et al., 2017; Shokri & Shmatikov, 2015) have been proposed as
an alternative setting: a shared global model is trained under the coordination of a central server,
from a federation of participating devices. The participating devices (clients) are typically large in
number and have slow or unstable internet connections. A principal motivating example for Feder-
ated Learning arises when the training data comes from users’ interaction with mobile applications.
Federated Learning enables mobile phones to collaboratively learn a shared prediction model while
keeping all the training data on device, decoupling the ability to do machine learning from the need
to store the data in the cloud. The training data is kept locally on users’ mobile devices, and the
devices are used as nodes performing computation on their local data in order to update a global
model. This goes beyond the use of local models that make predictions on mobile devices, by bring-
ing model training to the device as well. The above framework differs from conventional distributed
machine learning (Reddi et al., 2016; Ma et al., 2017; Shamir et al., 2014; Zhang & Lin, 2015; Dean
et al., 2012; Chilimbi et al., 2014) due to the very large number of clients, highly unbalanced and
non-i.i.d. data available on each client, and relatively poor network connections. In this work, our
focus is on the last constraint, since these unreliable and asymmetric connections pose a particular
challenge to practical Federated Learning.
For simplicity, we consider synchronized algorithms for Federated Learning where a typical round
consists of the following steps:
1. A subset of existing clients is selected, each of which downloads the current model.
1

Under review as a conference paper at ICLR 2018
2. Each client in the subset computes an updated model based on their local data.
3. The model updates are sent from the selected clients to the sever.
4. The server aggregates these models (typically by averaging) to construct an improved
global model.
A naive implementation of the above framework requires that each client sends a full model (or a full
model update) back to the server in each round. For large models, this step is likely to be the bottle-
neck of Federated Learning due to multiple factors. One factor is the asymmetric property of internet
connection speeds: the uplink is typically much slower than downlink. The US average broadband
speed was 55.0Mbps download vs. 18.9Mbps upload, with some internet service providers being
significantly more asymmetric, e.g., Xfinity at 125Mbps down vs. 15Mbps up (speedtest.net, 2016).
Additionally, existing model compressions schemes such as Han et al. (2015) can reduce the band-
width necessary to download the current model, and cryptographic protocols put in place to ensure
no individual client’s update can be inspected before averaging with hundreds or thousands of other
updates (Bonawitz et al., 2017) further increase the amount of bits that need to be uploaded.
It is therefore important to investigate methods which can reduce the uplink communication cost. In
this paper, we study two general approaches:
Structured updates, where we directly learn an update from a restricted space that can be
parametrized using a smaller number of variables.
Sketched updates, where we learn a full model update, then compress it before sending to
the server.
These approaches, explained in detail in Sections 2 and 3, can be combined, e.g., first learning a
structured update and sketching it; we do not experiment with this combination in this work though.
In the following, we formally describe the problem. The goal of Federated Learning is to learn a
model with parameters embodied in a real matrix
1
W R
d
1
×d
2
from data stored across a large
number of clients. We first provide a communication-naive version of the Federated Learning. In
round t 0, the server distributes the current model W
t
to a subset S
t
of n
t
clients. These
clients independently update the model based on their local data. Let the updated local models
be W
1
t
, W
2
t
, . . . , W
n
t
t
, so the update of client i can be written as H
i
t
:= W
i
t
W
t
, for i S
t
.
These updates could be a single gradient computed on the client, but typically will be the result of
a more complex calculation, for example, multiple steps of stochastic gradient descent (SGD) taken
on the client’s local dataset. In any case, each selected client then sends the update back to the sever,
where the global update is computed by aggregating
2
all the client-side updates:
W
t+1
= W
t
+ η
t
H
t
, H
t
:=
1
n
t
P
iS
t
H
i
t
.
The sever chooses the learning rate η
t
. For simplicity, we choose η
t
= 1.
In Section 4, we describe Federated Learning for neural networks, where we use a separate 2D
matrix W to represent the parameters of each layer. We suppose that W gets right-multiplied, i.e.,
d
1
and d
2
represent the output and input dimensions respectively. Note that the parameters of a fully
connected layer are naturally represented as 2D matrices. However, the kernel of a convolutional
layer is a 4D tensor of the shape #input × width × height × #output. In such a case, W is reshaped
from the kernel to the shape (#input × width × height) × #output.
Outline and summary. The goal of increasing communication efficiency of Federated Learning is
to reduce the cost of sending H
i
t
to the server, while learning from data stored across large number of
devices with limited internet connection and availability for computation. We propose two general
classes of approaches, structured updates and sketched updates. In the Experiments section, we
evaluate the effect of these methods in training deep neural networks.
In simulated experiments on CIFAR data, we investigate the effect of these techniques on the conver-
gence of the Federated Averaging algorithm (McMahan et al., 2017). With only a slight degradation
in convergence speed, we are able to reduce the total amount of data communicated by two orders of
magnitude. This lets us obtain a good prediction accuracy with an all-convolutional model, while in
total communicating less information than the size of the original CIFAR data. In a larger realistic
1
For sake of simplicity, we discuss only the case of a single matrix since everything carries over to setting
with multiple matrices, for instance corresponding to individual layers in a deep neural network.
2
A weighted sum might be used to replace the average based on specific implementations.
2

Under review as a conference paper at ICLR 2018
experiment on user-partitioned text data, we show that we are able to efficiently train a recurrent
neural network for next word prediction, before even using the data of every user once. Finally, we
note that we achieve the best results including the preprocessing of updates with structured random
rotations. Practical utility of this step is unique to our setting, as the cost of applying the random ro-
tations would be dominant in typical parallel implementations of SGD, but is negligible, compared
to the local training in Federated Learning.
2 STRUCTURED UPD ATE
The first type of communication efficient update restricts the updates H
i
t
to have a pre-specified
structure. Two types of structures are considered in the paper: low rank and random mask. It
is important to stress that we train directly the updates of this structure, as opposed to approxi-
mating/sketching general updates with an object of a specific structure which is discussed in
Section 3.
Low rank. We enforce every update to local model H
i
t
R
d
1
×d
2
to be a low rank matrix of rank at
most k, where k is a fixed number. In order to do so, we express H
i
t
as the product of two matrices:
H
i
t
= A
i
t
B
i
t
, where A
i
t
R
d
1
×k
, B
i
t
R
k×d
2
. In subsequent computation, we generated A
i
t
randomly and consider a constant during a local training procedure, and we optimize only B
i
t
. Note
that in practical implementation, A
i
t
can in this case be compressed in the form of a random seed
and the clients only need to send trained B
i
t
to the server. Such approach immediately saves a factor
of d
1
/k in communication. We generate the matrix A
i
t
afresh in each round and for each client
independently.
We also tried fixing B
i
t
and training A
i
t
, as well as training both A
i
t
and B
i
t
; neither performed as
well. Our approach seems to perform as well as the best techniques considered in Denil et al. (2013),
without the need of any hand-crafted features. An intuitive explanation for this observation is the
following. We can interpret B
i
t
as a projection matrix, and A
i
t
as a reconstruction matrix. Fixing A
i
t
and optimizing for B
i
t
is akin to asking “Given a given random reconstruction, what is the projection
that will recover most information?”. In this case, if the reconstruction is full-rank, the projection
that recovers space spanned by top k eigenvectors exists. However, if we randomly fix the projection
and search for a reconstruction, we can be unlucky and the important subspaces might have been
projected out, meaning that there is no reconstruction that will do as well as possible, or will be very
hard to find.
Random mask. We restrict the update H
i
t
to be a sparse matrix, following a pre-defined random
sparsity pattern (i.e., a random mask). The pattern is generated afresh in each round and for each
client independently. Similar to the low-rank approach, the sparse pattern can be fully specified by
a random seed, and therefore it is only required to send the values of the non-zeros entries of H
i
t
,
along with the seed.
3 SKETCHED UPDATE
The second type of updates addressing communication cost, which we call sketched, first computes
the full H
i
t
during local training without any constraints, and then approximates, or encodes, the
update in a (lossy) compressed form before sending to the server. The server decodes the updates
before doing the aggregation. Such sketching methods have application in many domains (Woodruff,
2014). We experiment with multiple tools in order to perform the sketching, which are mutually
compatible and can be used jointly:
Subsampling. Instead of sending H
i
t
, each client only communicates matrix
ˆ
H
i
t
which is formed
from a random subset of the (scaled) values of H
i
t
. The server then averages the subsampled updates,
producing the global update
ˆ
H
t
. This can be done so that the average of the sampled updates is an
unbiased estimator of the true average: E[
ˆ
H
t
] = H
t
. Similar to the random mask structured update,
the mask is randomized independently for each client in each round, and the mask itself can be
stored as a synchronized seed.
Probabilistic quantization. Another way of compressing the updates is by quantizing the weights.
3

Under review as a conference paper at ICLR 2018
We first describe the algorithm of quantizing each scalar to one bit. Consider the update H
i
t
, let
h = (h
1
, . . . , h
d
1
×d
2
) = vec(H
i
t
), and let h
max
= max
j
(h
j
), h
min
= min
j
(h
j
). The compressed
update of h, denoted by
˜
h, is generated as follows:
˜
h
j
=
(
h
max
, with probability
h
j
h
min
h
max
h
min
h
min
, with probability
h
max
h
j
h
max
h
min
.
It is easy to show that
˜
h is an unbiased estimator of h. This method provides 32× of compression
compared to a 4 byte float. The error incurred with this compression scheme was analysed for
instance in Suresh et al. (2017), and is a special case of protocol proposed in Kone
ˇ
cn
´
y & Richt
´
arik
(2016).
One can also generalize the above to more than 1 bit for each scalar. For b-bit quantization, we
first equally divide [h
min
, h
max
] into 2
b
intervals. Suppose h
i
falls in the interval bounded by h
and h
′′
. The quantization operates by replacing h
min
and h
max
of the above equation by h
and
h
′′
, respectively. Parameter b then allows for simple way of balancing accuracy and communication
costs.
Another quantization approach also motivated by reduction of communication while averaging vec-
tors was recently proposed in Alistarh et al. (2016). Incremental, randomized and distributed op-
timization algorithms can be similarly analysed in a quantized updates setting (Rabbat & Nowak,
2005; Golovin et al., 2013; Gamal & Lai, 2016).
Improving the quantization by structured random rotations. The above 1-bit and multi-bit quan-
tization approach work best when the scales are approximately equal across different dimensions.
For example, when max = 1 and min = 1 and most of values are 0, the 1-bit quantization
will lead to a large error. We note that applying a random rotation on h before the quantization
(multiplying h by a random orthogonal matrix) solves this issue. This claim has been theoretically
supported in Suresh et al. (2017). In that work, is shows that the structured random rotation can
reduce the quantization error by a factor of O(d / log d), where d is the dimension of h. We will
show its practical utility in the next section.
In the decoding phase, the server needs to perform the inverse rotation before aggregating all the
updates. Note that in practice, the dimension of h can easily be as high as d = 10
6
or more, and
it is computationally prohibitive to generate (O(d
3
)) and apply (O(d
2
)) a general rotation matrix.
Same as Suresh et al. (2017), we use a type of structured rotation matrix which is the product of a
Walsh-Hadamard matrix and a binary diagonal matrix. This reduces the computational complexity
of generating and applying the matrix to O(d) and O(d log d ), which is negligible compared to the
local training within Federated Learning.
4 EXPERIMENTS
We conducted experiments using Federated Learning to train deep neural networks for two different
tasks. First, we experiment with the CIFAR-10 image classification task (Krizhevsky, 2009) with
convolutional networks and artificially partitioned dataset, and explore properties of our proposed
algorithms in detail. Second, we use more realistic scenario for Federated Learning the public
Reddit post data (Google BigQuery), to train a recurrent network for next word prediction.
The Reddit dataset is particularly useful for simulated Federated Learning experiments, as it comes
with natural per-user data partition (by author of the posts). This includes many of the characteristics
expected to arise in practical implementation. For example, many users having relatively few data
points, and words used by most users are clustered around a specific topic of interest of the particular
user.
In all of our experiments, we employ the Federated Averaging algorithm (McMahan et al., 2017),
which significantly decreases the number of rounds of communication required to train a good
model. Nevertheless, we expect our techniques will show a similar reduction in communication
costs when applied to a synchronous distributed SGD, see for instance Alistarh et al. (2016). For
Federated Averaging, on each round we select multiple clients uniformly at random, each of which
performs several epochs of SGD with a learning rate of η on their local dataset. For the structured
4

Under review as a conference paper at ICLR 2018
Figure 1: Structured updates with the CIFAR data for size reduction various modes. Low rank
updates in top row, random mask updates in bottom row.
updates, SGD is restricted to only update in the restricted space, that is, only the entries of B
i
t
for low-rank updates and the unmasked entries for the random-mask technique. From this updated
model we compute the updates for each layer H
i
t
. In all cases, we run the experiments with a range
of choices of learning rate, and report the best result.
4.1 CONVOLUTIONAL MODELS ON THE CIFAR-10 DATASET
In this section we use the CIFAR-10 dataset to investigate the properties of our proposed methods
as part of Federated Averaging algorithm.
There are 50 000 training examples in the CIFAR-10 dataset, which we randomly partitioned into
100 clients each containing 500 training examples. The model architecture we used was the all-
convolutional model taken from what is described as “Model C” in Springenberg et al. (2014), for
a total of over 10
6
parameters. While this model is not the state-of-the-art, it is sufficient for our
needs, as our goal is to evaluate our compression methods, not to achieve the best possible accuracy
on this task.
The model has 9 convolutional layers, first and last of which have significantly fewer parameters
than the others. Hence, in this whole section, when we try to reduce the size the individual updates,
we only compress the inner 7 layers, each of which with the same parameter
3
. We denote this by
keyword ‘mode’, for all approaches. For low rank updates, ‘mode = 25% refers to the rank of the
update being set to 1/4 of rank of the full layer transformation, for random mask or sketching, this
refers to all but 25% of the parameters being zeroed out.
In the first experiment, summarized in Figure 1, we compare the two types of structured updates
introduced in Section 2 low rank in the top row and random mask in the bottom row. The main
message is that random mask performs significantly better than low rank, as we reduce the size of
the updates. In particular, the convergence speed of random mask seems to be essentially unaffected
when measured in terms of number of rounds. Consequently, if the goal was to only minimize the
upload size, the version with reduced update size is a clear winner, as seen in the right column.
In Figure 2, we compare the performance of structured and sketched updates, without any quantiza-
tion. Since in the above, the structured random mask updates performed better, we omit low rank
update for clarity from this comparison. We compare this with the performance of the sketched up-
dates, with and without preprocessing the update using random rotation, as described in Section 3,
and for two different modes. We denote the randomized Hadamard rotation by ‘HD’, and no rotation
by ‘I’.
3
We also tried reducing the size of all 9 layers, but this yields negligible savings in communication, while it
slightly degraded convergence speed.
5

Citations
More filters
Posted Content
TL;DR: This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.
Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

6,953 citations


Cites background from "Federated Learning: Strategies for ..."

  • ...However, it will always be the case that there are applications and scenarios where using a smaller or less expensive model is helpful, for example when performing client-side inference or federated learning [Konečnỳ et al., 2015, 2016]....

    [...]

Posted Content
H. Brendan McMahan1, Eider Moore1, Daniel Ramage1, Seth Hampson, Blaise Aguera y Arcas1 
TL;DR: This work presents a practical method for the federated learning of deep networks based on iterative model averaging, and conducts an extensive empirical evaluation, considering five different model architectures and four datasets.
Abstract: Modern mobile devices have access to a wealth of data suitable for learning models, which in turn can greatly improve the user experience on the device. For example, language models can improve speech recognition and text entry, and image models can automatically select good photos. However, this rich data is often privacy sensitive, large in quantity, or both, which may preclude logging to the data center and training there using conventional approaches. We advocate an alternative that leaves the training data distributed on the mobile devices, and learns a shared model by aggregating locally-computed updates. We term this decentralized approach Federated Learning. We present a practical method for the federated learning of deep networks based on iterative model averaging, and conduct an extensive empirical evaluation, considering five different model architectures and four datasets. These experiments demonstrate the approach is robust to the unbalanced and non-IID data distributions that are a defining characteristic of this setting. Communication costs are the principal constraint, and we show a reduction in required communication rounds by 10-100x as compared to synchronized stochastic gradient descent.

5,936 citations

Journal ArticleDOI
TL;DR: In this paper, a taxonomy of recent contributions related to explainability of different machine learning models, including those aimed at explaining Deep Learning methods, is presented, and a second dedicated taxonomy is built and examined in detail.

2,827 citations

Journal ArticleDOI
TL;DR: This work introduces a comprehensive secure federated-learning framework, which includes horizontal federated learning, vertical federatedLearning, and federated transfer learning, and provides a comprehensive survey of existing works on this subject.
Abstract: Today’s artificial intelligence still faces two major challenges. One is that, in most industries, data exists in the form of isolated islands. The other is the strengthening of data privacy and security. We propose a possible solution to these challenges: secure federated learning. Beyond the federated-learning framework first proposed by Google in 2016, we introduce a comprehensive secure federated-learning framework, which includes horizontal federated learning, vertical federated learning, and federated transfer learning. We provide definitions, architectures, and applications for the federated-learning framework, and provide a comprehensive survey of existing works on this subject. In addition, we propose building data networks among organizations based on federated mechanisms as an effective solution to allowing knowledge to be shared without compromising user privacy.

2,593 citations


Cites background from "Federated Learning: Strategies for ..."

  • ...The concept of federated learning is proposed by Google recently [36, 37, 41]....

    [...]

Posted Content
TL;DR: Previous efforts to define explainability in Machine Learning are summarized, establishing a novel definition that covers prior conceptual propositions with a major focus on the audience for which explainability is sought, and a taxonomy of recent contributions related to the explainability of different Machine Learning models are proposed.
Abstract: In the last years, Artificial Intelligence (AI) has achieved a notable momentum that may deliver the best of expectations over many application sectors across the field. For this to occur, the entire community stands in front of the barrier of explainability, an inherent problem of AI techniques brought by sub-symbolism (e.g. ensembles or Deep Neural Networks) that were not present in the last hype of AI. Paradigms underlying this problem fall within the so-called eXplainable AI (XAI) field, which is acknowledged as a crucial feature for the practical deployment of AI models. This overview examines the existing literature in the field of XAI, including a prospect toward what is yet to be reached. We summarize previous efforts to define explainability in Machine Learning, establishing a novel definition that covers prior conceptual propositions with a major focus on the audience for which explainability is sought. We then propose and discuss about a taxonomy of recent contributions related to the explainability of different Machine Learning models, including those aimed at Deep Learning methods for which a second taxonomy is built. This literature analysis serves as the background for a series of challenges faced by XAI, such as the crossroads between data fusion and explainability. Our prospects lead toward the concept of Responsible Artificial Intelligence, namely, a methodology for the large-scale implementation of AI methods in real organizations with fairness, model explainability and accountability at its core. Our ultimate goal is to provide newcomers to XAI with a reference material in order to stimulate future research advances, but also to encourage experts and professionals from other disciplines to embrace the benefits of AI in their activity sectors, without any prior bias for its lack of interpretability.

1,602 citations

References
More filters
Dissertation
01 Jan 2009
TL;DR: In this paper, the authors describe how to train a multi-layer generative model of natural images, using a dataset of millions of tiny colour images, described in the next section.
Abstract: In this work we describe how to train a multi-layer generative model of natural images. We use a dataset of millions of tiny colour images, described in the next section. This has been attempted by several groups but without success. The models on which we focus are RBMs (Restricted Boltzmann Machines) and DBNs (Deep Belief Networks). These models learn interesting-looking filters, which we show are more useful to a classifier than the raw pixels. We train the classifier on a labeled subset that we have collected and call the CIFAR-10 dataset.

15,005 citations

Posted Content
H. Brendan McMahan1, Eider Moore1, Daniel Ramage1, Seth Hampson, Blaise Aguera y Arcas1 
TL;DR: This work presents a practical method for the federated learning of deep networks based on iterative model averaging, and conducts an extensive empirical evaluation, considering five different model architectures and four datasets.
Abstract: Modern mobile devices have access to a wealth of data suitable for learning models, which in turn can greatly improve the user experience on the device. For example, language models can improve speech recognition and text entry, and image models can automatically select good photos. However, this rich data is often privacy sensitive, large in quantity, or both, which may preclude logging to the data center and training there using conventional approaches. We advocate an alternative that leaves the training data distributed on the mobile devices, and learns a shared model by aggregating locally-computed updates. We term this decentralized approach Federated Learning. We present a practical method for the federated learning of deep networks based on iterative model averaging, and conduct an extensive empirical evaluation, considering five different model architectures and four datasets. These experiments demonstrate the approach is robust to the unbalanced and non-IID data distributions that are a defining characteristic of this setting. Communication costs are the principal constraint, and we show a reduction in required communication rounds by 10-100x as compared to synchronized stochastic gradient descent.

5,936 citations

Proceedings Article
03 Dec 2012
TL;DR: This paper considers the problem of training a deep network with billions of parameters using tens of thousands of CPU cores and develops two algorithms for large-scale distributed training, Downpour SGD and Sandblaster L-BFGS, which increase the scale and speed of deep network training.
Abstract: Recent work in unsupervised feature learning and deep learning has shown that being able to train large models can dramatically improve performance. In this paper, we consider the problem of training a deep network with billions of parameters using tens of thousands of CPU cores. We have developed a software framework called DistBelief that can utilize computing clusters with thousands of machines to train large models. Within this framework, we have developed two algorithms for large-scale distributed training: (i) Downpour SGD, an asynchronous stochastic gradient descent procedure supporting a large number of model replicas, and (ii) Sandblaster, a framework that supports a variety of distributed batch optimization procedures, including a distributed implementation of L-BFGS. Downpour SGD and Sandblaster L-BFGS both increase the scale and speed of deep network training. We have successfully used our system to train a deep network 30x larger than previously reported in the literature, and achieves state-of-the-art performance on ImageNet, a visual object recognition task with 16 million images and 21k categories. We show that these same techniques dramatically accelerate the training of a more modestly- sized deep network for a commercial speech recognition service. Although we focus on and report performance of these methods as applied to training large neural networks, the underlying algorithms are applicable to any gradient-based machine learning algorithm.

3,475 citations


"Federated Learning: Strategies for ..." refers background in this paper

  • ...The above framework differs from conventional distributed machine learning [17, 11, 19, 22, 5, 4] due to the the large number of clients, highly unbalanced and non-i....

    [...]

Posted Content
TL;DR: A new and increasingly relevant setting for distributed optimization in machine learning, where the data defining the optimization are unevenly distributed over an extremely large number of nodes, is introduced, to train a high-quality centralized model.
Abstract: We introduce a new and increasingly relevant setting for distributed optimization in machine learning, where the data defining the optimization are unevenly distributed over an extremely large number of nodes. The goal is to train a high-quality centralized model. We refer to this setting as Federated Optimization. In this setting, communication efficiency is of the utmost importance and minimizing the number of rounds of communication is the principal goal. A motivating example arises when we keep the training data locally on users' mobile devices instead of logging it to a data center for training. In federated optimziation, the devices are used as compute nodes performing computation on their local data in order to update a global model. We suppose that we have extremely large number of devices in the network --- as many as the number of users of a given service, each of which has only a tiny fraction of the total data available. In particular, we expect the number of data points available locally to be much smaller than the number of devices. Additionally, since different users generate data with different patterns, it is reasonable to assume that no device has a representative sample of the overall distribution. We show that existing algorithms are not suitable for this setting, and propose a new algorithm which shows encouraging experimental results for sparse convex problems. This work also sets a path for future research needed in the context of \federated optimization.

1,272 citations


"Federated Learning: Strategies for ..." refers background in this paper

  • ...Recent works show that a careful choice of the server-side learning rate can lead to faster convergence [13, 12, 10]....

    [...]

  • ...Federated learning [10, 14, 9] proposes an alternative setting, where we train a shared global model under the coordination of a central server, from a federation of participating devices....

    [...]

Posted Content
TL;DR: This work presents a practical method for the federated learning of deep networks that proves robust to the unbalanced and non-IID data distributions that naturally arise, and allows high-quality models to be trained in relatively few rounds of communication.
Abstract: Modern mobile devices have access to a wealth of data suitable for learning models, which in turn can greatly improve the user experience on the device. For example, language models can improve speech recognition and text entry, and image models can automatically select good photos. However, this rich data is often privacy sensitive, large in quantity, or both, which may preclude logging to the data-center and training there using conventional approaches. We advocate an alternative that leaves the training data distributed on the mobile devices, and learns a shared model by aggregating locally-computed updates. We term this decentralized approach Federated Learning. We present a practical method for the federated learning of deep networks that proves robust to the unbalanced and non-IID data distributions that naturally arise. This method allows high-quality models to be trained in relatively few rounds of communication, the principal constraint for federated learning. The key insight is that despite the non-convex loss functions we optimize, parameter averaging over updates from multiple clients produces surprisingly good results, for example decreasing the communication needed to train an LSTM language model by two orders of magnitude.

985 citations


"Federated Learning: Strategies for ..." refers background or methods in this paper

  • ...We employ the Federated Averaging algorithm [13], which significantly decreases the number of rounds of communication required to train a good model....

    [...]

  • ...Federated learning [9, 13] proposes an alternative setting, where we train a shared global model under the coordination of a central server, from a federation of participating devices which maintain control of their own data....

    [...]

Trending Questions (1)
How to improve efficiency of a machine learning model with small dataset?

One way to improve efficiency with a small dataset is to use federated learning, where the model is trained on distributed client data.