Federated Learning: Strategies for Improving Communication Efficiency

Under review as a conference paper at ICLR 2018

FEDERATED LEARNING: STRATEGIES FOR IMPROVING

COMMUNICATION EFFICIENCY

Anonymous authors

Paper under double-blind review

ABSTRACT

Federated Learning is a machine learning setting where the goal is to train a high-

quality centralized model while training data remains distributed over a large num-

ber of clients each with unreliable and relatively slow network connections. We

consider learning algorithms for this setting where on each round, each client in-

dependently computes an update to the current model based on its local data, and

communicates this update to a central server, where the client-side updates are

aggregated to compute a new global model. The typical clients in this setting are

mobile phones, and communication efﬁciency is of the utmost importance.

In this paper, we propose two ways to reduce the uplink communication costs:

structured updates, where we directly learn an update from a restricted space

parametrized using a smaller number of variables, e.g. either low-rank or a random

mask; and sketched updates, where we learn a full model update and then com-

press it using a combination of quantization, random rotations, and subsampling

before sending it to the server. Experiments on both convolutional and recurrent

networks show that the proposed methods can reduce the communication cost by

two orders of magnitude.

1 INTRODUCTION

As datasets grow larger and models more complex, training machine learning models increasingly

requires distributing the optimization of model parameters over multiple machines. Existing ma-

chine learning algorithms are designed for highly controlled environments (such as data centers)

where the data is distributed among machines in a balanced and i.i.d. fashion, and high-throughput

networks are available.

Recently, Federated Learning (and related decentralized approaches) (McMahan & Ramage, 2017;

Kone

ˇ

cn

´

y et al., 2016; McMahan et al., 2017; Shokri & Shmatikov, 2015) have been proposed as

an alternative setting: a shared global model is trained under the coordination of a central server,

from a federation of participating devices. The participating devices (clients) are typically large in

number and have slow or unstable internet connections. A principal motivating example for Feder-

ated Learning arises when the training data comes from users’ interaction with mobile applications.

Federated Learning enables mobile phones to collaboratively learn a shared prediction model while

keeping all the training data on device, decoupling the ability to do machine learning from the need

to store the data in the cloud. The training data is kept locally on users’ mobile devices, and the

devices are used as nodes performing computation on their local data in order to update a global

model. This goes beyond the use of local models that make predictions on mobile devices, by bring-

ing model training to the device as well. The above framework differs from conventional distributed

machine learning (Reddi et al., 2016; Ma et al., 2017; Shamir et al., 2014; Zhang & Lin, 2015; Dean

et al., 2012; Chilimbi et al., 2014) due to the very large number of clients, highly unbalanced and

non-i.i.d. data available on each client, and relatively poor network connections. In this work, our

focus is on the last constraint, since these unreliable and asymmetric connections pose a particular

challenge to practical Federated Learning.

For simplicity, we consider synchronized algorithms for Federated Learning where a typical round

consists of the following steps:

1. A subset of existing clients is selected, each of which downloads the current model.

1

Under review as a conference paper at ICLR 2018

2. Each client in the subset computes an updated model based on their local data.

3. The model updates are sent from the selected clients to the sever.

4. The server aggregates these models (typically by averaging) to construct an improved

global model.

A naive implementation of the above framework requires that each client sends a full model (or a full

model update) back to the server in each round. For large models, this step is likely to be the bottle-

neck of Federated Learning due to multiple factors. One factor is the asymmetric property of internet

connection speeds: the uplink is typically much slower than downlink. The US average broadband

speed was 55.0Mbps download vs. 18.9Mbps upload, with some internet service providers being

signiﬁcantly more asymmetric, e.g., Xﬁnity at 125Mbps down vs. 15Mbps up (speedtest.net, 2016).

Additionally, existing model compressions schemes such as Han et al. (2015) can reduce the band-

width necessary to download the current model, and cryptographic protocols put in place to ensure

no individual client’s update can be inspected before averaging with hundreds or thousands of other

updates (Bonawitz et al., 2017) further increase the amount of bits that need to be uploaded.

It is therefore important to investigate methods which can reduce the uplink communication cost. In

this paper, we study two general approaches:

• Structured updates, where we directly learn an update from a restricted space that can be

parametrized using a smaller number of variables.

• Sketched updates, where we learn a full model update, then compress it before sending to

the server.

These approaches, explained in detail in Sections 2 and 3, can be combined, e.g., ﬁrst learning a

structured update and sketching it; we do not experiment with this combination in this work though.

In the following, we formally describe the problem. The goal of Federated Learning is to learn a

model with parameters embodied in a real matrix

1

W ∈ R

d

1

×d

2

from data stored across a large

number of clients. We ﬁrst provide a communication-naive version of the Federated Learning. In

round t ≥ 0, the server distributes the current model W

t

to a subset S

t

of n

t

clients. These

clients independently update the model based on their local data. Let the updated local models

be W

1

t

, W

2

t

, . . . , W

n

t

, so the update of client i can be written as H

i

t

:= W

i

t

− W

t

, for i ∈ S

t

.

These updates could be a single gradient computed on the client, but typically will be the result of

a more complex calculation, for example, multiple steps of stochastic gradient descent (SGD) taken

on the client’s local dataset. In any case, each selected client then sends the update back to the sever,

where the global update is computed by aggregating

2

all the client-side updates:

W

t+1

= W

t

+ η

t

H

t

, H

t

:=

1

n

t

P

i∈S

t

H

i

t

.

The sever chooses the learning rate η

t

. For simplicity, we choose η

t

= 1.

In Section 4, we describe Federated Learning for neural networks, where we use a separate 2D

matrix W to represent the parameters of each layer. We suppose that W gets right-multiplied, i.e.,

d

1

and d

2

represent the output and input dimensions respectively. Note that the parameters of a fully

connected layer are naturally represented as 2D matrices. However, the kernel of a convolutional

layer is a 4D tensor of the shape #input × width × height × #output. In such a case, W is reshaped

from the kernel to the shape (#input × width × height) × #output.

Outline and summary. The goal of increasing communication efﬁciency of Federated Learning is

to reduce the cost of sending H

i

t

to the server, while learning from data stored across large number of

devices with limited internet connection and availability for computation. We propose two general

classes of approaches, structured updates and sketched updates. In the Experiments section, we

evaluate the effect of these methods in training deep neural networks.

In simulated experiments on CIFAR data, we investigate the effect of these techniques on the conver-

gence of the Federated Averaging algorithm (McMahan et al., 2017). With only a slight degradation

in convergence speed, we are able to reduce the total amount of data communicated by two orders of

magnitude. This lets us obtain a good prediction accuracy with an all-convolutional model, while in

total communicating less information than the size of the original CIFAR data. In a larger realistic

1

For sake of simplicity, we discuss only the case of a single matrix since everything carries over to setting

with multiple matrices, for instance corresponding to individual layers in a deep neural network.

2

A weighted sum might be used to replace the average based on speciﬁc implementations.

2

Under review as a conference paper at ICLR 2018

experiment on user-partitioned text data, we show that we are able to efﬁciently train a recurrent

neural network for next word prediction, before even using the data of every user once. Finally, we

note that we achieve the best results including the preprocessing of updates with structured random

rotations. Practical utility of this step is unique to our setting, as the cost of applying the random ro-

tations would be dominant in typical parallel implementations of SGD, but is negligible, compared

to the local training in Federated Learning.

2 STRUCTURED UPD ATE

The ﬁrst type of communication efﬁcient update restricts the updates H

i

t

to have a pre-speciﬁed

structure. Two types of structures are considered in the paper: low rank and random mask. It

is important to stress that we train directly the updates of this structure, as opposed to approxi-

mating/sketching general updates with an object of a speciﬁc structure — which is discussed in

Section 3.

Low rank. We enforce every update to local model H

i

t

∈ R

d

1

×d

2

to be a low rank matrix of rank at

most k, where k is a ﬁxed number. In order to do so, we express H

i

t

as the product of two matrices:

H

i

t

= A

i

t

B

i

t

, where A

i

t

∈ R

d

1

×k

, B

i

t

∈ R

k×d

2

. In subsequent computation, we generated A

i

t

randomly and consider a constant during a local training procedure, and we optimize only B

i

t

. Note

that in practical implementation, A

i

t

can in this case be compressed in the form of a random seed

and the clients only need to send trained B

i

t

to the server. Such approach immediately saves a factor

of d

1

/k in communication. We generate the matrix A

i

t

afresh in each round and for each client

independently.

We also tried ﬁxing B

i

t

and training A

i

t

, as well as training both A

i

t

and B

i

t

; neither performed as

well. Our approach seems to perform as well as the best techniques considered in Denil et al. (2013),

without the need of any hand-crafted features. An intuitive explanation for this observation is the

following. We can interpret B

i

t

as a projection matrix, and A

i

t

as a reconstruction matrix. Fixing A

i

t

and optimizing for B

i

t

is akin to asking “Given a given random reconstruction, what is the projection

that will recover most information?”. In this case, if the reconstruction is full-rank, the projection

that recovers space spanned by top k eigenvectors exists. However, if we randomly ﬁx the projection

and search for a reconstruction, we can be unlucky and the important subspaces might have been

projected out, meaning that there is no reconstruction that will do as well as possible, or will be very

hard to ﬁnd.

Random mask. We restrict the update H

i

t

to be a sparse matrix, following a pre-deﬁned random

sparsity pattern (i.e., a random mask). The pattern is generated afresh in each round and for each

client independently. Similar to the low-rank approach, the sparse pattern can be fully speciﬁed by

a random seed, and therefore it is only required to send the values of the non-zeros entries of H

i

t

,

along with the seed.

3 SKETCHED UPDATE

The second type of updates addressing communication cost, which we call sketched, ﬁrst computes

the full H

i

t

during local training without any constraints, and then approximates, or encodes, the

update in a (lossy) compressed form before sending to the server. The server decodes the updates

before doing the aggregation. Such sketching methods have application in many domains (Woodruff,

2014). We experiment with multiple tools in order to perform the sketching, which are mutually

compatible and can be used jointly:

Subsampling. Instead of sending H

i

t

, each client only communicates matrix

ˆ

H

i

t

which is formed

from a random subset of the (scaled) values of H

i

t

. The server then averages the subsampled updates,

producing the global update

ˆ

H

t

. This can be done so that the average of the sampled updates is an

unbiased estimator of the true average: E[

ˆ

H

t

] = H

t

. Similar to the random mask structured update,

the mask is randomized independently for each client in each round, and the mask itself can be

stored as a synchronized seed.

Probabilistic quantization. Another way of compressing the updates is by quantizing the weights.

3

Under review as a conference paper at ICLR 2018

We ﬁrst describe the algorithm of quantizing each scalar to one bit. Consider the update H

i

t

, let

h = (h

1

, . . . , h

d

1

×d

2

) = vec(H

i

t

), and let h

max

= max

j

(h

j

), h

min

= min

j

(h

j

). The compressed

update of h, denoted by

˜

h, is generated as follows:

˜

h

j

=

(

h

max

, with probability

h

j

−h

min

h

max

−h

min

h

min

, with probability

h

max

−h

j

h

max

−h

min

.

It is easy to show that

˜

h is an unbiased estimator of h. This method provides 32× of compression

compared to a 4 byte ﬂoat. The error incurred with this compression scheme was analysed for

instance in Suresh et al. (2017), and is a special case of protocol proposed in Kone

ˇ

cn

´

y & Richt

´

arik

(2016).

One can also generalize the above to more than 1 bit for each scalar. For b-bit quantization, we

ﬁrst equally divide [h

min

, h

max

] into 2

b

intervals. Suppose h

i

falls in the interval bounded by h

′

and h

′′

. The quantization operates by replacing h

min

and h

max

of the above equation by h

′

and

h

′′

, respectively. Parameter b then allows for simple way of balancing accuracy and communication

costs.

Another quantization approach also motivated by reduction of communication while averaging vec-

tors was recently proposed in Alistarh et al. (2016). Incremental, randomized and distributed op-

timization algorithms can be similarly analysed in a quantized updates setting (Rabbat & Nowak,

2005; Golovin et al., 2013; Gamal & Lai, 2016).

Improving the quantization by structured random rotations. The above 1-bit and multi-bit quan-

tization approach work best when the scales are approximately equal across different dimensions.

For example, when max = 1 and min = −1 and most of values are 0, the 1-bit quantization

will lead to a large error. We note that applying a random rotation on h before the quantization

(multiplying h by a random orthogonal matrix) solves this issue. This claim has been theoretically

supported in Suresh et al. (2017). In that work, is shows that the structured random rotation can

reduce the quantization error by a factor of O(d / log d), where d is the dimension of h. We will

show its practical utility in the next section.

In the decoding phase, the server needs to perform the inverse rotation before aggregating all the

updates. Note that in practice, the dimension of h can easily be as high as d = 10

6

or more, and

it is computationally prohibitive to generate (O(d

3

)) and apply (O(d

2

)) a general rotation matrix.

Same as Suresh et al. (2017), we use a type of structured rotation matrix which is the product of a

Walsh-Hadamard matrix and a binary diagonal matrix. This reduces the computational complexity

of generating and applying the matrix to O(d) and O(d log d ), which is negligible compared to the

local training within Federated Learning.

4 EXPERIMENTS

We conducted experiments using Federated Learning to train deep neural networks for two different

tasks. First, we experiment with the CIFAR-10 image classiﬁcation task (Krizhevsky, 2009) with

convolutional networks and artiﬁcially partitioned dataset, and explore properties of our proposed

algorithms in detail. Second, we use more realistic scenario for Federated Learning — the public

Reddit post data (Google BigQuery), to train a recurrent network for next word prediction.

The Reddit dataset is particularly useful for simulated Federated Learning experiments, as it comes

with natural per-user data partition (by author of the posts). This includes many of the characteristics

expected to arise in practical implementation. For example, many users having relatively few data

points, and words used by most users are clustered around a speciﬁc topic of interest of the particular

user.

In all of our experiments, we employ the Federated Averaging algorithm (McMahan et al., 2017),

which signiﬁcantly decreases the number of rounds of communication required to train a good

model. Nevertheless, we expect our techniques will show a similar reduction in communication

costs when applied to a synchronous distributed SGD, see for instance Alistarh et al. (2016). For

Federated Averaging, on each round we select multiple clients uniformly at random, each of which

performs several epochs of SGD with a learning rate of η on their local dataset. For the structured

4

Under review as a conference paper at ICLR 2018

Figure 1: Structured updates with the CIFAR data for size reduction various modes. Low rank

updates in top row, random mask updates in bottom row.

updates, SGD is restricted to only update in the restricted space, that is, only the entries of B

i

t

for low-rank updates and the unmasked entries for the random-mask technique. From this updated

model we compute the updates for each layer H

i

t

. In all cases, we run the experiments with a range

of choices of learning rate, and report the best result.

4.1 CONVOLUTIONAL MODELS ON THE CIFAR-10 DATASET

In this section we use the CIFAR-10 dataset to investigate the properties of our proposed methods

as part of Federated Averaging algorithm.

There are 50 000 training examples in the CIFAR-10 dataset, which we randomly partitioned into

100 clients each containing 500 training examples. The model architecture we used was the all-

convolutional model taken from what is described as “Model C” in Springenberg et al. (2014), for

a total of over 10

6

parameters. While this model is not the state-of-the-art, it is sufﬁcient for our

needs, as our goal is to evaluate our compression methods, not to achieve the best possible accuracy

on this task.

The model has 9 convolutional layers, ﬁrst and last of which have signiﬁcantly fewer parameters

than the others. Hence, in this whole section, when we try to reduce the size the individual updates,

we only compress the inner 7 layers, each of which with the same parameter

3

. We denote this by

keyword ‘mode’, for all approaches. For low rank updates, ‘mode = 25%’ refers to the rank of the

update being set to 1/4 of rank of the full layer transformation, for random mask or sketching, this

refers to all but 25% of the parameters being zeroed out.

In the ﬁrst experiment, summarized in Figure 1, we compare the two types of structured updates

introduced in Section 2 — low rank in the top row and random mask in the bottom row. The main

message is that random mask performs signiﬁcantly better than low rank, as we reduce the size of

the updates. In particular, the convergence speed of random mask seems to be essentially unaffected

when measured in terms of number of rounds. Consequently, if the goal was to only minimize the

upload size, the version with reduced update size is a clear winner, as seen in the right column.

In Figure 2, we compare the performance of structured and sketched updates, without any quantiza-

tion. Since in the above, the structured random mask updates performed better, we omit low rank

update for clarity from this comparison. We compare this with the performance of the sketched up-

dates, with and without preprocessing the update using random rotation, as described in Section 3,

and for two different modes. We denote the randomized Hadamard rotation by ‘HD’, and no rotation

by ‘I’.

3

We also tried reducing the size of all 9 layers, but this yields negligible savings in communication, while it

slightly degraded convergence speed.

5

Federated Learning: Strategies for Improving Communication Efficiency

Citations

Cites background from "Federated Learning: Strategies for ..."

Cites background from "Federated Learning: Strategies for ..."

References

"Federated Learning: Strategies for ..." refers background in this paper

"Federated Learning: Strategies for ..." refers background in this paper

"Federated Learning: Strategies for ..." refers background or methods in this paper

Related Papers (5)

Trending Questions (1)