What is the effect of the BN-layer on the client’s test-set?

The vector of first-layer neuron activations over the client’s test-set (X) from applying weights and biases (W0, b0), can be modelled as a normal distribution, which BN relies on to work:zi , [W0X + b0]i zi ∼ N(E[zi], V ar[zi])(5)During local training, the client’s model has been adapted to the local dataset, and the BN-layer statistics used for inference (µ,6σ2) have been updated from the layer activations.

What are the factors that affect the round times of MTFL?

In real-world FL scenarios, the round times are influenced by the compute abilities of client devices, the computational cost of the models used, and the communication conditions.

(Open Access) Multi-Task Federated Learning for Personalised Deep Neural Networks in Edge Computing (2022) | Jed Mills

ORE Open Research Exeter

TITLE

Multi-Task Federated Learning for Personalised Deep Neural Networks in Edge Computing

AUTHORS

Mills, J; Hu, J; Min, G

JOURNAL

IEEE Transactions on Parallel and Distributed Systems

DEPOSITED IN ORE

12 August 2021

This version available at

http://hdl.handle.net/10871/126748

Open Research Exeter makes this work available in accordance with publisher policies.

A NOTE ON VERSIONS

The version presented here may diﬀer from the published version. If citing, you are advised to consult the published version for pagination, volume/issue and date of

publication

Multi-Task Federated Learning for Personalised

Deep Neural Networks in Edge Computing

Jed Mills, Jia Hu, Geyong Min

Abstract—Federated Learning (FL) is an emerging approach for collaboratively training Deep Neural Networks (DNNs) on mobile

devices, without private user data leaving the devices. Previous works have shown that non-Independent and Identically Distributed

(non-IID) user data harms the convergence speed of the FL algorithms. Furthermore, most existing work on FL measures global-model

accuracy, but in many cases, such as user content-recommendation, improving individual User model Accuracy (UA) is the real

objective. To address these issues, we propose a Multi-Task FL (MTFL) algorithm that introduces non-federated Batch-Normalization

(BN) layers into the federated DNN. MTFL beneﬁts UA and convergence speed by allowing users to train models personalised to their

own data. MTFL is compatible with popular iterative FL optimisation algorithms such as Federated Averaging (FedAvg), and we show

empirically that a distributed form of Adam optimisation (FedAvg-Adam) beneﬁts convergence speed even further when used as the

optimisation strategy within MTFL. Experiments using MNIST and CIFAR10 demonstrate that MTFL is able to signiﬁcantly reduce the

number of rounds required to reach a target UA, by up to 5× when using existing FL optimisation strategies, and with a further 3×

improvement when using FedAvg-Adam. We compare MTFL to competing personalised FL algorithms, showing that it is able to

achieve the best UA for MNIST and CIFAR10 in all considered scenarios. Finally, we evaluate MTFL with FedAvg-Adam on an

edge-computing testbed, showing that its convergence and UA beneﬁts outweigh its overhead.

Index Terms—Federated Learning, Multi-Task Learning, Deep Learning, Edge Computing, Adaptive Optimization.

1 INTRODUCTION

ULTI-access Edge Computing (MEC) [?] moves Cloud

services to the network edge, enabling low-latency and real-

time processing of applications via content caching and computa-

tion ofﬂoading [?] [?]. Coupled with the rapidly increasing quan-

tity of data collected by smartphones, Internet-of-Things (IoT)

devices, and social networks (SNs), MEC presents an opportunity

to store and process huge quantities of data at the edge, close to

their source.

Deep Neural Networks (DNNs) for Machine Learning (ML)

are becoming increasingly popular for their huge range of poten-

tial applications, ease of deployment, and state-of-the-art perfor-

mance. Training DNNs in supervised learning, however, can be

computationally expensive and require an enormous amount of

training data, especially with the trend of increasing DNN size.

The use of DNNs in MEC has typically involved collecting data

from mobile phones/IoT devices/SNs, performing training in the

cloud, and then deploying the model at the edge. Concerns about

data privacy, however, mean that users are increasingly unwilling

to upload their potentially sensitive data, raising the question about

how these models will be trained.

Federated Learning (FL) [?] opens new horizons for ML

at the edge. In FL, participating clients collaboratively train an

ML model (typically DNNs), without revealing their private data.

McMahan et al. [?] published an initial investigation into FL with

the Federated Averaging (FedAvg) algorithm. FedAvg works by

initialising a model at a coordinating server before distributing

this model to clients. These clients perform a round of training

• J. Mills, J. Hu and G. Min are with the Department of Computer Science,

University of Exeter, EX4 4QF, United Kingdom. E-mail: {jm729, j.hu,

g.min}@exeter.ac.uk. Corresponding authors: Jia Hu, Geyong Min.

on their local datasets and push their new models to the server.

The server averages these models together before sending the new

aggregated model to the clients for the next round of training. We

refer to the people/institutions/etc. that own data for FL as ‘users’,

and to the devices that actually participate in FL as ‘clients’.

FL is a very promising approach for distributed ML in situa-

tions where data cannot be uploaded for protecting clients’ privacy.

Therefore, FL is well suited for real-world scenarios such as

analysing sensitive healthcare data [?] [?], next-word prediction on

mobile keyboards [?], and content-recommendation [?]. However,

FL presents multiple unique challenges:

• Clients usually do not have Independent and Identically

Distributed (IID) training data. Each client has data generated

by itself, and can have noisy data or only a subset of all

features/labels. These factors can all substantially hinder

training of the FL model.

• FL research typically uses the performance metric of global-

model accuracy on a centralised test-set. However, in many

cases, individual model accuracy on clients is the real objec-

tive - motivating ‘personalised FL’ that creates unique models

for FL clients to improve local performance. However, the

best way of incorporating personalisation into FL remains an

under-researched topic.

• Due to the non-IID nature of client datasets, the performance

of the global FL model may be higher on some clients than

others. This could even lead some clients to receive a worse

model than the one they could have trained independently.

This paper addresses the above challenges by proposing a

Multi-Task FL algorithm (MTFL), that allows clients to train

personalised DNNs that both improve local model accuracy, and

help to further enhance client privacy. MTFL has lower storage

cost of personalisation, and lower computing cost compared with

other personalised FL algorithms (not requiring extra steps of SGD

on clients during the training loop or at personalisation time) [?]

[?] [?] [?].

As client datasets in FL are usually non-IID, clients can

be viewed as attempting to optimise their models during local

training for disparate tasks. Our MTFL approach takes the Batch-

Normalisation (BN) layers that are commonly incorporated into

DNN architectures, and keeps them private to each client. Mu-

drarkarta et al. [?] previously showed that private BN layers im-

proved Multi-Task Learning (MTL) performance for joint training

on ImageNet/Places-365 in the centralised setting.

Using private BN layers has the dual beneﬁt of personalising

each model to the clients’ local data as well as helping to preserve

data privacy: as some parameters of client models are not uploaded

to the server, less information about a client’s data distribution

can be gleaned from the uploaded model. Our MTFL approach

using BN layers also has a storage-cost beneﬁt compared to other

personalised FL algorithms: BN layers typically contain a tiny

fraction of the total parameters of a DNN, and only these BN

parameters need to be stored between FL rounds, compared to

entire personalised DNN models of competing algorithms [?] [?]

[?].

MTFL adds personalisation on top of the typical iterative FL

framework. FedAvg and other popular algorithms are instances of

this iterative optimisation framework [?] [?] [?]. Most of these

FL algorithms use vanilla Stochastic Gradient Descent (SGD) on

clients, however, momentum-based optimisation strategies such

as Adam [?] have the potential to improve convergence speed of

FL training. We show that a distributed optimisation technique

using Adam (FedAvg-Adam) shows substantial speedup in terms

of communication rounds compared to FedAvg, and works very

well within the MTFL algorithm.

Our work makes the following contributions:

• We propose an MTFL algorithm that adds Multi-Task learn-

ing on top of general iterative-FL algorithms, allowing users

to learn DNN models that are personalised for their own

data. MTFL uses private Batch Normalisation (BN) layers to

achieve this personalisation, which provides an added privacy

beneﬁt.

• We propose a new metric for measuring the performance

of FL algorithms: User model Accuracy (UA). UA better

reﬂects a common objective of FL (increasing test accuracy

on clients), as opposed to the standard global-model accuracy.

• We analyse the impact that private BN layers have on the

activations of MTFL models during inference, providing

insights into the source of their impact. We also analyse

the training and testing performance of MTFL when keep-

ing either the trained parameters or statistics of BN layers

private, demonstrating that MTFL provides a better balance

between convergence and regularisation compared to FL or

independent training.

• We conduct extensive simulations on the MNIST and CI-

FAR10 datasets. The results show that MTFL with FedAvg is

able to reach a target UA in up to 5× less rounds than when

using only FL, with FedAvg-Adam providing a further 3×

improvement. Other experiments show that MTFL is able to

signiﬁcantly improve average UA compared to other state-of-

the-art personalised FL algorithms.

• We perform experiments using an MEC-like testbed consist-

ing of Raspberry Pi clients and a FL server. The results show

that MTFL with FedAvg-Adam’s overheads are outweighed

by its substantial UA and convergence speed beneﬁts.

The rest of this paper is organised as follows: Section 2

describes related work; Section 3 details the proposed MTFL

algorithm, the effect that keeping private BN layers within MTFL

has on training and inference, and the FedAvg-Adam optimisation

strategy; Section 4 presents and discusses experiments using both

simulations and an MEC-like testbed; and Section 5 concludes the

paper.

2 RELATED WORK

As this work addresses several challenges to existing FL algo-

rithms, we overview the related work in three sub-topics of FL:

works considering personalisation, works dealing with practical

and deployment challenges, and works aiming to improve conver-

gence speed and global-model performance.

2.1 Personalised Federated Learning

Several authors have considered the approach of ‘personalising’

FL models in order to tailor model performance to non-IID user

datasets.

Meta-Learning aims to train a model that is easy to ﬁne-

tune with few samples. Fallah et al. [?] proposed the Per-FedAvg

algorithm based on Model Agnostic Meta-Learning (MAML), that

adds a ﬁrst-order adaptation term to the client loss functions, so

they can be tuned to client datasets with one step. Jiang et al.

[?] highlighted the connection between FedAvg and ﬁrst-order

MAML updates, and proposed a three-stage training algorithm to

improve personalisation.

Other authors propose training a combination of local and

global models in FL to improve personalisation. Hanzely and

Richt

arick [?] added a learnable parameter to allow clients to

control the extent of local and global model mixing. Dinh et

al. [?] kept a global model and a personal model for each user,

performing SGD on their personal model and then updating their

copy of the global model in an outer loop. Huang et al. [?] kept

a local model on each client, and added a proximal term to client

loss functions to keep these models close to a ‘personalised’ cloud

model, for the cross-silo FL setting.

Smith et al. [?] proposed MOCHA, which performs Federated

MTL formulates FL as a function of the model weight matrix

and a relationship matrix. Their algorithm takes into account

the heterogeneous hardware of clients, meaning MOCHA is not

directly comparable to our MTFL scheme. Recently, Dinh et

al. [?] generalised MOCHA and other algorithms into the FedU

framework, including proposing a decentralised version.

Our work proposes a Multi-Task learning approach to achieve

personalisation in FL (MTFL). We later show that our approach

has substantial converge speed, personalisation performance, pri-

vacy and storage coast beneﬁts compared to existing personalised

FL algorithms.

2.2 Federated Learning in Edge Computing

FL performs distributed computing at the network edge. Some

authors have considered the system design and communication

costs of FL in this environment. Jiang et al. [?] proposed an

FL system that reduces the total data clients upload by selecting

model weights with the largest gradient magnitudes. They also

Fig. 1: Operation of the MTFL algorithm in Edge Computing. Training is performed in rounds until a termination condition is met.

Step 1: the server selects a subset of clients from its database to participate in the round, and sends a work request to them. Step 2:

clients reply with an accept message depending on physical state and local preferences. Step 3: clients download the global model (and

any optimisation parameters) from the server, and update their copy of the global model with private patches (in this work, we use BN

layers as patches). Step 4: clients perform local training, before saving their personal patches for the next round. Step 5: the server

waits for C fraction of clients to upload their non-private model and optimiser values, or until a time limit. Step 6: the server averages

all models, saves the aggregate, and starts a new round.

considered implementation details such as asynchronous or round-

robin client updates. Bonawitz et al. [?] produced a FedAvg

system design, specifying clients/server roles, fault handling, and

security. They also provide analytics for their deployment of

this system with over 10 million clients. To address the non-

IID nature of client datasets in FL, Duan et al. [?] proposed the

Astrea framework: client datasets are augmented to help reduce

local class imbalances, and mediators are introduced to the global

aggregation method.

Several authors have also investigated the impacts of wire-

lessly connected FL clients. Yang et al. [?] studied different

scheduling policies in a wireless FL scenario. Their analysis

showed that with a low Signal-to-Interference-plus-Noise Ratio

(SINR), simple FL schemes perform well, but that as SINR

increases, more intelligent methods of selecting clients are needed.

Ahn et al. [?] proposed a Hybrid Federated Distillation scheme

for FL with wireless edge devices, including using over-the-

air computing and compression methods. Their results showed

that their scheme gave better performance in high-noise wireless

scenarios.

Other authors have proposed schemes for considering the com-

puting, networking and communication resources of FL clients

in edge computing. Wang et al. [?] performed experiments with

smartphones to argue that the computation-time (as opposed to

communication-time) of FedAvg is the most signiﬁcant bottleneck

for real-world FL, and propose algorithms to accommodate this

computational heterogeneity. Nishio and Yonetani [?] designed a

system that collects information about the computing and wireless

resources of clients before initiating a round of FL, reducing the

real-time taken to reach a target accuracy for FedAvg.

These previous works have proposed implementations of FL

systems. However, they do not consider MTL within FL, which is

a main contribution of our work with the MTFL algorithm.

2.3 Federated Learning Performance

The seminal FedAvg algorithm [?] collaboratively trains a model

by sending an initial model to participating clients, who each

perform SGD on the model using their local data. These new

models are sent to the server for averaging and a new round

is begun. Some progress has been made towards improving the

convergence rate of FedAvg. Leroy et al. [?] used Adam adaptive

optimisation when updating the global model on the server. Reddi

et al. [?] also generalised other adaptive optimisation techniques in

the same style and provided convergence guarantees. Our FedAvg-

Adam algorithm differs from these as clients in FedAvg-Adam

perform Adam SGD (as opposed to vanilla SGD), and the Adam

parameters are averaged alongside model weights at the server.

Liu et al. [?] used momentum-SGD on clients, and aggregated the

momentum values of clients on the server alongside the model

weights as an alternative method of accelerating convergence.

Some works have been produced investigating FL with non-

IID or poor-quality client data. Zhao et al. [?] proposed sharing a

small amount of data between clients to decrease the differences

in their data distributions and improve global model accuracy.

Konstantinov and Lampert [?] evaluated which clients have poor-

quality data by ﬁnding the difference between a client model’s

local predictions and predictions using a trusted dataset. Wang

et al. [?] ignored irrelevant client updates during training by

checking if each client’s update aligns with the global model.

The FedAvg-Adam optimisation method presented here uses

adaptive optimisation on clients, rather than SGD, which we later

show converges much faster than when using FedAvg or Adam

optimisation purely on the server.

3 MULTI-TASK FEDERATED LEARNING (MTFL)

Fig. 1 shows a high-level overview of how the MTFL algorithm

would operate in the edge-computing environment. More detailed

descriptions of the use of BN patches in MTFL, and optimisation

on clients is given in the later subsections.

The MTFL algorithm is based on the client-server framework,

however, rounds are initiated by the server, as shown in Fig 1.

First, the server selects all, or a subset of all, known clients

from its database and asks them to participate in the FL round

(Step 1), and sends a Work Request message to them. Clients

will accept a Work Request depending on user preferences (for

example, users can set their device to only participate in FL if

charging and connected to WiFi). All accepting clients then send

an Accept message to the server (Step 2). The server sends the

global model (and any associated optimization parameters) to all

accepting clients, who augment their copy of the global model

with private patches (Step 3). Clients then perform local training

using their own data, creating a different model. Clients save the

patch layers from their new model locally, and upload their non-

private model parameters to the server (Step 4).

The server waits for clients to ﬁnish training and upload their

models (Step 5). It can either wait for a maximum time limit, or for

a given fraction of clients to upload before continuing, depending

on the server preferences. After this, the server will aggregate all

received models to produce a single global model (Step 6) which

is saved on the server, before starting a new round.

MTFL therefore ofﬂoads the vast majority of computation

to client devices, who perform the actual model training. It

preserves users’ data-privacy more strongly than FedAvg and other

personalised-FL algorithms: not only is user data not uploaded, but

key parts of their local models are not uploaded. The framework

also accounts for client stragglers with its round time/uploading

client fraction limit. Moreover, MTFL utilises patch layers to

improve local model performance on individual users’ non-IID

datasets, making MTFL more personalised.

3.1 User Model Accuracy and MTFL

In many FL works, such as the original FedAvg paper [?], the

authors use a central IID test-set to measure FL performance.

Depending on the FL scenario, this metric may or may not be

desirable. If the intention is to create a single model that has good

performance on IID data, then this method would be suitable.

However, in many FL scenarios, the desire is to create a model that

has good performance on individual user devices. For example,

Google have used FedAvg for their GBoard next-word-prediction

software [?]. The objective was to improve the prediction score

for individual users. As users do not typically have non-IID data,

a single global model may display good performance for some

users, and worse performance for others.

We propose using the average User model Accuracy (UA) as

an alternative metric of FL performance. UA is the accuracy on a

client using a local test-set. This test-set for each client should be

drawn from a similar distribution as its training data. In this paper,

we perform experiments on classiﬁcation problems, but UA could

be altered for different metrics (e.g. error, recall).

In FL, user data is often non-IID, so users could be considered

as having different but related learning tasks. It is possible for

an FL scheme to achieve good global-model accuracy, but poor

UA, as the aggregate model may perform poorly on some clients’

datasets (especially if they have a small number of local samples,

Fig. 2: Example composition of a DNN model used in MTFL.

Each client’s model consists of shared global parameters (Ω

−Ω

)

for Convolutional (Conv) and Fully-Connected (FC) layers, and

private Batch-Normalization (BN) patch layers (P

, P

so are weighted less in the FedAvg averaging step). We propose

the MTFL algorithm that allows clients to build different models,

while still beneﬁting from FL, in order to improve the average UA.

Mudrakarta et al. [?] have previously shown that adding small

per-task ‘patch’ layers to DNNs improved their performance in

MTL scenarios. Patches are therefore a good candidate for training

personalised models for clients.

In FL, the aim is to minimise the following objective function:

k=1



(Ω) (1)

where K is the total number of clients, n

is the number of

samples on client k, n is the total number of samples across all

clients, 

is the loss function on client k, and Ω is the set of

global model parameters. Adding unique client patches to the FL

model changes the objective function of MTFL to:

MTFL

k=1



) (2)

= (Ω

· · · Ω

, P

, Ω

· · · Ω

, P

, Ω

· · · Ω

)

(3)

where M

is the patched model on client k, composed of

Federated model parameters Ω

· · · Ω

(j being the total number

of Federated layers) and patch parameters P

· · · P

(m being

the total number of local patches, {i} being the set of indexes of

the patch parameters) unique to client k. Fig. 2 shows an example

composition of a DNN model used in MTFL.

MTFL is a general algorithm for incorporating MTL into

FL. Different optimisation strategies (including FedAvg-Adam

described in Section 3.3) can be used within MTFL, and we later

show that MTFL can substantially reduce the number of rounds to

reach target UA, regardless of the optimisation strategy used.

As shown in Algorithm 1, MTFL runs rounds of communica-

tion until a given termination criteria (such as target UA) is met

(Line 2). At each round, a subset S

of clients are selected to

participate from the set of all clients S (Line 3). These clients

Multi-Task Federated Learning for Personalised Deep Neural Networks in Edge Computing

Figures

Citations

A Survey on Metaverse: Fundamentals, Security, and Privacy

A Survey on Metaverse: Fundamentals, Security, and Privacy

Integration of Blockchain Technology and Federated Learning in Vehicular (IoT) Networks: A Comprehensive Survey

DNN Deployment, Task Offloading, and Resource Allocation for Joint Task Inference in IIoT

Prediction Based Semi-Supervised Online Personalized Federated Learning for Indoor Localization

References

Adam: A Method for Stochastic Optimization

Communication-Efficient Learning of Deep Networks from Decentralized Data

A Survey on Mobile Edge Computing: The Communication Perspective

Federated Learning: Challenges, Methods, and Future Directions

Federated Learning with Non-IID Data.

Related Papers (5)

Multi-Task Federated Learning for Personalised Deep Neural Networks in Edge Computing

Adaptive Federated Optimization

A Collaborative Learning Framework via Federated Meta-Learning

A Federated Approach in Training Acoustic Models.

Variational Federated Multi-Task Learning.

Frequently Asked Questions (2)

Q1. What is the effect of the BN-layer on the client’s test-set?

Q2. What are the factors that affect the round times of MTFL?