scispace - formally typeset
Open AccessJournal ArticleDOI

Multi-Task Federated Learning for Personalised Deep Neural Networks in Edge Computing

Jed Mills, +2 more
- 01 Mar 2022 - 
- Vol. 33, Iss: 3, pp 630-641
TLDR
In this article, a multi-task federated learning (MTFL) algorithm is proposed that introduces non-federated batch-normalization (BN) layers into the federated DNN.
Abstract
Federated Learning (FL) is an emerging approach for collaboratively training Deep Neural Networks (DNNs) on mobile devices, without private user data leaving the devices. Previous works have shown that non-Independent and Identically Distributed (non-IID) user data harms the convergence speed of the FL algorithms. Furthermore, most existing work on FL measures global-model accuracy, but in many cases, such as user content-recommendation, improving individual User model Accuracy (UA) is the real objective. To address these issues, we propose a Multi-Task FL (MTFL) algorithm that introduces non-federated Batch-Normalization (BN) layers into the federated DNN. MTFL benefits UA and convergence speed by allowing users to train models personalised to their own data. MTFL is compatible with popular iterative FL optimisation algorithms such as Federated Averaging (FedAvg), and we show empirically that a distributed form of Adam optimisation (FedAvg-Adam) benefits convergence speed even further when used as the optimisation strategy within MTFL. Experiments using MNIST and CIFAR10 demonstrate that MTFL is able to significantly reduce the number of rounds required to reach a target UA, by up to $5\times$ 5 × when using existing FL optimisation strategies, and with a further $3\times$ 3 × improvement when using FedAvg-Adam. We compare MTFL to competing personalised FL algorithms, showing that it is able to achieve the best UA for MNIST and CIFAR10 in all considered scenarios. Finally, we evaluate MTFL with FedAvg-Adam on an edge-computing testbed, showing that its convergence and UA benefits outweigh its overhead.

read more

Content maybe subject to copyright    Report

ORE Open Research Exeter
TITLE
Multi-Task Federated Learning for Personalised Deep Neural Networks in Edge Computing
AUTHORS
Mills, J; Hu, J; Min, G
JOURNAL
IEEE Transactions on Parallel and Distributed Systems
DEPOSITED IN ORE
12 August 2021
This version available at
http://hdl.handle.net/10871/126748
COPYRIGHT AND REUSE
Open Research Exeter makes this work available in accordance with publisher policies.
A NOTE ON VERSIONS
The version presented here may differ from the published version. If citing, you are advised to consult the published version for pagination, volume/issue and date of
publication

1
Multi-Task Federated Learning for Personalised
Deep Neural Networks in Edge Computing
Jed Mills, Jia Hu, Geyong Min
Abstract—Federated Learning (FL) is an emerging approach for collaboratively training Deep Neural Networks (DNNs) on mobile
devices, without private user data leaving the devices. Previous works have shown that non-Independent and Identically Distributed
(non-IID) user data harms the convergence speed of the FL algorithms. Furthermore, most existing work on FL measures global-model
accuracy, but in many cases, such as user content-recommendation, improving individual User model Accuracy (UA) is the real
objective. To address these issues, we propose a Multi-Task FL (MTFL) algorithm that introduces non-federated Batch-Normalization
(BN) layers into the federated DNN. MTFL benefits UA and convergence speed by allowing users to train models personalised to their
own data. MTFL is compatible with popular iterative FL optimisation algorithms such as Federated Averaging (FedAvg), and we show
empirically that a distributed form of Adam optimisation (FedAvg-Adam) benefits convergence speed even further when used as the
optimisation strategy within MTFL. Experiments using MNIST and CIFAR10 demonstrate that MTFL is able to significantly reduce the
number of rounds required to reach a target UA, by up to 5× when using existing FL optimisation strategies, and with a further 3×
improvement when using FedAvg-Adam. We compare MTFL to competing personalised FL algorithms, showing that it is able to
achieve the best UA for MNIST and CIFAR10 in all considered scenarios. Finally, we evaluate MTFL with FedAvg-Adam on an
edge-computing testbed, showing that its convergence and UA benefits outweigh its overhead.
Index Terms—Federated Learning, Multi-Task Learning, Deep Learning, Edge Computing, Adaptive Optimization.
F
1 INTRODUCTION
M
ULTI-access Edge Computing (MEC) [?] moves Cloud
services to the network edge, enabling low-latency and real-
time processing of applications via content caching and computa-
tion offloading [?] [?]. Coupled with the rapidly increasing quan-
tity of data collected by smartphones, Internet-of-Things (IoT)
devices, and social networks (SNs), MEC presents an opportunity
to store and process huge quantities of data at the edge, close to
their source.
Deep Neural Networks (DNNs) for Machine Learning (ML)
are becoming increasingly popular for their huge range of poten-
tial applications, ease of deployment, and state-of-the-art perfor-
mance. Training DNNs in supervised learning, however, can be
computationally expensive and require an enormous amount of
training data, especially with the trend of increasing DNN size.
The use of DNNs in MEC has typically involved collecting data
from mobile phones/IoT devices/SNs, performing training in the
cloud, and then deploying the model at the edge. Concerns about
data privacy, however, mean that users are increasingly unwilling
to upload their potentially sensitive data, raising the question about
how these models will be trained.
Federated Learning (FL) [?] opens new horizons for ML
at the edge. In FL, participating clients collaboratively train an
ML model (typically DNNs), without revealing their private data.
McMahan et al. [?] published an initial investigation into FL with
the Federated Averaging (FedAvg) algorithm. FedAvg works by
initialising a model at a coordinating server before distributing
this model to clients. These clients perform a round of training
J. Mills, J. Hu and G. Min are with the Department of Computer Science,
University of Exeter, EX4 4QF, United Kingdom. E-mail: {jm729, j.hu,
g.min}@exeter.ac.uk. Corresponding authors: Jia Hu, Geyong Min.
on their local datasets and push their new models to the server.
The server averages these models together before sending the new
aggregated model to the clients for the next round of training. We
refer to the people/institutions/etc. that own data for FL as ‘users’,
and to the devices that actually participate in FL as ‘clients’.
FL is a very promising approach for distributed ML in situa-
tions where data cannot be uploaded for protecting clients’ privacy.
Therefore, FL is well suited for real-world scenarios such as
analysing sensitive healthcare data [?] [?], next-word prediction on
mobile keyboards [?], and content-recommendation [?]. However,
FL presents multiple unique challenges:
Clients usually do not have Independent and Identically
Distributed (IID) training data. Each client has data generated
by itself, and can have noisy data or only a subset of all
features/labels. These factors can all substantially hinder
training of the FL model.
FL research typically uses the performance metric of global-
model accuracy on a centralised test-set. However, in many
cases, individual model accuracy on clients is the real objec-
tive - motivating ‘personalised FL that creates unique models
for FL clients to improve local performance. However, the
best way of incorporating personalisation into FL remains an
under-researched topic.
Due to the non-IID nature of client datasets, the performance
of the global FL model may be higher on some clients than
others. This could even lead some clients to receive a worse
model than the one they could have trained independently.
This paper addresses the above challenges by proposing a
Multi-Task FL algorithm (MTFL), that allows clients to train
personalised DNNs that both improve local model accuracy, and
help to further enhance client privacy. MTFL has lower storage
cost of personalisation, and lower computing cost compared with

2
other personalised FL algorithms (not requiring extra steps of SGD
on clients during the training loop or at personalisation time) [?]
[?] [?] [?].
As client datasets in FL are usually non-IID, clients can
be viewed as attempting to optimise their models during local
training for disparate tasks. Our MTFL approach takes the Batch-
Normalisation (BN) layers that are commonly incorporated into
DNN architectures, and keeps them private to each client. Mu-
drarkarta et al. [?] previously showed that private BN layers im-
proved Multi-Task Learning (MTL) performance for joint training
on ImageNet/Places-365 in the centralised setting.
Using private BN layers has the dual benefit of personalising
each model to the clients’ local data as well as helping to preserve
data privacy: as some parameters of client models are not uploaded
to the server, less information about a client’s data distribution
can be gleaned from the uploaded model. Our MTFL approach
using BN layers also has a storage-cost benefit compared to other
personalised FL algorithms: BN layers typically contain a tiny
fraction of the total parameters of a DNN, and only these BN
parameters need to be stored between FL rounds, compared to
entire personalised DNN models of competing algorithms [?] [?]
[?].
MTFL adds personalisation on top of the typical iterative FL
framework. FedAvg and other popular algorithms are instances of
this iterative optimisation framework [?] [?] [?]. Most of these
FL algorithms use vanilla Stochastic Gradient Descent (SGD) on
clients, however, momentum-based optimisation strategies such
as Adam [?] have the potential to improve convergence speed of
FL training. We show that a distributed optimisation technique
using Adam (FedAvg-Adam) shows substantial speedup in terms
of communication rounds compared to FedAvg, and works very
well within the MTFL algorithm.
Our work makes the following contributions:
We propose an MTFL algorithm that adds Multi-Task learn-
ing on top of general iterative-FL algorithms, allowing users
to learn DNN models that are personalised for their own
data. MTFL uses private Batch Normalisation (BN) layers to
achieve this personalisation, which provides an added privacy
benefit.
We propose a new metric for measuring the performance
of FL algorithms: User model Accuracy (UA). UA better
reflects a common objective of FL (increasing test accuracy
on clients), as opposed to the standard global-model accuracy.
We analyse the impact that private BN layers have on the
activations of MTFL models during inference, providing
insights into the source of their impact. We also analyse
the training and testing performance of MTFL when keep-
ing either the trained parameters or statistics of BN layers
private, demonstrating that MTFL provides a better balance
between convergence and regularisation compared to FL or
independent training.
We conduct extensive simulations on the MNIST and CI-
FAR10 datasets. The results show that MTFL with FedAvg is
able to reach a target UA in up to 5× less rounds than when
using only FL, with FedAvg-Adam providing a further 3×
improvement. Other experiments show that MTFL is able to
significantly improve average UA compared to other state-of-
the-art personalised FL algorithms.
We perform experiments using an MEC-like testbed consist-
ing of Raspberry Pi clients and a FL server. The results show
that MTFL with FedAvg-Adam’s overheads are outweighed
by its substantial UA and convergence speed benefits.
The rest of this paper is organised as follows: Section 2
describes related work; Section 3 details the proposed MTFL
algorithm, the effect that keeping private BN layers within MTFL
has on training and inference, and the FedAvg-Adam optimisation
strategy; Section 4 presents and discusses experiments using both
simulations and an MEC-like testbed; and Section 5 concludes the
paper.
2 RELATED WORK
As this work addresses several challenges to existing FL algo-
rithms, we overview the related work in three sub-topics of FL:
works considering personalisation, works dealing with practical
and deployment challenges, and works aiming to improve conver-
gence speed and global-model performance.
2.1 Personalised Federated Learning
Several authors have considered the approach of ‘personalising’
FL models in order to tailor model performance to non-IID user
datasets.
Meta-Learning aims to train a model that is easy to fine-
tune with few samples. Fallah et al. [?] proposed the Per-FedAvg
algorithm based on Model Agnostic Meta-Learning (MAML), that
adds a first-order adaptation term to the client loss functions, so
they can be tuned to client datasets with one step. Jiang et al.
[?] highlighted the connection between FedAvg and first-order
MAML updates, and proposed a three-stage training algorithm to
improve personalisation.
Other authors propose training a combination of local and
global models in FL to improve personalisation. Hanzely and
Richt
´
arick [?] added a learnable parameter to allow clients to
control the extent of local and global model mixing. Dinh et
al. [?] kept a global model and a personal model for each user,
performing SGD on their personal model and then updating their
copy of the global model in an outer loop. Huang et al. [?] kept
a local model on each client, and added a proximal term to client
loss functions to keep these models close to a ‘personalised’ cloud
model, for the cross-silo FL setting.
Smith et al. [?] proposed MOCHA, which performs Federated
MTL formulates FL as a function of the model weight matrix
and a relationship matrix. Their algorithm takes into account
the heterogeneous hardware of clients, meaning MOCHA is not
directly comparable to our MTFL scheme. Recently, Dinh et
al. [?] generalised MOCHA and other algorithms into the FedU
framework, including proposing a decentralised version.
Our work proposes a Multi-Task learning approach to achieve
personalisation in FL (MTFL). We later show that our approach
has substantial converge speed, personalisation performance, pri-
vacy and storage coast benefits compared to existing personalised
FL algorithms.
2.2 Federated Learning in Edge Computing
FL performs distributed computing at the network edge. Some
authors have considered the system design and communication
costs of FL in this environment. Jiang et al. [?] proposed an
FL system that reduces the total data clients upload by selecting
model weights with the largest gradient magnitudes. They also

3
Fig. 1: Operation of the MTFL algorithm in Edge Computing. Training is performed in rounds until a termination condition is met.
Step 1: the server selects a subset of clients from its database to participate in the round, and sends a work request to them. Step 2:
clients reply with an accept message depending on physical state and local preferences. Step 3: clients download the global model (and
any optimisation parameters) from the server, and update their copy of the global model with private patches (in this work, we use BN
layers as patches). Step 4: clients perform local training, before saving their personal patches for the next round. Step 5: the server
waits for C fraction of clients to upload their non-private model and optimiser values, or until a time limit. Step 6: the server averages
all models, saves the aggregate, and starts a new round.
considered implementation details such as asynchronous or round-
robin client updates. Bonawitz et al. [?] produced a FedAvg
system design, specifying clients/server roles, fault handling, and
security. They also provide analytics for their deployment of
this system with over 10 million clients. To address the non-
IID nature of client datasets in FL, Duan et al. [?] proposed the
Astrea framework: client datasets are augmented to help reduce
local class imbalances, and mediators are introduced to the global
aggregation method.
Several authors have also investigated the impacts of wire-
lessly connected FL clients. Yang et al. [?] studied different
scheduling policies in a wireless FL scenario. Their analysis
showed that with a low Signal-to-Interference-plus-Noise Ratio
(SINR), simple FL schemes perform well, but that as SINR
increases, more intelligent methods of selecting clients are needed.
Ahn et al. [?] proposed a Hybrid Federated Distillation scheme
for FL with wireless edge devices, including using over-the-
air computing and compression methods. Their results showed
that their scheme gave better performance in high-noise wireless
scenarios.
Other authors have proposed schemes for considering the com-
puting, networking and communication resources of FL clients
in edge computing. Wang et al. [?] performed experiments with
smartphones to argue that the computation-time (as opposed to
communication-time) of FedAvg is the most significant bottleneck
for real-world FL, and propose algorithms to accommodate this
computational heterogeneity. Nishio and Yonetani [?] designed a
system that collects information about the computing and wireless
resources of clients before initiating a round of FL, reducing the
real-time taken to reach a target accuracy for FedAvg.
These previous works have proposed implementations of FL
systems. However, they do not consider MTL within FL, which is
a main contribution of our work with the MTFL algorithm.
2.3 Federated Learning Performance
The seminal FedAvg algorithm [?] collaboratively trains a model
by sending an initial model to participating clients, who each
perform SGD on the model using their local data. These new
models are sent to the server for averaging and a new round
is begun. Some progress has been made towards improving the
convergence rate of FedAvg. Leroy et al. [?] used Adam adaptive
optimisation when updating the global model on the server. Reddi
et al. [?] also generalised other adaptive optimisation techniques in
the same style and provided convergence guarantees. Our FedAvg-
Adam algorithm differs from these as clients in FedAvg-Adam
perform Adam SGD (as opposed to vanilla SGD), and the Adam
parameters are averaged alongside model weights at the server.
Liu et al. [?] used momentum-SGD on clients, and aggregated the
momentum values of clients on the server alongside the model
weights as an alternative method of accelerating convergence.
Some works have been produced investigating FL with non-
IID or poor-quality client data. Zhao et al. [?] proposed sharing a
small amount of data between clients to decrease the differences
in their data distributions and improve global model accuracy.
Konstantinov and Lampert [?] evaluated which clients have poor-
quality data by finding the difference between a client model’s
local predictions and predictions using a trusted dataset. Wang
et al. [?] ignored irrelevant client updates during training by
checking if each client’s update aligns with the global model.
The FedAvg-Adam optimisation method presented here uses
adaptive optimisation on clients, rather than SGD, which we later
show converges much faster than when using FedAvg or Adam
optimisation purely on the server.

4
3 MULTI-TASK FEDERATED LEARNING (MTFL)
Fig. 1 shows a high-level overview of how the MTFL algorithm
would operate in the edge-computing environment. More detailed
descriptions of the use of BN patches in MTFL, and optimisation
on clients is given in the later subsections.
The MTFL algorithm is based on the client-server framework,
however, rounds are initiated by the server, as shown in Fig 1.
First, the server selects all, or a subset of all, known clients
from its database and asks them to participate in the FL round
(Step 1), and sends a Work Request message to them. Clients
will accept a Work Request depending on user preferences (for
example, users can set their device to only participate in FL if
charging and connected to WiFi). All accepting clients then send
an Accept message to the server (Step 2). The server sends the
global model (and any associated optimization parameters) to all
accepting clients, who augment their copy of the global model
with private patches (Step 3). Clients then perform local training
using their own data, creating a different model. Clients save the
patch layers from their new model locally, and upload their non-
private model parameters to the server (Step 4).
The server waits for clients to finish training and upload their
models (Step 5). It can either wait for a maximum time limit, or for
a given fraction of clients to upload before continuing, depending
on the server preferences. After this, the server will aggregate all
received models to produce a single global model (Step 6) which
is saved on the server, before starting a new round.
MTFL therefore offloads the vast majority of computation
to client devices, who perform the actual model training. It
preserves users’ data-privacy more strongly than FedAvg and other
personalised-FL algorithms: not only is user data not uploaded, but
key parts of their local models are not uploaded. The framework
also accounts for client stragglers with its round time/uploading
client fraction limit. Moreover, MTFL utilises patch layers to
improve local model performance on individual users’ non-IID
datasets, making MTFL more personalised.
3.1 User Model Accuracy and MTFL
In many FL works, such as the original FedAvg paper [?], the
authors use a central IID test-set to measure FL performance.
Depending on the FL scenario, this metric may or may not be
desirable. If the intention is to create a single model that has good
performance on IID data, then this method would be suitable.
However, in many FL scenarios, the desire is to create a model that
has good performance on individual user devices. For example,
Google have used FedAvg for their GBoard next-word-prediction
software [?]. The objective was to improve the prediction score
for individual users. As users do not typically have non-IID data,
a single global model may display good performance for some
users, and worse performance for others.
We propose using the average User model Accuracy (UA) as
an alternative metric of FL performance. UA is the accuracy on a
client using a local test-set. This test-set for each client should be
drawn from a similar distribution as its training data. In this paper,
we perform experiments on classification problems, but UA could
be altered for different metrics (e.g. error, recall).
In FL, user data is often non-IID, so users could be considered
as having different but related learning tasks. It is possible for
an FL scheme to achieve good global-model accuracy, but poor
UA, as the aggregate model may perform poorly on some clients’
datasets (especially if they have a small number of local samples,
Fig. 2: Example composition of a DNN model used in MTFL.
Each client’s model consists of shared global parameters (
1
4
)
for Convolutional (Conv) and Fully-Connected (FC) layers, and
private Batch-Normalization (BN) patch layers (P
k
1
, P
k
2
, P
k
3
).
so are weighted less in the FedAvg averaging step). We propose
the MTFL algorithm that allows clients to build different models,
while still benefiting from FL, in order to improve the average UA.
Mudrakarta et al. [?] have previously shown that adding small
per-task ‘patch’ layers to DNNs improved their performance in
MTL scenarios. Patches are therefore a good candidate for training
personalised models for clients.
In FL, the aim is to minimise the following objective function:
F
FL
=
K
X
k=1
n
k
n
k
(Ω) (1)
where K is the total number of clients, n
k
is the number of
samples on client k, n is the total number of samples across all
clients,
k
is the loss function on client k, and is the set of
global model parameters. Adding unique client patches to the FL
model changes the objective function of MTFL to:
F
MTFL
=
K
X
k=1
n
k
n
k
(M
k
) (2)
M
k
= (Ω
1
· · ·
i
1
, P
k
1
,
i
1
+1
· · ·
i
m
, P
k
m
,
i
m
+1
· · ·
j
)
(3)
where M
k
is the patched model on client k, composed of
Federated model parameters
1
· · ·
j
(j being the total number
of Federated layers) and patch parameters P
k
1
· · · P
k
m
(m being
the total number of local patches, {i} being the set of indexes of
the patch parameters) unique to client k. Fig. 2 shows an example
composition of a DNN model used in MTFL.
MTFL is a general algorithm for incorporating MTL into
FL. Different optimisation strategies (including FedAvg-Adam
described in Section 3.3) can be used within MTFL, and we later
show that MTFL can substantially reduce the number of rounds to
reach target UA, regardless of the optimisation strategy used.
As shown in Algorithm 1, MTFL runs rounds of communica-
tion until a given termination criteria (such as target UA) is met
(Line 2). At each round, a subset S
r
of clients are selected to
participate from the set of all clients S (Line 3). These clients

Figures
Citations
More filters
Journal ArticleDOI

A Survey on Metaverse: Fundamentals, Security, and Privacy

TL;DR: A comprehensive survey of the fundamentals, security, and privacy of metaverse is presented, including a novel distributed metaverse architecture and its key characteristics with ternary-world interactions, and open research directions for building future metaverse systems are drawn.
Journal ArticleDOI

A Survey on Metaverse: Fundamentals, Security, and Privacy

TL;DR: In this paper , the authors present a comprehensive survey of the fundamentals, security, and privacy of metaverse, and discuss the security and privacy threats, present the critical challenges of Metaverse systems, and review the state-of-the-art countermeasures.
Journal ArticleDOI

Integration of Blockchain Technology and Federated Learning in Vehicular (IoT) Networks: A Comprehensive Survey

TL;DR: The paper focuses on the current research challenges and future research directions related to integrating FL and blockchain for vehicular networks, and sheds light on the blockchain and FL with real-world implementations.
Journal ArticleDOI

DNN Deployment, Task Offloading, and Resource Allocation for Joint Task Inference in IIoT

TL;DR: In this paper , a joint resource management scheme for a multi-task and multi-service scenario consisting of multiple sensors, a cloud server, and a base station equipped with an edge server is proposed, incorporating DNN deployment, data size control, task offloading, computing resource allocation and wireless channel allocation.
Journal ArticleDOI

Prediction Based Semi-Supervised Online Personalized Federated Learning for Indoor Localization

TL;DR: The proposed methods protect users’ privacy via FL and the challenges of dynamical and heterogeneous stream data in indoor localization are addressed by the proposed methods which provide better personalization localization service for users than baseline methods.
References
More filters
Proceedings Article

Adam: A Method for Stochastic Optimization

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Proceedings Article

Communication-Efficient Learning of Deep Networks from Decentralized Data

TL;DR: In this paper, the authors presented a decentralized approach for federated learning of deep networks based on iterative model averaging, and conduct an extensive empirical evaluation, considering five different model architectures and four datasets.
Journal ArticleDOI

A Survey on Mobile Edge Computing: The Communication Perspective

TL;DR: A comprehensive survey of the state-of-the-art MEC research with a focus on joint radio-and-computational resource management is provided in this paper, where a set of issues, challenges, and future research directions for MEC are discussed.
Journal ArticleDOI

Federated Learning: Challenges, Methods, and Future Directions

TL;DR: In this paper, the authors discuss the unique characteristics and challenges of federated learning, provide a broad overview of current approaches, and outline several directions of future work that are relevant to a wide range of research communities.
Posted Content

Federated Learning with Non-IID Data.

TL;DR: This work presents a strategy to improve training on non-IID data by creating a small subset of data which is globally shared between all the edge devices, and shows that accuracy can be increased by 30% for the CIFAR-10 dataset with only 5% globally shared data.
Related Papers (5)
Frequently Asked Questions (2)
Q1. What is the effect of the BN-layer on the client’s test-set?

The vector of first-layer neuron activations over the client’s test-set (X) from applying weights and biases (W0, b0), can be modelled as a normal distribution, which BN relies on to work:zi , [W0X + b0]i zi ∼ N(E[zi], V ar[zi])(5)During local training, the client’s model has been adapted to the local dataset, and the BN-layer statistics used for inference (µ,6σ2) have been updated from the layer activations. 

In real-world FL scenarios, the round times are influenced by the compute abilities of client devices, the computational cost of the models used, and the communication conditions.