How many times did the last instance have to accept an important number of queries?

With the JFIQ algorithm, when exposed to a query rate increase, the last instance might have to accept an important number of queries before deciding that upscaling is necessary.

Why does JFIQ autoscaling perform better than RND and JSQ2?

Due to taking the load of all instances into account, JFIQ autoscaling performs better than policies RND and JSQ2 (when ρ > 1.2), respectively, and yields results close to those of the reference policy JIQ.

what is the expected number of queries injected into the system?

In particular, query rates vary between 300 and 700 req/s, and the expected number of queries injected into the system is:∫ 864000 λ(t)dt = 43.2 ·106.

What is the simplest way to evaluate the performance of a JFIQ?

To evaluate the performance of JFIQ when using a fixed number of instances, the expected number of queries handled by the system is computed (as described in section III-B) as a function of the query rate ρ, for different values of the number n of instances.

What is the probability of a query completing in less than t?

Each application instance has an identical processing capacity µ > 0, with exponentially-distributed service times (i.e., the probability of a query completing in less than t is 1− e−µt).

(Open Access) Joint Monitorless Load-Balancing and Autoscaling for Zero-Wait-Time in Data Centers (2021) | Yoann Desmouceaux

Q: What are the contributions mentioned in the paper "Joint monitorless load-balancing and autoscaling for zero-wait-time in data centers" ?

This paper introduces a unified and centralized-monitoring-free architecture achieving both autoscaling and load-balancing, reducing operational overhead while increasing response time performance.

HAL Id: hal-03171974

https://hal-polytechnique.archives-ouvertes.fr/hal-03171974

Submitted on 17 Mar 2021

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-

entic research documents, whether they are pub-

lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diusion de documents

scientiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Joint Monitorless Load-Balancing and Autoscaling for

Zero-Wait-Time in Data Centers

Yoann Desmouceaux, Marcel Enguehard, Thomas Heide Clausen

To cite this version:

Yoann Desmouceaux, Marcel Enguehard, Thomas Heide Clausen. Joint Monitorless Load-Balancing

and Autoscaling for Zero-Wait-Time in Data Centers. IEEE Transactions on Network and Service

Management, IEEE, 2021, 18 (1), pp.672-686. �10.1109/TNSM.2020.3045059�. �hal-03171974�

Joint Monitorless Load-Balancing and Autoscaling

for Zero-Wait-Time in Data Centers

Yoann Desmouceaux, Marcel Enguehard, Thomas H. Clausen

Abstract—Cloud architectures achieve scaling through two

main functions: (i) load-balancers, which dispatch queries among

replicated virtualized application instances, and (ii) autoscalers,

which automatically adjust the number of replicated instances

to accommodate variations in load patterns. These functions are

often provided through centralized load monitoring, incurring

operational complexity. This paper introduces a uniﬁed and

centralized-monitoring-free architecture achieving both autoscal-

ing and load-balancing, reducing operational overhead while

increasing response time performance. Application instances are

virtually ordered in a chain, and new queries are forwarded along

this chain until an instance, based on its local load, accepts the

query. Autoscaling is triggered by the last application instance,

which inspects its average load and infers if its chain is under- or

over-provisioned. An analytical model of the system is derived,

and proves that the proposed technique can achieve asymptotic

zero-wait time with high (and controlable) probability. This result

is conﬁrmed by extensive simulations, which highlight close-to-

ideal performance in terms of both response time and resource

costs.

Index Terms—Load balancing, auto-scaling, segment routing,

application-aware, performance analysis.

I. INTRODUCTION

Virtualization and cloud architectures, wherein different

tenants share computing resources to run their workloads,

have made fast task allocation and deallocation a commodity

primitive in data centers [1]. To optimize costs while pre-

serving Quality of Service (QoS), applications are thus (i)

replicated among multiple instances running, e.g., in containers

or in virtual machines (VMs) [2], [3], and (ii) the number of

aforementioned instances is automatically scaled up or down

to meet a given Service Level Agreement (SLA) [4]. Two

functions enable this: (i) a load-balancer, which dispatches

queries onto identical replicas of the application, and (ii) an

autoscaler, which monitors these instances and automatically

adjusts their number according to the incoming load.

A challenge for network load-balancers is to provide

performance and resiliency while satisfying per-application

SLAs. Some architectures, such as Equal Cost Multi-Path

(ECMP) [5] or Maglev [6], distribute ﬂows among applica-

tion instances pseudo-randomly, forwarding packets without

terminating Layer-4 connections, and thus providing a high

throughput. The use of consistent hashing also provides re-

siliency for when an existing ﬂow is handed over to another

load-balancer [6]. This requires, nonetheless, that ﬂows be

Y. Desmouceaux is with Cisco Systems, 92130 Issy-les-

Moulineaux, France. M. Enguehard is with Polyconseil, 75008

Paris, France. T. H. Clausen is with École Polytechnique, 91120

Palaiseau, France. Emails: ydesmouc@cisco.com, marcel@enguehard.org,

thomas.clausen@polytechnique.edu

assigned to instances regardless of their load state, even though

it has been demonstrated [7] that considering application load

can greatly improve overall performance. Other load-balancing

architectures do take application state into account, by termi-

nating Layer-4 connections [8], and/or by using centralized

monitoring [9] – thus incurring both a performance overhead

and a degradation in resiliency.

Similarly, autoscalers use centralized monitoring, with an

external agent gathering load metrics from all servers so as to

make scaling decisions [10], [11]. The delay incurred by an

external agent collecting these metrics causes such decisions to

be made based on out-of-date information. Furthermore, such

agents typically collect external metrics (e.g., CPU load of a

VM as seen by the hypervisor), ignoring application-speciﬁc

metrics possibly more suitable for making scaling decisions.

A. Statement of Purpose

While workloads lasting hours or minutes (e.g., data pro-

cessing tasks) can be efﬁciently scheduled with ofﬂine op-

timization algorithms [12], and while sub-millisecond work-

loads require over-provisioning as the time to commission

a new instance is too large as compared to the application

execution time, mid-sized workloads (lasting from 100 ms to

1 s, e.g., Web workloads) are amenable to reactive autoscaling,

as container boot times are typically sub-second [13]. Thus,

in this paper, the problem of mid-sized workloads scalability

under QoS constraints is explored, for replicated applications

deployed, e.g., as containers. In particular, a centralized-

monitoring-free architecture for achieving asymptotic zero-

wait-time is introduced. More precisely, the architecture is

centralized-monitoring-free as it relies on the application

themselves monitoring their load, without piggy-backing in-

formation to a central controller. It yields asymptotic zero-

wait-time in the sense that each incoming query ﬁnds, with

probability converging to one as the number of application

instances goes to inﬁnity, an idle application instance. The

architecture relies on two interdependent components: a load-

aware load-balancing algorithm and a decentralized autoscal-

ing policy.

First, a centralizerd-monitoring-free load-balacing algorithm

is introduced: Join-the-First-Idle-Queue (JFIQ). JFIQ relies on

ordering the available application instances in a chain along

which incoming queries are directed. Each of the instances in

the chain makes a local decision based on its load, accepting

the query if it has available capacity, and forwarding the query

to the next instance in the chain otherwise. The proposed

architecture operates entirely within the network layer (Layer-

3) using IPv6 Segment Routing (SRv6) [14], thus removing

the need from terminating or proxying network connections.

Second, to achieve asymptotic zero-wait-time, JFIQ is

complemented with a centralizerd-monitoring-free autoscaling

policy which uses the fact that the busyness of the last

instance in the chain is an indicator of the busyness of the

whole system. This allows ofﬂoading autoscaling decisions

to that last instance, by measuring its occupancy ratio over

time. Upscaling/downscaling is triggered if that ratio crosses

pre-determined maximum/minimum thresholds. An analytical

model demonstrates the validity of using this autoscaling

policy conjointly with JFIQ to achieve asymptotic zero-wait-

time, and quantiﬁes the behavior of the system in terms of

response time.

Finally, this analytical model is complemented with exten-

sive simulations, capturing the dynamics of the architecture,

and showing that the proposed mechanism allows to precisely

control the tail of the response time distribution. These sim-

ulations illustrate that the propose mechanisms reduce the

resource cost (i.e., the number of necessary instances) for an

identical target response time by an order of magnitude in

the evaluated scenario, when compared to the simpler policies

used in consistent-hashing-based load-balancers.

B. Related work

This section discusses the literature on network load-

balancing (section I-B1) and autoscaling (section I-B2).

1) Load-balancing: The goal of a load-balancer is to as-

sign incoming queries for a given service to one of several

distributed instances of this service. As such, this requires:

(i) selecting the instance so as to minimize response time,

and (ii) making sure that the load-balancer does not become

a bottleneck.

Several load-aware load-balancing algorithms exist [15],

including Random (RND), where queries are assigned ran-

domly to one of n application instances, and Round-Robin

(RR), where the i-th query is assigned to the (i mod n)-th

instance. The optimal policy is the Least-Work-Left (LWL)

policy, which assigns queries to the application instance with

the least amount of pending work [16]. A simpler algorithm

is Join-the-Shortest-Queue (JSQ), which assigns queries to

the least loaded of the application instances. JSQ does not

require knowledge of the remaining work time of currently-

served queries, and provides near-optimal performance [17],

even in high-load regimes [18]. JSQ needs to query the state

of all application instances for each incoming query, which

incurs a monitoring overhead of n messages per query. A

more scalable algorithm, Join-the-Idle-Queue (JIQ), has been

proposed in [19]: queries are assigned to an idle application

instance if one exists, or to a random instance otherwise. This

is implemented by maintaining a centralized idle queue of

the identities of currently idle application instances, minimiz-

ing the monitoring overhead as compared to JSQ. Another

algorithm is Join-the-Shortest-of-d-Queues (JSQ

) [7], which

assigns queries to the least loaded of d randomly sampled ap-

plication instances, and which is therefore more decentralized

but less efﬁcient than JIQ (as stated in [20]). The algorithms

algorithm listed above have been analyzed in the heavy-trafﬁc

limit (where the query rate approaches stability), allowing to

quantify the achieved expected waiting time as a function of

the number of application instances [20], [21].

The above has summarized a set of algorithms for assigning

ﬂows to applications, as well as their key performance charac-

teristics. It is equally important to be able to actually distribute

network ﬂows across application instances, at the network

layer. This consists of directing ﬂows (e.g., TCP packets)

corresponding to queries for a given service (described by a

virtual IP address, VIP) to the physical IP address (PIP) of

a deployed instance. This load-balancing function can itself

be replicated, in which case it is deployed behind a layer

of ECMP routers, which can arbitrarily redistribute packets

between load-balancer instances, for new ﬂows as well as for

already-established ﬂows. It is thus necessary to maintain Per-

Connection-Consistency (PCC), i.e., to ensure that already-

established ﬂows are always directed to the same application

instance, regardless of the load-balancer they are handled by.

Maglev [6] and Ananta [22] use a combination of consistent

hashing and per-ﬂow tables to ensure PCC. This has been com-

plemented by enabling hardware-support [23], [24], [25], or by

using in-packet state to maintain PCC [25], [26], [27]. While

providing per-connection consistency, these architectures do

not consider the application instance load, using a naïve RND

policy at the cost of decreased application-performance [15],

[28]. A ﬁrst step towards considering the load of application

instances is 6LB [29], where consistent hashing is used with

a variant of the JSQ

algorithm that assigns queries to the

ﬁrst available from among two candidate instances. Some

architectures [9] rely on Software-Deﬁned Networking (SDN)

to monitor the network and the servers, and thus make load-

aware decisions – but at the cost of a monitoring overhead.

2) Autoscaling: Methods to provide autoscaling have been

classiﬁed as reactive and proactive [4]. Reactive methods reg-

ularly gather measurements, and take actions when thresholds

are crossed. For instance, in [10] up/downscaling is triggered

when bounds on some observed metrics are reached; a similar

approach can be found in [11], but with dynamic threshold

adjustment. These incur an overhead from gathering statistics,

and a time gap between detection of violations and appropriate

reaction. Similar threshold-based approaches include [30],

[31], [32].

Conversely, proactive approaches consist of anticipating

state and acting correspondingly. For example, in [33], moving

averages are used to anticipate the future value of metrics

of interests. Similarly, [34] uses Machine Learning (ML) to

classify workloads by their resource allocation preferences,

and in [35], neural networks are used to predict CPU load

trends of application instances and provision resources ac-

cordingly. A Tree-Augmented Naive Bayesian network is used

in [36] to detect SLA violations, and scale resources up when

this happens. In [37], [38], control theory is used to track

CPU usage and to allocate resources accordingly, and in [39],

control theory is used to adapt the amount of CPU resources

allocated to each query so that they complete within a deadline.

While solving the issue of timeliness, proactive approaches

suffer the need to collect statistics and perform centralized

computations.

Using queuing theory has also been proposed [40], [41].

In [42], an autoscaling scheme for JIQ is proposed, by creating

a feedback loop that decommissions application instances that

remain idle for a long period of time, and commissions a new

application instance for each new query. In [43], a similar

token-based mechanism is introduced, with a new application

instance being commissioned only when a task only ﬁnds busy

instances.

C. Paper Outline

The remainder of this paper is organized as follows. Sec-

tion II gives an overview of the architecture introduced in this

paper. An analytical model for the response time of the system

with a ﬁxed number of instances is introduced in section III,

and the asymptotic behavior of the system if characterized.

Numerical results are given in section IV, along with computa-

tional simulations providing further insight. Finally, section V

concludes this paper.

II. JOINT LOAD-BALANCING AND AUTOSCALING

In this paper, an application is replicated on a set of n

application instances {s

, . . . , s

} with identical processing

capacities. The goal is to minimize response time, i.e., queries

should be served with zero waiting time, by way of (i) en-

suring that enough application instances are available, and (ii)

mapping the query to an idle application instance. To address

the challenges introduced in sections I-B1 and I-B2, this

goal is attained through joint load-balancing and autoscaling

strategies which provide not only close-to-ideal algorithmic

performance, but which can also be efﬁciently implemented,

i.e., both the load-balancing and autoscaling functions must

incur minimal state and network overhead. The proposed

architecture relies on three intertwined building blocks: (i) a

load-balancing algorithm that achieves asymptotic zero-wait-

time if the number of application instances is correctly scaled;

(ii) an enhanced IPv6 dataplane to perform query dispatching

in a decentralized and stateless fashion; (iii) a centralized-

monitoring-free autoscaling technique to adapt the number of

application instances while incurring no monitoring cost.

A. Join-the-First-Idle-Queue Load-Balancing

An ideal load-balancing algorithm should achieve asymp-

totic zero-wait time for a properly-scaled set of application

instances. In particular, this is the behaviour of the reference

JIQ policy, which keeps track of available instances by means

of a centralized idle queue, with instances communicating

their availability to a centralized controller upon completion

of a query. The drawbacks of JIQ are twofold: it requires

centralized communication (which can create implementabil-

ity and scalability issues), and it requires centralized load

monitoring if used in conjunction with an autoscaler. To

address these issues, this paper proposes a new load-balancing

technique: Join-the-First-Idle-Queue (JFIQ), which does not

rely on centralized load tracking.

JFIQ relies on ordering the n application instances in a

chain (since the application instances are assumed to have

S2S1

S3 S4

Figure 1. Join-the-First-Idle-Queue LB (Algorithm 1) with n = 4 instances

SYN { c , S1, S2 , S3, v }

SYN { c , v}

accepts

refuses

SYN -AC K { v , S2, LB, c }

SYN { c , S1, S2 , S3, v }

client

data center

Figure 2. Example of SR load-balancing [29] with 3 instances, wherein the

second one accepts the connection.

identical capacity, the actual order of the instances in the chain

does not matter, so long as it remains consistent throughout

the lifetime of the system). Then, JFIQ enforces that each of

the ﬁrst (n−1) instances never serves more than 1 query at a

given time (see ﬁgure 1). Formally, each query is forwarded

along the chain (s

, . . . , s

) of n application instances. Each

instance s

6= s

in the list either accepts the query if it

currently idle, and otherwise forwards it to the next instance

i+1

. To ensure that all queries are served, the last instance s

must always accept queries. Thus, each of the ﬁrst (n−1) in-

stances can hold only 0 or 1 query, ensuring zero waiting time

for queries served by those. As shown later in section III-B,

JFIQ allows to predictably control the probability of having

a blocked task (i.e., a task waiting for the last application

instance to become idle) by varying the number n of instances.

B. Network-level JFIQ using SRv6

To achieve JFIQ at the network layer while enabling

application-awareness, this paper leverages the dataplane of

6LB [29] and SHELL [44], summarized in ﬁgure 2. This

dataplane is based on SRv6, a source-routing architecture

which allows specifying, within a speciﬁc IPv6 Extension

Header [45], a list of segments to be traversed by a given

packet, where each segment is an IPv6 address representing

an instruction to be performed on the packet.

First, a control plane provisions the egress router with a

ﬁxed list of application instances to be used by the JFIQ

algorithm. Then, when a connection establishment packet (e.g.,

a TCP SYN) destined for the VIP is received by the egress

router, it inserts an SRv6 header, with a list of PIPs corre-

sponding to that list of instances. Instances then implement the

JFIQ algorithm as described in algorithm 1, by either handling

the packet locally or forwarding it to the next instance. To

avoid perpetual triangular trafﬁc, a “stickiness” mechanism

is then used to let subsequent packets within this ﬂow be

directed to the instance having accepted the connection [44].

A speciﬁc ﬁeld of the transport header is used as a covert

channel to encode the index of the application instance that

has accepted the connection – examples of such ﬁelds include

QUIC session ID, low-order bits of TCP timestamps, or high-

order bits of TCP sequence numbers. This ﬁeld must be able

Algorithm 1 Local Connection Request Handling

p ← connection establishment packet  e.g., TCP SYN

v ← p.lastSegment  VIP

b ← number of busy threads for v

if b = 0 then  application instance is available

p.segmentsLeft ← 0

p.dst ← v

forward p to local workload v

else  forward to next application instance

p.segmentsLeft ← p.segmentsLeft − 1

p.dst ← p.nextSegment

transfer p to p.dst

end if

S2S1 S3

S2S1 S3 S4

λ=1.4

λ=1.7

upscale

arrivals

increase

Figure 3. Autoscaling when n = 3 and µ = 1. The level of red in each

application instance shows the average number of concurrently-served queries

as computed in section III. When the query rate increases from λ = 1.4 to

λ = 1.7, the third instance observes that it has become highly occupied and

thus requests upscaling.

to be set by the application instance and transparently echoed

in packets sent by the client, thus allowing the ingress router to

statelessly determine to which application instance non-SYN

packets should be forwarded.

Therefore, the load-balancing function does not require

per-ﬂow state, consisting of (i) applying a ﬁxed SR list on

connection establishment packets, or (ii) applying a one-

segment list on other packets, with a destination address that

depends on the value encoded in the covert channel found in

the packet. This makes the load-balancing function simpler,

thus more amenable to low-latency, high-throughput hardware

implementations. Plus, as the functionality performed by the

ingress router does not require any synchronization, it can

be distributed among several routers, yielding scalability and

ﬂexibility.

C. Autoscaling

A key feature of JFIQ (compared, e.g., to JIQ) is that the

last instance has a unique view on whether the system is

overloaded or not. By construction, all instances but the last

only accept queries when idle. This can be exploited to per-

form autoscaling: when the last instance detects that it serves

too many or too few queries, it requests to the control plane

that the chain be scaled up or down. The control plane then

provisions or deprovisions an instance as needed, and updates

the ingress router with the new list of instances to be used

Algorithm 2 Local Autoscaling at Last Application Instance

↑

, p

↓

← parameter  up/downscaling thresholds

avg

← parameter  average application execution time (1/µ)

W ← 1000 × r

avg

 window size for EWMA

← time()  timestamp of last event for EWMA

← 0  EWMA sample of p

= P[N

= 0]

r ← 0  number of events

for each connection establishment packet p from client do

r ← r + 1

← number of busy threads for v

α ← 1 − exp(−(time() − t

)/W )

← (1 − α) bp

+ α1

=0}

← time()

if r > 50 then  make sure to have a signiﬁcant sample

if bp

> p

↓

then

request downscaling; reset all variables

else if bp

< p

↑

then

request upscaling; reset all variables

end if

p.segmentsLeft ← 0

p.dst ← v

forward p to local workload v

end for

for each connection termination packet p from application do

r ← r + 1

α ← 1 − exp(−(time() − t

)/W )

← (1 − α) bp

 N

was > 0 over the last period

← time()

forward p

end for

1.2

1.4

1.6

1.8

20 21 22 23 24 25

Expected response time

n=27

n=28

n=29

n=30

n=31

n=32

n=33

n=34

Autoscale

0.2

0.4

0.6

0.8

1.2

20 21 22 23 24 25

Probability p

Request rate ⍴=λ/μ

n=27

n=28

n=29

n=30

n=31

n=32

n=33

n=34

Autoscale

Target p

Figure 4. JFIQ autoscaling: example of upscaling for p

∗

= 0.4 and ρ ∈

(20, 25): the number n of instances adapts to maintain p

within p

∗

(thick

line) and p

↑

. The top graph depicts the corresponding expected response time

E[T ], numerically computed with the method introduced in section III.

by the load-balancing function. This allows for centralized-

monitoring-free autoscaling, as illustrated in ﬁgure 3.

As formalized in Algorithm 2, the last instance in the chain

keeps statistics about its queue size over time. The fraction

of time p

during which the last instance is empty is sampled

(with an Exponentially-Weighted Moving Average, EWMA)

and the autoscaling mechanism tries to maintain it close to

a ﬁxed, tunable, target p

∗

. When p

goes below a threshold

↑

, the instance triggers upscaling of the chain. Conversely,

when this goes above a threshold p

↓

, the instance triggers

downscaling of the chain. To avoid oscillations, the proposed

autoscaling method ensure that p

n−1

, the fraction of time

Joint Monitorless Load-Balancing and Autoscaling for Zero-Wait-Time in Data Centers

Figures

Citations

Dynamic Distributed Multi-Path Aided Load Balancing for Optical Data Center Networks

Asynchronous Load Balancing and Auto-scaling: Mean-Field Limit and Optimal Design

References

A Proof for the Queuing Formula: L = λW

The power of two choices in randomized load balancing

Adaptive load sharing in homogeneous distributed systems

Survey of virtual machine research

Containers and Cloud: From LXC to Docker to Kubernetes

Related Papers (5)

Method for determining load balancing weights using application instance topology information

Zero Proof Authentication and Efficient Load Balancing Algorithm for Dynamic Cloud Environment

Dynamic Load Balancing for Distributed Network Management

On balancing the load in a clustered web farm

Staying FIT: efficient load shedding techniques for distributed stream processing

Frequently Asked Questions (8)

Q1. What are the contributions mentioned in the paper "Joint monitorless load-balancing and autoscaling for zero-wait-time in data centers" ?

Q2. In what theory is used to track CPU usage and allocate resources?

Q3. What are some of the common load-balancing algorithms?

Q4. How many times did the last instance have to accept an important number of queries?

Q5. Why does JFIQ autoscaling perform better than RND and JSQ2?

Q6. what is the expected number of queries injected into the system?

Q7. What is the simplest way to evaluate the performance of a JFIQ?

Q8. What is the probability of a query completing in less than t?