What are the contributions in this paper?

The authors describe THEMIS, a federated stream processing system for resource-starved, multi-site deployments. The authors provide the BALANCE-SIC distributed load shedding algorithm that balances the SIC values of result data. Their approach also incurs a low execution time overhead.

What is the SIC assignment for derived tuples?

The assignment of SIC values to derived tuples is performed as per Equation (3), which requires the sets of input and output tuples.

How does the load shedder calculate the result SIC?

To reduce the impact of delays when disseminating the result SIC values by the query coordinator to nodes hosting query fragments, the load shedder estimates the result SIC values of queries based on its local shedding.

What is the domain of tuples in the query graph?

Certain operators in the query graph are connected to a finite set of sources, which are denoted by S and produce source tuples in time-variant rates.

How does the algorithm increase the result SIC of all queries?

The algorithm follows a gradient ascent approach to increase gradually the result SIC values of all queries while minimising the pairwise SIC differences of the two queries with the lowest SIC values.

Why is it difficult to capture the tuples that are not dropped?

it is challenging in practice to capture accurately the sets T̃ S , T S , T̃ R and T R because source tuples are successively transformed to derived tuples by operators and some are shed: de-rived tuples are “lost”, e.g. due to filters and joins, which only select a subset of their input tuples.

What is the SIC metric used to determine the importance of tuples?

in their model, the SIC metric captures the importance of tuples, i.e. the higher the SIC value, the more important is the tuple, the algorithm thus always keeps the most valuable tuples (max(xSIC ) in line 16).

What is the SIC value of an individual source tuple ts?

the SIC value of an individual source tuple ts is inversely proportional to |T Ss | and is also normalised by the number of sources |S| in a query for a query-independent metric.

How many fragments does the BALANCE-SIC fairness algorithm accept?

Figure 11 shows that, when more queries are multi-fragmented, the BALANCE-SIC fairness algorithm converges to a fairer system, as more queries span nodes.

What is the SIC value of a result tuple?

the query SIC value of result tuples is:qSIC := ∑tr∈T̃ RtrSIC, (4)where the authors only consider result tuples tr ∈ T̃ R ⊆ T R that are derived from source tuples ∈

What are the different approaches for load shedding?

There exist semantic load shedding approaches for specific operator types, such as joins [21, 26, 17], aggregates [10, 35] and XML operators [38].

How does the Jain’s fairness index work?

The solution of [44] is obtained using Matlab and the Jain’s fairness index for the resulting utilities’ distribution (normalised log-output rates) equals 0.87.

(Open Access) THEMIS: Fairness in Federated Stream Processing under Overload (2016) | Evangelia Kalyvianaki

City, University of London Institutional Repository

Citation: Kalyvianaki, E., Fiscato, M., Salonidis, T. and Pietzuch, P. (2016). THEMIS:

Fairness in Federated Stream Processing under Overload. Paper presented at the 2016

ACM International Conference on Management of Data (SIGMOD), 26 Jun - 01 Jul 2016,

San Francisco, USA.

This is the accepted version of the paper.

This version of the publication may differ from the final published

version.

Permanent repository link: https://openaccess.city.ac.uk/id/eprint/13546/

Link to published version:

University of London available to a wider audience. Copyright and Moral

Rights remain with the author(s) and/or copyright holders. URLs from

City Research Online may be freely distributed and linked to.

Reuse: Copies of full items can be used for personal research or study,

educational, or not-for-profit purposes without prior permission or

charge. Provided that the authors, title and full bibliographic details are

credited, a hyperlink and/or URL is given for the original metadata page

and the content is not changed in any way.

City Research Online

City Research Online: http://openaccess.city.ac.uk/ publications@city.ac.uk

THEMIS: Fairness in Federated Stream Processing

under Overload

Evangelia Kalyvianaki

City University London

sbbj913@city.ac.uk

Marco Fiscato

Imperial College London

mﬁscato@doc.ic.ac.uk

Theodoros Salonidis

IBM TJ Watson Research Center

tsaloni@us.ibm.com

Peter Pietzuch

Imperial College London

prp@imperial.ac.uk

ABSTRACT

Federated stream processing systems, which utilise nodes from mul-

tiple independent domains, can be found increasingly in multi-pro-

vider cloud deployments, internet-of-things systems, collaborative

sensing applications and large-scale grid systems. To pool resources

from several sites and take advantage of local processing, submitted

queries are split into query fragments, which are executed collabo-

ratively by different sites. When supporting many concurrent users,

however, queries may exhaust available processing resources, thus

requiring constant load shedding. Given that individual sites have

autonomy over how they allocate query fragments on their nodes,

it is an open challenge how to ensure global fairness on processing

quality experienced by queries in a federated scenario.

We describe THEMIS, a federated stream processing system for

resource-starved, multi-site deployments. It executes queries in

a globally fair fashion and provides users with constant feedback

on the experienced processing quality for their queries. THEMIS

associates stream data with its source information content (SIC),

a metric that quantiﬁes the contribution of that data towards the

query result, based on the amount of source data used to gener-

ate it. We provide the BALANCE-SIC distributed load shedding

algorithm that balances the SIC values of result data. Our evalua-

tion shows that the BALANCE-SIC algorithm yields balanced SIC

values across queries, as measured by Jain’s Fairness Index. Our

approach also incurs a low execution time overhead.

1. INTRODUCTION

Federated stream processing systems (FSPSs) [14, 13] contin-

uously process data streams using computation and network re-

sources from several autonomous sites [9]. Submitted queries are

split into query fragments, which can be deployed across multiple

sites. For example, a cloud-based stream processing system may

span more than one cloud provider to beneﬁt from lower costs,

higher resilience or closer proximity to data sources. In collabo-

rative e-science applications, FSPSs such as OGSA-DAI [3] and

Astro-WISE [1] pool resources from multiple organisations to pro-

vide a shared processing service for high stream rates and computa-

tionally expensive queries. Participatory sensing and smart city in-

frastructures [5, 31] require deployments of systems that combine

independent domains with distinct data or processing capabilities

for a large user base.

A challenge is that FSPSs are likely to suffer from long-term

overload conditions. As a shared processing platform with many

users, they can experience a “tragedy of the commons” [19] when

users submit more queries than what can be sustained given the

available resources. Instead of adopting a rigid admission policy,

which rejects user queries when available resources are low, it is

more desirable for an FSPS to use load-shedding techniques [33,

27]. Under load-shedding, the FSPS provides a best-effort service

by reducing the resource requirements of queries through dropping

a fraction of tuples from the input data streams.

Appropriate load shedding in an FSPS, however, is complicated

by the fact that individual sites are autonomous and may imple-

ment their own resource allocation policies. For example, a site

may prioritise queries belonging to local users at the expense of

external query fragments. Without coordination of load-shedding

decisions across sites, multi-site queries may experience signiﬁcant

variations in processing quality, depending on the load distribution

across sites. It is therefore an open challenge how to ensure that

queries spanning multiple autonomous sites in an FSPS experience

globally fair processing quality under overload conditions.

Many stream processing systems support load shedding mecha-

nisms to handle overload conditions. Load shedding mechanisms

that operate at the granularity of individual nodes [33, 10, 35], how-

ever, cannot achieve fair shedding decisions for queries spanning

multiple nodes. Proposals for distributed load shedding [34, 44]

associate a utility function with query output rates and aim to max-

imise the sum of utilities, which is not a representative measure

of fairness. In addition, they assume special structure and a-priori

knowledge of utility functions. Load shedding decisions are con-

trolled by a centralised entity or are based on pre-computed shed-

ding plans—both of which are not practical in an FSPS in which

domains retain control. Operator-speciﬁc semantic shedding ap-

proaches for, e.g. joins [21, 26, 17], aggregates [10, 35] or XML

streams [38] cannot be applied in a federated context when users

employ diverse sets of operators or customised, user-deﬁned ones.

We describe a new approach for distributed load shedding in an

FSPS that treats queries in a globally fair fashion. The key idea

is to deﬁne a query-independent metric to measure the quality of

processing that query fragments have experienced, and then to use

this information for load shedding:

(1) We associate stream data with a metric called source informa-

tion content (SIC), which represents the contribution of that data to

the result in a query-independent way. The SIC metric quantiﬁes

the amount of source data that was used to generate a given query

result data item. Intuitively, data that was aggregated over many

stream sources is considered to be more important to the ﬁnal query

result. The SIC metric thus decouples processing quality from

the semantics of the operators and provides a query-independent

way to capture the quality of query processing with respect to tu-

ple shedding. This is particularly suited to accommodate a diverse

set of user queries that executes operators of various semantics and

even with user-deﬁned operators.

(2) Overloaded nodes in the FSPS invoke a distributed semantic

fair load-shedding algorithm that aims to balance the SIC values of

query results across all queries, referred to as the BALANCE-SIC

fairness policy. This policy balances the SIC values of query re-

sults (i.e. maximises the Jain’s Fairness Index, a normalised scalar

metric that quantiﬁes balance). It effectively utilises the process-

ing capacity of FSPS nodes, given the practical constraints of the

placement of queries on sites and their autonomy.

When queries are assigned across FSPS sites, it becomes chal-

lenging to control per-node tuple shedding and yet provide global

BALANCE-SIC fair processing. This stems from the fact that shed-

ding tuples at a node affects its resource availability and also the

processing quality of other queries. Since queries span across sites

and share resources, such effects are spread across sites, affecting

shedding decisions on the rest of the nodes. It is therefore non-

trivial to control tuple shedding globally in a federated setting.

In our approach, each node takes independent yet informed shed-

ding decisions about the overall processing quality of locally-hosted

queries. Queries provide continuous feedback on their processing

quality through the SIC metric. The shedding of tuples eventually

converges to global fairness as each node continuously adjusts its

shedding behaviour in response to that of other FSPS nodes.

To demonstrate the practicality of our fair load-shedding approach,

we describe THEMIS, an FSPS for overloaded deployments.

Our

evaluation of THEMIS shows that: (a) the SIC metric captures the

result degradation across a variety of query types; (b) in contrast

to the baseline of random shedding, THEMIS achieves 33% fairer

query processing, according to Jain Fairness Index, even with skewed

workload distributions; and (c) our approach has low overhead and

scales well to the number of nodes and queries.

In summary, the contributions and the paper outline are:

1. a query-independent model and a metric called SIC for quan-

tifying the quality of stream processing based on the amount

of information contributed by data sources (§4) and a practi-

cal approximation for computing it (§6);

2. the deﬁnition of the BALANCE-SIC fairness in an overloaded

FSPS based on the processing quality of queries; and a dis-

tributed algorithm for globally BALANCE-SIC-fair semantic

load-shedding in an FSPS, which takes the loss of informa-

tion suffered by queries into account (§5);

3. the design and implementation of THEMIS, an FSPS that im-

plements efﬁciently the BALANCE-SIC fair load-shedding

policy (§6); and

4. results from an experimental evaluation that demonstrate that

the approach achieves fair query processing under various

workloads in a federated setting (§7).

According to the Greek mythology, THEMIS is the Titan goddess

of law and order.

Paris

cloud-based

data center

Rome

governmental institute

data center

Mexico

research institute

data center

sensors

query

fragment

node

Figure 1: Example of a multi-site FSPS deployment for urban

micro-climate monitoring

2. OVERLOAD IN FEDERATED STREAM

PROCESSING

In this section, we describe the problem of fairness in query pro-

cessing, which arises in an overloaded FSPS. We identify the key

characteristics of an FSPS using an example application for query

data processing over micro-climate sensor-generated data (§2.1).

We then introduce the BALANCE-SIC fairness goal for an over-

loaded FSPS (§2.2) and discuss related work (§2.3).

2.1 Federated Stream Processing

Consider a use case of a globally-distributed FSPS for urban

micro-climate monitoring. Figure 1 shows a deployment of such

a system across three sites (i.e. Rome, Paris and Mexico), with en-

vironmental sensors as data sources. The FSPS collects data from

a range of sensors, such as air temperature, humidity and carbon

monoxide, and processes the data in real-time for analysis. Queries

are issued, e.g. by government agencies for urban planning, trans-

port authorities, citizens with respiratory problems and meteorolog-

ical researchers. A sample high-level data streaming queries may

continuously report: “the 10 highest values of carbon monoxide

concentration measurements on highways in Mexico every minute”

and “the covariance matrix between measurements of (temperature,

airﬂow) and (carbon dioxide, nitrogen) in Paris every 10 minutes”.

Each site consists of a data centre with physical nodes running a

local distributed stream processing system, and we assume seam-

less integration across all these systems at the federated sites [12].

Below, we provide a high-level overview of data stream process-

ing in an FSPS. Sources generate tuples for processing by queries.

Queries are subdivided into query fragments and deployed at one

or more sites. A query fragment consists of one or more operators,

and each fragment of the same query is deployed on a different

FSPS node. Query fragments use resources, i.e. CPU, memory,

disk space and network bandwidth, to process incoming tuples and

generate output tuples. Output tuples may be further processed by

fragments of the same query, until result tuples are sent to the user

issuing the query. Nodes share their resources among fragments

belonging to different queries.

Below, we identify three main characteristics regarding user be-

haviour and resource utilisation in such an FSPS:

C1. Skewed query workload distribution. Sites primarily host

queries of local users so the overall load distribution across sites

may be skewed, with some sites being more loaded than others. In

general, query fragments cannot be allocated uniformly across sites

due to local policy constraints or the reliance on local sources. For

example, queries using forecasting algorithms may be restricted to

running at a given site due to licensing constraints, which may limit

the number of authorised users or remote sites using the system.

C2. Permanent resource overload. Due to the shared nature of

an FSPS, we assume that the system is constantly overloaded, i.e.

its resources are lower than required for perfect execution of all

queries. In the above example, queries are issued by a large user

population, leading to high demand. A common strategy for an

FSPS to handle overload is to use tuple shedding [27, 33].

C3. Site autonomy. The collaborative nature of an FSPS means

that a site should accept incoming queries, even under high load.

However, sites belonging to different administrative domains are

managed autonomously. It is therefore infeasible to assume cen-

tralised control over all tuple shedding decisions, enforced across

all sites. Instead, sites elect to cooperate, having only a partial view

of all resource allocation decisions across the whole FSPS.

2.2 Fairness in FSPS

The problem of how to implement fair query processing arises

naturally in an overloaded FSPS. There exist many different ap-

proaches to address overload conditions. For example, admission

control rejects incoming queries under overload [41, 40]. Such

methods are not applicable in a federated context because the col-

laborative nature means that submitted queries must be accepted.

Other approaches redistribute operators for load balancing [43, 40,

42, 11]. However, query placement in an FSPS is typically con-

trolled by users, e.g. to leverage characteristics such as proximity.

We employ distributed load shedding to address overload con-

ditions in an FSPS. By using load shedding, we assume that users

agree to use the FSPS and receive degraded query processing for

their queries. We assume that users submit queries whose results

remain useful, even when their processing is degraded due to load

shedding, such as aggregates [10], including averages, counts as

well as top-k queries. Finally, we use distributed load shedding to

comply with site autonomy (see C3 in §2.1).

There are two challenges to implement distributed load shed-

ding. First, there is a need for a query-independent processing

metric to capture the impact of shedding on the quality of query

processing. Ideally, we require a measure for processing quality

that quantiﬁes the processing degradation under shedding but is

query-independent, i.e. it does not have to be adapted manually to

the semantics of speciﬁc queries. With such a measure, it becomes

possible to compare the impact of tuple shedding across queries

and hence guide shedding decisions according to a fairness policy.

In §4, we introduce the SIC query-independent metric that captures

the quality of processing by measuring the contribution of source

tuples actually used for generating query results.

Second, depending on the deployment of query fragments to

sites, some queries may get more penalised due to overload than

others. We therefore want to achieve global fairness across all

queries by enforcing load shedding at all sites so that all queries

are equally penalised by the shedding. We achieve this by aiming

to equalise a fairness measure of all queries after shedding. It is a

challenge how to implement fairness across queries executing on

overloaded, distributed and autonomous sites, regardless of their

deployment. In §5, we present a new distributed load shedding al-

gorithm that maintains BALANCE-SIC fairness of queries.

2.3 Related Work

The research community recognises the need for FSPSs, explor-

ing relevant research challenges. Tatbul [32] argues for the inte-

gration of multiple stream processing engines for a variety of ap-

plications and pinpoints the challenges when dealing with hetero-

geneous query semantics. Botan et al. [12] present MaxStream, an

FSPS for business intelligence applications. Our focus instead is

on fairness in an overloaded FSPS using load shedding.

Centralised load shedding. Early proposals for load shedding fo-

cus on single-node systems [4, 33, 28]. A simple way to address

overload is through random shedding [33] that discards arbitrary

tuples. This baseline approach is easy to implement and has low

overhead, however, it cannot be used to control the shed tuples.

In contrast, semantic shedding discards tuples using a function

that correlates them with their contribution to the quality of the

query output [33]. Tuples are discarded in a way that maximises

result quality. Carney et al. [15] describe generic drop- and value-

based functions to quantify the contribution of tuples on the result.

A drop-based function speciﬁes how the result quality of a query

decreases with the number of discarded tuples. Many systems dis-

card tuples as to maximise the output tuple rate [17, 34]. In some

cases, a value-based function correlates the query output quality

with the values of the output tuples [23]. In contrast, our goal is to

maximise the contribution of the source tuples used for processing.

There exist semantic load shedding approaches for speciﬁc oper-

ator types, such as joins [21, 26, 17], aggregates [10, 35] and XML

operators [38]. These approaches require domain knowledge of the

operator semantics, while we treat operators as black-boxes.

Distributed load shedding. The problem of distributed load al-

location has been studied for stream processing systems. Zhao et

al. [44] consider the allocation problem for applications with tasks

modelled as synchronous and asynchronous forks and joins, com-

menting that this approach can be applied to distributed stream pro-

cessing. Their work emphasises a theoretical framework for con-

vergence to an optimal solution and presents simulations over two

queries and three output streams. In contrast, we provide a fair

stream processing system based on the contributions of source tu-

ples and evaluate a prototype implementation.

Tatbul et al. [34] employ distributed shedding to maximise the

total weighted throughput of queries by computing the drop se-

lectivity of random or window drop operators inserted at the in-

put streams of a stream processing system. Shedding decisions

are made sequentially by each node along a query, starting from

the leaves and propagating through metadata up to the input nodes.

They assume identical queries layouts to nodes (e.g. all root com-

ponents are deployed on the same node), which is not applicable in

a federated system. Finally, the scalability of the approach remains

unclear, as the simulation-based evaluation only includes a hand-

ful of applications. In our prototype evaluation, we execute several

hundreds of queries across tens of nodes.

Both approaches [34, 44] perform load shedding to maximise

the sum of utility functions but sum maximisation does not achieve

fairness. In addition, they require utility functions of special struc-

ture (either linear weighted functions [34] or concave functions [44]

of rate), which does not capture query utility in practice. Finally,

they require a-priori knowledge of the utility functions, which is

challenging to estimate ofﬂine. In contrast, our approach targets

fairness without assuming speciﬁc, a-priori utility functions. The

only assumption is that the “utility” (as captured by the SIC metric)

decreases with shedding and is implicitly modelled during system

operation through the propagation and updating of the SIC metric

in the data tuples.

An important issue for load shedding is the selection of drop lo-

cations in a query plan [33, 10]. The most efﬁcient way is to discard

tuples at upstream operators, close to sources [15, 20, 29]. This,

however, is difﬁcult to do in an FSPS because it requires global

information about query plans that span multiple sites.

THEMIS: Fairness in Federated Stream Processing under Overload

Figures

Citations

GrandSLAm: Guaranteeing SLAs for Jobs in Microservices Execution Frameworks

Overload Control for Scaling WeChat Microservices

A holistic view of stream partitioning costs

Distributed resource management across process boundaries

Load-aware shedding in stream processing systems

References

The Tragedy of the Commons

A Quantitative Measure Of Fairness And Discrimination For Resource Allocation In Shared Computer Systems

Web Services Architecture

Aurora: a new model and architecture for data stream management

The CQL continuous query language: semantic foundations and query execution

Related Papers (5)

Integrating scale out and fault tolerance in stream processing using operator state management

Load shedding in a data stream manager

Aurora: a new model and architecture for data stream management

The Design of the Borealis Stream Processing Engine

Twitter Heron: Stream Processing at Scale

Frequently Asked Questions (16)

Q1. What are the contributions in this paper?

Q2. What is the SIC assignment for derived tuples?

Q3. How does the load shedder calculate the result SIC?

Q4. What is the domain of tuples in the query graph?

Q5. How does the algorithm increase the result SIC of all queries?

Q6. What is the FSPS's policy for balancing the SIC values of query?

Q7. Why is it difficult to capture the tuples that are not dropped?

Q8. What is the SIC metric used to determine the importance of tuples?

Q9. What is the effect of shedding on the rest of the nodes?

Q10. What is the SIC value of an individual source tuple ts?

Q11. What is the way to measure processing quality?

Q12. How many fragments does the BALANCE-SIC fairness algorithm accept?

Q13. What is the SIC value of a result tuple?

Q14. What are the different approaches for load shedding?

Q15. What is the effect of shedding tuples on the other nodes?

Q16. How does the Jain’s fairness index work?