This paper is included in the Proceedings of the
12th USENIX Symposium on Operating Systems Design
and Implementation (OSDI ’16).
November 2–4, 2016 • Savannah, GA, USA
ISBN 978 -1-931971-33-1
Open access to the Proceedings of the
12th USENIX Symposium on Operating Systems
Design and Implementation
is sponsored by USENIX.
Morpheus: Towards Automated SLOs
for Enterprise Clusters
Sangeetha Abdu Jyothi, Microsoft and University of Illinois at Urbana–Champaign;
Carlo Curino, Ishai Menache, and Shravan Matthur Narayanamurthy, Microsoft;
Alexey Tumanov, Microsoft and Carnegie Mellon University; Jonathan Yaniv, Technion—
Israel Institute of Technology; Ruslan Mavlyutov, Microsoft and University of Fribourg;
Íñigo Goiri, Subru Krishnan, Janardhan Kulkarni, and Sriram Rao, Microsoft
https://www.usenix.org/conference/osdi16/technical-sessions/presentation/jyothi
Morpheus: Towards Automated SLOs for Enterprise Clusters
Sangeetha Abdu Jyothi
m,u
Carlo Curino
m
Ishai Menache
m
Shravan Matthur Narayanamurthy
m
Alexey Tumanov
m,c
Jonathan Yaniv
t
Ruslan Mavlyutov
m, f
´
I
˜
nigo Goiri
m
Subru Krishnan
m
Janardhan Kulkarni
m
Sriram Rao
m
m
Microsoft,
u
University of Illinois at Urbana–Champaign,
c
Carnegie Mellon University
t
Technion-Israel Institute of Technology,
f
University of Fribourg
Abstract
Modern resource management frameworks for large-
scale analytics leave unresolved the problematic ten-
sion between high cluster utilization and job’s perfor-
mance predictability—respectively coveted by operators
and users. We address this in Morpheus, a new sys-
tem that: 1) codifies implicit user expectations as ex-
plicit Service Level Objectives (SLOs), inferred from his-
torical data, 2) enforces SLOs using novel scheduling
techniques that isolate jobs from sharing-induced perfor-
mance variability, and 3) mitigates inherent performance
variance (e.g., due to failures) by means of dynamic re-
provisioning of jobs. We validate these ideas against pro-
duction traces from a 50k node cluster, and show that
Morpheus can lower the number of deadline violations by
5× to 13×, while retaining cluster-utilization, and lower-
ing cluster footprint by 14% to 28%. We demonstrate the
scalability and practicality of our implementation by de-
ploying Morpheus on a 2700-node cluster and running it
against production-derived workloads.
1 Introduction
Commercial enterprises ranging from Fortune-500 com-
panies to venture-capital funded startups are increas-
ingly relying on multi-tenanted clusters for running their
business-critical data analytics jobs. These jobs comprise
of multiple tasks that are run on different cluster nodes,
where the unit of per-task resource allocation is a con-
tainer (i.e, a bundle of resources such as CPU, RAM and
disk I/O) on an individual machine. From an analysis
of large-scale production workloads, we observe signifi-
cant variance in job runtimes, which sometimes results in
missed deadlines and negative business impact. This is
perceived by users as an unpredictable execution experi-
ence, and it accounts for 25% of (resource-provisioning
related) user escalations in Microsoft big-data clusters.
Unpredictability comes from several sources, which for
discussion purposes, we roughly group as follows:
• Sharing-induced – performance variability caused
by inconsistent allocations of resources across job
runs—a scheduling policy artifact.
• Inherent – performance variability due to changes in
the job input (size, skew, availability), source code
tweaks, failures, and hardware churn—this is en-
demic even in dedicated and lightly used clusters.
Unpredictability is most noticeable to users who sub-
mit periodic jobs (i.e., scheduled runs of the same job on
newly arriving data). Their recurrent nature prompts users
to form an expectation on jobs’ runtime performance as
well as react to any deviation from it, particularly, if the
job is business-critical (i.e., a production job).
Unfortunately, widely deployed resource managers [9,
27, 51, 55] provide limited mechanisms (e.g., fairness
weights, priorities, job killing) for users to cope with un-
predictability of such jobs. Given these basic tools, users
resort to a combination of ad-hoc tricks, often pivoting
around conservative over-provisioning for important pro-
duction jobs. These coarse compensating actions are man-
ual and inherently error-prone. Worse, they may adversely
impact cluster utilization—a key metric for cluster opera-
tors. Owing to the substantial costs involved in building/-
operating large-scale clusters, operators seek good return
on investment (ROI) by maximizing utilization.
Divergent predictability and utilization requirements
are poorly handled by existing systems. This is taxing
and leads to tension between users and operators.
An ideal resource management infrastructure would
provide predictable execution as a core primitive, while
achieving high cluster utilization. This is a worthwhile
infrastructure to build, particularly, because periodic, pro-
duction jobs make up the majority of cluster workloads,
as reported by [43] and as we observe in §2.
USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 117
In this paper, we move the state of the art towards this
ideal, by proposing a system called Morpheus. Building
Morpheus poses several interesting challenges such as,
automatically: 1) capturing user predictability expecta-
tions, 2) controlling sharing-induced unpredictability, and
3) coping with inherent unpredictability. We elaborate on
these challenges next.
Inferring SLOs and modeling job resource demands.
Our first challenge is to formalize the implicit user pre-
dictability expectation in an explicit form that is action-
able for the underlying resource management infrastruc-
ture. We refer to the resulting characterization as an (in-
ferred) Service Level Objective (SLO). We focus on com-
pletion time SLOs or deadlines. The next step consists of
quantifying the amount of resources that must be provi-
sioned during the execution of the job to meet the SLO
without wastefully over-provisioning resources. Natu-
rally, the precise resource requirements of each job de-
pend on numerous factors such as function being com-
puted, the degree of parallelism, data size and skew.
The above is hard to accomplish for arbitrary jobs for
two reasons: 1) target SLOs are generally unknown to
operators, and often hard to define even for the users—
see §2, and 2) automatic provisioning is a known hard
problem even when fixing the application framework [52,
15, 26, 19]. However, the periodic nature of our work-
load makes this problem tractable by means of history-
driven approaches. We tackle this problem using a com-
bination of techniques: First, we statistically derive a
target SLO for a periodic job by analyzing all inter-job
data dependencies and ingress/egress operations (§ 4).
Second, we leverage telemetry of historical runs to de-
rive a job resource model—a time-varying skyline of re-
source demands. We employ a Linear Programming for-
mulation, that explicitly controls the penalty of over/un-
der provisioning—balancing predictability and utilization
(§5). Programmatically deriving SLOs and job resource
model enables a tuning-free user experience, where users
can simply sign-off on the proposed contract. Users may
alternatively override any parameter of the inferred SLO
and the job resource model, which becomes binding if ac-
cepted by our system.
Eliminating sharing-induced unpredictability. Our
second challenge is to enforce SLOs while retaining high-
utilization in a shared environment. This consists of
controlling performance variance with minimal resource
over-provisioning. As noted above, sharing-induced un-
predictability is a scheduling artifact. Accordingly, we
structurally eliminate it by leveraging the notion of recur-
ring reservation, a scheduling construct that isolates peri-
odic production jobs from the noisiness of sharing. A key
property of recurring reservations is that once a periodic
job is admitted each of its instantiations will have a pre-
dictable resource allocation. High-utilization is achieved
by means of a new online, planning algorithm (§ 6). The
algorithm leverages jobs’ flexibility (e.g., deadline slack)
to pack reservations tightly.
Mitigating inherent unpredictability. Our last challenge
is dealing with inherent performance variance (i.e., ex-
ogenous factors, such as task failures, code/data changes,
etc.). We do this by dynamically re-provisioning the cur-
rent instance of a reservation, in response to job resource
consumption, in relationship to the SLO. This compen-
sates for short-term drifts, while continuous retraining of
our SLO and job resource model extractors captures long-
term effects. This problem is in spirit similar to what was
proposed in Jockey [19], as we discuss in §7.
We emphasize that all of the above techniques are
framework-independent—this is key for our production
clusters as they support multiple application frameworks.
Experimental validation. We validate our design by im-
plementing Morpheus atop of Hadoop/YARN [51] (§8).
We then perform several faithful simulations with traces
of a production cluster with over 50k nodes, and show
that the SLOs we derived are representative of the job’s
needs. The combination of tight job provisioning, reser-
vation packing, and dynamic reprovisioning allows us to
achieve: 5× to 13× reduction in potential SLO viola-
tions (with respect to user-defined static provisioning),
and identical cluster utilization. All while, our packing al-
gorithms leverage the flexibility in target SLOs to smooth
the provisioning load over time, and achieve better ROI,
by reducing cluster footprint by 14% to 28%. We con-
clude by deploying Morpheus on a 2700-node cluster, and
performing stress-tests with a production-derived work-
load. This confirms both the scalability of our design, and
the practicality of our implementation (§ 9). We intend to
release components of Morpheus as open-source and the
progress can be tracked at [2].
2 Motivation
In the early phases of our project, we set out to confir-
m/deny our informal intuitions of how big-data clusters
are operated and used. We did so by analyzing four data
sources: 1) execution logs of millions of jobs running on
clusters with more than 50k nodes, 2) infrastructure de-
ployment/upgrade logs, 3) interviews, discussion threads,
and escalation tickets from users, operators and decision
makers, and 4) targeted micro-benchmarks. We summa-
rize below the main findings of our analysis.
118 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association
0%
20%
40%
60%
80%
100%
prod adhoc
ESCALATIONS(%)
low
medium
high
extreme
SEVERITY
A) B)
C)
Figure 1: Analysis of user escalations and recurrent behaviors of production workloads.
2.1 Cluster workloads
Proper execution of production jobs is crucial. Produc-
tion jobs represent over 75% of our workload and a simi-
lar percentage of the provisioned capacity—the rest being
dedicated to ad-hoc jobs (10-20%) and ready to handle
growth/failures (5-10%). All unassigned capacity is redis-
tributed fairly to speed up jobs. As expected, users care
mostly about proper execution of production jobs. Fig. 1a
shows that over 90% of all escalations relate to production
jobs, and this percentage grows to 100% for high/extreme
severity escalations.
Predictability trumps fairness. Further analysis of the
escalations of Fig. 1a and of discussion threads, indicates
that users are 120× more likely to complain about the
performance (un)predictability (25% of all job/resource-
management escalations) than about fairness (< 0.2%),
despite the fact that our system does not enforce fairness
strictly. This outcome may be expected, as customers can-
not observe how “fair” allocations really are.
Production jobs are often periodic. Over 60% of the
jobs in our larger/busier clusters are recurrent. Most
of these recurring jobs are production jobs operating on
continuously arriving data, hence are periodic in nature.
Fig. 1b shows the distribution of the period for periodic
jobs. Interestingly, most of the distribution mass is con-
tributed by a small number of natural values (e.g., once-
a-day, once-an-hour, etc.); this property will be useful to
our allocation mechanisms (§6). Fig. 1c provides further
evidence of recurrent behavior, by showing that job start
times are more densely distributed around the “start-of-
the-hour”. This confirms that most jobs are submitted au-
tomatically on a fixed schedule.
The above evidence confirms that the most important
portion of our workloads is strongly recurrent. This al-
lows for planning the cluster agenda, without being overly
conservative in the resource provisioning of jobs.
2.2 Predictability challenges
Manual tuning of job allocation is hard. Fig. 2a shows
the distribution of the ratio between the total amount of
resources provisioned by the job’s owner and the job’s
actual resource usage (both comparing peak parallelism
and area). The wide range of over/under-allocation in-
dicates that it is very hard for users (or they lack incen-
tives) to optimally provision resources for their jobs. We
further validate this hunch through a user study in [15].
The graphs shows that 75% of jobs are over-provisioned
(even at their peak), with 20% of them over 10× over-
provisioned. This is likely due to users statically setting
their provisioning for a periodic job. We confirm this, by
observing that in one-month period over 80% of periodic
jobs had no changes in their resource provisioning. Large
under-provisioned jobs partially offset the impact of over-
provisioning on cluster utilization.
Sources of performance variance. It is hard to precisely
establish the sources of variance from the production logs
we have. We observe a small but positive correlation
(0.16) between the amount of sharing (above provisioned
resources) and job runtime variance. This indicates that
increased sharing affects runtime variance.
We investigate further the roles of sharing-induced
and inherent performance variance by means of a sim-
ple micro-benchmark. Fig. 2b shows the normalized run-
time of 5 TPC-H queries
1
. We consider two configura-
tions one with constrained parallelism (500 containers),
and one with unconstrained parallelism (>2000 contain-
ers); each container is a bundle of <1core,8GB RAM>.
Each query was run 100 times in each configuration on an
empty cluster at 10TB scale. The graph shows that even
when removing common sources of inherent variability
(data availability, failures, network congestion), runtimes
remain unpredictable (e.g., due to stragglers, §7).
By analyzing these experiments and observing produc-
tion environments, we conclude that: 1) history-based
approaches can model well the “normal” behavior of a
query (small box), 2) handling outliers (as in the long
whiskers) without wasting resources requires a dynamic
component that performs reprovisioning online, and 3)
while each source of variance may be addressed with an
ad-hoc solution, providing a general-purpose line of de-
fense is paramount— see §7 for our solution.
1
Box shows [25th,75th] percentiles, and whiskers shows [min,max].
USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 119
0
20
40
60
80
100
120
4/1
6/1
8/1
10/1
12/1
2/1
4/1
Capacity (%)
SKU1 SKU2
A) B) C)
Figure 2: A) Empirical CDF of provisioning vs. used resources, B) box-whisker plot of normalized runtime of TPC-H
queries running with 500 containers (left) and >2000 containers (right). C) cluster capacity of different machine types.
2.3 Changing conditions
Cluster conditions keep evolving—jobs may run on
different server types. We provide in Fig. 2c a measure
of hardware churn in our clusters. We refer to different
machines configurations as Stock Keeping Units (SKUs).
Over a period of a year, the ratio between number of ma-
chines of type SKU1 and type SKU2 changed from 80/20
to 55/45; the total number of nodes also kept changing
over that period. This is notable, because even seem-
ingly minor hardware differences can impact job runtime
significantly— e.g., 40% difference in runtime on SKU1
vs SKU2 for a Spark production job.
User scripts keep evolving. We perform an analysis of
the versioning of user scripts/UDFs. We remove all sim-
ple parameterizations that naturally change with every in-
stantiation, and then construct a fuzzy match of the code
structure. Within one-month of trace data, we detect that
15-20% of periodic jobs had at least one large code delta
(more than 10% code difference), and over 50% had at
least one small delta (any change that breaks MD5 of the
parameter-stripped code). Hence, even an optimal static
tuning is likely going to drift out of optimality over time.
Motivated by all of the above evidence, we focus on
building a resource management substrate that provides
predictable execution as a core primitive.
3 Overview of Morpheus
Morpheus is a system that continuously observes and
learns as periodic jobs execute over time. The findings are
used to economically reserve resources for the job ahead
of job execution, and dynamically adapt to changing con-
ditions at runtime. To give an informal sense of the key
functionalities in Morpheus, we start our overview by fol-
lowing a typical life-cycle of a periodic job (JobX) as it is
governed by Morpheus (§3.1). Next, we describe the core
subsystems (§3.2). Fig. 3 provides a logical view of the
architecture, and “zooms in” on a particular job.
3.1 “Life” of a periodic job
With reference to Fig. 3, a typical periodic jobs goes
through the following stages.
1. The user periodically submits JobX with manually pro-
visioned resources. In the meantime, the underlying
infrastructure captures:
(a) Data-dependencies and ingress/egress operations
in the Provenance Graph (PG).
(b) Resource utilization of each run (marked as the
R1-R4 skylines in Fig. 3) in a Telemetry-History
(TH) database.
2. The SLO Inference performs an offline analysis of the
successful runs of JobX:
(a) From the PG it derives a deadline d—the SLO.
(b) From the TH, it derives a model of the job re-
source demand over time, R
∗
. We refer to R
∗
as
the job resource model
3. The user signs off (or optionally overrides) the
automatically-generated SLO and job resource model.
4. Morpheus enforces SLOs via recurring reservations:
(a) Adds a recurring reservation for JobX into the
cluster agenda—this sets aside resources over
time based on the job resource model R
∗
.
(b) New instances of JobX run within the recurring
reservation (dedicated resources).
5. The Dynamic Reprovisioning component monitors the
job progress online, and increases/decreases the reser-
vation, to mitigate inherent execution variability.
6. Morpheus constantly feeds back into Step 2 the PG
and telemetry information of the new runs for contin-
uous learning and refinement of the SLO and the job
resource model.
120 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association