scispace - formally typeset
Open AccessProceedings ArticleDOI

Morpheus: towards automated SLOs for enterprise clusters

TLDR
Morpheus is a new system that codifies implicit user expectations as explicit Service Level Objectives (SLOs) inferred from historical data, enforces SLOs using novel scheduling techniques that isolate jobs from sharing-induced performance variability, and mitigates inherent performance variance by means of dynamic reprovisioning of jobs.
Abstract
Modern resource management frameworks for large-scale analytics leave unresolved the problematic tension between high cluster utilization and job's performance predictability--respectively coveted by operators and users. We address this in Morpheus, a new system that: 1) codifies implicit user expectations as explicit Service Level Objectives (SLOs), inferred from historical data, 2) enforces SLOs using novel scheduling techniques that isolate jobs from sharing-induced performance variability, and 3) mitigates inherent performance variance (e.g., due to failures) by means of dynamic reprovisioning of jobs. We validate these ideas against production traces from a 50k node cluster, and show that Morpheus can lower the number of deadline violations by 5× to 13×, while retaining cluster-utilization, and lowering cluster footprint by 14% to 28%. We demonstrate the scalability and practicality of our implementation by deploying Morpheus on a 2700-node cluster and running it against production-derived workloads.

read more

Content maybe subject to copyright    Report

This paper is included in the Proceedings of the
12th USENIX Symposium on Operating Systems Design
and Implementation (OSDI16).
November 24, 2016 • Savannah, GA, USA
ISBN 978 -1-931971-33-1
Open access to the Proceedings of the
12th USENIX Symposium on Operating Systems
Design and Implementation
is sponsored by USENIX.
Morpheus: Towards Automated SLOs
for Enterprise Clusters
Sangeetha Abdu Jyothi, Microsoft and University of Illinois at Urbana–Champaign;
Carlo Curino, Ishai Menache, and Shravan Matthur Narayanamurthy, Microsoft;
Alexey Tumanov, Microsoft and Carnegie Mellon University; Jonathan Yaniv, Technion
Israel Institute of Technology; Ruslan Mavlyutov, Microsoft and University of Fribourg;
Íñigo Goiri, Subru Krishnan, Janardhan Kulkarni, and Sriram Rao, Microsoft
https://www.usenix.org/conference/osdi16/technical-sessions/presentation/jyothi

Morpheus: Towards Automated SLOs for Enterprise Clusters
Sangeetha Abdu Jyothi
m,u
Carlo Curino
m
Ishai Menache
m
Shravan Matthur Narayanamurthy
m
Alexey Tumanov
m,c
Jonathan Yaniv
t
Ruslan Mavlyutov
m, f
´
I
˜
nigo Goiri
m
Subru Krishnan
m
Janardhan Kulkarni
m
Sriram Rao
m
m
Microsoft,
u
University of Illinois at Urbana–Champaign,
c
Carnegie Mellon University
t
Technion-Israel Institute of Technology,
f
University of Fribourg
Abstract
Modern resource management frameworks for large-
scale analytics leave unresolved the problematic ten-
sion between high cluster utilization and job’s perfor-
mance predictability—respectively coveted by operators
and users. We address this in Morpheus, a new sys-
tem that: 1) codifies implicit user expectations as ex-
plicit Service Level Objectives (SLOs), inferred from his-
torical data, 2) enforces SLOs using novel scheduling
techniques that isolate jobs from sharing-induced perfor-
mance variability, and 3) mitigates inherent performance
variance (e.g., due to failures) by means of dynamic re-
provisioning of jobs. We validate these ideas against pro-
duction traces from a 50k node cluster, and show that
Morpheus can lower the number of deadline violations by
5× to 13×, while retaining cluster-utilization, and lower-
ing cluster footprint by 14% to 28%. We demonstrate the
scalability and practicality of our implementation by de-
ploying Morpheus on a 2700-node cluster and running it
against production-derived workloads.
1 Introduction
Commercial enterprises ranging from Fortune-500 com-
panies to venture-capital funded startups are increas-
ingly relying on multi-tenanted clusters for running their
business-critical data analytics jobs. These jobs comprise
of multiple tasks that are run on different cluster nodes,
where the unit of per-task resource allocation is a con-
tainer (i.e, a bundle of resources such as CPU, RAM and
disk I/O) on an individual machine. From an analysis
of large-scale production workloads, we observe signifi-
cant variance in job runtimes, which sometimes results in
missed deadlines and negative business impact. This is
perceived by users as an unpredictable execution experi-
ence, and it accounts for 25% of (resource-provisioning
related) user escalations in Microsoft big-data clusters.
Unpredictability comes from several sources, which for
discussion purposes, we roughly group as follows:
Sharing-induced performance variability caused
by inconsistent allocations of resources across job
runs—a scheduling policy artifact.
Inherent performance variability due to changes in
the job input (size, skew, availability), source code
tweaks, failures, and hardware churn—this is en-
demic even in dedicated and lightly used clusters.
Unpredictability is most noticeable to users who sub-
mit periodic jobs (i.e., scheduled runs of the same job on
newly arriving data). Their recurrent nature prompts users
to form an expectation on jobs’ runtime performance as
well as react to any deviation from it, particularly, if the
job is business-critical (i.e., a production job).
Unfortunately, widely deployed resource managers [9,
27, 51, 55] provide limited mechanisms (e.g., fairness
weights, priorities, job killing) for users to cope with un-
predictability of such jobs. Given these basic tools, users
resort to a combination of ad-hoc tricks, often pivoting
around conservative over-provisioning for important pro-
duction jobs. These coarse compensating actions are man-
ual and inherently error-prone. Worse, they may adversely
impact cluster utilization—a key metric for cluster opera-
tors. Owing to the substantial costs involved in building/-
operating large-scale clusters, operators seek good return
on investment (ROI) by maximizing utilization.
Divergent predictability and utilization requirements
are poorly handled by existing systems. This is taxing
and leads to tension between users and operators.
An ideal resource management infrastructure would
provide predictable execution as a core primitive, while
achieving high cluster utilization. This is a worthwhile
infrastructure to build, particularly, because periodic, pro-
duction jobs make up the majority of cluster workloads,
as reported by [43] and as we observe in §2.
USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 117

In this paper, we move the state of the art towards this
ideal, by proposing a system called Morpheus. Building
Morpheus poses several interesting challenges such as,
automatically: 1) capturing user predictability expecta-
tions, 2) controlling sharing-induced unpredictability, and
3) coping with inherent unpredictability. We elaborate on
these challenges next.
Inferring SLOs and modeling job resource demands.
Our first challenge is to formalize the implicit user pre-
dictability expectation in an explicit form that is action-
able for the underlying resource management infrastruc-
ture. We refer to the resulting characterization as an (in-
ferred) Service Level Objective (SLO). We focus on com-
pletion time SLOs or deadlines. The next step consists of
quantifying the amount of resources that must be provi-
sioned during the execution of the job to meet the SLO
without wastefully over-provisioning resources. Natu-
rally, the precise resource requirements of each job de-
pend on numerous factors such as function being com-
puted, the degree of parallelism, data size and skew.
The above is hard to accomplish for arbitrary jobs for
two reasons: 1) target SLOs are generally unknown to
operators, and often hard to define even for the users—
see §2, and 2) automatic provisioning is a known hard
problem even when fixing the application framework [52,
15, 26, 19]. However, the periodic nature of our work-
load makes this problem tractable by means of history-
driven approaches. We tackle this problem using a com-
bination of techniques: First, we statistically derive a
target SLO for a periodic job by analyzing all inter-job
data dependencies and ingress/egress operations (§ 4).
Second, we leverage telemetry of historical runs to de-
rive a job resource model—a time-varying skyline of re-
source demands. We employ a Linear Programming for-
mulation, that explicitly controls the penalty of over/un-
der provisioning—balancing predictability and utilization
(§5). Programmatically deriving SLOs and job resource
model enables a tuning-free user experience, where users
can simply sign-off on the proposed contract. Users may
alternatively override any parameter of the inferred SLO
and the job resource model, which becomes binding if ac-
cepted by our system.
Eliminating sharing-induced unpredictability. Our
second challenge is to enforce SLOs while retaining high-
utilization in a shared environment. This consists of
controlling performance variance with minimal resource
over-provisioning. As noted above, sharing-induced un-
predictability is a scheduling artifact. Accordingly, we
structurally eliminate it by leveraging the notion of recur-
ring reservation, a scheduling construct that isolates peri-
odic production jobs from the noisiness of sharing. A key
property of recurring reservations is that once a periodic
job is admitted each of its instantiations will have a pre-
dictable resource allocation. High-utilization is achieved
by means of a new online, planning algorithm (§ 6). The
algorithm leverages jobs’ flexibility (e.g., deadline slack)
to pack reservations tightly.
Mitigating inherent unpredictability. Our last challenge
is dealing with inherent performance variance (i.e., ex-
ogenous factors, such as task failures, code/data changes,
etc.). We do this by dynamically re-provisioning the cur-
rent instance of a reservation, in response to job resource
consumption, in relationship to the SLO. This compen-
sates for short-term drifts, while continuous retraining of
our SLO and job resource model extractors captures long-
term effects. This problem is in spirit similar to what was
proposed in Jockey [19], as we discuss in §7.
We emphasize that all of the above techniques are
framework-independent—this is key for our production
clusters as they support multiple application frameworks.
Experimental validation. We validate our design by im-
plementing Morpheus atop of Hadoop/YARN [51] (§8).
We then perform several faithful simulations with traces
of a production cluster with over 50k nodes, and show
that the SLOs we derived are representative of the job’s
needs. The combination of tight job provisioning, reser-
vation packing, and dynamic reprovisioning allows us to
achieve: 5× to 13× reduction in potential SLO viola-
tions (with respect to user-defined static provisioning),
and identical cluster utilization. All while, our packing al-
gorithms leverage the flexibility in target SLOs to smooth
the provisioning load over time, and achieve better ROI,
by reducing cluster footprint by 14% to 28%. We con-
clude by deploying Morpheus on a 2700-node cluster, and
performing stress-tests with a production-derived work-
load. This confirms both the scalability of our design, and
the practicality of our implementation (§ 9). We intend to
release components of Morpheus as open-source and the
progress can be tracked at [2].
2 Motivation
In the early phases of our project, we set out to confir-
m/deny our informal intuitions of how big-data clusters
are operated and used. We did so by analyzing four data
sources: 1) execution logs of millions of jobs running on
clusters with more than 50k nodes, 2) infrastructure de-
ployment/upgrade logs, 3) interviews, discussion threads,
and escalation tickets from users, operators and decision
makers, and 4) targeted micro-benchmarks. We summa-
rize below the main findings of our analysis.
118 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association

0%
20%
40%
60%
80%
100%
prod adhoc
ESCALATIONS(%)
low
medium
high
extreme
SEVERITY
A) B)
C)
Figure 1: Analysis of user escalations and recurrent behaviors of production workloads.
2.1 Cluster workloads
Proper execution of production jobs is crucial. Produc-
tion jobs represent over 75% of our workload and a simi-
lar percentage of the provisioned capacity—the rest being
dedicated to ad-hoc jobs (10-20%) and ready to handle
growth/failures (5-10%). All unassigned capacity is redis-
tributed fairly to speed up jobs. As expected, users care
mostly about proper execution of production jobs. Fig. 1a
shows that over 90% of all escalations relate to production
jobs, and this percentage grows to 100% for high/extreme
severity escalations.
Predictability trumps fairness. Further analysis of the
escalations of Fig. 1a and of discussion threads, indicates
that users are 120× more likely to complain about the
performance (un)predictability (25% of all job/resource-
management escalations) than about fairness (< 0.2%),
despite the fact that our system does not enforce fairness
strictly. This outcome may be expected, as customers can-
not observe how “fair” allocations really are.
Production jobs are often periodic. Over 60% of the
jobs in our larger/busier clusters are recurrent. Most
of these recurring jobs are production jobs operating on
continuously arriving data, hence are periodic in nature.
Fig. 1b shows the distribution of the period for periodic
jobs. Interestingly, most of the distribution mass is con-
tributed by a small number of natural values (e.g., once-
a-day, once-an-hour, etc.); this property will be useful to
our allocation mechanisms (§6). Fig. 1c provides further
evidence of recurrent behavior, by showing that job start
times are more densely distributed around the “start-of-
the-hour”. This confirms that most jobs are submitted au-
tomatically on a fixed schedule.
The above evidence confirms that the most important
portion of our workloads is strongly recurrent. This al-
lows for planning the cluster agenda, without being overly
conservative in the resource provisioning of jobs.
2.2 Predictability challenges
Manual tuning of job allocation is hard. Fig. 2a shows
the distribution of the ratio between the total amount of
resources provisioned by the job’s owner and the job’s
actual resource usage (both comparing peak parallelism
and area). The wide range of over/under-allocation in-
dicates that it is very hard for users (or they lack incen-
tives) to optimally provision resources for their jobs. We
further validate this hunch through a user study in [15].
The graphs shows that 75% of jobs are over-provisioned
(even at their peak), with 20% of them over 10× over-
provisioned. This is likely due to users statically setting
their provisioning for a periodic job. We confirm this, by
observing that in one-month period over 80% of periodic
jobs had no changes in their resource provisioning. Large
under-provisioned jobs partially offset the impact of over-
provisioning on cluster utilization.
Sources of performance variance. It is hard to precisely
establish the sources of variance from the production logs
we have. We observe a small but positive correlation
(0.16) between the amount of sharing (above provisioned
resources) and job runtime variance. This indicates that
increased sharing affects runtime variance.
We investigate further the roles of sharing-induced
and inherent performance variance by means of a sim-
ple micro-benchmark. Fig. 2b shows the normalized run-
time of 5 TPC-H queries
1
. We consider two configura-
tions one with constrained parallelism (500 containers),
and one with unconstrained parallelism (>2000 contain-
ers); each container is a bundle of <1core,8GB RAM>.
Each query was run 100 times in each configuration on an
empty cluster at 10TB scale. The graph shows that even
when removing common sources of inherent variability
(data availability, failures, network congestion), runtimes
remain unpredictable (e.g., due to stragglers, §7).
By analyzing these experiments and observing produc-
tion environments, we conclude that: 1) history-based
approaches can model well the “normal” behavior of a
query (small box), 2) handling outliers (as in the long
whiskers) without wasting resources requires a dynamic
component that performs reprovisioning online, and 3)
while each source of variance may be addressed with an
ad-hoc solution, providing a general-purpose line of de-
fense is paramount— see §7 for our solution.
1
Box shows [25th,75th] percentiles, and whiskers shows [min,max].
USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 119

0
20
40
60
80
100
120
4/1
6/1
8/1
10/1
12/1
2/1
4/1
Capacity (%)
SKU1 SKU2
A) B) C)
Figure 2: A) Empirical CDF of provisioning vs. used resources, B) box-whisker plot of normalized runtime of TPC-H
queries running with 500 containers (left) and >2000 containers (right). C) cluster capacity of different machine types.
2.3 Changing conditions
Cluster conditions keep evolving—jobs may run on
different server types. We provide in Fig. 2c a measure
of hardware churn in our clusters. We refer to different
machines configurations as Stock Keeping Units (SKUs).
Over a period of a year, the ratio between number of ma-
chines of type SKU1 and type SKU2 changed from 80/20
to 55/45; the total number of nodes also kept changing
over that period. This is notable, because even seem-
ingly minor hardware differences can impact job runtime
significantly— e.g., 40% difference in runtime on SKU1
vs SKU2 for a Spark production job.
User scripts keep evolving. We perform an analysis of
the versioning of user scripts/UDFs. We remove all sim-
ple parameterizations that naturally change with every in-
stantiation, and then construct a fuzzy match of the code
structure. Within one-month of trace data, we detect that
15-20% of periodic jobs had at least one large code delta
(more than 10% code difference), and over 50% had at
least one small delta (any change that breaks MD5 of the
parameter-stripped code). Hence, even an optimal static
tuning is likely going to drift out of optimality over time.
Motivated by all of the above evidence, we focus on
building a resource management substrate that provides
predictable execution as a core primitive.
3 Overview of Morpheus
Morpheus is a system that continuously observes and
learns as periodic jobs execute over time. The findings are
used to economically reserve resources for the job ahead
of job execution, and dynamically adapt to changing con-
ditions at runtime. To give an informal sense of the key
functionalities in Morpheus, we start our overview by fol-
lowing a typical life-cycle of a periodic job (JobX) as it is
governed by Morpheus (§3.1). Next, we describe the core
subsystems (§3.2). Fig. 3 provides a logical view of the
architecture, and “zooms in” on a particular job.
3.1 “Life” of a periodic job
With reference to Fig. 3, a typical periodic jobs goes
through the following stages.
1. The user periodically submits JobX with manually pro-
visioned resources. In the meantime, the underlying
infrastructure captures:
(a) Data-dependencies and ingress/egress operations
in the Provenance Graph (PG).
(b) Resource utilization of each run (marked as the
R1-R4 skylines in Fig. 3) in a Telemetry-History
(TH) database.
2. The SLO Inference performs an offline analysis of the
successful runs of JobX:
(a) From the PG it derives a deadline d—the SLO.
(b) From the TH, it derives a model of the job re-
source demand over time, R
. We refer to R
as
the job resource model
3. The user signs off (or optionally overrides) the
automatically-generated SLO and job resource model.
4. Morpheus enforces SLOs via recurring reservations:
(a) Adds a recurring reservation for JobX into the
cluster agenda—this sets aside resources over
time based on the job resource model R
.
(b) New instances of JobX run within the recurring
reservation (dedicated resources).
5. The Dynamic Reprovisioning component monitors the
job progress online, and increases/decreases the reser-
vation, to mitigate inherent execution variability.
6. Morpheus constantly feeds back into Step 2 the PG
and telemetry information of the new runs for contin-
uous learning and refinement of the SLO and the job
resource model.
120 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association

Citations
More filters
Proceedings ArticleDOI

Optimus: an efficient dynamic resource scheduler for deep learning clusters

TL;DR: Optimus is proposed, a customized job scheduler for deep learning clusters, which minimizes job training time based on online resource-performance models, and sets up performance models to accurately estimate training speed as a function of allocated resources in each job.
Proceedings Article

Live video analytics at scale with approximation and delay-tolerance

TL;DR: VideoStorm is described, a video analytics system that processes thousands of video analytics queries on live video streams over large clusters, considering two key characteristics of video Analytics: resource-quality tradeoff with multi-dimensional configurations, and variety in quality and lag goals.
Proceedings Article

Tiresias: A {GPU} Cluster Manager for Distributed Deep Learning

TL;DR: Tiresias is presented, a GPU cluster manager tailored for distributed DL training jobs, which efficiently schedules and places DL jobs to reduce their job completion times (JCT), and its performance is comparable to that of solutions assuming perfect knowledge.
Proceedings ArticleDOI

Imbalance in the cloud: An analysis on Alibaba cluster trace

TL;DR: This paper performs a deep analysis on a newly released trace dataset by Alibaba in September 2017, consisting of detail statistics of 11089 online service jobs and 12951 batch jobs co-locating on 1300 machines over 12 hours, revealing several important insights about different types of imbalance in the Alibaba cloud.
Proceedings ArticleDOI

SLAQ: quality-driven scheduling for distributed machine learning

TL;DR: SLAQ as mentioned in this paper is a cluster scheduling system for approximate ML training jobs that aims to maximize the overall job quality by exploring the quality-runtime trade-offs across multiple jobs to maximize system-wide quality improvement.
References
More filters
Proceedings ArticleDOI

Apache Hadoop YARN: yet another resource negotiator

TL;DR: The design, development, and current state of deployment of the next generation of Hadoop's compute platform: YARN is summarized, which decouples the programming model from the resource management infrastructure, and delegates many scheduling functions to per-application components.
Proceedings ArticleDOI

Mesos: a platform for fine-grained resource sharing in the data center

TL;DR: The results show that Mesos can achieve near-optimal data locality when sharing the cluster among diverse frameworks, can scale to 50,000 (emulated) nodes, and is resilient to failures.
Proceedings ArticleDOI

Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling

TL;DR: This work proposes a simple algorithm called delay scheduling, which achieves nearly optimal data locality in a variety of workloads and can increase throughput by up to 2x while preserving fairness.
Proceedings ArticleDOI

Dominant resource fairness: fair allocation of multiple resource types

TL;DR: Dominant Resource Fairness (DRF), a generalization of max-min fairness to multiple resource types, is proposed, and it is shown that it leads to better throughput and fairness than the slot-based fair sharing schemes in current cluster schedulers.
Proceedings ArticleDOI

Large-scale cluster management at Google with Borg

TL;DR: A summary of the Borg system architecture and features, important design decisions, a quantitative analysis of some of its policy decisions, and a qualitative examination of lessons learned from a decade of operational experience with it are presented.
Related Papers (5)