Morpheus: towards automated SLOs for enterprise clusters

doi:10.5555/3026877.3026887

This paper is included in the Proceedings of the

12th USENIX Symposium on Operating Systems Design

and Implementation (OSDI ’16).

November 2–4, 2016 • Savannah, GA, USA

ISBN 978 -1-931971-33-1

Open access to the Proceedings of the

12th USENIX Symposium on Operating Systems

Design and Implementation

is sponsored by USENIX.

Morpheus: Towards Automated SLOs

for Enterprise Clusters

Sangeetha Abdu Jyothi, Microsoft and University of Illinois at Urbana–Champaign;

Carlo Curino, Ishai Menache, and Shravan Matthur Narayanamurthy, Microsoft;

Alexey Tumanov, Microsoft and Carnegie Mellon University; Jonathan Yaniv, Technion—

Israel Institute of Technology; Ruslan Mavlyutov, Microsoft and University of Fribourg;

Íñigo Goiri, Subru Krishnan, Janardhan Kulkarni, and Sriram Rao, Microsoft

https://www.usenix.org/conference/osdi16/technical-sessions/presentation/jyothi

Morpheus: Towards Automated SLOs for Enterprise Clusters

Sangeetha Abdu Jyothi

m,u

Carlo Curino

m

Ishai Menache

m

Shravan Matthur Narayanamurthy

m

Alexey Tumanov

m,c

Jonathan Yaniv

t

Ruslan Mavlyutov

m, f

´

I

˜

nigo Goiri

m

Subru Krishnan

m

Janardhan Kulkarni

m

Sriram Rao

m

Microsoft,

u

University of Illinois at Urbana–Champaign,

c

Carnegie Mellon University

t

Technion-Israel Institute of Technology,

f

University of Fribourg

Abstract

Modern resource management frameworks for large-

scale analytics leave unresolved the problematic ten-

sion between high cluster utilization and job’s perfor-

mance predictability—respectively coveted by operators

and users. We address this in Morpheus, a new sys-

tem that: 1) codiﬁes implicit user expectations as ex-

plicit Service Level Objectives (SLOs), inferred from his-

torical data, 2) enforces SLOs using novel scheduling

techniques that isolate jobs from sharing-induced perfor-

mance variability, and 3) mitigates inherent performance

variance (e.g., due to failures) by means of dynamic re-

provisioning of jobs. We validate these ideas against pro-

duction traces from a 50k node cluster, and show that

Morpheus can lower the number of deadline violations by

5× to 13×, while retaining cluster-utilization, and lower-

ing cluster footprint by 14% to 28%. We demonstrate the

scalability and practicality of our implementation by de-

ploying Morpheus on a 2700-node cluster and running it

against production-derived workloads.

1 Introduction

Commercial enterprises ranging from Fortune-500 com-

panies to venture-capital funded startups are increas-

ingly relying on multi-tenanted clusters for running their

business-critical data analytics jobs. These jobs comprise

of multiple tasks that are run on different cluster nodes,

where the unit of per-task resource allocation is a con-

tainer (i.e, a bundle of resources such as CPU, RAM and

disk I/O) on an individual machine. From an analysis

of large-scale production workloads, we observe signiﬁ-

cant variance in job runtimes, which sometimes results in

missed deadlines and negative business impact. This is

perceived by users as an unpredictable execution experi-

ence, and it accounts for 25% of (resource-provisioning

related) user escalations in Microsoft big-data clusters.

Unpredictability comes from several sources, which for

discussion purposes, we roughly group as follows:

• Sharing-induced – performance variability caused

by inconsistent allocations of resources across job

runs—a scheduling policy artifact.

• Inherent – performance variability due to changes in

the job input (size, skew, availability), source code

tweaks, failures, and hardware churn—this is en-

demic even in dedicated and lightly used clusters.

Unpredictability is most noticeable to users who sub-

mit periodic jobs (i.e., scheduled runs of the same job on

newly arriving data). Their recurrent nature prompts users

to form an expectation on jobs’ runtime performance as

well as react to any deviation from it, particularly, if the

job is business-critical (i.e., a production job).

Unfortunately, widely deployed resource managers [9,

27, 51, 55] provide limited mechanisms (e.g., fairness

weights, priorities, job killing) for users to cope with un-

predictability of such jobs. Given these basic tools, users

resort to a combination of ad-hoc tricks, often pivoting

around conservative over-provisioning for important pro-

duction jobs. These coarse compensating actions are man-

ual and inherently error-prone. Worse, they may adversely

impact cluster utilization—a key metric for cluster opera-

tors. Owing to the substantial costs involved in building/-

operating large-scale clusters, operators seek good return

on investment (ROI) by maximizing utilization.

Divergent predictability and utilization requirements

are poorly handled by existing systems. This is taxing

and leads to tension between users and operators.

An ideal resource management infrastructure would

provide predictable execution as a core primitive, while

achieving high cluster utilization. This is a worthwhile

infrastructure to build, particularly, because periodic, pro-

duction jobs make up the majority of cluster workloads,

as reported by [43] and as we observe in §2.

USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 117

In this paper, we move the state of the art towards this

ideal, by proposing a system called Morpheus. Building

Morpheus poses several interesting challenges such as,

automatically: 1) capturing user predictability expecta-

tions, 2) controlling sharing-induced unpredictability, and

3) coping with inherent unpredictability. We elaborate on

these challenges next.

Inferring SLOs and modeling job resource demands.

Our ﬁrst challenge is to formalize the implicit user pre-

dictability expectation in an explicit form that is action-

able for the underlying resource management infrastruc-

ture. We refer to the resulting characterization as an (in-

ferred) Service Level Objective (SLO). We focus on com-

pletion time SLOs or deadlines. The next step consists of

quantifying the amount of resources that must be provi-

sioned during the execution of the job to meet the SLO

without wastefully over-provisioning resources. Natu-

rally, the precise resource requirements of each job de-

pend on numerous factors such as function being com-

puted, the degree of parallelism, data size and skew.

The above is hard to accomplish for arbitrary jobs for

two reasons: 1) target SLOs are generally unknown to

operators, and often hard to deﬁne even for the users—

see §2, and 2) automatic provisioning is a known hard

problem even when ﬁxing the application framework [52,

15, 26, 19]. However, the periodic nature of our work-

load makes this problem tractable by means of history-

driven approaches. We tackle this problem using a com-

bination of techniques: First, we statistically derive a

target SLO for a periodic job by analyzing all inter-job

data dependencies and ingress/egress operations (§ 4).

Second, we leverage telemetry of historical runs to de-

rive a job resource model—a time-varying skyline of re-

source demands. We employ a Linear Programming for-

mulation, that explicitly controls the penalty of over/un-

der provisioning—balancing predictability and utilization

(§5). Programmatically deriving SLOs and job resource

model enables a tuning-free user experience, where users

can simply sign-off on the proposed contract. Users may

alternatively override any parameter of the inferred SLO

and the job resource model, which becomes binding if ac-

cepted by our system.

Eliminating sharing-induced unpredictability. Our

second challenge is to enforce SLOs while retaining high-

utilization in a shared environment. This consists of

controlling performance variance with minimal resource

over-provisioning. As noted above, sharing-induced un-

predictability is a scheduling artifact. Accordingly, we

structurally eliminate it by leveraging the notion of recur-

ring reservation, a scheduling construct that isolates peri-

odic production jobs from the noisiness of sharing. A key

property of recurring reservations is that once a periodic

job is admitted each of its instantiations will have a pre-

dictable resource allocation. High-utilization is achieved

by means of a new online, planning algorithm (§ 6). The

algorithm leverages jobs’ ﬂexibility (e.g., deadline slack)

to pack reservations tightly.

Mitigating inherent unpredictability. Our last challenge

is dealing with inherent performance variance (i.e., ex-

ogenous factors, such as task failures, code/data changes,

etc.). We do this by dynamically re-provisioning the cur-

rent instance of a reservation, in response to job resource

consumption, in relationship to the SLO. This compen-

sates for short-term drifts, while continuous retraining of

our SLO and job resource model extractors captures long-

term effects. This problem is in spirit similar to what was

proposed in Jockey [19], as we discuss in §7.

We emphasize that all of the above techniques are

framework-independent—this is key for our production

clusters as they support multiple application frameworks.

Experimental validation. We validate our design by im-

plementing Morpheus atop of Hadoop/YARN [51] (§8).

We then perform several faithful simulations with traces

of a production cluster with over 50k nodes, and show

that the SLOs we derived are representative of the job’s

needs. The combination of tight job provisioning, reser-

vation packing, and dynamic reprovisioning allows us to

achieve: 5× to 13× reduction in potential SLO viola-

tions (with respect to user-deﬁned static provisioning),

and identical cluster utilization. All while, our packing al-

gorithms leverage the ﬂexibility in target SLOs to smooth

the provisioning load over time, and achieve better ROI,

by reducing cluster footprint by 14% to 28%. We con-

clude by deploying Morpheus on a 2700-node cluster, and

performing stress-tests with a production-derived work-

load. This conﬁrms both the scalability of our design, and

the practicality of our implementation (§ 9). We intend to

release components of Morpheus as open-source and the

progress can be tracked at [2].

2 Motivation

In the early phases of our project, we set out to conﬁr-

m/deny our informal intuitions of how big-data clusters

are operated and used. We did so by analyzing four data

sources: 1) execution logs of millions of jobs running on

clusters with more than 50k nodes, 2) infrastructure de-

ployment/upgrade logs, 3) interviews, discussion threads,

and escalation tickets from users, operators and decision

makers, and 4) targeted micro-benchmarks. We summa-

rize below the main ﬁndings of our analysis.

118 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association

0%

20%

40%

60%

80%

100%

prod adhoc

ESCALATIONS(%)

low

medium

high

extreme

SEVERITY

A) B)

C)

Figure 1: Analysis of user escalations and recurrent behaviors of production workloads.

2.1 Cluster workloads

Proper execution of production jobs is crucial. Produc-

tion jobs represent over 75% of our workload and a simi-

lar percentage of the provisioned capacity—the rest being

dedicated to ad-hoc jobs (10-20%) and ready to handle

growth/failures (5-10%). All unassigned capacity is redis-

tributed fairly to speed up jobs. As expected, users care

mostly about proper execution of production jobs. Fig. 1a

shows that over 90% of all escalations relate to production

jobs, and this percentage grows to 100% for high/extreme

severity escalations.

Predictability trumps fairness. Further analysis of the

escalations of Fig. 1a and of discussion threads, indicates

that users are 120× more likely to complain about the

performance (un)predictability (25% of all job/resource-

management escalations) than about fairness (< 0.2%),

despite the fact that our system does not enforce fairness

strictly. This outcome may be expected, as customers can-

not observe how “fair” allocations really are.

Production jobs are often periodic. Over 60% of the

jobs in our larger/busier clusters are recurrent. Most

of these recurring jobs are production jobs operating on

continuously arriving data, hence are periodic in nature.

Fig. 1b shows the distribution of the period for periodic

jobs. Interestingly, most of the distribution mass is con-

tributed by a small number of natural values (e.g., once-

a-day, once-an-hour, etc.); this property will be useful to

our allocation mechanisms (§6). Fig. 1c provides further

evidence of recurrent behavior, by showing that job start

times are more densely distributed around the “start-of-

the-hour”. This conﬁrms that most jobs are submitted au-

tomatically on a ﬁxed schedule.

The above evidence conﬁrms that the most important

portion of our workloads is strongly recurrent. This al-

lows for planning the cluster agenda, without being overly

conservative in the resource provisioning of jobs.

2.2 Predictability challenges

Manual tuning of job allocation is hard. Fig. 2a shows

the distribution of the ratio between the total amount of

resources provisioned by the job’s owner and the job’s

actual resource usage (both comparing peak parallelism

and area). The wide range of over/under-allocation in-

dicates that it is very hard for users (or they lack incen-

tives) to optimally provision resources for their jobs. We

further validate this hunch through a user study in [15].

The graphs shows that 75% of jobs are over-provisioned

(even at their peak), with 20% of them over 10× over-

provisioned. This is likely due to users statically setting

their provisioning for a periodic job. We conﬁrm this, by

observing that in one-month period over 80% of periodic

jobs had no changes in their resource provisioning. Large

under-provisioned jobs partially offset the impact of over-

provisioning on cluster utilization.

Sources of performance variance. It is hard to precisely

establish the sources of variance from the production logs

we have. We observe a small but positive correlation

(0.16) between the amount of sharing (above provisioned

resources) and job runtime variance. This indicates that

increased sharing affects runtime variance.

We investigate further the roles of sharing-induced

and inherent performance variance by means of a sim-

ple micro-benchmark. Fig. 2b shows the normalized run-

time of 5 TPC-H queries

1

. We consider two conﬁgura-

tions one with constrained parallelism (500 containers),

and one with unconstrained parallelism (>2000 contain-

ers); each container is a bundle of <1core,8GB RAM>.

Each query was run 100 times in each conﬁguration on an

empty cluster at 10TB scale. The graph shows that even

when removing common sources of inherent variability

(data availability, failures, network congestion), runtimes

remain unpredictable (e.g., due to stragglers, §7).

By analyzing these experiments and observing produc-

tion environments, we conclude that: 1) history-based

approaches can model well the “normal” behavior of a

query (small box), 2) handling outliers (as in the long

whiskers) without wasting resources requires a dynamic

component that performs reprovisioning online, and 3)

while each source of variance may be addressed with an

ad-hoc solution, providing a general-purpose line of de-

fense is paramount— see §7 for our solution.

1

Box shows [25th,75th] percentiles, and whiskers shows [min,max].

USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 119

0

20

40

60

80

100

120

4/1

6/1

8/1

10/1

12/1

2/1

4/1

Capacity (%)

SKU1 SKU2

A) B) C)

Figure 2: A) Empirical CDF of provisioning vs. used resources, B) box-whisker plot of normalized runtime of TPC-H

queries running with 500 containers (left) and >2000 containers (right). C) cluster capacity of different machine types.

2.3 Changing conditions

Cluster conditions keep evolving—jobs may run on

different server types. We provide in Fig. 2c a measure

of hardware churn in our clusters. We refer to different

machines conﬁgurations as Stock Keeping Units (SKUs).

Over a period of a year, the ratio between number of ma-

chines of type SKU1 and type SKU2 changed from 80/20

to 55/45; the total number of nodes also kept changing

over that period. This is notable, because even seem-

ingly minor hardware differences can impact job runtime

signiﬁcantly— e.g., 40% difference in runtime on SKU1

vs SKU2 for a Spark production job.

User scripts keep evolving. We perform an analysis of

the versioning of user scripts/UDFs. We remove all sim-

ple parameterizations that naturally change with every in-

stantiation, and then construct a fuzzy match of the code

structure. Within one-month of trace data, we detect that

15-20% of periodic jobs had at least one large code delta

(more than 10% code difference), and over 50% had at

least one small delta (any change that breaks MD5 of the

parameter-stripped code). Hence, even an optimal static

tuning is likely going to drift out of optimality over time.

Motivated by all of the above evidence, we focus on

building a resource management substrate that provides

predictable execution as a core primitive.

3 Overview of Morpheus

Morpheus is a system that continuously observes and

learns as periodic jobs execute over time. The ﬁndings are

used to economically reserve resources for the job ahead

of job execution, and dynamically adapt to changing con-

ditions at runtime. To give an informal sense of the key

functionalities in Morpheus, we start our overview by fol-

lowing a typical life-cycle of a periodic job (JobX) as it is

governed by Morpheus (§3.1). Next, we describe the core

subsystems (§3.2). Fig. 3 provides a logical view of the

architecture, and “zooms in” on a particular job.

3.1 “Life” of a periodic job

With reference to Fig. 3, a typical periodic jobs goes

through the following stages.

1. The user periodically submits JobX with manually pro-

visioned resources. In the meantime, the underlying

infrastructure captures:

(a) Data-dependencies and ingress/egress operations

in the Provenance Graph (PG).

(b) Resource utilization of each run (marked as the

R1-R4 skylines in Fig. 3) in a Telemetry-History

(TH) database.

2. The SLO Inference performs an ofﬂine analysis of the

successful runs of JobX:

(a) From the PG it derives a deadline d—the SLO.

(b) From the TH, it derives a model of the job re-

source demand over time, R

∗

. We refer to R

∗

as

the job resource model

3. The user signs off (or optionally overrides) the

automatically-generated SLO and job resource model.

4. Morpheus enforces SLOs via recurring reservations:

(a) Adds a recurring reservation for JobX into the

cluster agenda—this sets aside resources over

time based on the job resource model R

∗

.

(b) New instances of JobX run within the recurring

reservation (dedicated resources).

5. The Dynamic Reprovisioning component monitors the

job progress online, and increases/decreases the reser-

vation, to mitigate inherent execution variability.

6. Morpheus constantly feeds back into Step 2 the PG

and telemetry information of the new runs for contin-

uous learning and reﬁnement of the SLO and the job

resource model.

120 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association

Morpheus: towards automated SLOs for enterprise clusters

Citations

Optimus: an efficient dynamic resource scheduler for deep learning clusters

Live video analytics at scale with approximation and delay-tolerance

Tiresias: A {GPU} Cluster Manager for Distributed Deep Learning

Imbalance in the cloud: An analysis on Alibaba cluster trace

SLAQ: quality-driven scheduling for distributed machine learning

References

Apache Hadoop YARN: yet another resource negotiator

Mesos: a platform for fine-grained resource sharing in the data center

Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling

Dominant resource fairness: fair allocation of multiple resource types

Large-scale cluster management at Google with Borg

Related Papers (5)

Apache Hadoop YARN: yet another resource negotiator

Mesos: a platform for fine-grained resource sharing in the data center

Large-scale cluster management at Google with Borg

Dominant resource fairness: fair allocation of multiple resource types

Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing