scispace - formally typeset
Open AccessJournal ArticleDOI

Co-scheduling Amdahl applications on cache-partitioned systems

Reads0
Chats0
TLDR
This article provides key elements to determine the subset of applications that should share the LLC (while remaining ones only use their smaller private cache), and designs efficient heuristics for Amdahl applications.
Abstract
Cache-partitioned architectures allow subsections of the shared last-level cache (LLC) to be exclusively reserved for some applications. This technique dramatically limits interactions between applications that are concurrently executing on a multi-core machine. Consider n applications that execute concurrently, with the objective to minimize the makespan, defined as the maximum completion time of the n applications. Key scheduling questions are: (i) which proportion of cache and (ii) how many processors should be given to each application? In this paper, we provide answers to (i) and (ii) for Amdahl applications. Even though the problem is shown to be NP-complete, we give key elements to determine the subset of applications that should share the LLC (while remaining ones only use their smaller private cache). Building upon these results, we design efficient heuristics for Amdahl applications. Extensive simulations demonstrate the usefulness of co-scheduling when our efficient cache partitioning strategies are deployed.

read more

Content maybe subject to copyright    Report

Co-scheduling Amdahl applications on cache-partitioned systems
Guillaume Aupy
a
, Anne Benoit
b
, Sicheng Dai
c
, Lo¨ıc Pottier
b
, Padma Raghavan
d
, Yves Robert
b,e
, Manu
Shantharam
f
a
Inria, Universit´e de Bordeaux, France
b
Laboratoire LIP,
´
Ecole Normale Sup´erieure de Lyon, France
c
East China Normal University, China
d
Vanderbilt University, Nashville TN, USA
e
University of Tennessee Knoxville, USA
f
San Diego Supercomputer Center, San Diego CA, USA
Abstract
Cache-partitioned architectures allow subsections of the shared last-level cache (LLC) to be exclusively
reserved for some applications. This technique dramatically limits interactions between applications that
are concurrently executing on a multi-core machine. Consider n applications that execute concurrently, with
the objective to minimize the makespan, defined as the maximum completion time of the n applications.
Key scheduling questions are: (i) which proportion of cache and (ii) how many processors should be given
to each application? In this paper, we provide answers to (i) and (ii) for Amdahl applications. Even though
the problem is shown to be NP-complete, we give key elements to determine the subset of applications
that should share the LLC (while remaining ones only use their smaller private cache). Building upon
these results, we design efficient heuristics for Amdahl applications. Extensive simulations demonstrate the
usefulness of co-scheduling when our efficient cache partitioning strategies are deployed.
Keywords: Co-scheduling; cache partitioning; complexity results.
1. Introduction
At scale, the I/O movements of High Performance Computing (HPC) applications are expected to be one
of the most critical problems [Adv14]. Observations on the Intrepid machine at Argonne National Laboratory
(ANL) show that I/O transfers can be slowed down up to 70% due to congestion [GAB
+
15]. When ANL
upgraded its house supercomputer from Intrepid (Peak perf: 0.56 PFlops; peak I/O throughput: 88 GB/s)
to Mira (Peak perf: 10 PFlops; peak I/O throughput: 240 GB/s), the net result for an application whose
I/O throughput scales linearly (or worse) with performance was a downgrade from 160 GB/PFlop to 24
GB/PFlop!
To cope with such an imbalance (which is not expected to reduce on future platforms), a possible approach
is to develop in situ co-scheduling analysis and data preprocessing on dedicated nodes [Adv14]. This scheme
applies to data-intensive periodic workflows where data is generated by the main simulation, and parallel
processes are run to process this data with the constraints that output results should be sent to disk storage
before newly generated data arrives for processing. These solutions are starting to be implemented for HPC
applications. Sewell et al. [SHF
+
15] explain that in the case of the HACC application (a cosmological code),
petabytes of data are created to be analyzed later. The analysis is done by multiple independent processes.
The idea of their work is to minimize the amount of data copied to I/O filesystem, by performing the analysis
at the same time as HACC is running (what they call in situ). The main constraint is that these processes
are data-intensive and are handled by a dedicated machine. Also, the execution of these processes should
be done efficiently enough so that they finish before the next batch of data arrives, hence resulting in a
pipelined approach. All these frameworks motivate the design of efficient co-scheduling strategies.
Email addresses: guillaume.aupy@inria.fr (Guillaume Aupy), Anne.Benoit@ens-lyon.fr (Anne Benoit),
51151500012@ecnu.cn (Sicheng Dai), Loic.Pottier@ens-lyon.fr (Lo¨ıc Pottier), padma.raghavan@vanderbilt.edu (Padma
Raghavan), Yves.Robert@inria.fr (Yves Robert), shantharam.manu@gmail.com (Manu Shantharam)
Preprint submitted to IJHPCA May 12, 2017

One main issue of co-scheduling is to evaluate co-run degradations due to cache sharing [ZBF10]. Many
studies have shown that interferences on the shared last-level cache (LLC) can be detrimental to co-scheduled
applications [LK14]. Previous solutions consisted in preventing co-schedule of possibly interfering workloads,
or terminating low importance applications [ZLMT14]. Lo et al. [LCG
+
16] recently showed experimentally
that important gains could be reached by co-scheduling applications with strict cache partitioning enabled.
Cache partitioning, the technique at the core of this work, consists in reserving exclusivity of subsections of
the LLC of a chip multi-processor (CMP), to some of the applications running on this CMP. This functionality
was recently introduced by Intel under the name Cache Allocation Technology [Int14]. With the advent of
large shared memory multi-core machines (e.g., Sunway TaihuLight, the current #1 supercomputer uses
256-cores processor chips with a shared memory of 32GB [Don16]), the design of algorithms that co-schedule
applications efficiently and decide how to partition the shared memory (seen as the cache here), is becoming
critical.
In this work, we study the following problem. We are given a set of Amdahl applications, i.e., parallel
applications application obeying Amdahl’s speedup law [Amd67] (see Equation 1 for details). Amdahl’s law
has had a profound impact on the evolution of HPC [Hea15] and many scientific applications, including most
Nas Parallel Benchmarks, obey this law [CE00]. We are also given a multi-core processor with a shared last-
level cache LLC. How can we best partition the LLC to minimize the total execution time (or makespan),
i.e., the moment when the last application finishes its computation. For each application, we assume that we
know the number of compute operations to perform, and the miss rate on a fixed size cache. For the multi-
core processor, we know its LLC size, the cost for a cache miss, the cost for a cache hit, the size of the cache
and total number of processors. For the theoretical study, we assume that these processors can be shared by
two applications through multi-threading [KSS12], hence we can assign a rational number of processors to
each application, and this allows us to study the intrinsic complexity of co-scheduling with cache partitioning.
Equipped with all these applications and platform parameters, recent work [HSPE08, RKB
+
09, KSS12] shows
how to model the impact of cache misses and to accurately predict the execution time of an application. In
this context, we make the following main contributions:
With rational numbers of processors, we show that the co-scheduling problem is NP-complete, even
when applications are perfectly parallel, i.e., their speed-up scales up linearly with the number of
processors.
With rational numbers of processors, we show several results that characterize optimal solutions, and
in particular that the co-scheduling cache-partitioning problem reduces to deciding which subset of
applications will share the LLC; when this subset is known, we show how to determine the optimal
cache fractions and rational number of processors for perfectly-parallel applications. Furthermore, we
show that all applications should finish at the same time, even if they are not perfectly parallel.
These theoretical results guide the design of heuristics for Amdahl applications. We show through ex-
tensive simulations (using both rational and integer numbers of processors) that our heuristics greatly
improve the performance of cache-partitioning algorithms, even for parallel applications obeying Am-
dahl’s law with a large sequential fraction, hence with a limited speedup profile.
The rest of the paper is organized as follows. Section 2 provides an overview of related work. Section 3 is
devoted to formally defining the framework and all model parameters. Section 4 gives our main theoretical
contributions. The heuristics are defined in Section 5, and evaluated through simulations in Section 6.
Finally, Section 7 outlines our main findings and discusses directions for future work.
2. Related work
Since the advent of systems with tens of cores, co-scheduling has received considerable attention. Due to
lack of space, we refer to [MSM
+
11, DJF
+
15, LCG
+
16] for a survey of many approaches to co-scheduling.
The main idea is to execute several applications concurrently rather than in sequence, with the objective to
increase platform throughput. Indeed, some individual applications may well not need all available cores, or
some others could use all resources, but at the price of a dramatic performance loss. In particular, the latter
case is encountered whenever application speedup becomes too low beyond a given processor count.
2

The main difficulty of co-scheduling is to decide which applications to execute concurrently, and how many
cores to assign to each of them. Indeed, when executing simultaneously, any two applications will compete
for shared resources, which will create interferences and decrease their throughput. Modeling application
interference is a challenging task. Dynamic schedulers are used when application behavior is unknown [QP06,
TJS09]. Static schedulers aim at optimizing the sharing of the resources by relying on application knowledge
such as estimated workload, speed-up profile, cache behavior, etc. One widely-used approach is to build an
interference graph whose vertices are applications and whose edges represent degradation factors [JSCT08,
ZHG
+
15, HZJ16]. This approach is interesting but hard to implement. Indeed, the interaction of two
applications depends on many factors, such as their size, their core count, the memory bandwidth, etc.
Obtaining the speedup profile of a single application already is difficult and requires intensive benchmarking
campaigns. Obtaining the degradation profile of two applications is even more difficult and can be achieved
only for regular applications. To further darken the picture, the interference graph subsumes only pairwise
interactions, while a global picture of the processor and cache requirements for all applications is needed by
the scheduler.
Shared resources include cache, memory, I/O channels and network links, but among potential degrada-
tion factors, cache accesses are prominent. When several applications share the cache, they are granted a
fraction of cache lines as opposed to the whole cache, and their cache miss ratio increases accordingly. Mul-
tiple cache partitioning strategies have been proposed [BCSM08, GSYY09, BZF10, DFB
+
12]. In this paper,
we focus on a static allocation of LLC cache fractions, and processor numbers, to concurrent applications
as a function of several parameters (cache-miss ratio, access frequency, operation count). To the best of our
knowledge, this work is the first analytical model and complexity study for this challenging problem.
3. Model
This section details platform and application parameters, and formally states the optimization problem.
Architecture. We consider a parallel platform of p homogeneous computing elements, or processors, that
share two storage locations:
A small storage S
s
with low latency, governed by a LRU replacement policy, also called cache;
A large storage S
l
with high latency, also called memory.
More specifically, C
s
(resp. C
l
) denotes the size of S
s
(resp. S
l
), and l
s
(resp. l
l
) the latency of S
s
(resp. S
l
).
In this work, we assume that C
l
= +. We have the relation l
s
l
l
.
In this work, we consider the cache partitioning technique [Int14], where one can allocate a portion of
the cache to applications so that they can execute without interference from other applications.
Applications. There are n independent parallel applications to be scheduled on the parallel platform,
whose speedup profiles obey Amdahl’s law [Amd67]. For an application T
i
, we define several parameters:
w
i
, the number of computing operations needed for T
i
;
s
i
, the sequential fraction of T
i
;
f
i
, the frequency of data accesses of T
i
: f
i
is the number of data accesses per computing operation;
a
i
, the memory footprint of T
i
.
We use these parameters to model the execution of each application as follows.
Parallel execution time. Let F l
i
(p
i
) be the number of operations performed by each processor for applica-
tion T
i
, when executed on p
i
processors. According to Amdahl’s speedup profile [Amd67], we have
F l
i
(p
i
) = s
i
w
i
+ (1 s
i
)
w
i
p
i
(1)
The power law of cache misses. In chip multi-processors, many authors have observed that the Power Law
accurately models how the cache size affects the miss rate [HSPE08, RKB
+
09, KSS12]. Mathematically, the
power law states that if m
0
is the miss rate of a workload for a baseline cache size C
0
, the miss rate m for a
new cache size C can be expressed as m = m
0
C
0
C
α
where α is the sensitivity factor from the Power Law of
Cache Misses [HSPE08, RKB
+
09, KSS12] and typically ranges between 0.3 and 0.7 with an average at 0.5.
3

Note that, by definition, a rate cannot be higher than 1, hence we extend this definition as:
m = min
1, m
0
C
0
C
α
. (2)
This formula can be read as follows: if the cache size allocated is too small, then the execution goes as if no
cache was allocated, and all accesses will be misses.
Computations and data movement. We use the cost model introduced by Krishna et al. [KSS12] to evaluate
the execution cost of an application as a function of the cache fraction that it has been allocated. Specifically,
for each application, we define m
0
, the miss rate of application T
i
with a cache of size C
0
(we can also use the
miss rate of applications with a cache of another fixed size). We express the execution time of T
i
as a function
of p
i
, the number of processors allocated to T
i
, and x
i
, the fraction of S
s
allocated to T
i
(recall both are
rational numbers). Let F l
i
(p
i
) be the number of operations performed by each processor for application T
i
,
given that the application is executed on p
i
processors. We have F l
i
(p
i
) = s
i
w
i
+ (1 s
i
)
w
i
p
i
according to
Amdahl’s speedup profile. Finally,
Exe
i
(p
i
, x
i
) =
F l
i
(p
i
) (1 + f
i
(l
s
+ l
l
)) if x
i
= 0;
F l
i
(p
i
)
1 + f
i
l
s
+ l
l
· min
1,
m
0
x
i
C
s
C
0
α

if x
i
C
s
a
i
;
F l
i
(p
i
)
1 + f
i
l
s
+ l
l
· min
1,
m
0
a
i
C
0
α

otherwise.
(3)
Indeed, for each operation, we pay the cost of the computing operation, plus the cost of data accesses,
and by definition we have f
i
accesses per operation. At each access, we pay a latency l
s
, and an additional
latency l
l
in case of cache miss (see Equation (2)). The last case states that we cannot use a portion of cache
greater than the memory footprint a
i
of application T
i
. This model is somewhat pessimistic: cache accesses
to the same variable by two different processors are counted twice. We show in Section 6 that despite this
conservative assumption (no sharing), co-scheduling can outperform classical approaches that sequentially
deploy each application on the whole set of available resources.
Equation (3) calls for a few observations. For notational convenience, let d
i
= m
0
C
0
C
s
α
:
It is useless to give a fraction of cache larger than
a
i
C
s
to application T
i
;
Because of the minimum min
1,
d
i
(x
i
)
α
, either x
i
> d
1
α
i
, or x
i
= 0: indeed, if we give application T
i
a
fraction of cache smaller than d
1
α
i
, the minimum is equal to 1, and this fraction is wasted.
Hence, we have for all i:
x
i
= 0 or d
1
α
i
< x
i
a
i
C
s
. (4)
Of course, if d
1
α
i
a
i
C
s
for some application T
i
, then x
i
= 0.
We denote by Exe
seq
i
(x
i
) = Exe
i
(1, x
i
) the sequential execution time of application T
i
with a fraction of
cache x
i
.
Scheduling problem. Given n applications T
1
, . . . , T
n
, we aim at partitioning the shared cache and assign
processors so that the concurrent execution of these applications takes minimal time. In other words, we
aim at minimizing the execution time of the longest application, when all applications start their execution
at the same time. Formally:
Definition 1 (CoSchedCache). Given n applications T
1
, . . . , T
n
and a platform with p identical processors
sharing a cache of size C
s
, find a schedule {(p
1
, x
1
), . . . , (p
n
, x
n
)} with
P
n
i=1
p
i
p, and
P
n
i=1
x
i
1, that
minimizes
max
1in
Exe
i
(p
i
, x
i
).
We pay particular attention in the following to perfectly parallel applications, i.e., applications T
i
with
s
i
= 0. In this case, Exe
i
(p
i
, x
i
) =
Exe
i
(1,x
i
)
p
i
=
Exe
seq
i
(x
i
)
p
i
. The co-scheduling problem for such applications is
denoted CoSchedCachePP.
4

4. Complexity Results
In this section, we focus on the CoSchedCache problem with rational numbers of processors in order
to study the intrinsic complexity of co-scheduling with cache partitioning. We first prove that in an optimal
execution, all applications must complete at the same time when using rational numbers of processors
(Section 4.1). We remind that CoSchedCache is NP-complete, even for perfectly parallel applications
(Section 4.2), and we show several dominance results on the optimal solution (Section 4.3). While some of
these dominance results only hold for perfectly parallel applications, they will guide the design of heuristics
for general applications in Section 5.
4.1. All applications complete at the same time
Lemma 1. To minimize the makespan when using rational numbers of processors, all applications must
finish at the same time.
Proof. Consider n applications T
1
, . . . , T
n
that obey Amdahl’s law, and a solution S = {(p
i
, x
i
)}
1in
to
CoSchedCache. Let D
S
= max
i
Exe
i
(p
i
, x
i
) be the makespan of this solution. For simplicity, we let
A
i
= 1 + f
i
l
s
+ l
l
· min
1,
m
i
1MBS
s
x
i
C
s
10
6
α
!!
,
b
i
= A
i
w
i
s
i
,
c
i
= A
i
w
i
(1 s
i
)
Hence, Exe
i
(p
i
, x
i
) = b
i
+
c
i
p
i
. The set of applications whose execution time is exactly D
S
is denoted
by I
S
.
We show the result by contradiction. We consider an optimal solution S whose subset I
S
has minimal
size (i.e., for any other optimal solution S
o
, |I
S
| |I
S
o
|). Then we show that if |I
S
| 6= n, we can construct
a solution S
0
with either (i) a smaller makespan if |I
S
| = 1 (contradicting the optimality hypothesis), or
(ii) one less application whose execution time is exactly D
S
(contradicting the minimality hypothesis).
Assume |I
S
| 6= n, let T
i
0
I
S
and T
i
1
/ I
S
. We have Exe
i
1
(p
i
1
, x
i
1
) < Exe
i
0
(p
i
0
, x
i
0
) = D
S
, that is
b
i
1
+
c
i
1
p
i
1
< b
i
0
+
c
i
0
p
i
0
, and hence (b
i
1
b
i
0
)p
i
0
p
i
1
c
i
0
p
i
1
+ c
i
1
p
i
0
< 0. (5)
We now prove that we can always find 0 < ε < p
i
1
s.t. Exe
i
0
(p
i
0
, x
i
0
) > Exe
i
0
(p
i
0
+ ε, x
i
0
) > Exe
i
1
(p
i
1
ε, x
i
1
), i.e.,
D
S
= b
i
0
+
c
i
0
p
i
0
> b
i
0
+
c
i
0
p
i
0
+ε
> b
i
1
+
c
i
1
p
i
1
ε
.
The left part of inequality b
i
0
+
c
i
0
p
i
0
> b
i
0
+
c
i
0
p
i
0
+ε
is always true when ε > 0. For the right part of inequality
above, we have:
(b
i
1
b
i
0
)ε
2
+ [(p
i
1
p
i
0
)(b
i
1
b
i
0
) + c
i
0
+ c
i
1
]ε + (b
i
1
b
i
0
)p
i
0
p
i
1
c
i
0
p
i
1
+ c
i
1
p
i
0
< 0. (6)
From Equation (5), we know that (b
i
1
b
i
0
)p
i
0
p
i
1
c
i
0
p
i
1
+c
i
1
p
i
0
< 0, so we can always find a 0 < ε < p
i
1
that could make Equation (6) satisfied.
Then clearly, S
0
= {(p
0
i
, x
i
)}
i
where p
0
i
is (i) p
i
if i / {i
0
, i
1
}, (ii) p
i
0
+ ε if i = i
0
, (iii) p
i
1
ε if i = i
1
, is
a valid solution: we have the property
P
i
p
0
i
=
P
i
p
i
p, and
P
i
x
0
i
=
P
i
x
i
1.
Hence,
If |I
S
| = 1, then for all i, Exe
i
(p
0
i
, x
i
) < D
S
, hence showing that S is not optimal;
Else, I
S
0
= I
S
\ {i
0
}, and D
S
0
= D
S
, hence showing that S is not minimal.
This shows that necessarily, |I
S
| = n.
5

Citations
More filters
Journal ArticleDOI

Co-scheduling HPC workloads on cache-partitioned CMP platforms:

TL;DR: The impact of cache partitioning when multiple HPC applications are co-scheduled onto CMP platforms is demonstrated and the Cache Allocation Technology (CAT) recently provided by Intel is used to partition the LLC and give each co- scheduled application their own cache area.
Proceedings ArticleDOI

Co-Scheduling HPC Workloads on Cache-Partitioned CMP Platforms

TL;DR: The Cache Allocation Technology (CAT) recently provided by Intel is used to partition the last level of cache (LLC) and give each co-scheduled application their own cache area and demonstrate the impact of cache partitioning when multiple HPC application are co- scheduled onto CMP platforms.
Dissertation

Co-scheduling for large-scale applications : memory and resilience

TL;DR: This thesis aims at minimizing the expected completion time of a set of co-scheduled applications in a failure-prone context by redistributing processors and exploring the impact on failures on co- scheduling performance.
Posted Content

An Adaptive Self-Scheduling Loop Scheduler

TL;DR: This work proposes a new LS method that requires little to no expert knowledge to achieve speedups close to those of tuned LS methods by self-managing chunk size based on a heuristic of workload variance and using work-stealing.
Book ChapterDOI

An Analytical Bound for Choosing Trivial Strategies in Co-scheduling.

TL;DR: In this paper, the problem of co-scheduling of HPC applications on the same shared computing nodes is addressed, where each application may have different requirements for shared resources (e.g., network bandwidth or memory bus bandwidth).
References
More filters
Proceedings ArticleDOI

Validity of the single processor approach to achieving large scale computing capabilities

TL;DR: In this paper, the authors argue that the organization of a single computer has reached its limits and that truly significant advances can be made only by interconnection of a multiplicity of computers in such a manner as to permit cooperative solution.
Proceedings ArticleDOI

Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches

TL;DR: In this article, the authors propose a low-overhead, runtime mechanism that partitions a shared cache between multiple applications depending on the reduction in cache misses that each application is likely to obtain for a given amount of cache resources.
Proceedings ArticleDOI

Evaluating STT-RAM as an energy-efficient main memory alternative

TL;DR: It is shown that an optimized, equal capacity STT-RAM main memory can provide performance comparable to DRAM main memory, with an average 60% reduction in main memory energy.
Related Papers (5)
Frequently Asked Questions (12)
Q1. What are the contributions mentioned in the paper "Co-scheduling amdahl applications on cache-partitioned systems" ?

In this paper, the authors provide answers to ( i ) and ( ii ) for Amdahl applications. 

Future work will be devoted to gain access to, and conduct real experiments on, a cache-partitioned system with a high core count: this would allow us to further validate the accuracy of the model and to confirm the impact of their promising results. On the theoretical side, the authors plan to focus on the problem with integer numbers of processors and they hope to derive interesting results that could help design even more efficient heuristics. 

The authors simplify the design of the heuristics by temporarily allocating processors as if the applications were perfectly parallel, and then concentrating on strategies that partition the cache efficiently among some applications (and give no cache fraction to remaining ones). 

For the simulations, the authors use a cache configuration representing an Intel Xeon CPU E5-2690, with a 40MB last level cache per processor of 8 cores. 

Extensive simulation results demonstrate that the use of dominant partitions always leads to better results than more naive approaches, as soon as there is a small sequential fraction of work in application speedup profiles. 

The main difficulty of co-scheduling is to decide which applications to execute concurrently, and how many cores to assign to each of them. 

The authors show that the ratio processors/applications has a significant impact on performance: when many processors are available for a few applications, it is less crucial to use efficient cache-partitioning and all applications can share the cache, hence Fair obtains good results, close to DomS-MinRatio. 

the power law states that if m0 is the miss rate of a workload for a baseline cache size C0, the miss rate m for a new cache size C can be expressed as m = m0 ( C0 C )α where α is the sensitivity factor from the Power Law of Cache Misses [HSPE08, RKB+09, KSS12] and typically ranges between 0.3 and 0.7 with an average at 0.5. 

According to the literature [KKSM13, MHSN15, PB14], the last level cache (LLC) latency is on average four to ten times better than the DDR latency, and the authors enforce a ratio of 5.88 in the simulations. 

Lemma 3. Given a set of applications T1, . . . , Tn and a partition IC , IC , the optimal solution to CSCPP-Ext ( IC , IC ) isxi = (wifidi) 1/(α+1)∑ j∈IC (wjfjdj) 1/(α+1) if i ∈ IC ,xi = 0 otherwise. 

Results are normalized with the makespan of AllProcCache, which is the execution without any co-scheduling: in the AllProcCache heuristic, applications are executed sequentially, each using all processors and all the cache. 

With more applications, the authors obtain the same ranking of heuristics, except that Fair is always the worst heuristic: since there are less processors on average per application, a good co-scheduling policy is necessary (see [ABD+17] for detailed results).