What are the contributions mentioned in the paper "Co-scheduling amdahl applications on cache-partitioned systems" ?

In this paper, the authors provide answers to ( i ) and ( ii ) for Amdahl applications.

What future works have the authors mentioned in the paper "Co-scheduling amdahl applications on cache-partitioned systems" ?

Future work will be devoted to gain access to, and conduct real experiments on, a cache-partitioned system with a high core count: this would allow us to further validate the accuracy of the model and to confirm the impact of their promising results. On the theoretical side, the authors plan to focus on the problem with integer numbers of processors and they hope to derive interesting results that could help design even more efficient heuristics.

How do the authors simplify the design of the heuristics?

The authors simplify the design of the heuristics by temporarily allocating processors as if the applications were perfectly parallel, and then concentrating on strategies that partition the cache efficiently among some applications (and give no cache fraction to remaining ones).

What is the cache configuration used for the simulations?

For the simulations, the authors use a cache configuration representing an Intel Xeon CPU E5-2690, with a 40MB last level cache per processor of 8 cores.

What is the heuristic for a naive approach?

Extensive simulation results demonstrate that the use of dominant partitions always leads to better results than more naive approaches, as soon as there is a small sequential fraction of work in application speedup profiles.

What is the effect of the ratio processors/applications on performance?

The authors show that the ratio processors/applications has a significant impact on performance: when many processors are available for a few applications, it is less crucial to use efficient cache-partitioning and all applications can share the cache, hence Fair obtains good results, close to DomS-MinRatio.

what is the power law of cache misses?

the power law states that if m0 is the miss rate of a workload for a baseline cache size C0, the miss rate m for a new cache size C can be expressed as m = m0 ( C0 C )α where α is the sensitivity factor from the Power Law of Cache Misses [HSPE08, RKB+09, KSS12] and typically ranges between 0.3 and 0.7 with an average at 0.5.

How is the last level cache (LLC) latency calculated?

According to the literature [KKSM13, MHSN15, PB14], the last level cache (LLC) latency is on average four to ten times better than the DDR latency, and the authors enforce a ratio of 5.88 in the simulations.

What is the solution to CSCPP-Ext?

Lemma 3. Given a set of applications T1, . . . , Tn and a partition IC , IC , the optimal solution to CSCPP-Ext ( IC , IC ) isxi = (wifidi) 1/(α+1)∑ j∈IC (wjfjdj) 1/(α+1) if i ∈ IC ,xi = 0 otherwise.

How many processors are used in the AllProcCache heuristic?

Results are normalized with the makespan of AllProcCache, which is the execution without any co-scheduling: in the AllProcCache heuristic, applications are executed sequentially, each using all processors and all the cache.

What is the heuristic for a large number of applications?

With more applications, the authors obtain the same ranking of heuristics, except that Fair is always the worst heuristic: since there are less processors on average per application, a good co-scheduling policy is necessary (see [ABD+17] for detailed results).

(Open Access) Co-scheduling Amdahl applications on cache-partitioned systems (2018) | Guillaume Aupy

Q: What is the main difficulty of co-scheduling?

The main difficulty of co-scheduling is to decide which applications to execute concurrently, and how many cores to assign to each of them.

Co-scheduling Amdahl applications on cache-partitioned systems

Guillaume Aupy

, Anne Benoit

, Sicheng Dai

, Lo¨ıc Pottier

, Padma Raghavan

, Yves Robert

b,e

, Manu

Shantharam

Inria, Universit´e de Bordeaux, France

Laboratoire LIP,

Ecole Normale Sup´erieure de Lyon, France

East China Normal University, China

Vanderbilt University, Nashville TN, USA

University of Tennessee Knoxville, USA

San Diego Supercomputer Center, San Diego CA, USA

Abstract

Cache-partitioned architectures allow subsections of the shared last-level cache (LLC) to be exclusively

reserved for some applications. This technique dramatically limits interactions between applications that

are concurrently executing on a multi-core machine. Consider n applications that execute concurrently, with

the objective to minimize the makespan, deﬁned as the maximum completion time of the n applications.

Key scheduling questions are: (i) which proportion of cache and (ii) how many processors should be given

to each application? In this paper, we provide answers to (i) and (ii) for Amdahl applications. Even though

the problem is shown to be NP-complete, we give key elements to determine the subset of applications

that should share the LLC (while remaining ones only use their smaller private cache). Building upon

these results, we design eﬃcient heuristics for Amdahl applications. Extensive simulations demonstrate the

usefulness of co-scheduling when our eﬃcient cache partitioning strategies are deployed.

Keywords: Co-scheduling; cache partitioning; complexity results.

1. Introduction

At scale, the I/O movements of High Performance Computing (HPC) applications are expected to be one

of the most critical problems [Adv14]. Observations on the Intrepid machine at Argonne National Laboratory

(ANL) show that I/O transfers can be slowed down up to 70% due to congestion [GAB

15]. When ANL

upgraded its house supercomputer from Intrepid (Peak perf: 0.56 PFlops; peak I/O throughput: 88 GB/s)

to Mira (Peak perf: 10 PFlops; peak I/O throughput: 240 GB/s), the net result for an application whose

I/O throughput scales linearly (or worse) with performance was a downgrade from 160 GB/PFlop to 24

GB/PFlop!

To cope with such an imbalance (which is not expected to reduce on future platforms), a possible approach

is to develop in situ co-scheduling analysis and data preprocessing on dedicated nodes [Adv14]. This scheme

applies to data-intensive periodic workﬂows where data is generated by the main simulation, and parallel

processes are run to process this data with the constraints that output results should be sent to disk storage

before newly generated data arrives for processing. These solutions are starting to be implemented for HPC

applications. Sewell et al. [SHF

15] explain that in the case of the HACC application (a cosmological code),

petabytes of data are created to be analyzed later. The analysis is done by multiple independent processes.

The idea of their work is to minimize the amount of data copied to I/O ﬁlesystem, by performing the analysis

at the same time as HACC is running (what they call in situ). The main constraint is that these processes

are data-intensive and are handled by a dedicated machine. Also, the execution of these processes should

be done eﬃciently enough so that they ﬁnish before the next batch of data arrives, hence resulting in a

pipelined approach. All these frameworks motivate the design of eﬃcient co-scheduling strategies.

Email addresses: guillaume.aupy@inria.fr (Guillaume Aupy), Anne.Benoit@ens-lyon.fr (Anne Benoit),

51151500012@ecnu.cn (Sicheng Dai), Loic.Pottier@ens-lyon.fr (Lo¨ıc Pottier), padma.raghavan@vanderbilt.edu (Padma

Raghavan), Yves.Robert@inria.fr (Yves Robert), shantharam.manu@gmail.com (Manu Shantharam)

Preprint submitted to IJHPCA May 12, 2017

One main issue of co-scheduling is to evaluate co-run degradations due to cache sharing [ZBF10]. Many

studies have shown that interferences on the shared last-level cache (LLC) can be detrimental to co-scheduled

applications [LK14]. Previous solutions consisted in preventing co-schedule of possibly interfering workloads,

or terminating low importance applications [ZLMT14]. Lo et al. [LCG

16] recently showed experimentally

that important gains could be reached by co-scheduling applications with strict cache partitioning enabled.

Cache partitioning, the technique at the core of this work, consists in reserving exclusivity of subsections of

the LLC of a chip multi-processor (CMP), to some of the applications running on this CMP. This functionality

was recently introduced by Intel under the name Cache Allocation Technology [Int14]. With the advent of

large shared memory multi-core machines (e.g., Sunway TaihuLight, the current #1 supercomputer uses

256-cores processor chips with a shared memory of 32GB [Don16]), the design of algorithms that co-schedule

applications eﬃciently and decide how to partition the shared memory (seen as the cache here), is becoming

critical.

In this work, we study the following problem. We are given a set of Amdahl applications, i.e., parallel

applications application obeying Amdahl’s speedup law [Amd67] (see Equation 1 for details). Amdahl’s law

has had a profound impact on the evolution of HPC [Hea15] and many scientiﬁc applications, including most

Nas Parallel Benchmarks, obey this law [CE00]. We are also given a multi-core processor with a shared last-

level cache LLC. How can we best partition the LLC to minimize the total execution time (or makespan),

i.e., the moment when the last application ﬁnishes its computation. For each application, we assume that we

know the number of compute operations to perform, and the miss rate on a ﬁxed size cache. For the multi-

core processor, we know its LLC size, the cost for a cache miss, the cost for a cache hit, the size of the cache

and total number of processors. For the theoretical study, we assume that these processors can be shared by

two applications through multi-threading [KSS12], hence we can assign a rational number of processors to

each application, and this allows us to study the intrinsic complexity of co-scheduling with cache partitioning.

Equipped with all these applications and platform parameters, recent work [HSPE08, RKB

09, KSS12] shows

how to model the impact of cache misses and to accurately predict the execution time of an application. In

this context, we make the following main contributions:

• With rational numbers of processors, we show that the co-scheduling problem is NP-complete, even

when applications are perfectly parallel, i.e., their speed-up scales up linearly with the number of

processors.

• With rational numbers of processors, we show several results that characterize optimal solutions, and

in particular that the co-scheduling cache-partitioning problem reduces to deciding which subset of

applications will share the LLC; when this subset is known, we show how to determine the optimal

cache fractions and rational number of processors for perfectly-parallel applications. Furthermore, we

show that all applications should ﬁnish at the same time, even if they are not perfectly parallel.

• These theoretical results guide the design of heuristics for Amdahl applications. We show through ex-

tensive simulations (using both rational and integer numbers of processors) that our heuristics greatly

improve the performance of cache-partitioning algorithms, even for parallel applications obeying Am-

dahl’s law with a large sequential fraction, hence with a limited speedup proﬁle.

The rest of the paper is organized as follows. Section 2 provides an overview of related work. Section 3 is

devoted to formally deﬁning the framework and all model parameters. Section 4 gives our main theoretical

contributions. The heuristics are deﬁned in Section 5, and evaluated through simulations in Section 6.

Finally, Section 7 outlines our main ﬁndings and discusses directions for future work.

2. Related work

Since the advent of systems with tens of cores, co-scheduling has received considerable attention. Due to

lack of space, we refer to [MSM

11, DJF

15, LCG

16] for a survey of many approaches to co-scheduling.

The main idea is to execute several applications concurrently rather than in sequence, with the objective to

increase platform throughput. Indeed, some individual applications may well not need all available cores, or

some others could use all resources, but at the price of a dramatic performance loss. In particular, the latter

case is encountered whenever application speedup becomes too low beyond a given processor count.

The main diﬃculty of co-scheduling is to decide which applications to execute concurrently, and how many

cores to assign to each of them. Indeed, when executing simultaneously, any two applications will compete

for shared resources, which will create interferences and decrease their throughput. Modeling application

interference is a challenging task. Dynamic schedulers are used when application behavior is unknown [QP06,

TJS09]. Static schedulers aim at optimizing the sharing of the resources by relying on application knowledge

such as estimated workload, speed-up proﬁle, cache behavior, etc. One widely-used approach is to build an

interference graph whose vertices are applications and whose edges represent degradation factors [JSCT08,

ZHG

15, HZJ16]. This approach is interesting but hard to implement. Indeed, the interaction of two

applications depends on many factors, such as their size, their core count, the memory bandwidth, etc.

Obtaining the speedup proﬁle of a single application already is diﬃcult and requires intensive benchmarking

campaigns. Obtaining the degradation proﬁle of two applications is even more diﬃcult and can be achieved

only for regular applications. To further darken the picture, the interference graph subsumes only pairwise

interactions, while a global picture of the processor and cache requirements for all applications is needed by

the scheduler.

Shared resources include cache, memory, I/O channels and network links, but among potential degrada-

tion factors, cache accesses are prominent. When several applications share the cache, they are granted a

fraction of cache lines as opposed to the whole cache, and their cache miss ratio increases accordingly. Mul-

tiple cache partitioning strategies have been proposed [BCSM08, GSYY09, BZF10, DFB

12]. In this paper,

we focus on a static allocation of LLC cache fractions, and processor numbers, to concurrent applications

as a function of several parameters (cache-miss ratio, access frequency, operation count). To the best of our

knowledge, this work is the ﬁrst analytical model and complexity study for this challenging problem.

3. Model

This section details platform and application parameters, and formally states the optimization problem.

Architecture. We consider a parallel platform of p homogeneous computing elements, or processors, that

share two storage locations:

• A small storage S

with low latency, governed by a LRU replacement policy, also called cache;

• A large storage S

with high latency, also called memory.

More speciﬁcally, C

(resp. C

) denotes the size of S

(resp. S

), and l

(resp. l

) the latency of S

(resp. S

In this work, we assume that C

= +∞. We have the relation l

 l

In this work, we consider the cache partitioning technique [Int14], where one can allocate a portion of

the cache to applications so that they can execute without interference from other applications.

Applications. There are n independent parallel applications to be scheduled on the parallel platform,

whose speedup proﬁles obey Amdahl’s law [Amd67]. For an application T

, we deﬁne several parameters:

• w

, the number of computing operations needed for T

;

• s

, the sequential fraction of T

;

• f

, the frequency of data accesses of T

: f

is the number of data accesses per computing operation;

• a

, the memory footprint of T

We use these parameters to model the execution of each application as follows.

Parallel execution time. Let F l

) be the number of operations performed by each processor for applica-

tion T

, when executed on p

processors. According to Amdahl’s speedup proﬁle [Amd67], we have

F l

) = s

+ (1 − s

)

(1)

The power law of cache misses. In chip multi-processors, many authors have observed that the Power Law

accurately models how the cache size aﬀects the miss rate [HSPE08, RKB

09, KSS12]. Mathematically, the

power law states that if m

is the miss rate of a workload for a baseline cache size C

, the miss rate m for a

new cache size C can be expressed as m = m





where α is the sensitivity factor from the Power Law of

Cache Misses [HSPE08, RKB

09, KSS12] and typically ranges between 0.3 and 0.7 with an average at 0.5.

Note that, by deﬁnition, a rate cannot be higher than 1, hence we extend this deﬁnition as:

m = min



1, m





. (2)

This formula can be read as follows: if the cache size allocated is too small, then the execution goes as if no

cache was allocated, and all accesses will be misses.

Computations and data movement. We use the cost model introduced by Krishna et al. [KSS12] to evaluate

the execution cost of an application as a function of the cache fraction that it has been allocated. Speciﬁcally,

for each application, we deﬁne m

, the miss rate of application T

with a cache of size C

(we can also use the

miss rate of applications with a cache of another ﬁxed size). We express the execution time of T

as a function

of p

, the number of processors allocated to T

, and x

, the fraction of S

allocated to T

(recall both are

rational numbers). Let F l

) be the number of operations performed by each processor for application T

given that the application is executed on p

processors. We have F l

) = s

+ (1 − s

)

according to

Amdahl’s speedup proﬁle. Finally,

Exe

, x

) =











F l

) (1 + f

+ l

)) if x

= 0;

F l

)



1 + f



+ l

· min









if x

≤ a

;

F l

)



1 + f



+ l

· min









otherwise.

(3)

Indeed, for each operation, we pay the cost of the computing operation, plus the cost of data accesses,

and by deﬁnition we have f

accesses per operation. At each access, we pay a latency l

, and an additional

latency l

in case of cache miss (see Equation (2)). The last case states that we cannot use a portion of cache

greater than the memory footprint a

of application T

. This model is somewhat pessimistic: cache accesses

to the same variable by two diﬀerent processors are counted twice. We show in Section 6 that despite this

conservative assumption (no sharing), co-scheduling can outperform classical approaches that sequentially

deploy each application on the whole set of available resources.

Equation (3) calls for a few observations. For notational convenience, let d

= m





• It is useless to give a fraction of cache larger than

to application T

;

• Because of the minimum min



)



, either x

> d

, or x

= 0: indeed, if we give application T

fraction of cache smaller than d

, the minimum is equal to 1, and this fraction is wasted.

Hence, we have for all i:

= 0 or d

< x

≤

. (4)

Of course, if d

≥

for some application T

, then x

= 0.

We denote by Exe

seq

) = Exe

(1, x

) the sequential execution time of application T

with a fraction of

cache x

Scheduling problem. Given n applications T

, . . . , T

, we aim at partitioning the shared cache and assign

processors so that the concurrent execution of these applications takes minimal time. In other words, we

aim at minimizing the execution time of the longest application, when all applications start their execution

at the same time. Formally:

Deﬁnition 1 (CoSchedCache). Given n applications T

, . . . , T

and a platform with p identical processors

sharing a cache of size C

, ﬁnd a schedule {(p

, x

), . . . , (p

, x

)} with

i=1

≤ p, and

i=1

≤ 1, that

minimizes

max

1≤i≤n

Exe

, x

We pay particular attention in the following to perfectly parallel applications, i.e., applications T

with

= 0. In this case, Exe

, x

) =

Exe

(1,x

)

Exe

seq

)

. The co-scheduling problem for such applications is

denoted CoSchedCachePP.

4. Complexity Results

In this section, we focus on the CoSchedCache problem with rational numbers of processors in order

to study the intrinsic complexity of co-scheduling with cache partitioning. We ﬁrst prove that in an optimal

execution, all applications must complete at the same time when using rational numbers of processors

(Section 4.1). We remind that CoSchedCache is NP-complete, even for perfectly parallel applications

(Section 4.2), and we show several dominance results on the optimal solution (Section 4.3). While some of

these dominance results only hold for perfectly parallel applications, they will guide the design of heuristics

for general applications in Section 5.

4.1. All applications complete at the same time

Lemma 1. To minimize the makespan when using rational numbers of processors, all applications must

ﬁnish at the same time.

Proof. Consider n applications T

, . . . , T

that obey Amdahl’s law, and a solution S = {(p

, x

)}

1≤i≤n

CoSchedCache. Let D

= max

Exe

, x

) be the makespan of this solution. For simplicity, we let

= 1 + f

+ l

· min

1MBS





= A

(1 − s

)

Hence, Exe

, x

) = b

. The set of applications whose execution time is exactly D

is denoted

by I

We show the result by contradiction. We consider an optimal solution S whose subset I

has minimal

size (i.e., for any other optimal solution S

, |I

| ≤ |I

|). Then we show that if |I

| 6= n, we can construct

a solution S

with either (i) a smaller makespan if |I

| = 1 (contradicting the optimality hypothesis), or

(ii) one less application whose execution time is exactly D

(contradicting the minimality hypothesis).

Assume |I

| 6= n, let T

∈ I

and T

/∈ I

. We have Exe

, x

) < Exe

, x

) = D

, that is

< b

, and hence (b

− b

− c

+ c

< 0. (5)

We now prove that we can always ﬁnd 0 < ε < p

s.t. Exe

, x

) > Exe

+ ε, x

) > Exe

−

ε, x

), i.e.,

= b

> b

+ε

> b

−ε

The left part of inequality b

> b

+ε

is always true when ε > 0. For the right part of inequality

above, we have:

−(b

− b

)ε

+ [(p

− p

)(b

− b

) + c

+ c

]ε + (b

− b

− c

+ c

< 0. (6)

From Equation (5), we know that (b

−b

−c

< 0, so we can always ﬁnd a 0 < ε < p

that could make Equation (6) satisﬁed.

Then clearly, S

= {(p

, x

)}

where p

is (i) p

if i /∈ {i

, i

}, (ii) p

+ ε if i = i

, (iii) p

− ε if i = i

, is

a valid solution: we have the property

≤ p, and

≤ 1.

Hence,

• If |I

| = 1, then for all i, Exe

, x

) < D

, hence showing that S is not optimal;

• Else, I

= I

\ {i

}, and D

= D

, hence showing that S is not minimal.

This shows that necessarily, |I

| = n.

Co-scheduling Amdahl applications on cache-partitioned systems

Figures

Citations

Co-scheduling HPC workloads on cache-partitioned CMP platforms:

Co-Scheduling HPC Workloads on Cache-Partitioned CMP Platforms

Co-scheduling for large-scale applications : memory and resilience

An Adaptive Self-Scheduling Loop Scheduler

An Analytical Bound for Choosing Trivial Strategies in Co-scheduling.

References

Johnson: Computers and Intractability-A Guide to the Theory of NP-Completeness

Validity of the single processor approach to achieving large scale computing capabilities

Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches

The NAS parallel benchmarks—summary and preliminary results

Evaluating STT-RAM as an energy-efficient main memory alternative

Related Papers (5)

A Coordinated Approach for Practical OS-Level Cache Management in Multi-core Real-Time Systems

A shared cache-aware Task scheduling strategy for multi-core systems

Energy-efficient Real-time Scheduling on Multicores: A Novel Approach to Model Cache Contention

Region scheduling: efficiently using the cache architectures via page-level affinity

Reactive tiling

Frequently Asked Questions (12)

Q1. What are the contributions mentioned in the paper "Co-scheduling amdahl applications on cache-partitioned systems" ?

Q2. What future works have the authors mentioned in the paper "Co-scheduling amdahl applications on cache-partitioned systems" ?

Q3. How do the authors simplify the design of the heuristics?

Q4. What is the cache configuration used for the simulations?

Q5. What is the heuristic for a naive approach?

Q6. What is the main difficulty of co-scheduling?

Q7. What is the effect of the ratio processors/applications on performance?

Q8. what is the power law of cache misses?

Q9. How is the last level cache (LLC) latency calculated?

Q10. What is the solution to CSCPP-Ext?

Q11. How many processors are used in the AllProcCache heuristic?

Q12. What is the heuristic for a large number of applications?