Compiler techniques for massively scalable implicit task parallelism

doi:10.1109/SC.2014.30

Compiler Techniques for Massively Scalable

Implicit Task Parallelism

Timothy G. Armstrong,

∗

Justin M. Wozniak,

†‡

Michael Wilde,

†‡

Ian T. Foster

∗†‡

∗

Dept. of Computer Science, University of Chicago, Chicago, IL, USA

†

Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL, USA

‡

Computation Institute, University of Chicago and Argonne National Laboratory, Chicago, IL, USA

Abstract—Swift/T is a high-level language for writing concise,

deterministic scripts that compose serial or parallel codes im-

plemented in lower-level programming models into large-scale

parallel applications. It executes using a data-driven task parallel

execution model that is capable of orchestrating millions of

concurrently executing asynchronous tasks on homogeneous or

heterogeneous resources. Producing code that executes efﬁciently

at this scale requires sophisticated compiler transformations:

poorly optimized code inhibits scaling with excessive synchro-

nization and communication. We present a comprehensive set of

compiler techniques for data-driven task parallelism, including

novel compiler optimizations and intermediate representations.

We report application benchmark studies, including unbalanced

tree search and simulated annealing, and demonstrate that our

techniques greatly reduce communication overhead and enable

extreme scalability, distributing up to 612 million dynamically

load balanced tasks per second at scales of up to 262,144 cores

without explicit parallelism, synchronization, or load balancing

in application code.

I. INTRODUCTION

In recent years, large-scale computation has become an

indispensable tool in many ﬁelds, including those that have

not traditionally used high-performance computing. These in-

clude data-intensive applications such as machine learning and

scientiﬁc data crunching and compute-intensive applications

such as high-ﬁdelity simulations.

The traditional development model for high-performance

computing requires close cooperation between domain experts

and parallel computing experts to build applications that

efﬁciently run on distributed-memory systems, with careful

attention given to low-level concerns such as distribution of

data, load balancing, and synchronization. Many real-world

applications, however, are amenable to generic approaches to

these concerns. In particular, many applications are naturally

expressed with data-driven task parallelism, in which massive

numbers of concurrently executing tasks are dynamically

assigned to execution resources, with synchronization and

communication handled using intertask data dependencies.

Variants of this execution model for distributed-memory and

heterogeneous systems have received signiﬁcant attention be-

cause of the attractive conﬂuence of high performance with

ease of development for many applications. Data-driven task

parallelism can expose more parallelism than can alternative

models such as fork-join [29], and it addresses challenges

of utilizing heterogenous, distributed-memory resources with

transparent data movement between devices and dynamic

data-aware task scheduling. Recent work has explored imple-

menting this execution model with libraries and conservative

language extensions to C for distributed-memory and heteroge-

nous systems [3], [8], [9], [28] and has shown that performance

can match or exceed performance of code directly using the

underlying interfaces (e.g., message passing or threads). One

reason for this success is that sophisticated algorithms for

load balancing (e.g., work stealing) or data movement, usually

impractical to reimplement for each application, can be imple-

mented in an application-independent manner. Another reason

is that the asynchronous execution model is effective at hiding

latency and exploiting available resources in applications with

irregular parallelism or unpredictable task runtimes.

Swift/T [36] is a high-level implicitly parallel programming

language that aims to make writing massively parallel code

for this execution model as easy and intuitive as sequential

scripting in languages such as Python. Implementing a very

high-level language such as Swift/T efﬁciently and scalably

is challenging, however, because the programmer only spec-

iﬁes synchronization and communication implicitly through

function composition or reads and writes to variables and

data structures. Thus, internode data movement, parallel task

management, and memory management are left entirely to

the language’s compiler and runtime system. Since large-

scale applications may require execution rates of hundreds

of millions of tasks per second on many thousands of cores,

this complex coordination logic must be implemented both

efﬁciently and scalably.

For this reason, we have developed, adapted, and im-

plemented a range of compiler techniques for data-driven

task parallelism, presented here. By optimizing the use of a

distributed runtime system’s operations, communication and

synchronization overhead is reduced by an order of magnitude.

In addition to this primary outcome of the work, we make the

following technical contributions:

• Characterization of the novel compiler optimization prob-

lems arising in data-driven implicit task parallelism.

• Design of an intermediate representation for effective

optimization of the execution model.

• Novel compiler optimizations that reduce coordination

costs by an order of magnitude.

• Compiler techniques that achieve low-overhead dis-

tributed automatic memory management at massive scale.

SC14, November 16-21, 2014, New Orleans, Louisiana, USA

978-1-4799-5500-8/14/$31.00

c

2014 IEEE

a

b c d

h

o

e f g

a writes data

d waits for

data

parent task a

spawns

child task b

Task

Shared

data item

Task spawn

dependency

Data

dependency

Fig. 1: Task and data dependencies in data-driven task paral-

lelism, forming a spawn tree rooted at task a. Data dependen-

cies on shared data defer execution of tasks until the variables

are frozen.

II. DATA-DRIVEN TASK PARALLELISM AND SWIFT/T

We introduce the data-driven task parallelism execution

model (Section II-A), show how it is programmable with the

high-level Swift/T language (Section II-B), and describe a

massively scalable implementation (Section II-C).

A. Abstract Execution Model

In data-driven task parallelism, a program is organized

into task deﬁnitions with explicit inputs. A task is a runtime

instantiation of a task deﬁnition with inputs bound to speciﬁc

data. Once executing, tasks run to completion and are not

preempted.

Each task can spawn asynchronous child tasks, resulting in

a spawn tree of tasks as in Figure 1. We assume support for

shared data: data items that can be read or written by any

task that obtains a reference to the data. Parent tasks can pass

data to their child tasks at spawn time, for example small

data such as numbers or short strings, along with references

to arbitrary shared data. Shared data items provide a means

for coordination between multiple tasks. For example, a task

can spawn two tasks, passing both a reference to a shared

data item, which one task reads and the other writes. Data

dependencies, which defer the execution of tasks, are the

primary synchronization mechanism. The execution model

permits a task to write (or not write) any data it holds a

reference to, allowing many runtime data dependency patterns

beyond static task graphs.

The execution model is much lower level than high-level

programming models such as the Swift/T language discussed

in the next section. There is no high-level syntax, and safety

guarantees are limited. A task can execute arbitrary code that

performs arbitrary computation, I/O, and runtime operations

such as spawning tasks or reading/writing data. This makes

invalid, non-deterministic, or otherwise unsafe behavior possi-

ble. For example, race conditions are possible if shared data is

read without synchronizing using data dependencies. Explicit

bookkeeping is also needed for both memory management and

correct freezing of variables. Programming errors could result

in memory leaks, prematurely freed data, or deadlocks. Many

other more restrictive task-parallel programming models, such

as task graphs or fork-join parallelism, can be expressed with

these basic constructs, so optimizations for this model are

broadly applicable.

B. Overview of Swift/T Programming Language

The overall Swift/T system has been described in previous

work [36], so we focus here on language semantics relevant

to compiler optimization. The Swift/T language’s syntax and

semantics are derived from the Swift language [34]. Swift/T

focuses on high-performance ﬁne-grained task parallelism,

such as calling foreign functions (including C and Fortran)

with in-memory data and launching kernels on GPUs and other

accelerators [16]. These foreign functions are integrated into

the Swift/T language as typed leaf functions that encapsulate

computationally intensive code, leaving parallel coordination,

task distribution, and data dependency management to the

Swift/T dataﬂow programming model. Figure 2 illustrates how

leaf functions can be composed into an application, with

complexities such as data-dependent control ﬂow expressible

naturally in the language.

The Swift/T language is a global-view implicitly parallel

language, meaning that, by default, execution order of state-

ments is constrained only by data dependencies, and that

execution location is left to language implementation, with

program variables logically accessible to code regardless of

where it executes. That is, program logic can be expressed

without explicit concurrency, communication, or data parti-

tioning. Certain control structures, including conditionals and

explicit wait statements, add additional control ﬂow dependen-

cies to code, while annotations can provide hints or constraints

for data or task placement. Two types of loops are available:

foreach loops, for parallel iteration, and for loops, where

iterations are pipelined, with data passed from one iteration

to the next. Swift/T also supports unbounded recursion.

Swift/T can guarantee deterministic execution even with

implicit parallelism because its standard data types are mono-

tonic; that is, they cannot be mutated in such a way that

information is lost or overwritten. A monotonic variable starts

off empty, then incrementally accumulates information until

it is frozen, whereupon it cannot be further modiﬁed. One

can construct a wide variety of monotonic data types [11],

[17]. Basic Swift/T variables are single-assignment I-vars [21],

which are frozen when ﬁrst assigned. Composite monotonic

data types can be incrementally assigned in parts but cannot

be overwritten. Programs that attempt to overwrite data will

fail at runtime (or compile time if the compiler determines

that the write is deﬁnitely erroneous). Swift/T programs using

only monotonic variables are deterministic by construction,

up to the order of side-effects such as I/O. For example, the

output value of an arbitrarily complex function involving many

data and control structures is deterministic, but the order in

which debug print statements execute depends on the nonde-

terministic order in which tasks run. Further nondeterminism

is introduced only by non-Swift/T code, library functions such

as rand(), or by rarely-used nonmonotonic data types that

are outside the scope of this paper.

1 blob models[], res[][];

2 foreach m in [1:N_models] {

3 models[m] = load(sprintf("model%i.data", m));

4 }

5

6 foreach i in [1:M] {

7 foreach j in [1:N] {

8 // initial quick evaluation of parameters

9 p, m = evaluate(i, j);

10 if (p > 0) {

11 // run ensemble of simulations

12 blob res2[];

13 foreach k in [1:S] {

14 res2[k] = simulate(models[m], i, j, k);

15 }

16 res[i][j] = summarize(res2);

17 }

18 }

19 }

20

21 // Summarize results to ﬁle

22 foreach i in [1:M] {

23 file out<sprintf("output%i.txt", i)>;

24 out = analyze(res[i]);

25 }

(a) Implicitly parallel Swift/T code.

Start

load()

evaluate()

simulate()

summarize()

analyze()

(b) Visualization of optimized parallel tasks and data dependen-

cies for parameters M = 2 N = 2 S = 3. Tasks and data are

mapped dynamically to compute resources at runtime.

Fig. 2: An application – an amalgam of several real scientiﬁc applications – that runs an ensemble of simulations for many

parameter combinations. The code executes with implicit parallelism, ordered by data dependencies. Data dependencies are

implied by reads and writes to scalar variables (e.g. p and m) and associative arrays (e.g. models and res). Swift/T semantics

allow functions (e.g. load, evaluate, and simulate) to execute in parallel when execution resources are available and

data dependencies are satisﬁed.

Worker 4

Server 0

Server 1

Worker 8

Worker 12

Worker 16

•

Worker 5

Worker 9

Worker 13

Worker 17

•

Server 3

Worker 7

Worker 11

Worker 15

Worker 19

•

Server 2

Worker 6

Worker 10

Worker 14

Worker 18

•

Node 0 Node 1 Node 2 Node 3

Fig. 3: Runtime process layout on distributed-memory system.

Processes are divided into workers and servers, which are then

mapped onto the processes of multi-core systems.

The sparse dynamically sized array is the main composite

data type in Swift/T. Integer indices are the default, but

other index types including strings are supported. The array

can be assigned all at once (e.g., int A[] = f();), or

in parts (e.g., int A[]; A[i] = a; A[j] = b;). The

array lookup operation A[i] will return when A[i] is set.

An incomplete array lookup does not prevent progress; other

statements can execute concurrently.

C. Massively Scalable Data-Driven Task Parallelism

The ADLB [18] and Turbine [35] runtime libraries provide

the runtime support for massively scalable data-driven task

parallelism on a MPI-2 or MPI-3 communication layer [31].

In this runtime system, MPI processes are divided into

two roles: workers and servers, which can be laid out in

various ways, for example with one server process allocated

to each shared-memory node, as shown in Figure 3. Worker

processes execute any program logic, coordinating with each

other through remote execution of data and task operations

Worker 3

Worker 2

Server 1

Server 0

function: f

args: 1,'foo',<9>

Tasks: ready

Tasks: waiting

function:g

args: <2>,<9>

state: running

f(2, 'bar', <9>

state: running

f(2, 'bar', <9>

state: idle

Data

Work stealing,

notifications

int

<2>

read refcount: 1

write refcount: 1

value: (unset)

float

<42>

read refcount: 2

write refcount: 0

value: 3.14

array

<9>

read refcount: 1

write refcount: 2

value:

{<2>,<3>,<5>}

Dependencies

…,<3>,<5>,<2>,…

New tasks

Tasks waiting

Tasks ready

Dependencies

Data

operations

Tasks to

execute

Data

operations

Fig. 4: Runtime architecture showing distributed worker

processes coordinating through task and data operations.

Ready/waiting tasks and shared data items are stored on

servers, with each server storing a subset of tasks and data.

Servers must communicate to redistribute tasks through work-

stealing, and to request/receive notiﬁcations about data avail-

ability on other servers.

on servers, as shown in Figure 4. These operations are

low-latency, typically taking microseconds to process, which

minimizes delays to worker processes. If needed, parallel

MPI functions can be executed by worker processes that

are dynamically grouped into “teams” with a shared MPI

communicator [37].

The data functionality includes rich data structures such

as scalar values, strings, binary blobs, structs, and associa-

tive arrays, providing the primitives needed to implement

Fig. 5: Throughput and scaling of runtime system for varying

task durations.

1 foreach i in [1:N] {

2 foreach j in [1:M] {

3 a, b, c = A[i-1][j-1], A[i-1][j], A[i][j-1];

4 A[i][j] = h(f(g(a)), f(g(b)), f(g(c)));

5 }

6 }

Fig. 6: Swift code fragment illustrating wavefront pattern.

Swift’s monotonic data types as shared data items. Memory

management of this data is supported using read and write

reference counters for each data item, allowing unused data

to be deleted and frozen data to be read-only. The task

functionality implements a scalable distributed task queue,

with load balancing using randomized work stealing between

servers. Task data dependencies are supported, so that tasks

can be released when data is frozen, at the granularity of

an entire data structure or individual array subscripts, as

shown in Figure 4. Figure 5 illustrates the scalability and

task throughput of Swift/T programs using the runtime system

on the Blue Waters supercomputer, where Swift/T achieved

a peak throughput of 1.47 billion tasks/s on 524,288 cores

running the Sweep benchmark described later in Section IV.

Tasks of 1 ms or more achieve high efﬁciency the servers are

lightly loaded and queuing delays are minimal.

III. COMPILER OPTIMIZATION

STC is a whole-program optimizing compiler for Swift/T

that targets the distributed runtime described previously.

Within STC we have implemented optimizations aimed at

reducing communication and synchronization without loss of

parallelism (Section III-A). An intermediate representation for

the program captures the execution model (Section III-B),

allowing optimization of synchronization, shared data, and

reference counting (Sections III-C, III-E, III-F, respectively).

A. Optimization Goals for Data-driven Task Parallelism

To optimize a wide range of data-driven task parallelism

patterns, we need compiler optimization techniques that can

understand the semantics of task parallelism and monotonic

variables in order to perform major transformations of the task

structure of programs to reduce synchronization and commu-

nication at runtime, while preserving parallelism. Excessive

runtime operations impair program efﬁciency because tasks

waste time waiting for communication; they can also impair

scalability by causing bottlenecks for data or task queues.

The implicitly parallel Swift/T code in Figure 6 illustrates

the opportunities and challenges of optimization. The code

STC Compiler

Executable

code

Distributed

Runtime

Swift/T

Code

IR-2

IR-1 IR-1

Postprocessing:

Ref. Counting &

Value. Passing

Optimization

Frontend Code Generation

Fig. 7: STC compiler architecture. The frontend produces

IR-1, to which optimization passes are applied to produce

successively more optimized IR-1 trees. Postprocessing adds

intertask data passing and reference counting information to

produce IR-2 for code generation.

speciﬁes a dynamic, data-driven wavefront pattern of paral-

lelism, where evaluation of cell values is dynamically sched-

uled based on data availability at runtime, allowing execution

to adapt to variable task latencies. Two straightforward trans-

formations give immediate improvements: representing input

parameters such as i and j as regular local variables rather

than shared monotonic variables and hoisting the lookups of

A[i-1] and A[i] out of the inner loop body. The real

challenge, however, is in efﬁciently resolving implied data

dependencies between loop iterations. The naïve approach uses

three data dependencies per input cell; but with this strategy,

synchronization can quickly become a bottleneck. Smarter

approaches can identify common inputs of neighboring cells

to avoid redundant data reads, or defer task spawns until input

data is available: if the task for (i − 1, j) spawns the task for

(i, j), only grid cell A[i][j + 1] must be resolved at runtime

since both other inputs were available at (i− 1, j). The charac-

teristics of the f, g, and h functions also affect performance of

different parallelization schemes. Fusing f and g invocations

is a clear improvement because no parallelism is lost; but,

depending on function runtimes and other factors, the optimal

parallel structure is not immediately obvious. To maximize

parallelism, we would implement the loop body invocations as

three independent f(g(...)) tasks that produce the input

data for a h(...) task. To minimize runtime overhead, on

the other hand, we would merge these four tasks into a single

task that executes the f(g(...)) calls sequentially.

B. Intermediate Representation

The STC compiler uses a medium-level intermediate rep-

resentation (IR) that captures the execution model of data-

driven task parallelism. Two IR variants are used by stages

of the compiler (Figure 7). IR-1 is generated by the compiler

frontend and then optimized. IR-2 includes additional infor-

mation for code generation: explicit bookkeeping for reference

counts and data passing to child tasks. Sample IR-1 code for a

parallel, recursive Fibonacci calculation is shown in Figure 8.

Each IR procedure is structured as a tree of blocks. Each

block is represented as a sequence of statements. State-

ments are either composite conditional statements or single

IR instructions operating on input/output variables, giving a

ﬂat, simple-to-analyze representation. Control ﬂow is repre-

sented with high-level structures: if statements, foreach loops,

do/while loops, and so forth. The statements in each block

execute sequentially, but blocks within some control-ﬂow

structures execute asynchronously and some IR instructions

spawn asynchronous tasks. Data-dependent execution is im-

plicit in some asynchronous IR instructions or explicit with

wait statements that execute a code block after a set of

variables is frozen.

Variables are either single-assignment locally stored values

or references to shared data, that is, unique identiﬁers used

to locate data on a remote process. A reference is either the

initial reference to a variable allocated in the block or an alias

obtained by duplicating a reference, by acquiring a reference

stored in a data structure, or by subscripting a composite

variable. Shared monotonic data is a ﬁrst-class construct in

the IR, so optimizations can exploit monotonic semantics.

C. Adaption of Traditional Optimizations

The foundation of our suite of optimizations is a range

of traditional optimization techniques adapted from conven-

tional compilers [19] to our intermediate representation and

execution model in general. This required substantial changes

to many of the techniques, particularly to generalize them to

monotonic variables, and also to be able to effectively optimize

across task boundaries and with concurrent semantics.

STC includes a powerful value numbering [19] (VN) anal-

ysis. that discovers congruence relations in an IR function

between various expression types, including variables, array

cells, constants, arithmetic expressions, and function calls.

Annotations on functions, including standard library and user

functions, assist this optimization. For example, the annotation

@pure asserts that a function output is deterministic, and

it has no side-effects. The VN pass identiﬁes congruence

relations for each IR block. Value congruence, for example,

retrieve(x)

∼

=

V

y ∗ 2

∼

=

V

6, means that multiple expressions

have the same value. Alias congruence, for example y

∼

=

A

z

∼

=

A

A[0], means that IR variables refer to the same runtime

shared data. Alias congruence implies value congruence. A

relation for a block B applies to B and all descendant

blocks, because of the monotonicity of IR variables. A set

of expressions congruent in B deﬁnes a congruence class.

STC’s VN implementation visits all IR instructions in an

IR function with a reverse postorder tree walk. Each IR

instruction, for example, StoreInt A 1, can yield con-

gruence relations: in this case A

∼

=

V

store(1) and 1

∼

=

V

retrieve(A). These new relations are added to the known

relations, perhaps merging existing congruence classes. For

example, if B

∼

=

V

store(1), then A

∼

=

V

B. Erroneous user

code that double-assigns a variable forces VN to abort, since

the correctness of the analysis depends on each variable having

a consistent value. Congruence relations in a block always

apply to descendant blocks. We also propagate congruence

relations upward to parent blocks in the case of conditional

statements. For example, if x

∼

=

V

1 on both branches of

an if statement, it is propagated to the parent. We create

temporary variables if necessary to do this, for example, if

x

∼

=

V

retrieve(A) and y

∼

=

V

retrieve(A) on the branches, a

1 main () {

2 int n = argv("n"); // Get command line argument

3 int f = fib(n);

4

5 // Print result once computation ﬁnishes

6 printf("fib(%i)=%i", n, f);

7 }

8

9 (int o) fib (int i) {

10 if (i == 0) {

11 o = 0;

12 } else if (i == 1) {

13 o = 1;

14 } else {

15 // Compute ﬁb(i-1) and ﬁb(i-2) concurrently

16 o = fib(i - 1) + fib(i - 2);

17 }

18 }

(a) Swift/T code for recursive Fibonacci.

1 () @main () {

2 declare $int v_n, int f // variables for block

3 CallExtLocal argv [ v_n ] [ "n" ] // call to argv

4 Call fib [ f ] [ v_n ] // ﬁb runs asynchronously

5 wait (f) { // execute block once f is frozen

6 declare $int v_f

7 LoadScalar v_f f // Load value of f to v_f

8 CallExtLocal printf [ ] [ "fib(%i)=%i" v_n v_f ]

9 }

10 }

11

12 // Compute o := ﬁbonacci(i)

13 // input v_i is value, output o is shared var

14 (int o) @fib ($int v_i)

15 declare $boolean t0

16 LocalOp <eq_int> t0 v_i 0 // t0 := (v_i == 0)

17 if (t0) {

18 StoreScalar o 0 // o := 0

19 } else {

20 declare $boolean t2

21 LocalOp <eq_int> t2 v_i 1 // t2 := v_i + 1

22 if (t2) {

23 StoreScalar o 1 // o := 1

24 } else {

25 declare $int v_i1, $int v_i2, int f1, int f2

26 LocalOp <minus_int> v_i1 v_i 1 // v_i1 := v_i - 1

27 // Fib calls run asynchronously

28 Call fib [ f1 ] [ v_i1 ]

29 LocalOp <minus_int> v_i2 v_i 2 // v_i2 := v_i - 2

30 Call fib [ f2 ] [ v_i2 ]

31 // Compute sum once f1, f2 assigned

32 AsyncOp <plus_int> o f1 f2 // o := f1 + f2

33 }

34 }

35 }

(b) IR-1 optimized at -O2.

Fig. 8: Sample Swift/T program and corresponding IR for

recursive Fibonacci algorithm. The IR comprises two func-

tions: main and fib functions. IR instructions include Swift

function calls (e.g. Call fib), foreign function calls (e.g.

CallExtLocal printf), immediate arithmetic operations

(e.g. LocalOp <eq_int>), data-dependent arithmetic oper-

ations (e.g. AsyncOp <minus_int>), and reads and writes

of shared data items (LoadScalar and StoreScalar,

respectively). Control ﬂow constructs used include conditional

if statements, and wait statements for data-dependent exe-

cution.

Compiler techniques for massively scalable implicit task parallelism

Figures

Citations

Scientific Workflows: Past, Present and Future

Regent: a high-productivity programming language for HPC with logical regions

CANDLE/Supervisor: a workflow framework for machine learning applied to cancer research.

Computing Just What You Need: Online Data Analysis and Reduction at Extreme Scales

From desktop to large-scale model exploration with Swift/T

References

Advanced Compiler Design and Implementation

StarPU: a unified platform for task scheduling on heterogeneous multicore architectures

Exploiting coarse-grained task, data, and pipeline parallelism in stream programs

Code Generation in the Polyhedral Model Is Easier Than You Think

Workflow scheduling algorithms for grid computing

Related Papers (5)

Swift: A language for distributed parallel scripting

More scalability, less pain: A simple programming model and its implementation for extreme computing

Swift/T: Large-scale application composition via distributed-memory dataflow processing

Pegasus, a workflow management system for science automation

Parsl: Pervasive Parallel Programming in Python