scispace - formally typeset
Open AccessProceedings ArticleDOI

Compiler techniques for massively scalable implicit task parallelism

Reads0
Chats0
TLDR
This work presents a comprehensive set of compiler techniques for data-driven task parallelism, including novel compiler optimizations and intermediate representations, and demonstrates that these techniques greatly reduce communication overhead and enable extreme scalability.
Abstract
Swift/T is a high-level language for writing concise, deterministic scripts that compose serial or parallel codes implemented in lower-level programming models into large-scale parallel applications. It executes using a data-driven task parallel execution model that is capable of orchestrating millions of concurrently executing asynchronous tasks on homogeneous or heterogeneous resources. Producing code that executes efficiently at this scale requires sophisticated compiler transformations: poorly optimized code inhibits scaling with excessive synchronization and communication. We present a comprehensive set of compiler techniques for data-driven task parallelism, including novel compiler optimizations and intermediate representations. We report application benchmark studies, including unbalanced tree search and simulated annealing, and demonstrate that our techniques greatly reduce communication overhead and enable extreme scalability, distributing up to 612 million dynamically load balanced tasks per second at scales of up to 262,144 cores without explicit parallelism, synchronization, or load balancing in application code.

read more

Content maybe subject to copyright    Report

Compiler Techniques for Massively Scalable
Implicit Task Parallelism
Timothy G. Armstrong,
Justin M. Wozniak,
Michael Wilde,
Ian T. Foster
Dept. of Computer Science, University of Chicago, Chicago, IL, USA
Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL, USA
Computation Institute, University of Chicago and Argonne National Laboratory, Chicago, IL, USA
Abstract—Swift/T is a high-level language for writing concise,
deterministic scripts that compose serial or parallel codes im-
plemented in lower-level programming models into large-scale
parallel applications. It executes using a data-driven task parallel
execution model that is capable of orchestrating millions of
concurrently executing asynchronous tasks on homogeneous or
heterogeneous resources. Producing code that executes efficiently
at this scale requires sophisticated compiler transformations:
poorly optimized code inhibits scaling with excessive synchro-
nization and communication. We present a comprehensive set of
compiler techniques for data-driven task parallelism, including
novel compiler optimizations and intermediate representations.
We report application benchmark studies, including unbalanced
tree search and simulated annealing, and demonstrate that our
techniques greatly reduce communication overhead and enable
extreme scalability, distributing up to 612 million dynamically
load balanced tasks per second at scales of up to 262,144 cores
without explicit parallelism, synchronization, or load balancing
in application code.
I. INTRODUCTION
In recent years, large-scale computation has become an
indispensable tool in many fields, including those that have
not traditionally used high-performance computing. These in-
clude data-intensive applications such as machine learning and
scientific data crunching and compute-intensive applications
such as high-fidelity simulations.
The traditional development model for high-performance
computing requires close cooperation between domain experts
and parallel computing experts to build applications that
efficiently run on distributed-memory systems, with careful
attention given to low-level concerns such as distribution of
data, load balancing, and synchronization. Many real-world
applications, however, are amenable to generic approaches to
these concerns. In particular, many applications are naturally
expressed with data-driven task parallelism, in which massive
numbers of concurrently executing tasks are dynamically
assigned to execution resources, with synchronization and
communication handled using intertask data dependencies.
Variants of this execution model for distributed-memory and
heterogeneous systems have received significant attention be-
cause of the attractive confluence of high performance with
ease of development for many applications. Data-driven task
parallelism can expose more parallelism than can alternative
models such as fork-join [29], and it addresses challenges
of utilizing heterogenous, distributed-memory resources with
transparent data movement between devices and dynamic
data-aware task scheduling. Recent work has explored imple-
menting this execution model with libraries and conservative
language extensions to C for distributed-memory and heteroge-
nous systems [3], [8], [9], [28] and has shown that performance
can match or exceed performance of code directly using the
underlying interfaces (e.g., message passing or threads). One
reason for this success is that sophisticated algorithms for
load balancing (e.g., work stealing) or data movement, usually
impractical to reimplement for each application, can be imple-
mented in an application-independent manner. Another reason
is that the asynchronous execution model is effective at hiding
latency and exploiting available resources in applications with
irregular parallelism or unpredictable task runtimes.
Swift/T [36] is a high-level implicitly parallel programming
language that aims to make writing massively parallel code
for this execution model as easy and intuitive as sequential
scripting in languages such as Python. Implementing a very
high-level language such as Swift/T efficiently and scalably
is challenging, however, because the programmer only spec-
ifies synchronization and communication implicitly through
function composition or reads and writes to variables and
data structures. Thus, internode data movement, parallel task
management, and memory management are left entirely to
the language’s compiler and runtime system. Since large-
scale applications may require execution rates of hundreds
of millions of tasks per second on many thousands of cores,
this complex coordination logic must be implemented both
efficiently and scalably.
For this reason, we have developed, adapted, and im-
plemented a range of compiler techniques for data-driven
task parallelism, presented here. By optimizing the use of a
distributed runtime system’s operations, communication and
synchronization overhead is reduced by an order of magnitude.
In addition to this primary outcome of the work, we make the
following technical contributions:
Characterization of the novel compiler optimization prob-
lems arising in data-driven implicit task parallelism.
Design of an intermediate representation for effective
optimization of the execution model.
Novel compiler optimizations that reduce coordination
costs by an order of magnitude.
Compiler techniques that achieve low-overhead dis-
tributed automatic memory management at massive scale.
SC14, November 16-21, 2014, New Orleans, Louisiana, USA
978-1-4799-5500-8/14/$31.00
c
2014 IEEE

a
b c d
h
o
e f g
a writes data
d waits for
data
parent task a
spawns
child task b
Task
Shared
data item
Task spawn
dependency
Data
dependency
Fig. 1: Task and data dependencies in data-driven task paral-
lelism, forming a spawn tree rooted at task a. Data dependen-
cies on shared data defer execution of tasks until the variables
are frozen.
II. DATA-DRIVEN TASK PARALLELISM AND SWIFT/T
We introduce the data-driven task parallelism execution
model (Section II-A), show how it is programmable with the
high-level Swift/T language (Section II-B), and describe a
massively scalable implementation (Section II-C).
A. Abstract Execution Model
In data-driven task parallelism, a program is organized
into task definitions with explicit inputs. A task is a runtime
instantiation of a task definition with inputs bound to specific
data. Once executing, tasks run to completion and are not
preempted.
Each task can spawn asynchronous child tasks, resulting in
a spawn tree of tasks as in Figure 1. We assume support for
shared data: data items that can be read or written by any
task that obtains a reference to the data. Parent tasks can pass
data to their child tasks at spawn time, for example small
data such as numbers or short strings, along with references
to arbitrary shared data. Shared data items provide a means
for coordination between multiple tasks. For example, a task
can spawn two tasks, passing both a reference to a shared
data item, which one task reads and the other writes. Data
dependencies, which defer the execution of tasks, are the
primary synchronization mechanism. The execution model
permits a task to write (or not write) any data it holds a
reference to, allowing many runtime data dependency patterns
beyond static task graphs.
The execution model is much lower level than high-level
programming models such as the Swift/T language discussed
in the next section. There is no high-level syntax, and safety
guarantees are limited. A task can execute arbitrary code that
performs arbitrary computation, I/O, and runtime operations
such as spawning tasks or reading/writing data. This makes
invalid, non-deterministic, or otherwise unsafe behavior possi-
ble. For example, race conditions are possible if shared data is
read without synchronizing using data dependencies. Explicit
bookkeeping is also needed for both memory management and
correct freezing of variables. Programming errors could result
in memory leaks, prematurely freed data, or deadlocks. Many
other more restrictive task-parallel programming models, such
as task graphs or fork-join parallelism, can be expressed with
these basic constructs, so optimizations for this model are
broadly applicable.
B. Overview of Swift/T Programming Language
The overall Swift/T system has been described in previous
work [36], so we focus here on language semantics relevant
to compiler optimization. The Swift/T language’s syntax and
semantics are derived from the Swift language [34]. Swift/T
focuses on high-performance fine-grained task parallelism,
such as calling foreign functions (including C and Fortran)
with in-memory data and launching kernels on GPUs and other
accelerators [16]. These foreign functions are integrated into
the Swift/T language as typed leaf functions that encapsulate
computationally intensive code, leaving parallel coordination,
task distribution, and data dependency management to the
Swift/T dataflow programming model. Figure 2 illustrates how
leaf functions can be composed into an application, with
complexities such as data-dependent control flow expressible
naturally in the language.
The Swift/T language is a global-view implicitly parallel
language, meaning that, by default, execution order of state-
ments is constrained only by data dependencies, and that
execution location is left to language implementation, with
program variables logically accessible to code regardless of
where it executes. That is, program logic can be expressed
without explicit concurrency, communication, or data parti-
tioning. Certain control structures, including conditionals and
explicit wait statements, add additional control flow dependen-
cies to code, while annotations can provide hints or constraints
for data or task placement. Two types of loops are available:
foreach loops, for parallel iteration, and for loops, where
iterations are pipelined, with data passed from one iteration
to the next. Swift/T also supports unbounded recursion.
Swift/T can guarantee deterministic execution even with
implicit parallelism because its standard data types are mono-
tonic; that is, they cannot be mutated in such a way that
information is lost or overwritten. A monotonic variable starts
off empty, then incrementally accumulates information until
it is frozen, whereupon it cannot be further modified. One
can construct a wide variety of monotonic data types [11],
[17]. Basic Swift/T variables are single-assignment I-vars [21],
which are frozen when first assigned. Composite monotonic
data types can be incrementally assigned in parts but cannot
be overwritten. Programs that attempt to overwrite data will
fail at runtime (or compile time if the compiler determines
that the write is definitely erroneous). Swift/T programs using
only monotonic variables are deterministic by construction,
up to the order of side-effects such as I/O. For example, the
output value of an arbitrarily complex function involving many
data and control structures is deterministic, but the order in
which debug print statements execute depends on the nonde-
terministic order in which tasks run. Further nondeterminism
is introduced only by non-Swift/T code, library functions such
as rand(), or by rarely-used nonmonotonic data types that
are outside the scope of this paper.

1 blob models[], res[][];
2 foreach m in [1:N_models] {
3 models[m] = load(sprintf("model%i.data", m));
4 }
5
6 foreach i in [1:M] {
7 foreach j in [1:N] {
8 // initial quick evaluation of parameters
9 p, m = evaluate(i, j);
10 if (p > 0) {
11 // run ensemble of simulations
12 blob res2[];
13 foreach k in [1:S] {
14 res2[k] = simulate(models[m], i, j, k);
15 }
16 res[i][j] = summarize(res2);
17 }
18 }
19 }
20
21 // Summarize results to file
22 foreach i in [1:M] {
23 file out<sprintf("output%i.txt", i)>;
24 out = analyze(res[i]);
25 }
(a) Implicitly parallel Swift/T code.
Start
load()
evaluate()
simulate()
summarize()
analyze()
(b) Visualization of optimized parallel tasks and data dependen-
cies for parameters M = 2 N = 2 S = 3. Tasks and data are
mapped dynamically to compute resources at runtime.
Fig. 2: An application an amalgam of several real scientific applications that runs an ensemble of simulations for many
parameter combinations. The code executes with implicit parallelism, ordered by data dependencies. Data dependencies are
implied by reads and writes to scalar variables (e.g. p and m) and associative arrays (e.g. models and res). Swift/T semantics
allow functions (e.g. load, evaluate, and simulate) to execute in parallel when execution resources are available and
data dependencies are satisfied.
Worker 4
Worker 4
Server 0
Server 0
Server 1
Server 1
Worker 8
Worker 8
Worker 12
Worker 12
Worker 16
Worker 16
Worker 5
Worker 5
Worker 9
Worker 9
Worker 13
Worker 13
Worker 17
Worker 17
Server 3
Server 3
Worker 7
Worker 7
Worker 11
Worker 11
Worker 15
Worker 15
Worker 19
Worker 19
Server 2
Server 2
Worker 6
Worker 6
Worker 10
Worker 10
Worker 14
Worker 14
Worker 18
Worker 18
Node 0 Node 1 Node 2 Node 3
Fig. 3: Runtime process layout on distributed-memory system.
Processes are divided into workers and servers, which are then
mapped onto the processes of multi-core systems.
The sparse dynamically sized array is the main composite
data type in Swift/T. Integer indices are the default, but
other index types including strings are supported. The array
can be assigned all at once (e.g., int A[] = f();), or
in parts (e.g., int A[]; A[i] = a; A[j] = b;). The
array lookup operation A[i] will return when A[i] is set.
An incomplete array lookup does not prevent progress; other
statements can execute concurrently.
C. Massively Scalable Data-Driven Task Parallelism
The ADLB [18] and Turbine [35] runtime libraries provide
the runtime support for massively scalable data-driven task
parallelism on a MPI-2 or MPI-3 communication layer [31].
In this runtime system, MPI processes are divided into
two roles: workers and servers, which can be laid out in
various ways, for example with one server process allocated
to each shared-memory node, as shown in Figure 3. Worker
processes execute any program logic, coordinating with each
other through remote execution of data and task operations
Worker 3
Worker 3
Worker 2
Worker 2
Server 1
Server 1
Server 0
Server 0
function: f
args: 1,'foo',<9>
Tasks: ready
Tasks: waiting
function:g
args: <2>,<9>
state: running
f(2, 'bar', <9>
state: running
f(2, 'bar', <9>
state: idle
state: idle
Data
Work stealing,
notifications
int
<2>
read refcount: 1
write refcount: 1
value: (unset)
float
<42>
read refcount: 2
write refcount: 0
value: 3.14
array
<9>
read refcount: 1
write refcount: 2
value:
{<2>,<3>,<5>}
Dependencies
,<3>,<5>,<2>,
New tasks
Tasks waiting
Tasks ready
Dependencies
Data
Data
operations
Tasks to
execute
Data
operations
Fig. 4: Runtime architecture showing distributed worker
processes coordinating through task and data operations.
Ready/waiting tasks and shared data items are stored on
servers, with each server storing a subset of tasks and data.
Servers must communicate to redistribute tasks through work-
stealing, and to request/receive notifications about data avail-
ability on other servers.
on servers, as shown in Figure 4. These operations are
low-latency, typically taking microseconds to process, which
minimizes delays to worker processes. If needed, parallel
MPI functions can be executed by worker processes that
are dynamically grouped into “teams” with a shared MPI
communicator [37].
The data functionality includes rich data structures such
as scalar values, strings, binary blobs, structs, and associa-
tive arrays, providing the primitives needed to implement

Fig. 5: Throughput and scaling of runtime system for varying
task durations.
1 foreach i in [1:N] {
2 foreach j in [1:M] {
3 a, b, c = A[i-1][j-1], A[i-1][j], A[i][j-1];
4 A[i][j] = h(f(g(a)), f(g(b)), f(g(c)));
5 }
6 }
Fig. 6: Swift code fragment illustrating wavefront pattern.
Swift’s monotonic data types as shared data items. Memory
management of this data is supported using read and write
reference counters for each data item, allowing unused data
to be deleted and frozen data to be read-only. The task
functionality implements a scalable distributed task queue,
with load balancing using randomized work stealing between
servers. Task data dependencies are supported, so that tasks
can be released when data is frozen, at the granularity of
an entire data structure or individual array subscripts, as
shown in Figure 4. Figure 5 illustrates the scalability and
task throughput of Swift/T programs using the runtime system
on the Blue Waters supercomputer, where Swift/T achieved
a peak throughput of 1.47 billion tasks/s on 524,288 cores
running the Sweep benchmark described later in Section IV.
Tasks of 1 ms or more achieve high efficiency the servers are
lightly loaded and queuing delays are minimal.
III. COMPILER OPTIMIZATION
STC is a whole-program optimizing compiler for Swift/T
that targets the distributed runtime described previously.
Within STC we have implemented optimizations aimed at
reducing communication and synchronization without loss of
parallelism (Section III-A). An intermediate representation for
the program captures the execution model (Section III-B),
allowing optimization of synchronization, shared data, and
reference counting (Sections III-C, III-E, III-F, respectively).
A. Optimization Goals for Data-driven Task Parallelism
To optimize a wide range of data-driven task parallelism
patterns, we need compiler optimization techniques that can
understand the semantics of task parallelism and monotonic
variables in order to perform major transformations of the task
structure of programs to reduce synchronization and commu-
nication at runtime, while preserving parallelism. Excessive
runtime operations impair program efficiency because tasks
waste time waiting for communication; they can also impair
scalability by causing bottlenecks for data or task queues.
The implicitly parallel Swift/T code in Figure 6 illustrates
the opportunities and challenges of optimization. The code
STC Compiler
Executable
code
Distributed
Runtime
Swift/T
Code
IR-2
IR-1 IR-1
Postprocessing:
Ref. Counting &
Value. Passing
Optimization
Frontend Code Generation
Fig. 7: STC compiler architecture. The frontend produces
IR-1, to which optimization passes are applied to produce
successively more optimized IR-1 trees. Postprocessing adds
intertask data passing and reference counting information to
produce IR-2 for code generation.
specifies a dynamic, data-driven wavefront pattern of paral-
lelism, where evaluation of cell values is dynamically sched-
uled based on data availability at runtime, allowing execution
to adapt to variable task latencies. Two straightforward trans-
formations give immediate improvements: representing input
parameters such as i and j as regular local variables rather
than shared monotonic variables and hoisting the lookups of
A[i-1] and A[i] out of the inner loop body. The real
challenge, however, is in efficiently resolving implied data
dependencies between loop iterations. The naïve approach uses
three data dependencies per input cell; but with this strategy,
synchronization can quickly become a bottleneck. Smarter
approaches can identify common inputs of neighboring cells
to avoid redundant data reads, or defer task spawns until input
data is available: if the task for (i 1, j) spawns the task for
(i, j), only grid cell A[i][j + 1] must be resolved at runtime
since both other inputs were available at (i 1, j). The charac-
teristics of the f, g, and h functions also affect performance of
different parallelization schemes. Fusing f and g invocations
is a clear improvement because no parallelism is lost; but,
depending on function runtimes and other factors, the optimal
parallel structure is not immediately obvious. To maximize
parallelism, we would implement the loop body invocations as
three independent f(g(...)) tasks that produce the input
data for a h(...) task. To minimize runtime overhead, on
the other hand, we would merge these four tasks into a single
task that executes the f(g(...)) calls sequentially.
B. Intermediate Representation
The STC compiler uses a medium-level intermediate rep-
resentation (IR) that captures the execution model of data-
driven task parallelism. Two IR variants are used by stages
of the compiler (Figure 7). IR-1 is generated by the compiler
frontend and then optimized. IR-2 includes additional infor-
mation for code generation: explicit bookkeeping for reference
counts and data passing to child tasks. Sample IR-1 code for a
parallel, recursive Fibonacci calculation is shown in Figure 8.
Each IR procedure is structured as a tree of blocks. Each
block is represented as a sequence of statements. State-
ments are either composite conditional statements or single
IR instructions operating on input/output variables, giving a
flat, simple-to-analyze representation. Control flow is repre-
sented with high-level structures: if statements, foreach loops,
do/while loops, and so forth. The statements in each block

execute sequentially, but blocks within some control-flow
structures execute asynchronously and some IR instructions
spawn asynchronous tasks. Data-dependent execution is im-
plicit in some asynchronous IR instructions or explicit with
wait statements that execute a code block after a set of
variables is frozen.
Variables are either single-assignment locally stored values
or references to shared data, that is, unique identifiers used
to locate data on a remote process. A reference is either the
initial reference to a variable allocated in the block or an alias
obtained by duplicating a reference, by acquiring a reference
stored in a data structure, or by subscripting a composite
variable. Shared monotonic data is a first-class construct in
the IR, so optimizations can exploit monotonic semantics.
C. Adaption of Traditional Optimizations
The foundation of our suite of optimizations is a range
of traditional optimization techniques adapted from conven-
tional compilers [19] to our intermediate representation and
execution model in general. This required substantial changes
to many of the techniques, particularly to generalize them to
monotonic variables, and also to be able to effectively optimize
across task boundaries and with concurrent semantics.
STC includes a powerful value numbering [19] (VN) anal-
ysis. that discovers congruence relations in an IR function
between various expression types, including variables, array
cells, constants, arithmetic expressions, and function calls.
Annotations on functions, including standard library and user
functions, assist this optimization. For example, the annotation
@pure asserts that a function output is deterministic, and
it has no side-effects. The VN pass identifies congruence
relations for each IR block. Value congruence, for example,
retrieve(x)
=
V
y 2
=
V
6, means that multiple expressions
have the same value. Alias congruence, for example y
=
A
z
=
A
A[0], means that IR variables refer to the same runtime
shared data. Alias congruence implies value congruence. A
relation for a block B applies to B and all descendant
blocks, because of the monotonicity of IR variables. A set
of expressions congruent in B defines a congruence class.
STC’s VN implementation visits all IR instructions in an
IR function with a reverse postorder tree walk. Each IR
instruction, for example, StoreInt A 1, can yield con-
gruence relations: in this case A
=
V
store(1) and 1
=
V
retrieve(A). These new relations are added to the known
relations, perhaps merging existing congruence classes. For
example, if B
=
V
store(1), then A
=
V
B. Erroneous user
code that double-assigns a variable forces VN to abort, since
the correctness of the analysis depends on each variable having
a consistent value. Congruence relations in a block always
apply to descendant blocks. We also propagate congruence
relations upward to parent blocks in the case of conditional
statements. For example, if x
=
V
1 on both branches of
an if statement, it is propagated to the parent. We create
temporary variables if necessary to do this, for example, if
x
=
V
retrieve(A) and y
=
V
retrieve(A) on the branches, a
1 main () {
2 int n = argv("n"); // Get command line argument
3 int f = fib(n);
4
5 // Print result once computation finishes
6 printf("fib(%i)=%i", n, f);
7 }
8
9 (int o) fib (int i) {
10 if (i == 0) {
11 o = 0;
12 } else if (i == 1) {
13 o = 1;
14 } else {
15 // Compute fib(i-1) and fib(i-2) concurrently
16 o = fib(i - 1) + fib(i - 2);
17 }
18 }
(a) Swift/T code for recursive Fibonacci.
1 () @main () {
2 declare $int v_n, int f // variables for block
3 CallExtLocal argv [ v_n ] [ "n" ] // call to argv
4 Call fib [ f ] [ v_n ] // fib runs asynchronously
5 wait (f) { // execute block once f is frozen
6 declare $int v_f
7 LoadScalar v_f f // Load value of f to v_f
8 CallExtLocal printf [ ] [ "fib(%i)=%i" v_n v_f ]
9 }
10 }
11
12 // Compute o := fibonacci(i)
13 // input v_i is value, output o is shared var
14 (int o) @fib ($int v_i)
15 declare $boolean t0
16 LocalOp <eq_int> t0 v_i 0 // t0 := (v_i == 0)
17 if (t0) {
18 StoreScalar o 0 // o := 0
19 } else {
20 declare $boolean t2
21 LocalOp <eq_int> t2 v_i 1 // t2 := v_i + 1
22 if (t2) {
23 StoreScalar o 1 // o := 1
24 } else {
25 declare $int v_i1, $int v_i2, int f1, int f2
26 LocalOp <minus_int> v_i1 v_i 1 // v_i1 := v_i - 1
27 // Fib calls run asynchronously
28 Call fib [ f1 ] [ v_i1 ]
29 LocalOp <minus_int> v_i2 v_i 2 // v_i2 := v_i - 2
30 Call fib [ f2 ] [ v_i2 ]
31 // Compute sum once f1, f2 assigned
32 AsyncOp <plus_int> o f1 f2 // o := f1 + f2
33 }
34 }
35 }
(b) IR-1 optimized at -O2.
Fig. 8: Sample Swift/T program and corresponding IR for
recursive Fibonacci algorithm. The IR comprises two func-
tions: main and fib functions. IR instructions include Swift
function calls (e.g. Call fib), foreign function calls (e.g.
CallExtLocal printf), immediate arithmetic operations
(e.g. LocalOp <eq_int>), data-dependent arithmetic oper-
ations (e.g. AsyncOp <minus_int>), and reads and writes
of shared data items (LoadScalar and StoreScalar,
respectively). Control flow constructs used include conditional
if statements, and wait statements for data-dependent exe-
cution.

Citations
More filters
Journal ArticleDOI

Scientific Workflows: Past, Present and Future

TL;DR: This special issue and editorial celebrate 10 years of progress with data-intensive or scientific workflows with responses from a survey of major workflow systems, which provides evidence of substantial progress and a structured index of related papers.
Proceedings ArticleDOI

Regent: a high-productivity programming language for HPC with logical regions

TL;DR: An optimizing compiler is presented that translates Regent programs into efficient implementations for Legion, an asynchronous task-based model and it is demonstrated that Regent achieves performance comparable to hand-tuned Legion.
Journal ArticleDOI

CANDLE/Supervisor: a workflow framework for machine learning applied to cancer research.

TL;DR: This paper presents a workflow system that makes progress on scaling machine learning ensembles, specifically in this first release, ensembled of deep neural networks that address problems in cancer research across the atomistic, molecular and population scales.
Book ChapterDOI

Computing Just What You Need: Online Data Analysis and Reduction at Extreme Scales

TL;DR: Methods and tools that various groups are developing are described to enable experimental exploration of algorithmic, software, and system design alternatives that have major implications for the design of various elements of supercomputer systems.
Proceedings ArticleDOI

From desktop to large-scale model exploration with Swift/T

TL;DR: A framework for combining existing capabilities for model exploration approaches and simulations with the Swift/T parallel scripting language to run scientific workflows on a variety of computing resources, from desktop to academic clusters to Top 500 level supercomputers is presented.
References
More filters
Book

Advanced Compiler Design and Implementation

TL;DR: Advanced Compiler Design and Implementation by Steven Muchnick Preface to Advanced Topics
Journal ArticleDOI

StarPU: a unified platform for task scheduling on heterogeneous multicore architectures

TL;DR: StarPU as mentioned in this paper is a runtime system that provides a high-level unified execution model for numerical kernel designers with a convenient way to generate parallel tasks over heterogeneous hardware and easily develop and tune powerful scheduling algorithms.
Proceedings ArticleDOI

Exploiting coarse-grained task, data, and pipeline parallelism in stream programs

TL;DR: This paper demonstrates an end-to-end stream compiler that attains robust multicore performance in the face of varying application characteristics and exploits all types of parallelism in a unified manner in order to achieve this generality.
Proceedings ArticleDOI

Code Generation in the Polyhedral Model Is Easier Than You Think

TL;DR: A general transformation framework able to deal with nonunimodular, noninvertible, nonintegral or even nonuniform functions is discussed and several improvements to a state-of-the-art code generation algorithm are presented.
Book ChapterDOI

Workflow scheduling algorithms for grid computing

TL;DR: This chapter investigates existing workflow scheduling algorithms developed and deployed by various Grid projects and introduces allocating suitable resources to workflow tasks.
Related Papers (5)