scispace - formally typeset
Open AccessProceedings ArticleDOI

Generating performance portable code using rewrite rules: from high-level functional expressions to high-performance OpenCL code

Reads0
Chats0
TLDR
This work proposes a novel approach aiming to combine high-level programming, code portability, and high-performance by applying a simple set of rewrite rules to transform it into a low-level functional representation close to the OpenCL programming model, from which OpenCL code is generated.
Abstract
Computers have become increasingly complex with the emergence of heterogeneous hardware combining multicore CPUs and GPUs. These parallel systems exhibit tremendous computational power at the cost of increased programming effort resulting in a tension between performance and code portability. Typically, code is either tuned in a low-level imperative language using hardware-specific optimizations to achieve maximum performance or is written in a high-level, possibly functional, language to achieve portability at the expense of performance. We propose a novel approach aiming to combine high-level programming, code portability, and high-performance. Starting from a high-level functional expression we apply a simple set of rewrite rules to transform it into a low-level functional representation, close to the OpenCL programming model, from which OpenCL code is generated. Our rewrite rules define a space of possible implementations which we automatically explore to generate hardware-specific OpenCL implementations. We formalize our system with a core dependently-typed lambda-calculus along with a denotational semantics which we use to prove the correctness of the rewrite rules. We test our design in practice by implementing a compiler which generates high performance imperative OpenCL code. Our experiments show that we can automatically derive hardware-specific implementations from simple functional high-level algorithmic expressions offering performance on a par with highly tuned code for multicore CPUs and GPUs written by experts.

read more

Content maybe subject to copyright    Report

Edinburgh Research Explorer
Generating Performance Portable Code using Rewrite Rules:
From High-Level Functional Expressions to High-Performance
OpenCL Code
Citation for published version:
Steuwer, M, Fensch, C, Lindley, S & Dubach, C 2015, Generating Performance Portable Code using
Rewrite Rules: From High-Level Functional Expressions to High-Performance OpenCL Code. in
Proceedings of the 20th ACM SIGPLAN International Conference on Functional Programming. ACM
SIGPLAN Notices, no. 9, vol. 50, ACM, Vancouver, BC, Canada, pp. 205-217, 20th ACM SIGPLAN
International Conference on Functional Programming, Vancouver, British Columbia, Canada, 31/08/15.
https://doi.org/10.1145/2784731.2784754
Digital Object Identifier (DOI):
10.1145/2784731.2784754
Link:
Link to publication record in Edinburgh Research Explorer
Document Version:
Peer reviewed version
Published In:
Proceedings of the 20th ACM SIGPLAN International Conference on Functional Programming
General rights
Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s)
and / or other copyright owners and it is a condition of accessing these publications that users recognise and
abide by the legal requirements associated with these rights.
Take down policy
The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer
content complies with UK legislation. If you believe that the public display of this file breaches copyright please
contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and
investigate your claim.
Download date: 09. Aug. 2022

Generating Performance Portable Code using Rewrite Rules
From High-Level Functional Expressions to High-Performance OpenCL Code
Michel Steuwer
The University of Edinburgh (UK)
University of Münster (Germany)
michel.steuwer@ed.ac.uk
Christian Fensch
Heriot-Watt University (UK)
c.fensch@hw.ac.uk
Sam Lindley
The University of Edinburgh (UK)
sam.lindley@ed.ac.uk
Christophe Dubach
The University of Edinburgh (UK)
christophe.dubach@ed.ac.uk
Abstract
Computers have become increasingly complex with the emergence
of heterogeneous hardware combining multicore CPUs and GPUs.
These parallel systems exhibit tremendous computational power
at the cost of increased programming effort resulting in a tension
between performance and code portability. Typically, code is either
tuned in a low-level imperative language using hardware-specific
optimizations to achieve maximum performance or is written in a
high-level, possibly functional, language to achieve portability at
the expense of performance.
We propose a novel approach aiming to combine high-level pro-
gramming, code portability, and high-performance. Starting from a
high-level functional expression we apply a simple set of rewrite
rules to transform it into a low-level functional representation, close
to the OpenCL programming model, from which OpenCL code is
generated. Our rewrite rules define a space of possible implementa-
tions which we automatically explore to generate hardware-specific
OpenCL implementations. We formalize our system with a core
dependently-typed λ-calculus along with a denotational semantics
which we use to prove the correctness of the rewrite rules.
We test our design in practice by implementing a compiler
which generates high performance imperative OpenCL code. Our
experiments show that we can automatically derive hardware-
specific implementations from simple functional high-level al-
gorithmic expressions offering performance on a par with highly
tuned code for multicore CPUs and GPUs written by experts.
Categories and Subject Descriptors D3.2 [Programming Lan-
guages]: Language Classification Applicative (functional) lan-
guages; Concurrent, distributed, and parallel languages; D3.4
[Processors]: Code generation, Compilers, Optimization
Keywords Algorithmic patterns, rewrite rules, performance porta-
bility, GPU, OpenCL, code generation
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a
fee. Request permissions from permissions@acm.org.
ICFP ’15, August 31 September 2, 2015, Vancouver, BC, Canada.
Copyright
c
2015 ACM 978-1-4503-3669-7/15/08.. . $15.00.
http://dx.doi.org/10.1145/nnnnnnn.nnnnnnn
1. Introduction
In recent years, graphics processing units (GPUs) have emerged as
the power horse of high-performance computing. These devices of-
fer enormous raw performance but require programmers to have a
deep understanding of the hardware in order to maximize perfor-
mance. This means software is written and tuned on a per-device
basis and needs to be adapted frequently to keep pace with ever
changing hardware.
Programming models such as OpenCL offer the promise of
functional portability of code across different parallel processors.
However, performance portability often remains elusive; code
achieving high performance for one device might only achieve a
fraction of the available performance on a different device. Figure 1
illustrates this problem by showing how a parallel reduce (a.k.a.
fold) implementation, written and optimized for one particular de-
vice, performs on other devices. Three implementations have been
tuned to maximize performance on each device: the Nvidia_opt
and AMD_opt implementations are tuned for the Nvidia and AMD
GPU respectively, implementing tree-based reduce using an itera-
tive approach with carefully specified synchronization primitives.
The Nvidia_opt version utilizes the local (a.k.a. shared) memory to
store intermediate results and exploits a hardware feature of Nvidia
GPUs to avoid certain synchronization barriers. The AMD_opt
version does not perform these two optimizations but instead uses
vectorized operations. The Intel_opt parallel implementation, tuned
for an Intel CPU, also relies on vectorized operations. However, it
uses a much coarser form of parallelism with fewer threads, in
which each thread performs more work.
Nvidia opt AMD opt Intel opt
0.0
0.2
0.4
0.6
0.8
1.0
Nvidia
GPU
AMD
GPU
Intel
CPU
Failed
Relative Performance
Figure 1: Performance is not portable across devices. Each bar
represents the device-specific optimized implementation of a paral-
lel reduce implemented in OpenCL and tuned for an Nvidia GPU,
AMD GPU, and Intel CPU respectively. Performance is normalized
with respect to the best implementation on each device.

Figure 1 shows the performance achieved by each implementa-
tion on three different devices. Running an implementation which
has been optimized on a different device leads to suboptimal per-
formance in all cases. Consider the AMD_opt implementation, for
instance, where we see that the performance loss is 20% when run-
ning on the Nvidia GPU and 90% (i.e., 10× slower) when running
on the CPU. The CPU optimized version, Intel_opt, achieves less
than 20% (i.e., 5× slower) when run on a GPU. Finally, it is worth
noting that the Nvidia_opt version, which performs quite badly on
the AMD GPU, actually fails to execute correctly on the CPU. This
is due to a low-level optimization which removes synchronization
barriers which can be avoided on the GPU, but are required on the
CPU for correctness.
This lack of performance portability is mainly due to the low-
level nature of the programming model; the dominant programming
interfaces for parallel devices such as GPUs exposes programmers
to many hardware-specific details. As a result, programming be-
comes complex, time-consuming, and error prone.
Several high-level programming models have been proposed
to tackle the programmability issue and shield programmers from
low-level hardware details. High-level dataflow programming lan-
guage such as StreamIt [25] and LiquidMetal [19] allow the pro-
grammer to easily express different implementations at the algo-
rithm level. Nvidia’s NOVA [12] language takes a more functional
approach in which higher-order functions such as map and reduce
are expressed as primitives recognized by the backend compiler.
Similarly, Accelerate [9] allows the programmer to write high-level
functional code in a DSL embedded in Haskell, and automatically
generate CUDA code for the GPU. For instance, the parallel reduce
discussed earlier would be written in Accelerate as:
sum xs = fold (+) 0 (use xs)
These kind of approaches hide the complexity of parallelism
and low-level optimizations from the user. However, they rely on
hard-coded device-specific implementations or heuristics to drive
the optimization process. When targeting different devices, the li-
brary implementation or backend compiler has to be re-tuned or
even worst re-engineered. In order to address the performance
portability issue, we aim to develop mechanisms that can effec-
tively explore device-specific optimizations. The core idea is not
to commit to a specific implementation or set of optimizations but
instead to let a tool automate the process.
In this paper we present an approach which compiles a high-
level functional expression similar to the one written in Accel-
erate into highly optimized device-specific OpenCL code. We
show that we achieve performance on a par with expert-written
implementations on an Intel multicore CPU, an AMD GPU, and
an Nvidia GPU. Central to our approach is a set of rewrite rules
that systematically translate high-level algorithmic concepts into
low-level hardware paradigms, both expressed in a functional style.
The rewrite rules are based on the kind of algebraic reasoning
well-known to functional programmers, and pioneered by Bird [5]
and others in the 1980s. They are used to systematically transform
programs into a low-level representation, from which high perfor-
mance code is generated automatically.
The power of our technique lies in the rewrite rules, written once
by an expert system designer. These rules encode the different al-
gorithmic choices and low-level hardware specific optimizations.
The rewrite rules play the dual role of enabling the composition
of high-level algorithmic concepts and enabling the mapping of
these onto hardware paradigms, but also critically provide correct-
ness preserving exploration of the implementation space. The rules
enable a clear separation of concerns between high-level algorith-
mic concepts and low-level hardware paradigms while using a uni-
fied framework. The defined implementation space is automatically
searched to produce high performance code.
High-level Expression
OpenCL Program
OpenCL Primitives
Algorithmic Primitives
Low-level Expression
Algorithmic choices &
Hardware optimizations
map
reduce
iterate
split
join
vectorize
toLocal
map-local
map-workgroup
vector units
workgroups
local memory
barriers
...
Dot product Vector reduce
Hardware Paradigms
Code generation
High-level
programming
reorder
...
...
...
Exploration with
rewriting rules
BlackScholes
Figure 2: The programmer expresses the problem with high-level
algorithmic primitives. These are systematically transformed into
low-level primitives using a rule rewriting system. OpenCL code
is generated by mapping the low-level primitives directly to the
OpenCL programming model representing hardware paradigms.
This paper demonstrates that our approach yields high-performance
code with OpenCL as our target hardware platform. We compare
the performance of our approach with highly-tuned linear algebra
functions extracted from state-of-the-art libraries and with bench-
marks such as BlackScholes. We express them as compositions of
high-level algorithmic primitives which are systematically mapped
to low-level OpenCL primitives.
The primary contributions of our paper are as follows:
a collection of high-level functional algorithmic primitives
for the programmer and low-level functional OpenCL primi-
tives representing the OpenCL programming model;
a core dependently-typed calculus and denotational semantics;
a set of rewrite rules that systematically express algorithmic
and optimization choices, bridging the gap between high-level
functional programs and OpenCL;
proofs of the soundness of the rewrite rules with respect to the
denotational semantics;
achieving performance portability by systematically applying
rewrite rules to yield device-specific implementations, with per-
formance on a par with the best hand-tuned versions.
The remainder of the paper is structured as follows. Section 2
provides an overview of our technique. Sections 3 and 4 present
our functional primitives and rewrite rules. Section 5 presents a
core language and denotational semantics, which we use to jus-
tify the rewrite rules. Section 6 explains our automatic search strat-
egy, while Section 7 introduces our benchmarks. Our experimental
setup and performance results are shown in Sections 8 and 9. Fi-
nally, Section 10 discusses related work and Section 11 concludes.
2. Overview
The overview of our approach is presented in Figure 2. The pro-
grammer writes a high-level expression composed of algorithmic
primitives. Using rewriting rules, we map this high-level expres-
sion into a low-level expression consisting of OpenCL primitives. In
the rewriting stage, different algorithmic and optimization choices
can be explored. The generated low-level expression is then fed
into our code generator that emits an OpenCL program compiled
to machine code by the vendor provided OpenCL compiler.

λ xs . map (λ x . x 3) xs
(a) High-level expression written by the programmer.
rewrite rules
λ xs . (join mapWorkgroup (joinVec
mapLocal (mapVec (λ x . x 3))
splitVec 4) split 1024) xs
(b) Low-level expression derived using rewrite rules and search.
code generator
1 int4 mul3(int4 x) { return x * 3; }
2 kernel vectorScal(global int* in,out, int len){
3 for (int i=get_group_id; i < len/1024;
4 i+=get_num_groups) {
5 global int* grp_in = in+(i*1024);
6 global int* grp_out = out+(i*1024);
7 for (int j=get_local_id; j < 1024/4;
8 j+=get_local_size) {
9 global int4* in_vec4 =(int4*)grp_in+(j*4);
10 global int4* out_vec4=(int4*)grp_out+(j*4);
11 *out_vec4 = mul3(*in_vec4);
12 } } }
(c) OpenCL program produced by our code generator.
Figure 3: Pseudo-code representing vector scaling. The user maps
a function multiplying an element by 3 over the input array (a). This
high-level expression is transformed into a low-level expression (b)
using rewrite rules in a search process. Finally, our code generator
turns the low-level expression into an OpenCL program (c).
We illustrate the mechanisms of our approach using a simple
vector scaling example shown in Figure 3. The user expresses
the computation by writing a high-level expression using the map
primitive as shown in Figure 3a. Our expressions are glued together
with lambda abstractions and function composition; we formally
define the syntax in Section 5.
Our technique first rewrites the high-level expression into a
low-level expression closer to the OpenCL programming model.
This is achieved by applying the rewrite rules presented later in
Section 4 possibly using an automatic search strategy discusses in
Section 6. Figure 3b shows one possible derivation of the original
high-level expression. Starting from the last line, the input (xs) is
split into chunks of 1024 elements. Each chunk is mapped onto a
group of threads, called workgroup, with the mapWorkgroup low-
level primitive. Within a workgroup, we group 4 elements into a
SIMD vector, each mapped to a local thread inside a workgroup
via the mapLocal primitive. Finally, the mapVec primitive applies
the vectorized form of the user defined function. The exact meaning
of our primitives will be given later in Section 3.
The last step consists of traversing the low-level expression
and generating OpenCL code for each low-level primitive encoun-
tered (Figure 3c). The two map primitives generate the for-loops
(line 3–4 and 7–8) that iterate over the input array assigning work
to the workgroups and local threads. The information of how many
chunks each workgroup and thread processes comes from the corre-
sponding split. In line 11 the vectorized version of the user defined
function (mul3 defined in line 1) is finally applied to the input array.
To summarize, our approach is able to generate OpenCL code
starting from a high-level program representation. This is achieved
by systematically transforming the high-level expression into a
low-level form suitable for code generation using an automated
search process.
map
A,B,I
: (A B) [A]
I
[B]
I
zip
A,B,I
: [A]
I
[B]
I
[A × B]
I
reduce
A,I
: ((A × A) A) A [A]
I
[A]
1
split
A,I
: (n : size) [A]
n×I
[[A]
n
]
I
join
A,I,J
: [[A]
I
]
J
[A]
I×J
iterate
A,I,J
: (n : size) ((m : size) [A]
I×m
[A]
m
)
[A]
I
n
×J
[A]
J
reorder
A,I
: [A]
I
[A]
I
Figure 4: High-level algorithmic primitives.
3. Algorithmic and OpenCL Primitives
A key idea of this paper is to expose algorithmic choices and
hardware-specific program optimizations in a functional style. This
allows for systematic transformations using a collection of rewrite
rules (Section 4). The high-level algorithmic primitives can either
be used by the programmer directly, as a stand-alone language (or
embedded DSL), or be used as an intermediate representation tar-
geted by another language. Once a program is represented by our
high-level primitives, we can automatically transform it into low-
level hardware primitives. These represent hardware-specific fea-
tures in a programming model such as OpenCL, the target chosen
for this paper. Following the same approach, a different set of low-
level primitives might be designed to target other low-level pro-
gramming models such as MPI.
In this section we give a high-level account of the primitives;
Section 5 gives a more formal account. Figure 4 and 5 present our
algorithmic and OpenCL primitives. The type system we present
here is monomorphic (largely to keep the formal presentation in
Section 5 simple), however, we do rely on a restricted form of
dependent types. The only kind of type-dependency we allow is
for array types, whose size may depend on a run-time value. Type
inference is beyond the scope of this paper, but in the future we
intend to apply ideas from systems such as DML [45] to our setting.
We let I range over sizes. A size can be a size variable m, n, a
natural number i, or a product I × J or power I
J
of sizes I and J.
We let A, B range over types. We write A B for a function from
type A to type B and (n : size) B for a dependent function
from size n to type B (where B may include array types whose
sizes depend on n). We write A × B for the product of types A
and B and 1 for the unit type. We write [A]
I
for an array of size
I with elements of type A. The primitives are annotated with type
and size subscripts. Thus, formally each one actually represents a
type-indexed family of primitives. We often omit subscripts when
they are not relevant or can be trivially inferred.
3.1 Algorithmic Primitives
As in Accelerate [9, 30], we deliberately restrict ourselves to a
set of primitives for which we know that high performance CPU
and GPU implementations exist. In contrast to Accelerate, we al-
low nesting of primitives to express nested parallelism. Nesting of
arrays is used to represent multi-dimensional data structures like
matrices. Figure 4 presents the high-level primitives used to define
programs at the algorithmic level. The map and zip primitives are
standard.
The reduce primitive is a special case of a fold returning a single
reduced element in an array of size 1. We assume the supplied
function is associative and commutative in order to admit efficient
parallel implementations. Returning the result as an array with a
single element allows for a more compositional design, in which
our primitives operate on arrays rather than scalar values.

mapWorkgroup
A,B,I
: (A B) [A]
I
[B]
I
mapLocal
A,B,I
: (A B) [A]
I
[B]
I
mapGlobal
A,B,I
: (A B) [A]
I
[B]
I
mapSeq
A,B,I
: (A B) [A]
I
[B]
I
toLocal
A,B
: (A B) (A B)
toGlobal
A,B
: (A B) (A B)
reduceSeq
A,B,I
: ((A × B) A) A [B]
I
[A]
1
reducePart
A,I
: ((A × A) A) A (n : size)
[A]
I×n
[A]
n
reorderStride
A,I
: (n : size) [A]
n×I
[A]
n×I
mapVec
A,B,I
: (A B) hAi
I
hBi
I
splitVec
A,I
: (n : size) [A]
n×I
[hAi
n
]
I
joinVec
A,I,J
: [hAi
I
]
J
[A]
I×J
Figure 5: Low-level OpenCL primitives used for code generation.
The split and join primitives transform the shape of array data.
The expression split n xs transforms array xs of size n × I, with
elements of type A, into an array of size I with elements that are
A arrays of size n; join is the inverse of split. (In practice A itself
may be an array type, in which case we can view split as adding a
dimension to and join as subtracting a dimension from a matrix.)
The iterate primitive repeatedly applies a given function. The
expression iterate n f applies the function f repeatedly n times.
The type of iterate is instructive. The function f may change the
length of the processed array at each iteration step. We currently
restrict the length to stay the same or shrink in each iteration by a
fixed factor (given by the implicit subscript I), which is sufficient
to express, e.g., iterative reduce (see Section 4). We intend to lift
this restriction in the future, which will probably require a richer
type system. Given n the type of iterate expresses that the input
array will shrink by a factor of I
n
.
Finally, the reorder primitive allows the programmer to express
that the order of elements in an array is unimportant, allowing
a number of useful optimizations—as we will see in Section 4.
This primitive bares obvious similarities to the unordered operation
of the Ferry query language [21], which asserts that the order of
elements in a list is unimportant.
3.2 OpenCL-specific Primitives
In order to achieve high performance on manycore CPUs and
GPUs, programmers often use a set of rules of thumb to drive
the optimization of their application. Each hardware vendor pro-
vides optimization guides [1, 31] that extensively cover hardware
idiosyncrasies and optimizations. The main idea behind our work
is to identify common optimization patterns and express them with
the help of low-level primitives coupled with a rewrite system. Fig-
ure 5 lists the OpenCL-specific primitives we have identified.
Maps Each mapX primitive has the same high-level semantics
as plain map, but represents a specific way of mapping computa-
tions to the hardware and exploiting parallelism in OpenCL. The
mapWorkgroup primitive assigns work to a group of threads, called
workgroup in OpenCL, with every workgroup applying the given
function on an element of the input array. Similarly, the mapLocal
primitive assigns work to a local thread inside a workgroup. As
workgroups are optional in OpenCL mapGlobal assigns work to a
thread not organized in a workgroup. This allows us to map com-
putations in different ways to the thread hierarchy. The mapSeq
primitive performs a sequential map within a single thread.
Generating OpenCL code for all of these primitives is simi-
lar; we describe this using mapWorkgroup as an example. A loop
is generated, where the iteration variable is determined by the
workgroup-id function from the OpenCL API. Inside the loop, a
pointer is generated to partition the input array, so that every work-
group calls the given function f on a different chunk of data. An
output pointer is generated similarly. We continue with the body of
the loop by generating the code for the function f recursively. Fi-
nally, an appropriate synchronization mechanism is added for the
given map primitive. For instance, after a mapLocal we add a bar-
rier synchronization for the threads inside the workgroup.
Local/Global The toLocal and toGlobal primitives are used to
determine where the result of the given function f should be
stored. OpenCL defines two distinct address spaces: global and
local. Global memory is the commonly used large but slow mem-
ory. On GPUs, the small local memory has a high bandwidth with
low latency and is used to store frequently accessed data or for effi-
cient communication between local threads (shared memory). With
these two primitives, we can in effect exploit the memory hierar-
chy defined in OpenCL. These primitives act similarly to a typecast
(their high-level semantics is that of the identity function) and are
in fact implemented as such, so that no code is emitted directly. We
check for incorrect use of these primitives in our implementation.
For example, the implementation checks that a toLocal primitive is
eventually followed by a toGlobal primitive to ensure that the final
result is copied back into global memory, as required by OpenCL.
We plan to extend our type system in the future to track the memory
location of arrays using an effect system.
In our design, every function reads its input and writes its output
using pointers provided by the callee function. As a result, we can
force a store to local memory by wrapping any function with the
toLocal function. In the code generator, this will simply change the
output pointer of function f to an area in local memory.
Sequential Reduce The reduceSeq primitive performs a sequen-
tial reduce within a single thread. The generated code consists of
an accumulation variable which is initialized with the given initial
value. A loop is generated iterating over the array and calling the
given function which stores its intermediate result in the accumu-
lation variable. Note, that we require the function passed to reduce
to be associative and commutative in order to enable an efficient
parallel implementation. We do not impose the same restriction for
the reduceSeq function, as here we guarantee a sequential order of
execution; thus reduceSeq has a more general type.
Partial Reduce The reducePart primitive performs a partial re-
duce, i.e., an array of n elements is reduced to an array of m el-
ements where 1 m n. While not directly used to generate
OpenCL code, reducePart is useful as an intermediate representa-
tion for deriving different implementations of reduce as we will see
in the next section.
Reorder Stride The high-level semantics of reorderStride
A,I
n
is just like reorder
A,I
. The low-level implementation actually per-
forms a specific reordering in which the array is reordered with a
stride n, that is, element i is mapped to element i/I +n(i%I). In
the generated OpenCL code this primitive ensures that after split-
ting the workload, consecutive threads access consecutive mem-
ory elements (i.e., coalesce memory access), which is beneficial on
modern GPUs as it maximizes memory bandwidth.
Our implementation does not produce code directly, but gener-
ates instead an index function, which is used when accessing the
array the next time. While beyond the scope of this paper, our de-
sign supports user-defined index functions as well.
Vectorization The OpenCL programming model supports SIMD
vector data types such as int4 where any operations on this type
will be executed in the hardware vector units. In the absence of
vector units in the hardware, the OpenCL compiler scalarizes the
code automatically.

Citations
More filters
Book ChapterDOI

Model-Driven Software Development

TL;DR: Basic concepts, such as model-driven engineering, metamodelling, model transformation and technological space, are defined, and the state-of-the-art implementations of these concepts are described.
Proceedings ArticleDOI

Lift: a functional data-parallel IR for high-performance GPU code generation

TL;DR: This paper describes how LIFT IR programs are compiled into efficient OpenCL code, a new data-parallel IR which encodes OpenCL-specific constructs as functional patterns which is flexible enough to express GPU programs with complex optimizations achieving performance on par with manually optimized code.
Proceedings ArticleDOI

Futhark: purely functional GPU-programming with nested parallelism and in-place array updates

TL;DR: This paper presents the design and implementation of three key features of Futhark that seek a suitable middle ground with imperative approaches and presents a flattening transformation aimed at enhancing the degree of parallelism.
Proceedings ArticleDOI

High performance stencil code generation with Lift

TL;DR: This paper demonstrates how complex multidimensional stencil code and optimizations such as tiling are expressible using compositions of simple 1D Lift primitives, and shows that this approach outperforms existing compiler approaches and hand-tuned codes.
Journal ArticleDOI

Efficient differentiable programming in a functional array-processing language

TL;DR: In this article, the authors present a system for the automatic differentiation of a higher-order functional array-processing language, which simultaneously supports both source-to-source forward-mode AD and global optimisations such as loop transformations.
References
More filters
Journal ArticleDOI

MapReduce: simplified data processing on large clusters

TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.
Journal ArticleDOI

MapReduce: simplified data processing on large clusters

TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.
Proceedings ArticleDOI

Rodinia: A benchmark suite for heterogeneous computing

TL;DR: This characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.
Book ChapterDOI

StreamIt: A Language for Streaming Applications

TL;DR: The StreamIt language provides novel high-level representations to improve programmer productivity and program robustness within the streaming domain and the StreamIt compiler aims to improve the performance of streaming applications via stream-specific analyses and optimizations.
Proceedings ArticleDOI

Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines

TL;DR: A systematic model of the tradeoff space fundamental to stencil pipelines is presented, a schedule representation which describes concrete points in this space for each stage in an image processing pipeline, and an optimizing compiler for the Halide image processing language that synthesizes high performance implementations from a Halide algorithm and a schedule are presented.
Related Papers (5)
Frequently Asked Questions (12)
Q1. How does the approach to vectorization work?

In their approach vectorization is achieved by using the splitVec and corresponding joinVec primitives, which changes the element type of an array and adjust the length accordingly. 

On GPUs, the small local memory has a high bandwidth with low latency and is used to store frequently accessed data or for efficient communication between local threads (shared memory). 

A loop is generated iterating over the array and calling the given function which stores its intermediate result in the accumulation variable. 

In their current implementation, the authors support two types of reordering: no reordering, represented by the id function, and reorderStride, which reorders elements with a certain stride n. 

The OpenCL programming model supports SIMD vector data types such as int4 where any operations on this type will be executed in the hardware vector units. 

Central to their approach is a set of rewrite rules that systematically translate high-level algorithmic concepts into low-level hardware paradigms, both expressed in a functional style. 

The reducePart primitive performs a partial reduce, i.e., an array of n elements is reduced to an array of m elements where 1 ≤ m ≤ n. 

The type system the authors present here is monomorphic (largely to keep the formal presentation in Section 5 simple), however, the authors do rely on a restricted form of dependent types. 

This lack of performance portability is mainly due to the lowlevel nature of the programming model; the dominant programming interfaces for parallel devices such as GPUs exposes programmers to many hardware-specific details. 

This is due to a low-level optimization which removes synchronization barriers which can be avoided on the GPU, but are required on the CPU for correctness. 

Once an expression is in its lowest-level form, it is possible to produce OpenCL code for each single primitive easily with their code generator as described in the previous section. 

The last step consists of traversing the low-level expression and generating OpenCL code for each low-level primitive encountered (Figure 3c).