How does the approach to vectorization work?

In their approach vectorization is achieved by using the splitVec and corresponding joinVec primitives, which changes the element type of an array and adjust the length accordingly.

What is the use case for reordering?

In their current implementation, the authors support two types of reordering: no reordering, represented by the id function, and reorderStride, which reorders elements with a certain stride n.

How can the authors produce OpenCL code for each single primitive?

Once an expression is in its lowest-level form, it is possible to produce OpenCL code for each single primitive easily with their code generator as described in the previous section.

(Open Access) Generating performance portable code using rewrite rules: from high-level functional expressions to high-performance OpenCL code (2015) | Michel Steuwer

Q: What is the use of the small local memory on GPUs?

On GPUs, the small local memory has a high bandwidth with low latency and is used to store frequently accessed data or for efficient communication between local threads (shared memory).

Q: What is the function that stores its intermediate result in the accumulation variable?

A loop is generated iterating over the array and calling the given function which stores its intermediate result in the accumulation variable.

Q: What type of data types are supported by the OpenCL programming model?

The OpenCL programming model supports SIMD vector data types such as int4 where any operations on this type will be executed in the hardware vector units.

Q: What is the central to the approach?

Central to their approach is a set of rewrite rules that systematically translate high-level algorithmic concepts into low-level hardware paradigms, both expressed in a functional style.

Q: What is the function that reducePart performs?

The reducePart primitive performs a partial reduce, i.e., an array of n elements is reduced to an array of m elements where 1 ≤ m ≤ n.

Q: What type system do the authors use to represent hardware-specific features in a programming model?

The type system the authors present here is monomorphic (largely to keep the formal presentation in Section 5 simple), however, the authors do rely on a restricted form of dependent types.

Edinburgh Research Explorer

Generating Performance Portable Code using Rewrite Rules:

From High-Level Functional Expressions to High-Performance

OpenCL Code

Citation for published version:

Steuwer, M, Fensch, C, Lindley, S & Dubach, C 2015, Generating Performance Portable Code using

Rewrite Rules: From High-Level Functional Expressions to High-Performance OpenCL Code. in

Proceedings of the 20th ACM SIGPLAN International Conference on Functional Programming. ACM

SIGPLAN Notices, no. 9, vol. 50, ACM, Vancouver, BC, Canada, pp. 205-217, 20th ACM SIGPLAN

International Conference on Functional Programming, Vancouver, British Columbia, Canada, 31/08/15.

https://doi.org/10.1145/2784731.2784754

Digital Object Identifier (DOI):

10.1145/2784731.2784754

Link:

Link to publication record in Edinburgh Research Explorer

Document Version:

Peer reviewed version

Published In:

Proceedings of the 20th ACM SIGPLAN International Conference on Functional Programming

General rights

and / or other copyright owners and it is a condition of accessing these publications that users recognise and

abide by the legal requirements associated with these rights.

Take down policy

The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer

content complies with UK legislation. If you believe that the public display of this file breaches copyright please

contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and

investigate your claim.

Download date: 09. Aug. 2022

Generating Performance Portable Code using Rewrite Rules

From High-Level Functional Expressions to High-Performance OpenCL Code

Michel Steuwer

The University of Edinburgh (UK)

University of Münster (Germany)

michel.steuwer@ed.ac.uk

Christian Fensch

Heriot-Watt University (UK)

c.fensch@hw.ac.uk

Sam Lindley

The University of Edinburgh (UK)

sam.lindley@ed.ac.uk

Christophe Dubach

The University of Edinburgh (UK)

christophe.dubach@ed.ac.uk

Abstract

Computers have become increasingly complex with the emergence

of heterogeneous hardware combining multicore CPUs and GPUs.

These parallel systems exhibit tremendous computational power

at the cost of increased programming effort resulting in a tension

between performance and code portability. Typically, code is either

tuned in a low-level imperative language using hardware-speciﬁc

optimizations to achieve maximum performance or is written in a

high-level, possibly functional, language to achieve portability at

the expense of performance.

We propose a novel approach aiming to combine high-level pro-

gramming, code portability, and high-performance. Starting from a

high-level functional expression we apply a simple set of rewrite

rules to transform it into a low-level functional representation, close

to the OpenCL programming model, from which OpenCL code is

generated. Our rewrite rules deﬁne a space of possible implementa-

tions which we automatically explore to generate hardware-speciﬁc

OpenCL implementations. We formalize our system with a core

dependently-typed λ-calculus along with a denotational semantics

which we use to prove the correctness of the rewrite rules.

We test our design in practice by implementing a compiler

which generates high performance imperative OpenCL code. Our

experiments show that we can automatically derive hardware-

speciﬁc implementations from simple functional high-level al-

gorithmic expressions offering performance on a par with highly

tuned code for multicore CPUs and GPUs written by experts.

Categories and Subject Descriptors D3.2 [Programming Lan-

guages]: Language Classiﬁcation – Applicative (functional) lan-

guages; Concurrent, distributed, and parallel languages; D3.4

[Processors]: Code generation, Compilers, Optimization

Keywords Algorithmic patterns, rewrite rules, performance porta-

bility, GPU, OpenCL, code generation

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for proﬁt or commercial advantage and that copies bear this notice and the full citation

on the ﬁrst page. Copyrights for components of this work owned by others than ACM

must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,

to post on servers or to redistribute to lists, requires prior speciﬁc permission and/or a

fee. Request permissions from permissions@acm.org.

ICFP ’15, August 31 – September 2, 2015, Vancouver, BC, Canada.

 2015 ACM 978-1-4503-3669-7/15/08.. . $15.00.

http://dx.doi.org/10.1145/nnnnnnn.nnnnnnn

1. Introduction

In recent years, graphics processing units (GPUs) have emerged as

the power horse of high-performance computing. These devices of-

fer enormous raw performance but require programmers to have a

deep understanding of the hardware in order to maximize perfor-

mance. This means software is written and tuned on a per-device

basis and needs to be adapted frequently to keep pace with ever

changing hardware.

Programming models such as OpenCL offer the promise of

functional portability of code across different parallel processors.

However, performance portability often remains elusive; code

achieving high performance for one device might only achieve a

fraction of the available performance on a different device. Figure 1

illustrates this problem by showing how a parallel reduce (a.k.a.

fold) implementation, written and optimized for one particular de-

vice, performs on other devices. Three implementations have been

tuned to maximize performance on each device: the Nvidia_opt

and AMD_opt implementations are tuned for the Nvidia and AMD

GPU respectively, implementing tree-based reduce using an itera-

tive approach with carefully speciﬁed synchronization primitives.

The Nvidia_opt version utilizes the local (a.k.a. shared) memory to

store intermediate results and exploits a hardware feature of Nvidia

GPUs to avoid certain synchronization barriers. The AMD_opt

version does not perform these two optimizations but instead uses

vectorized operations. The Intel_opt parallel implementation, tuned

for an Intel CPU, also relies on vectorized operations. However, it

uses a much coarser form of parallelism with fewer threads, in

which each thread performs more work.

Nvidia opt AMD opt Intel opt

0.0

0.2

0.4

0.6

0.8

1.0

Nvidia

GPU

AMD

GPU

Intel

CPU

Failed

Relative Performance

Figure 1: Performance is not portable across devices. Each bar

represents the device-speciﬁc optimized implementation of a paral-

lel reduce implemented in OpenCL and tuned for an Nvidia GPU,

AMD GPU, and Intel CPU respectively. Performance is normalized

with respect to the best implementation on each device.

Figure 1 shows the performance achieved by each implementa-

tion on three different devices. Running an implementation which

has been optimized on a different device leads to suboptimal per-

formance in all cases. Consider the AMD_opt implementation, for

instance, where we see that the performance loss is 20% when run-

ning on the Nvidia GPU and 90% (i.e., 10× slower) when running

on the CPU. The CPU optimized version, Intel_opt, achieves less

than 20% (i.e., 5× slower) when run on a GPU. Finally, it is worth

noting that the Nvidia_opt version, which performs quite badly on

the AMD GPU, actually fails to execute correctly on the CPU. This

is due to a low-level optimization which removes synchronization

barriers which can be avoided on the GPU, but are required on the

CPU for correctness.

This lack of performance portability is mainly due to the low-

level nature of the programming model; the dominant programming

interfaces for parallel devices such as GPUs exposes programmers

to many hardware-speciﬁc details. As a result, programming be-

comes complex, time-consuming, and error prone.

Several high-level programming models have been proposed

to tackle the programmability issue and shield programmers from

low-level hardware details. High-level dataﬂow programming lan-

guage such as StreamIt [25] and LiquidMetal [19] allow the pro-

grammer to easily express different implementations at the algo-

rithm level. Nvidia’s NOVA [12] language takes a more functional

approach in which higher-order functions such as map and reduce

are expressed as primitives recognized by the backend compiler.

Similarly, Accelerate [9] allows the programmer to write high-level

functional code in a DSL embedded in Haskell, and automatically

generate CUDA code for the GPU. For instance, the parallel reduce

discussed earlier would be written in Accelerate as:

sum xs = fold (+) 0 (use xs)

These kind of approaches hide the complexity of parallelism

and low-level optimizations from the user. However, they rely on

hard-coded device-speciﬁc implementations or heuristics to drive

the optimization process. When targeting different devices, the li-

brary implementation or backend compiler has to be re-tuned or

even worst re-engineered. In order to address the performance

portability issue, we aim to develop mechanisms that can effec-

tively explore device-speciﬁc optimizations. The core idea is not

to commit to a speciﬁc implementation or set of optimizations but

instead to let a tool automate the process.

In this paper we present an approach which compiles a high-

level functional expression – similar to the one written in Accel-

erate – into highly optimized device-speciﬁc OpenCL code. We

show that we achieve performance on a par with expert-written

implementations on an Intel multicore CPU, an AMD GPU, and

an Nvidia GPU. Central to our approach is a set of rewrite rules

that systematically translate high-level algorithmic concepts into

low-level hardware paradigms, both expressed in a functional style.

The rewrite rules are based on the kind of algebraic reasoning

well-known to functional programmers, and pioneered by Bird [5]

and others in the 1980s. They are used to systematically transform

programs into a low-level representation, from which high perfor-

mance code is generated automatically.

The power of our technique lies in the rewrite rules, written once

by an expert system designer. These rules encode the different al-

gorithmic choices and low-level hardware speciﬁc optimizations.

The rewrite rules play the dual role of enabling the composition

of high-level algorithmic concepts and enabling the mapping of

these onto hardware paradigms, but also critically provide correct-

ness preserving exploration of the implementation space. The rules

enable a clear separation of concerns between high-level algorith-

mic concepts and low-level hardware paradigms while using a uni-

ﬁed framework. The deﬁned implementation space is automatically

searched to produce high performance code.

High-level Expression

OpenCL Program

OpenCL Primitives

Algorithmic Primitives

Low-level Expression

Algorithmic choices &

Hardware optimizations

map

reduce

iterate

split

join

vectorize

toLocal

map-local

map-workgroup

vector units

workgroups

local memory

barriers

...

Dot product Vector reduce

Hardware Paradigms

Code generation

High-level

programming

reorder

...

Exploration with

rewriting rules

BlackScholes

Figure 2: The programmer expresses the problem with high-level

algorithmic primitives. These are systematically transformed into

low-level primitives using a rule rewriting system. OpenCL code

is generated by mapping the low-level primitives directly to the

OpenCL programming model representing hardware paradigms.

This paper demonstrates that our approach yields high-performance

code with OpenCL as our target hardware platform. We compare

the performance of our approach with highly-tuned linear algebra

functions extracted from state-of-the-art libraries and with bench-

marks such as BlackScholes. We express them as compositions of

high-level algorithmic primitives which are systematically mapped

to low-level OpenCL primitives.

The primary contributions of our paper are as follows:

•

a collection of high-level functional algorithmic primitives

for the programmer and low-level functional OpenCL primi-

tives representing the OpenCL programming model;

•

a core dependently-typed calculus and denotational semantics;

•

a set of rewrite rules that systematically express algorithmic

and optimization choices, bridging the gap between high-level

functional programs and OpenCL;

•

proofs of the soundness of the rewrite rules with respect to the

denotational semantics;

•

achieving performance portability by systematically applying

rewrite rules to yield device-speciﬁc implementations, with per-

formance on a par with the best hand-tuned versions.

The remainder of the paper is structured as follows. Section 2

provides an overview of our technique. Sections 3 and 4 present

our functional primitives and rewrite rules. Section 5 presents a

core language and denotational semantics, which we use to jus-

tify the rewrite rules. Section 6 explains our automatic search strat-

egy, while Section 7 introduces our benchmarks. Our experimental

setup and performance results are shown in Sections 8 and 9. Fi-

nally, Section 10 discusses related work and Section 11 concludes.

2. Overview

The overview of our approach is presented in Figure 2. The pro-

grammer writes a high-level expression composed of algorithmic

primitives. Using rewriting rules, we map this high-level expres-

sion into a low-level expression consisting of OpenCL primitives. In

the rewriting stage, different algorithmic and optimization choices

can be explored. The generated low-level expression is then fed

into our code generator that emits an OpenCL program compiled

to machine code by the vendor provided OpenCL compiler.

λ xs . map (λ x . x ∗ 3) xs

(a) High-level expression written by the programmer.

rewrite rules

λ xs . (join ◦ mapWorkgroup (joinVec ◦

mapLocal (mapVec (λ x . x ∗ 3))

◦ splitVec 4) ◦ split 1024) xs

(b) Low-level expression derived using rewrite rules and search.

code generator

1 int4 mul3(int4 x) { return x * 3; }

2 kernel vectorScal(global int* in,out, int len){

3 for (int i=get_group_id; i < len/1024;

4 i+=get_num_groups) {

5 global int* grp_in = in+(i*1024);

6 global int* grp_out = out+(i*1024);

7 for (int j=get_local_id; j < 1024/4;

8 j+=get_local_size) {

9 global int4* in_vec4 =(int4*)grp_in+(j*4);

10 global int4* out_vec4=(int4*)grp_out+(j*4);

11 *out_vec4 = mul3(*in_vec4);

12 } } }

Figure 3: Pseudo-code representing vector scaling. The user maps

a function multiplying an element by 3 over the input array (a). This

high-level expression is transformed into a low-level expression (b)

using rewrite rules in a search process. Finally, our code generator

turns the low-level expression into an OpenCL program (c).

We illustrate the mechanisms of our approach using a simple

vector scaling example shown in Figure 3. The user expresses

the computation by writing a high-level expression using the map

primitive as shown in Figure 3a. Our expressions are glued together

with lambda abstractions and function composition; we formally

deﬁne the syntax in Section 5.

Our technique ﬁrst rewrites the high-level expression into a

low-level expression closer to the OpenCL programming model.

This is achieved by applying the rewrite rules presented later in

Section 4 possibly using an automatic search strategy discusses in

Section 6. Figure 3b shows one possible derivation of the original

high-level expression. Starting from the last line, the input (xs) is

split into chunks of 1024 elements. Each chunk is mapped onto a

group of threads, called workgroup, with the mapWorkgroup low-

level primitive. Within a workgroup, we group 4 elements into a

SIMD vector, each mapped to a local thread inside a workgroup

via the mapLocal primitive. Finally, the mapVec primitive applies

the vectorized form of the user deﬁned function. The exact meaning

of our primitives will be given later in Section 3.

The last step consists of traversing the low-level expression

and generating OpenCL code for each low-level primitive encoun-

tered (Figure 3c). The two map primitives generate the for-loops

(line 3–4 and 7–8) that iterate over the input array assigning work

to the workgroups and local threads. The information of how many

chunks each workgroup and thread processes comes from the corre-

sponding split. In line 11 the vectorized version of the user deﬁned

function (mul3 deﬁned in line 1) is ﬁnally applied to the input array.

To summarize, our approach is able to generate OpenCL code

starting from a high-level program representation. This is achieved

by systematically transforming the high-level expression into a

low-level form suitable for code generation using an automated

search process.

map

A,B,I

: (A → B) → [A]

→ [B]

zip

A,B,I

: [A]

→ [B]

→ [A × B]

reduce

A,I

: ((A × A) → A) → A → [A]

→ [A]

split

A,I

: (n : size) → [A]

n×I

→ [[A]

]

join

A,I,J

: [[A]

]

→ [A]

I×J

iterate

A,I,J

: (n : size) → ((m : size) → [A]

I×m

→ [A]

) →

[A]

×J

→ [A]

reorder

A,I

: [A]

→ [A]

Figure 4: High-level algorithmic primitives.

3. Algorithmic and OpenCL Primitives

A key idea of this paper is to expose algorithmic choices and

hardware-speciﬁc program optimizations in a functional style. This

allows for systematic transformations using a collection of rewrite

rules (Section 4). The high-level algorithmic primitives can either

be used by the programmer directly, as a stand-alone language (or

embedded DSL), or be used as an intermediate representation tar-

geted by another language. Once a program is represented by our

high-level primitives, we can automatically transform it into low-

level hardware primitives. These represent hardware-speciﬁc fea-

tures in a programming model such as OpenCL, the target chosen

for this paper. Following the same approach, a different set of low-

level primitives might be designed to target other low-level pro-

gramming models such as MPI.

In this section we give a high-level account of the primitives;

Section 5 gives a more formal account. Figure 4 and 5 present our

algorithmic and OpenCL primitives. The type system we present

here is monomorphic (largely to keep the formal presentation in

Section 5 simple), however, we do rely on a restricted form of

dependent types. The only kind of type-dependency we allow is

for array types, whose size may depend on a run-time value. Type

inference is beyond the scope of this paper, but in the future we

intend to apply ideas from systems such as DML [45] to our setting.

We let I range over sizes. A size can be a size variable m, n, a

natural number i, or a product I × J or power I

of sizes I and J.

We let A, B range over types. We write A → B for a function from

type A to type B and (n : size) → B for a dependent function

from size n to type B (where B may include array types whose

sizes depend on n). We write A × B for the product of types A

and B and 1 for the unit type. We write [A]

for an array of size

I with elements of type A. The primitives are annotated with type

and size subscripts. Thus, formally each one actually represents a

type-indexed family of primitives. We often omit subscripts when

they are not relevant or can be trivially inferred.

3.1 Algorithmic Primitives

As in Accelerate [9, 30], we deliberately restrict ourselves to a

set of primitives for which we know that high performance CPU

and GPU implementations exist. In contrast to Accelerate, we al-

low nesting of primitives to express nested parallelism. Nesting of

arrays is used to represent multi-dimensional data structures like

matrices. Figure 4 presents the high-level primitives used to deﬁne

programs at the algorithmic level. The map and zip primitives are

standard.

The reduce primitive is a special case of a fold returning a single

reduced element in an array of size 1. We assume the supplied

function is associative and commutative in order to admit efﬁcient

parallel implementations. Returning the result as an array with a

single element allows for a more compositional design, in which

our primitives operate on arrays rather than scalar values.

mapWorkgroup

A,B,I

: (A → B) → [A]

→ [B]

mapLocal

A,B,I

: (A → B) → [A]

→ [B]

mapGlobal

A,B,I

: (A → B) → [A]

→ [B]

mapSeq

A,B,I

: (A → B) → [A]

→ [B]

toLocal

A,B

: (A → B) → (A → B)

toGlobal

A,B

: (A → B) → (A → B)

reduceSeq

A,B,I

: ((A × B) → A) → A → [B]

→ [A]

reducePart

A,I

: ((A × A) → A) → A → (n : size) →

[A]

I×n

→ [A]

reorderStride

A,I

: (n : size) → [A]

n×I

→ [A]

n×I

mapVec

A,B,I

: (A → B) → hAi

→ hBi

splitVec

A,I

: (n : size) → [A]

n×I

→ [hAi

]

joinVec

A,I,J

: [hAi

]

→ [A]

I×J

Figure 5: Low-level OpenCL primitives used for code generation.

The split and join primitives transform the shape of array data.

The expression split n xs transforms array xs of size n × I, with

elements of type A, into an array of size I with elements that are

A arrays of size n; join is the inverse of split. (In practice A itself

may be an array type, in which case we can view split as adding a

dimension to and join as subtracting a dimension from a matrix.)

The iterate primitive repeatedly applies a given function. The

expression iterate n f applies the function f repeatedly n times.

The type of iterate is instructive. The function f may change the

length of the processed array at each iteration step. We currently

restrict the length to stay the same or shrink in each iteration by a

ﬁxed factor (given by the implicit subscript I), which is sufﬁcient

to express, e.g., iterative reduce (see Section 4). We intend to lift

this restriction in the future, which will probably require a richer

type system. Given n the type of iterate expresses that the input

array will shrink by a factor of I

Finally, the reorder primitive allows the programmer to express

that the order of elements in an array is unimportant, allowing

a number of useful optimizations—as we will see in Section 4.

This primitive bares obvious similarities to the unordered operation

of the Ferry query language [21], which asserts that the order of

elements in a list is unimportant.

3.2 OpenCL-speciﬁc Primitives

In order to achieve high performance on manycore CPUs and

GPUs, programmers often use a set of rules of thumb to drive

the optimization of their application. Each hardware vendor pro-

vides optimization guides [1, 31] that extensively cover hardware

idiosyncrasies and optimizations. The main idea behind our work

is to identify common optimization patterns and express them with

the help of low-level primitives coupled with a rewrite system. Fig-

ure 5 lists the OpenCL-speciﬁc primitives we have identiﬁed.

Maps Each mapX primitive has the same high-level semantics

as plain map, but represents a speciﬁc way of mapping computa-

tions to the hardware and exploiting parallelism in OpenCL. The

mapWorkgroup primitive assigns work to a group of threads, called

workgroup in OpenCL, with every workgroup applying the given

function on an element of the input array. Similarly, the mapLocal

primitive assigns work to a local thread inside a workgroup. As

workgroups are optional in OpenCL mapGlobal assigns work to a

thread not organized in a workgroup. This allows us to map com-

putations in different ways to the thread hierarchy. The mapSeq

primitive performs a sequential map within a single thread.

Generating OpenCL code for all of these primitives is simi-

lar; we describe this using mapWorkgroup as an example. A loop

is generated, where the iteration variable is determined by the

workgroup-id function from the OpenCL API. Inside the loop, a

pointer is generated to partition the input array, so that every work-

group calls the given function f on a different chunk of data. An

output pointer is generated similarly. We continue with the body of

the loop by generating the code for the function f recursively. Fi-

nally, an appropriate synchronization mechanism is added for the

given map primitive. For instance, after a mapLocal we add a bar-

rier synchronization for the threads inside the workgroup.

Local/Global The toLocal and toGlobal primitives are used to

determine where the result of the given function f should be

stored. OpenCL deﬁnes two distinct address spaces: global and

local. Global memory is the commonly used large but slow mem-

ory. On GPUs, the small local memory has a high bandwidth with

low latency and is used to store frequently accessed data or for efﬁ-

cient communication between local threads (shared memory). With

these two primitives, we can in effect exploit the memory hierar-

chy deﬁned in OpenCL. These primitives act similarly to a typecast

(their high-level semantics is that of the identity function) and are

in fact implemented as such, so that no code is emitted directly. We

check for incorrect use of these primitives in our implementation.

For example, the implementation checks that a toLocal primitive is

eventually followed by a toGlobal primitive to ensure that the ﬁnal

result is copied back into global memory, as required by OpenCL.

We plan to extend our type system in the future to track the memory

location of arrays using an effect system.

In our design, every function reads its input and writes its output

using pointers provided by the callee function. As a result, we can

force a store to local memory by wrapping any function with the

toLocal function. In the code generator, this will simply change the

output pointer of function f to an area in local memory.

Sequential Reduce The reduceSeq primitive performs a sequen-

tial reduce within a single thread. The generated code consists of

an accumulation variable which is initialized with the given initial

value. A loop is generated iterating over the array and calling the

given function which stores its intermediate result in the accumu-

lation variable. Note, that we require the function passed to reduce

to be associative and commutative in order to enable an efﬁcient

parallel implementation. We do not impose the same restriction for

the reduceSeq function, as here we guarantee a sequential order of

execution; thus reduceSeq has a more general type.

Partial Reduce The reducePart primitive performs a partial re-

duce, i.e., an array of n elements is reduced to an array of m el-

ements where 1 ≤ m ≤ n. While not directly used to generate

OpenCL code, reducePart is useful as an intermediate representa-

tion for deriving different implementations of reduce as we will see

in the next section.

Reorder Stride The high-level semantics of reorderStride

A,I

is just like reorder

A,I

. The low-level implementation actually per-

forms a speciﬁc reordering in which the array is reordered with a

stride n, that is, element i is mapped to element i/I +n∗(i%I). In

the generated OpenCL code this primitive ensures that after split-

ting the workload, consecutive threads access consecutive mem-

ory elements (i.e., coalesce memory access), which is beneﬁcial on

modern GPUs as it maximizes memory bandwidth.

Our implementation does not produce code directly, but gener-

ates instead an index function, which is used when accessing the

array the next time. While beyond the scope of this paper, our de-

sign supports user-deﬁned index functions as well.

Vectorization The OpenCL programming model supports SIMD

vector data types such as int4 where any operations on this type

will be executed in the hardware vector units. In the absence of

vector units in the hardware, the OpenCL compiler scalarizes the

code automatically.

Generating performance portable code using rewrite rules: from high-level functional expressions to high-performance OpenCL code

Figures

Citations

Model-Driven Software Development

Lift: a functional data-parallel IR for high-performance GPU code generation

Futhark: purely functional GPU-programming with nested parallelism and in-place array updates

High performance stencil code generation with Lift

Efficient differentiable programming in a functional array-processing language

References

MapReduce: simplified data processing on large clusters

MapReduce: simplified data processing on large clusters

Rodinia: A benchmark suite for heterogeneous computing

StreamIt: A Language for Streaming Applications

Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines

Related Papers (5)

Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines

Accelerating Haskell array codes with multicore GPUs

SPIRAL: Code Generation for DSP Transforms

Delite: A Compiler Architecture for Performance-Oriented Embedded Domain-Specific Languages

PolyMage: Automatic Optimization for Image Processing Pipelines

Frequently Asked Questions (12)

Q1. How does the approach to vectorization work?

Q2. What is the use of the small local memory on GPUs?

Q3. What is the function that stores its intermediate result in the accumulation variable?

Q4. What is the use case for reordering?

Q5. What type of data types are supported by the OpenCL programming model?

Q6. What is the central to the approach?

Q7. What is the function that reducePart performs?

Q8. What type system do the authors use to represent hardware-specific features in a programming model?

Q9. Why does the Nvidia_opt version fail to perform on the GPU?

Q10. Why does the Nvidia_opt version fail to execute correctly on the CPU?

Q11. How can the authors produce OpenCL code for each single primitive?

Q12. What is the last step in the process?