This work proposes a novel approach aiming to combine high-level programming, code portability, and high-performance by applying a simple set of rewrite rules to transform it into a low-level functional representation close to the OpenCL programming model, from which OpenCL code is generated.
Abstract:
Computers have become increasingly complex with the emergence of heterogeneous hardware combining multicore CPUs and GPUs. These parallel systems exhibit tremendous computational power at the cost of increased programming effort resulting in a tension between performance and code portability. Typically, code is either tuned in a low-level imperative language using hardware-specific optimizations to achieve maximum performance or is written in a high-level, possibly functional, language to achieve portability at the expense of performance. We propose a novel approach aiming to combine high-level programming, code portability, and high-performance. Starting from a high-level functional expression we apply a simple set of rewrite rules to transform it into a low-level functional representation, close to the OpenCL programming model, from which OpenCL code is generated. Our rewrite rules define a space of possible implementations which we automatically explore to generate hardware-specific OpenCL implementations. We formalize our system with a core dependently-typed lambda-calculus along with a denotational semantics which we use to prove the correctness of the rewrite rules. We test our design in practice by implementing a compiler which generates high performance imperative OpenCL code. Our experiments show that we can automatically derive hardware-specific implementations from simple functional high-level algorithmic expressions offering performance on a par with highly tuned code for multicore CPUs and GPUs written by experts.
TL;DR: Basic concepts, such as model-driven engineering, metamodelling, model transformation and technological space, are defined, and the state-of-the-art implementations of these concepts are described.
TL;DR: This paper describes how LIFT IR programs are compiled into efficient OpenCL code, a new data-parallel IR which encodes OpenCL-specific constructs as functional patterns which is flexible enough to express GPU programs with complex optimizations achieving performance on par with manually optimized code.
TL;DR: This paper presents the design and implementation of three key features of Futhark that seek a suitable middle ground with imperative approaches and presents a flattening transformation aimed at enhancing the degree of parallelism.
TL;DR: This paper demonstrates how complex multidimensional stencil code and optimizations such as tiling are expressible using compositions of simple 1D Lift primitives, and shows that this approach outperforms existing compiler approaches and hand-tuned codes.
TL;DR: In this article, the authors present a system for the automatic differentiation of a higher-order functional array-processing language, which simultaneously supports both source-to-source forward-mode AD and global optimisations such as loop transformations.
TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.
TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.
TL;DR: This characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.
TL;DR: The StreamIt language provides novel high-level representations to improve programmer productivity and program robustness within the streaming domain and the StreamIt compiler aims to improve the performance of streaming applications via stream-specific analyses and optimizations.
TL;DR: A systematic model of the tradeoff space fundamental to stencil pipelines is presented, a schedule representation which describes concrete points in this space for each stage in an image processing pipeline, and an optimizing compiler for the Halide image processing language that synthesizes high performance implementations from a Halide algorithm and a schedule are presented.
In their approach vectorization is achieved by using the splitVec and corresponding joinVec primitives, which changes the element type of an array and adjust the length accordingly.
Q2. What is the use of the small local memory on GPUs?
On GPUs, the small local memory has a high bandwidth with low latency and is used to store frequently accessed data or for efficient communication between local threads (shared memory).
Q3. What is the function that stores its intermediate result in the accumulation variable?
A loop is generated iterating over the array and calling the given function which stores its intermediate result in the accumulation variable.
Q4. What is the use case for reordering?
In their current implementation, the authors support two types of reordering: no reordering, represented by the id function, and reorderStride, which reorders elements with a certain stride n.
Q5. What type of data types are supported by the OpenCL programming model?
The OpenCL programming model supports SIMD vector data types such as int4 where any operations on this type will be executed in the hardware vector units.
Q6. What is the central to the approach?
Central to their approach is a set of rewrite rules that systematically translate high-level algorithmic concepts into low-level hardware paradigms, both expressed in a functional style.
Q7. What is the function that reducePart performs?
The reducePart primitive performs a partial reduce, i.e., an array of n elements is reduced to an array of m elements where 1 ≤ m ≤ n.
Q8. What type system do the authors use to represent hardware-specific features in a programming model?
The type system the authors present here is monomorphic (largely to keep the formal presentation in Section 5 simple), however, the authors do rely on a restricted form of dependent types.
Q9. Why does the Nvidia_opt version fail to perform on the GPU?
This lack of performance portability is mainly due to the lowlevel nature of the programming model; the dominant programming interfaces for parallel devices such as GPUs exposes programmers to many hardware-specific details.
Q10. Why does the Nvidia_opt version fail to execute correctly on the CPU?
This is due to a low-level optimization which removes synchronization barriers which can be avoided on the GPU, but are required on the CPU for correctness.
Q11. How can the authors produce OpenCL code for each single primitive?
Once an expression is in its lowest-level form, it is possible to produce OpenCL code for each single primitive easily with their code generator as described in the previous section.
Q12. What is the last step in the process?
The last step consists of traversing the low-level expression and generating OpenCL code for each low-level primitive encountered (Figure 3c).