What is the arithmetic for the matrix multiplication in this step?

The arithmetic for the matrix multiplication in this step is done in the Galois field GFð28Þ in which addition becomes XOR and multiplication becomes bit shifting and XORing.

What are the common optimizations of the Vivado HLS tool?

At the loop level, dataflow pipelining, and the common optimizations of loop-unrolling, loop-merging, loop-rotation, dead-code elimination, etc., are also available.

What are the compiler optimizations that can be applied to OpenCL code?

There are several compiler optimizations that can be applied to OpenCL code: kernel vectorization, static memory coalescing, generating multiple compute units, and loop unrolling.

What protocol must be created to interface between the input and the circuit?

To have VivadoHLS process the input as a stream, and thus pass the input as a pointer, a protocol must be created to interface between the stream and the circuit.

What is the general purpose kernel for ROCCC?

ROCCC generates a general-purpose kernel for any architecture, which includes architectures having high bandwidth and large memory latencies that often support many outstanding requests.

What are some examples of applications using OpenCL to program FPGA accelerators?

Various applications using OpenCL to program FPGA accelerators have been demonstrated, such as information filtering [31], Monte Carlo simulation [30], finite difference [32], particle simulations [32], and video compression [33].

Why did the authors choose to use an input array of fixed size?

Since the information the authors are interested in is how the tool compiles the kernel and not the data passing, the authors elected to use an input array of fixed size to avoid the extra overhead.

What is the common way to handle a project?

Most operations can be handled through a provided makefile, from compiling and simulating to automatic project creation and synthesis.

What is the system overview of the Altera OpenCL SDK?

The OpenCL system overview is shown in Fig. 3. Unlike the OpenCL compiler for CPUs and GPUs, where parallel threads are executed on different cores, AOC transforms kernel functions into deeply pipelined hardware circuits to achieve parallelism.

How many instructions are executed for each output pixel?

In the direct case, the compiled assembly for x86 has 32 instructions for the inner loop, meaning 32 machine instructions executed for every output pixel generated.

What is the way to optimize the code regions?

The GUI provides the user a list of code regions (targeted at loops, function bodies, and other bracketed regions) that can be optimized using synthesis directives to guide the RTL generation.

Does ROCCC make any assumptions regarding the interface to the outside world?

ROCCC does not make any assumptions regarding the interface to the outside world, e.g., memory, therefore unrolling eight folds would require that eight data elements can be fetched each cycle.

What is the way to create a hybrid design?

It is possible to create hybrid designs with portions of code running on a soft-core processor communicating with custom hardware accelerators.

Why is the total number of write memory accesses the same?

An important note the authors want to point out is that for every test, the total number of write memory accesses is exactly the same because LegUp only duplicates the hardware engines, but does not merge their computations.

How many versions of the code did the authors implement?

The authors implemented four different versions, including cache blocked memory accesses to determine the best performing implementationVrow based access and nonmemory blocking.

(Open Access) High-Level Language Tools for Reconfigurable Computing (2015) | Skyler Windh

Q: What are the contributions in this paper?

| In the past decade or so the authors have witnessed a steadily increasing interest in FPGAs as hardware accelerators: they provide an excellent mid-point between the reprogrammability of software devices ( CPUs, DSPs, and GPUs ) and the performance and low energy consumption of ASICs. In this paper, the authors review the history of using FPGAs as hardware accelerators and summarize the challenges facing the raising of the programming abstraction layers. The authors survey five High-Level Language tools for the development of FPGA programs: Xilinx Vivado, Altera OpenCL, BluespecBSV, ROCCC, and LegUp to provide an overview of their tool flow, the optimizations they provide, and a qualitative analysis of their hardware implementations of high level code.

UC Riverside

UC Riverside Previously Published Works

Title

High-level language tools for reconfigurable computing

Permalink

https://escholarship.org/uc/item/6xg1852d

Journal

Proceedings of the IEEE, 103(3)

ISSN

0018-9219

Authors

Windh, S

Ma, X

Halstead, RJ

et al.

Publication Date

2015-03-01

DOI

10.1109/JPROC.2015.2399275

Peer reviewed

eScholarship.org Powered by the California Digital Library

University of California

INVITED

PAPER

High-Level Language Tools for

Reconfigurable Computing

This paper provides a focused survey of five tools to improve productivity in

developing code for FPGAs.

By Skyler Windh, Xiaoyin Ma, Student Member IEEE, Robert J. Halstead, Prerna Budhkar,

Zabdiel Luna, Omar Hussaini, and Walid A. Najjar,

Fellow IEEE

ABSTRACT

Inthepastdecadeorsowehavewitnesseda

steadily increasing interest in FPGAs as hardware accelerators:

they provide an excellent mid-point between the reprogramm-

ability of software devices (CPUs, DSPs, and GPUs) and the

performance and low energy consumption of ASICs. However,

the programmability of FPGA-based accelerators remains one

of the biggest obstacles to their wider adoption. Developing

FPGA programs requires extensive familiarity with hardware

design and experience with a tedious and complex tool chain.

For half a century, layers of abstractions have been developed

that simplify the software development process: languages,

compilers, dynamically linked libraries, operating systems,

APIs, etc. Very little, if any, such abstractions exist in the devel-

opment of FPGA programs. In this paper, we review the history

of using FPGAs as hardware accelerators and summarize the

challenges facing the raising of the programming abstraction

layers. We survey five High-Level Language tools for the de-

velopment of FPGA programs: Xilinx Vivado, Altera OpenCL,

BluespecBSV, ROCCC, and LegUp to provide an overview of their

tool flow, the optimizations they provide, and a qualitative

analysis of their hardware implementations of high level code.

KEYWORDS

Compiler optimization; high level synthesis; max

filter; reconfigurable computing

I. INTRODUCTION

In recent years we have witnessed a tremendous growth in

size and speed of FPGAs accompanied by an ever widening

spectrum of application domains where they are exten-

sively used. Furthermore, a large number of specialized

functional units are being added to their architectures such

as DSP units, multi-ported on-c hip memories, and CPUs.

Modern FPGAs are used as platforms for configurable

computing that combine the flexibility and reprogramabi-

lity of CPUs with the efficiency of ASICs. Commercial as

well as research projects using FPGA accelerators on a w ide

variety of applications have reported speed-up, over both

CPUs and GPUs, ranging from one to three orders of mag-

nitude as well as reduced energy consumption per result

ranging from one to two orders of magnitude. Application

domains have included signal, image and video processing,

encryption/decryption, decompression (text, integer data,

images, etc.), data bases [1], [2], dense and sparse linear

algebra, graph algorithm, data mining, information process-

ing and text analysis, packet processing, intrusion detection,

bioinformatics, financial analysis, seismic data analysis, etc.

FPGAs are programmed using Hardware Description

Languages (HDLs) such as VHDL, Verilog, SystemC, and

SystemVerilog t hat are used for digital circuit design and

implementation. In these languages the circuit to be

mapped on an FPGA is designed at a fairly low level: the

data paths and state machine controllers are built from the

bottom up, timing primitives are used to provide synchro-

nization between signals, the registering of data values is

explicitly stated, parallelism is implicit and sequential or-

dering of events must be explicitly enforced via the state

machine. Traditionally trained software developers are not

familiar with such programming paradigms. Beyond the

program development state, the tool chains are language

and vendor specific and consist of a synthesis phase where

the HDL code is translated to a netlist, mapping where

logic expressions are translated into hardware primitives

specific to a device, place and route where hardware logic

blocks are placed on the device and wires routed to

connect them. This last phase attempts to solve an NP-

complete problem using heuristics, such as simulated an-

nealing, and may take hours ordaystocompletedepending

Manuscript received August 21, 2014; revised December 2, 2014; accepted

January 20, 2015. Date of current version April 14, 2015. This work has been supported

in part by National Science Foundation AwardsCCF-1219180 and IIS-1161997.

S. Windh, R. Halstead, P. Budhkar, Z. Luna, O. Hussaini and W. A. Najjar are with

the Department of Computer Science and Engineering, University of California at

Riverside, Riverside, CA 92521 USA (e-mail: najjar@cs.ucr.edu).

X. Ma is with the Department of Electrical and Computer Engineering, University of

California at Riverside, Riverside, CA 92521 USA.

Digital Object Identifier: 10.1109/JPROC.2015.2399275

0018-9219

2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.

See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

390 Proceedings of the IEEE | Vol. 103, No. 3, March 2015

on the size of the circuit relative to the device size as well as

the timing constraints imposed by the user. The steepness of

the learning curve for such languages and tools makes their

use a daunting and expensive proposition for most projects.

This paper provides a qualitative survey of the currently

available tools, both research and commercial ones, for

programming FPGAs as hardware accelerators. We start

with a historical prospective on the use of FPGA-based

hardware accelerators (Section II) showing that the role of

FPGAs as accelerators emerged v ery shorty after their in-

troduction, as glue-logic replacements, in the 1980s. In

Section III we discuss the efficiency of the hardware com-

puting model over the stored program model and review

the challenge s posed by using High-Level Languages

(HLLs) as programming tools to generate hardware struc-

tures on FPGAs. Related works and five High-Level Syn-

thesis (HLS) tools are described in Section IV, three

commercial tools: Xilinx Vivado, Altera OpenCL, Bluespec

BSV, and two university research tools: ROCCC and

LegUp. We use a simple image filter, dilation, and AES

encryption routines to describe the style of pro gramming

these tools and explore their capabilities in implementing

compiler-based transformations that enhance the through-

put of the generated structure (Sections V and VI). Area,

performance and power results for both benchmarks are

compared in Section VII and, finally, concluding remarks

are presented in Section VIII. Note: in this paper we report

results and compare tools only to the extent allowed by the

terms of the user license agreements.

II. A HISTORICAL PERSPECTIVE

The use of FPGAs as hardware accelerators is not a new

concept. Very shortly after the introduction of the first

SRAM-based FPGA device (Xilinx, 1985) the PAM (Prog-

rammable Active Memory) [3], [4] was conceived and built

at the DEC Paris Research Lab (PReL). The PAM P

con-

sists of a 5  5 array of X ilin x XC3020 FPGAs (F ig. 1)

connected to various memory modules as well as to a h ost

workstation via a VME bus. It had a maximum frequency of

25 MHz, 0.5 MB of RAM, and communicated on a host bus

of 8 MB/s. The PAM P

was built using slightly newer

FPGA, the Xilinx XC3090. I t operated with a maximum

frequencyof40MHz,4MBofRAM,anduseda100MB/s

host bus. It was d escribed as ‘‘universal hardware co-

processor closely coupled to a standard host computer’’ [5].

It was evaluated using 10 benchmark codes [5] consisting

of: long multiplication, RSA cryptography, Ziv-Lempel

compression, e dit distance calculations, heat and Laplace

equations, N-body calculations, binary 2-D convolution,

Boltzman machine model, 3-D graphics (including transla-

tion, rotation, clipping, and perspective projection) and

discrete cosine transform. It is interesting to note that

most of these benchmarks are still today subjects of

research and development efforts in hardware accelera-

tion. Berlin et al. in [5] conclude that PAM delivered a

performance comparable to that of ASIC chips or super-

computers, of the time, and was one to two orders of

magnitude faster than software. They also state that be-

cause of the PAM’s large off-chip I/O bandwidth (6.4 Gb/s)

it was ideally suited for ‘‘on-the-fly data acquisition and

filtering,’’ which i s exactly the computational model,

streaming data, adopted by most of the hardware acceler-

ation projects that rely on FPGA platforms.

This first reconfigurable platform was rapidly followed

by the SPLASH 1 (1989) and SPLASH 2 (1992) [6]–[10]

projects at the Supercomputer Research Center (Table 1).

Each were linear arrays of FPGAs with local memory

modules. They were designed for accelerating string-based

operations and computations such as edit distance calcula-

tions. The SPLASH 2 was reported to achieve fo ur orders of

magnitudespeedup,overaSUNSPARC10,onedit

distance computation using dynamic programming.

The PAM and SPLASH projects put the foundation of

reconfigurable computing by using FPGA-based hardware

accelerators. In the past two decades the density and speed

of FPGAs h ave grown tremendously: the density by several

orders of magnitude, the clock speed by just over one order

of magnitude. Both of these projects could each be easily

implemented on single moderately sized modern FPGA de-

vice. However, the main challenge to FPGAs as hardware

accelerators, namely the abstraction gap between application

Fig. 1. Architecture of the DEC PReL PAM P

Table 1 Architecture Parameters of the SPLASH 1 and SPLASH 2

Accelerators

Windh et al.: High-Level Language Tools for Reconfigurable Computing

Vol. 103, No. 3, March 2015 | Proceedings of the IEEE 391

development and FPGA programming, not only remains

unchanged but has probably gotten worse due to increase

in complexity of the applications enabled by the larger de-

vice sizes. FPGA hardware accelerators are still beyond the

reach of traditionally trained application code developers.

III. HARDWARE AND SOFTWARE

COMPUTING MODELS

In this section we discuss two issues that define the com-

plexity of compiling HLLs to hardware circuits: 1) the

semantic gap between the sequential stored-program exe-

cution model implicit in these languages and 2) the effects

of abstractions, or lack thereof, on the complexity of the

compiler.

A. Efficiency and Universality

The stored program model is a universal computing

model: it is equivalent to a Turing machine with the limi-

tations on the size of the tape imposed by the virtual ad-

dress space. It can therefore be programmed to execute any

computable function. Hardware execution, on the other

hand, is not universal unless it has an attached microproces-

sor. It is, however, extremely efficient. Consider an image

filter applied on a 3  3 pixel window over a frame: the

for all loop implemented in hardware can be both pipelined

(let d be the pipe line depth) and unrolled as to compute

multiple windows concurrently, let the unroll factor be k.

In the steady state d  k operations are being executed

concurrently producing k output results per cycle. On a

CPU, the innermost loop of a typical image filter requires

20–30 machine instructions per loop body including nine

load instructions. Assuming an average instruction level

parallelism (ILP) of two, each result takes 10–15 machine

cyclesVwhich is the ratio of the respective clock speeds of

CPUs and GPUs to FPGAs. However, that same loop can be

replicated many times on the FPGA achieving a much

higher throughput (at least an order of magnitude). Fur-

thermore, the ability to configure local customized storage

ontheFPGAmakesitpossibletoreducethenumberof

memory accesses, mostly reads, by reusing already fetched

data resulting in a m ore efficient use of the memory band-

width and lower energy consumption per task [11]. Hence

the higher speedup o r throughput observed on a very wide

range of applications using FPGA accelerators over multi-

cores (CPUs and GPUs). Further d etails on CPU ef ficiency

for image filte rs are discussed in Se ction V-A.

B. Semantics of the Execution Models

CPUs and GPUs are inherently stored-program (or von

Neumann) machines and so are the programming lan-

guages used on these. Most of the languages in use today

reflect the stored program paradigm. As such they are

bound by its sequential consistency, both at the language

and machine levels. While CPU and GPU architectures ex-

ploit various forms of paral lelism, such as instruction, da ta

and thread-level parallelisms, they do so circumventing the

sequential consistency implied in the source code internally

(branch prediction, out-of-order execution, SIMD paralle-

lism, etc.), while preserving the appearance of a sequen-

tially consistent execution externally (reo rder buffers,

precise i nterrupts, etc.). The compiling of a HLL code to a

CPU or GPU is therefore the translation from one stored

program machine model, the HLL, to another, the ma-

chine’s I nstruction Set Architecture (ISA). In t he stored

program p aradigm the compiler can generate a parallel

execution only when doing so is provably safe. In other

words when the record of execution can be proved,bythe

compiler, to be either identical or equivalent to the sequen-

tial execution. For example , in a single level forall loop, any

interleaving of the iterations produces a correct result.

Also, in a single threaded CPU execution the producer/

consumer relationship is not a parallel construct since the

semantics imply that al l the data must be produced before

any datum can be consumed. Hence all the data is stored in

memory by the producer loop before the consumer loop

starts execution.

A digital circuit, on the other hand, is i nherently paral-

lel, spatial, with distributed storage and tim ed behavior.

HDLs(e.g.,VHDL,Verilog,SystemC,andBluespec)are

arguably the most commonly used parallel languages. I n a

digital circuit the producer/consumer relation is a parallel

structure: the data produced is temporarily stored in a local

buffer the size of which is determined by the relative rates

of production and consumption. Furthermore, any imple-

mentation w ould be expected to include back pressure and

synchronization mechanisms to 1) halt the production be-

fore the buffer is full and 2) stall the consumption when the

bufferisemptytoachieveacorrect implementation. Buf-

fering the data is not necessary when compiling individual

kernels (e.g., stand-alone fi lters). However, it becomes a

necessity, and often a major challenge, when compiling

larger systems. Consider data streaming through a series of

filters: buffers and back-pressure are necessary to hold the

data between filters. Automatically inferring efficient

buffering schemes without user assistance in the forms of

pragmasorannotationsisamajorchallenge.

Edwards [ 12] makes the case that C-based languages are

not well suited for HLS. The major challenges described in

the paper for C-based languages apply to most HLLs. These

challenges are the lack of: 1 ) timing information in the

code, 2) size-based data types (o r variable bit length data

types), 3) built-in concurrency model(s), 4) local mem-

ories separated from the abstraction of one large shared

memory. While all these points are valid, th e main attrac-

tion of C-based languages is familiarity. Most HLS tools

using C-based languages provide workarounds f or one or

more of these obstacles as described in [12].

The abstraction and semantic gaps between the hard-

ware and software computing models are summarized in

Table 2. Translating a HLL to a circuit requires a transfor-

mation of the sequential to a spatial/parallel, with the

Windh et al.: High-Level Language Tools for Reconfigurable Computing

392 Proceedings of the IEEE |Vol.103,No.3,March2015

creation of custom sequencing, timed synchronizations,

distributed storage, pipelining, etc. The storage in the von

Neumann model is abstracted in a single large virtual ad-

dress space with uniform access time (in theory). The spa -

tial model is better served with multiple local small

memories. The parallelism in the von Neumann model can

be dynamic: threads are created and complete relinquish-

ing resources. In hardware every thread must be provi-

sioned with resources statically. The software model relies

on an implicit sequential consistency where all instruc-

tionsexecuteinprogramorderandnoinstructionstarts

before all the previous instructions have completed execu-

tion. The hardware execution is data flow driven.

Raising the abstraction level of FPGA programming to

that of CPU or GPU programming is a daunting task that is

yet to be fully completed. It is of critical importance in the

programming of accelerators as opposed to the high-leve l

design of arbitrary digital circuit, which is the focus of high-

level synthesis. Hardware accelerators differ from general

purpose logic design in one important way: the starting

point of logic design is a device whose behavior is specified

by a hardware description code implemented in a HDL such

as VHDL, Verilog, SystemC, SystemVerilog, or Bluespec.

The starting point of a hardware accelerator is an existing

software application a subset of which, being frequently

executed, is translated into hardware. That subset is, quasi by

definition, a loop nest. Hopefully that loop nest is paralle-

lizable and can therefore exploit the FPGA resources. By

focusing on loop nests, the task of compiling HLLs to FPGAs

is simplified and opportunities for loop transformations and

optimizations abound. The ROCCC compiler takes this ap-

proach and is described later in this paper.

C. Related Work

AsthenumberoftoolssupportingHLSforFPGAshas

increased so has the number of surveys comparing and

contrasting such tools. However, the rapidly shifting land-

scape of HLS tools for reconfigurable computing makes

most such endeavors o bsolete within a few years.

A description of the historical evolution of HLS tools,

starting with the pioneering work in the 1970s can be

found in [13]. The authors offer an interesting analysis of

the reasons behind the successes and failures of the various

generations of HLS tools. While the survey is not focused

on HLS tools for FPGAs, it does mention several FPGA-

specific tools, such as Handel-C, as well as general HLS

tools that could be use d for FPGAs .

The major research efforts in compiling high-level lan-

guages to reconfigurable computing are surveyed in [14].

The paper offers an in-depth analysis of the tools available

at that time.

AutoESL is described in [15]. The paper also provid es

an extensive survey of HLS in general and of tools speci-

fically for FPGA programming.

In [16] the authors reviewed six high level languages/

tools based on program ming productivity and generated

hardware performance (frequency, area). User experience

of using the targeted languages is recorded and normalized

as a measure of producti vity in this study. However, most

of the tools evaluated in this work are no longer supported

by their deve lopers.

An extensive evaluation of 12 HLS tools i n terms of

capabilities, usability, and quality of results is p resented in

[17]. The authors use Sobel edge detection to evaluate the

tools along eight specific criteria: documentation, learning

curve, ease of implementation, abstraction level, data

types, exploration, verification, and quality of the resu lts.

Daoud et al. [18] survey past and current HLS tools.

IV. HIGH LEVEL LANGUAGE TOOLS

In this section we provide a brief description of the tools,

where they w ere developed , and highlight some of their

optimizations and the user experience of developing with

each tool. The following tools can be divided into two

classes: commercial software and research projects. We

first cover the commercial software, followed by the univ-

ersity research projects.

A. Xilinx Vivado HLS

Vivado High-Level Synthesis is a compl ete HLS envi-

ronment from Xilinx. I t has been in development for the

last several years following Xilinx’s acquisition of AutoESL

[19]–[21]. Vivado HLS is available as a component of

Table 2 Features and Characteristics of Stored Program and Spatial Computation Models

Windh et al.: High-Level Language Tools for Reconfigurable Computing

Vol. 103, No. 3, March 2015 | Proceedings of the IEEE 393

High-Level Language Tools for Reconfigurable Computing

Figures

Citations

Are We There Yet? A Study on the State of High-Level Synthesis

Software-defined Radios: Architecture, state-of-the-art, and challenges

A survey on reconfigurable accelerators for cloud computing

Do OS abstractions make sense on FPGAs

A Hybrid FPGA-Based System for EEG- and EMG-Based Online Movement Prediction

References

LLVM: a compilation framework for lifelong program analysis & transformation

High-Level Synthesis for FPGAs: From Prototyping to Deployment

SUIF: an infrastructure for research on parallelizing and optimizing compilers

LegUp: high-level synthesis for FPGA-based processor/accelerator systems

High-Level Synthesis: from Algorithm to Digital Circuit

Related Papers (5)

FPGA programming for the masses

High-Level Synthesis for FPGAs: From Prototyping to Deployment

A Survey and Evaluation of FPGA High-Level Synthesis Tools

Reconfigurable Computing Architectures

Efficient FPGA Implementation of OpenCL High-Performance Computing Applications via High-Level Synthesis

Frequently Asked Questions (19)

Q1. What are the contributions in this paper?

Q2. What is the arithmetic for the matrix multiplication in this step?

Q3. What are the common optimizations of the Vivado HLS tool?

Q4. What are the compiler optimizations that can be applied to OpenCL code?

Q5. What are the main languages used for FPGAs?

Q6. What protocol must be created to interface between the input and the circuit?

Q7. What is the general purpose kernel for ROCCC?

Q8. How many orders of magnitude have FPGAs achieved?

Q9. What are some examples of applications using OpenCL to program FPGA accelerators?

Q10. Why did the authors choose to use an input array of fixed size?

Q11. What is the common way to handle a project?

Q12. What is the main challenge to FPGAs as hardware accelerators?

Q13. What is the system overview of the Altera OpenCL SDK?

Q14. How many instructions are executed for each output pixel?

Q15. What is the way to optimize the code regions?

Q16. Does ROCCC make any assumptions regarding the interface to the outside world?

Q17. What is the way to create a hybrid design?

Q18. Why is the total number of write memory accesses the same?

Q19. How many versions of the code did the authors implement?