scispace - formally typeset
Open AccessJournal ArticleDOI

High-Level Language Tools for Reconfigurable Computing

Reads0
Chats0
TLDR
The history of using FPGAs as hardware accelerators is reviewed, the challenges facing the raising of the programming abstraction layers are summarized and five High-Level Language tools for the development of FPGA programs are surveyed.
Abstract
In the past decade or so we have witnessed a steadily increasing interest in FPGAs as hardware accelerators: they provide an excellent mid-point between the reprogrammability of software devices (CPUs, DSPs, and GPUs) and the performance and low energy consumption of ASICs. However, the programmability of FPGA-based accelerators remains one of the biggest obstacles to their wider adoption. Developing FPGA programs requires extensive familiarity with hardware design and experience with a tedious and complex tool chain. For half a century, layers of abstractions have been developed that simplify the software development process: languages, compilers, dynamically linked libraries, operating systems, APIs, etc. Very little, if any, such abstractions exist in the development of FPGA programs. In this paper, we review the history of using FPGAs as hardware accelerators and summarize the challenges facing the raising of the programming abstraction layers. We survey five High-Level Language tools for the development of FPGA programs: Xilinx Vivado, Altera OpenCL, BluespecBSV, ROCCC, and LegUp to provide an overview of their tool flow, the optimizations they provide, and a qualitative analysis of their hardware implementations of high level code.

read more

Content maybe subject to copyright    Report

UC Riverside
UC Riverside Previously Published Works
Title
High-level language tools for reconfigurable computing
Permalink
https://escholarship.org/uc/item/6xg1852d
Journal
Proceedings of the IEEE, 103(3)
ISSN
0018-9219
Authors
Windh, S
Ma, X
Halstead, RJ
et al.
Publication Date
2015-03-01
DOI
10.1109/JPROC.2015.2399275
Peer reviewed
eScholarship.org Powered by the California Digital Library
University of California

INVITED
PAPER
High-Level Language Tools for
Reconfigurable Computing
This paper provides a focused survey of five tools to improve productivity in
developing code for FPGAs.
By Skyler Windh, Xiaoyin Ma, Student Member IEEE, Robert J. Halstead, Prerna Budhkar,
Zabdiel Luna, Omar Hussaini, and Walid A. Najjar,
Fellow IEEE
ABSTRACT
|
Inthepastdecadeorsowehavewitnesseda
steadily increasing interest in FPGAs as hardware accelerators:
they provide an excellent mid-point between the reprogramm-
ability of software devices (CPUs, DSPs, and GPUs) and the
performance and low energy consumption of ASICs. However,
the programmability of FPGA-based accelerators remains one
of the biggest obstacles to their wider adoption. Developing
FPGA programs requires extensive familiarity with hardware
design and experience with a tedious and complex tool chain.
For half a century, layers of abstractions have been developed
that simplify the software development process: languages,
compilers, dynamically linked libraries, operating systems,
APIs, etc. Very little, if any, such abstractions exist in the devel-
opment of FPGA programs. In this paper, we review the history
of using FPGAs as hardware accelerators and summarize the
challenges facing the raising of the programming abstraction
layers. We survey five High-Level Language tools for the de-
velopment of FPGA programs: Xilinx Vivado, Altera OpenCL,
BluespecBSV, ROCCC, and LegUp to provide an overview of their
tool flow, the optimizations they provide, and a qualitative
analysis of their hardware implementations of high level code.
KEYWORDS
|
Compiler optimization; high level synthesis; max
filter; reconfigurable computing
I. INTRODUCTION
In recent years we have witnessed a tremendous growth in
size and speed of FPGAs accompanied by an ever widening
spectrum of application domains where they are exten-
sively used. Furthermore, a large number of specialized
functional units are being added to their architectures such
as DSP units, multi-ported on-c hip memories, and CPUs.
Modern FPGAs are used as platforms for configurable
computing that combine the flexibility and reprogramabi-
lity of CPUs with the efficiency of ASICs. Commercial as
well as research projects using FPGA accelerators on a w ide
variety of applications have reported speed-up, over both
CPUs and GPUs, ranging from one to three orders of mag-
nitude as well as reduced energy consumption per result
ranging from one to two orders of magnitude. Application
domains have included signal, image and video processing,
encryption/decryption, decompression (text, integer data,
images, etc.), data bases [1], [2], dense and sparse linear
algebra, graph algorithm, data mining, information process-
ing and text analysis, packet processing, intrusion detection,
bioinformatics, financial analysis, seismic data analysis, etc.
FPGAs are programmed using Hardware Description
Languages (HDLs) such as VHDL, Verilog, SystemC, and
SystemVerilog t hat are used for digital circuit design and
implementation. In these languages the circuit to be
mapped on an FPGA is designed at a fairly low level: the
data paths and state machine controllers are built from the
bottom up, timing primitives are used to provide synchro-
nization between signals, the registering of data values is
explicitly stated, parallelism is implicit and sequential or-
dering of events must be explicitly enforced via the state
machine. Traditionally trained software developers are not
familiar with such programming paradigms. Beyond the
program development state, the tool chains are language
and vendor specific and consist of a synthesis phase where
the HDL code is translated to a netlist, mapping where
logic expressions are translated into hardware primitives
specific to a device, place and route where hardware logic
blocks are placed on the device and wires routed to
connect them. This last phase attempts to solve an NP-
complete problem using heuristics, such as simulated an-
nealing, and may take hours ordaystocompletedepending
Manuscript received August 21, 2014; revised December 2, 2014; accepted
January 20, 2015. Date of current version April 14, 2015. This work has been supported
in part by National Science Foundation AwardsCCF-1219180 and IIS-1161997.
S. Windh, R. Halstead, P. Budhkar, Z. Luna, O. Hussaini and W. A. Najjar are with
the Department of Computer Science and Engineering, University of California at
Riverside, Riverside, CA 92521 USA (e-mail: najjar@cs.ucr.edu).
X. Ma is with the Department of Electrical and Computer Engineering, University of
California at Riverside, Riverside, CA 92521 USA.
Digital Object Identifier: 10.1109/JPROC.2015.2399275
0018-9219
Ó
2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
390 Proceedings of the IEEE | Vol. 103, No. 3, March 2015

on the size of the circuit relative to the device size as well as
the timing constraints imposed by the user. The steepness of
the learning curve for such languages and tools makes their
use a daunting and expensive proposition for most projects.
This paper provides a qualitative survey of the currently
available tools, both research and commercial ones, for
programming FPGAs as hardware accelerators. We start
with a historical prospective on the use of FPGA-based
hardware accelerators (Section II) showing that the role of
FPGAs as accelerators emerged v ery shorty after their in-
troduction, as glue-logic replacements, in the 1980s. In
Section III we discuss the efficiency of the hardware com-
puting model over the stored program model and review
the challenge s posed by using High-Level Languages
(HLLs) as programming tools to generate hardware struc-
tures on FPGAs. Related works and five High-Level Syn-
thesis (HLS) tools are described in Section IV, three
commercial tools: Xilinx Vivado, Altera OpenCL, Bluespec
BSV, and two university research tools: ROCCC and
LegUp. We use a simple image filter, dilation, and AES
encryption routines to describe the style of pro gramming
these tools and explore their capabilities in implementing
compiler-based transformations that enhance the through-
put of the generated structure (Sections V and VI). Area,
performance and power results for both benchmarks are
compared in Section VII and, finally, concluding remarks
are presented in Section VIII. Note: in this paper we report
results and compare tools only to the extent allowed by the
terms of the user license agreements.
II. A HISTORICAL PERSPECTIVE
The use of FPGAs as hardware accelerators is not a new
concept. Very shortly after the introduction of the first
SRAM-based FPGA device (Xilinx, 1985) the PAM (Prog-
rammable Active Memory) [3], [4] was conceived and built
at the DEC Paris Research Lab (PReL). The PAM P
0
con-
sists of a 5 5 array of X ilin x XC3020 FPGAs (F ig. 1)
connected to various memory modules as well as to a h ost
workstation via a VME bus. It had a maximum frequency of
25 MHz, 0.5 MB of RAM, and communicated on a host bus
of 8 MB/s. The PAM P
1
was built using slightly newer
FPGA, the Xilinx XC3090. I t operated with a maximum
frequencyof40MHz,4MBofRAM,anduseda100MB/s
host bus. It was d escribed as ‘‘universal hardware co-
processor closely coupled to a standard host computer’’ [5].
It was evaluated using 10 benchmark codes [5] consisting
of: long multiplication, RSA cryptography, Ziv-Lempel
compression, e dit distance calculations, heat and Laplace
equations, N-body calculations, binary 2-D convolution,
Boltzman machine model, 3-D graphics (including transla-
tion, rotation, clipping, and perspective projection) and
discrete cosine transform. It is interesting to note that
most of these benchmarks are still today subjects of
research and development efforts in hardware accelera-
tion. Berlin et al. in [5] conclude that PAM delivered a
performance comparable to that of ASIC chips or super-
computers, of the time, and was one to two orders of
magnitude faster than software. They also state that be-
cause of the PAM’s large off-chip I/O bandwidth (6.4 Gb/s)
it was ideally suited for ‘‘on-the-fly data acquisition and
filtering,’ which i s exactly the computational model,
streaming data, adopted by most of the hardware acceler-
ation projects that rely on FPGA platforms.
This first reconfigurable platform was rapidly followed
by the SPLASH 1 (1989) and SPLASH 2 (1992) [6]–[10]
projects at the Supercomputer Research Center (Table 1).
Each were linear arrays of FPGAs with local memory
modules. They were designed for accelerating string-based
operations and computations such as edit distance calcula-
tions. The SPLASH 2 was reported to achieve fo ur orders of
magnitudespeedup,overaSUNSPARC10,onedit
distance computation using dynamic programming.
The PAM and SPLASH projects put the foundation of
reconfigurable computing by using FPGA-based hardware
accelerators. In the past two decades the density and speed
of FPGAs h ave grown tremendously: the density by several
orders of magnitude, the clock speed by just over one order
of magnitude. Both of these projects could each be easily
implemented on single moderately sized modern FPGA de-
vice. However, the main challenge to FPGAs as hardware
accelerators, namely the abstraction gap between application
Fig. 1. Architecture of the DEC PReL PAM P
0
.
Table 1 Architecture Parameters of the SPLASH 1 and SPLASH 2
Accelerators
Windh et al.: High-Level Language Tools for Reconfigurable Computing
Vol. 103, No. 3, March 2015 | Proceedings of the IEEE 391

development and FPGA programming, not only remains
unchanged but has probably gotten worse due to increase
in complexity of the applications enabled by the larger de-
vice sizes. FPGA hardware accelerators are still beyond the
reach of traditionally trained application code developers.
III. HARDWARE AND SOFTWARE
COMPUTING MODELS
In this section we discuss two issues that define the com-
plexity of compiling HLLs to hardware circuits: 1) the
semantic gap between the sequential stored-program exe-
cution model implicit in these languages and 2) the effects
of abstractions, or lack thereof, on the complexity of the
compiler.
A. Efficiency and Universality
The stored program model is a universal computing
model: it is equivalent to a Turing machine with the limi-
tations on the size of the tape imposed by the virtual ad-
dress space. It can therefore be programmed to execute any
computable function. Hardware execution, on the other
hand, is not universal unless it has an attached microproces-
sor. It is, however, extremely efficient. Consider an image
filter applied on a 3 3 pixel window over a frame: the
for all loop implemented in hardware can be both pipelined
(let d be the pipe line depth) and unrolled as to compute
multiple windows concurrently, let the unroll factor be k.
In the steady state d k operations are being executed
concurrently producing k output results per cycle. On a
CPU, the innermost loop of a typical image filter requires
20–30 machine instructions per loop body including nine
load instructions. Assuming an average instruction level
parallelism (ILP) of two, each result takes 10–15 machine
cyclesVwhich is the ratio of the respective clock speeds of
CPUs and GPUs to FPGAs. However, that same loop can be
replicated many times on the FPGA achieving a much
higher throughput (at least an order of magnitude). Fur-
thermore, the ability to configure local customized storage
ontheFPGAmakesitpossibletoreducethenumberof
memory accesses, mostly reads, by reusing already fetched
data resulting in a m ore efficient use of the memory band-
width and lower energy consumption per task [11]. Hence
the higher speedup o r throughput observed on a very wide
range of applications using FPGA accelerators over multi-
cores (CPUs and GPUs). Further d etails on CPU ef ficiency
for image filte rs are discussed in Se ction V-A.
B. Semantics of the Execution Models
CPUs and GPUs are inherently stored-program (or von
Neumann) machines and so are the programming lan-
guages used on these. Most of the languages in use today
reflect the stored program paradigm. As such they are
bound by its sequential consistency, both at the language
and machine levels. While CPU and GPU architectures ex-
ploit various forms of paral lelism, such as instruction, da ta
and thread-level parallelisms, they do so circumventing the
sequential consistency implied in the source code internally
(branch prediction, out-of-order execution, SIMD paralle-
lism, etc.), while preserving the appearance of a sequen-
tially consistent execution externally (reo rder buffers,
precise i nterrupts, etc.). The compiling of a HLL code to a
CPU or GPU is therefore the translation from one stored
program machine model, the HLL, to another, the ma-
chine’s I nstruction Set Architecture (ISA). In t he stored
program p aradigm the compiler can generate a parallel
execution only when doing so is provably safe. In other
words when the record of execution can be proved,bythe
compiler, to be either identical or equivalent to the sequen-
tial execution. For example , in a single level forall loop, any
interleaving of the iterations produces a correct result.
Also, in a single threaded CPU execution the producer/
consumer relationship is not a parallel construct since the
semantics imply that al l the data must be produced before
any datum can be consumed. Hence all the data is stored in
memory by the producer loop before the consumer loop
starts execution.
A digital circuit, on the other hand, is i nherently paral-
lel, spatial, with distributed storage and tim ed behavior.
HDLs(e.g.,VHDL,Verilog,SystemC,andBluespec)are
arguably the most commonly used parallel languages. I n a
digital circuit the producer/consumer relation is a parallel
structure: the data produced is temporarily stored in a local
buffer the size of which is determined by the relative rates
of production and consumption. Furthermore, any imple-
mentation w ould be expected to include back pressure and
synchronization mechanisms to 1) halt the production be-
fore the buffer is full and 2) stall the consumption when the
bufferisemptytoachieveacorrect implementation. Buf-
fering the data is not necessary when compiling individual
kernels (e.g., stand-alone fi lters). However, it becomes a
necessity, and often a major challenge, when compiling
larger systems. Consider data streaming through a series of
filters: buffers and back-pressure are necessary to hold the
data between filters. Automatically inferring efficient
buffering schemes without user assistance in the forms of
pragmasorannotationsisamajorchallenge.
Edwards [ 12] makes the case that C-based languages are
not well suited for HLS. The major challenges described in
the paper for C-based languages apply to most HLLs. These
challenges are the lack of: 1 ) timing information in the
code, 2) size-based data types (o r variable bit length data
types), 3) built-in concurrency model(s), 4) local mem-
ories separated from the abstraction of one large shared
memory. While all these points are valid, th e main attrac-
tion of C-based languages is familiarity. Most HLS tools
using C-based languages provide workarounds f or one or
more of these obstacles as described in [12].
The abstraction and semantic gaps between the hard-
ware and software computing models are summarized in
Table 2. Translating a HLL to a circuit requires a transfor-
mation of the sequential to a spatial/parallel, with the
Windh et al.: High-Level Language Tools for Reconfigurable Computing
392 Proceedings of the IEEE |Vol.103,No.3,March2015

creation of custom sequencing, timed synchronizations,
distributed storage, pipelining, etc. The storage in the von
Neumann model is abstracted in a single large virtual ad-
dress space with uniform access time (in theory). The spa -
tial model is better served with multiple local small
memories. The parallelism in the von Neumann model can
be dynamic: threads are created and complete relinquish-
ing resources. In hardware every thread must be provi-
sioned with resources statically. The software model relies
on an implicit sequential consistency where all instruc-
tionsexecuteinprogramorderandnoinstructionstarts
before all the previous instructions have completed execu-
tion. The hardware execution is data flow driven.
Raising the abstraction level of FPGA programming to
that of CPU or GPU programming is a daunting task that is
yet to be fully completed. It is of critical importance in the
programming of accelerators as opposed to the high-leve l
design of arbitrary digital circuit, which is the focus of high-
level synthesis. Hardware accelerators differ from general
purpose logic design in one important way: the starting
point of logic design is a device whose behavior is specified
by a hardware description code implemented in a HDL such
as VHDL, Verilog, SystemC, SystemVerilog, or Bluespec.
The starting point of a hardware accelerator is an existing
software application a subset of which, being frequently
executed, is translated into hardware. That subset is, quasi by
definition, a loop nest. Hopefully that loop nest is paralle-
lizable and can therefore exploit the FPGA resources. By
focusing on loop nests, the task of compiling HLLs to FPGAs
is simplified and opportunities for loop transformations and
optimizations abound. The ROCCC compiler takes this ap-
proach and is described later in this paper.
C. Related Work
AsthenumberoftoolssupportingHLSforFPGAshas
increased so has the number of surveys comparing and
contrasting such tools. However, the rapidly shifting land-
scape of HLS tools for reconfigurable computing makes
most such endeavors o bsolete within a few years.
A description of the historical evolution of HLS tools,
starting with the pioneering work in the 1970s can be
found in [13]. The authors offer an interesting analysis of
the reasons behind the successes and failures of the various
generations of HLS tools. While the survey is not focused
on HLS tools for FPGAs, it does mention several FPGA-
specific tools, such as Handel-C, as well as general HLS
tools that could be use d for FPGAs .
The major research efforts in compiling high-level lan-
guages to reconfigurable computing are surveyed in [14].
The paper offers an in-depth analysis of the tools available
at that time.
AutoESL is described in [15]. The paper also provid es
an extensive survey of HLS in general and of tools speci-
fically for FPGA programming.
In [16] the authors reviewed six high level languages/
tools based on program ming productivity and generated
hardware performance (frequency, area). User experience
of using the targeted languages is recorded and normalized
as a measure of producti vity in this study. However, most
of the tools evaluated in this work are no longer supported
by their deve lopers.
An extensive evaluation of 12 HLS tools i n terms of
capabilities, usability, and quality of results is p resented in
[17]. The authors use Sobel edge detection to evaluate the
tools along eight specific criteria: documentation, learning
curve, ease of implementation, abstraction level, data
types, exploration, verification, and quality of the resu lts.
Daoud et al. [18] survey past and current HLS tools.
IV. HIGH LEVEL LANGUAGE TOOLS
In this section we provide a brief description of the tools,
where they w ere developed , and highlight some of their
optimizations and the user experience of developing with
each tool. The following tools can be divided into two
classes: commercial software and research projects. We
first cover the commercial software, followed by the univ-
ersity research projects.
A. Xilinx Vivado HLS
Vivado High-Level Synthesis is a compl ete HLS envi-
ronment from Xilinx. I t has been in development for the
last several years following Xilinx’s acquisition of AutoESL
[19]–[21]. Vivado HLS is available as a component of
Table 2 Features and Characteristics of Stored Program and Spatial Computation Models
Windh et al.: High-Level Language Tools for Reconfigurable Computing
Vol. 103, No. 3, March 2015 | Proceedings of the IEEE 393

Citations
More filters
Journal ArticleDOI

Are We There Yet? A Study on the State of High-Level Synthesis

TL;DR: HLS is currently a viable option for fast prototyping and for designs with short time to market and to help close the QoR gap, a survey of literature focused on improving HLS concludes.
Journal ArticleDOI

Software-defined Radios: Architecture, state-of-the-art, and challenges

TL;DR: In this article, a survey of the state-of-the-art software-defined radio (SDR) platforms in the context of wireless communication protocols is presented, with a focus on programmability, flexibility, portability, and energy efficiency.
Proceedings ArticleDOI

A survey on reconfigurable accelerators for cloud computing

TL;DR: A thorough survey of the frameworks for the efficient utilization of the FPGAs in the data centers and the hardware accelerators that have been implemented for the most widely used cloud computing applications are presented.
Proceedings Article

Do OS abstractions make sense on FPGAs

TL;DR: Coyote is built and evaluated, an open source, portable, configurable “shell” for FPGAs which provides a full suite of OS abstractions, working with the host OS.
Journal ArticleDOI

A Hybrid FPGA-Based System for EEG- and EMG-Based Online Movement Prediction

TL;DR: A novel Field Programmable Gate Array (FPGA) -based system for real-time movement prediction using physiological data that achieves a classification accuracy similar to systems with double precision floating-point precision.
References
More filters
Proceedings ArticleDOI

LLVM: a compilation framework for lifelong program analysis & transformation

TL;DR: The design of the LLVM representation and compiler framework is evaluated in three ways: the size and effectiveness of the representation, including the type information it provides; compiler performance for several interprocedural problems; and illustrative examples of the benefits LLVM provides for several challenging compiler problems.
Journal ArticleDOI

High-Level Synthesis for FPGAs: From Prototyping to Deployment

TL;DR: AutoESL's AutoPilot HLS tool coupled with domain-specific system-level implementation platforms developed by Xilinx are used as an example to demonstrate the effectiveness of state-of-art C-to-FPGA synthesis solutions targeting multiple application domains.
Journal ArticleDOI

SUIF: an infrastructure for research on parallelizing and optimizing compilers

TL;DR: The SUIF compiler is built into a powerful, flexible system that may be useful for many other researchers and the authors invite you to use and welcome your contributions to this infrastructure.
Proceedings ArticleDOI

LegUp: high-level synthesis for FPGA-based processor/accelerator systems

TL;DR: A new open source high-level synthesis tool called LegUp that allows software techniques to be used for hardware design and produces hardware solutions of comparable quality to a commercial high- level synthesis tool.
Book

High-Level Synthesis: from Algorithm to Digital Circuit

TL;DR: This book presents an excellent collection of contributions addressing different aspects of high-level synthesis from both industry and academia, and should be on each designers and CAD developers shelf.
Related Papers (5)
Frequently Asked Questions (19)
Q1. What are the contributions in this paper?

| In the past decade or so the authors have witnessed a steadily increasing interest in FPGAs as hardware accelerators: they provide an excellent mid-point between the reprogrammability of software devices ( CPUs, DSPs, and GPUs ) and the performance and low energy consumption of ASICs. In this paper, the authors review the history of using FPGAs as hardware accelerators and summarize the challenges facing the raising of the programming abstraction layers. The authors survey five High-Level Language tools for the development of FPGA programs: Xilinx Vivado, Altera OpenCL, BluespecBSV, ROCCC, and LegUp to provide an overview of their tool flow, the optimizations they provide, and a qualitative analysis of their hardware implementations of high level code. 

The arithmetic for the matrix multiplication in this step is done in the Galois field GFð28Þ in which addition becomes XOR and multiplication becomes bit shifting and XORing. 

At the loop level, dataflow pipelining, and the common optimizations of loop-unrolling, loop-merging, loop-rotation, dead-code elimination, etc., are also available. 

There are several compiler optimizations that can be applied to OpenCL code: kernel vectorization, static memory coalescing, generating multiple compute units, and loop unrolling. 

FPGAs are programmed using Hardware Description Languages (HDLs) such as VHDL, Verilog, SystemC, and SystemVerilog that are used for digital circuit design and implementation. 

To have VivadoHLS process the input as a stream, and thus pass the input as a pointer, a protocol must be created to interface between the stream and the circuit. 

ROCCC generates a general-purpose kernel for any architecture, which includes architectures having high bandwidth and large memory latencies that often support many outstanding requests. 

Commercial as well as research projects using FPGA accelerators on a wide variety of applications have reported speed-up, over both CPUs and GPUs, ranging from one to three orders of magnitude as well as reduced energy consumption per result ranging from one to two orders of magnitude. 

Various applications using OpenCL to program FPGA accelerators have been demonstrated, such as information filtering [31], Monte Carlo simulation [30], finite difference [32], particle simulations [32], and video compression [33]. 

Since the information the authors are interested in is how the tool compiles the kernel and not the data passing, the authors elected to use an input array of fixed size to avoid the extra overhead. 

Most operations can be handled through a provided makefile, from compiling and simulating to automatic project creation and synthesis. 

the main challenge to FPGAs as hardware accelerators, namely the abstraction gap between applicationdevelopment and FPGA programming, not only remains unchanged but has probably gotten worse due to increase in complexity of the applications enabled by the larger device sizes. 

The OpenCL system overview is shown in Fig. 3. Unlike the OpenCL compiler for CPUs and GPUs, where parallel threads are executed on different cores, AOC transforms kernel functions into deeply pipelined hardware circuits to achieve parallelism. 

In the direct case, the compiled assembly for x86 has 32 instructions for the inner loop, meaning 32 machine instructions executed for every output pixel generated. 

The GUI provides the user a list of code regions (targeted at loops, function bodies, and other bracketed regions) that can be optimized using synthesis directives to guide the RTL generation. 

ROCCC does not make any assumptions regarding the interface to the outside world, e.g., memory, therefore unrolling eight folds would require that eight data elements can be fetched each cycle. 

It is possible to create hybrid designs with portions of code running on a soft-core processor communicating with custom hardware accelerators. 

An important note the authors want to point out is that for every test, the total number of write memory accesses is exactly the same because LegUp only duplicates the hardware engines, but does not merge their computations. 

The authors implemented four different versions, including cache blocked memory accesses to determine the best performing implementationVrow based access and nonmemory blocking.