High-Level Language Tools for Reconfigurable Computing
read more
Citations
Are We There Yet? A Study on the State of High-Level Synthesis
Software-defined Radios: Architecture, state-of-the-art, and challenges
A survey on reconfigurable accelerators for cloud computing
Do OS abstractions make sense on FPGAs
A Hybrid FPGA-Based System for EEG- and EMG-Based Online Movement Prediction
References
LLVM: a compilation framework for lifelong program analysis & transformation
High-Level Synthesis for FPGAs: From Prototyping to Deployment
SUIF: an infrastructure for research on parallelizing and optimizing compilers
LegUp: high-level synthesis for FPGA-based processor/accelerator systems
High-Level Synthesis: from Algorithm to Digital Circuit
Related Papers (5)
Frequently Asked Questions (19)
Q2. What is the arithmetic for the matrix multiplication in this step?
The arithmetic for the matrix multiplication in this step is done in the Galois field GFð28Þ in which addition becomes XOR and multiplication becomes bit shifting and XORing.
Q3. What are the common optimizations of the Vivado HLS tool?
At the loop level, dataflow pipelining, and the common optimizations of loop-unrolling, loop-merging, loop-rotation, dead-code elimination, etc., are also available.
Q4. What are the compiler optimizations that can be applied to OpenCL code?
There are several compiler optimizations that can be applied to OpenCL code: kernel vectorization, static memory coalescing, generating multiple compute units, and loop unrolling.
Q5. What are the main languages used for FPGAs?
FPGAs are programmed using Hardware Description Languages (HDLs) such as VHDL, Verilog, SystemC, and SystemVerilog that are used for digital circuit design and implementation.
Q6. What protocol must be created to interface between the input and the circuit?
To have VivadoHLS process the input as a stream, and thus pass the input as a pointer, a protocol must be created to interface between the stream and the circuit.
Q7. What is the general purpose kernel for ROCCC?
ROCCC generates a general-purpose kernel for any architecture, which includes architectures having high bandwidth and large memory latencies that often support many outstanding requests.
Q8. How many orders of magnitude have FPGAs achieved?
Commercial as well as research projects using FPGA accelerators on a wide variety of applications have reported speed-up, over both CPUs and GPUs, ranging from one to three orders of magnitude as well as reduced energy consumption per result ranging from one to two orders of magnitude.
Q9. What are some examples of applications using OpenCL to program FPGA accelerators?
Various applications using OpenCL to program FPGA accelerators have been demonstrated, such as information filtering [31], Monte Carlo simulation [30], finite difference [32], particle simulations [32], and video compression [33].
Q10. Why did the authors choose to use an input array of fixed size?
Since the information the authors are interested in is how the tool compiles the kernel and not the data passing, the authors elected to use an input array of fixed size to avoid the extra overhead.
Q11. What is the common way to handle a project?
Most operations can be handled through a provided makefile, from compiling and simulating to automatic project creation and synthesis.
Q12. What is the main challenge to FPGAs as hardware accelerators?
the main challenge to FPGAs as hardware accelerators, namely the abstraction gap between applicationdevelopment and FPGA programming, not only remains unchanged but has probably gotten worse due to increase in complexity of the applications enabled by the larger device sizes.
Q13. What is the system overview of the Altera OpenCL SDK?
The OpenCL system overview is shown in Fig. 3. Unlike the OpenCL compiler for CPUs and GPUs, where parallel threads are executed on different cores, AOC transforms kernel functions into deeply pipelined hardware circuits to achieve parallelism.
Q14. How many instructions are executed for each output pixel?
In the direct case, the compiled assembly for x86 has 32 instructions for the inner loop, meaning 32 machine instructions executed for every output pixel generated.
Q15. What is the way to optimize the code regions?
The GUI provides the user a list of code regions (targeted at loops, function bodies, and other bracketed regions) that can be optimized using synthesis directives to guide the RTL generation.
Q16. Does ROCCC make any assumptions regarding the interface to the outside world?
ROCCC does not make any assumptions regarding the interface to the outside world, e.g., memory, therefore unrolling eight folds would require that eight data elements can be fetched each cycle.
Q17. What is the way to create a hybrid design?
It is possible to create hybrid designs with portions of code running on a soft-core processor communicating with custom hardware accelerators.
Q18. Why is the total number of write memory accesses the same?
An important note the authors want to point out is that for every test, the total number of write memory accesses is exactly the same because LegUp only duplicates the hardware engines, but does not merge their computations.
Q19. How many versions of the code did the authors implement?
The authors implemented four different versions, including cache blocked memory accesses to determine the best performing implementationVrow based access and nonmemory blocking.