High-level synthesis of dynamic data structures: A case study using Vivado HLS
read more
Citations
MachSuite: Benchmarks for accelerator design and customized architectures
Machine learning for wireless communications in the Internet of Things: A comprehensive survey
Are We There Yet? A Study on the State of High-Level Synthesis
High-Level Language Tools for Reconfigurable Computing
Energy-efficient acceleration of big data analytics applications using FPGAs
References
An efficient k-means clustering algorithm: analysis and implementation
LegUp: high-level synthesis for FPGA-based processor/accelerator systems
Designing Modular Hardware Accelerators in C with ROCCC 2.0
An overview of today's high-level synthesis tools
FPGA-based K-means clustering using tree-based data structures
Related Papers (5)
Frequently Asked Questions (19)
Q2. what is the synthesis directive for the lls algorithm?
In order to match the parallelism of computational units and memory ports, the authors partitioning the centre positions and centroid buffer arrays into P banks using the array partitioning directive.
Q3. What is the data passed between recursive instances?
The data passed between recursive instances are the objects treeNode, centreSet (set of candidate centres), and the variable k (current set size).
Q4. How much latency is the filtering algorithm using?
The HLS design of the filtering algorithm also consumes a ‘closeto-hand-written’ amount of FPGA resources, but latency is initially degraded by a factor of 30×.
Q5. What is the way to automate the loop nest?
Automating this step requires a program analysis capable of identifying disjoint regions (in terms of access patterns) in the monolithic heap memory space.
Q6. What is the purpose of this paper?
The authors propose an analysis for finding tight bounds on the dynamically allocated heap memory, an automated analysis of dependencies carried by data structures accessed through pointers, and an automated analysis to identify and privatise disjoint regions in the monolithic heap memory as research directions to improve the HLS support for (widely used) programs operating on dynamic, pointer-based data structures.
Q7. What is the use of the recursive function calls?
In addition, the implementation uses recursive function calls (beyond tail recursion) which usually requires the presence of a stack.
Q8. What is the way to avoid the pruning approach?
When inadequate memory is available to service an allocation request, the algorithm allows us to abandon the pruning approach and instead consider all candidate centres [7].
Q9. What is the way to infer a bound of N 1 centre sets?
A generic framework to infer such an average-case bound (semi-)automatically while still supporting the worst case requirement would be a valuable tool to support dynamic memory allocation in an HLS context.
Q10. What is the way to allocate memory?
Their custom implementation of the fixed-size allocator uses a free-list which keeps track of occupied memory space and the on-chip heap memory can accommodate an ‘averagecase’ number of centre sets.
Q11. what is the synthesisability of the main kernel?
The synthesisability of the main kernel as in Listing 2 requires the removal of the recursive function calls and the calls to malloc and free (discussed in 1 and 2) and code transformations to improve QoR (discussed in 3 and 4).
Q12. How many centre sets must be retained in memory?
In this case, N − 1 centre sets must be retained in memory before they can be disposed, and hence, the heap memory for centre sets must be able to accommodate N−1 sets to ensure functional correctness in this worst-case scenario.
Q13. what is the synthesis directive for the rtl?
using synthesis directives and a minor source code modification to ensure correct indexing of the parallel instances of the centroid buffer, the authors are able to produce an RTL design which is architecturally similar to its hand-written counterpart.
Q14. How many centre sets can be disposed?
The authors select a bound of B = 256 Nmax − 1 centre sets (8 BRAMs) which practically causes no runtime degradation in the scenarios considered here.
Q15. What is the commonality of the benchmark cases?
Their work shares the commonality that the chosen benchmark cases are data-flow centric stream-based applications with simple control flow.
Q16. What is the difference between Lloyd’s algorithm and the filtering algorithm?
The former is data-flow centric and has regular control flow and regular memory accesses, whereas the implementation of the filtering algorithm uses dynamic memory management and is based on recursive traversal of a pointer-linked tree structure.
Q17. How do the authors split the tree structure into sub-trees?
As in the manual RTL design, the authors split the tree structure into P independent sub-trees to parallelise the application by instantiating P parallel processing kernels.
Q18. What is the average case of a tree degenerating?
In the average case, however, the tree is unlikely to be fully degenerate and the instantaneous memory requirement is significantly lower because centre sets can be disposed earlier.
Q19. What are the key computational parts of the filtering algorithm?
The key computational parts of the filtering algorithm (Listing 2) are the closest centre search (lines 2-6) and the candidate pruning (lines 12-18, pruningTest, remove centres form the current set).