It is well known that if each coefficient value of a digital filter is a sum of signed power-of-two (SPT) terms, the filter can be implemented without using multipliers. In the past decade, several methods have been developed for the design of filters whose coefficient values are sums of SPT terms. Most of these methods are for the design of filters where all the coefficient values have the same number of SPT terms. It has also been demonstrated recently that significant advantage can be achieved if the coefficient values are allocated with different number of SPT terms while keeping the total number of SPT terms for the filter fixed. In this paper, we present a new method for allocating the number of SPT terms to each coefficient value. In our method, the number of SPT terms allocated to a coefficient is determined by the statistical quantization step-size of that coefficient and the sensitivity of the frequency response of the filter to that coefficient. After the assignment of the SPT terms, an integer-programming algorithm is used to optimize the coefficient values. Our technique yields excellent results but does not guarantee optimum assignment of SPT terms. Nevertheless, for any particular assignment of SPT terms, the result obtained is optimum with respect to that assignment.

Signed Power-of-Two Term Allocation Scheme for the Design of Digital Filters

We evaluate the power and performance of the Rodinia benchmark suite using the Altera SDK for OpenCL targeting a Stratix V FPGA against a modern CPU and GPU. We study multiple OpenCL kernels per benchmark, ranging from direct ports of the original GPU implementations to loop-pipelined kernels specifically optimized for FPGAs. Based on our results, we find that even though OpenCL is functionally portable across devices, direct ports of GPU-optimized code do not perform well compared to kernels optimized with FPGA-specific techniques such as sliding windows. However, by exploiting FPGA-specific optimizations, it is possible to achieve up to 3.4x better power efficiency using an Altera Stratix V FPGA in comparison to an NVIDIA K20c GPU, and better run time and power efficiency in comparison to CPU. We also present preliminary results for Arria 10, which, due to hardened FPUs, exhibits noticeably better performance compared to Stratix V in floating-point-intensive benchmarks.

Evaluating and optimizing OpenCL kernels for high performance computing with FPGAs

As general-purpose processors have hit the power wall and chip fabrication cost escalates alarmingly, coarse-grained reconfigurable architectures (CGRAs) are attracting increasing interest from both academia and industry, because they offer the performance and energy efficiency of hardware with the flexibility of software. However, CGRAs are not yet mature in terms of programmability, productivity, and adaptability. This article reviews the architecture and design of CGRAs thoroughly for the purpose of exploiting their full potential. First, a novel multidimensional taxonomy is proposed. Second, major challenges and the corresponding state-of-the-art techniques are surveyed and analyzed. Finally, the future development is discussed.

https://dl.acm.org/doi/pdf/10.1145/3357375

A Survey of Coarse-Grained Reconfigurable Architecture and Design: Taxonomy, Challenges, and Applications

With the pursuit of improving compute performance under strict power constraints, there is an increasing need for deploying applications to heterogeneous hardware architectures with accelerators, such as GPUs and FPGAs. However, although these heterogeneous computing platforms are becoming widely available, they are very difficult to program especially with FPGAs. As a result, the use of such platforms has been limited to a small subset of programmers with specialized hardware knowledge. To tackle this challenge, we introduce HeteroCL, a programming infrastructure composed of a Python-based domain-specific language (DSL) and an FPGA-targeted compilation flow. The HeteroCL DSL provides a clean programming abstraction that decouples algorithm specification from three important types of hardware customization in compute, data types, and memory architectures. HeteroCL further captures the interdependence among these different customization techniques, allowing programmers to explore various performance/area/accuracy trade-offs in a systematic and productive manner. In addition, our framework produces highly efficient hardware implementations for a variety of popular workloads by targeting spatial architecture templates such as systolic arrays and stencil with dataflow architectures. Experimental results show that HeteroCL allows programmers to explore the design space efficiently in both performance and accuracy by combining different types of hardware customization and targeting spatial architectures, while keeping the algorithm code intact.

HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Reconfigurable Computing

Modern high-level synthesis (HLS) tools greatly reduce the turn-around time of designing and implementing complex FPGA-based accelerators. They also expose various optimization opportunities, which cannot be easily explored at the register-transfer level. With the increasing adoption of the HLS design methodology and continued advances of synthesis optimization, there is a growing need for realistic benchmarks to (1) facilitate comparisons between tools, (2) evaluate and stress-test new synthesis techniques, and (3) establish meaningful performance baselines to track progress of the HLS technology. While several HLS benchmark suites already exist, they are primarily comprised of small textbook-style function kernels, instead of complete and complex applications. To address this limitation, we introduce Rosetta, a realistic benchmark suite for software programmable FPGAs. Designs in Rosetta are fully-developed applications. They are associated with realistic performance constraints, and optimized with advanced features of modern HLS tools. We believe that Rosetta is not only useful for the HLS research community, but can also serve as a set of design tutorials for non-expert HLS users. In this paper we describe the characteristics of our benchmarks and the optimization techniques applied to them. We further report experimental results on an embedded FPGA device as well as a cloud FPGA platform.

https://dl.acm.org/doi/pdf/10.1145/3174243.3174255

Rosetta: A Realistic High-Level Synthesis Benchmark Suite for Software Programmable FPGAs

Recently, FPGA vendors such as Altera and Xilinx have released OpenCL SDK for programming FPGAs. However, the architecture of FPGA is significantly different from that of CPU/GPU, for which OpenCL is originally designed. Tuning the OpenCL code for good performance on FPGAs is still an open problem, since the existing OpenCL tools and models designed for CPUs/GPUs are not directly applicable to FPGAs. In the paper, we present an FPGA-based performance analysis framework that can shed light on the performance bottleneck and thus guide the code tuning for OpenCL applications on FPGAs. Particularly, we leverage static and dynamic analysis to develop an analytical performance model, which has captured the key architectural features of FPGA abstractions under OpenCL. Then, we provide four programmer-interpretable metrics to quantify the performance potentials of the OpenCL program with input optimization combination for the next optimization step. We evaluate our framework with a number of user cases, and demonstrate that 1) our analytical performance model can accurately predict the performance of OpenCL programs with different optimization combinations on FPGAs, and 2) our tool can be used to effectively guide the code tuning on alleviating the performance bottleneck.

A performance analysis framework for optimizing OpenCL applications on FPGAs

Big data applications often incur large costs in I/O, data transfer and copying overhead, especially when operating in cloud environments. Since most such computations are distributed, data processing operations offloaded to the network card (NIC) could potentially reduce the data movement overhead by enabling near-data processing at several points of a distributed system. Following this idea, in this paper we present StRoM, a programmable, FPGA-based RoCE v2 NIC supporting the offloading of application level kernels. These kernels can be used to perform memory access operations directly from the NIC such as traversal of remote data structures as well as filtering or aggregation over RDMA data streams on both the sending or receiving sides. StRoM bypasses the CPU entirely and extends the semantics of RDMA to enable multi-step data access operations and in-network processing of RDMA streams. We demonstrate the versatility and potential of StRoM with four different kernels extending one-sided RDMA commands: 1) Traversal of remote data structures through pointer chasing, 2) Consistent retrieval of remote data blocks, 3) Data shuffling on the NIC by partitioning incoming data to different memory regions or CPU cores, and 4) Cardinality estimation on data streams.

StRoM: smart remote memory

We present an efficient combined single-path delay commutator-feedback (SDC-SDF) radix-2 pipelined fast Fourier transform architecture, which includes $\log _{2}\textit {N}-1$ SDC stages, and 1 SDF stage. The SDC processing engine is proposed to achieve 100% hardware resource utilization by sharing the common arithmetic resource in the time-multiplexed approach, including both adders and multipliers. Thus, the required number of complex multipliers is reduced to $\log _{4}\textit {N}-0.5$ , compared with $\log _{2}\textit {N}-1$ for the other radix-2 SDC/SDF architectures. In addition, the proposed architecture requires roughly minimum number of complex adders $\log _{2}\textit {N}+1$ and complex delay memory $2\textit {N}+1.5\log _{2}\textit {N}-1.5$ .

A Combined SDC-SDF Architecture for Normal I/O Pipelined Radix-2 FFT

A lot of research efforts have been devoted to accelerating relational database applications on FPGAs, due to their high energy efficiency and high throughput. Most of the existing studies are based on hardware description languages (HDLs). Recently, FPGA vendors have started to develop OpenCL SDKs for much better programmability. In this paper, we investigate the performance of relational database applications on OpenCL-based FPGAs. As a start, we study the performance of data partitioning, a core operation widely used in relational databases. Due to random memory accesses, data partitioning is time-consuming and can become a major bottleneck for database operators such as hash join. We start with the state-of-the-art OpenCL implementation which was originally designed for CPUs/GPUs, and find that it suffers from lock overheads and memory bandwidth overheads. To reduce lock overheads, we develop a simple yet efficient multi-kernel approach to leverage two emerging features of Altera OpenCL SDK, namely task kernel and channel. Moreover, on-chip buckets are employed to reduce the number of memory transactions. We further develop a cost model to guide the parameter configuration. We evaluate the proposed design on a recent Altera Stratix V FPGA. Our results demonstrate 1) our cost model can accurately predict the performance of data partitioning under different parameter settings; 2) our proposed multi-kernel approach can achieve 10.7X; speedup over the existing OpenCL implementation. Also, the experiments with three case studies show that the optimized implementations can achieve 4–12X performance improvement over the original implementations.

A study of data partitioning on OpenCL-based FPGAs

Network Function Virtualization (NFV) virtualizes software network functions to offer flexibility in their design, management and deployment. Although GPUs have demonstrated their power in significantly accelerating network functions, they have not been effectively integrated into NFV systems for the following reasons. First, GPUs are severely underutilized in NFV systems with existing GPU virtualization approaches. Second, data isolation in the GPU memory is not guaranteed. Third, building an efficient network function on CPUGPU architectures demands huge development efforts. In this paper, we propose G-NET, an NFV system with a GPU virtualization scheme that supports spatial GPU sharing, a service chain based GPU scheduler, and a scheme to guarantee data isolation in the GPU. We also develop an abstraction for building efficient network functions on G-NET, which significantly reduces development efforts. With our proposed design, G-NET enhances overall throughput by up to 70.8% and reduces the latency by up to 44.3%, in comparison with existing GPU virtualization solutions.

/pdf/g-net-effective-gpu-sharing-in-nfv-systems-2nqyvu2fvc.pdf

Zeke Wang

Papers

A performance analysis framework for optimizing OpenCL applications on FPGAs

StRoM: smart remote memory

A Combined SDC-SDF Architecture for Normal I/O Pipelined Radix-2 FFT

A study of data partitioning on OpenCL-based FPGAs

G-NET: Effective GPU Sharing in NFV Systems