scispace - formally typeset
Search or ask a question

Showing papers in "ACM Transactions on Reconfigurable Technology and Systems in 2017"


Journal ArticleDOI
TL;DR: A scalable parallel framework is proposed that exploits four levels of parallelism in hardware acceleration and a systematic design space exploration methodology is put forward to search for the optimal solution that maximizes accelerator throughput under the FPGA constraints.
Abstract: Deep convolutional neural networks (CNNs) have gained great success in various computer vision applications. State-of-the-art CNN models for large-scale applications are computation intensive and memory expensive and, hence, are mainly processed on high-performance processors like server CPUs and GPUs. However, there is an increasing demand of high-accuracy or real-time object detection tasks in large-scale clusters or embedded systems, which requires energy-efficient accelerators because of the green computation requirement or the limited battery restriction. Due to the advantages of energy efficiency and reconfigurability, Field-Programmable Gate Arrays (FPGAs) have been widely explored as CNN accelerators. In this article, we present an in-depth analysis of computation complexity and the memory footprint of each CNN layer type. Then a scalable parallel framework is proposed that exploits four levels of parallelism in hardware acceleration. We further put forward a systematic design space exploration methodology to search for the optimal solution that maximizes accelerator throughput under the FPGA constraints such as on-chip memory, computational resources, external memory bandwidth, and clock frequency. Finally, we demonstrate the methodology by optimizing three representative CNNs (LeNet, AlexNet, and VGG-S) on a Xilinx VC709 board. The average performance of the three accelerators is 424.7, 445.6, and 473.4GOP/s under 100MHz working frequency, which outperforms the CPU and previous work significantly.

82 citations


Journal ArticleDOI
TL;DR: An efficient, lightweight, and scalable PUF identification (ID) generator circuit that offers a compact design with good uniqueness and reliability properties and is specifically designed for FPGAs and the proposed post-characterisation method can be generally used for any FPGA-based PUF designs.
Abstract: Physical unclonable functions (PUFs), a form of physical security primitive, enable digital identifiers to be extracted from devices, such as field programmable gate arrays (FPGAs). Many PUF implementations have been proposed to generate these unique n-bit binary strings. However, they often offer insufficient uniqueness and reliability when implemented on FPGAs and can consume excessive resources. To address these problems, in this article we present an efficient, lightweight, and scalable PUF identification (ID) generator circuit that offers a compact design with good uniqueness and reliability properties and is specifically designed for FPGAs. A novel post-characterisation methodology is also proposed that improves the reliability of a PUF without the need for any additional hardware resources. Moreover, the proposed post-characterisation method can be generally used for any FPGA-based PUF designs. The PUF ID generator consumes 8.95% of the hardware resources of a low-cost Xilinx Spartan-6 LX9 FPGA and 0.81% of a Xilinx Artix-7 FPGA. Experimental results show good uniqueness, reliability, and uniformity with no occurrence of bit-aliasing. In particular, the reliability of the PUF is close to 100% over an environmental temperature range of 25°C to 70°C with ± 10% variation in the supply voltage.

53 citations


Journal ArticleDOI
TL;DR: A hardware-based bandwidth compression technique that can be applied to field-programmable gate array-- (FPGA) based high-performance computation with a logically wider effective memory bandwidth and a multichannel serializer and deserializer that enable applications to use multiple channels of computational data with the bandwidth compression.
Abstract: Although computational performance is often limited by insufficient bandwidth to/from an external memory, it is not easy to physically increase off-chip memory bandwidth In this study, we propose a hardware-based bandwidth compression technique that can be applied to field-programmable gate array-- (FPGA) based high-performance computation with a logically wider effective memory bandwidth Our proposed hardware approach can boost the performance of FPGA-based stream computations by applying a data compression technique to effectively transfer more data streams To apply this data compression technique to bandwidth compression via hardware, several requirements must first be satisfied, including an acceptable level of compression performance and a sufficiently small hardware footprint Our proposed hardware-based bandwidth compressor utilizes an efficient prediction-based data compression algorithm Moreover, we propose a multichannel serializer and deserializer that enable applications to use multiple channels of computational data with the bandwidth compression The serializer encodes compressed data blocks of multiple channels into a data stream, which is efficiently written to an external memory Based on preliminary evaluation, we define an encoding format considering both high compression ratio and small hardware area As a result, we demonstrate that our area saving bandwidth compressor increases performance of an FPGA-based fluid dynamics simulation by deploying more processing elements to exploit spatial parallelism with the enhanced memory bandwidth

16 citations


Journal ArticleDOI
TL;DR: This work compares the micro-architecture, performance, and area of twosoft-processor overlays: the Octavo multi-threaded soft-processor and the MXP soft vector processor and finds that Octavos’ higher operating frequency and MXP’s more efficient code execution results in similar performance from both, but with a penalty of an order of magnitude greater area.
Abstract: Field-Programmable Gate Arrays (FPGAs) can yield higher performance and lower power than software solutions on CPUs or GPUs However, designing with FPGAs requires specialized hardware design skills and hours-long CAD processing times To reduce and accelerate the design effort, we can implement an overlay architecture on the FPGA, on which we then more easily construct the desired system but at a large cost in performance and area relative to a direct FPGA implementation In this work, we compare the micro-architecture, performance, and area of two soft-processor overlays: the Octavo multi-threaded soft-processor and the MXP soft vector processor To measure the area and performance penalties of these overlays relative to the underlying FPGA hardware, we compare direct FPGA implementations of the micro-benchmarks written in C synthesized with the LegUp HLS tool and also written in the Verilog HDL Overall, Octavo’s higher operating frequency and MXP’s more efficient code execution results in similar performance from both, within an order of magnitude of direct FPGA implementations, but with a penalty of an order of magnitude greater area

10 citations


Journal ArticleDOI
TL;DR: This article presents a method based on linear programming (LP) that determines the optimal operation distribution for a particular device and application with respect to performance, power, or dependability metrics, and demonstrates its effectiveness with two case studies involving dot-product and distance-calculation kernels on a range of Virtex-5 FPGAs.
Abstract: Field-programmable gate arrays (FPGA) are an increasingly attractive alternative to traditional microprocessor-based computing architectures in extreme-computing domains, such as aerospace and supercomputing. FPGAs offer several resource types that offer different tradeoffs between speed, power, and area, which make FPGAs highly flexible for varying application computational requirements. However, since an application’s computational operations can map to different resource types, a major challenge in leveraging resource-diverse FPGAs is determining the optimal distribution of these operations across the device’s available resources for varying FPGA devices, resulting in an extremely large design space. In order to facilitate fast design-space exploration, this article presents a method based on linear programming (LP) that determines the optimal operation distribution for a particular device and application with respect to performance, power, or dependability metrics. Our LP method is an effective tool for exploring early designs by quickly analyzing thousands of FPGAs to determine the best FPGA devices and operation distributions, which significantly reduces design time. We demonstrate our LP method’s effectiveness with two case studies involving dot-product and distance-calculation kernels on a range of Virtex-5 FPGAs. Results show that our LP method selects optimal distributions of operations to within an average of 4% of actual values.

9 citations


Journal ArticleDOI
TL;DR: A novel power-efficient, fast, and versatile hardware architecture whose objective is to monitor a set of target patterns to maintain their frequency over a stream of data and demonstrates that FIM hardware acceleration is particularly efficient for large and low-density datasets.
Abstract: Stream processing has become extremely popular for analyzing huge volumes of data for a variety of applications, including IoT, social networks, retail, and software logs analysis. Streams of data are produced continuously and are mined to extract patterns characterizing the data. A class of data mining algorithm, called generate-and-test, produces a set of candidate patterns that are then evaluated over data. The main challenges of these algorithms are to achieve high throughput, low latency, and reduced power consumption. In this article, we present a novel power-efficient, fast, and versatile hardware architecture whose objective is to monitor a set of target patterns to maintain their frequency over a stream of data. This accelerator can be used to accelerate data-mining algorithms, including itemsets and sequences mining.The massive fine-grain reconfiguration capability of field-programmable gate array (FPGA) technologies is ideal to implement the high number of pattern-detection units needed for these intensive data-mining applications. We have thus designed and implemented an IP that features high-density FPGA occupation and high working frequency. We provide detailed description of the IP internal micro-architecture and its actual implementation and optimization for the targeted FPGA resources. We validate our architecture by developing a co-designed implementation of the Apriori Frequent Itemset Mining (FIM) algorithm, and perform numerous experiments against existing hardware and software solutions. We demonstrate that FIM hardware acceleration is particularly efficient for large and low-density datasets (i.e., long-tailed datasets). Our IP reaches a data throughput of 250 million items/s and monitors up to 11.6k patterns simultaneously, on a prototyping board that overall consumes 24W in the worst case. Furthermore, our hardware accelerator remains generic and can be integrated to other generate and test algorithms.

8 citations


Journal ArticleDOI
TL;DR: Methods for fast and cycle-accurate emulation of NoCs with up to thousands of nodes using a single FPGA are presented and methods for emulating both direct and indirect networks, focusing on commonly used meshes and fat-trees are proposed, different from prior work that considers only direct networks.
Abstract: Modeling and simulation/emulation play a major role in research and development of novel Networks-on-Chip (NoCs). However, conventional software simulators are so slow that studying NoCs for emerging many-core systems with hundreds to thousands of cores is challenging. State-of-the-art FPGA-based NoC emulators have shown great potential in speeding up the NoC simulation, but they cannot emulate large-scale NoCs due to the FPGA capacity constraints. Moreover, emulating large-scale NoCs under synthetic workloads on FPGAs typically requires a large amount of memory and thus involves the use of off-chip memory, which makes the overall design much more complicated and may substantially degrade the emulation speed. This article presents methods for fast and cycle-accurate emulation of NoCs with up to thousands of nodes using a single FPGA. We first describe how to emulate a NoC under a synthetic workload using only FPGA on-chip memory (BRAMs). We next present a novel use of time-division multiplexing where BRAMs are effectively used for emulating a network using a small number of nodes, thereby overcoming the FPGA capacity constraints. We propose methods for emulating both direct and indirect networks, focusing on the commonly used meshes and fat-trees (k-ary n-trees). This is different from prior work that considers only direct networks. Using the proposed methods, we build a NoC emulator, called FNoC, and demonstrate the emulation of some mesh-based and fat-tree-based NoCs with canonical router architectures. Our evaluation results show that (1) the size of the largest NoC that can be emulated depends on only the FPGA on-chip memory capacity; (2) a mesh-based NoC with 16,384 nodes (128×128 NoC) and a fat-tree-based NoC with 6,144 switch nodes and 4,096 terminal nodes (4-ary 6-tree NoC) can be emulated using a single Virtex-7 FPGA; and (3) when emulating these two NoCs, we achieve, respectively, 5,047× and 232× speedups over BookSim, one of the most widely used software-based NoC simulators, while maintaining the same level of accuracy.

7 citations


Journal ArticleDOI
TL;DR: This paper proposes the first fully pipelined implementation of the kernel normalised least mean squares algorithm for regression, and achieves 80% of the core’s speed, this being a speedup of 10× over an optimised implementation on a desktop processor and 2.66× over a GPU.
Abstract: Kernel adaptive filters (KAFs) are online machine learning algorithms which are amenable to highly efficient streaming implementations. They require only a single pass through the data and can act as universal approximators, i.e. approximate any continuous function with arbitrary accuracy. KAFs are members of a family of kernel methods which apply an implicit non-linear mapping of input data to a high dimensional feature space, permitting learning algorithms to be expressed entirely as inner products. Such an approach avoids explicit projection into the feature space, enabling computational efficiency. In this paper, we propose the first fully pipelined implementation of the kernel normalised least mean squares algorithm for regression. Independent training tasks necessary for hyperparameter optimisation fill pipeline stages, so no stall cycles to resolve dependencies are required. Together with other optimisations to reduce resource utilisation and latency, our core achieves 161 GFLOPS on a Virtex 7 XC7VX485T FPGA for a floating point implementation and 211 GOPS for fixed point. Our PCI Express based floating-point system implementation achieves 80% of the core’s speed, this being a speedup of 10× over an optimised implementation on a desktop processor and 2.66× over a GPU.

6 citations


Journal ArticleDOI
TL;DR: This article demonstrates the capability of the proposed interconnected-FPGAs system to accelerate join operations in a relational database and develops a new parallel join algorithm, PPJoin, targeted to big-data analysis in a shared-nothing architecture.
Abstract: A huge amount of data is being generated and accumulated in data centers, which leads to an important increase in the required energy consumption to analyze these data. Thus, we must consider the redesign of current computer systems architectures to be more friendly to applications based on distributed algorithms that require a high data transfer rate.Novel computer architectures that introduce dedicated accelerators to enable near-data processing have been discussed and developed for high-speed big-data analysis. In this work, we propose a computer system with an FPGA-based accelerator, namely, interconnected-FPGAs, which offers two advantages: (1) direct data transmission and (2) offloading computation into data-flow in the FPGA. In this article, we demonstrate the capability of the proposed interconnected-FPGAs system to accelerate join operations in a relational database. We developed a new parallel join algorithm, PPJoin, targeted to big-data analysis in a shared-nothing architecture. PPJoin is an extended version of the NUMA-based parallel join algorithm, created by overlapping computation by multicore processors and data communication. The data communication between computational nodes can be accelerated by direct data transmission without passing through the main memory of the hosts. To confirm the performance of the PPJoin algorithm and its acceleration process using an interconnected-FPGA platform, we evaluated a simple query for large tables. Additionally, to support availability, we also evaluated the actual benchmark query. Our evaluation results confirm that the PPJoin algorithm is faster than a software-based query engine by 1.5--5 times. Moreover, we experimentally confirmed that the direct data transmission by interconnected FPGAs reduces computational time around 20% for PPJoin.

4 citations


Journal ArticleDOI
TL;DR: A Network-on-Chip in a hierarchical star topology to enable agents’ transactions through message broadcasting using the Open Core Protocol as an interface between hardware modules is proposed and a multi-agent system is created to simulate and analyse message exchanges in a generic heavy traffic load agent-based application.
Abstract: A system of interacting agents is, by definition, very demanding in terms of computational resources. Although multi-agent systems have been used to solve complex problems in many areas, it is usually very difficult to perform large-scale simulations in their targeted serial computing platforms. Reconfigurable hardware, in particular Field Programmable Gate Arrays devices, have been successfully used in High Performance Computing applications due to their inherent flexibility, data parallelism, and algorithm acceleration capabilities. Indeed, reconfigurable hardware seems to be the next logical step in the agency paradigm, but only a few attempts have been successful in implementing multi-agent systems in these platforms. This article discusses the problem of inter-agent communications in Field Programmable Gate Arrays. It proposes a Network-on-Chip in a hierarchical star topology to enable agents’ transactions through message broadcasting using the Open Core Protocol as an interface between hardware modules. A customizable router microarchitecture is described and a multi-agent system is created to simulate and analyse message exchanges in a generic heavy traffic load agent-based application. Experiments have shown a throughput of 1.6Gbps per port at 100MHz without packet loss and seamless scalability characteristics.

3 citations


Journal ArticleDOI
TL;DR: This article demonstrates that Thread- and Instruction-Level parallel Template architecture (TILT), a programmable FPGA-based horizontally microcoded compute engine designed to highly utilize floating point (FP) functional units (FUs), can improve significantly the average throughput of eight FP-intensive applications compared to a soft scalar CPU.
Abstract: By using resource sharing field-programmable gate array (FPGA) compute engines, we can reduce the performance gap between soft scalar CPUs and resource-intensive custom datapath designs. This article demonstrates that Thread- and Instruction-Level parallel Template architecture (TILT), a programmable FPGA-based horizontally microcoded compute engine designed to highly utilize floating point (FP) functional units (FUs), can improve significantly the average throughput of eight FP-intensive applications compared to a soft scalar CPU (similar to a FP-extended Nios). For eight benchmark applications, we show that: (i) a base TILT configuration having a single instance for each FU type can improve the performance over a soft scalar CPU by 15.8 × , while requiring on average 26% of the custom datapaths’ area; (ii) selectively increasing the number of FUs can more than double TILT’s average throughput, reducing the custom-datapath-throughput-gap from 576 × to 14 × ; and (iii) replicated instances of the most computationally dense TILT configuration that fit within the area of each custom datapath design can reduce the gap to 8.27 × , while replicated instances of application-tuned configurations of TILT can reduce the custom-datapath-throughput-gap to an average of 5.22 × , and up to 3.41 × for the Matrix Multiply benchmark. Last, we present methods for design space reduction, and we correctly predict the computationally densest design for seven out of eight benchmarks.

Journal ArticleDOI
TL;DR: A new method for Monte Carlo (MC) option pricing using field-programmable gate arrays (FPGAs), which use a discrete-space random walk over a binomial lattice, rather than the continuous space-walks used by existing approaches.
Abstract: This article presents a new method for Monte Carlo (MC) option pricing using field-programmable gate arrays (FPGAs), which use a discrete-space random walk over a binomial lattice, rather than the continuous space-walks used by existing approaches. The underlying hypothesis is that the discrete-space walk will significantly reduce the area needed for each MC engine, and the resulting increase in parallelisation and raw performance outweighs any accuracy losses introduced by the discretisation. Experimental results support this hypothesis, showing that for a given MC simulation size, there is no significant loss in accuracy by using a discrete space model for the path-dependent exotic financial options. Analysis of the binomial simulation model shows that only limited-precision fixed-point arithmetic is needed, and also shows that pairs of MC kernels are able to share RAM resources. When using realistic constraints on pricing problems, it was found that the size of a discrete-space MC engine can be kept to 370 Flip-Flops and 233 Lookup Tables, allowing up to 3,000 variance-reduced MC cores in one FPGA. The combination of a highly parallelisable architecture and model-specific optimisations means that the binomial pricing technique allows for a 50× improvement in throughput compared to existing FPGA approaches, without any reduction in accuracy.

Journal ArticleDOI
TL;DR: A high-performance B8B implementation on FPGAs that introduces workers that autonomously cooperate using work stealing to allow parallel execution and full utilization of the target FPGA and demonstrates how instance-specific designs can be generated just-in-time such that the provided speedups outweigh the additional time required for design synthesis.
Abstract: Branch and bound (B8B) algorithms structure the search space as a tree and eliminate infeasible solutions early by pruning subtrees that cannot lead to a valid or optimal solution. Custom hardware designs significantly accelerate the execution of these algorithms. In this article, we demonstrate a high-performance B8B implementation on FPGAs. First, we identify general elements of B8B algorithms and describe their implementation as a finite state machine. Then, we introduce workers that autonomously cooperate using work stealing to allow parallel execution and full utilization of the target FPGA. Finally, we explore advantages of instance-specific designs that target a specific problem instance to improve performance.We evaluate our concepts by applying them to a branch and bound problem, the reconstruction of corrupted AES keys obtained from cold-boot attacks. The evaluation shows that our work stealing approach is scalable with the available resources and provides speedups proportional to the number of workers. Instance-specific designs allow us to achieve an overall speedup of 47 × compared to the fastest implementation of AES key reconstruction so far. Finally, we demonstrate how instance-specific designs can be generated just-in-time such that the provided speedups outweigh the additional time required for design synthesis.