scispace - formally typeset
Search or ask a question

Showing papers in "ACM Transactions on Reconfigurable Technology and Systems in 2018"


Journal ArticleDOI
TL;DR: The second generation of the FINN framework is described, an end-to-end tool that enables design-space exploration and automates the creation of fully customized inference engines on FPGAs that optimizes for given platforms, design targets, and a specific precision.
Abstract: Convolutional Neural Networks have rapidly become the most successful machine-learning algorithm, enabling ubiquitous machine vision and intelligent decisions on even embedded computing systems. While the underlying arithmetic is structurally simple, compute and memory requirements are challenging. One of the promising opportunities is leveraging reduced-precision representations for inputs, activations, and model parameters. The resulting scalability in performance, power efficiency, and storage footprint provides interesting design compromises in exchange for a small reduction in accuracy. FPGAs are ideal for exploiting low-precision inference engines leveraging custom precisions to achieve the required numerical accuracy for a given application. In this article, we describe the second generation of the FINN framework, an end-to-end tool that enables design-space exploration and automates the creation of fully customized inference engines on FPGAs. Given a neural network description, the tool optimizes for given platforms, design targets, and a specific precision. We introduce formalizations of resource cost functions and performance predictions and elaborate on the optimization algorithms. Finally, we evaluate a selection of reduced precision neural networks ranging from CIFAR-10 classifiers to YOLO-based object detection on a range of platforms including PYNQ and AWS F1, demonstrating new unprecedented measured throughput at 50 TOp/s on AWS F1 and 5 TOp/s on embedded devices.

204 citations


Journal ArticleDOI
TL;DR: NEURAghe is presented, a flexible and efficient hardware/software solution for the acceleration of CNNs on Zynq SoCs that leverages the synergistic usage of Zynqu ARM cores and of a powerful and flexible Convolution-Specific Processor deployed on the reconfigurable logic.
Abstract: Deep convolutional neural networks (CNNs) obtain outstanding results in tasks that require human-level understanding of data, like image or speech recognition. However, their computational load is significant, motivating the development of CNN-specialized accelerators. This work presents NEURAghe, a flexible and efficient hardware/software solution for the acceleration of CNNs on Zynq SoCs. NEURAghe leverages the synergistic usage of Zynq ARM cores and of a powerful and flexible Convolution-Specific Processor deployed on the reconfigurable logic. The Convolution-Specific Processor embeds both a convolution engine and a programmable soft core, releasing the ARM processors from most of the supervision duties and allowing the accelerator to be controlled by software at an ultra-fine granularity. This methodology opens the way for cooperative heterogeneous computing: While the accelerator takes care of the bulk of the CNN workload, the ARM cores can seamlessly execute hard-to-accelerate parts of the computational graph, taking advantage of the NEON vector engines to further speed up computation. Through the companion NeuDNN SW stack, NEURAghe supports end-to-end CNN-based classification with a peak performance of 169GOps/s, and an energy efficiency of 17GOps/W. Thanks to our heterogeneous computing model, our platform improves upon the state-of-the-art, achieving a frame rate of 5.5 frames per second (fps) on the end-to-end execution of VGG-16 and 6.6fps on ResNet-18.

57 citations


Journal ArticleDOI
TL;DR: This work proposes and develops deconvolution architecture for efficient FPGA implementation, and a hardware mapping framework is developed to automatically generate the low-latency hardware design for any given CNN model on the target device.
Abstract: Convolutional Neural Networks-- (CNNs) based algorithms have been successful in solving image recognition problems, showing very large accuracy improvement. In recent years, deconvolution layers are widely used as key components in the state-of-the-art CNNs for end-to-end training and models to support tasks such as image segmentation and super resolution. However, the deconvolution algorithms are computationally intensive, which limits their applicability to real-time applications. Particularly, there has been little research on the efficient implementations of deconvolution algorithms on FPGA platforms that have been widely used to accelerate CNN algorithms by practitioners and researchers due to their high performance and power efficiency. In this work, we propose and develop deconvolution architecture for efficient FPGA implementation. FPGA-based accelerators are proposed for both deconvolution and CNN algorithms. Besides, memory sharing between the computation modules is proposed for the FPGA-based CNN accelerator as well as for other optimization techniques. A non-linear optimization model based on the performance model is introduced to efficiently explore the design space to achieve optimal processing speed of the system and improve power efficiency. Furthermore, a hardware mapping framework is developed to automatically generate the low-latency hardware design for any given CNN model on the target device. Finally, we implement our designs on Xilinx Zynq ZC706 board and the deconvolution accelerator achieves a performance of 90.1 giga operations per second (GOPS) under 200MHz working frequency and a performance density of 0.10 GOPS/DSP using 32-bit quantization, which significantly outperforms previous designs on FPGAs. A real-time application of scene segmentation on Cityscapes Dataset is used to evaluate our CNN accelerator on Zynq ZC706 board, and the system achieves a performance of 107 GOPS and 0.12 GOPS/DSP using 16-bit quantization and supports up to 17 frames per second for 512 × 512 image inputs with a power consumption of only 9.6W.

48 citations


Journal ArticleDOI
TL;DR: FPGA architectural changes such as increasing DSP block count, enhancing low-precision support in DSP blocks and rethinking the on-chip memories are suggested to reduce the programmability gap for DL applications.
Abstract: Recently, deep learning (DL) has become best-in-class for numerous applications but at a high computational cost that necessitates high-performance energy-efficient acceleration. The reconfigurability of FPGAs is appealing due to the rapid change in DL models but also causes lower performance and area-efficiency compared to ASICs. In this article, we implement three state-of-the-art computing architectures (CAs) for convolutional neural network (CNN) inference on FPGAs and ASICs. By comparing the FPGA and ASIC implementations, we highlight the area and performance costs of programmability to pinpoint the inefficiencies in current FPGA architectures. We perform our experiments using three variations of these CAs for AlexNet, VGG-16 and ResNet-50 to allow extensive comparisons. We find that the performance gap varies significantly from 2.8× to 6.3×, while the area gap is consistent across CAs with an 8.7 average FPGA-to-ASIC area ratio. Among different blocks of the CAs, the convolution engine, constituting up to 60% of the total area, has a high area ratio ranging from 13 to 31. Motivated by our FPGA vs. ASIC comparisons, we suggest FPGA architectural changes such as increasing DSP block count, enhancing low-precision support in DSP blocks and rethinking the on-chip memories to reduce the programmability gap for DL applications.

39 citations


Journal ArticleDOI
TL;DR: Using field-programmable gate arrays (FPGAs) as a substrate to deploy soft graphics processing units (GPUs) would enable offering the FPGA compute power in a very flexible GPU-like tool flow.
Abstract: Using field-programmable gate arrays (FPGAs) as a substrate to deploy soft graphics processing units (GPUs) would enable offering the FPGA compute power in a very flexible GPU-like tool flow. Application-specific adaptations like selective hardening of floating-point operations and instruction set subsetting would mitigate the high area and power demands of soft GPUs. This work explores the capabilities and limitations of soft General Purpose Computing on GPUs (GPGPU) for both fixed- and floating point arithmetic. For this purpose, we have developed FGPU: a configurable, scalable, and portable GPU architecture designed especially for FPGAs. FGPU is open-source and implemented entirely in RTL. It can be programmed in OpenCL and controlled through a Python API. This article introduces its hardware architecture as well as its tool flow. We evaluated the proposed GPGPU approach against multiple other solutions. In comparison to homogeneous Multi-Processor System-On-Chips (MPSoCs), we found that using a soft GPU is a Pareto-optimal solution regarding throughput per area and energy consumption. On average, FGPU has a 2.9× better compute density and 11.2× less energy consumption than a single MicroBlaze processor when computing in IEEE-754 floating-point format. An average speedup of about 4× over the ARM Cortex-A9 supported with the NEON vector co-processor has been measured for fixed- or floating-point benchmarks. In addition, the biggest FGPU cores we could implement on a Xilinx Zynq-7000 System-On-Chip (SoC) can deliver similar performance to equivalent implementations with High-Level Synthesis (HLS).

17 citations


Journal ArticleDOI
TL;DR: An instruction-driven CNN accelerator, supporting the Winograd algorithm and the cross-layer scheduling, for object detection, and a hardware architecture to support the instructions with Winogsrad computation units and reach the state-of-the-art energy efficiency is proposed.
Abstract: In recent years, Convolutional Neural Networks (CNNs) have been widely applied in computer vision and have achieved significant improvements in object detection tasks. Although there are many optimizing methods to speed up CNN-based detection algorithms, it is still difficult to deploy detection algorithms on real-time low-power systems. Field-Programmable Gate Array (FPGA) has been widely explored as a platform for accelerating CNN due to its promising performance, high energy efficiency, and flexibility. Previous works show that the energy consumption of CNN accelerators is dominated by the memory access. By fusing multiple layers in CNN, the intermediate data transfer can be reduced. However, previous accelerators with the cross-layer scheduling are designed for a particular CNN model. In addition to the memory access optimization, the Winograd algorithm can greatly improve the computational performance of convolution. In this article, to improve the flexibility of hardware, we design an instruction-driven CNN accelerator, supporting the Winograd algorithm and the cross-layer scheduling, for object detection. We modify the loop unrolling order of CNN, so that we can schedule a CNN across different layers with instructions and eliminate the intermediate data transfer. We propose a hardware architecture to support the instructions with Winograd computation units and reach the state-of-the-art energy efficiency. To deploy image detection algorithms onto the proposed accelerator with fixed-point computation units, we adopt the fixed-point fine-tune method, which can guarantee the accuracy of the detection algorithms. We evaluate our accelerator and scheduling policy on the Xilinx KU115 FPGA platform. The intermediate data transfer can be reduced by more than 90% on the VGG-D CNN model with the cross-layer strategy. Thus, the performance of our hardware accelerator reaches 1700GOP/s on the classification model VGG-D. We also implement a framework for object detection algorithms, which achieves 2.3× and 50× in energy efficiency compared with GPU and CPU, respectively. Compared with floating-point algorithms, the accuracy of the fixed-point detection algorithms only drops by less than 1%.

17 citations


Journal ArticleDOI
TL;DR: This work detail an architecture dedicated to inference using ternary weights and activations, which allows to achieve up to 5.2k frames per second per Watt for classification on a VC709 board using approximately half of the resources of the FPGA.
Abstract: Although performing inference with artificial neural networks (ANN) was until quite recently considered as essentially compute intensive, the emergence of deep neural networks coupled with the evolution of the integration technology transformed inference into a memory bound problem. This ascertainment being established, many works have lately focused on minimizing memory accesses, either by enforcing and exploiting sparsity on weights or by using few bits for representing activations and weights, to be able to use ANNs inference in embedded devices. In this work, we detail an architecture dedicated to inference using ternary l−1, 0, 1r weights and activations. This architecture is configurable at design time to provide throughput vs. power trade-offs to choose from. It is also generic in the sense that it uses information drawn for the target technologies (memory geometries and cost, number of available cuts, etc.) to adapt at best to the FPGA resources. This allows to achieve up to 5.2k frames per second per Watt for classification on a VC709 board using approximately half of the resources of the FPGA.

17 citations


Journal ArticleDOI
TL;DR: This work combines measurements of register-level switching activity and system-level power to build an adaptive online model that produces live breakdowns of power consumption within the design and proposes a strategy allowing for the identification and subsequent elimination of counters found to be of low significance at runtime, reducing algorithmic complexity without sacrificing significant accuracy.
Abstract: In an FPGA system-on-chip design, it is often insufficient to merely assess the power consumption of the entire circuit by compile-time estimation or runtime power measurement. Instead, to make better decisions, one must understand the power consumed by each module in the system. In this work, we combine measurements of register-level switching activity and system-level power to build an adaptive online model that produces live breakdowns of power consumption within the design. Online model refinement avoids time-consuming characterization while also allowing the model to track long-term operating condition changes. Central to our method is an automated flow that selects signals predicted to be indicative of high power consumption, instrumenting them for monitoring. We named this technique KAPow, for ‘K’ounting Activity for Power estimation, which we show to be accurate and to have low overheads across a range of representative benchmarks. We also propose a strategy allowing for the identification and subsequent elimination of counters found to be of low significance at runtime, reducing algorithmic complexity without sacrificing significant accuracy. Finally, we demonstrate an application example in which a module-level power breakdown can be used to determine an efficient mapping of tasks to modules and reduce system-wide power consumption by up to 7%.

17 citations


Journal ArticleDOI
TL;DR: This article proposes a new DNN architecture, LightNN, which replaces the multiplications to one shift or a constrained number of shifts and adds, making them better in accuracy than conventional DNNs for large DNN configurations, and has a regularization effect.
Abstract: Hardware implementations of deep neural networks (DNNs) have been adopted in many systems because of their higher classification speed. However, while they may be characterized by better accuracy, larger DNNs require significant energy and area, thereby limiting their wide adoption. The energy consumption of DNNs is driven by both memory accesses and computation. Binarized neural networks (BNNs), as a tradeoff between accuracy and energy consumption, can achieve great energy reduction and have good accuracy for large DNNs due to their regularization effect. However, BNNs show poor accuracy when a smaller DNN configuration is adopted. In this article, we propose a new DNN architecture, LightNN, which replaces the multiplications to one shift or a constrained number of shifts and adds. Our theoretical analysis for LightNNs shows that their accuracy is maintained while dramatically reducing storage and energy requirements. For a fixed DNN configuration, LightNNs have better accuracy at a slight energy increase than BNNs, yet are more energy efficient with only slightly less accuracy than conventional DNNs. Therefore, LightNNs provide more options for hardware designers to trade off accuracy and energy. Moreover, for large DNN configurations, LightNNs have a regularization effect, making them better in accuracy than conventional DNNs. These conclusions are verified by experiment using the MNIST and CIFAR-10 datasets for different DNN configurations. Our FPGA implementation for conventional DNNs and LightNNs confirms all theoretical and simulation results and shows that LightNNs reduce latency and use fewer FPGA resources compared to conventional DNN architectures.

16 citations


Journal ArticleDOI
TL;DR: The implementation results show that the proposed architecture is resistant against SPA attack and yields a better performance when compared to the existing state-of-the-art BEC designs for computing point multiplication (PM).
Abstract: In this article, we present a high-performance hardware architecture for Elliptic curve based (authenticated) key agreement protocol “Elliptic Curve Menezes, Qu and Vanstone” (ECMQV) over Binary Edwards Curve (BEC). We begin by analyzing inversion module on a 251-bit binary field. Subsequently, we present Field Programmable Gate Array (FPGA) implementations of the unified formula for computing elliptic curve point addition on BEC in affine and projective coordinates and investigate the relative performance of these two coordinates. Then, we implement the w-coordinate based differential addition formulae suitable for usage in Montgomery ladder. Next, we present a novel hardware architecture of BEC point multiplication using mixed w-coordinates of the Montgomery laddering algorithm and analyze it in terms of resistance to Simple Power Analysis (SPA) attack. In order to improve the performance, the architecture utilizes registers efficiently and uses efficient scheduling mechanisms for the BEC arithmetic implementations. Our implementation results show that the proposed architecture is resistant against SPA attack and yields a better performance when compared to the existing state-of-the-art BEC designs for computing point multiplication (PM). Finally, we present an FPGA design of ECMQV key agreement protocol using BEC defined over GF(2251). The execution of ECMQV protocol takes 66.47μs using 32,479 slices on Virtex-4 FPGA and 52.34μs using 15,988 slices on Virtex-5 FPGA. To the best of our knowledge, this is the first FPGA design of the ECMQV protocol using BEC.

16 citations


Journal ArticleDOI
TL;DR: This article presents a thorough account of the Rathlin image processing language (RIPL), a high-level image processing domain-specific language for FPGAs, and motivates its design, based on higher-order algorithmic skeletons, with requirements from the imageprocessing domain.
Abstract: Specialized FPGA implementations can deliver higher performance and greater power efficiency than embedded CPU or GPU implementations for real-time image processing. Programming challenges limit their wider use, because the implementation of FPGA architectures at the register transfer level is time consuming and error prone. Existing software languages supported by high-level synthesis (HLS), although providing a productivity improvement, are too general purpose to generate efficient hardware without the use of hardware-specific code optimizations. Such optimizations leak hardware details into the abstractions that software languages are there to provide, and they require knowledge of FPGAs to generate efficient hardware, such as by using language pragmas to partition data structures across memory blocks. This article presents a thorough account of the Rathlin image processing language (RIPL), a high-level image processing domain-specific language for FPGAs. We motivate its design, based on higher-order algorithmic skeletons, with requirements from the image processing domain. RIPL’s skeletons suffice to elegantly describe image processing stencils, as well as recursive algorithms with nonlocal random access patterns. At its core, RIPL employs a dataflow intermediate representation. We give a formal account of the compilation scheme from RIPL skeletons to static and cyclostatic dataflow models to describe their data rates and static scheduling on FPGAs. RIPL compares favorably to the Vivado HLS OpenCV library and C++ compiled with Vivado HLS. RIPL achieves between 54 and 191 frames per second (FPS) at 100MHz for four synthetic benchmarks, faster than HLS OpenCV in three cases. Two real-world algorithms are implemented in RIPL: visual saliency and mean shift segmentation. For the visual saliency algorithm, RIPL achieves 71 FPS compared to optimized C++ at 28 FPS. RIPL is also concise, being 5x shorter than C++ and 111x shorter than an equivalent direct dataflow implementation. For mean shift segmentation, RIPL achieves 7 FPS compared to optimized C++ on 64 CPU cores at 1.1, and RIPL is 10x shorter than the direct dataflow FPGA implementation.

Journal ArticleDOI
TL;DR: The results suggest that an open source version of the layouts of the actual FPGA building blocks should be created, so their actual layout area can be used to achieve a highly accurate ranking of the implementation area of FPGa architectures built upon these layouts.
Abstract: This work provides an evaluation on the accuracy of the minimum-width transistor area models in ranking the actual layout area of FPGA architectures. Both the original VPR area model and the new COFFE area model are compared against the actual layouts with up to three metal layers for the various FPGA building blocks. We found that both models have significant variations with respect to the accuracy of their predictions across the building blocks. In particular, the original VPR model overestimates the layout area of larger buffers, full adders, and multiplexers by as much as 38%, while they underestimate the layout area of smaller buffers and multiplexers by as much as 58%, for an overall prediction error variation of 96%. The newer COFFE model also significantly overestimates the layout area of full adders by 13% and underestimates the layout area of multiplexers by a maximum of 60% for a prediction error variation of 73%. Such variations are particularly significant considering sensitivity analyses are not routinely performed in FPGA architectural studies. Our results suggest that such analyses are extremely important in studies that employ the minimum-width area models so the tolerance of the architectural conclusions against the prediction error variations can be quantified. Furthermore, an open-source version of the layouts of the actual FPGA building blocks should be created so their actual layout area can be used to achieve a highly accurate ranking of the implementation area of FPGA architectures built upon these layouts.

Journal ArticleDOI
TL;DR: This article shows how priority inversion and starvation can be solved by making the reconfiguration process preemptive—that is, allowing it to be interrupted at any time and resumed at a later time without restarting it from scratch.
Abstract: To improve computing performance in real-time applications, modern embedded platforms comprise hardware accelerators that speed up the task’s most compute-intensive parts. A recent trend in the design of real-time embedded systems is to integrate field-programmable gate arrays (FPGA) that are reconfigured with different accelerators at runtime, to cope with dynamic workloads that are subject to timing constraints. One of the major limitations when dealing with partial FPGA reconfiguration in real-time systems is that the reconfiguration port can only perform one reconfiguration at a time: if a high-priority task issues a reconfiguration request while the reconfiguration port is already occupied by a lower-priority task, the high-priority task has to wait until the current reconfiguration is completed (a phenomenon known as priority inversion), unless the current reconfiguration is aborted (introducing unbounded delays in low-priority tasks, a phenomenon known as starvation). This article shows how priority inversion and starvation can be solved by making the reconfiguration process preemptive—that is, allowing it to be interrupted at any time and resumed at a later time without restarting it from scratch. Such a feature is crucial for the design of runtime reconfigurable real-time systems but not yet available in today’s platforms. Furthermore, the trade-off of achieving a guaranteed bound on the reconfiguration delay for low-priority tasks and the maximum delay induced for high-priority tasks when preempting an ongoing reconfiguration has been identified and analyzed. Experimental results on the Xilinx Zynq-7000 platform show that the proposed implementation of preemptive reconfiguration introduces a low runtime overhead, thus effectively solving priority inversion and starvation.

Journal ArticleDOI
TL;DR: This paper addresses the problem of space processing applications deployed on SRAM-based Field Programmable Gate Arrays by proposing a fine-grained module-based error recovery technique that without additional system hardware can localize and correct errors that classic MER fails to do.
Abstract: Space processing applications deployed on SRAM-based Field Programmable Gate Arrays (FPGAs) are vulnerable to radiation-induced Single Event Upsets (SEUs). Compared with the well-known SEU mitigation solution—Triple Modular Redundancy (TMR) with configuration memory scrubbing—TMR with module-based error recovery (MER) is notably more energy efficient and responsive in repairing soft-errors in the system. Unfortunately, TMR-MER systems also need to resort to scrubbing when errors occur between sub-components, such as in interconnection nets, which are not recovered by MER. This article addresses this problem by proposing a fine-grained module-based error recovery technique, which can localize and correct errors that classic MER fails to do without additional system hardware. We evaluate our proposal via fault-injection campaigns on three types of circuits implemented in Xilinx 7-Series devices. With respect to scrubbing, we observed reductions in the mean time to repair configuration memory errors of between 48.5% and 89.4%, while reductions in energy used recovering from configuration memory errors were estimated at between 77.4% and 96.1%. These improvements result in higher reliability for systems employing TMR with fine-grained reconfiguration than equivalent systems relying on scrubbing for configuration error recovery.

Journal ArticleDOI
TL;DR: ReDCrypt is proposed, the first reconfigurable hardware-accelerated framework that empowers privacy-preserving inference of deep learning models in cloud servers and a high-throughput and power-efficient implementation of GC protocol on FPGA for the privacy-sensitive phase.
Abstract: Artificial Intelligence (AI) is increasingly incorporated into the cloud business in order to improve the functionality (e.g., accuracy) of the service. The adoption of AI as a cloud service raises serious privacy concerns in applications where the risk of data leakage is not acceptable. Examples of such applications include scenarios where clients hold potentially sensitive private information such as medical records, financial data, and/or location. This article proposes ReDCrypt, the first reconfigurable hardware-accelerated framework that empowers privacy-preserving inference of deep learning models in cloud servers. ReDCrypt is well-suited for streaming (a.k.a., real-time AI) settings where clients need to dynamically analyze their data as it is collected over time without having to queue the samples to meet a certain batch size. Unlike prior work, ReDCrypt neither requires to change how AI models are trained nor relies on two non-colluding servers to perform. The privacy-preserving computation in ReDCrypt is executed using Yao’s Garbled Circuit (GC) protocol. We break down the deep learning inference task into two phases: (i) privacy-insensitive (local) computation, and (ii) privacy-sensitive (interactive) computation. We devise a high-throughput and power-efficient implementation of GC protocol on FPGA for the privacy-sensitive phase. ReDCrypt’s accompanying API provides support for seamless integration of ReDCrypt into any deep learning framework. Proof-of-concept evaluations for different DL applications demonstrate up to 57-fold higher throughput per core compared to the best prior solution with no drop in the accuracy.

Journal ArticleDOI
TL;DR: A detailed comparison study of SRAM and MTJ BRAMs is conducted that includes cell designs that are robust with device variation, transistor-level design and optimization of all the required BRAM-specific circuits, and variation-aware simulation at the 22nm node.
Abstract: While plentiful on-chip memory is necessary for many designs to fully utilize an FPGA’s computational capacity, SRAM scaling is becoming more difficult because of increasing device variation. An alternative is to build FPGA block RAM (BRAM) from magnetic tunnel junctions (MTJ), as this emerging embedded memory has a small cell size, low energy usage, and good scalability. We conduct a detailed comparison study of SRAM and MTJ BRAMs that includes cell designs that are robust with device variation, transistor-level design and optimization of all the required BRAM-specific circuits, and variation-aware simulation at the 22nm node. At a 256Kb block size, MTJ-BRAM is 3.06× denser and 55% more energy efficient and its Fmax is 274MHz, which is adequate for most FPGA system clock domains. We also detail further enhancements that allow these 256 Kb MTJ BRAMs to operate at a higher speed of 353MHz for the streaming FIFOs, which are very common in FPGA designs and describe how the non-volatility of MTJ BRAM enables novel on-chip configuration and power-down modes. For a RAM architecture similar to the latest commercial FPGAs, MTJ-BRAMs could expand FPGA memory capacity by 2.95× with no die size increase.

Journal ArticleDOI
TL;DR: This work shows that continuously monitoring on-chip delays at the LUT-to-LUT link level during operation allows an FPGA to detect and self-adapt to aging and environmental effects on timing, and introduces a strategy for rapid, in-system incremental repair of links with degraded timing.
Abstract: We show that continuously monitoring on-chip delays at the LUT-to-LUT link level during operation allows a field-programmable gate array to detect and self-adapt to aging and environmental timing effects. Using a lightweight (

Journal ArticleDOI
TL;DR: The proposed technique is scalable to the large number of configuration options available in modern soft core processors by relying on rapid and accurate estimation models instead of time-consuming FPGA synthesis and execution-based techniques.
Abstract: The large number of embedded soft core processors available today make it tedious and time consuming to select the best processor for a given application This task is even more challenging due to the numerous configuration options available for a single soft core processor while optimizing for contradicting design requirements such as performance and area In this article, we propose a generic framework for rapid performance estimation of applications on soft core processors The proposed technique is scalable to the large number of configuration options available in modern soft core processors by relying on rapid and accurate estimation models instead of time-consuming FPGA synthesis and execution-based techniques Experimental results on two leading commercial soft core processors executing applications from the widely used CHStone benchmark suite show an average error of less than 6% while running in the order of minutes when compared to hours taken by synthesis-based techniques

Journal ArticleDOI
TL;DR: Wotan is presented, a tool to quickly estimate routability for a wide range of architectures without the use of benchmark circuits and efficiently counts paths through the FPGA routing graph to estimate the probability of node congestion and the probabilities to successfully route a randomized subset of (source, sink) pairs, which are then combined into an overall routability metric.
Abstract: FPGA routing architectures consist of routing wires and programmable switches that together account for the majority of the fabric delay and area, making evaluation and optimization of an FPGA’s routing architecture very important. Routing architectures have traditionally been evaluated using a full synthesize, pack, place and route CAD flow over a suite of benchmark circuits. While the results are accurate, a full CAD flow has a long runtime and is often tuned to a specific FPGA architecture type, which limits exploration of different architecture options early in the design process. In this article, we present Wotan, a tool to quickly estimate routability for a wide range of architectures without the use of benchmark circuits. At its core, our routability predictor efficiently counts paths through the FPGA routing graph to (1) estimate the probability of node congestion and (2) estimate the probabilities to successfully route a randomized subset of (source, sink) pairs, which are then combined into an overall routability metric. We describe our predictor and present routability estimates for a range of 6-LUT and 4-LUT architectures using mixes of wire types connected in complex ways, showing a rank correlation of 0.91 with routability results from the full VPR CAD flow while requiring 18× less CPU effort.

Journal ArticleDOI
TL;DR: The results obtained show how the proposed IL methodology can simplify the description of pipelined architectures while enabling performances that are close to those achievable through an RTL design methodology.
Abstract: As modern field-programmable gate arrays (FPGA) enable high computing performance and efficiency, their programming with low-level hardware description languages is time-consuming and remains a major obstacle to their adoption. High-level synthesis compilers are able to produce register-transfer-level (RTL) designs from C/C++ algorithmic descriptions, but despite allowing significant design-time improvements, these tools are not always able to generate hardware designs that compare to handmade RTL designs. In this article, we consider synthesis from an intermediate-level (IL) language that allows the description of algorithmic state machines handling connections between streaming sources and sinks. However, the interconnection of streaming sources and sinks can lead to cyclic combinational relations, resulting in undesirable behaviors or un-synthesizable designs. We propose a functional-level methodology to automate the resolution of such cyclic relations into acyclic combinational functions. The proposed IL synthesis methodology has been applied to the design of pipelined floating-point cores. The results obtained show how the proposed IL methodology can simplify the description of pipelined architectures while enabling performances that are close to those achievable through an RTL design methodology.

Journal ArticleDOI
TL;DR: This article takes an important step toward out-of-order execution in soft processors by exploring instruction scheduling in an FPGA substrate and discusses both circuit and microarchitectural trade-offs and compares three circuit structures for the scheduler, including a new structure called a fused-logic matrix scheduler.
Abstract: Soft processors have a role to play in simplifying field-programmable gate array (FPGA) application design as they can be deployed only when needed, and it is easier to write and debug single-threaded software code than create hardware. The breadth of this second role increases when the performance of the soft processor increases, yet the sophisticated out-of-order superscalar approaches that arrived in the mid-1990s are not employed, despite their area cost now being easily tolerable. In this article, we take an important step toward out-of-order execution in soft processors by exploring instruction scheduling in an FPGA substrate. This differs from the hard-processor design problem because the logic substrate is restricted to LUTs, whereas hard processor scheduling circuits employ CAM and wired-OR structures to great benefit. We discuss both circuit and microarchitectural trade-offs and compare three circuit structures for the scheduler, including a new structure called a fused-logic matrix scheduler. Using our optimized circuits, we show that four-issue distributed schedulers with up to 54 entries can be built with the same cycle time as the commercial Nios II/f soft processor (240MHz). This careful design has the potential to significantly increase both the IPC and raw compute performance of a soft processor, compared to current commercial soft processors.