scispace - formally typeset
Search or ask a question

Showing papers in "ACM Transactions on Reconfigurable Technology and Systems in 2010"


Journal ArticleDOI
TL;DR: This article presents the design and implementation of a massively parallelized Quasi-Monte Carlo simulation engine on an FPGA-based supercomputer, called Maxwell, and compares this implementation with equivalent graphics processing units (GPUs) and general purpose processors (GPP)-based implementations.
Abstract: Quasi-Monte Carlo simulation is a special Monte Carlo simulation method that uses quasi-random or low-discrepancy numbers as random sample sets. In many applications, this method has proved advantageous compared to the traditional Monte Carlo simulation method, which uses pseudo-random numbers, thanks to its faster convergence and higher level of accuracy. This article presents the design and implementation of a massively parallelized Quasi-Monte Carlo simulation engine on an FPGA-based supercomputer, called Maxwell. It also compares this implementation with equivalent graphics processing units (GPUs) and general purpose processors (GPP)-based implementations. The detailed comparison between these three implementations (FPGA vs. GPP vs. GPU) is done in the context of financial derivatives pricing based on our Quasi-Monte Carlo simulation engine. Real hardware implementations on the Maxwell machine show that FPGAs outperform equivalent GPP-based software implementations by 2 orders of magnitude, with the speed-up figure scaling linearly with the number of processing nodes used (FPGAs/GPPs). The same implementations show that FPGAs achieve a ~ 3x speedup compared to equivalent GPU-based implementations. Power consumption measurements also show FPGAs to be 336x more energy efficient than CPUs, and 16x more energy efficient than GPUs.

66 citations


Journal ArticleDOI
TL;DR: An overview of the components in the VFloat library are given and their use in an implementation of the K-means clustering algorithm applied to multispectral satellite images is demonstrated.
Abstract: Optimal reconfigurable hardware implementations may require the use of arbitrary floating-point formats that do not necessarily conform to IEEE specified sizes. We present a variable precision floating-point library (VFloat) that supports general floating-point formats including IEEE standard formats. Most previously published floating-point formats for use with reconfigurable hardware are subsets of our format. Custom datapaths with optimal bitwidths for each operation can be built using the variable precision hardware modules in the VFloat library, enabling a higher level of parallelism. The VFloat library includes three types of hardware modules for format control, arithmetic operations, and conversions between fixed-point and floating-point formats. The format conversions allow for hybrid fixed- and floating-point operations in a single design. This gives the designer control over a large number of design possibilities including format as well as number range within the same application. In this article, we give an overview of the components in the VFloat library and demonstrate their use in an implementation of the K-means clustering algorithm applied to multispectral satellite images.

55 citations


Journal ArticleDOI
TL;DR: Three lookup-table-based AES implementations that efficiently use the BlockRAM and DSP units embedded within Xilinx Virtex-5 FPGAs and implementations of a BRAM- based AES key-expansion, CMAC, and CTR modes of operation are presented.
Abstract: We present three lookup-table-based AES implementations that efficiently use the BlockRAM and DSP units embedded within Xilinx Virtex-5 FPGAs. An iterative module outputs a 32-bit AES round column every clock cycle, with a throughput of 1.67 Gbit/s when processing two 128-bit inputs. This construct is then replicated four times to provide a complete AES round per cycle with 6.7 Gbit/s throughput when processing eight input streams. This, in turn, is replicated ten times for a fully unrolled design providing over 52 Gbit/s of throughput. We also present implementations of a BRAM-based AES key-expansion, CMAC, and CTR modes of operation. Results for designs where DSPs are replaced by regular logic are also presented. The combination and arrangement of the specialized embedded functions available in the FPGA allows us to implement our designs using very few traditional user logic elements such as flip-flops and lookup tables, yet still achieve these high throughputs. HDL source code, simulation testbenches, and software tool commands to reproduce reported results for the three AES variants and CMAC mode are made publicly available. Our contribution concludes with a discussion on comparing cipher implementations in the literature, and why these comparisons can be meaningless without a common reporting methodology, or within the context of a constrained target application.

54 citations


Journal ArticleDOI
TL;DR: This article presents a widely parallel and deeply pipelined hardware CG implementation, targeted at modern FPGA architectures, particularly suited for accelerating multiple small-to-medium-sized dense systems of linear equations and can be used as a stand-alone solver or as building block to solve higher-order systems.
Abstract: Recent developments in the capacity of modern Field Programmable Gate Arrays (FPGAs) have significantly expanded their applications. One such field is the acceleration of scientific computation and one type of calculation that is commonplace in scientific computation is the solution of systems of linear equations. A method that has proven in software to be very efficient and robust for finding such solutions is the Conjugate Gradient (CG) algorithm. In this article we present a widely parallel and deeply pipelined hardware CG implementation, targeted at modern FPGA architectures. This implementation is particularly suited for accelerating multiple small-to-medium-sized dense systems of linear equations and can be used as a stand-alone solver or as building block to solve higher-order systems. In this article it is shown that through parallelization it is possible to convert the computation time per iteration for an order n matrix from Θ(n2) clock cycles on a microprocessor to Θ(n) on a FPGA. Through deep pipelining it is also possible to solve several problems in parallel and maximize both performance and efficiency. I/O requirements are shown to be scalable and convergent to a constant value with the increase of matrix order. Post place-and-route results on a readily available VirtexII-6000 demonstrate sustained performance of 5 GFlops, and results on a Virtex5-330 indicate sustained performance of 35 GFlops. A comparison with an optimized software implementation running on a high-end CPU demonstrate that this FPGA implementation represents a significant speedup of at least an order of magnitude.

54 citations


Journal ArticleDOI
TL;DR: This work systematically explores the design space of the force pipeline with respect to arithmetic algorithm, arithmetic mode, precision, and various other optimizations, and finds that for the Stratix-III, and for the best single precision designs, 11 pipelines running at 250 MHz can fit on the FPGA.
Abstract: The acceleration of molecular dynamics (MD) simulations using high-performance reconfigurable computing (HPRC) has been much studied. Given the intense competition from multicore and GPUs, there is now a question whether MD on HPRC can be competitive. We concentrate here on the MD kernel computation: determining the short-range force between particle pairs. In one part of the study, we systematically explore the design space of the force pipeline with respect to arithmetic algorithm, arithmetic mode, precision, and various other optimizations. We examine simplifications and find that some have little effect on simulation quality. In the other part, we present the first FPGA study of the filtering of particle pairs with nearly zero mutual force, a standard optimization in MD codes. There are several innovations, including a novel partitioning of the particle space, and new methods for filtering and mapping work onto the pipelines. As a consequence, highly efficient filtering can be implemented with only a small fraction of the FPGA’s resources. Overall, we find that, for an Altera Stratix-III EP3ES260, 8 force pipelines running at nearly 200 MHz can fit on the FPGA, and that they can perform at 95p efficiency. This results in an 80-fold per core speed-up for the short-range force, which is likely to make FPGAs highly competitive for MD.

52 citations


Journal ArticleDOI
TL;DR: A hierarchical taxonomy of computing devices, concepts and terminology describing reconfigurability, and computational density and internal memory bandwidth metrics to compare devices is presented.
Abstract: As on-chip transistor counts increase, the computing landscape has shifted to multi- and many-core devices. Computational accelerators have adopted this trend by incorporating both fixed and reconfigurable many-core and multi-core devices. As more, disparate devices enter the market, there is an increasing need for concepts, terminology, and classification techniques to understand the device tradeoffs. Additionally, computational performance, memory performance, and power metrics are needed to objectively compare devices. These metrics will assist application scientists in selecting the appropriate device early in the development cycle. This article presents a hierarchical taxonomy of computing devices, concepts and terminology describing reconfigurability, and computational density and internal memory bandwidth metrics to compare devices.

50 citations


Journal ArticleDOI
TL;DR: TMD-MPI is shown to address current design challenges in HPRC usage, suggesting that the MPI standard has enough syntax and semantics to program these new types of parallel architectures.
Abstract: High-Performance Reconfigurable Computers (HPRCs) consist of one or more standard microprocessors tightly-coupled with one or more reconfigurable FPGAs. HPRCs have been shown to provide good speedups and good cost/performance ratios, but not necessarily ease of use, leading to a slow acceptance of this technology. HPRCs introduce new design challenges, such as the lack of portability across platforms, incompatibilities with legacy code, users reluctant to change their code base, a prolonged learning curve, and the need for a system-level Hardware/Software co-design development flow. This article presents the evolution and current work on TMD-MPI, which started as an MPI-based programming model for Multiprocessor Systems-on-Chip implemented in FPGAs, and has now evolved to include multiple X86 processors. TMD-MPI is shown to address current design challenges in HPRC usage, suggesting that the MPI standard has enough syntax and semantics to program these new types of parallel architectures. Also presented is the TMD-MPI Ecosystem, which consists of research projects and tools that are developed around TMD-MPI to further improve HPRC usability. Finally, we present preliminary communication performance measurements.

38 citations


Journal ArticleDOI
TL;DR: Modifications to the Rings design are proposed which significantly improve its robustness against attacks, alleviate implementation-related problems, and simultaneously improve its area, throughput, and power performance.
Abstract: A ring oscillator-based true-random number generator design (Rings design) was introduced in Sunar et al. [2007]. The design was rigorously analyzed under a simple mathematical model and its performance characteristics were established. In this article we focus on the practical aspects of the Rings design on a reconfigurable logic platform and determine their implications on the earlier analysis framework. We make recommendations for avoiding pitfalls in real-life implementations by considering ring interaction, transistor-level effects, narrow signal rejection, transmission line attenuation, and sampler bias. Furthermore, we present experimental results showing that changing operating conditions such as the power supply voltage or the operating temperature may affect the output quality when the signal is subsampled. Hence, an attacker may shift the operating point via a simple noninvasive influence and easily bias the TRNG output. Finally, we propose modifications to the design which significantly improve its robustness against attacks, alleviate implementation-related problems, and simultaneously improve its area, throughput, and power performance.

36 citations


Journal ArticleDOI
TL;DR: A complete partitioning and floorplanning algorithm tailored for reconfigurable architectures deployable on FPGAs and considering communication infrastructure feasibility is described and named floorplacer in order to underline the great differences with respect to traditional floorplanners.
Abstract: The aim of this article is to describe a complete partitioning and floorplanning algorithm tailored for reconfigurable architectures deployable on FPGAs and considering communication infrastructure feasibility. This article proposes a novel approach for resource- and reconfiguration- aware floorplanning. Different from existing approaches, our floorplanning algorithm takes specific physical constraints such as resource distribution and the granularity of reconfiguration possible for a given FPGA device into account. Due to the introduction of constraints typical of other problems like partitioning and placement, the proposed approach is named floorplacer in order to underline the great differences with respect to traditional floorplanners. These physical constraints are typically considered at the later placement stage. Different aspects of the problems have been described, focusing particularly on the FPGAs resource heterogeneity and the temporal dimension typical of reconfigurable systems. Once the problem is introduced a comparison among related works has been provided and their limits have been pointed out. Experimental results proved the validity of the proposed approach.

36 citations


Journal ArticleDOI
TL;DR: This article proposes an approach for solving large 3-SAT problems on FPGA using a WSAT algorithm, and can solve larger problems than previous works with less hardware resources, and shows higher performance.
Abstract: WSAT and its variants are one of the best performing stochastic local search algorithms for the satisfiability (SAT) problem. In this article, we propose an approach for solving large 3-SAT problems on FPGA using a WSAT algorithm. In hardware solvers, it is important to solve large problems efficiently. In WSAT algorithms, an assignment of binary values to the variables that satisfy all clauses is searched by repeatedly choosing a variable in an unsatisfied clause using a heuristic, and flipping its value. In our solver, (1) only the clauses that may be unsatisfied by the flipping are evaluated in parallel to minimize the circuit size, and (2) several independent tries are executed at the same time on the pipelined circuit to achieve high performance. Our FPGA solver can solve larger problems than previous works with less hardware resources, and shows higher performance.

32 citations


Journal ArticleDOI
TL;DR: The designs presented here enable a Xilinx Virtex4 FPGA to achieve 270 MHz IEEE compliant double precision floating-point performance with a 9-stage adder pipeline and 14-stage multiplier pipeline.
Abstract: Floating-point applications are a growing trend in the FPGA community. As such, it has become critical to create floating-point units optimized for standard FPGA technology. Unfortunately, the FPGA design space is very different from the VLSI design space; thus, optimizations for FPGAs can differ significantly from optimizations for VLSI. In particular, the FPGA environment constrains the design space such that only limited parallelism can be effectively exploited to reduce latency. Obtaining the right balances between clock speed, latency, and area in FPGAs can be particularly challenging. This article presents implementation details for an IEEE-754 standard floating-point adder and multiplier for FPGAs. The designs presented here enable a Xilinx Virtex4 FPGA (-11 speed grade) to achieve 270 MHz IEEE compliant double precision floating-point performance with a 9-stage adder pipeline and 14-stage multiplier pipeline. The area requirement is approximately 500 slices for the adder and under 750 slices for the multiplier.

Journal ArticleDOI
TL;DR: A scalable FPGA-array with bandwidth-reduction mechanism (BRM) to implement high-performance and power-efficient CCMs for scientific simulations based on finite difference methods and it is shown that BRM works effectively for benchmark computations, and therefore commercially available low-end FPGAs with relatively narrow I/O bandwidth can be utilized to construct a scalable FPNarray.
Abstract: For scientific numerical simulation that requires a relatively high ratio of data access to computation, the scalability of memory bandwidth is the key to performance improvement, and therefore custom-computing machines (CCMs) are one of the promising approaches to provide bandwidth-aware structures tailored for individual applications. In this article, we propose a scalable FPGA-array with bandwidth-reduction mechanism (BRM) to implement high-performance and power-efficient CCMs for scientific simulations based on finite difference methods. With the FPGA-array, we construct a systolic computational-memory array (SCMA), which is given a minimum of programmability to provide flexibility and high productivity for various computing kernels and boundary computations. Since the systolic computational-memory architecture of SCMA provides scalability of both memory bandwidth and arithmetic performance according to the array size, we introduce a homogeneously partitioning approach to the SCMA so that it is extensible over a 1D or 2D array of FPGAs connected with a mesh network. To satisfy the bandwidth requirement of inter-FPGA communication, we propose BRM based on time-division multiplexing. BRM decreases the required number of communication channels between the adjacent FPGAs at the cost of delay cycles. We formulate the trade-off between bandwidth and delay of inter-FPGA data-transfer with BRM. To demonstrate feasibility and evaluate performance quantitatively, we design and implement the SCMA of 192 processing elements over two ALTERA Stratix II FPGAs. The implemented SCMA running at 106MHz has the peak performance of 40.7 GFlops in single precision. We demonstrate that the SCMA achieves the sustained performances of 32.8 to 35.7 GFlops for three benchmark computations with high utilization of computing units. The SCMA has complete scalability to the increasing number of FPGAs due to the highly localized computation and communication. In addition, we also demonstrate that the FPGA-based SCMA is power-efficient: it consumes 69p to 87p power and requires only 2.8p to 7.0p energy of those for the same computations performed by a 3.4-GHz Pentium4 processor. With software simulation, we show that BRM works effectively for benchmark computations, and therefore commercially available low-end FPGAs with relatively narrow I/O bandwidth can be utilized to construct a scalable FPGA-array.

Journal ArticleDOI
John M. Bodily1, Brent Nelson1, Zhaoyi Wei1, Dah-Jye Lee1, Jeff Chase1 
TL;DR: This article reports on a series of experiments mapping a collection of different algorithms onto both an FPGA and a GPU, finding that for two different optical flow algorithms the GPU had better performance, while for a set of digital comm MIMO computations, they had similar performance.
Abstract: FPGA devices have often found use as higher-performance alternatives to programmable processors for implementing computations. Applications successfully implemented on FPGAs typically contain high levels of parallelism and often use simple statically scheduled control and modest arithmetic. Recently introduced computing devices such as coarse-grain reconfigurable arrays, multi-core processors, and graphical processing units promise to significantly change the computational landscape and take advantage of many of the same application characteristics that fit well on FPGAs. One real-time computing task, optical flow, is difficult to apply in robotic vision applications because of its high computational and data rate requirements, and so is a good candidate for implementation on FPGAs and other custom computing architectures. This article reports on a series of experiments mapping a collection of different algorithms onto both an FPGA and a GPU. For two different optical flow algorithms the GPU had better performance, while for a set of digital comm MIMO computations, they had similar performance. In all cases the FPGA implementations required 10x the development time. Finally, a discussion of the two technology’s characteristics is given to show they achieve high performance in different ways.

Journal ArticleDOI
TL;DR: Simulations on randomly generated task graphs indicate that RDMS algorithm can reduce interconfiguration communication time and intertask communication time by 11% and 44% respectively, compared with two other approaches that consider data dependency and hardware resource utilization only.
Abstract: High-performance reconfigurable computing involves acceleration of significant portions of an application using reconfigurable hardware. When the hardware tasks of an application cannot simultaneously fit in an FPGA, the task graph needs to be partitioned and scheduled into multiple FPGA configurations, in a way that minimizes the total execution time. This article proposes the Reduced Data Movement Scheduling (RDMS) algorithm that aims to improve the overall performance of hardware tasks by taking into account the reconfiguration time, data dependency between tasks, intertask communication as well as task resource utilization. The proposed algorithm uses the dynamic programming method. A mathematical analysis of the algorithm shows that the execution time would at most exceed the optimal solution by a factor of around 1.6, in the worst-case. Simulations on randomly generated task graphs indicate that RDMS algorithm can reduce interconfiguration communication time by 11p and 44p respectively, compared with two other approaches that consider data dependency and hardware resource utilization only. The practicality, as well as efficiency of the proposed algorithm over other approaches, is demonstrated by simulating a task graph from a real-life application - N-body simulation - along with constraints for bandwidth and FPGA parameters from existing high-performance reconfigurable computers. Experiments on SRC-6 are carried out to validate the approach.

Journal ArticleDOI
TL;DR: A novel configuration scrubbing core is presented, used for internal detection and correction of radiation-induced configuration single and multiple bit errors, without requiring external scrubbing, which significantly improves the availability in hostile radiation environments of FPGA-based designs.
Abstract: This article presents a novel configuration scrubbing core, used for internal detection and correction of radiation-induced configuration single and multiple bit errors, without requiring external scrubbing. The proposed technique combines the benefits of fast radiation-induced fault detection with fast restoration of the device functionality and small area and power overheads. Experimental results demonstrate that the novel approach significantly improves the availability in hostile radiation environments of FPGA-based designs. When implemented using a Xilinx XC2V1000 Virtex-II device, the presented technique detects and corrects single bit upsets and double, triple and quadruple multi bit upsets, occupying just 1488 slices and dissipating less than 30 mW at a 50MHz running frequency.

Journal ArticleDOI
TL;DR: This article presents a novel and portable framework for runtime performance analysis of HLL applications for FPGAs, including an automated tool forperformance analysis of designs created with Impulse C, a commercial HLL for FFPAs.
Abstract: High-Level Languages (HLLs) for Field-Programmable Gate Arrays (FPGAs) facilitate the use of reconfigurable computing resources for application developers by using familiar, higher-level syntax, semantics, and abstractions, typically enabling faster development times than with traditional Hardware Description Languages (HDLs). However, programming at a higher level of abstraction is typically accompanied by some loss of performance as well as reduced transparency of application behavior, making it difficult to understand and improve application performance. While runtime tools for performance analysis are often featured in development with traditional HLLs for sequential and parallel programming, HLL-based development for FPGAs has an equal or greater need yet lacks these tools. This article presents a novel and portable framework for runtime performance analysis of HLL applications for FPGAs, including an automated tool for performance analysis of designs created with Impulse C, a commercial HLL for FPGAs. As a case study, this tool is used to successfully locate performance bottlenecks in a molecular dynamics kernel in order to gain speedup.

Journal ArticleDOI
TL;DR: A new RDI logic design is proposed that can be used to cost-efficiently implement RDI on FPGA devices and it is demonstrated that RDI is an efficient countermeasure technique onFPGA in comparison to other countermeasures.
Abstract: Side-channel attacks (SCA) threaten electronic cryptographic devices and can be carried out by monitoring the physical characteristics of security circuits. Differential Power Analysis (DPA) is one the most widely studied side-channel attacks. Numerous countermeasure techniques, such as Random Delay Insertion (RDI), have been proposed to reduce the risk of DPA attacks against cryptographic devices. The RDI technique was first proposed for microprocessors but it was shown to be unsuccessful when implemented on smartcards as it was vulnerable to a variant of the DPA attack known as the Sliding-Window DPA attack.Previous research by the authors investigated the use of the RDI countermeasure for Field Programmable Gate Array (FPGA) based cryptographic devices. A split-RDI technique was proposed to improve the security of the RDI countermeasure. A set of critical parameters was also proposed that could be utilized in the design stage to optimize a security algorithm design with RDI in terms of area, speed and power. The authors also showed that RDI is an efficient countermeasure technique on FPGA in comparison to other countermeasures.In this article, a new RDI logic design is proposed that can be used to cost-efficiently implement RDI on FPGA devices. Sliding-Window DPA and realignment attacks, which were shown to be effective against RDI implemented on smartcard devices, are performed on the improved RDI FPGA implementation. We demonstrate that these attacks are unsuccessful and we also propose a realignment technique that can be used to demonstrate the weakness of RDI implementations.

Journal ArticleDOI
TL;DR: This work proposes security primitives using ideas centered around the notion of “moats and drawbridges,” which encompass four design properties: logical isolation, interconnect traceability, secure reconfigurable broadcast, and configuration scrubbing.
Abstract: Computing systems designed using reconfigurable hardware are increasingly composed using a number of different Intellectual Property (IP) cores, which are often provided by third-party vendors that may have different levels of trust. Unlike traditional software where hardware resources are mediated using an operating system, IP cores have fine-grain control over the underlying reconfigurable hardware. To address this problem, the embedded systems community requires novel security primitives that address the realities of modern reconfigurable hardware. In this work, we propose security primitives using ideas centered around the notion of “moats and drawbridges.” The primitives encompass four design properties: logical isolation, interconnect traceability, secure reconfigurable broadcast, and configuration scrubbing. Each of these is a fundamental operation with easily understood formal properties, yet they map cleanly and efficiently to a wide variety of reconfigurable devices. We carefully quantify the required overheads of the security techniques on modern FPGA architectures across a number of different applications.

Journal ArticleDOI
TL;DR: Field Programmable Gate Arrays (FPGAs) offer a possible alternative with their customizable and application-targeted memory sub-system and processing elements.
Abstract: Double precision floating point Sparse Matrix-Vector Multiplication (SMVM) is a critical computational kernel used in iterative solvers for systems of sparse linear equations. The poor data locality exhibited by sparse matrices along with the high memory bandwidth requirements of SMVM result in poor performance on general purpose processors. Field Programmable Gate Arrays (FPGAs) offer a possible alternative with their customizable and application-targeted memory sub-system and processing elements. In this work we investigate two separate implementations of the SMVM on an SRC-6 MAPStation workstation. The first implementation investigates the peak performance capability, while the second implementation balances the amount of instantiated logic with the available sustained bandwidth of the FPGA subsystem. Both implementations yield the same sustained performance with the second producing a much more efficient solution. The metrics of processor and application balance are introduced to help provide some insight into the efficiencies of the FPGA and CPU based solutions explicitly showing the tight coupling of the available bandwidth to peak floating point performance. Due to the FPGAs ability to balance the amount of implemented logic to the available memory bandwidth it can provide a much more efficient solution. Finally, making use of the lessons learned implementing the SMVM, we present a fully implemented non-preconditioned Conjugate Gradient Algorithm utilizing the second SMVM design.

Journal ArticleDOI
TL;DR: This work proposes a new Relocatable Hardware-Software Scheduling (RHSS) method that not only can be applied to dynamically relocatable hardware-software tasks, but also increases the reconfigurable hardware resource utilization, reduces the reconfigured hardware resource fragmentation with realistic placement methods, and makes best efforts at meeting the real-time constraints of tasks.
Abstract: With the gradually fading distinction between hardware and software, it is now possible to relocate tasks from a microprocessor to reconfigurable logic and vice versa. However, existing hardware-software scheduling can rarely cope with such runtime task relocation. In this work, we propose a new Relocatable Hardware-Software Scheduling (RHSS) method that not only can be applied to dynamically relocatable hardware-software tasks, but also increases the reconfigurable hardware resource utilization, reduces the reconfigurable hardware resource fragmentation with realistic placement methods, and makes best efforts at meeting the real-time constraints of tasks. The feasibility of the proposed relocatable hardware-software scheduling algorithm was proved by applying it to some randomly generated examples and a real dynamically reconfigurable network security system example. Compared to the quadratic time complexity of the state-of-the-art Adaptive Hardware-Software Allocation (AHSA) method, RHSS is linear in time complexity, and improves the reconfigurable hardware utilization by as much as 117.8p. The scheduling and placement time and the memory usage are also drastically reduced by as much as 89.5p and 96.4p, respectively.

Journal ArticleDOI
TL;DR: This work introduces two different architectures for accelerating the task of finding homologous RNA molecules in a genome database that takes advantage of the tree-like configuration of the covariance models used to represent the consensus secondary structure of an RNA family and converts it directly into a highly-pipelined processing engine.
Abstract: The search for homologous RNA molecules---sequences of RNA that might behave simiarly due to similarity in their physical (secondary) structure---is currently a computationally intensive task. Moreover, RNA sequences are populating genome databases at a pace unmatched by gains in standard processor performance. While software tools such as Infernal can efficiently find homologies among RNA families and genome databases of modest size, the continuous advent of new RNA families and the explosive growth in volume of RNA sequences necessitate a faster approach.This work introduces two different architectures for accelerating the task of finding homologous RNA molecules in a genome database. The first architecture takes advantage of the tree-like configuration of the covariance models used to represent the consensus secondary structure of an RNA family and converts it directly into a highly-pipelined processing engine. Results for this architecture show a 24× speedup over Infernal when processing a small RNA model. It is estimated that the architecture could potentially offer several thousands of times speedup over Infernal on larger models, provided that there are sufficient hardware resources available.The second architecture is introduced to address the steep resource requirements of the first architecture. It utilizes a uniform array of processing elements and schedules all of the computations required to scan for an RNA homolog onto those processing elements. The estimated speedup for this architecture over the Infernal software package ranges from just under 20× to over 2,350×.

Journal ArticleDOI
TL;DR: The design of a novel framework for system-level simulative performance prediction of RC systems and applications is presented and a set of simulative case studies are presented to illustrate the various capabilities of the framework to quickly obtain a wide range of performance prediction results and power consumption estimates.
Abstract: Reconfigurable computing (RC) is rapidly emerging as a promising technology for the future of high-performance and embedded computing, enabling systems with the computational density and power of custom-logic hardware and the versatility of software-driven hardware in an optimal mix. Novel methods for rapid virtual prototyping, performance prediction, and evaluation are of critical importance in the engineering of complex reconfigurable systems and applications. These techniques can yield insightful tradeoff analyses while saving valuable time and resources for researchers and engineers alike. The research described herein provides a methodology for mapping arbitrary applications to targeted reconfigurable platforms in a simulation environment called RCSE. By splitting the process into two domains, the application and simulation domains, characterization of each element can occur independently and in parallel, leading to fast and accurate performance prediction results for large and complex systems. This article presents the design of a novel framework for system-level simulative performance prediction of RC systems and applications. The article also presents a set of case studies analyzing two applications, Hyperspectral Imaging (HSI) and Molecular Dynamics (MD), across three disparate RC platforms within the simulation framework. The validation results using each of these applications and systems show that our framework can quickly obtain performance prediction results with reasonable accuracy on a variety of platforms. Finally, a set of simulative case studies are presented to illustrate the various capabilities of the framework to quickly obtain a wide range of performance prediction results and power consumption estimates.

Journal ArticleDOI
Xu Guo1, Patrick Schaumont1
TL;DR: The impact of the communication link between CPU and coprocessor hardware for a typical Elliptic Curve Cryptography design is studied, and it is demonstrated that the SoC may become performance-limited due to cop rocessor data- and instruction-transfers.
Abstract: Most hardware/software (HW/SW) codesigns of Elliptic Curve Cryptography have focused on the computational aspect of the ECC hardware, and not on the system integration into a System-on-Chip (SoC) architecture. We study the impact of the communication link between CPU and coprocessor hardware for a typical ECC design, and demonstrate that the SoC may become performance-limited due to coprocessor data- and instruction-transfers. A dual strategy is proposed to remove the bottleneck: introduction of control hierarchy as well as local storage. The performance of the ECC coprocessor can be almost independent of the selection of bus protocols. Besides performance, the proposed ECC coprocessor is also optimized for scalability. Using design space exploration of a large number of system configurations of different architectures, our proposed ECC coprocessor architecture enables trade-offs between area, speed, and security.

Journal ArticleDOI
TL;DR: A new architecture called sarfum is proposed that, in addition to ensuring bitstream confidentiality and integrity, precludes the replay of old bitstreams and also includes a protocol for the system designer to remotely monitor the running configuration of the FPGA.
Abstract: Remote update of hardware platforms or embedded systems is a convenient service enabled by Field Programmable Gate Array (FPGA)-based systems. This service is often essential in applications like space-based FPGA systems or set-top boxes. However, having the source of the update be remote from the FPGA system opens the door to a set of attacks that may challenge the confidentiality and integrity of the FPGA configuration, the bitstream. Existing schemes propose to encrypt and authenticate the bitstream to thwart these attacks. However, we show that they do not prevent the replay of old bitstream versions, and thus give adversaries an opportunity for downgrading the system. In this article, we propose a new architecture called sarfum that, in addition to ensuring bitstream confidentiality and integrity, precludes the replay of old bitstreams. sarfum also includes a protocol for the system designer to remotely monitor the running configuration of the FPGA. Following our presentation and analysis of the security protocols, we propose an example of implementation with the CCM (Counter with CBC-MAC) authenticated encryption standard. We also evaluate the impact of our architecture on the configuration time for different FPGA devices.

Journal ArticleDOI
TL;DR: This article addresses the interconnection architecture by comparing several types of local crossbars-orientated interconnections in terms of hardware resources and flexibility and show how the use of local interconnection can result in routing area savings.
Abstract: In addition to the usual articles, it is our pleasure to present extended versions of some of the top papers presented at the 2009 International Workshop on Applied Reconfigurable Computing (ARC’09). The papers cover a wide range of topics from application solutions (Saiprasert et al.) to tools (Kahoul et al., Kępa et al.), novel architectures (Inoue et al., Guo and Schaumont) and even techniques to address radiation effects (Sterpone, Lanuzza et al.). The first article by Saiprasert et al. represents a good example of how researchers must investigate application requirements to derive highly efficient FPGA implementations. They do this by highlighting a novel approach to optimizing a key random-number generator function which allows FPGA resources to be efficiently used by performing a detailed analysis on the impact of the error due to truncation/rounding. The issue of circuit implementation is covered by the work of Kahoul et al. and Kępa et al. Reconfigurability and the move toward heterogeneous resources present challenges for place and route tools. Kahoul et al. present an approach based on integer linear programming, which closes the loop in the efficiency of analytical techniques and the accuracy of empirical tools. The issue of low-level tools for bitstream generation debugging and IP core design assurance is addressed in the work by Kępa et al. They present a tool framework which gives a set of high-level application programming interfaces for abstracting the Xilinx FPGAs. Inoue et al. propose to overcome the power consumption implications of FPGAs by developing a variable-grain logic cell architecture. In this article, they address the interconnection architecture by comparing several types of local crossbars-orientated interconnections in terms of hardware resources and flexibility and show how the use of local interconnection can result in routing area savings. Guo and Schaumont cover the system aspects of integrating encryption hardware by identifying the importance of communications and proposing efficient control strategies and local storage to overcome the problem and improve scalability. The issue of mitigating radiation effects is a problem particularly in space and avionics applications and is covered by the work of Sterpone and Lanuzza et al. Sterpone avoid the use of redundant hardware to overcome Single Event Upsets (SEUs) by proposing a new timing-driven placement algorithm for implementing soft-errors resilient circuits on SRAM-based FPGAs. Lanuzza et al. adopt a different approach by creating a configuration scrubbing core

Journal ArticleDOI
TL;DR: A new timing-driven placement algorithm for implementing soft-errors resilient circuits on SRAM-based FPGAs with a negligible degradation of performance is proposed based on a placement heuristic able to remove the crossing error domains while decreasing the routing congestions and delay inserted by the TMR routing and voting scheme.
Abstract: Electronic systems for safety critical applications such as space and avionics need the maximum level of dependability for guarantee the success of their missions. Simultaneously the computation capabilities required in these fields are constantly increasing for afford the implementation of different kind of applications ranging from signal processing to networking. SRAM-based FPGAs are the candidate devices to achieve this goal thanks to their high versatility of implementing complex circuits with a very short development time. However, in critical environments, the presence of Single Event Upsets (SEUs) affecting the FPGA’s functionalities, requires the adoption of specific fault tolerant techniques, like Triple Modular Redundancy (TMR), able to increase the protection capability against radiation effects, but on the other side, introducing a dramatic penalty in terms of performances. In this paper, it is proposed a new timing-driven placement algorithm for implementing soft-errors resilient circuits on SRAM-based FPGAs with a negligible degradation of performance. The algorithm is based on a placement heuristic able to remove the crossing error domains while decreasing the routing congestions and delay inserted by the TMR routing and voting scheme. Experimental analysis performed by timing analysis and SEU static analysis point out a performance improvement of 29p on the average with respect to standard TMR approach and an increased robustness against SEU affecting the FPGA’s configuration memory. Accurate analyses of SEUs sensitivity and performance optimization have been performed on a real microprocessor core demonstrating an improvement of performances of more than 62p.

Journal ArticleDOI
TL;DR: This work considers the need to develop MS techniques that are fast enough to allow interactive synthesis times and repeated applications of the MS to explore different possibilities of synthesizing the circuits, and formalizes the problem of reducing the number of parallel memory references in every row of the kernel by a novel combinatorial setting.
Abstract: In High-Level Synthesis (HLS), extracting parallelism in order to create small and fast circuits is the main advantage of HLS over software execution. Modulo Scheduling (MS) is a technique in which a loop is parallelized by overlapping different parts of successive iterations. This ability to extract parallelism makes MS an attractive synthesis technique for loop acceleration. In this work we consider two problems involved in the use of MS which are central when targeting FPGAs. Current MS scheduling techniques sacrifice execution times in order to meet resource and delay constraints. Let “ideal” execution times be the ones that could have been obtained by MS had we ignored resource and delay constraints. Here we pose the opposite problem, which is more suitable for HLS, namely, how to reduce resource constraints without sacrificing the ideal execution time. We focus on reducing the number of memory ports used by the MS synthesis, which we believe is a crucial resource for HLS. In addition to reducing the number of memory ports we consider the need to develop MS techniques that are fast enough to allow interactive synthesis times and repeated applications of the MS to explore different possibilities of synthesizing the circuits. Current solutions for MS synthesis that can handle memory constraints are too slow to support interactive synthesis. We formalize the problem of reducing the number of parallel memory references in every row of the kernel by a novel combinatorial setting. The proposed technique is based on inserting dummy operations in the kernel and by doing so, performing modulo-shift operations such that the maximal number of parallel memory references in a row is reduced. Experimental results suggest improved execution times for the synthesized circuit. The synthesis takes only a few seconds even for large-size loops.

Journal ArticleDOI
TL;DR: The paper illustrates the application of FDAT (FPGA Design Analysis Tool), a versatile, modular and open tools framework for low-level analysis and verification of FPGA designs, for bit-pattern analysis of Virtex-II Pro andvirtex-5 inter-tile routing and verify of the spatial isolation between designs.
Abstract: The growth of the Reconfigurable Computing (RC) systems community exposes diverse requirements with regard to functionality of Electronic Design Automation (EDA) tools. Low-level design tools are increasingly required for RC bitstream debugging and IP core design assurance, particularly in multiparty Partially Reconfigurable (PR) designs. While tools for low-level analysis of design netlists do exist, there is increasing demand for automated and customisable bitstream analysis tools.This article discusses the need for low-level IP core verification within PR-enabled FPGA systems and reports FDAT (FPGA Design Analysis Tool), a versatile, modular and open tools framework for low-level analysis and verification of FPGA designs. FDAT provides a set of high-level Application Programming Interfaces (APIs) abstracting the Xilinx FPGA fabric, the implemented design (e.g., placed and routed netlist) and the related bitstream. A lightweight graphic front-end allows custom visualisation of the design within the FPGA fabric. The operation of FDAT is governed by “recipe” scripts which support rapid prototyping of the abstract algorithms for system-level design verification. FDAT recipes, being Python scripts, can be ported to embedded FPGA systems, for example, the previously reported Secure Reconfiguration Controller (SeReCon) which enforces an IP core spatial isolation policy in order to provide run-time protection to the PR system.The paper illustrates the application of FDAT for bit-pattern analysis of Virtex-II Pro and Virtex-5 inter-tile routing and verification of the spatial isolation between designs.

Journal ArticleDOI
TL;DR: Comparing the three approaches indicates that the custom RTL approach has the lead in terms of performance, but both the AccelDSP and the Tensilica Xtensa approaches show fast design time and early architectural exploration capability.
Abstract: This work investigates several approaches for implementing the OFDM functions of the fixed-WiMax standard on reconfigurable platforms. In the first phase, a custom RTL approach, using VHDL, is investigated. The approach shows the capability of a medium-size FPGA to accommodate the OFDM functions of a fixed-WiMax transceiver with only 50p occupation rate. In the second phase, a high-level approach based on the AccelDSP tool is used and compared to the custom RTL approach. The approach presents an easy flow to transfer MATLAB floating-point code into synthesizable cores. The AccelDSP approach shows an area overhead of 10p, while allowing early architectural exploration and accelerating the design time by a factor of two. However, the performance figure obtained is almost 1/4 of that obtained in the custom RTL approach. In the third phase, the Tensilica Xtensa configurable processor is targeted, which presents remarkable figures in terms of power, area, and design time. Comparing the three approaches indicates that the custom RTL approach has the lead in terms of performance. However, both the AccelDSP and the Tensilica Xtensa approaches show fast design time and early architectural exploration capability. In terms of power, the obtained estimation results show that the configurable Xtensa processor approach has the lead, where approximately the total power consumed is about 12--15 times less than those results obtained by the other two approaches.

Journal ArticleDOI
TL;DR: Experimental results reveal that the hardware resource usage on an FPGA as well as the error in the approximation of the distribution of interest are significantly reduced by the use of the optimization techniques introduced in the proposed approach.
Abstract: Monte Carlo simulation is one of the most widely used techniques for computationally intensive simulations in mathematical analysis and modeling. A multivariate Gaussian random number generator is one of the main building blocks of such a system. Field Programmable Gate Arrays (FPGAs) are gaining increased popularity as an alternative means to the traditional general purpose processors targeting the acceleration of the computationally expensive random number generator block. This article presents a novel approach for mapping a multivariate Gaussian random number generator onto an FPGA by optimizing the computational path in terms of hardware resource usage subject to an acceptable error in the approximation of the distribution of interest. The proposed approach is based on the eigenvalue decomposition algorithm which leads to a design with different precision requirements in the computational paths. An analysis on the impact of the error due to truncation/rounding operation along the computational path is performed and an analytical expression of the error inserted into the system is presented. Based on the error analysis, three algorithms that optimize the resource utilization and at the same time minimize the error in the output of the system are presented and compared. Experimental results reveal that the hardware resource usage on an FPGA as well as the error in the approximation of the distribution of interest are significantly reduced by the use of the optimization techniques introduced in the proposed approach.