scispace - formally typeset
Search or ask a question

Showing papers on "Reconfigurable computing published in 2014"


Proceedings ArticleDOI
11 May 2014
TL;DR: A first attempt at closely fitting FPGAs into existing cloud computing models, where resources are virtualized, flexible, and have the illusion of infinite scalability, shows that FPGA cloud compute resources can easily outperform virtual machines, while the system's virtualization and abstraction significantly reduces design iteration time and design complexity.
Abstract: We present a new approach for integrating virtualized FPGA-based hardware accelerators into commercial-scale cloud computing systems, with minimal virtualization overhead. Partially reconfigurable regions across multiple FPGAs are offered as generic cloud resources through OpenStack (open-source cloud software), thereby allowing users to "boot" custom designed or predefined network-connected hardware accelerators with the same commands they would use to boot a regular Virtual Machine. We propose a hardware and software framework to enable this virtualization. This is a first attempt at closely fitting FPGAs into existing cloud computing models, where resources are virtualized, flexible, and have the illusion of infinite scalability. Our system can set up and tear down virtual accelerators in approximately 2.6 seconds on average, much faster than regular virtual machines. The static virtualization hardware on the physical FPGAs causes only a three cycle latency increase and a one cycle pipeline stall per packet in accelerators when compared to a non-virtualized system. We present a case study analyzing the design and performance of an application-level load balancer using a fully implemented prototype of our system. Our study shows that FPGA cloud compute resources can easily outperform virtual machines, while the system's virtualization and abstraction significantly reduces design iteration time and design complexity.

210 citations


Proceedings ArticleDOI
Fei Chen1, Yi Shan2, Yu Zhang1, Yu Wang2, Hubertus Franke, Xiaotao Chang1, Kun Wang3 
20 May 2014
TL;DR: A general framework for integrating FPGAs into the cloud is proposed and a prototype of the framework is implemented based on OpenStack, Linux-KVM and Xilinx FPGA, which enables isolation between multiple processes in multiple VMs, precise quantitative acceleration resource allocation, and priority-based workload scheduling.
Abstract: Cloud computing is becoming a major trend for delivering and accessing infrastructure on demand via the network. Meanwhile, the usage of FPGAs (Field Programmable Gate Arrays) for computation acceleration has made significant inroads into multiple application domains due to their ability to achieve high throughput and predictable latency, while providing programmability, low power consumption and time-to-value. Many types of workloads, e.g. databases, big data analytics, and high performance computing, can be and have been accelerated by FPGAs. As more and more workloads are being deployed in the cloud, it is appropriate to consider how to make FPGAs and their capabilities available in the cloud. However, such integration is non-trivial due to issues related to FPGA resource abstraction and sharing, compatibility with applications and accelerator logics, and security, among others. In this paper, a general framework for integrating FPGAs into the cloud is proposed and a prototype of the framework is implemented based on OpenStack, Linux-KVM and Xilinx FPGAs. The prototype enables isolation between multiple processes in multiple VMs, precise quantitative acceleration resource allocation, and priority-based workload scheduling. Experimental results demonstrate the effectiveness of this prototype, an acceptable overhead, and good scalability when hosting multiple VMs and processes.

198 citations


Journal ArticleDOI
08 Jul 2014
TL;DR: Motivated by specific threats, this paper describes FPGA security primitives from multiple FPGAs vendors and gives examples of those primitives in use in applications.
Abstract: Since their inception, field-programmable gate arrays (FPGAs) have grown in capacity and complexity so that now FPGAs include millions of gates of logic, megabytes of memory, high-speed transceivers, analog interfaces, and whole multicore processors. Applications running in the FPGA include communications infrastructure, digital cinema, sensitive database access, critical industrial control, and high-performance signal processing. As the value of the applications and the data they handle have grown, so has the need to protect those applications and data. Motivated by specific threats, this paper describes FPGA security primitives from multiple FPGA vendors and gives examples of those primitives in use in applications.

144 citations


Journal ArticleDOI
TL;DR: ZyCAP combines high-throughput configuration with a high-level software interface that frees the processor from detailed PR management, making PR on the Zynq easy and efficient.
Abstract: New hybrid FPGA platforms that couple processors with a reconfigurable fabric, such as the Xilinx Zynq, offer an alternative view of reconfigurable computing where software applications leverage hardware resources through the use of often reconfigured accelerators. For this to be feasible, reconfiguration overheads must be reduced so that the processor is not burdened with managing the process. We discuss partial reconfiguration (PR) on these architectures, and present an open source controller, ZyCAP, that overcomes the limitations of existing methods, offering more effective use of hardware resources in such architectures. ZyCAP combines high-throughput configuration with a high-level software interface that frees the processor from detailed PR management, making PR on the Zynq easy and efficient.

114 citations


Journal ArticleDOI
TL;DR: ReconOS allows for rapid design-space exploration, supports a structured application development process, and improves the portability of applications between different reconfigurable computing systems.
Abstract: The ReconOS operating system for reconfigurable computing offers a unified multithreaded programming model and OS services for threads executing in software and threads mapped to reconfigurable hardware. The OS interface lets hardware threads interact with software threads using well-known mechanisms such as semaphores, mutexes, condition variables, and message queues. By semantically integrating hardware accelerators into a standard OS environment, ReconOS allows for rapid design-space exploration, supports a structured application development process, and improves the portability of applications between different reconfigurable computing systems.

111 citations


Proceedings ArticleDOI
08 Jun 2014
TL;DR: Methods to improve the efficiency of SGM on general purpose PCs, through fine grained parallelization and usage of multiple cores are studied, which are scalable to the number of available cores and portable to embedded processors.
Abstract: Semi-Global Matching (SGM) is widely used for real-time stereo vision in the automotive context. Despite its popularity, only implementations using reconfigurable hardware (FPGA) or graphics hardware (GPU) achieve high enough frame rates for intelligent vehicles. Existing real-time implementations for general purpose PCs use image and disparity sub-sampling at the expense of matching quality. We study methods to improve the efficiency of SGM on general purpose PCs, through fine grained parallelization and usage of multiple cores. The different approaches are evaluated on the KITTI benchmark, which provides real imagery with LIDAR ground truth. The system is able to compute disparity maps of VGA image pairs with a disparity range of 128 values at more than 16 Hz. The approach is scalable to the number of available cores and portable to embedded processors.

108 citations


Posted Content
TL;DR: Techniques for an efficient Cumulative Distribution Table CDT based Gaussian sampler on reconfigurable hardware involving Peikert's convolution lemma and the Kullback-Leibler divergence and a first Bliss architecture for Xilinx Spartan-6 FPGAs that integrates fast FFT/NTT-based polynomial multiplication, sparse multiplication, and a Keccak hash function are presented.
Abstract: The recent Bimodal Lattice Signature Scheme (BLISS) showed that lattice-based constructions have evolved to practical alternatives to RSA or ECC. It offers small signatures of 5600 bits for a 128-bit level of security, and proved to be very fast in software. However, due to the complex sampling of Gaussian noise with high precision, it is not clear whether this scheme can be mapped efficiently to embedded devices. Even though the authors of BLISS also proposed a new sampling algorithm using Bernoulli variables this approach is more complex than previous methods using large precomputed tables. The clear disadvantage of using large tables for high performance is that they cannot be used on constrained computing environments, such as FPGAs, with limited memory. In this work we thus present techniques for an efficient Cumulative Distribution Table (CDT) based Gaussian sampler on reconfigurable hardware involving Peikert’s convolution lemma and the Kullback-Leibler divergence. Based on our enhanced sampler design, we provide a scalable implementation of BLISS signing and verification on a Xilinx Spartan-6 FPGA supporting either 128-bit, 160-bit, or 192-bit security. For high speed we integrate fast FFT/NTT-based polynomial multiplication, parallel sparse multiplication, Huffman compression of signatures, and Keccak as hash function. Additionally, we compare the CDT with the Bernoulli approach and show that for the particular BLISS-I parameter set the improved CDT approach is faster with lower area consumption. Our BLISS-I core uses 2,291 slices, 5.5 BRAMs, and 5 DSPs and performs a signing operation in 114.1 μs on average. Verification is even faster with a latency of 61.2 μs and 17,101 supported verification operations per second.

100 citations


Journal ArticleDOI
TL;DR: A performance analysis of the FPGA and the GPU implementations, and an extra CPU reference implementation, shows the competitive throughput of the proposed architecture even at a much lower clock frequency than those of the GPU and the CPU.
Abstract: This work presents a new flexible parameterizable architecture for image and video processing with reduced latency and memory requirements, supporting a variable input resolution. The proposed architecture is optimized for feature detection, more specifically, the Canny edge detector and the Harris corner detector. The architecture contains neighborhood extractors and threshold operators that can be parameterized at runtime. Also, algorithm simplifications are employed to reduce mathematical complexity, memory requirements, and latency without losing reliability. Furthermore, we present the proposed architecture implementation on an FPGA-based platform and its analogous optimized implementation on a GPU-based architecture for comparison. A performance analysis of the FPGA and the GPU implementations, and an extra CPU reference implementation, shows the competitive throughput of the proposed architecture even at a much lower clock frequency than those of the GPU and the CPU. Also, the results show a clear advantage of the proposed architecture in terms of power consumption and maintain a reliable performance with noisy images, low latency and memory requirements.

91 citations


Book ChapterDOI
23 Sep 2014
TL;DR: In this paper, an efficient Cumulative Distribution Table (CDT) based Gaussian sampler was proposed for the Bimodal Lattice Signature Scheme (BLS) on reconfigurable hardware involving Peikert's convolution lemma and Kullback-Leibler divergence.
Abstract: The recent Bimodal Lattice Signature Scheme Bliss showed that lattice-based constructions have evolved to practical alternatives to RSA or ECC. Besides reasonably small signatures with 5600 bits for a 128-bit level of security, Bliss enables extremely fast signing and signature verification in software. However, due to the complex sampling of Gaussian noise with high precision, it is not clear whether this scheme can be mapped efficiently to embedded devices. Even though the authors of Bliss also proposed a new sampling algorithm using Bernoulli variables this approach is more complex than previous methods using large precomputed tables. The clear disadvantage of using large tables for high performance is that they cannot be used on constrained computing environments, such as FPGAs, with limited memory. In this work we thus present techniques for an efficient Cumulative Distribution Table CDT based Gaussian sampler on reconfigurable hardware involving Peikert's convolution lemma and the Kullback-Leibler divergence. Based on our enhanced sampler design, we provide a first Bliss architecture for Xilinx Spartan-6 FPGAs that integrates fast FFT/NTT-based polynomial multiplication, sparse multiplication, and a Keccak hash function. Additionally, we compare the CDT with the Bernoulli approach and show that for the particular Bliss-I parameter set the improved CDT approach is faster with lower area consumption. Our core uses 2,431 slices, 7.5 BRAMs, and 6 DSPs and performs a signing operation in 126 μs on average. Verification takes even less with 70 μs.

85 citations


Journal ArticleDOI
31 Mar 2014-Sensors
TL;DR: A review of developments in the use of FPGAs in sensor systems is presented, describing as well the FPGA technologies employed by the different research groups and providing an overview of future research within this field.
Abstract: The current trend in the evolution of sensor systems seeks ways to provide more accuracy and resolution, while at the same time decreasing the size and power consumption. The use of Field Programmable Gate Arrays (FPGAs) provides specific reprogrammable hardware technology that can be properly exploited to obtain a reconfigurable sensor system. This adaptation capability enables the implementation of complex applications using the partial reconfigurability at a very low-power consumption. For highly demanding tasks FPGAs have been favored due to the high efficiency provided by their architectural flexibility (parallelism, on-chip memory, etc.), reconfigurability and superb performance in the development of algorithms. FPGAs have improved the performance of sensor systems and have triggered a clear increase in their use in new fields of application. A new generation of smarter, reconfigurable and lower power consumption sensors is being developed in Spain based on FPGAs. In this paper, a review of these developments is presented, describing as well the FPGA technologies employed by the different research groups and providing an overview of future research within this field.

79 citations


Proceedings ArticleDOI
20 Oct 2014
TL;DR: In this article, the authors compare the trends of these computing architectures for high-performance computing and survey these platforms in the execution of algorithms belonging to different scientific application domains, showing that FPGAs are increasing the gap to GPUs and many-core CPUs moving them away from highperformance computing with intensive floating-point calculations.
Abstract: Floating-point computing with more than one TFLOP of peak performance is already a reality in recent Field-Programmable Gate Arrays (FPGA). General-Purpose Graphics Processing Units (GPGPU) and recent many-core CPUs have also taken advantage of the recent technological innovations in integrated circuit (IC) design and had also dramatically improved their peak performances. In this paper, we compare the trends of these computing architectures for high-performance computing and survey these platforms in the execution of algorithms belonging to different scientific application domains. Trends in peak performance, power consumption and sustained performances, for particular applications, show that FPGAs are increasing the gap to GPUs and many-core CPUs moving them away from high-performance computing with intensive floating-point calculations. FPGAs become competitive for custom floating-point or fixed-point representations, for smaller input sizes of certain algorithms, for combinational logic problems and parallel map-reduce problems.

Journal ArticleDOI
TL;DR: This work sets the new area records as it proposes the hardware architecture of the smallest block cipher ever published on field-programmable gate arrays (FPGAs) at 128-bit level of security.
Abstract: While advanced encryption standard (AES) is extensively in use in a number of applications, its area cost limits its deployment in resource constrained platforms In this letter, we have implemented SIMON, a recent promising low-cost alternative of AES on reconfigurable platforms The Feistel network, the construction of the round function and the key generation of SIMON, enables bit-serial hardware architectures which can significantly reduce the cost Moreover, encryption and decryption can be done using the same hardware The results show that with an equivalent security level, SIMON is 86% smaller than AES, 70% smaller than PRESENT (a standardized low-cost AES alternative), and its smallest hardware architecture only costs 36 slices (72 LUTs, 30 registers) To our best knowledge, this work sets the new area records as we propose the hardware architecture of the smallest block cipher ever published on field-programmable gate arrays (FPGAs) at 128-bit level of security Therefore, SIMON is a strong alternative to AES for low-cost FPGA-based applications

Proceedings ArticleDOI
24 Nov 2014
TL;DR: A two phase detection approach to differentiate recycled (used) FPGAs from new ones and both approaches rely on machine learning via support vector machines (SVM) for classification.
Abstract: The counterfeit electronic component industry continues to threaten the security and reliability of systems by infiltrating recycled components into the supply chain. With the increased use of FPGAs in critical systems, recycled FPGAs cause significant concerns for government and industry. In this paper, we propose a two phase detection approach to differentiate recycled (used) FPGAs from new ones. Both approaches rely on machine learning via support vector machines (SVM) for classification. The first phase examines suspect FPGAs “as is” while the second phase requires some accelerated aging. To be more specific, Phase I detects recycled FPGAs by comparing the frequencies of ring oscillators (ROs) distributed on the FPGAs against a golden model. Experimental results on Xilinx FPGAs show that Phase I can correctly classify 8 out of 20 FPGAs under test. However, Phase I fails to detect FPGAs at fast corners and with lesser prior usage. Phase II is then used to compliment Phase I and overcome its limitations. The second phase performs a short aging step on the suspect FPGAs and exploits the aging speed reduction (due to prior usage) to cover the cases missed by Phase I. In our silicon results, Phase II detects all the fresh and recycled FPGAs correctly.

Proceedings ArticleDOI
17 Aug 2014
TL;DR: This work designs and builds a flexible and consolidated framework, OpenANFV, to support virtualized accelerators for MBs in the cloud environment to provide high performance on top of commodity hardware to cope with various virtual function requirements.
Abstract: The resources of dedicated accelerators (e.g. FPGA) are still required to bridge the gap between software-based Middleboxs(MBs) and the commodity hardware. To consolidate various hardware resources in an elastic, programmable and reconfigurable manner, we design and build a flexible and consolidated framework, OpenANFV, to support virtualized accelerators for MBs in the cloud environment. OpenANFV is seamlessly and efficiently put into Openstack to provide high performance on top of commodity hardware to cope with various virtual function requirements. OpenANFV works as an independent component to manage and virtualize the acceleration resources (e.g. cinder manages block storage resources and nova manages computing resources). Specially, OpenANFV mainly has the following three features. (1)Automated Management. Provisioning for multiple Virtualized Network Functions (VNFs) is automated to meet the dynamic requirements of NFV environment. Such automation alleviates the time pressure of the complicated provisioning and configuration as well as reduces the probability of manually induced configuration errors. (2) Elasticity. VNFs are created, migrated, and destroyed on demand in real time. The reconfigurable hardware resources in pool can rapidly and flexibly offload the corresponding services to the accelerator platform in the dynamic NFV environment. (3) Coordinating with Openstack. The design and implementation of the OpenANFV APIs coordinate with the mechanisms in Openstack to support required virtualized MBs for multiple tenants.

Journal ArticleDOI
14 Jun 2014
TL;DR: The single-graph multiple-flows (SGMF) architecture that combines coarse-grain reconfigurable computing with dynamic dataflow to deliver massive thread-level parallelism is presented, positioned as an energy efficient design alternative for GPGPUs.
Abstract: We present the single-graph multiple-flows (SGMF) architecture that combines coarse-grain reconfigurable computing with dynamic dataflow to deliver massive thread-level parallelism. The CUDA-compatible SGMF architecture is positioned as an energy efficient design alternative for GPGPUs.The architecture maps a compute kernel, represented as a dataflow graph, onto a coarse-grain reconfigurable fabric composed of a grid of interconnected functional units. Each unit dynamically schedules instances of the same static instruction originating from different CUDA threads. The dynamically scheduled functional units enable streaming the data of multiple threads (or graph flows, in SGMF parlance) through the grid. The combination of statically mapped instructions and direct communication between functional units obviate the need for a full instruction pipeline and a centralized register file, whose energy overheads burden GPGPUWe show that the SGMF architecture delivers performance comparable to that of contemporary GPGPUs while consuming 57% less energy on average.

Proceedings ArticleDOI
20 Oct 2014
TL;DR: This work proposes an automated methodology to generate FPGA bitstreams from high-level programs written in Domain-Specific Languages (DSLs), leveraging the domain-knowledge conveyed by the DSL and its domain-specific semantics to extract application parallelism, perform optimizations and also identify a suitable system-architecture for the implementation, thereby relieving the user from most of the hardware-level details.
Abstract: Field Programmable Gate Arrays (FPGAs) are very versatile devices, but their complicated programming model has stymied their widespread usage. While modern High-Level Synthesis (HLS) tools provide better programming models, the interface they offer is still too low-level. In order to produce good quality hardware designs with these tools, the users are forced to manually perform optimizations that demand detailed knowledge of both the application and the implementation platform. Additionally, many HLS tools only generate isolated hardware modules that the user still needs to integrate into a system design before generating the FPGA bitstream. These problems make HLS tools difficult to use for application developers who have little hardware design knowledge. To address these problems, we propose an automated methodology to generate FPGA bitstreams from high-level programs written in Domain-Specific Languages (DSLs). We leverage the domain-knowledge conveyed by the DSL and its domain-specific semantics to extract application parallelism, perform optimizations and also identify a suitable system-architecture for the implementation, thereby, relieving the user from most of the hardware-level details. We demonstrate the high productivity and high design quality this approach offers by automatically generating hardware systems from applications written in OptiML, a machine-learning DSL. To evaluate our methodology, we use four OptiML applications and show that we can easily generate different solutions which achieve different trade-offs between performance and area. More importantly, the results reveal that our generated hardware achieves much better performance compared to the one obtained from using the HLS tool without platform-specific optimizations.

Journal ArticleDOI
Gordon J. Brebner1, Weirong Jiang1
TL;DR: A tool chain is presented that maps a domain-specific packet-processing language called PX to high-performance reconfigurable-computing architectures based on field-programmable gate array (FPGA) technology.
Abstract: Internet applications, notably streaming video, demand extremely high communication speeds in core networks, currently 100 Gbps and moving toward 400 Gbps and beyond Data packets must be processed at these rates, presenting serious challenges for traditional computing approaches This article presents a tool chain that maps a domain-specific packet-processing language called PX to high-performance reconfigurable-computing architectures based on field-programmable gate array (FPGA) technology PX is a declarative language with object-oriented semantics A customized computing architecture is generated to match the exact requirements expressed in the PX description The architecture includes components for packet parsing and editing, and for table lookups It is expressed in a register transfer level (RTL) description, which is then processed using standard FPGA implementation tools The architecture is dynamically programmable via custom firmware updates when the packet-processing system is in operation The authors illustrate the language, tool chain, and implementation results through a practical example involving a 100-Gbps OpenFlow implementation

Book ChapterDOI
14 Apr 2014
TL;DR: It is shown that an extended architecture with dedicated inverter stage can achieve a performance of more than 32,000 point multiplications per second on a (small) Xilinx Zynq 7020 FPGA, making the design suitable for cheap deployment in many future security applications.
Abstract: Elliptic curve cryptography (ECC) has become the predominant asymmetric cryptosystem found in most devices during the last years. Despite significant progress in efficient implementations, computations over standardized elliptic curves still come with enormous complexity, in particular when implemented on small, embedded devices. In this context, Bernstein proposed the highly efficient ECC instance Curve25519 that was shown to achieve new ECC speed records in software providing a high security level comparable to AES with 128-bit key. These very tempting results from the software domain have led to adoption of Curve25519 by several security-related applications, such as the NaCl cryptographic library or in anonymous routing networks (nTor). In this work we demonstrate that even better efficiency of Curve25519 can be realized on reconfigurable hardware, in particular by employing their Digital Signal Processor blocks (DSP). In a first proposal, we present a DSP-based single-core architecture that provides high-performance despite moderate resource requirements. As a second proposal, we show that an extended architecture with dedicated inverter stage can achieve a performance of more than 32,000 point multiplications per second on a (small) Xilinx Zynq 7020 FPGA. This clearly outperforms speed results of any software-based and most hardware-based implementations known so far, making our design suitable for cheap deployment in many future security applications.

Journal ArticleDOI
TL;DR: High-Q tunable filters are in demand in both wireless and satellite applications and provide the network operator the means for efficiently managing hardware resources, while accommodating multistandards requirements and achieving network traffic/capacity optimization.
Abstract: High-Q tunable filters are in demand in both wireless and satellite applications. The need for tunability and configurability in wireless systems arises when deploying different systems that coexist geographically. Such deployments take place regularly when an operator has already installed a network and needs to add a new-generation network, for example, to add a long-term evolution (LTE) network to an existing third-generation (3G) network. The availability of tunable/reconfigurable hardware will also provide the network operator the means for efficiently managing hardware resources, while accommodating multistandards requirements and achieving network traffic/capacity optimization. Wireless systems can also benefit from tunable filter technologies in other areas; for example, installing wireless infrastructure equipment, such as a remote radio unit (RRU) on top of a 15-story high communication tower, is a very costly task. By using tunable filters, one installation can serve many years since if there is a need to change the frequency or bandwidth, it can be done through remote electronic tuning, rather than installing a new filter. Additionally, in urban areas, there is a very limited space for wireless service providers to install their base stations due to expensive real estate and/or maximum weight loading constrains on certain installation locations such as light poles or power lines. Therefore, once an installation site is acquired, it is natural for wireless service providers to use tunable filters to pack many functions, such as multistandards and multibands, into one site.

Proceedings ArticleDOI
19 May 2014
TL;DR: An efficient reconfigurable architecture for parallel BFS that adopts new optimizations for utilizing memory bandwidth and adopts a custom graph representation based on compressed-sparse raw format (CSR), as well as a restructuring of the conventional BFS algorithm.
Abstract: Large-scale graph structures are considered as a keystone for many emerging high-performance computing applications in which Breadth-First Search (BFS) is an important building block. For such graph structures, BFS operations tends to be memory-bound rather than compute-bound. In this paper, we present an efficient reconfigurable architecture for parallel BFS that adopts new optimizations for utilizing memory bandwidth. Our architecture adopts a custom graph representation based on compressed-sparse raw format (CSR), as well as a restructuring of the conventional BFS algorithm. By taking maximum advantage of available memory bandwidth, our architecture continuously keeps our processing elements active. Using a commercial high-performance reconfigurable computing system (the Convey HC-2), our results demonstrate a 5× speedup over previously published FPGA-based implementations.

Proceedings ArticleDOI
01 Jun 2014
TL;DR: This work investigates lightweight aspects and suitable parameter sets for Ring-LWE encryption and shows optimizations that enable implementations even with very few resources on a reconfigurable hardware device.
Abstract: Ideal lattice-based cryptography gained significant attraction in the last years due to its versatility, simplicity and performance in implementations. Nevertheless, existing implementations of encryption schemes reported only results trimmed for high-performance what is certainly not sufficient for all applications in practice. To the contrary, in this work we investigate lightweight aspects and suitable parameter sets for Ring-LWE encryption and show optimizations that enable implementations even with very few resources on a reconfigurable hardware device. Despite of this restriction, we still achieve reasonable throughput that is sufficient for many today's and future applications.

Journal ArticleDOI
TL;DR: A digital hardware emulation of device-level models for the insulated gate bipolar transistor and the power diode on the field programmable gate array (FPGA) features a fully paralleled implementation using an accurate floating-point data representation in VHSIC hardware description language (VHDL) language.
Abstract: Accurate models of power electronic devices are necessary for hardware-in-the-loop (HIL) simulators. This paper proposes a digital hardware emulation of device-level models for the insulated gate bipolar transistor (IGBT) and the power diode on the field programmable gate array (FPGA). The hardware emulation utilizes detailed physics-based nonlinear models for these devices, and features a fully paralleled implementation using an accurate floating-point data representation in VHSIC hardware description language (VHDL) language. A dc-dc buck converter circuit is emulated to validate the hardware IGBT and diode models, and the nonlinear circuit simulation process. The captured oscilloscope results demonstrate high accuracy of the emulator in comparison to the offline simulation of the original system using Saber software.

Journal ArticleDOI
11 Dec 2014
TL;DR: This work presents a multi-granularity FPGA suitable for mobile computing that achieves a 3-4× interconnect area reduction over commercial FPGAs for comparable connectivity, reducing overall area and leakage by 2-2.5×, and delivering up to 50% lower active power.
Abstract: Following the rapid expansion of mobile computing in the past decade, mobile system-on-a-chip (SoC) designs have off-loaded most compute-intensive tasks to dedicated accelerators to improve energy efficiency. An increasing number of accelerators in power-limited SoCs results in large regions of “dark silicon.” Such accelerators lack flexibility, thus any design change requires a SoC re-spin, significantly impacting cost and timeline. To address the need for efficiency and flexibility, this work presents a multi-granularity FPGA suitable for mobile computing. Occupying 20.5mm2 in 40nm CMOS, the chip incorporates 2,760 fine-grained configurable logic blocks (CLBs) with 11,040 6-input look-up-tables (LUTs) for random logic, basic arithmetic, shift registers, and distributed memories, 42 medium-grained 48b DSP processors for MAC and SIMD operations, 16 32K×1b to 512×72b reconfigurable block RAMs, and 2 coarsegrained kernels: a 64-8192-point fast Fourier transform (FFT) processor and a 16-core universal DSP (UDSP) for software-defined radio (SDR). Using a mixradix hierarchical interconnect, the chip achieves a 4× interconnect area reduction over commercial FPGAs for comparable connectivity, reducing overall area and leakage by 2.5×, and delivering a 10-50% lower active power. With coarse-grained kernels, the chip's energy efficiency reaches within 4-5× of ASIC designs.

Journal ArticleDOI
TL;DR: This work proposes a novel modular Bit-Vector (BV) based architecture to perform high-speed packet classification on Field Programmable Gate Array (FPGA) and introduces an algorithm named StrideBV and modularize the BV architecture to achieve better scalability than traditional BV methods.
Abstract: Packet classification is widely used as a core function for various applications in network infrastructure. With increasing demands in throughput, performing wire-speed packet classification has become challenging. Also the performance of today's packet classification solutions depends on the characteristics of rulesets. In this work, we propose a novel modular Bit-Vector (BV) based architecture to perform high-speed packet classification on Field Programmable Gate Array (FPGA). We introduce an algorithm named StrideBV and modularize the BV architecture to achieve better scalability than traditional BV methods. Further, we incorporate range search in our architecture to eliminate ruleset expansion caused by range-to-prefix conversion. The post place-and-route results of our implementation on a state-of-the-art FPGA show that the proposed architecture is able to operate at 100+ Gbps for minimum size packets while supporting large rulesets up to 28 K rules using only the on-chip memory resources. Our solution is ruleset-feature independent , i.e. the above performance can be guaranteed for any ruleset regardless the composition of the ruleset.

Proceedings ArticleDOI
20 May 2014
TL;DR: A novel redundancy-based protection approach based on Trojan tolerance that modifies the application mapping process to provide high-level of protection against Trojans of varying forms and sizes is proposed.
Abstract: Reconfigurable hardware including Field programmable gate arrays (FPGAs) are being used in a wide range of embedded applications including signal processing, multimedia, and security. FPGA device production is often outsourced to off-shore facilities for economic reasons. This opens up the opportunities for insertion of malicious design alterations in the foundry, referred to as hardware Trojan attacks, to cause logical and physical malfunction. The vulnerability of these devices to hardware attacks raises security concerns regarding hardware and design assurance. In this paper, we analyze hardware Trojan attacks in FPGA considering diverse activation and payload characteristics and derive a taxonomy of Trojan attacks in FPGA. To our knowledge, this is the first effort to analyze Trojan threats in FPGA hardware. Next, we propose a novel redundancy-based protection approach based on Trojan tolerance that modifies the application mapping process to provide high-level of protection against Trojans of varying forms and sizes. We show that the proposed approach incurs significantly higher security at lower overhead than conventional fault-tolerance schemes by exploiting the nature of Trojans and reconfiguration of FPGA resources.

Journal ArticleDOI
01 Oct 2014
TL;DR: This work characterises the communication overheads in such a hybrid system to motivate the importance of lean management, before quantifying the context switch overhead of the hypervisor approach, and compares the resulting idle time for a standard Linux implementation and the proposed hypervisor method, showing two orders of magnitude improved performance.
Abstract: Emerging hybrid reconfigurable platforms tightly couple capable processors with high performance reconfigurable fabrics. This promises to move the focus of reconfigurable computing systems from static accelerators to a more software oriented view, where reconfiguration is a key enabler for exploiting the available resources. This requires a revised look at how to manage the execution of such hardware tasks within a processor-based system, and in doing so, how to virtualize the resources to ensure isolation and predictability. This view is further supported by trends towards amalgamation of computation in the automotive and avionics domains, where such properties are essential to overall system reliability. We present the virtualized execution and management of software and hardware tasks using a microkernel-based hypervisor running on a commercial hybrid computing platform (the Xilinx Zynq). The CODEZERO hypervisor has been modified to leverage the capabilities of the FPGA fabric, with support for discrete hardware accelerators, dynamically reconfigurable regions, and regions of virtual fabric. We characterise the communication overheads in such a hybrid system to motivate the importance of lean management, before quantifying the context switch overhead of the hypervisor approach. We then compare the resulting idle time for a standard Linux implementation and the proposed hypervisor method, showing two orders of magnitude improved performance with the hypervisor.

Journal ArticleDOI
TL;DR: This work demonstrates how the modern analog communication system like Community Radio Schemes and Radio Data System (RDS) and digital communication systems such as Simple Digital Video Broadcasting (DVB) and OFDM based data communication can be developed using the Open source hardware USRP1.
Abstract: In this modern world many communication devices are highly intelligent and interconnected between each other. Any up-gradation of the hardware in the existing communication devices is not easier one. Compatibility of the new hardware with existing hardware is highly essential. But the new protocols may or may not support the older one. The solution for these problems can be provided by using the reconfigurable hardware design. The hardware can be reprogrammed according to the new change in technology up-gradation. The cost of commercially available hardware and software requirements for setting up such a module is very high. This can be solved by using Open source hardware and software such as Universal Software Radio Peripheral (USRP) and GNU Radio. This work demonstrates how the modern analog communication system like Community Radio Schemes and Radio Data System (RDS) and digital communication systems such as Simple Digital Video Broadcasting (DVB) and OFDM based data communication can be developed using the Open source hardware USRP1. This work will be helpful even for first year level of engineering students to easily implement any communication and control applications with cheaper cost.

Book ChapterDOI
20 Oct 2014
TL;DR: This work presents the Latency-insensitive Environment for Application Programming (LEAP), an FPGA operating system built around latency-insensitivity communications channels, and presents an extensible interface for compile-time management of resources.
Abstract: FPGAs offer attractive power and performance for many applications, especially relative to traditional sequential architectures. In spite of these advantages, FPGAs have been deployed in only a few, niche domains.We argue that the difficulty of programming FPGAs all but precludes their use in more general systems: FPGA programmers are currently exposed to all the gory system details that software operating systems long ago abstracted away. In this work, we present the Latency-insensitive Environment for Application Programming (LEAP), an FPGA operating system built around latency-insensitive communications channels. LEAP alleviates the FPGA programming problem by providing a rich set of portable latency-insensitive abstraction layers for program development. Unlike software operating systems services, which are generally dynamic, the nature of FPGAs requires that many configuration decisions be made at compile time. We present an extensible interface for compile-time management of resources. We demonstrate that LEAP provides design portability, while consuming as little as 3% of FPGA area, by mapping several designs on to various FPGA platforms.

Journal ArticleDOI
03 Sep 2014
TL;DR: The lean DSP Extension Architecture (iDEA) presented in this article builds around the dynamic programmability of a single DSP48E1 primitive, with minimal additional logic to create a general-purpose processor supporting a full instruction-set architecture.
Abstract: DSP blocks in modern FPGAs can be used for a wide range of arithmetic functions, offering increased performance while saving logic resources for other uses. They have evolved to better support a plethora of signal processing tasks, meaning that in other application domains they may be underutilised. The DSP48E1 primitives in new Xilinx devices support dynamic programmability that can help extend their usefulness; the specific function of a DSP block can be modified on a cycle-by-cycle basis. However, the standard synthesis flow does not leverage this flexibility in the vast majority of cases. The lean DSP Extension Architecture (iDEA) presented in this article builds around the dynamic programmability of a single DSP48E1 primitive, with minimal additional logic to create a general-purpose processor supporting a full instruction-set architecture. The result is a very compact, fast processor that can execute a full gamut of general machine instructions. We show a number of simple applications compiled using an MIPS compiler and translated to the iDEA instruction set, comparing with a Xilinx MicroBlaze to show estimated performance figures. Being based on the DSP48E1, this processor can be deployed across next-generation Xilinx Artix-7, Kintex-7, Virtex-7, and Zynq families.

Proceedings ArticleDOI
03 Nov 2014
TL;DR: Efficient and effective packing and analytical placement algorithms for large-scale heterogeneous FPGAs to deal with issues on heterogeneity, datapath regularity, and scalability are presented.
Abstract: As FPGA architecture evolves, complex heterogenous blocks, such as RAMs and DSPs, are widely used to effectively implement various circuit applications. These complex blocks often consist of datapath-intensive circuits, which are not adequately addressed in existing packing and placement algorithms. Besides, scalability has become a first-order metric for modern FPGA design, mainly due to the dramatically increasing design complexity. This paper presents efficient and effective packing and analytical placement algorithms for large-scale heterogeneous FPGAs to deal with issues on heterogeneity, datapath regularity, and scalability. Compared to the well-known academic tool VPR, experimental results show that our packing and placement algorithms achieve respective 199.80X and 3.07X speedups with better wirelength, and our overall flow achieves 50% shorter wirelength, with an 18.30X overall speedup.